<<

www.nature.com/scientificreports

OPEN Comparative analyses of whole- sequences from multiple Received: 7 June 2017 Makio Yokono 1,2, Soichirou Satoh3 & Ayumi Tanaka1 Accepted: 16 April 2018 Phylogenies based on entire are a powerful tool for reconstructing the of Life. Several Published: xx xx xxxx methods have been proposed, most of which employ an alignment-free strategy. Average sequence similarity methods are diferent than most other whole-genome methods, because they are based on local alignments. However, previous average similarity methods fail to reconstruct a correct phylogeny when compared against other whole-genome . In this study, we developed a novel average sequence similarity method. Our method correctly reconstructs the phylogenetic tree of in silico evolved E. coli . We applied the method to reconstruct a whole- phylogeny of 1,087 from all three domains of life, , , and Eucarya. Our tree was automatically reconstructed without any decisions, such as the selection of organisms. The tree exhibits a concentric circle-like structure, indicating that all the organisms have similar total branch lengths from their common ancestor. Branching patterns of the members of each phylum of Bacteria and Archaea are largely consistent with previous reports. The topologies are largely consistent with those reconstructed by other methods. These results strongly suggest that this approach has sufcient taxonomic resolution and reliability to infer phylogeny, from phylum to strain, of a wide range of organisms.

Te reconstruction of phylogenetic trees is a powerful tool for understanding organismal evolutionary processes. Molecular phylogenetic analysis using ribosomal RNA (rRNA) clarifed the phylogenetic relationship of the three domains, bacterial, archaeal, and eukaryotic1. In addition to rRNA, various orthologous genes have been used for reconstructing phylogenetic trees of life. However, the level of phylogenetic resolution allowed by single genes, particularly at the basal level of the Tree of Life, is insufcient owing to the saturation of nucleotide or amino acid substitutions. One solution to overcome this problem is to use concatenated orthologous genes2,3. However, this method is difcult to apply to distantly related organisms and viral genomes, because genes are frequently horizontally transferred4, and the number of vertically inherited orthologs shared by distantly related organisms is limited5. It is reasonable to use orthologs to elucidate species ; however, organismal evo- lution has occurred not only through the molecular evolution of orthologs, but also through the evolution of non-orthologous genes, because not only orthologous genes but also non-orthologous genes (species specifc genes) contribute to the identity of organisms. However, this information is not included in alignment-based phylogenies of orthologous genes. Another approach is to reconstruct a phylogenetic tree based on whole-genome sequences6–8. An enormous number of complete genome sequences are now available. Tis enables the reconstruction of whole-genome phylogenies for a large number of organisms. Several tree reconstruction methods have been used with complete genomes9–11, including gene order, gene content, nucleotide composition, metabolic pathway reaction content, and single-nucleotide polymorphism analyses. Tese approaches use hidden evolutionary information, not avail- able in single-gene or concatenated gene phylogenetic analyses. However, all these methods have their own limi- tations, including low resolution and reliability8,12. Te primary problem with most whole-genome phylogenies is that they are based on alignment-free methodology, which does not incorporate standard molecular evolutionary concepts, although the reconstructed phylogenies are similar to sequence-based trees in many cases8. Although whole-genome phylogenetic approaches have many limitations, these methods should be further developed,

1Institute of Low Temperature Science, Hokkaido University, Sapporo, 060-0819, Japan. 2Nippon Flour Mills Co., Ltd., Innovation Center, Atsugi, 243-0041, Japan. 3Graduate School of Life and Environmental Sciences, Kyoto Prefectural University, Kyoto, 606-8522, Japan. Correspondence and requests for materials should be addressed to A.T. (email: [email protected])

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 1 www.nature.com/scientificreports/

(a) 1th 2th 3th 4th 5th (b) All genes with our distance-base method 01 16 32 04 08 31 E. coliORFs 536 02 15 30 20% 07 14 29 mutation 03 28 32 1th 01 13 06 27 (c) (b) + distance correction 12 26 01 25 2th 01 02 05 11 24 10 23 22 3th 01 02 03 04 09 32 21 20 (d) A single gene (404 A.A.) with our distance-base method 01 4th 01 02 03 04 05 06 07 08 19 18 17 5th 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 32 (e) A single gene (404 A.A.) with alignment-base method (f) A single gene (97 A.A.) with alignment-base method 01 01

32 32

Figure 1. (a) Artifcial evolution of all open reading frames from coli 536. Tirty-two genomes afer the ffh generation were used to build the phylogenetic trees shown in b–f. (b,c) Reconstructed trees using all genes evolved in silico using our average sequence similarity method. (b) Distance matrix was constructed with D. (c) Distance matrix was constructed with C. (d–f) Reconstructed trees using a single gene. (d) Tree reconstructed from mutated ECP_0844 genes (404 amino acid length) using our average sequence similarity method developed in this study. (e) Similarity dendrogram constructed from mutated ECP_0844 genes, or (f) mutated ECP_0843 genes (97 amino acid length) using the multiple sequence alignment program ClustalX v. 2.119.

particularly because the methods can potentially use hidden evolutionary information that cannot be incorpo- rated into alignment-based phylogenetic analyses. Average sequence similarity methods have been proposed for the reconstruction of whole-genome trees13–15. In contrast to other whole-genome tree methods, these average similarity methods use aligned sequence informa- tion in which evolutionary distances are calculated from the BLAST bit score16 of best-matched pairs. However, this approach has been criticized from various viewpoints:8 (1) Te approach uses local alignments (BLAST)16, instead of global alignment. (2) Te approach does not include standard molecular phylogenetic concepts. (3) Many of these methods use reciprocal best matches in which non-orthologous genes are included, thereby intro- ducing noise. Te major problem concerning average sequence similarity approaches is the inclusion of non-orthologous genes, and, in fact, whole-genome trees are improved by excluding non-orthologous genes from the analyses14. However, it should be noted that many genes have evolved from ancestral non-orthologous genes through gene duplication followed by divergence processes, and have obtained new functions on these paths. For this rea- son, non-orthologous gene information should be included to better understand genomic evolution. However, the inclusion of non-orthologous genes in alignment-based phylogenetics is a challenging approach. In a previous report, we developed a whole-proteome method using average sequence similarities that includes non-orthologous gene pairs17. Te phylogenetic tree of photosynthetic organisms reconstructed by this new method is similar to that of trees previously reported. However, it is not evident whether this method can be applied to a Tree of Life including all three domains. In this report, we have modifed our previous method and succeeded in automatically reconstructing a Tree of Life including 1,087 species encompassing all three domains of life, Bacteria, Archaea, and Eucarya. Our phylogeny has high resolution and reliability for both closely and distantly related organisms, despite the exclusion of all human decisions for the selection of organisms and genes. We also show that the inclusion of non-orthologous genes improves the resolution and reliability of phylogenetic trees compared with trees generated by methods that only use reciprocal best hit pairs. Furthermore, our tree supports the three- Tree of Life topology. Results In silico evolution of E. coli proteome and phylogenetic analysis. We frst applied in silico evolution to the E. coli 536 proteome to evaluate the applicability of our method (Fig. 1a). Te E. coli 536 proteome was randomly mutated (20% mutation of amino acid residues per generation) in silico, and a phylogenetic tree of

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 2 www.nature.com/scientificreports/

the resulting proteomes was reconstructed using our new average sequence similarity method (Fig. 1b). Te tree was correctly reconstructed, refecting the actual evolutionary process used to generate the true phylogeny. Te relationship between branch lengths and generations (mutation time) was then examined in more detail. Te lengths of each internal branch per generation are almost all the same except at the initial phase of in silico evolu- tion (Fig. 1b). Te distance between two strains is less than 0.85, if the two strains have a common ancestor in the third or subsequent generations (SupportingFile1.xls, sheet 3). Te distance is greater than 0.95, if the common ancestor is in the second generation. As shown in the saturation curve (Supplementary Fig. 1d, black line), D is approximately linearly related to evolution time (mutation value) when the value D is lower than 0.85 (C < 12, corresponding to ~70% amino acids replacement). Tese results suggest that D can be used as an evolutionary distance, and that if the value D is lower than 0.85, then the phylogenetic tree can be inferred as being reliable. Te length of each internal branch became the same afer correcting the distances (Fig. 1c) using the saturation curve constructed by 10% mutation per generation (Supplementary Fig. 1d, black line).

Reconstruction of a Tree of Life. Next we reconstructed a phylogenetic tree from the proteomes of hun- dreds of organisms from all domains of life using our new method. We used the following two criteria to properly evaluate the method: 1) It is well known that the topology of a phylogenetic tree can be dramatically afected by gene and taxon selection, and that the presence of certain genes and/or taxa can sometimes seriously disturb tree topology. To exclude the arbitrariness of our genome selection we used all the genomes available that had been completed as of 2011. Tis included a wide range of organisms from all three domains of life, Bacteria, Archaea, and Eucarya (SupportingFile1.xls, sheet 4). Tis criterion will help show whether our average sequence similarity method is applicable to a wide range of organisms. 2) All of the procedures were performed automatically, which completely excludes human factors in tree reconstruction. Figures 2 and 3 shows the phylogenetic tree of 31 Eucarya, 62 Archaea, and 994 Bacteria, whereas supple- mentary Figure 3 shows the same tree, but without two parasites (Encephalitozoon cuniculi and Nanoarchaeum equitans). Te three domains are clearly separated, as reported in other whole genome phylogenies6,18. One con- spicuous characteristic of our tree is that the total branch lengths from the common ancestor are nearly the same among all of the organisms, resulting in a concentric circle-like structure for the tree. We compared single-gene and the whole-proteome trees from the E. coli population that we had evolved in silico to attempt to explain this phenomenon. Te branch lengths of the whole-proteome tree corresponding to the same generation time are exactly identical among various evolutionary lines, which gives rise to a concentric circle like structure as shown in Fig. 1b and c. In contrast, branch lengths of the single-gene tree reconstructed by our new method, or by the alignment-based Neighbor-Joining (NJ) method using a traditional distance matrix corrected for multiple substi- tutions calculated from the entire alignment19, are diferent between the diferent evolutionary lines (Fig. 1d–f). Tis result led to the idea that the branch lengths became similar by averaging the branch lengths of the large number of in the proteome. Tis feature was not signifcantly altered afer our saturation correction (Supplementary Fig. 1e), nor by tree reconstruction method (Supplementary Fig. 1f and g).

Comparison of entire tree topology. We then compared topologies of the three domains generated by our method with previous reports to evaluate our method’s topological robustness. Our tree topology is largely consistent with previous reports that focus on the phylogenetic relationships of restricted groups (see Supplementary Text 1). Terefore, we compared our tree topology with that of other large-scale, multi-gene or entire genome organismal trees. Ciccarelli et al. reported a Tree of Life constructed with the maximum likelihood method using concatenated genes18; this was used as a benchmark for evaluating our method. For this purpose, we constructed a tree (Supplementary Fig. 4a) using the same organisms as reported by Ciccarelli et al. (Fig. 2 and Table S4 in Ciccarelli et al.18). Ciccarelli’s tree has 163 branching points supported by 80–100% bootstrap values. Among the 163 branching points, 152 (93.3%) are identical to our tree (Supplementary Fig. 4a). Tis result indicates that our tree is very similar to a Tree of Life constructed by the maximum likelihood method using Ciccarelli’s dataset. Although comparison with other reports cannot fully evaluate our new approach, because the true Tree of Life is impossible to know, the phylogenetic topology that we recover is quite close to previous reports, for both intra-generic and cross-domain analyses, suggesting that our whole-proteome average sequence similarity approach has high resolution for both closely and distantly related organisms, encompassing Bacteria, Archaea, and Eucarya. We used all the complete genomes available, as of 2011, to exclude arbitrariness in genome selection (Fig. 2). However, a great number of genome sequences have become available subsequent to 2011. Phylogenetic analyses including uncultivated, newly discovered Archaea phyla suggest the emergence of Eucarya from within Archaea20–22, which is diferent from our tree. One possible reason for this discrepancy is that our tree (Fig. 2) does not include these newly discovered phyla. Another potential factor is our tree contains a much larger number of Bacteria (944) versus Eucarya (31) and Archae (29), and this bias may afect tree topol- ogy. Terefore, we constructed a new phylogenetic tree of 106 organisms (Fig. 4) that includes these newly discovered Archaea lineages, and we used a similar number of organisms from each domain, Bacteria (40), Eucarya (38) and Archaea (38)23. Te three domains remain clearly separated in the new tree, still supporting the three-domain tree of life. Based on these results, it was fnally concluded that the three-domain tree of life is supported by our present method. Next, we compared the branching pattern of this tree with recently published trees. In our tree, Lokiarchaeota deeply branched in Archaea, which is consistent with reports which support three-domain tree of life23,24 and also consistent with a tree constructed by 45 concatenated genes25. In contrast, Lokiarchaeota branches of more recently within the Archaea in the two-domain tree of life constructed by concatenated ribosomal proteins26. Te branching pattern of Archaea members in our tree

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 3 www.nature.com/scientificreports/

ria e 1

6 8 t 1 1 6 0 V

7 3 1 b 3 0 e 1 1 - 0

1 ACC1033 r 9 0 - m P 1 5

6

C

4 0 3 c R 5 A R - 7 o C 1 i 0 3

B C 8

B 9 n a G C 0 1 l a 3

M a B 4

T 9 r l u e u i 4 l 7 o 2 C s s t 4 r B s 8

u i zae K r 4 4

p r A a e o B J 3 O B t s m C t i l C R 1 k

k

o r s 4 o h D A 2 5 l m o 2 4 X

o k u c

B s h h 9 3 T B a 0 5 a n r a P n C 4 7 3 a 8 M a 4 0 r n a c i 2 8 e 6 h - str B1 r i 800

d r a P o o o o n 2 o 8 0 e 4 A m i l 3 4 o 5 r 6 2 1 a c d o i P i k 5 l a i 2 o 1 Me s e a D a p I m l d l a b n f 2 4 r 2 3 3 2 r 1 2 r i d e o a 6 l h Legio e d 97 t t r o d l r 5 t y t e E a t d a o a a o r 0

m i d i s o H U 1 M 2 0 6 8 a

e 30 s e p i 0 t o a e e V o n e n O u 5 s c a s e e J d 9 c v a 1

a o e n 3 e z t o 2 c s 1 9 6 l r e A 1 5 l r s r 17 v f l u a M a i e c h a t l a R o t d 7 t l i u U o i 3 R t P i y i C r l a l r n s i i 8 i a l c 7 b e a 6 t 5 n l a 1 e e K e 6b a t Pol a i a t ae pv ory c a s r s i oryzae MAFF 31 y r a n n e a C 2 A e s r l 1 a c b c C o a s a A s r 3 o o p l i 2 i r i a C i 6 a l 6 n i a nel l R 1 o

a n e n libium t b l t e r o g s i l e T u i l a h K p i e t a l m T S a n e c e 0 p d a p h 0 e 1 0 l s a T i 0 CC a e ri e m m s y a e x p r a c l e p P

v l l a p i i o a l v v v p t aromo h c l i estris str e a 8 C h u l i C e o t 0 i P o R C l 110 1 c m 4 T n r e i a 8 o a c e s d A o s p n h s la o p l h s R B b V p i e y H c c o e F 0 a p U t r n m a a v s 2 e i e a a i 7 v ci x i u e N l p a a d l u t o i y n A t a c r h a o i N B l e e t a n i p i i c l m a l ae pv i f n A 2 o H a a o e h o c t a l e r e i D a a H 1 e p u c a i i e o i r t t M u A p r i l a a e m 5 pn a i i a b m l r o n t c i m l o h m o a ii o i 1 o e l l Nitro r s r c G 1 r h l l r s a o r S J a C r e l lle x r a l r n e i t a s e r a S l e b i e e r L l e F a r o l t a e e o - u c n i a d e l o l p s pv a n l t u c i d pe i u f i e t i e o l a a i d e i n z i a e l i nas oryz s d o a N p e l e i r s m camp i d t eumop o a a d l C i e i s d l i 9 a e e d n c e d m i c p n l e o l d l a o r i nas sp JS6 e m m r l u N m l d m n e d y n e e o l i r a B e l o s d a e v d a o t t D e n u a h l p u l e a T e l a p t m m r s v d l u n i u u o r i l n a t b o m s l c u l 4 i m l o a l l o i m o t l e t l i e a x v 0 Ch f o a i o e n m s G 3 r d c p l i r r i v i i ris pv campestris i roleiphil t o o t S t r e e d o d n A e E e r a o h m o oma H o m e A a 9 s a G a s v n U d c a h s t h o m a a a T a p 4 r o S r r r e 4 n l e a u e a o t a i A e a u a 7 h a s d e s a - R k u m as oryz r l c i o u d s e h s D 2 s i o o h o M a F A k i k i i u i i o S h s 5 n n r t a o u p s pv 1 c W m i M r r a m e a r a d P c r m a i i p d k a m s 5 s r r a r p u t r po e c d r p o n i m h k e U p t k 0 s r i i p u m a n m spira m l k i 1 i a s i a omoba 0 s m s l 2 r e C h e e m K aumann 2 T s s K a r 3 a u o r d - x r a s i e r r 0 i r H u r e u e r hila e r De r e r s d s m u o h a I e eud i e n tr 1 c 5 s o a o a n d a n u C 1 t i s o i a r u d a a s o 1 i p 2 Thiobaci M i a t 7 F i x o t x i u a C t u d e i I 0 P s F B e C u e d l i no a d d e i p s u i a a r m a v B B N x f - s R k i i T o o 2 5 h l o Ac s p a ns r r a s - s l r l l mo m h anthomo Fran s r s o u s 0 r r b L r B 0 Fr e n 1 0 d r r p tis - d d c B0 B o o 2 B i C e i s d p 4 C e e d n B M s k a 8 c v 8 a 1 H o l e b a h e e i l l e F k o o g e u B p o o l t o p a i F 1 J e i a m X c s 1 0 e ps - e 3 su C B a m r a r Neisseria gono - w e e 2 x 1 a J h s d pes p r i o T hl o t f t t A 1 da n r h r E mult 0 d o p a idit a d r A o d a s - r c o u r h o i i m 3 M h h F n h 1 o A r t d a s a o 3 2 5 0 d r l o Ac N s u l l d 1 i N d i a B a s 6 J 4 k t th f M p s i u i S r Q a l l i a c Leg i k len nc 5 e 4 u h l l e H k a a h n m k k h r Y a z r J I 4 h r s m i o o 0 i a o h h 7 ct a 8 o o e r m s 6 2 r o s t ra n H n Coxie 0 i cisell bs P B r a A e r r l i 1 n e n o o w k d t 1 Nitrosomona 1 s a B k o o M s b s o N o k n nii S k s r 1 L a e a l o f o n l 1 u p 2 hioba e x 2 h d l e a c c u h h eria Legionella idi r a r i 0 i u i d i 7 c i u u o r r e o d 1 e h h l n e 1 m o 1 r h o se r a h f o R P n i s PM 9 W r d e i h alo i 4 2 k r c e s o a p r 2 k k B o d a eg i u r e s u l l d obac s ionell B k t Xa s d e u u k a 3 a l r m u s n k o l u 6 B r B k s 0 B r r e s i ifo r r i na P t a n 7 h l 7 u l p D l t i 2 J r h e s b i iu r r m t e Fr a a lus d u c e r o y s t a l t s u lla s b e B e n - 4 B hio c s pneu u h o d r e B u old X o o B u l e n 0 o N b o u k d o A o l a L P u u s l o as cam m nnii A man o N r u n u h a o c l s r X e a p Legio i e i l i t l lla m r rmis 1 i k r h r i e r l an t m B f a e B l l s B r a M o n i e s s KT t a t a M s r k a e h m s t u t e i n u cillus f i 1 l B B n n B B o d e us 7 1 t i N u k l p p c f t a1 a a e y p a p r i u ulare ba t r S u o au c l r e u r e c a m t e Thau D a o r s n 9 2 a pn r v e s s h t y p cis a i e o l n r ali u i a s B u t i l o i s X b b a 6 bu b n enitrifi P u l a n ho u a a m n e Acin ag 1 h r ar mophi t iola t t p B k s mon r M l A x l X a t m t a c F C ci s r h e T S b 9 a e a N h ATCC 2 B r e l e r a i a e or i nella - T B Burkh a e auma m D o y el m e n l t n o c A t l ra e pneumophila rn o y a E z f X r t Ued o 3 C c y 6 1 W u F r n oxiella l r e e b ph o c C P y e n nsis s lu c n g d l h a cter cryoh t li E la eumop c Hal t r l l om o S e c e m 2 s Coxi b 1 tho m r n a c e o e B Xanthomonast campest r erroo o e o A era s y a t c t n s e r S a X b cter M o C i i A an i s a l Xa a ro a s f eu N 8 M c t s ndi ho n n t p s n us i C c Thi Alkal r b l c a e o C i s t T n X o P Thiomi c i r Y er s eu b t r i la P - r g 5 ul ii et i g d h a s a r c s MED2 s i c a c 1 c e a b ba t 1 p a 1 i T s o c orh h s oba cise s p err n C l o c cter b p a s m AT 1 eae FA i i Xan h c ell u u u i c ans ATC t u X b t o e s 3 g are ne x Du ell d hylo o t s y cha st g l i t a C ub ubsp o 1 a C r c s s i i ca s o e 4 i b i s n tropha C9 p MZ a n nsi C burn xid atus i d i s a a u e alka l X y r e b eto i K hila oo od 9 t 0 l u R s r a b l 5 n pon r 0 lla umoph ilimn g i i e u i iba a b ns sp o l gway o m 1 s s h n o m 4 s hizobiu s t p Phi l d s 5 1 i t toba a de s RED6 3 k s s RC p c n i ul s x a coc c 9 c Sac u xida ans A crospir ospi 5 i P o o CC 3 9 c e ja m t s Z S eg o s p is R i ur e 7 y A m MWYL 0 Rh b t t e etii AM1 b f Psychr M a u t st b b l 4 6 B A n a an ula ul R n 1T Acin n s ula a ivib 2 l T s i 8 s ladel M 1 io p m i ren h l i l u 4 a a B g gr S n i ico g 10 H e subs a c l cus n 4 c s p a s r Co a ut l 8 I P o T izob o r i 2 c g bl r k 5 eti ns ra e 12472 C C 2525 R br u D tr i 7 Acine ren re rens il n RSA s 9 A e n j 2 p l e e h t e s A s a b J108 TCC ri n 9 e i m legu a e A la 5 8 1 2 1 n e 7 s 4 n l lvi a r K 8 h is r n Pari t u i R ia ca N s 0 i Con a cr o sp ha r l 1 h 1 1 D c s phia t t AT T 8 r kea a n DC300 i si p h s rb s i i i a S 4 i l si r b c A u t i i C eh s u a B e o s t n C t on 2 4 ich s i s is r i SA 33 m M m s y e C - loph m u s L g P m ub c e 49 psul Cel a i i 9 1 ol s -11 CC a C 2 H m ine m l marisrubri - a W Le l 232 a s a s S F l t b unogena X r l s E aeolei Vx 1 str elo i nd 3 l 9 o P 8 a le u 1 a i gn HL-E L min r s ar i u 1 i H e lla n e 2 o l F Y CH S 3 K p 7 5 a n 1 c H Re l P t A A bsp p C K 1 5 in h e J 9 o gu T g atu 9 il T s A M g ct 9 C19 s idatus hii T r a B a c g bac 7 i K a B g m r _ f 7 a S r D s i i N 6 U b 1 39 C er caesariens m aqu s 1 p l r r osarum b m ag a ica y 0 i C ob Rhi - u Q c 0 9 H u e r i 1 o o o no edi F l b C t i dii AvOP ES 3 l s M Ma m U n ino A S G 9 a 7 k e n e b Rh M t 0 8 i 1 GR7 c d sa P O L i ne er 4 L 2 r t y C371 a 4 3 str Cm str LHE- a s a zobium etli CFN T 0 OS _ 5 a c n g 2 Me a vicid asia 1 1 1 o A a c ct C 4 V b a a s u a 5 Si s i t 2 Q b a l pv toma z r nod 8 8 Bermane P r h t o - o bacter n o 2 e er a i C CL- Ba b e i CS ringae B728 Ag ob p 0 U1 2 e 1 n e e p sorhi norhizobi r c x n i a r uniibact i o n c ugino a W iu u r o t 0 1 s a l i s a v i o 2 a i 1 r o g Sinorhizo r u v trifo m iu cc osu c 8 2 i t r a v o p B 5 com h a d u s ob m m f U1 a 2 o r m u 0 ept h r n a a PA ing S 1 e u FS C v M ter vinela n i e zob b n i o e tutzeri A150 e n r - S act t r s 1 N e g a s adioba v s al Marino t s aer a y 5 i u e d 7 1 n s u aeruginosao n - vicia yos m c a aeruginoss g n m t u s VCS 2 a m s r e ium lii WS li C1 y o a bac s m n 8 c o er s p c r a e s s i c um medic e p l b s o r PfO 6 r C h n a a a s 2 h iu f f M t o a d uginos y s L4 T a e 4 ociu og A t mon o e bium frediii IAT 170 C n s u s r D loti z m c e r C 7 a a Ba ct o o m a e o en il 9 0 c o i M230 38 4 r do aer syringaes pv sy 1 i ep ecotyp e o -1 e z Azoto o n s mon u c - t 8 b viti er 2 m a l s 1 5 r MA n o s oku n A d o s f B n 0 t i 65 3 o P n e De H onel u s x a ma u domon 244 a 4 9 Oc K 4 A d m o s G l i B m s y e nas a t i 3 nicata 1 Aur ae WSM4 s 1 2 eudomonasu o moph C12 5 8 4 d Pseu s ona eu m n uor a a hrob Br r FF303 S t e d l a A 7 5 u m r 4 a P seudo o o f d KT R tu 7 l B 4 tan s u Pseudomona i s e T - C 4 a u B c n Ps d t a T 1 anti r e C g e Ps P m s tida W6 5 a leod a C 3 c e NGR2 s P u u r s t u r l 5 ni s o a id n i T 0 ri ac e u ll i e p t 1 L2 h t 0 bocoru c a l 8 P i P d s ento o t TW L m ll B e c o i f s nas fluorescens Pf F k OS2 A 7 e 09 i u on a s m y i a t a l c t V H c P pu is r m 3 Fu ona ru ru l l i seudom e n a o s y C bren a l a - a m mo e an B m Bruc a 9 1 1 A P Pseudomo s n s OS14 r r iu n d 3 m cella n 3 9 1 o ns e romonas E C l el m 0 P o a tida t h o - P v Mesor s Bruc is 4 d n ie l nas mac te opl er o T m i s an e u 2 m u c W A z m mang it udo monas pu o tica a y ct w W en l ella i A 1 eu o ih l o ia a it s domo s oal hal a rifica A s a C t e T u d m s p lo d p t a n ri hrop ab A u o b l H n I aggr s n ella C Ps e a u romo s l a a 4 P Bar na hi T Pse e d a a s e r B is ort s meliten C e i is e B C s u s l deni n le l 1 pel ano zobi r is Ps in na ba l n a E 0 054 i u bi 2 P e i P Alte e a i o - tonell egat us C suis Pseudo mon ar r Pseud a t e AT c o A 3 w ll w m W e Br l e i o p a xyd u v T 2 3 Ps d 4 z A Rh Rh 7 l bv 3 6 o h - a R g m sp CC la u a C 133 udo om nadale e ie ll H 6 a ce r sis 4 5 C eromonao S s MB 40 1 R h a h i H C di t p e s - odo o ans o A 4 se I PV n i h o d IAM v l 1 s 5 Idioma m la a s R o d TCC 49 l bo 2 P o l ll a B R o en is a 3 16 0 r hewane n 2 M B d o p p SI8 BN e e w e NCI o h p B 18 a rt tr 9 4 e S n e a B s r s s s 1 A M t n x i a R p o s eu eu a b u 5 l a a h a S s d hod s d e rt el 261 2 C1 8 T ortu 7 A f 0 e u 5 5-9 C s -9 seudoal w loihicaw S li n y o domo d o ae P e e e r u p d o ne 0 C 23 4 h a d h d o 4 6 h h nsis i 3 20 i o s mon s A s S1 1 S s z ps o e m B lla 2 0 S a dimarin e n 6 M o m u tr Hous 1 5 l n 7 Ni o a 8 l o e b Br eud o d n n r 8 anella e 5 i Bradyrhizobium sp BTAio a as t quin 4 9 rigi 9 t t N N u n a o n la ANA- hyloba rob i itro m a a m s s n 0 a l facie 2 1 r AP t dyr o p hew a f mazonee p 7 e S r mo s o p palu e w s - 4 a o ja p n a l ta ton- S e n R 5 O st 7 ct b bacte hizo a a lu la h a 4 1 5 3 P 7 0 a p n a l na nell lla a lla - a ct A O e o l s u s st b S w R putr 2 c r 4 lig c n as u s t e e 2 ti a r M eriu z r t s p t ri r a s 1 ne sp M l v str JL03t o h e i bium a r i c h M lla OS1 S a s M e Meth r ot r r s c p tr i s s Bi tr S an 3-18- X h a u i lu s il wa w p a b ro 3 1 e t r m w m alu s Ha l Shewa lla s e h m a iz oph p N s C if T e ciens icCN-3 a 5 s r th y nth o b in B t sB o oulo he he t ll 8 y l U sp st r G A n wane tica Oe 5 e var o M y c b u o is i r S S e sp W l lo b lo hl o i a r g b-3 S ri A s A 2 5 m bal a n 6 ia b a e u g D ORS 0 wa h a rova ba o bac m c en ra s 5 T is u e S OS1 n Beij a c th r A 3 I 0 w o se E c te ct om ar d 11 Bi E 9 K s h ella putrefa e H01 5 t y c s s 1 - e S lla b h m 5 e ri lo er te a boxi is k sB 1 C a e S ae sero L er r u et r u A 1 2 hewanella anella S 0 iu m b i l X1 y 0 7 1 1 5 S is E inc a um ext ha aut in i 8 8 u B 0 m c dov N 8 3 s pneu 7 kia ind ra t n o 4 hewananell a o M Z 6 n e o d b S Shew r m o d r ic trop a - ella baltica u eumoni 0 io iu um o 2 Shewan 3 233 d n r 5 hew tr P Methylou to m o s an S ple cens s l r h 5 aemophilus ducreyiopn 35000H s 1 ic a le p q C O s u es 1 a a subs n r o u icu R s ewan H G 02 RoseobactMethyl s a p e M O llu eur rod en id Ro O n n 4 s P S Sh i omnu T c G s u s M c pneumoniaeip L2 s o itt R3 seob R li 5 a s pl pleuropneumoniaec P lt e Siliciba ba S J B PA y 7 5 mophilus par s inog s u P p indic C 2 1 e ob c e za o cte 2 M J lu cini lu 129 m Ro ce 0 1 in cillu il c n a 0 0 Ha t c u uc hi s p cter er sp ll rium 6 2 1 s u s nza lue se cte a silv 0 8 Ac f S a AT 3 oba ba lus pleuroa s ub lue n ov il sp S 1 i llus s f i r pomeroyii DS ino m omn in s ar ciba ME es sp t i aemop a u O CC Actinc e aci H id s 0 ius 4- h s s c 2 A c Roseo c K D tri A n lu e nu t 209 1 s nobaciln to A a e 9 90 4 a inob hilu l ophil 6 E n r 3 B 6 t u ophi KW E 1 ic Sulf bi sp -2 L 39 M m d tt Pitt 2 o nh 2 Acti Ac m aem 284 i - l variu - a e R P ae 5 4 1 D a itoba Sulf i T 6 ll H e z 5 2 2 ino Ro b be S M1 Ha e a 6 2 a - Haemopre a z en H e 21- Roseobacter s ts it s s ns 3 0 nzae zR n 3 a 2 R oseo Sag e e cter obact 4 teu e e z P h ov n p 21 I 0 s u nflu ttH n N o s S a luen fl i za e 8 ae d a i sp M influef n s u z o b itt r s P i e Pi fl 9 M b a ula st iu H er sp EE-7 s s a n -02 a a cter s N u fluen i r ct T A il philu nz s 86 fluen it e L s C h in u im ra o r shi p C S p s lue il 6 I 6 c k den ellata H 2 -14 ophilu o mo u s in I 6 ib e t T 5 ophilus in il h nzae u 9 a a a ba 9 3 m ae h p 86 c e JannasRose n it C 7 1 m e H p s inf o hil te e rific e DFL E C 6 aem e a o m R2 r b ll 2 H a H e ae Pitt Ocea a a - 60 H m hilu a a c ob ve an 3 e zae nz lk te chia 7 1 H emop n salmonicida A44 a r ac stf s 1 Ha lus influe a ATCC 7 ni li iu OCh 2 mop Ha flue c p m ter olden e hi flue 7 ol h sp C in 8 phil 3 Rho ilu H sp Ha op in s 1 T a gr s T 11 s 9 P C C C si u M do anu H C S s SK 4 hilu S hydro CN 4 9 Rh T 2 1 C Haem p p 1 R ba C 1 S op D s SS odo h losus C 5 2 m is 8 o cter s 2 0 A s as s 7 m 23 d 6 5 n sub i 3 tum S 4 u K 1 bac o 5 3 aemophilHae e a on i I G R R b H 4 u us A3 F l h te h a phaero H a s salmonicidaom subspam fund L uconace od r sp o ct TC s a phil h o a Para o d e a chr ra id b o r C251 n y g m angsp SK pr ic a hae b sp o s n u n G c c a h ides A m P i m o luc oc te cter s a u romon eri u undum 3TC to r roide e 6 l s hydro t erium of lm on b c c r o Ae nas ct a ac u a o TCC 17 T ac s oba t s deni ps s A phae id ba o er u e o um pr ri 1 01 cter la s omona romo otob obacterit ib 1 B di t TCC roi K ch h ho v 4 2 az tr u D Aer y P P teri lii 11 1 Gr oxydans ifica s d 1 0 s Phot A s ot S 1 e 3 2 P u M anul rop ns B 702 s 1 9 obac cheri MJ d 6 ag 1 24 s di 1 ne hic PD 0 hot fi heri ESn M ibact 0 5 1 P le 2 -11 ag to 621 us 12 3 p 22 3 A n spir s 2 P e er b PA 2 Vibrio io fisc o D G to H 2 ri E L C BA 3 s illum m A e l ibr ib M s 6 C 3 pir ci th 5 Rh V V u 01 6 illum magneto d e sp id J P AT 06 ip sd od o d C ag h e o ri en s Y M eyi 221 85 ili n b ib pl cu C rv 3 C ne u sis ac V s fi C a C ticumm C te o us ha R ul a cryptu ri ulni fic e o u R GDN ral ib v rio ra ba lo h es V o ib e cte b R od tacticu A ri ulni V ticus RIMD 1 ol ac hod ospi MB- m JF IH ba ib v 5 h 51 r cres t V io ly 2 G0 c V C er os 1 cte 12 io P au c ri 1 ibr Ex s r ae henyl lo re pi llu m -5 ri V ib er Hypho b cent s rillum m M u sp icu V l a ce m rahaemo t ho oba c u n ce S HTC a c m te s tu r n -1 p ibrio o OceanicaulM cte r N s ub t io V ri P on ar sp A1 C en r ib arvul as ic rium K B ru u C22 b V n au 3 00 1 m m Vi o alginoly 0 a ep li 1 0 5 AT S 5 1 6 N rcul s zu W 5 Vibri O 61 o Eryth tu m ci CC M 2 vosph a P ni is ar n ae J-1235 0 69 um be arv um AT sp is e 111 r M V -8 N1 is roba r ib M um H le e tr ing m ac HT C ho ra 40 s on p ct uden CC263S 70 c le 5 27 h obiu e ul CC 1 1 LK o oleraeo 9 e Tor p r litor um 0 1 ri ch 3 l m sis 54 ib O era aro H la 3 V rio l yrthosi Ery Ery a T v 4 brio chib rae ho mati th lis CC25 am 4 i V le c ovar E pis Ac thro roba HTC e V o o bi civor nt ch bri e ipal 5AT ba C2 0 iv o Vi i a ev c cter 3 or bri e O1 -2 ci br S ans te 59 a era edr fensa Sp S phingomonas spr sp S sp 4 ns Vi l c ista ina h p DS D ho M66 um a de ing hi N SD- S- c ae nara ia p ) loss ll o ng M 124 AP 1 r Ci g m G Z m opyx 1 21 brio amin isu f tone ym on Vi Cc izon gr p t o il om as witti i 4 choletr is am C s al 4 o s p Ba um bion and ona aske la siphon ) m us H s c K Vibri co o m sy t id s hi A5 i a str B chizaphrth on pis u ida 1 C atus m i R nsis R hid ol s orsitan O W a obi W 8 p Sg S cy pi endo m i TT ol ndida Pe li 1 a idic (A on a Cand tr i ba l s B2 rthosiphph s ond ch E t ag sub 25 nera y i inidi m ia hrli us Pe ibac s 6 ola str Tuc7 c ss au endos chi Ehrl p era aph tr. S A rthos o inidius l a lagi ter m Buch idic s y gl sp y chaf ichi ubi ob la hia loss a ub m a bacte il Buchn aph str AP rt at s bi feen ch q is ra idico 5A (Ac o ul N s on a ue ZM ola sw dalis g ag E cen t o Ehr Ehrli si ffeen r ub H 4 hne aph ic e o us P s W f Dr s str S T a d a str. ggl So a c B o o lic chi si ique CC Buc phi ol i c mine lb so hia r Ehr s 1 ner a ic W dis us str 0 lu achia ph Wo A a ru apu st HTCC 00 ch id lo floridanic us ila l na uminantiu min lichi r Ark 2 u ph a ia an HI432 d W e w ba pla a canlpa B hnera a om ylv ab olba ndo ill ch A sma ant ans 10 ra H 6 c sy ist ia napl iu 6 Buc e c hmann nns abilis otorh 4 hi m on end A m m str m is as 2 hn r H e ir h -1 3 W a e bi i T os n as arg str str uc st loc ia p m P 93 04 olba ndo on S apl ma inal W J B la B us ri I1 c t o C# ymbi a m Gard ak o s ann lu 81 hi sym f Dr 14 sma a e elge e nic rote ta SCR a endosy bi os 03 ont o rgi st e lli idatu chm P ic is 80 on o 0- pha n r S vond l de nd lo lla um t ph 08 f Dr go ale s t a Ca s B ie ens tic 8 litica m of ila 11 Wol o Mar en cic tu ani p bion Dr a 24 b sophil cy tr i a ards se teroco os na ach top Flor es ni ida atro n t of C oph na i hilu id an and Edw a tasm m p e ila ssa a sp a m m a m C iu aculans b56s ule s e e au ter u x im wR lano HZ B Erwini s 9 quinqu ulan i ga obac oteam 96 s s a pr litica 43 e te Pect ati co 0 fas r r tero TCC 7 cia Ser n 9 tus i A C 439 90 1 58 areti 29 64 N Pe ersinia eoll ATC 33 eor W l Y m is IP 317 ick olba a ATCC CC os e chia inia covieri i ul II ttsi ers ber med nii AT erc PI 5 a senn e Y r kse b s Y lus 27 ndo sinia inte ri otu 1p IP sy er de d PB 3 str R etsu mb Y nia fre culosi 3295 lis ick Ric s io er osis ta et kett tr nt Yersi inia pseuotub cul ien tsia Mi st rs ud r Or R ri si ya rain YersiniaYe tube ar ick ck a rick ya T a pse do erculosis IPov et ett ma R ub s bi tsia sii ettsi S o rsini pseu t ti 1 c str i f Brug Ye eudo pes 0 ono Sh str I nia ps a 910 R rii eil ow ia ersi a ini a str Ricke ick str a Sm a m Y ers ol ett M alay rsini Y Ang 92 R ttsia sia ali ith i Ye is CO crotus ick mass si sh est tis ettsi bir 7 ia p a il ica pes biovar Mi F Ricke pe iae 24 ersin ia tis aco MTU5 6 Y ersin t cki Y pes estoides O Ri tsia i s inia s P M rienti ck Ricke ak tr Ru ers sti 78 a Orien Ric etts Ric ttsia ari st s Y tis KI 85 tsuts tia k ia f ket tic inia pe pes l516 H 7 44 ugamu tsutsugam ettsi elis tsia ca r Har ers epa a G 20 a UR p na tfo Y sinia N iqu e M H-K shi Ric bel RW row de rd Yer is nia TU us kett lii O XCal Ric az nsi est s Ant 2 N str hi s sia b SU ke eki s str nia p 34 umo iae Iked tr B e 85-3 2 ttsi i st M rsi a pesti p pne n a or llii 8 a ty r M cKi Ye ini eumo 980 yon RM 9 ph adr el subs pn g L3 i s id Yers iae sp r RSK2 69- tr W E on sub - st C ilmi iella pneumoniaeum iae 94 3- 50 ng Klebs pne on A-8 z2 1 Ca ton lla um BA 62z4 mpylob bsie pne CC var ATCC 9 1 Kle ella 38 i AT str 260 Ca act si aki 5 ae sero phi A U_1 mp er j Kleb sp 6 kaz zon AK T18 Campy ylo ejun acter r sa BAA-89 ari Paraty tr C bac i sub ob cte TCC bsp r i A str hi s loba ter j sp jej ter oba ri A su erova atyph Typ Cam ct ejuni s En ron ose rica a s Par var pylo er jejun un C k e var ero 2 7 b ubs i 2 60 acter ent enteric ero ca s tr Ty C-B6 Camp acter i subs p j ejuni 8 94 rob nella a s teri hi s r S 633 e jej p Cit lmo teric p en yp uis st 19 C ylob un je 11 Sa rica subsp en bs var T es CVM ampylobac a i sub juni 1 nte sp a su ero lera str cter sp HB9 6 lla e sub eric a s r Cho und jejun je 3- one rica ent teric a ngr Cam ter i ju ni 13 alm te lla en serov rze pyl jej uni s sub 8 1- S a en one sp rica hwa 3 oba ubs sp jeju 17 6 nell alm sub te r Sc r SL48 cter p jeju n lmo S rica sp en ova st 54 a jeju i 84 Sa te sub ser gona SL2 Ca C ni s ni -25 lla en rica rica A tr 2 mpy amp ubs NCTC one te ente tr LT loba ylob p je 1116 alm lla en sp serovar Newport s c acte jun i S one sub erica var urium s 6 ter je r jej CF93 8 alm rica p ent ero phim SL47 C jun uni -6 S ente a s r Ty str amp Ca i su RM ella ca subs nteric rova berg 7 ylo mpy bsp 1221 on eri sp e se idel tr SPB c bacte loba doy alm a ent erica r He i B s 3 r u cter lei 2 S l ent rova atyph 185 psal col 699 terica sub p a se Par 02 i iensi i RM 7 Salmonel en a subs eric ovar CT_02 Cam s RM3 222 nella teric ent ser str Cam pyl 1 8 en bsp erica blin 91 py oba 95 Salmo nella ca su ent Du 287/ lobac cter mo teri bsp ar m str f ter lari R Sal la en a su serov naru 9 Campylob con M21 l nteric terica Galli 12510 cisus 00 almone la e sp en var str P a 138 S onel sub sero ritidis i C cter 26 Salm ca rica Ente C amp curvu enteri te ar ampy ylob s 525 la ubsp en serov loba acte 92 monel a s erica H cter ho r fetu Sal teric ent elicob m s su lla en bsp 469 acter inis bsp ne ica su 5 pylori ATC C fetu Salmo ter ATCC 3 Hel HP BAA s 82- la en ii 7 u ico b AG -38 40 lmonel erguson 12 H acter 1 1 Sa hia f e 10 elico b pylor ic senteriae Sd19 nteria acte i J99 Escher y se H r pylo igella d gella dy elico ba ri 2669 Sh Shi cter 5 6 Helic pylor i P Ss04 oba 12 sonnei ) q cter gella 27 pylori Shi i Sb2 94 Helicob G27 boydi 083 - Helico acter ella C 3 bacte pylo Shig ydii CD 1 r acino ri Shi4 lla bo r 840 He nychi 70 Shige ri 5 st licoba s str S lexne r 301 s cter hep heeba lla f 2a st aticus Shige neri ATC ella flex 7T A C 5144 Shig tr 245 /69 Wolin 9 ri 2a s 2348 ella su flexne H6 str E ccinog a i O127 enes D Shigell ia col Nitrati SM 1 erich a e ru 740 Esch oli ED1 Sulfuro ptor s ichia c C O1 Sulfu vum sp p SB1 Escher coli APE rimona NBC37 55-2 erichia s denit -1 Esch t rificans I89 DSM 12 i UT Naut 51 erichia col ilia prof Esch i S88 Arcob undicola A a col 3 a acter b mH erichi i CFT07 utzleri R Esch hia col M4018 Escheric e i li F11 ichia co Aquife Escher Hydroge x a e li 53 6 C4115 nivirg a olicus V erichi a co H7 str E Hydroge sp 128 -5-R F5 Esch coli O157 i nobaculu 1-1 a str Saka i m sp Y04 Escherichi i O157H7 Sul AAS1 hia col DL933 d b fur ihydr oge Escheric 157H7 E nibi um azo chia coli O Sulf rense Az- Escheri i urihydroge Fu1 N026 nibium sp hia coli UM Pers YO3AOP Escheric x ephonella m 1 -3-5 arina EX coli SMS Thermode -H1 Escherichia o Desulfov ib sulfovib rio I39 Desulf rio vul garis yellowston chia coli IA ovibri o vulga subsp v ulg ii DSM 11 Escheri 8 o ris subsp vulgar aris DP4 347 ia coli 5 363 is str Hi Escherich r Desul ldenborou g e fovib rio vu h a coli HS Desu lf lgari s str M Escherichi ovibrio desulf iyazaki F Desulfovi uric ans subsp li ATCC 8739 l brio desul furic desulfurican s Escherichi a co tr DH1 0B c ans subs p desul str G2 0 li str K -12 subs fur icans str AT Escherichia co c Lawson CC 2777 4 f ia intrace 952 llulari s PHE/MN1- erichia co li BW2 Desulf 00 Esch r W3 110 i ovi brio magnet coli str K-12 subst icus RS- 1 Escherichia 55 c str K- 12 sub str MG16 alkenivorans AK-01 Esche richia coli 7A Desulfoc occus ol eovorans cherichia coli E2437 E o Hx d3 Es 55989 A Desul fobacteriu Esch eric hia coli m autotrophicum HRM2 scherichia coli SE 11 o Syntrophoba ct E r er fumaroxidans MPOB m Esche richia coli IAI1 aciditrophicu s SB unc lass if Esch erich ia coli 101-1 u ied Deltaprot A i eoba cter ia miscellaneou s m E110019 c Desulfotalea psychrophila LSv54 o Escheri chia c oli B171 Geoba ct er s p FRC -32 Escheric hia coli E2 2 l

g s uraniireducens Rf4 M Escherichia coli B7A

Geobacter bemidjiensis Bem l

o Thalassiosira pseudonana CCMP1335

Geobacter metallireducens GS-15 o v

l Cyanidioschyzo n merol ae

ter sulfurreduce ns PCA F

h Geoba c l u Volvox carteri

Geobacter lovleyi SZ e

l Chla myd omonas reinh ardtii

elobac ter propionicus DSM 237 9 e P e

4 Ostreo coccu e a Desu lfuromonas acetoxidans DSM 68 s tauri

380 u carbinolicus DSM 2 Ostreococc us lucimarinu s CCE9901

C 1

Dehalococc oides sp BAV Physcomitre lla patens n

t

o E 1 b

ehalococcoides sp CBDB Oryza sativa Jap

h D onica Group

195 n nogenes Arab a Dehalococcoides ethe idopsis thalian a

1 7 Viti 9 1 s vini o bium m inu tum Pei1 m phylotype Rs-D fera

Elu simicro group 1 bacteriu

d Termite l Populus o

unculture trichocarpa e

g

34 a is MED1 z Dokdonia donghaens

HTCC25 59 Trypanos oma cruzi st lanticus rain C z

Croceibacter at L Bren er

Plasmodi L7 um z

bacterium BBF falcipar um 3D7

Flavobacteria

7 o

MED21 Ence i

blandensis phalit ozo on t D oekiella cuni culi

Leeuwenh ii KT0803 GB-M1

amell a forset Entamoeba o

Gr 3-P histolytica H

nsii 2 M-1IMS a

ibacter irg e Entamoeb S o ( lar

Po a dispa r SA

D152 Caen W760 sp ME or a ibact er habditis eleg

Polar D ans

TCC2170 roso phila melan r sp H oga a

Maribacte 2501 A ste r

ata HTCC nophele s g lea biform ambi ae a Robiginita AL38 M

cterium B onode lphi s ba s do cteria 6 mestica Flavoba P02/8 Mus musculu

chrophi lum JI s

ium psy 1

acter Homo sap S

Flavob soniae UW10 755 iens i ohn 700 bacterium j is AT CC Cryp toc oc

Flavo us torqu cus neofor e oflex 1 Sc man Psychr s sp PR WSS hizosac s va r ne riphagu lleri G charomy ofo rman

Algo 06 mue Yarro ces pom s B -35

334 us Sulcia ata wia lipo be 01A

t ATCC t agul lytica

insonii Candida sca co ch i t b a hut

Cytophag 4 r Hc Homalod

CC 2313 ller i st Deba

arina AT lcia mue ryomy Scheffe R lla m tus Su ces rsom r

Microsci andida S hansen yces sti

C accharo ii CBS7 pitis CB

e myce 67 S 605

3 C s cerevis 4 o 34 and iae

NCTC 9 ida glab a lis rata V es fragi 46 Kluyve CBS 13

teroid s YCH romy ce 8

Bac ragili Halo s lacti

r oides f qua A s

Bacter 82 dratu shbya m g h d wa os

n VPI-54 2 lsby sypii A

micro C 848 Haloba i DSM TCC 16 1 m

etaiotao s ATC 3 2 Halob cter 790 0895 i s th iu

roide ulgatu TCC 850 CFP Natronomona acte m sp NR

Bacte eroides v asonis A 277 movar rium s C-1 i

Bact ist 33 geno s alin ar o hae pharaon um

alis ATCC ymp R1 acteroides d 3 i o chon Halorubru s DSM r

Parab is W8 dotri Halo 216

l gingival pseu arc m lacu 0

ides ula m sprofundi

o Porphyromonas gingiv tero 55 5 Ar ar zobac 138 chaeogl ismo r

Porphyromonastus A M tui ATC i

ida 5a2 DS M obus f ATC C 4923

Cand ruber micola DSM 246 etha ulg C 430 9 e

iaticus ter 26 1 noco idus DS 49 as d DSM r d

r lus s Methan pus M 43 Salinibac Chlorobiumeroide li culum 0 h act ans DSM 1303 ospirillum lab 4

tus Amoebophi aeob 7 Methano rean ph -1 cu hungatei JF um Z

Candida obium m ferrooxid BU S lleu

Chlor tiforme rvum NCIB 832 Meth s mar

Chlorobiu hra pa anosp isnigr -1 n pidum TL3 Metha i JR haerul o i e 1

phaeoclat robaculum obium te M 27 5 noregu a p

ctyon Chlo Chlor M 26 la alustr

di eol um DS uncu bo is E1

Pelo ides DS Met ltu one i -9 p t red m 6A

C obi um lut ibrio hano 8 c aeov D3 saet ethano

Chlor ph ii Ca Meth a t geni o bium mat anoc hermo c archa

Meth phi p Chloro rochro BS1 occoid la eon

des 1 anosa es b PT R

chlo eroi Met rcina b urto C-I

t SM 27 l c c

obium oba rii D hano arkeri nii DSM 62

Chlor hae 0 sar m p estua 1 M cina str 4

351 p biu eth mazei G Fusa 2 a loro CC ano ro

Ch m AT sarc o h ssiu ina 1

Prosthecochlorishala a acet

on t ivor a t Methanothe an

rpe s C2

lorohe 7 A

i Ch Methan rmo n

0 bacter thermM 5 osp ethanop y

haer Me l

Hardjo-bovis JB19 tha a stad yrus

) 0 autotrophicu

jo-bovis- 13L5 nobre kandle ii serovar L1 vib tma

uz nae DS ri e

B sen cr -1 act AV serovar HardFio 01 P Me er s s str Del 19

ii r es s 2C K Me than m M 3091 t en i st Am n th ith ta t

ai str 566 1 r sp Met anoco ocald ii A H

toc s logena C han cc oc TCC e ovar L Pa Pari 2CP- oco us occus 3506

Leptospira borgpetera borgpetersCopenhagen deha M ccus a a ar strain toc 1 0 er et eo ja 1 a rov ans ser toc ct 9-5 han vann licu nna Leptospirs se rrog HD10 oba Anaeromyxobactegenans 10 Me oco iel s Na sch s

r Pa train Pa s yx Fw thano ccu ii nka ii DSM

rrogan inte ova ovoru rom dehalo sp cocc s ma SB i

inte spira ser atoc s teri nae ter ter Meth r -3 26

ra lexa ar P c A ac anococ us m ipal 61 a ospi Lepto a bif ba xob aripa udis s pt serov y Metha C m e Le ospir flexa ovibrio ro cus ludis 5 Lept ra bi Bdell Anaeromyxobacnae nococcu ma

A 22 ripa S2

ptospi 16 s ma ludi

Le K Nitro s

us D sopu ripal C7 e Picroph u nth ilis mil dis

e xa u C

cus 6 ilu s mari 6

r versat T s t

coc 5119 her orridu tim Myxo 6 acte CC T mo us t 5 he pl s SCM c e orib m AT rmo asm DS us K tu pla a M 97 1 um So c dat ula 6 sma vol 9

di aps 7 can 0

Can c n607 T-2 acido ium

Stigmatellaellulos aurantiaca DW4/3-1ium G

er tiaca 5 4 Thermococcu phi SS

ngium c act tatus Elli an V lum 1 a m e idob aur Py The DSM 1

Sora Ac ter usi BAA-83-1 oru ro

bac TCC fern co rmoc s 728 timonas A in ccu oc onnurineus

r m s

tus Soli mma rae PB90 fu cus

e iphilu rio ko G id Pyr su dak NA

i andida muciniphila c oco Pyro s DS Nano C a yla are 1 E a th V ccus ab M nsi

Opitutus ter P 36 equ ar Me 3 s K chaeu T yssi G hor 8 OD itans

Akkermansi la ruddii 1 Ki m

E ikos n4

45 Cand 5 hii - p 36 OT M

s CarsonelSM ida Th 3

h a D tus erm

arin 1 Kor ofilu u m m h

Candidatu H 38 Ignico arch pe

s J1 aeu

rellula nde

i baltica S 3 ccus m cry n

stop -18 s H c hos

Bla 9 Stap pto rk

pneumoniaee TW hylo pitalis K fil 5

Desu um OP r

opirellula 39 IC

P therm a o Rhod G Aer lf IN4 F

6 uroco us mari 8

-5 op Hype /I cc y hlamydophila caviae /C 3 u Ca yrum rther u C N

t pneumoniae CWL02 ld s nus

o /B

neumoniae AR is Fe ivir pe mus butylicuskamch DS

ydophila fel s S26/434 s ga rni F1

tis T m x a a hermo a K tkens u

Chlamydophila Chlam phila octiti X Pyroba qu 1

r ila abortu C ilin is mydo rachom Pyr prote gen 12 a p t c 2

ChlamydophilaChla p a 3 T ob ul us n sis M 545 1n i di /UCH-1/pr u

y g 5 he Pyroba acu m IC-1 am tis D/UW-3/HAR-1 ig 0 3 Th rm is eutr 6

Chlamydoph lum lan a

is L2b A/ 78 0 erm us oph 67 Chl is H 78 Meta culum a cali dicu m

t 5 W H 11 us th Pyroba m il

homat ma sp W th er llo dif us e s p m on D V24S r

cho u s erm oph sphae rsenat ti SM

trac ra UWE2cc us 6 Sul Sul cul s J 41 p t o 7 t

a hila c cc 1 oph ilu fol f ra se um ae CM 8 a

ho co s sp CC93S991 2 s obu olobus icum 4 c D ilu H 11 n K

Chlamydia ho 0 e B dula rop c

Chlamydia lamydi c Ther in s 8 s a DSM 54

Chlamydia muridarumSyne neN s sp R p RS99 7 o DeinococcHB Sulfol tokod hilum 8

L Ch amoebop cu c cid DS

Sy p CC99 R m oc 27 ocal M 53 1351

dia ub omicrob cu Su obu aii str IM

Synechococcuhococ occus s s l 4 a rob f s solf dari C ec sp BL1005 Dein g us radiodur S olobus str 48

( occus s act eo ulf us D 7 2 h S tochlamy 2 t atar Syn 0 o h o o

o CC96 er ium co e Sulfo lo isla SM 6

Synechoc ococcus 81 xylan cc rm bus icus

H 3 roseum a ndicu o

W us li Sul lobu i 3

datus Pr Synechocech p 3 931 7 Roseif de s ans sl P2 9

yn cus sps 0 oph D folobus islandicuss a M1nd s

S s 93 IT Chl s S R S YN155 r

Candi u M 5 R H ilus DSM ert M 1 ulfol isla ic

cc IT tr RCC30 1 oro osei lexus er i 1 ndic us YG

o M s 1 92 1 p DSM V 1 Sul obu a c a 3 r

nechococo tr s sp 0 0 flexu etosi 51 CD11 0 fo us 1 Sy ch s s 57 IT 96 1 Chl flex 0 lob s isla 57

sp RS c

M Chl p 5 LS21 1 r ne H tr 930 o s ag us c h 994 9 5 us 4

Sy marinu W s oro r on au i ndi

s sp s IT 12 of ast - 1 sl 5

marinus hococcu M 6 le greg 1 cus e

s s r 93 fle and ccu u st xu enh ra 64

T 198 x ic M162 a e

Synec occ marinu rinus str AS I u s ans D nt us

coccu c s M MP 5 s sp ol ia r nus zi h

ro ho tr 5 au i D cu M1 7

c s CC 951 37 ra Y-40 SM s ATCC 2377 42

s rochloroco mari s 1 1 SM 1

P ne s nu IT P 1 nt 94 5 e Sy i ia 0-

r is str M M c

Prochlo 92 n a or tr C A cus fl 85 394 m st s C IT 1 o

r r

ococcuus t M TL A Bi J 1

Prochlorococcu c pa s tr 2 fidobac-10 9

lor c s a

Prochlorococcuso masp u s NA TL -

ch c in s fl

e o ub r Bif

r s s marinusa Pro o T c l s str tr NA te h a id

h s m ro ri c p ob a ro rinu s marinurinu T ph Bi ac um anima

t b s

P a 2 ropher e t i u s f

m s cu 0 4 ry Propi idobac te

ri e

us us us ma marinus1 s m Bi u

cc in 511 a w f Bi m r yma whi ido fidob ad li a

co Prochlorococcua orococ CC 1 hi onibacteriu ter s subsp l h

o m CY01T 0 a

coccus A 8 pp ba iu ol s 8 m acteri esc r hlor cu p C lei ct

e c c Prochlchlorococc C e long

d o sp C pplei TW r en

e

a co chloro ce P 4 iu a 2 um long o

Pr o Pro ece s p 2 Kineococ m acnesm um tis A ctis AD or he s 4 03 ma str 08

l Pro ot 7 8 long c i h e TCC c C 6 rin /27 NCC2 a

t anoth c

Le Twi e

ro he C C CC 700 e Act u um subs 0

P Cy Cyan 1 t P C 3 ifson m 1 11

c 0 o p 3 cu s KPA17120 y n s P 1 t D 70 57

85 a p inob s radiotolerans S JO10 0

r y e s s sp P K ia Be 5 3

c 294 t

H C e is ocur C p infan

W th IES-84st xyli sub acteri u A h

i o y CC 0 1 la te

ni n c 2 v n 2 e o

e o a o 30 2 ia ib berg s y h hococcu 1 4 6 4 Bre a

y t C c 7 2 rh ac um P ti s ATCC

a is AT 41 C s

wa ne C iz vibac Cla te p ia

a y CC P C 79 o r xyli str

e iabil HSC2 cave

er eruginosaS N Synecr P CY9 C 1 Ar phi viba m RS302

i a 7310 us P - Arth R te ic

v C t la 1569 a m s hrob enibact h

pha a C 6 at u Acid rium cter ig rn

P g t robact DC22 CT 0C ae o t t 0 a

d toc sp n n 1

cos os ga tus BP oth ac line michi e 1 6 7

migena C n n CB DS Cro N C 81 s elo lo Ar ter erium 0 s

nabaen e er e 1 ns is 0 M 7 a

Microcystis a A 3 mus thr sp r ga s 12

m 5 a u b a spu 1 b BL 2 u ne 3 e

a oba F salmo b

coccu 4 A B2 re 2 s 3

c o nsis su 3

h 7 cellu sc p t

a dulari ococcusC ccus elonga ct 4 m l ec 3Ba2- er e i

n ch C n ninarum c No P No lolyt chlor s hi

Nostoc punctiformeLyngbya sp PCSy p 7 T bsp g a

s 1 JA-2- ca Fran Fra C1 an o Syne 1 icus sep o e s sp JA-3-3 r Fran e

c J di oph n o

e Th a o k nki s

a synechoco n ATCC 33 th C110 s sp 742 i id 11 ia en ed is n um erythraeum IMS101 I bacte kia a sp E

h er t o n cu C S e s on N

a mo s B p Cc ol C r y MB a s alni i icu P

a chococcu li p c P C Thermo ococ bi n r sp HT JS I AN1p us s B

rin fida S is AC 3 20

a 9 S al p A 3

ech 6 61 9 8 t

m Syne 9 2 1 tr in o N1 6 2 b Trichodesmi e fu i ra 4 e c s 0 p sp c o i Syn 2 13 t CC 4

r violaceusS PC Z S tomyc sca r a c o o o A AS94 N S Str tre r p 2 a

C hl ter G Corynebac a 64 c G s 5 p Y ar ic

o M M 0 Co ept tomyces co X a 9 i s ne Co es gr e

ry s n C

a e ryne om i N

o n oge Cor ry c n Ac AS50 0 B

e es M1 GA 7 0 ne yce iseus ol -

Gloeobacg py G 8 ba a 4 o M 1 yneba ba 4

y pyogene 6 o c terium s av CN 0

p S 0 d ter sub s s cte e m A 5 Coryne S-

u cu pyogen fre Cor Co ium lic

c c G c ri e s 2 c ococcus M n 2 t um r ol p 0

d o t ogenes 3 ry eriu kroppensmi a

i yne or A3 5 c y s Ma 5 u gri

o toco ep occus p e ne je rea ti m t r s n tr 1 Sa bact m li seu

i p - ba ba s M l e s ik

re St g AS82 I J diph lyti 2 t trep cu ccha Co

t S c o s G S 4 0 cte ct eiu s

S o y M S 4 e eriu c t A-4 NB

c p nes MGAS107 s 1 R ryn ri r m um D ed

Streptoco s 039 0 rop iu the

s e e hodo um g m aurim t 68 RC

u n is Rho N eb m K4 ii c e r oc ol ri 0 c a D

o rept c g R ac ae 1 1

o pyogene o e yspo e SM 7 SM St c s GAS1 5 b ho coccu do a luta ff 1 335 y u rd t i NC

a o er c

P p R i t pyogen 7 Mycobac doco coc p cus pyoge s 56 s ia iu mi ie ucosum 1 4438 0

e s pyogenesu MGAS31 4 u ra e ns T 0

tr c c 3V/ 5 s cus f m cum C 9 ( cu c 40 c 9 ccus arc

StreptococcusS pyogenestococcu MGAS102c o o Myc jost ryth gl YS- 1312 5 c o 1 r p c c 90 e

c o o 260 e 51 H M in u R re o t 1 t A 1 Myco ii ry t 31 A

A treptococt t p p 1 9 y teriu oba opacus ic rae am ATCC

S p p equi e e c RH th a I 9

S re s MGCS10r CO 1 B Myco M M 4 tre t t ctiae ia 6 D3 yco o ro a NR icu S e y b m c A FM

S S cus pyogenes M a ct 3 6 bact Mycobacteriumc a t po

Streptococcus ala i a R o er 1 m 7009 m H 6 ba c smegm

b 1 D e bac t iu li RL

u ui subs act ae CJB e Mycoba a e B s P 0 A

idemicu l oniae eri c r m 4 T a agal iae 4 c t i 1 23 p CC c us ag M31 1 m teriu te erium u 5 7

e ag s acti E 2 1 4 um m a R 2 5

t treptococ u act - r bsce 4 3

N S 5 iu a

r v 13

S cc m 8

cocc gal e R umonia 4 G 1 m lep cterium a tis

o agal pneu 19F g n 0

p zoo c 18 R s b ssu 3

coccus eq pto o s a s l rae p ilvu s

bs t e u ep a t 2 t

coccus niae 103 M sp MC J a r r 5 s e

Streptococcusto agalactia cu ia iwan 6 yco rae L l MC c ep cc A 1 m e m r 6 9 T S

Stre ct o J 1 - sp KM P n

ui su coc a c J N i i

Strepto St tococcus agalactia 10 i 2

to 7058 e 4 83 My My bac Br49 Y P

Strep ae Ta neumo 1 9 Z MD S R 15

pto oniae eP ia 6 Y n agal ep tococcus pne p P - R L c Myc -

us eq trep r moniae TIG n S 66 Myco c terium avium G R 5

Streptococcus pyogenes M49 59 S s o A 1 s 9 M ob o 2 S -

9 My C i e Stre u St us G 6 CN bac 3 1 a moni m 1 3 Mycobac yco K

e C a oba

ococcuscc cc u y CH s ct t Strep pneum e r r K c

o o e t obac bac t n

c n a a S UA15 ba eri er ep i ophilu c o

r to pneu oc p g s Brachyspira hyodysenteriaMycobactMy t t n TCC 700n i philu cte u eri iu

St p cus o ubs n hilus LMGm 1 te ep us u u i t m m

Streptococc r m s ans Myc cob er r u s c u t ( b H M t i ri c u s g her eriu um um ma m a

St Stre o e e t y iu v subsp V tococ n mu ac co o

cocc c n ia a hermo obac m ul iu

o t tu rin m t p oniae nA halli s us s 3 3 m tu c m

ba ter H

G rep thermopc 3 eri bo ococcup s o C s s 3 6 0 ber u eran Streptococcus pneut e u 3 3 bovis be 10 pa St r m r u oc 4 ct teri iu m repto p t c t c cu 1 um tu v o c u s c c l eri m i c rcu 4

St re S o e G1 I Borre Trep s M s ratuber

t i o to 1 um t u

c n c ococcuZYH s u bov A l lo A S occus p t i BCG os c

to p o ococ t m F gy

t t p 05 c bercu sis

p us pneums p re 98HAH3 o 2122 i p t re ris M a u is s C e u e t l lia nem t 9 Tr y tr c r re S Bo ub be BC 9 culo e

t t S uis is SK1p 3 s C n c 1 Tre

S gordoni tu epone tr Pa D

o S S suis 9 s 8 Ca rr e rculo losis / s emo 8 Ach rc 97 C c s s b a o tococc Streptoc s 5 el r G sis K-

o 1 u V5 1 Aste icat po 1 t / 6 n d e WA ulo s 5 1 D d ia s 55 p cu 6 Di e s ccu 9 2 ole n sis H37R t c

s C i id t

e 8 sp cr i 5 5 ma pallidumnt subsp pallidums SS r T eur 1 1 K c her ae 91 ema r Strep r t C 9 C

t O 6 c t r yell a i is F1 1

a ococcu s c 2001 y a pla c H37 ok S t toco i U 3 tyoglom t B 1 0 ococ sub a D e T o n us ms ol p t u l 2 Ca o 1173P

aecalis s 5 Bor pa yo c

p s f O 39 h g d sma a AT v r p re ep s u F E1 r e t i e l ows wi i Ph 1 R s u s um o d e i re r 8045 r 4 rb garinii PB i

S t t r lli 17 Str u c DSM a 1 m a reli l DA a 3 He m i 3 S c lactis c s EGD- r D M t y la a y v du ri i 2574 8 3 s ox u 5 C 2 2

c o T u t l 4 o e e o s reuteri F27 b 2 T o s o a a r

o c 5 s 2 t 8 e P u s H C faeci a 4 n h idla e

5 s li s m su c h yd pl

c o 7 CFS 4 8 e s 4 r 2 o

s C 6 H s th P c 35 t - 1 e u ob i e tches- bsp sakei 23 t e CC 7 26 e e r

o eute 2 e n a t ccus c 2 2 T t u h fzel u t tr CLIP e 68 C CC53 t r e T T On s K a l S cillu 4 c 68 r ot r sm

0 7 m f u wii PG r a mentum IF p H m 4 h r y

p erococcu u F 0 m l act herm he o 4 s H n S 80 l o r t s r l 3 gen C 2 l s y g Borr bsp

e L ccus 3 l 5 y A a 2 o t e 0 2 A W e C a he t e i b t b p11 SL r 0 o t T io o ii PK o P Th i a r 2 2 6 A 2 G o m o n s AT 33 s D str Am k t o d a 5 t c r t 8 L r c r p B llu rum W o 4 T a a h r

En t 2 4 s h eri g Coprothermo n t

ctoco 4 2a e s a a m e t m mo bro u

S a r s C N m rmu a b u l ma o i

C / n h e p M r s 1 H 97- 7 H u e n a a s eu n Cli n er elia b e e B 0 l Des m 00- t H 33 8 o r i yel rr us fer 3 1 s a o e r o o str s t

La c F 0 A s r a d D o Mesopla -8 nta e i BB1 r a um b o A all o 0 s s s Z 8 2 m m

l 6 s m m om p m Lactoba 9 b i E a e t r Sy t t mo phi elia baci t a g m c 2 7 c i b T l 1 4 5 r e o d m o e og og i 1

2 ua s 9 F 3 u D A

c str c st 1 n F B s s s 6 A r u s o o lo id subsp cremor BAA-3 o A a 8 l a Thermo h a i NC 7 A 1 6 e r o m a x s b a asei BL t c s hracis u 7 4 0 i o Enteroco acill o i ATCC e a r 5 a u e S or l e r s r o l o t b o cto i us 1 - n c e ws um t 3 r r 2 h t hy mod y u a sp to lu ur 1 y o r e t 1098 5 t 1 h ican u t o a a L t h s 03 Q 1 b lfit s pla s n n s str t 11 t 2 e h a a r i M du

c r e n b u d t t r h t 4 2 i l l hytop re

cillus sakei su c s H 4 s R m g T v u s s C 8 p if 3 a roph e m i g 3 e e t s r 1 r f ga m Kosmotoga ol e La e a r c s c pe 32 n e A i s

Lactobacillus reuteri JCM 11 4 i C v dr i i s str

o nn A 1 B F - o a hermosiph do g kian e 9 o 8 c i v t ob r ph li tt illu g i ovar n c s s tr u e d 6 pen s a t c x o t s brevis ATCC 36 TCC s i 8 - c i i a 6 o i 1 n l i C o c C b t r D o S sma florum L1 synthetic r a b oni i l Clo n M A o a 1 l A C k u o illus c s s E t l t es tr H-6- 7 33 53 t eria monocyt s e Cald og m s a 4

s t o c s s C 9 1 a i f 1 e u i b R rf

t s c ku e a Ca o r J Ureapl yt T l ac Ni a e m b

Lactob A ia cereu s u 4 u f t S c i 2

y 4 u T r oph t y l y o u a s llus ant u n 9 s d m a omonas a ea p U la i e nes serotype s u a C K r c a Q 9 bacter p m l n G s r s a og a c i r l u e 2 e e b A l ti 4 c Ure opl e ur i s

c r l e B a - en n u s obac l l 3 s n H i t e u c r h 5 r ri CC r cereus r N t s

l i i T p u te illus case i l 9 u u p c sma A n L cho - cc CC o L h u 9 e l ca t 2 o es ATCC 82 o l i s L B-149 k e Fus i po e Lactoba a h us A l a t dicell n c A s t gd

t s u t t h o s cus i c ko e e M t a le 1 s y N i c t g l s h l F t p ZS eri ser A q s i l r a s n i e n n cytogene ster c r 9 ridi h i a G u i i r f riu ila e asm r n us anthra c c H i i c Clo i e of e c a i e acillus reuteri 1 co ri a n B l s T m 2 ap i Baci n C i e o ld N u l

o l S 5 t g i r m p a moden d e AT o o i e c u s li r a or e Li m a l s p m l s e n r l Lact s oeni PSU- a cill s V u r 5 C r i as

i i a B r r K o RR u u r C s 3 t B s s m 5 or s l 7

io i 0 c H u a i

r l K d i l r o ta s eticus DPC 45 B u l a n a p um m ob a m s ll m h e u T C s u geni las

r m c u l 1 u o Lactobac s h B e C u m RKU- tt fe

6 u e s o o lga s u u N u 0 l d o oni on y o N i i 0 C m m i s s b o e 1 n s r s i e l - o ul a l b C l m ma parv

b i d M Myc Y c s Ba e c i f l m s ulosir man o na h ovar l l l A m th 1 b ing a

ed a u r p c 9 e X n 1 M

a 1 l m Lactobacillu L i i i y y 4 t roteoly ri l u r o a o ac i C m af t s u s 5 1 s wo a r , i m s m o i t C p W l l a l rid earia T ma Ba - C o a n M OY P bu r e C C t g 6 s 3 nostoc citreum KM i c c s n a x 4 l 5 h S r l e B r o sse e C p t s e s s C Ic n u a p e n u 5 k h I af Fu

Lactobac i a 1 T s l D Lactob h n h s c C i e s c ri e hns e 3 C r 9 g c b 1 C l e 8 - e a a u l a 5 o 1 B e o h C y c l te B ae s t l p c u C d r r o t ia c e a 2 o a a a l i t i o C

t l l l o ir 1 i t oplasma mo 3 o o s u i s l l D r t iu 0 e canu e u s 8 o a l s o 0 d o n nie c p

r f B o ga r h l 4 f 1 - n R 0 o s s t jo l B s B l s B C l s l s i A i a e 8 d a i n i s l er u SM

s s l n o s l s n i 1 i l i o ei u ri M a wel x i s n M r l r s c v u l s up 1 o M 6 d i 7 i u o p 2

c e s 0 i b s e e o

B i s t M 3 d u s u 2 c i illus th u o r o e l A B A p m ar a t l d

i u ob

u i L t o M s B c m 6 i BAA-11 L l s s t m l Z-2 TM i h o llus sp p r y s um l t m t l 3 p L 3 d i i l l i r i 2 3 c u s C i e n ub l t y r m t s i c l t p y t 5 i 0 a mesenteroid a i s r C t sp 2 i s 2 s r s u um s

us l h G l i u h o a n tic st i C m u l y a 7 Colli t a i u N r i subs Leuco s u 7 0 i u t My i b r i r l Bacillus cereus ATC 2 t c lu s l O d i c y vu

a i t i s

2 i c 5 n c c t d l BF Oenococcu r u a nsi 4 d r tor u 1 t o i s 2 i l d m c p 6 or s i e x u c c B m i n se a 5 M d s r l y

ill c c 4 l oc 43 s l c r I c 1 acidophilu 5 0 m N o c il B d u l o M 5 t i r o Li d i b l i W i i

2 1 s steri a l s i i s 5 o s

B C i m h i 5 O c i A 2 i o v us DS o d s a e n d u u el t i u u 0 u d i ct i i 1 m k i i 90 a P o

T m s u E d p a s r 7 3 u nu y r A d p i t l 2 s u coplas i e c H i d t e T m 9 T u 3 3 kii u C

s Ba a i i c B u B p 1 l m i a um s s s Y n 19 ac i v M 5 u i i M Li c m l B C b 5 p ac l 3 h s be ell l c 3 6 m a B l s i i B e t k m B l i u lu u J R u u A C m g e erova

l S u i s i a A

s n C

Listeria monocytoge 8 4 m t R a u s b u l eoba u b H t m u a n r F i e C j l a u m u 4 p i o e 1 D 9 u 3 b n c 1 S r t cle a u a i l A a N b a m r o I s m e e ri ec m s R y m

r i 5 s s 1 ingie S S u s u G o C o u J ob S m M F er s s m c c a i lo 4 t 1 e d m u T i e b t r sc e l B u o s a G 5 R s n i m s r h p C 5 c g a d m tob C u v r n r a cc ella 1 M C u

p

s wo b d F T 1 M

u l P u d n r y m r l s a u T 52 2 C

U b ru 4 3

R i e m m b Lactobacillusct helv i i m f h M o y s a l 1 t e 1 A

d ov s e a p J e l B m e c i s c S m 7 cteri T u b e bile 1 b n i o at M 52 i B tali a l M s v l i C a a u i F u i u e y 9 b M y s b u 7 c s o ii 8 c B r A i S u m o p p i n e u t o h a hu s f t ma pneum o e E o m s r r - Myc a r a a H M m o r AT e s n r 3 s des subsp c 6 n I u o e

e w _ a s c u t

s - l l B t C e f e o La u o Q oeni ATCC o m C c a u t i 6 B r g ar u r s f um m m

Lac t e e e f t m r p arolyticus f elb c i ic u u C t i DS P 2 nu l u r s k u 4 c u u t l o e u i o t r r 0 e u u i n ae u r

s u ei p w C r l r c r um JCVI-1 p i l e o a s t a a i t N a 3 c a N u u r g 1 B a u p r J h s i i i e i i Y a i p y

u i u b 1 3 v 6 r e u r l t m c d 0 c um m l s n l 3 u u n e u f i T e n l i u r o O r N u b d d n Clo i e i i a s s r c e l B c b o l u i l i c l D B l a r CC c 6 l

i opl a t r 5 n M a 0 s p l e uob u t l i

s t r i n d u l 3 n l 6 r i s a a e 3 u m g l tr A _ d N g m M s u p e u i i h su i d s n i o s u b r l i t a u t u l a n o u r i le o o 5 Myco

s y n 3 n n R c 4 l u u M r f l g a y S d e u tr 3 t p E s s u u a

2

A 0 s m e lus l e m

l a m r e t F i i c p i c u a u f m b C s f tr p p u t 1 i 6 d b g i t C m t cillu u e u r c u y m u s s i i h i K il s i a asm a a n H1 m s 2 M C p n 0 b u 672 o A 3 n e bsp yco c s n i e r a S s a t a r p m c s c a 0 e C m e G TCC

p u l 3 a v t I e O 27 a b i e m m a b p J p u t tu c Myc t AT m i 3 3 u e a l l a b A i s s r d

s s s M b

u i a b Exi b s i rid U p s s i i e s y c s 0 s n C i s c n W d ie u i s l m

o F c 1 B s oe s s s s E g l B n 5 u h i u A u b i u p 0 B r e oniae M1 a B

o s u m a B b r u s bac A t s t u c s S p plasm u

s 40 pl y s

n b B s

T s t A A 2 u a s Lactobacillus delbrueckii subsp bulgaricus ATCC 1184 o e n a pu s o b CC 7 s u 5 I 5 r s r n

i e 8 m u p s 1 a g s S b r / u p B u s ium u o m n t m

mesenteroi e o u L D

r c c 2 M u p t 0 Q t s 0 H 2781 N 1 A u n opl 3 asm m 5 s r s ATCC A s s b p s r e 8 a A g

A s

u u ting s r m t b ucl b a o g s c 8

s A U s s A

c 5 p

e s u s 1 Oenococcus 5 i e s SM 89 y r t

o F u u l s T u r ub s M n e 1 C s 1 s a s s v s r e T u o l 0 a r s t A r s u H s o b t

act T i s a s e 8 x l 6

r a L s s s s l - r lm B u p C a L i c t a asma r 0 t 3 u a A c e 5 y u u t C c D a 2 e r 00

L s s u - ea u u y T E c L

r u C p u a s ph

c a u o E r r s a a hyopne a r t C A W e a p a 1 h m s 2 m O e

a e T s s c o n u t i R 3 e h A s C oni C k

a -

c p 5 b

e k l e l p n a u r K n s u C a L n

r u 4 l 9 c T 3 hyop e r u r C e r tu - p s a u p b o r y y o l 3 k T 5 s N y e l s 2 u a a C l b g 1 r s o y u 7 u e i c o v

u u 2 C C e a c 1 s a u c r 3

u c

r t 3 h u C r s 9 e u C s UAB CT 3 0 a m A u - 0 a t e u u o

a c c n E in P c p hy o o a s a a c 2 s g c o l b 6 3 c s o L p t a e u C 8 3

5

s c A

l h a C t o r i d h S o f G r u c 4 5 9 s u l r u r 5 c 1 2 s a c d

l F a o s a er

s u 2 p c 0 neum s

c c

T u t c 9 c

s o 98

s s 3 u u 8 en 2 e o s 6 7 2 op e u y n a

u 1 M m i s 1 Leuconostoc o u 4 2

2 o a u TCC c m c s

c C u c s c l a c S 7 me u 5 4 c h d 9 s b c c 7 r 9 s c umoni o a o u o c 6 c y e o i

M t c ne c 5 C

p 7 l A c S u u o 3 a s l B t 3 r ii c c c A

o l r

o h c o 0 y a e s o T v y 9 C c e u y 2 n AT u c o y o oni c

c t T r p o 3 umon c

l l h c C o h c 7 h e 8 t a o h s o I 25 h u o a c 3 S r C c o y s y an p

l l o P c p

t l

c p C s

u p l t e a o 6 ae s h h

y y a CC C ae 23 o r

y a l

o o

y

S t s e a 5 t l 5 p u h p s h c t r 2 y s r h P

h 8

y S 6 c a ISDg 1 S p a p 7 p o iae

s u u p h J S G

t t p i l 6 t

c h 0 3 a c a a 4925 p a a t y s a t t 1 S o t S p

c 9 4 s t

a ,

n h 74 2 s t c l

S S a 3

o S 5

t S e p s e

u e

c o 3 i S l

S c a u g o 4 t

y 6 o l

c c

n 8

h

y S

i c o r

h p

c C o

u p a

c

o r

t

l h r

a

t o (

t y S l

h y s S r

p i

u h

S

l i a l p

i t a

a c h

t S a

a o

S

B

l p

l e

f

o

i

l

s r

e

r o

)

o

c x

f

h

l i

e a

x e

t

a

a

l c e

e

s a

)

e

,

B

S

r

a

p

c

h i

r

y

o

s

p

c

i

h

r

a

a

T c

e

e

e a

t

e

e n )

T

s

e

h

e r

r

i

m

c

o

u

t o

g t

a

e

e

s

F

i s r e m t i u c

(

C

o

r

i

o

b

a

c

t e

r i d

a

e

)

A

c

t

i

n

o

b

a

c

t

e

r i

0.06 a D

i

c

t

y

o

g l

o

m

i

F

u

s

o

b a c t e r i a

Figure 2. Phylogenetic tree of 1,087 species reconstructed from a comparison of all protein sequences from all the species. Branch colors refect taxonomic information (division) obtained from the NCBI Website.

is largely consistent with a Bayesian phylogeny using markers including more than ten thousand amino acid positions27, except that Halobacteria deeply branch of within the Archaea in our tree (Fig. 4). Te branching pattern of Eucarya is similar to an RNA polymerase tree23.

Impact of laterally transferred genes on tree topology. Our method uses all of the genes in the genome, and does not exclude laterally transferred genes in the analyses, in spite of the extensive occurrence of lateral gene transfer throughout the evolution of life and its potentially confounding infuence. We used two dif- ferent approaches to examine the impact of laterally transferred genes on our tree topology. First, using our in silico evolution of the E. coli 536 genome, we artifcially introduced lateral gene transfers between two of the descendants (Supplementary Fig. 4e). Te laterally transferred genes were randomly selected. In both cases, the branching pattern did not change when the numbers of the transferred genes was lower than 40% of the total genes. When 50% of the genes were transferred, the branching pattern was changed. Second, lateral gene transfer was introduced in silico between real genomes (Supplementary Fig. 4c and 4d). Lateral genes were randomly selected from Salmonella or , and added to E. coli 536 genome. When 10 or 30% of the genes were artifcially transferred from Salmonella to E. coli 536, the topology was not changed. However, when 10 or 30% genes were artifcially transferred from a distantly related (Cyanobacteria) to E. coli 536, the branching position of E. coli O157H7 EDL933 was slightly changed, but that of other organisms did not change. Te branching position of E. coli 536 itself, which acquired a large number of laterally transferred

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 4 www.nature.com/scientificreports/

Figure 3. Smaller version of Fig. 2. Species within each clade were collapsed according to taxonomic information (division).

genes in the experiment, was not changed at all. Although we could not exactly quantitate the impact of laterally transferred genes on our tree topology, the above two experiments indicate that the tree topology constructed by our method is not largely afected by laterally transferred genes. Discussion Our tree was automatically reconstructed; that is human decisions, such as alignment and selection of genomes (organisms), were completely excluded. Gloeobacter is the most deeply rooted cyanobacterium (2.7 Gy ago)28, and E. coli and Salmonella diverged about 0.12 Gy ago, with all E. coli species being more recently diverged29. Tese organisms’ phylogenetic topology was correctly reproduced by our method. Other topologies, including the relationships between and Archaea, are also consistent with previous reports. Tese results strongly suggest that our new approach has sufcient taxonomic resolution for taxa ranging from domain to strain for a wide range of organisms spanning the Tree of Life. However, drastic decreases in genome size generate incorrect topologies. For example, Encephalitozoon cuniculi GB-M1, a parasitic with a greatly reduced genome, about 2,000 protein-coding genes, does not cluster with other Fungi in our analysis. Similarly, Nanoarchaeum, the smallest known genome in the Archaea, with only 536 protein-coding genes30, is not correctly located in our tree. However, these organisms may have a sufcient number of ORFs to calculate evolutionary distances using our method, because other organisms containing reduced genome sizes (e.g., Prochlorococcus with 1,800 protein-coding genes) are well resolved in our analysis, and correct trees have been reconstructed from artifcially reduced genomes using the previous version of our method17. Te incorrect placement of these organisms in our tree suggests that drastic changes in genome sequence have occurred along with a reduction in genome size in these organisms, which is consistent with previous reports31,32. Our phylogenetic analysis of in silico evolved E. coli proteomes shows a linear relationship between gener- ation time and proteome tree branch length when the evolutionary distance D is below 0.85. We calibrated the relationship between geologic/evolutionary time and branch length in our Tree of Life using the fossil record and molecular markers (Supplementary Fig. 1 h, red). Branch lengths per year lengthen afer 0.6 Gy ago. Branch lengths relate to the degree of genome diversifcation in our tree, because of the phylogenetic distance calcula- tions we use. Terefore, if the values from the fossil record and molecular markers that we used are correct, some important biological or environmental event such as a rise in atmospheric O2 concentration may have occurred around this time (about 0.6 GY ago) that could have facilitated the rapid diversifcation of the genome. Branch length saturation (Fig. 1b and c) cannot be a major reason for this phenomenon, because similar profles are also obtained afer correction by the saturation factor (Supplementary Fig. 1h, blue). Another important feature of our tree is that it exhibits a concentric circle like structure. Tis was not caused by the tree reconstruction method itself, BioNJ33, because the same tree structure was made by the NJ method (for all 1,087 species) and the Fitch-Margoliash method (a 721 species subset) as well (Supplemental Fig. 1f and g). Tis indicates that all the organisms have similar total branch lengths from their common ancestor. Tis feature might be a typical characteristic of an average similarity tree. However, this topology is not consistent with a report that substitution rates vary signifcantly between species, and that this variation is associated with species charac- teristics such as generation times in animals and plants34. Te evolutionary rates of Bacteria have also been reported to be infuenced by generation times35. In contrast, branch lengths generated by our present method are almost the

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 5 www.nature.com/scientificreports/

Oxytricha trifallax Rhodopirellula baltica SH 1 Toxoplasma gondii ME49 Gemmata obscuriglobus Hammondia hammondi Natranaerobius thermophilus JW/NM-WN-LF Plasmodium vivax Sal-1 Kosmotoga olearia TBF 1951 Plasmodium falciparum 3D7 Thermotoga maritima MSB8 Babesia bigemina Fervidobacterium nodosum Rt17-B1 Guillardia theta Listeria innocua Clip11262 Thalassiosira pseudonana subsp subtilis str 168 Phaeodactylum tricornutum CCAP 1055/1 Thermomicrobium roseum Aureococcus anophagefferens Anaerolinea thermophila sulphuraria Truepera radiovictrix Chondrus crispus Marinithermus hydrothermalis Ostreococcus lucimarinus CCE9901 radiodurans R1 Micromonas commoda radiotolerans SRS30216 Physcomitrella patens Streptosporangium roseum Selaginella moellendorffii Catenulispora acidiphila Oryza sativa Japonica Group Arabidopsis thaliana Thiobacillus denitrificans ATCC 25259 Naegleria gruberi strain NEG-M Escherichia coli str K-12 substr MG1655 Phytophthora infestans T30-4 Burkholderia ambifaria verticillata NRRL 6337 Ustilago maydis 521 Acinetobacter baumannii 1656-2 Schizosaccharomyces pombe Dinoroseobacter shibae DFL 12 Yarrowia lipolytica bacilliformis KC583 Saccharomyces cerevisiae Acetobacter pasteurianus Aspergillus fumigatus Af293 DSM 2380 Batrachochytrium dendrobatidis JAM81 Desulfobulbus propionicus Salpingoeca rosetta Chlorobaculum parvum NCIB 8327 Monosiga brevicollis ATCC 50154 Rhodothermus marinus Capsaspora owczarzaki ATCC 30864 Opitutus terrae PB90-1 Xenopus tropicalis Coraliomargarita akajimensis Homo sapiens Gloeobacter violaceus Drosophila melanogaster Synechocystis sp PCC 6714 Amphimedon queenslandica Cyanothece sp CCY0110 Dictyostelium discoideum AX4 fragilis YCH46 Polysphondylium pallidum PN500 Brachyspira hyodysenteriae WA1 Acytostelium subglobosum LB1 azotonutricium Acanthamoeba castellanii str Neff PKo Simkania negevensis Chlamydia muridarum str Nigg Lokiarchaeum sp GC14_75 0.05 Natrialba magadii Haloarcula marismortui ATCC 43049 Candidatus Nitrososphaera gargensis Ga92 Cenarchaeum symbiosum A Nitrosopumilus maritimus SCM1 Candidatus Nitrosoarchaeum limnia SFB1 Thermofilum pendens Hrk 5 Sulfolobus tokodaii str 7 Metallosphaera sedula DSM 5348 Thermoproteus tenax Pyrobaculum aerophilum str IM2 Vulcanisaeta moutnovskia Caldivirga maquilingensis IC-167 K1 Ignicoccus hospitalis KIN4/I Desulfurococcus amylolyticus Ferroplasma acidarmanus Methanoregula boonei Methanocorpusculum labreanum Z Methanosaeta harundinacea Methanosarcina mazei Go1 Methanococcoides burtonii DSM 6242 Methanocella paludicola SANAE vannielii SB Methanotorris igneus infernus Methanothermus fervidus Methanothermobacter thermautotrophicus str Delta H Methanobrevibacter smithii ATCC 35061 0.05 Ferroglobus placidus Archaeoglobus veneficus Thermococcus barophilus Thermococcus nautili GE5 Aciduliprofundum boonei Candidatus Methanomethylophilus alvus

Figure 4. Phylogenetic tree of 116 species reconstructed from a comparison of all protein sequences using the ftch-margoliash method. Te 116 species are constructed with random and equal sampling from latest genomes from the three domains23. An inset is polar tree layout.

same among diferent lineages, with the exception of some parasites, suggesting that rates of genome evolution are not largely afected by species. We presently cannot answer this discrepancy, and encourage further study. Tere are two approaches for the reconstruction of a phylogenetic tree; one is alignment-based, and the other is alignment-free. Whole-genome phylogenetic inference methods, such as gene order and gene content, are alignment-free phylogenies, where sequence information is only used for the annotation of genes6,8,36. Traditional alignment-based approaches are impractical for inferring whole-genome phylogenies, because the creation of global, multiple sequence alignments of all the amino acids from or, even more difcult, all the nucleotides of whole-genome sequences for large numbers of even somewhat divergent organisms is presently unrealistic, due to computational limits and complexity. Another alignment-based approach is to create multiple sequence align- ments of single or concatenated genes or gene products. However, this method requires human decisions and biases concerning gene and/or organism selection. And, even though it becomes more powerful as more genes are added, it also eventually becomes computationally intractable as the size of the dataset increases. Our average sequence similarity method is quite diferent from these other two alignment-based approaches, because our average sequence similarity method reconstructs a whole-proteome phylogeny based on the local alignments of amino acid sequences. Previously, species trees have been reconstructed by using only orthologs from these local alignments. Some average sequence similarity methods use the mean normalized blastp scores

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 6 www.nature.com/scientificreports/

0 E

-150

-150 0 Ebest

Figure 5. Representation of best-matched proteins on a two-dimensional display. Te vertical axes represent the logarithmic E-values (E) of the best-matched proteins of Arabidopsis to the proteins of Chlamydomonas. Te horizontal axes represent the logarithmic E-values of the best-matched proteins of Arabidopsis to Arabidopsis (Ebest). In this case, the best-matched proteins are identical to the query proteins. Te plot was ftted with a straight line from the origin (red line). We estimated the average of T from the slope of the line.

of reciprocal pairs to restrict the phylogenetic analysis to orthologous pairs14. A major criticism of this approach is that all orthologous genes may not be correctly defned using only reciprocal best matches8. However, this criticism does not apply to our analyses, because we found that the inclusion of non-orthologous gene pairs reconstructs a better topology with our method. In our study we compared phylogenetic trees reconstructed using reciprocal pairs (Supplementary Fig. 1i) with those reconstructed using directional pairs (Fig. 2). A large portion of the reciprocal pairs is expected to be orthologous pairs. In contrast, a large portion of our directional pairs has high E-values, around 101 (Fig. 5), indicating undetectable homology between the paired genes. E-values of these pairs with undetectable homology were included in the calculation of genomic evolutionary distances in our present study. However, the overall tree topology was more correctly reconstructed using our directed pairs approach than with the reciprocal pairs method. For example, merolae, a Rhodophyte, is most deeply rooted in the clade in the tree by reciprocal pairs method. Tis conclusion is completely against the idea that only orthologous pairs must be used for the average sequence similarity method. Why do directed pairs result in the correct topology? BLAST is used in our new method between all of the individual proteome sequences; therefore, the E-values obtained are the expectations that particular proteins are present in the tar- geted genome. All of the probabilities (E-values) are converted to information content scores by conversion to a logarithmic value. Scores of all the pairs are then used for calculating pairwise genomic evolutionary distances. For this reason, our calculated phylogenetic distances describe the probability that genome A evolved from Genome B. Concepts of orthologous genes and function are not included in this calculation process. Tis is one of the reasons why we use all of the best-matched pairs for calculating genomic evolutionary distances. Tis cal- culation process is quite similar to that of calculating protein evolutionary distances from amino acid substitution probabilities (Supplementary Fig. 1a). In the case of traditional phylogenetic analyses, probabilities of amino acid substitutions are converted to scores (e.g. by the PAM250 score matrix) through conversion to logarithmic values, and all these scores are summed to obtain global similarity scores between any two sequences. Over extremely long evolutionary periods genes not only evolve from obviously orthologous genes, but also from genes in which any sign of homology has been lost. Tis may be another reason why our inclusion of gene pairs with very high E-values results in an improved phylogenetic tree. Tis strategy is quite diferent from the reconstruction of a species tree by a single or concatenated gene dataset in which the selection of vertically inherited orthologs is indispensable, because the selected genes must have the same evolutionary history as the species. Our method takes into account the fact that organismal evolution occurs not only by the molecular evolution of orthologous genes, but also by the acquisition of species-specifc genes. Ideally, this information has to be incorporated into the reconstruction of a species tree. Te present approach attempts to include this type of information. Te process of phylogenetic tree reconstruction using any distance matrix method can be divided into two parts: frst, the calculation of evolutionary distances, and second, the reconstruction of a phylogenetic tree using those calculated evolutionary distances. Laterally transferred genes are generally considered to cause noise in calculating evolutionary distances. Tis is particularly true when the phylogenetic tree is constructed by an alignment-based method with either single or concatenated genes datasets. However, the number of detectable, strictly vertically inherited orthologous genes amongst all of the genomes of life is relatively small, with most genes in all of life being laterally transferred and then vertically inherited18. Laterally transferred genes do cause noise on the one hand, but also can provide important information, if treated correctly. We tried to evaluate the

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 7 www.nature.com/scientificreports/

impact of laterally transferred genes by introducing artifcial lateral gene transfer (Supplementary Figs. 4c-4e). Although the impact of laterally transferred genes on the tree topology was not quantitatively evaluated in our experiments, tree topologies were not largely afected by the lateral gene transfer events. Tis may be an advan- tage of our method. However, it should be mentioned that the phylogenetic position of sp. NRC-1 in our tree is diferent from that of Ciccarelli et al.18. Tis diference may be due to the large number of laterally transferred genes in this archaeal organism. Halobacterium sp. NRC-1 has acquired 1,089 genes by lateral gene transfer from the bacterial domain, which implies that 41% of its total genes (1,089/2,630) were laterally trans- ferred. Tis corroborates our in silico experiments, and further demonstrates that a very large number of laterally transferred genes can disturb tree topology37. Recently, a list of horizontally transferred gene candidates has been compiled for both Bacteria and Archaea38. Te exclusion of highly transferred genes from analyses based on such lists is a possible strategy for overcoming the problem. In contrast, horizontally trans- ferred genes are not much of a problem in Eucarya because horizontal transfer is quite limited in that domain39. Further study is required concerning the impact of laterally transferred genes. A second unsolved question is whether it is reasonable to allow multiple query genes to pair with the same gene. At present, we have no answer as to whether our procedure is entirely adequate for calculating evolutionary distances, though it does appear to approximate distances well enough to reconstruct acceptable phylogenies. Distance-based tree topology also depends on the algorithm used to reconstruct a tree from a distance matrix. BioNJ33 may have produced a more accurate topology than NJ, because of the huge size of our data set. According to our preliminary calculation with a smaller number of proteomes, the Fitch–Margoliash method, even without the global rearrangement option, can reconstruct a more realistic topology than the NJ method with our data (Supplementary Fig. 1f and g). However, presently the Fitch–Margoliash method cannot be applied to a 1,087 proteome phylogeny due to calculation time constraints, regardless of global rearrangement option. Recently, extensive research has attempted to elucidate the eukaryote origin, with many reports supporting a two-domain Tree of Life20–22. However, questions have been raised concerning this hypothesis23,24 because of difculties in determining a root, and problems with using concatenated genes having independent evolutionary histories. Tis selection of genes for phylogenetic analysis greatly afects tree topology. Our method attempts to overcome these problems by using all of the protein sequences in a genome for tree reconstruction, but not using a concatenation process. It should be noted that alignment-free whole genome phylogenies that utilize protein domain content and abundance also support the three-domain phylogeny24. Improvement of whole genome phy- logenetic methods will be required to fully resolve the two versus three-domain issue. Our present method automatically reconstructs a reasonable Tree of Life. However, theoretical analysis of our method has not been sufciently undertaken. Further improvements of the method are expected by establishing ways of incorporating various properties of the evolutionary events that lead to extant genomes in the Tree of Life and understanding those theoretical backgrounds. Methods Herein, we propose, develop, and demonstrate a novel average sequence similarity whole-proteome method for reconstructing the Tree of Life. In our new method, phylogenetic distances are calculated by averaging similarities between directional best-hit pairs. Although the method has many problems (a long calculation time, and the limited availability of whole genome species), it is potentially a very powerful tool because it includes local align- ments, in spite of being a whole-genome phylogenetic method. Previous average sequence similarity approaches aimed at only using orthologous pairs by employing reciprocal best match pairs with a cut of E-value above some particular value, although non-orthologous pairs could not be completely excluded. Our aim is to automatically reconstruct a Tree of Life while excluding human decision making processes, such as the selection of organisms, determination of gene pairs, construction of distance matrices, and tree reconstruc- tion itself. Instead of reciprocal best match pairs, our method uses the E-values of directed best-hit pairs. A large number of non-orthologous pairs are included in these pairs, because many genes are not shared between any two organisms. Te procedure follows, in detail. All of the deduced amino acid sequences from the ORFs of one organism (X) are used as query sequences, and blastp searches are performed against all of the amino acid sequences from the ORFs of another organism (Y)40. For each query ORF (x), the E-value (E) for its best-matched ORF (y) is calculated (Supplementary Fig. 1a). If organisms X and Y are identical, the best-matched pair is that of the identical ORFs and its E-value is referred to as Ebest. Next, divide the log-transformed E by log-transformed Ebest, which is referred to as T. LogE 10 ==LogET Ebest LogE10 best (1) Tis can be transformed as follows:

T ()EEbest = (2) Here, the E-value means the probability that a query sequence appears in an unrelated database sequence, if the size of targeted database is constant. Ten, Ebest indicates the probability of an event (A) that randomly changes a sequence to x by coincidence, and E indicates the probability of an event (B) that randomly changes a sequence to the shared sequence between x and y by coincidence. Terefore, T shows the frequency of event A occurring when event B occurs once, and 1−T refects the relative frequency of the change of the common ances- tral sequence of x and y to x (Supplementary Fig. 1b). T is in the range 0 to 1, and when x is similar to y, T nears 1. All pairs of Log10E and Log10Ebest corresponding to all ORFs of X are plotted on a two-dimensional graph, as shown in Fig. 5. We divide the dots into fourteen groups corresponding to the values of Ebest and examine the

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 8 www.nature.com/scientificreports/

distribution pattern of E, to examine whether the dots can be ftted as a straight line from the origin. Te patterns are very close among the fourteen groups, as shown in Supplemental Fig. 1c. Tis relationship is observed with all combinations of the two genomes (Supporting File1.xls, sheet 1 and 2). Based on these results, the dots are ftted with a straight line. Te straight line from the origin (Fig. 5, red line) is then determined by a least-squares ft analysis. We then use the slope of the line as an average of T (Tx→y), and that average is used to estimate a distance between X and Y (Dx↔y). Here, Tx→y is not always the same as Ty→x, because proteome sizes and the number of intrinsic genes are diferent between diferent organisms. Te distance (Dx↔y) is approximated by subtracting the average value of Tx→y and Ty→x from 1, and is used for the construction of our distant matrix.

TTxy→→+ yx Dxy↔ =−1 2 (3) Te best-matched genes of Y are categorized based on a reverse blastp search against X. Te best-matched genes of Y are only used for the D calculation of the reciprocal tree, when the reverse search gives back the orig- inal X query gene. Tree topologies are more correctly reconstructed using our directed pairs approach versus reciprocal pairs, indicating that the inclusion of non-orthologous pairs results in a more correct calculation of genomic distances. Our tree topology is nearly consistent with previous reports that include Bacteria, Archaea, and Eucarya. One conspicuous characteristic of this tree is that the total branch lengths from the common ancestor to the tips are almost the same among all the organisms resulting in a concentric circle like structure for the tree. Te reliability of this newly developed sequence similarity method was supported by the analysis of Escherichia coli proteomes that were evolved in silico. A problem with our method is it requires a very long calculation time. For example, the method required almost three years calculation time to obtain the E-values for all the pairs from 1,087 proteomes with 24 CPU cores. However, these pairwise BLAST calculations are performed in parallel and, therefore, this time restriction is directly related to the number of CPU cores.

Data Availability. E-values for all the pairs from 1,087 proteomes, which were used for the construction of our distant matrix, are available from the corresponding author on reasonable request. Distance matrices and tree fles in FigTree format will be available from FigShare (https://fgshare.com/10.6084/m9.fgshare.5677705). References 1. Woese, C. R., Kandler, O. & Wheelis, M. L. Towards a natural system of organisms: proposal for the domainsArchaea, Bacteria, and Eucarya. Proceedings of the National Academy of Sciences 87, 4576–4579 (1990). 2. Brown, J. R., Douady, C. J., Italia, M. J., Marshall, W. E. & Stanhope, M. J. Universal trees based on large combined protein sequence data sets. Nature genetics 28, 281–285 (2001). 3. Swingley, W. D., Blankenship, R. E. & Raymond, J. Integrating Markov clustering and molecular phylogenetics to reconstruct the cyanobacterial species tree from conserved protein families. Molecular biology and evolution 25, 643–654 (2008). 4. Gogarten, J. P. & Townsend, J. P. Horizontal gene transfer, genome innovation and evolution. Nature Reviews Microbiology 3, 679–687 (2005). 5. Dagan, T., Artzy-Randrup, Y. & Martin, W. Modular networks and cumulative impact of lateral transfer in genome evolution. Proceedings of the National Academy of Sciences 105, 10039–10044 (2008). 6. Delsuc, F., Brinkmann, H. & Philippe, H. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6, 361–375 (2005). 7. Rannala, B. & Yang, Z. Phylogenetic inference using whole genomes. Annu. Rev. Genomics Hum. Genet. 9, 217–231 (2008). 8. Snel, B., Huynen, M. A. & Dutilh, B. E. Genome trees and the nature of genome evolution. Annu. Rev. Microbiol. 59, 191–209 (2005). 9. Huson, D. H. & Steel, M. Phylogenetic trees based on gene content. Bioinformatics 20, 2044–2049 (2004). 10. Sato, N. Comparative analysis of the genomes of cyanobacteria and . Genome Informatics 13, 173–182 (2002). 11. Suyama, M. & Bork, P. Evolution of prokaryotic gene order: genome rearrangements in closely related species. Trends in Genetics 17, 10–13 (2001). 12. Coenye, T., Gevers, D., de Peer, Y. V., Vandamme, P. & Swings, J. Towards a prokaryotic genomic . FEMS microbiology reviews 29, 147–167 (2005). 13. Clarke, G. P., Beiko, R. G., Ragan, M. A. & Charlebois, R. L. Inferring genome trees by using a flter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. Journal of bacteriology 184, 2072–2080 (2002). 14. Gophna, U., Doolittle, W. F. & Charlebois, R. L. Weighted genome trees: refnements and applications. Journal of bacteriology 187, 1305–1316 (2005). 15. Henz, S. R., Huson, D. H., Auch, A. F., Nieselt-Struwe, K. & Schuster, S. C. Whole-genome prokaryotic phylogeny. Bioinformatics 21, 2329–2335 (2005). 16. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990). 17. Satoh, S., Mimuro, M. & Tanaka, A. Construction of a Phylogenetic Tree of Photosynthetic Based on Average Similarities of Whole Genome Sequences. PLoS ONE 8, e70290, https://doi.org/10.1371/journal.pone.0070290 (2013). 18. Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006). 19. Larkin, M. A. et al. Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948 (2007). 20. Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015). 21. Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017). 22. Eme, L., Spang, A., Lombard, J., Stairs, C. W. & Ettema, T. J. Archaea and the origin of eukaryotes. Nature Reviews Microbiology 15, nrmicro. 2017, 2133 (2017). 23. Da Cunha, V., Gaia, M., Gadelle, D., Nasir, A. & Forterre, P. Lokiarchaea are close relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes. PLoS genetics 13, e1006810 (2017). 24. Nasir, A., Kim, K. M., Da Cunha, V. & Caetano-Anollés, G. Arguments Reinforcing the Tree-Domain View of Diversifed Cellular Life. Archaea 2016 (2016). 25. Williams, T. A. et al. Integrative modeling of gene and genome evolution roots the archaeal tree of life. Proceedings of the National Academy of Sciences, 201618463 (2017). 26. Hug, L. A. et al. A new view of the tree of life. Nature Microbiology 1, 16048 (2016).

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 9 www.nature.com/scientificreports/

27. Raymann, K., Brochier-Armanet, C. & Gribaldo, S. Te two-domain tree of life is linked to a new root for theArchaea. Proceedings of the National Academy of Sciences 112, 6670–6675 (2015). 28. Blank, C. & Sanchez‐Baracaldo, P. Timing of morphological and ecological innovations in the cyanobacteria–a key to understanding the rise in atmospheric oxygen. Geobiology 8, 1–23 (2010). 29. Lecointre, G., Rachdi, L., Darlu, P. & Denamur, E. Escherichia coli molecular phylogeny using the incongruence length diference test. Molecular biology and evolution 15, 1685–1695 (1998). 30. Waters, E. et al. Te genome of Nanoarchaeum equitans: insights into early archaeal evolution and derived parasitism. Proceedings of the National Academy of Sciences 100, 12984–12988 (2003). 31. Itoh, T., Martin, W. & Nei, M. Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proceedings of the National Academy of Sciences 99, 12944–12948 (2002). 32. Moran, N. A. & , G. R. Genomic changes following host restriction in bacteria. Current opinion in genetics & development 14, 627–633 (2004). 33. Gascuel, O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695 (1997). 34. Tomas, J. A., Welch, J. J., Lanfear, R. & Bromham, L. A generation time efect on the rate of molecular evolution in invertebrates. Molecular biology and evolution, msq009 (2010). 35. Weller, C. & Wu, M. A generation‐time efect on the rate of molecular evolution in bacteria. Evolution 69, 643–652 (2015). 36. Comin, M. & Verzotto, D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology 7, 34 (2012). 37. Nelson-Sathi, S. et al. Acquisition of 1,000 eubacterial genes physiologically transformed a at the origin of . Proceedings of the National Academy of Sciences 109, 20537–20542 (2012). 38. Jeong, H. & Nasir, A. A Preliminary List of Horizontally Transferred Genes in Prokaryotes Determined by Tree Reconstruction and Reconciliation. Frontiers in genetics 8, 112 (2017). 39. Ku, C. & Martin, W. F. A natural barrier to lateral gene transfer from prokaryotes to eukaryotes revealed from genomes: the 70% rule. BMC biology 14, 89 (2016). 40. Ito, H., Yokono, M., Tanaka, R. & Tanaka, A. Identifcation of a novel vinyl reductase gene essential for the biosynthesis of monovinyl chlorophyll in Synechocystis sp PCC6803. Journal of Biological Chemistry 283, 9002–9011, https://doi.org/10.1074/jbc.M708369200 (2008). Acknowledgements We thank Dr. Atsushi Takabayashi and Dr. Ryouichi Tanaka for their valuable discussions. We also thank Steven M. Tompson, from Edanz Group (www.edanzediting.com/ac) for editing a draf of this manuscript. Author Contributions M.Y. designed and performed the research. M.Y. and S.S. wrote programs. M.Y. and A.T. wrote the main manuscript. All authors reviewed the manuscript. Additional Information Supplementary information accompanies this paper at https://doi.org/10.1038/s41598-018-25090-8. Competing Interests: M.Y. is employee of Nippon Flour Mills Co., Ltd. Part of the calculation was done by a computer lent from Hokkaido University to Nippon Flour Mills Co., Ltd., Innovation Center for free. Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional afliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, , distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Cre- ative Commons license, and indicate if changes were made. Te images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not per- mitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

© Te Author(s) 2018

SciEntific RepOrTs | (2018)8:6800 | DOI:10.1038/s41598-018-25090-8 10