Genomic analysis of the uncultivated marine crenarchaeote

Steven J. Hallam*†, Konstantinos T. Konstantinidis*, Nik Putnam‡, Christa Schleper§, Yoh-ichi Watanabe¶, Junichi Sugaharaሻ, Christina Preston**, Jose´ de la Torre††, Paul M. Richardson‡, and Edward F. DeLong*‡‡

*Massachusetts Institute of Technology, Cambridge, MA 02139; ‡Joint Genome Institute, Walnut Creek, CA 94598; §Department of Biology, University of Bergen, Jahnebakken 5, N-5020 Bergen, Norway; ¶Department of Biomedical Chemistry, University of Tokyo, Tokyo 113-0033, Japan; ࿣Institute for Advanced Biosciences, Keio University, Tsuruoka 997-0017, Japan; **Monterey Bay Aquarium Research Institute, Moss Landing, CA 95069; and ††University of Washington, Seattle, WA 98195

Communicated by Carl R. Woese, University of Illinois at Urbana–Champaign, Urbana, IL, September 27, 2006 (received for review June 24, 2006) are ubiquitous and abundant microbial constituents well as the identification and structural elucidation of non- of soils, sediments, lakes, and ocean waters. To further describe the thermophilic crenarchaeotal core lipids (8, 24, 25). cosmopolitan nonthermophilic Crenarchaeota, we analyzed the Fosmid libraries enriched in C. symbiosum genomic DNA pre- genome sequence of one representative, the uncultivated viously were constructed and screened for phylogenetic and func- symbiont Cenarchaeum symbiosum. C. symbiosum genotypes coin- tionally informative gene sequences (19, 22). In a directed effort to habiting the same host partitioned into two dominant populations, genetically characterize C. symbiosum, we systematically selected corresponding to previously described a- and b-type ribosomal overlapping fosmid clones to assemble the full genome comple- RNA variants. Although they were syntenic, overlapping a- and ment. The composite full genome sequence of one ribotype of C. b-type ribotype genomes harbored significant variability. A single symbiosum, along with overlapping regions from related, sympatric tiling path comprising the dominant a-type genotype was assem- genetic variants, provides new perspective on the biological prop- bled and used to explore the genomic properties of C. symbiosum erties of nonthermophilic Crenarchaeota, their predicted metabolic and its planktonic relatives. Of 2,066 ORFs, 55.6% matched genes pathways, population biology, and gene representation in environ- with predicted function from previously sequenced genomes. The mental samples. remaining genes partitioned between functional RNAs (2.4%) and Results hypotheticals (42%) with limited homology to known functional genes. The latter category included some genes likely involved in Genome Assembly and Population Structure. The C. symbiosum the archaeal–sponge symbiotic association. Conversely, 525 C. genome was assembled from a set of 155 completed fosmid symbiosum ORFs were most highly similar to sequences from sequences selected from an environmental library enriched for C. symbiosum Supporting Text marine environmental genomic surveys, and they apparently rep- genomic DNA (see , Tables 2 and 3, and Fig. 4, which are published as supporting information on the PNAS resent orthologous genes from free-living planktonic Crenarcha- web site) (19, 22). The fosmids AF083071 and AF083072 corre- eota. In total, the C. symbiosum genome was remarkably distinct sponding to previously described a- and b-type ribosomal variants from those of other known and shared many core meta- (22) served as nucleation points for the separation of sequence bolic features in common with its free-living planktonic relatives. variants into discrete genomic bins (Tables 2 and 3). Because only ͉ ͉ ͉ ͉ a relatively small number of clones comprise this library, each Archaea Crenarchaea environmental genomics marine microbiology fosmid insert is expected to originate from an independent donor population genomics genome. Therefore, any assembly derived from this sample must represent a composite of related, but potentially nonidentical, lanktonic Crenarchaeota of the domain Archaea (1, 2) now are genotypes; for the purposes of this study, a population genome Precognized to comprise a significant component of marine equivalent. Remarkably, a single tiling path containing the com- microbial biomass, Ϸ1028 cells in today’s oceans (3–5). Although plete genomic complement of C. symbiosum could be assembled marine Crenarchaeota span the depth continuum (6), their numbers from this complex data set, which corresponded to the a-type are greatest in waters just below the photic zone (3, 7). Isotopic population of sequence variants (Table 4, which is published as analyses of lipids suggest that marine Crenarchaeota have the supporting information on the PNAS web site). capacity for autotrophic carbon assimilation (8–12). The recent C. symbiosum population structure was evaluated by analyzing isolation of Nitrosopumilus maritimus, the first cultivated nonther- fosmid sequence variation over the length of the assembled tiling mophilic crenarchaeon, demonstrated conclusively that bicarbon- path (Fig. 1). Overlapping fosmid sequences ranged between Ϸ80% ate and ammonia can serve as sole carbon and energy sources for and 100% nucleotide identity, with the a- and b-type variants at least some members of this lineage (13). The presence and dominating at the extremes. Overlapping a- and b-type fosmids, distribution of gene fragments in environmental samples, bearing although virtually indistinguishable at the level of gene content and weak homology to one of the subunits of ammonia monooxygenase (amoA), also has been recently reported (14–18). Comparative Author contributions: E.F.D. designed research; S.J.H., C.S., C.P., J.d.l.T., P.M.R., and E.F.D. environmental genomic studies have also identified the presence of performed research; S.J.H., K.T.K., N.P., C.S., Y.-i.W., J.S., P.M.R., and E.F.D. analyzed data; a multiple genes in metabolic pathways potentially associated with and S.J.H., K.T.K., and E.F.D. wrote the paper. crenarchaeotal ammonia oxidization and CO2 fixation (19). The authors declare no conflict of interest. Cenarchaeum symbiosum, the sole archaeal symbiont of the Freely available online through the PNAS open access option. marine sponge mexicana (20), falls well within the Abbreviations: TCA, tricarboxylic acid; BSR, BLAST score ratios; SAR, Sargasso Sea; COG, lineage of ubiquitous and abundant planktonic marine Cren- clusters of an orthologous group; WGS, whole-genome shotgun. archaeota (18, 20, 21). Although yet uncultivated, C. symbio- Data deposition: The sequence reported in this paper has been deposited in the GenBank sum can be harvested in significant quantities from host database (accession no. DP000238). tissues, where it comprises up to 65% of the total microbial †Present address: University of British Columbia, Vancouver, BC, Canada V6T 1Z3. biomass (20, 21). These enriched uniarchaeal preparations of ‡‡To whom correspondence should be addressed. E-mail: [email protected]. C. symbiosum have facilitated DNA analyses (20, 22, 23), as © 2006 by The National Academy of Sciences of the USA

18296–18301 ͉ PNAS ͉ November 28, 2006 ͉ vol. 103 ͉ no. 48 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0608549103 Fig. 1. C. symbiosum fosmid population structure. 600 120 160 )

(Upper) Fosmids partition into two distinct popula- p aba + b b ^ K ) ( tion bins corresponding to a-type and b-type ribo- s % e 300 60 80 ( d somal variants. Average nucleotide identity of each i t y

75 o t e i fully sequenced fosmid is plotted against the posi- l t c 0 0 u 0 n tion of each fosmid in the assembled a-type scaffold. N 0.8 0.9 1.0 0.8 0.9 1.0 e 0.8 0.9 1.0

d 80

Blue lines represent the set of fosmids falling within i

e

the a-type population, and red lines indicate the set d i of fosmids falling within the b-type population. (In- t 85 o b e

sets) Histograms represent the overall sequence di- c l vergence among overlapping fosmids. The distribu- u 90 n tion of observed sequence similarity (percentage . G 95 identity) in high-scoring segment pairs for align- V a ments between fosmid clones assigned to population A ‘‘a’’ (Left), between ‘‘a ϩ b’’ populations (Center), 100 * and between fosmid clones assigned to population b 40 Kb

K 200 ‘‘b’’ (Right). (Lower) Number of nucleotide polymor- / s phisms per 1 kb of orthologous sequence shared n 150 o i between overlapping fosmids within the a-type pop- t u

t 100 Ͼ i ulation exhibiting 95% nucleotide identity to the t s genomic scaffold. Gaps in the distribution represent b 50 u genomic intervals covered by a single fosmid clone S 0 see Supporting Text). ٙ, Dashed line represents 500 1000 1500 2000) identity cut-off (Ϸ92% average nucleotide identity) Genomic scaffold position (Kbp) for type-a fosmids used in tiling path construction *, Sequence divergence within type-a fosmids based on comparison of fosmids sharing Ͼ95% average nucleotide identity. organization, differed in average nucleotide identity by Ϸ15% (Fig. 56% of all predicted protein-encoding genes could be assigned to 1). Average nucleotide identity within each set of overlapping a- or functional or conserved roles based on homology searches (see b-type fosmids was Ϸ98%, although the range of variation within Materials and Methods). The distribution of tRNAs was uneven, the b-type population was considerably higher (Fig. 1). To facilitate with the clear majority mapping to two distinct regions of the analyses, fosmid sequences were partitioned by using a 93% identity genome (Fig. 2). cut-off, roughly corresponding to a standard demarcation of bac- terial based on whole-genome analysis (26, 27). To estimate Expanded Gene Families. The C. symbiosum genome contained an the representation of a- and b-type donors in the fosmid library, a- estimated 79 expanded gene families accounting for over 25% of its and b-type sequences were queried against the set of fosmid end coding potential (see Materials and Methods and Table 6, which is sequence reads Ն200 bp in length (see Materials and Methods). This published as supporting information on the PNAS web site). The approach identified on average 11 matches per a-type fosmid, and majority of families were predicted to encode hypothetical proteins 8 matches per b-type fosmid, consistent with 60% representation of with no more than three representatives. However, 15 families the a-type donor genomes in the DNA library (see supporting contained at least four representatives (Table 6). Many families, information for further information). including the two largest (containing 34 and 15 members, respec- To explore the coherence and diversity of donor genotypes Ͼ tively), were predicted to encode hypothetical proteins with limited within the a-type population, overlapping fosmid inserts with 95% homology to surface-layer or extracellular matrix proteins. Repre- nucleotide identity were evaluated for nucleotide polymorphisms sentatives of these families often contained high levels of nucleotide (see Supporting Text). On average, these overlapping sequences exhibited 25–30 nucleotide polymorphisms per 1 kbp. The majority of these cases involved variation within intergenic regions or Table 1. C. symbiosum genome features synonymous changes within ORFs (77% synonymous compared with 23% nonsynonymous changes). However, ‘‘hot-spots’’ of nu- Specifications* cleotide variation were detected in some orthologous genes Size, bp 2,045,086 (Ͼ50–80 polymorphisms per kbp). These changes often were Average G ϩ C content, % 57.74 associated with the presence of a variant allele within one or more Predicted ORFs 2,066 ORF density, gene͞kb 0.986 of the expanded gene families (see below), typically originating Average ORF length (bp) 924 from a donor genotype not covered by sequenced fosmids. To gain Coding percentage, % 91.2 insight into processes shaping allelic diversity between populations, ORF content the ratio of nonsynonymous to synonymous substitutions for 836 Predicted functional 1,067 orthologous C. symbiosum genes common to a- and b-type popu- Predicted functional in COGs 1,024 lations was determined (Fig. 5, which is published as supporting Conserved hypothetical 88 information on the PNAS web site). Hypothetical 863 RNA genes 16S-23S rRNA operon 1 Genome Features. The assembled C. symbiosum genome sequence 5S rRNA 1 is represented by a 2,045,086-bp single circular chromosome, with tRNAs 45 a 57.74% average G ϩ C content (Table 1). No clear origin of Expanded gene families† replication could be identified using standard criteria (28, 29). A Number of families 79 total of 2,017 protein-encoding genes were predicted in the genome Number of genes in families 263 sequence, as well as a single copy of a linked small subunit–large Coding percentage, % 26.78 subunit ribosomal RNA (rRNA) operon, 1 copy of a 5S rRNA, 45 *See Materials and Methods for fosmid assembly parameters. Ϫ predicted transfer RNAs (tRNA) (Table 5, which is published as †Based on following cutoffs: expection Ն 1e 20, bitscore Ն 100, identity Ն MICROBIOLOGY supporting information on the PNAS web site). Approximately 40%, and overlap Ն 100 aa.

Hallam et al. PNAS ͉ November 28, 2006 ͉ vol. 103 ͉ no. 48 ͉ 18297 1 cycle, were present, as predicted (19), in the fully assembled C. symbiosum genome. A single operon encoding oxoacid:ferredoxin oxidoreductase subunits suggested that C. symbiosum utilizes at a minimum an incomplete branched TCA cycle for production of intermediates in cofactor and amino acid biosynthesis. Because succinate dehydrogenase and fumarase also are involved in the 3-hydroxypropionate cycle, it is possible they may serve dual roles A N as both hydroxypropionate and TCA cycle components in these 5S rR Crenarchaeota. With the exception of glucokinase (EC 2.7.1.2) and pyruvate kinase (EC 2.7.1.40), C. symbiosum appears to contain an intact form of the Embden–Meyerhof–Parnas (EMP) pathway for the metabolism of hexose sugars. The lack of glucokinase and Cenarchaeum symbiosum pyruvate kinase may indicate that the EMP functions in the gluconeogenic direction rather than the glycolytic pathway, as has been proposed for other Archaea (31). Several alternatives to glucokinase, however, including two genes predicted to encode carbohydrate kinases of unknown specificity and one gene pre- dicted to encode a ROK-family ribokinase, were identified. Simi-

-23S S larly, a gene predicted to encode phosphoenolpyruvate synthase 6 1 also was present. The absence of glucose 1-dehydrogenase (EC 1.1.1.47), gluconolactonase (EC 3.1.1.17), and 2-keto-3-deoxy glu- conate aldolase (EC 4.1.2.–) homologues suggested that C. sym- biosum does not use the Entner–Duodoroff (ED) pathway in the catabolism of hexose sugars. In addition to the EMP pathway, an COG Category intact nonoxidative pentose phosphate pathway was identified, J Translation,ribosomal structure and biogenesis G Carbohydrate transport and metabolism providing a mechanism for production of NADPH and ribose K Transcription E Amino acid transport and metabolism L DNA replication, recombination and repair F Nucleotide transport and metabolism sugars for nucleotide biosynthesis. D Cell division and chromosome partitioning H Coenzyme metabolism T Signal transduction mechanisms I Lipid metabolism M Cell envelope biogenesis, outer membrane P Inorganic ion transport and metabolism N Cell motility and secretion Q Secondary metabolites biosynthesis, transport,catabolism Energy Metabolism. Homologues of genes potentially associated U Intracellular trafficking and secretion R General function prediction only O Posttranslational modif., turnover, chaperones S Function unknown with chemolithotrophic ammonia oxidation, including ammonia C Energy production and conversion - Not in COGs monooxygenase, ammonia permease, urease, a urea transport Expanded Gene Family 1 3 5 7 9 and RNAs 11 13 system, putative nitrite reductase, and nitric oxide reductase ac- 2 4 6 8 10 12 cessory protein (19) were represented in the C. symbiosum genome Fig. 2. The C. symbiosum genome. Nested circles from outermost to inner- (Table 6). Several loci, including the ammonia permease and most represent the following information. (i) Gene content predicted on the ammonia monooxygenase subunit C, were expanded gene-family forward strand. (ii) Gene content predicted on the reverse strand. The color of members (Table 6). Homologues for (bacterial) hydroxylamine predicted ORFs is based on COG functional categories (see key for color oxidoreductase (EC 1.7.3.4) and cytochromes c554 and c552 were not designations). (iii) Conservation of predicted C. symbiosum genes in the identified. If C. symbiosum does indeed derive energy directly from unassembled set of WGS data from the SAR. (iv) Conservation of predicted C. ammonia oxidation, it appears to employ different mechanisms symbiosum genes in the set of published and completed microbial genomes than typical nitrifying bacteria for oxidizing hydroxylamine. Con- (see Supporting Text). The height of the bars in circles 3 and 4 indicates the BSR sistent with this hypothesis, 14 genes predicted to encode domains for the set of predicted C. symbiosum proteins, queried against the public ͞ genomes and SAR, respectively, spanning a range of BSR values between related of the plastocyanin azurin family of blue type (I) copper Ϸ30% and 100% amino acid identity. (v) The extent of polymorphisms within proteins were identified with potential to substitute for cytochromes the type-a population shown in Fig. 1 mapped on the genome (see Fig. 1 for as mobile electron carriers (32, 33). details). (vi) Expanded gene families (discussed in the text; see key for color designations and Table 6 for additional information); note that high numbers Other Genomic and Metabolic Features. The complement of core of polymorphisms (circle 5) frequently coincide with the expanded protein genes expected for Archaea in general were mostly present in the families. (vii) tRNA and rRNA gene positions. (viii)Gϩ C content deviation genome of C. symbiosum. (More detailed discussion of specific from the mean (57.5%) in 1,000-bp windows. metabolic features of can be found in Supporting Text). Genes required to synthesize all 20 aa, with the exception of proline, were present in the C. symbiosum genome. Nearly complete sets of genes polymorphism, corresponding to hot-spots of allelic diversity (Fig. required for the de novo synthesis of biotin, vitamin B12, riboflavin, 4, which is published as supporting information on the PNAS web thiamine, and pyridoxine all were identified. In the case of folic acid site). biosynthesis, genes encoding all steps for the conversion of the C1 The largest family contained genes ranging between 2 and 35 kbp carrier tetrahydrofolate (THF) to methyl-THF were identified. The in size, predicted to encode numerous big archaeal proteins (bap) C. symbiosum genome contains the full repertoire of genes neces- Ϸ of unknown function. The bap genes represent 15% of the entire sary for chromosomal replication fork assembly and function, genome and 56% of all expanded gene families. Sequence analysis including components of the origin recognition complex (cdc6), two identified putative signal peptide cleavage sequences in just over topoisomerases, single- and double-stranded helicases, three copies 70% (24͞34) of all predicted bap family members. Membrane- of a predicted bacterial͞archaeal-type DNA primase, a two-subunit spanning domains in bap family members were not identified. All eukaryal͞archaeal DNA primase system, RNase H, sliding clamp, but three family members contained one or more WD40 or ␤ and DNA ligase (cdc9). Genes encoding two distinct DNA poly- propeller domain repeats, suggesting potentially shared protein- merases were identified, including a single B family DNA polymer- binding domains or interactions (30). ase I elongation subunit related to those of thermophilic Crenar- chaeota (23) and a second euryarchaeotal-like DNA polymerase II Central Metabolism. Multiple components of a putative CO2 assim- including both large and small subunits. As predicted (34), a ilation pathway based on a modified 3-hydroxypropionate cycle, eukaryal-like, single copy histone H3–H4, the first to be found in and enzymes involved in the oxidative tricarboxylic acid (TCA) any crenarchaeote, was present in the assembled C. symbiosum

18298 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0608549103 Hallam et al. genome. Ten of the 45 predicted tRNAs in C. symbiosum contain putative introns (Table 5) (35). Most of the exon–intron boundaries 100 form the conserved bulge–helix–bulge motif (BHB), although 80 58 73 several appear to adopt structurally divergent forms (36). Such 60 divergent features previously have been correlated with the pres- 40 ence of two distinct copies of the splicing endonuclease (endA) (37–40). Consistent with these observations, the C. symbiosum 63 80 genome encodes two copies of endA. 77 60 Phylogeny and Comparative Environmental Genomics. We phyloge- netically compared the newly available C. symbiosum ORFs (in- cluding ribosomal proteins, elongation factors, SecY, and DNA repair proteins), to orthologues from other lineages (unpublished data; Fig. 6, which is published as supporting information on the PNAS web site). The lack of close relatives, and paucity of cren- archaeotal genomes in current databases, complicated these anal- yses because of unbalanced taxon representation. In aggregate, phylogenetic analyses of the r-protein alignments and of conserved nonribosomal proteins did not resolve the phylogenetic placement of C. symbiosum beyond previous results of rRNA analyses. To explore the shared coding potential between C. symbiosum and its planktonic relatives, marine metagenomic data (15) were 80 63 75 aligned to the assembled C. symbiosum genome (Fig. 2; see 60 Materials and Methods). The distribution of planktonic crenarcha- 40 eotal homologues over the length of the C. symbiosum genome 100 SAR6 varied considerably between different marine samples. Coverage 80 was greatest in the Sargasso Sea sample 3 (SAR3) with over 4,000 60 60 74 unique reads averaging 65% amino acid identity and 78% amino 40 acid similarity over the length of the aligning read. This finding represents Ϸ1.25% of the total sequence population in the SAR3 100 SAR7 sample. The depth of sequence coverage was uneven, varying 80 57 between 1- and Ͼ20-fold between homologous intervals. More than 60 72 20% of the aligning sequences were derived from mate pairs 40 mapping within the average range of insert sizes (3–6 kb), suggest- 0 500 1000 1500 2000 ing gene order is conserved between C. symbiosum and its plank- Nucleotide position (Kbp) Expanded Gene Family tonic relatives over short syntenic intervals. Numerous gaps in 1 2 3 sequence coverage also were identified, indicating a significant proportion of C. symbiosum genes are absent or not well conserved Fig. 3. Comparative analysis of C. symbiosum and SAR sample bins. Coverage within planktonic Crenarchaeota (Fig. 3). plots for individual SAR sample bins aligned to the C. symbiosum genomic C symbiosum scaffold (see Materials and Methods). Each vertical bar represents an individ- All protein-encoding sequences predicted in the . ual WGS read. The y axis on each plot corresponds to the percentage amino genome were queried against the SAR whole-genome shotgun acid sequence similarity for each aligning read. The average percentage (WGS) data as well as the set of public genomes (see Materials and amino acid identity (top) and similarity (bottom) for SAR WGS reads aligning Methods). The resulting alignments were compared by using to the C. symbiosum genome is shown to the far right of each coverage plot. BLAST score ratios (BSR) to identify highly conserved genes For simplified visualization of gaps in the alignment, all matches are replotted shared between C. symbiosum, SAR, and public genomes (Fig. 2). near the base of the x axis to form a normalized 1D plot spanning the reference A total of 65 genes with a BSR Ն 30 were more highly conserved sequence. To illustrate the gene content of gaps within the distribution of between C. symbiosum and the public genomes (Table 7, which is aligning WGS reads, several regions corresponding to expanded gene families published as supporting information on the PNAS web site). Of 1–3 (Table 4) unique to the C. symbiosum genome are highlighted. (For these, 43 fell into defined clusters of orthologous group (COG) coverage plots for SAR sample 3 aligned to five archaeal reference genomes, see Fig. 7, which is published as supporting information on the PNAS web site.) categories, including 10 genes associated with DNA replication, recombination, and repair (L), 9 genes associated with amino acid transport and metabolism (E), and 6 genes associated with post- and public genomes or were not well conserved at all (Fig. 2 and translational modification, protein turnover, and chaperones (O). data not shown). The latter case, represented by gaps in both the The distribution of genes within these three categories was far from circular genome map (Fig. 2) and coverage plots (Fig. 3), encom- random. For instance, within the first category, seven genes were passed Ͼ800 genes with a BSR Ͻ 30, corresponding to Ϸ39% of all most similar to bacterial associated DNA modification methyltrans- predicted protein-encoding genes in the C. symbiosum genome. ferases, and within the third category, five genes were homologous to serine protease inhibitors (serpins). A total of 525 genes with a Discussion BSR Ն 30 were more highly conserved between C. symbiosum and the SAR data set, corresponding to Ϸ26% of all predicted protein- Population Structure and Genomic Coherence. The C. symbiosum encoding genes in the C. symbiosum genome (Table 8, which is genome represents a composite sequence assembled from individ- published as supporting information on the PNAS web site). This ual, closely related sympatric donor genotypes. As such, the genetic set of shared genes spanned the complete spectrum of COG plasticity and population structure of host-associated C. symbiosum categories, with highest representation in energy production and cells is in part reflected in genome sequence variants. Rearrange- conversion (C), amino acid transport and metabolism (E), trans- ments or transpositions between overlapping syntenic regions were lation, ribosomal structure, and biogenesis (J), transcription (K), seldom detected, and in only two instances were recent intragenic and DNA replication, recombination, and repair (L). The remain- recombination events unambiguously detected. Although we sam- MICROBIOLOGY ing gene predictions either were shared equally between the SAR pled only a small fraction of C. symbiosum donor genomes within

Hallam et al. PNAS ͉ November 28, 2006 ͉ vol. 103 ͉ no. 48 ͉ 18299 the host, the rarity of recombination events suggests that recom- appear now to fall well within the Crenarchaeota based on bination and hybridization are less important than are clonal trees with well supported nodes having broad taxon represen- diversification and genetic drift in C. symbiosum populations. This tation (1). dynamic is qualitatively different from recently described acid-mine In aggregate, analyses of individual or concatenated protein drainage archaeal populations, where widespread recombination is alignments and phylogenetic analyses did not resolve the phyloge- proposed to generate multiple mosaic genotypes (41). netic placement of C. symbiosum beyond previous rRNA analyses The separation of syntenic a- and b-type sequences during (1, 47). This finding is mostly attributable to the currently poor assembly potentially reflects periodic selection within the sponge representation of Crenarchaeota in existing genomic databases. host tissues, partitioning population variants into distinct sequence Despite the relatively low number of crenarchaeotal genomes clusters (42, 43). Given that gene content, order, and orientation available for comparison, our results generally supported previous among overlapping a- and b-type genotypes are identical, selective rRNA phylogenetic analyses, placing C. symbiosum peripheral to forces are likely acting on individual genes or their expression. The the crenarchaeotal lineage of cultivated hyperthermophiles. frequency of highly variable alleles (indicated by large peaks in Fig. 1) indicates that selective pressures act with variable intensity on Symbiosis. Little is known about specific functional relationships different regions of the C. symbiosum genome. Fine scale deter- between C. symbiosum and its sponge host, Axinella mexicana. Close mination of a- and b-type genotype spatial distribution in host relatives of C. symbiosum, however, seem commonly associated tissues, as well as deeper sampling of allelic variation in C. symbio- with other Axinella or other sponge hosts (48, 49), so similar sum populations, are necessary to further explore the nature, associations may be common in the marine environment. With expression, and consequences of this intrapopulation genetic respect to potential symbiotic metabolic interactions, carbon and variability. nitrogen exchanges are common in many microbial–eukaryal sym- bioses, and sponge-associated nitrification also has been reported Functional and Metabolic Relationships. The results presented here, previously (50). One possible interaction, consistent with the C. in combination with previous studies (13, 16, 19), support the notion symbiosum genome complement and the nitrifying phenotype of N. that C. symbiosum and its planktonic marine relatives may derive maritimus, is removal of nitrogenous host-waste products (e.g., cellular energy directly from the oxidation of ammonia. C. symbio- ammonia, urea). This could simultaneously fuel the symbiont’s sum harbors a number of genes known to be relevant to ammonia respiratory energy metabolism and might even provide new carbon metabolism. Whether ammonia is the sole source of energy for to the host, via archaeal chemolithotrophic CO2 fixation, and either C. symbiosum or planktonic Crenarchaeota is unclear, but this subsequent symbiont–host carbon exchange. appears to be so for its recently cultivated crenarchaeotal relatives Viable and dividing populations of C. symbiosum have been (13). Although carbon fixation remains to be formally demon- observed to persist in a single sponge host individual for up to strated for C. symbiosum, the hydroxypropionate pathway appears 5 years (21). As an apparently nonmotile extracellular symbiont, present in this archaeal symbiont (Table 6) (19). Closely related C. symbiosum likely has developed mechanisms to inhibit or homologues of virtually all genes associated with ammonia oxida- evade host consumption and defend against viral predation. A tion and carbon fixation in C. symbiosum also were present in significant number of predicted genes encode domains homol- environmental samples having known high crenarchaeotal repre- ogous to cell surface, regulatory, or defense mechanisms, in- sentation (Fig. 3 and Table 8) (19). Combined with well docu- cluding numerous restriction modification systems to protect mented high crenarchaeotal biomass in marine plankton (3, 6, 11, 44), these data are consistent with a major role for crenarchaeotal against foreign DNA, autotransporter adhesins potentially in- nitrification in carbon and nitrogen cycling in the sea. volved in mediating cell–cell contact, proteases that possibly Comparisons with the SAR WGS data set indicated that the modify or degrade extracellular matrix proteins, glycosyltrans- majority of other core metabolic subsystems identified in C. sym- ferases involved in cell wall biogenesis, and secreted seine biosum also were well conserved in planktonic Crenarchaeota.In protease inhibitors with the potential to mediate evasion of addition to information processing systems, homologies of C. innate host defense systems. Many of these genomic features are symbiosum biosynthetic and housekeeping subsystems (including not found in the planktonic relatives of C. symbiosum, and glycolysis, gluconeogenesis, pentose phosphate conversion, TCA therefore may be specifically associated with the symbiotic cycle, cofactor and vitamin metabolism, amino acid biosynthesis, lifestyle of C. symbiosum. Given the few known archaeal– oxidative phosphorylation, and ATP synthesis) were most fre- metazoan symbioses, the C. symbiosum genome sequence pro- quently found in environmental samples known to contain high vides a unique opportunity to further explore the genetic levels of free-living planktonic Crenarchaeota. These results suggest features mediating archaeal–eukaryal host contact, communi- that the majority of core metabolic functions found in C. symbiosum cation, and trophic exchange. It also provides a reference point also are present in its planktonic relatives. Conversely, a consider- for interpreting the genomic inventory, metabolic features, and able number of C. symbiosum-unique genes also were found, which evolution of its free-living relatives, which are abundant com- potentially are involved in the archaeal–sponge symbiotic associa- ponents of microbial plankton and may exert significant influ- tion (see below). ence on energy and matter cycling in the sea.

Evolutionary Relationships. The domain Archaea remains well Materials and Methods defined by two major subkingdoms, the Crenarchaeota and Library Construction, Specifications, and Sequencing. C. symbiosum Euryarchaeota (2). Cultivation-independent phylogenetic sur- cell enrichment, DNA extraction from sponge tissue, and fosmid veys (45) have revealed many new environmentally significant library construction and sequencing protocols have been previously clades within the domain Archaea. These environmental described (19, 51). Complete genome annotation files are available clades now outnumber archaeal lineages having cultivated through the Joint Genome Institute’s Integrated Microbial Ge- representatives (1, 18, 46, 47). Ribosomal rRNA-based phy- nomes system (http:͞͞img.jgi.doe.gov͞cgi-bin͞pub͞main.cgi) and logenetic analyses of predominant cultivated and uncultivated through the National Center for Biotechnology Information’s web archaeal groups suggest that Crenarchaeota and Euryarchaeota portal (www.ncbi.nlm.nih.gov). Individual fosmid sequences can be are best represented by polytomies, ‘‘star radiations,’’ with obtained from GenBank under the accession nos. DQ397540– poor intragroup resolution relative to their common ancestral DQ397640 and DQ397827–DQ397878, corresponding to a- and node (1). As a consequence, groups previously thought to be b-type population bins, respectively. The complete a-type genome more deeply branching, for example, the Korarchaeota (46), sequence can be obtained from GenBank under accession no.

18300 ͉ www.pnas.org͞cgi͞doi͞10.1073͞pnas.0608549103 Hallam et al. DP000238. Additional information can be found in the Supporting proteins longer than 300 aa were split into 300-aa-long consecutive Text. fragments, which then were queried against the SAR database. Evaluation of gene conservation was based on analysis of BSR Nucleotide Polymorphism Determination. To identify orthologous among C. symbiosum, SAR, and public genomes (54) by using regions (defined here as the reciprocal best matches with the tBLASTn. BLASTn algorithm (52) and a minimum cut-off of 50% identity Coverage plots relating the set of WGS reads from individual over a minimal 700-bp interval), the complete genomic scaffold was SAR sample bins, SAR1–7 (www.venterinstitute.org͞sargasso) divided into 1,000-bp-long (1-kbp-long) consecutive fragments and (15), to the C. symbiosum genomic scaffold were generated by using searched against the set of fosmids. The same analysis was per- the Promer program implemented in MUMmer 3.18 (55). See formed at the gene level as well, by using the set of genes annotated Supporting Text for further details. on the genomic scaffold as reference sequences. When the total length of orthologous sequences (gene-level comparisons) between Phylogenetic Analysis. Phylogenetic analyses were performed by a fosmid and the genomic scaffold was longer than 5 kbp, the fosmid using maximum-likelihood methods implemented in PHYML was considered to derive from C. symbiosum, and the average (http:͞͞atgc.lirmm.fr͞phyml) (56). See Supporting Text for further nucleotide identity between the fosmid and the genome was details. calculated directly from the resulting BLASTn output. Orthologous regions between C. symbiosum fosmids (1-kbp window compari- We thank Celine Brochier, Asuncion Martinez, Tsultrim Palden, Tracy sons) subsequently were aligned by using clustalw (53). The number Mincer, Matthew Sullivan, Maureen Coleman, Jarod Chapman, Sam Pitluck, Chris Detter, Krishna Palaniappan, and the Joint Genome of invariable and variable bases in the 1-kbp fragments, which are Institute staff for computational and technical assistance. J.S. and shown in Fig. 1, was calculated directly from the clustalw alignments Y.-i.W. thank Nozomu Yachie, Masaru Tomita, and Akio Kanai at Keio for the fosmids that showed Ͼ95% average nucleotide identity to University for tRNA analysis with SPLITS. We thank two anonymous the genomic scaffold. reviewers and Norman Pace, whose advice and suggestions greatly improved the final manuscript. This study was supported by National Comparative Analysis Between C. symbiosum, Public Genomes, and Science Foundation Grants MCB0236541, MCB0509923, and the SAR Data Set. The SAR database contained the complete set of MCB0348001 (to E.F.D.), a Gordon and Betty Moore Foundation unassembled, vector-trimmed, WGS sequences (15), whereas the Award (to E.F.D.), the U.S. Department of Energy’s Office of Science, Biological, and Environmental Research Program and the University of public genomes database included all whole-genome sequences California–Lawrence Livermore National Laboratory under Contract accessible through National Center for Biotechnology Informa- W-7405-ENG-48, Lawrence Berkeley National Laboratory Contract tion’s ftp site as of December 2006 (260 genomes in total). Because DE-AC03–765F00098, and Los Alamos National Laboratory Contract the SAR average read-length is only Ϸ818 bases, C. symbiosum W-7405-ENG-36.

1. Robertson CE, Harris JK, Spear JR, Pace NR (2005) Curr Opin Microbiol 8:638–642. 26. Konstantinidis KT, Tiedje JM (2005) J Bacteriol 187:6258–6264. 2. Woese CR, Kandler O, Wheelis ML (1990) Proc Natl Acad Sci USA 87:4576–4579. 27. Konstantinidis KT, Tiedje JM (2005) Proc Natl Acad Sci USA 102:2567–2572. 3. Karner MB, DeLong EF, Karl DM (2001) Nature 409:507–510. 28. Zhang R, Zhang CT (2005) Archaea 1:335–346. 4. DeLong EF (1992) Proc Natl Acad Sci USA 89:5685–5689. 29. Kelman Z (2000) Trends Biochem Sci 25:521–523. 5. Fuhrman JA, McCallum K, Davis AA (1992) Nature 356:148–149. 30. Smith TF, Gaitatzes C, Saxena K, Neer EJ (1999) Trends Biochem Sci 24:181–185. 6. Massana R, Murray AE, Preston CM, DeLong EF (1997) Appl Environ Microbiol 63:50–56. 31. Hutchins AM, Holden JF, Adams MW (2001) J Bacteriol 183:709–715. 7. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan 32. Mattar S, Scharf B, Kent SB, Rodewald K, Oesterhelt D, Engelhard M (1994) J Biol Chem MB, Edwards R, Brito BR, et al. (2006) Science 311:496–503. 269:14939–14945. 8. Damste JS, Schouten S, Hopmans EC, van Duin AC, Geenevasen JA (2002) J Lipid Res 33. Scharf B, Engelhard M (1993) Biochemistry 32:12894–12900. 43:1641–1651. 34. Cubonova L, Sandman K, Hallam SJ, Delong EF, Reeve JN (2005) J Bacteriol 187:5482– 9. Pearson A, McNichol AP, Benitez-Nelson BC, Hayes JM, Eglinton TI (2001) Geochem 5485. Cosmochim Acta 65:3123–3137. 10. Wuchter C, Schouten S, Boschker HT, Sinninghe Damste JS (2003) FEMS Microbiol Lett 35. Sugahara J, Yachie N, Sekine Y, Soma A, Matsui M, Tomita M, Kanai A (2006) In Silico 219:203–207. Biol 6:0039. 11. Herndl GJ, Reinthaler T, Teira E, van Aken H, Veth C, Pernthaler A, Pernthaler J (2005) 36. Marck C, Grosjean H (2003) RNA 9:1516–1531. Appl Environ Microbiol 71:2303–2309. 37. Tocchini-Valentini GD, Fruscoloni P, Tocchini-Valentini GP (2005) Proc Natl Acad Sci 12. Ingalls AE, Shah SR, Hansman RL, Aluwihare LI, Santos GM, Druffel ER, Pearson A USA 102:8933–8938. (2006) Proc Natl Acad Sci USA 103:6442–6447. 38. Yoshinari S, Fujita S, Masui R, Kuramitsu S, Yokobori S, Kita K, Watanabe Y (2005) 13. Konneke M, Bernhard AE, de la Torre JR, Walker CB, Waterbury JB, Stahl DA (2005) Biochem Biophys Res Commun 334:1254–1259. Nature 437:543–546. 39. Calvin K, Hall MD, Xu F, Xue S, Li H (2005) J Mol Biol 353:952–960. 14. Francis CA, Roberts KJ, Beman JM, Santoro AE, Oakley BB (2005) Proc Natl Acad Sci USA 40. Yoshinari S, Itoh T, Hallam SJ, Delong EF, Yokobori SI, Yamagishi A, Oshima T, Kita K, 102:14683–14688. Watanabe YI (2006) Biochem Biophys Res Commun 346:1024–1032. 15. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen 41. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, I, Nelson KE, Nelson W, et al. (2004) Science 304:66–74. Rubin EM, Rokhsar DS, Banfield JF (2004) Nature 428:37–43. 16. Wuchter C, Abbas B, Coolen MJ, Herfort L, van Bleijswijk J, Timmers P, Strous M, Teira 42. Cohan FM (2001) Syst Biol 50:513–524. E, Herndl GJ, Middelburg JJ, et al. (2006) Proc Natl Acad Sci USA 103:12317–12322. 43. Cohan FM (2002) Annu Rev Microbiol 56:457–487. 17. Treusch AH, Leininger S, Kletzin A, Schuster SC, Klenk HP, Schleper C (2005) Environ 44. DeLong EF, Wu KY, Prezelin BB, Jovine RV (1994) Nature 371:695–697. Microbiol 7:1985–1995. 45. Pace NR (1997) Science 276:734–740. 18. Schleper C, Jurgens G, Jonuscheit M (2005) Nat Rev Microbiol 3:479–488. 46. Barns SM, Delwiche CF, Palmer JD, Pace NR (1996) Proc Natl Acad Sci USA 93:9188–9193. 19. Hallam SJ, Mincer TJ, Schleper C, Preston CM, Roberts K, Richardson PM, DeLong EF 47. DeLong EF (1998) Curr Opin Genet Dev 8:649–654. (2006) PLoS Biol 4:e95. 48. Holmes B, Blanch H (July 4, 2006) Mar Biol, 10.1007͞s00227-006-0361-x. 20. Preston CM, Wu KY, Molinski TF, DeLong EF (1996) Proc Natl Acad Sci USA 93:6241– 49. Margot H, Acebal C, Toril E, Amils R, Fernandez Puentes J (2002) Mar Biol 140:739–745. 6246. 21. Preston CM (1998) PhD thesis (Univ of California, Santa Barbara). 50. Diaz MC, Ward BB (1997) Mar Ecol Prog Ser 156:97–107. 22. Schleper C, DeLong EF, Preston CM, Feldman RA, Wu KY, Swanson RV (1998) J Bacteriol 51. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF (1996) J Bacteriol 178:591–599. 180:5003–5009. 52. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J Mol Biol 215:403–410. 23. Schleper C, Swanson RV, Mathur EJ, DeLong EF (1997) J Bacteriol 179:7803–7811. 53. Thompson JD, Higgins DG, Gibson TJ (1994) Nucleic Acids Res 22:4673–4680. 24. DeLong EF, King LL, Massana R, Cittone H, Murray A, Schleper C, Wakeham SG (1998) 54. Rasko DA, Myers GS, Ravel J (2005) BMC Bioinformatics 6:2. Appl Environ Microbiol 64:1133–1138. 55. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) 25. Schouten S, Hopmans EC, Pancost RD, Damste JS (2000) Proc Natl Acad Sci USA Genome Biol 5:R12. 97:14421–14426. 56. Guindon S, Gascuel O (2003) Syst Biol 52:696–704. MICROBIOLOGY

Hallam et al. PNAS ͉ November 28, 2006 ͉ vol. 103 ͉ no. 48 ͉ 18301 Hallam et al., PNAS 2006 Supplementary material Supporting Text

C. symbiosum Population Structure. To determine the role of recombination in shaping sequence diversity within the C. symbiosum fosmid population, alignments of overlapping a- and b-type fosmids with greater than 2-fold coverage, were manually inspected. Only two recent intragenic recombination events, one within the a-type population, shared between three overlapping fosmids, and one between two overlapping a- and b-type fosmids were unambiguously identified (data not shown). Moreover, rearrangements or transpositions between overlapping fosmids were seldom detected. Fosmids were absolutely syntenic over a given coding interval, unless interrupted by exact insertions or deletions.

To gain insight into processes shaping allelic diversity between populations, the ratio of nonsynonymous to synonymous substitutions (Ka/Ks; ref. 1) for 836 orthologous C. symbiosum genes common to a- and b-type populations was determined (Fig. 5). Overall, 12 ORFs predicted to encode hypothetical proteins were identified with Ka/Ks >1. These may represent adaptive loci helping to shape or reinforce genotypic isolation.

Electron Transport and ATP Synthesis. Three electron transport complexes, including a complete respiratory NADH dehydrogenase (complex I), succinate dehydrogenase (complex II), and cytochrome C oxidase (complex IV) were unambiguously identified. In addition, an operon predicted to encode a Rieske iron--sulfur cluster protein, cytochrome b, and type (I) copper protein with the potential to function as a cytochrome c reductase (complex III) was identified. Finally, a gene cluster predicted to encode a complete archaeal ATP synthase (complex V) was identified, completing the aerobic respiratory circuit required for the indirect coupling of electron transfer to ATP synthesis. In the case of the NADH dehydrogenase (nuo), a gene cluster encoding the core respiratory complex nuoABCDHIJKLMN, was identified. However, genes encoding the electron-input module nuoEFG, involved in NADH binding and oxidation were not identified, supporting the hypothesis that C. symbiosum uses alternative electron carriers, potentially including type (I) copper proteins or ferredoxins. In addition to the proposed pathway of ammonia oxidation described above, seven genes predicted to encode Fe-S cluster oxidoreductases and five genes predicted to encode ferredoxins were identified with the potential to contribute electrons into the respiratory chain or to act as low-potential electron donors for various enzymatic reactions. Amino Acid and Cofactor Biosynthesis. Although 19 complete amino acid biosynthetic pathways were identified, we could not identify all genes for L-proline biosynthesis. Genes predicted to encode pyrroline-5-carboxylate reductase (EC 1.5.1.2) and ornithine cyclodeaminase (EC 4.3.1.12) involved in the conversion of 1-pyrroline-5-carboxylate and ornithine to L-proline, were not identified. However, several genes predicted to encode aminopeptidases were present, as was an oligo-transport system containing a linked permease and an unlinked ATPase component, with the potential to mediate uptake of free peptides. A solute binding protein with the potential to interact with this transport complex was identified in close proximity to the predicted permease components. Transporter Systems. The largest group of transporters in C. symbiosum consisted of the multisubunit ABC family and included transporters for Mn2+/Zn2+, Ni2+, Fe3+ hydroxamate, phosphate/phosphonate, nitrate/sulfonate/bicarbonate, branched chain amino acids, dipeptides, and multidrug resistance. A variety of genes encoding proteins involved in the transport or exchange of ammonia, urea, Na+/H+, K+/H+, Mg2+/Co2+, Mn2+/Fe2+, Ca+/Na+ and unspecified divalent and heavy metal cations, were also identified. Protein Translocation and Secretion. The C. symbiosum genome encoded a complete set of archaeal signal recognition particle (SRP) components for targeting secretary and membrane proteins [srp19, srp54, a 7S RNA gene and SRP receptor homologue (ftsY)]. The ftsY gene was found as part of a larger gene cluster containing additional secretory pathway components including a secY translocase homologue, and a second gene predicted to encode an integral membrane protein with similarity to the Sec accessory subunit YidC. Definitive homologues for secE and secG, two additional subunits of the archaeal Sec translocation pore complex, were not identified. Three copies of a conserved hypothetical

S1 Hallam et al., PNAS 2006 Supplementary material gene with limited homology to eukaryotic vacuolar sorting factors and several components of the Sec- independent twin argine translocation (tat) system, including tatA and tatC were also identified in the genome sequence.

Signaling, Motility, and Cell Surface Features. The absence of genes encoding classical two- component sensory and motility systems suggests that C. symbiosum is nonmotile. A total of 6 genes predicted to encode signaling kinases, including 4 serine/threonine kinases, 1 signal transduction histidine kinase and 1 unusual protein kinase of unknown function, were identified with the potential to transduce cell surface events or regulate cellular function. Two genes predicted to encode protein phosphatases specific to serine/threonine and tyrosine were also identified, consistent with the operation of reversible phosphorylation cascades and regulatory networks in C. symbiosum.

C. symbiosum contains numerous genes encoding cell surface features, including membrane proteins mediating signaling, cell envelope formation, phospholipid binding and modification, glycosylation, and serine or metal-dependent protein degradation.

Information Processing and Chromatin Dynamics. The C. symbiosum genome contained the full repertoire of genes necessary for chromosomal replication fork assembly and function, and two distinct DNA polymerases were identified, including a single B family DNA polymerase I elongation subunit related to those of thermophilic Crenarchaeota (1), and a second euryarchaeotal-like DNA polymerase II including both large and small subunits. Numerous genes involved in DNA repair were also present, including RadA recombinase, nucleotidyltransferase, O-methyltransferase, and a complete uvr nucleotide excision repair system. In addition to DNA replication and repair systems, 3 genes predicted to encode ATPases typically associated with chromosome partitioning and maintenance, including a homologue of structural maintenance of chromosomes (smc), a membrane-associated ATPase (minD) and a gene predicted to encode the cell division protein ftsZ, were identified. Previous analyses indicated the presence of a eukaryal-like histone homologue in C. symbiosum (2), and this single copy of histone H3-H4 was present in the assembled C. symbiosum genome. Genes predicted to encode a histone H1 DNA binding protein, and 3 histone acetyltransferase genes were also present. As well, 27 helicase genes were identified with varying roles in DNA replication, transcription and repair, including 9 superfamily II members implicated in ATP-dependent chromatin remodeling. Moreover, 25 genes encoding restriction modification methytransferases were identified. Only 2 of these genes had closely related archaeal homologues. Five of the methyltransferases could be unambiguously linked to genes predicted to encode restriction endonucleases of untested specificity. C. symbiosum contains a complete set of genes necessary for transcription initiation including preinitiation complex formation and RNA polymerase assembly. The presence of 3 divergent copies of the TATA binding protein (tbp) and 5 copies of the transcription initiation factor TFIIB (tfb) indicates that C. symbiosum has the potential to generate alternative preinitiation complexes. Over 40 genes predicted to encode transcriptional regulators with the capacity to modulate preinitiation complex formation were identified. The majority of these genes are predicted to encode members of the Lrp/AsnC family of transcriptional regulators. In two instances related groups of transcriptional regulators formed expanded gene families composed of 5 and 4 members, respectively (Table 6). A total of 10 translation initiation and 4 elongation factors were present in the C. symbiosum genome including 2 copies of the translation initiation factor 2B (eIF2B). In addition, a single gene predicted to encode a small bacterial-type cold shock protein (cspB) potentially involved in RNA binding and translational control was also identified. A highly conserved cspB gene was previously identified on a genomic DNA fragment derived from a planktonic marine crenarchaeon, 4B7 (3). So far, crenarchaeotal cspB homologues have only been found in mesophiles or psychrophiles, a feature distinguishing C. symbiosum and cold-living relatives from other thermophilic lineages. Aminoacyl-tRNA synthetases are encoded for every amino acid with the exception of glutamine. However, 4 subunits of a glutamyl-tRNA amidotranseferase (gat) encoded by separate gatED and gatBA

S2 Hallam et al., PNAS 2006 Supplementary material operons likely mediated Glutamyl-tRNA activation. The genomic distribution of aminoacyl-tRNA synthetases was uneven, coinciding with the pattern observed for individual tRNAs (see previous and data not shown). Genes predicted to encode selenocysteinyl-tRNA, as well selenophosphate synthase or selenocysteine synthase required for its activation, were not identified. A gene related to the selenocysteine specific elongation factor (selB) was present, but its function remains currently unknown.

C. symbiosum harbors a number of genes predicted to encode chaperonins involved in cellular stress responses and protein folding and refolding processes. An operon predicted to encode a heat-shock or stress response complex composed of the genes grpE, hsp70 (dnaK), and hsp40 (dnaJ) was identified. Three additional copies of genes containing dnaK-like domains, and 2 copies each of the small heat shock protein hsp20, the alpha and beta (gimC) subunits of prefoldin, and an hsp60 related thermosome alpha and beta subunits, were also identified. In addition to the classical chaperonin systems, 4 genes predicted to encode peptidyl-prolyl cis trans isomerase, and at least 10 genes predicted to encode protein disulfide isomerase or thioredoxin, were identified with the potential to assist in de novo protein folding and oxidative stress responses.

Supporting Materials and Methods

Sequencing and Assembly Approach. Before fosmid selection, end sequencing generated 2,779 nonredundant reads greater than 200 base pairs (bp) per read, averaging 500 bp per read. Of the set of nonredundant reads, 1,041 clones were represented by paired-ends covering both the 5' and 3' ends. Overall, 66.2% of the library was represented by at least one end sequence, and 49.6% was represented by paired-end sequences. Five successive phases of fosmid selection, sequencing and assembly, were conducted over a four-year period. Initial selection was based on the following three criteria: (i) paired end sequences predicted to contain ORFs most similar to archaeal genes, (ii) linkage with previously reported fosmids harboring phylogenetic anchors, including genes encoding the small subunit ribosomal RNA, radA recombinase (4), and (iii) DNA polymerase subunits (1) and sets of paired ends, assembled in opposing orientations and predicted to contain ORFs homologous to two or more archaeal genes. Two previously described fosmids, AF083071 and AF083072, harboring a-type and b-type SSU ribosomal RNA genes respectively were included as seeds for fosmid walking excursions (5) (Fig. 4). Library Composition and Coverage. The fraction of C. symbiosum DNA contained within the library was estimated based on the number of small subunit ribosomal RNA genes (SSU rRNA) identified per plate in relation to the average size of completed archaeal genomes, obtained from the NCBI genomic biology browser (www.ncbi.nlm.nih.gov/Genomes/). On average, one C. symbiosum SSU rRNA gene was identified every 3 plates in the library (data not shown). Based on the representation of C. symbiosum SSU rRNA genes extrapolated to an average 2 Mb genome size, between 20% and 25% of the library, or 400-500 fosmids should de derived from C. symbiosum donor genomes. This represents ª8X coverage assuming a random distribution of fosmid clones. However, this coverage estimate does not take into account the richness and evenness of individual genotypes within the population or cloning biases associated with library production. Given that the population is not homogeneous and contains inherent naturally occurring variation between donor genomes with the potential to impact the relative genomic position of end sequences, the actual coverage area is in actuality much smaller. Tiling Path Construction. Following each phase of selection and sequencing, completed fosmids were aligned by using the Sequencher package (www.genecodes.com) to join sequences ≥95% identical to one another. Resulting contigs were used as queries in blastn searches against the set of completed fosmids to identify potential overlapping regions separated by small-scale rearrangements, insertions and deletions (6). The results of these searches were visually inspected and compared by using a combination of two prototype visualization tools, the Interactive Multiple Sequence Arranger and Relationship Viewer (IMARV) and blastView2, under development at the Joint Genome Institute. Binning fosmids before tiling path construction eliminated cross-assembly errors between the a- and b- type populations. A tiling path was manually assembled from within the a-type population bin by using

S3 Hallam et al., PNAS 2006 Supplementary material a 93% nucleotide identity cut-off over a 1,000 bp minimal overlapping interval. Overlapping fosmids were further trimmed to a minimal overlapping interval with 100% nucleotide identity to remove polymorphic sites and generate a representative genomic sequence for downstream annotation efforts. To minimize the introduction of chimeric junctions in ORFs encoded by two or more overlapping fosmids, efforts were made to maximize the total nucleotide contribution of individual fosmids within a given contig.

DNA Sequence Analysis. Contigs were annotated by using the FGENESB pipeline for automatic annotation of bacterial genomes from Softberry (www.softberry.com/berry.phtml, Mount Kisco, NY) by using the following parameters and cut-offs: ORF size = 100 aa, Expectation = 1e-10. Predicted ORFs were queried against the KEGG, COG and GenBank nonredundant (NR) databases. SSU and LSU -8 rRNA genes were identified by blastn query against NR with expectation cut-offs of 1 ¥ 10 . Automated FGENESB annotation of fosmids was manually refined and corrected by using the genome annotation and visualization tool Artemis (www.sanger.ac.uk/Software/Artemis) (7). Putative tRNA genes were identified by using the program tRNAscan-SE 1.21 (www.genetics.wustl.edu/eddy/tRNAscan-SE) (8) set to the archaeal tRNA covariance model, and SPLITS 1.0 (http://splits.iab.keio.ac.jp) (9) set to the following parameters; -d 2 -p 0.65 -F 5. The putative tRNA genes containing possible intron(s) detected with only tRNAscan-SE 1.21 were verified by the prediction of the bulge-helix-bulge motif for the hallmark of the exon-intron boundaries [Marck, C. and Grosjean, H. (2003) RNA 9:1516-1531] by using SPLITS 1.0. The 5S rRNA sequence was initially identified on the basis of blastn searches of the unassembled fosmid sequences by using representative archaeal reference sequences. The 7S RNA sequence was initially identified on the basis of blastn searches by using a 17-nt conserved region (AGG(C/T)CCGGAAGGGAGCA) deduced from archaeal 7S sequences. The tentative 5' and 3' ends of the gene were assigned from the conservation between a- and b-type genome sequences. The similarity of the predicted secondary structures to the consensus folding of 7S RNA and the presence of conserved motifs [Zwieb C, van Nues RW, Rosenblad MA, Brown JD, Samuelsson T. (2005) RNA 11:7-13] were also considered. The complete tiling path was assembled in Artemis by sequential addition of ordered and oriented contigs and gap spanning fosmids with accompanying feature tables, and visualized with the program GenomeViz www.uniklinikum-giessen.de/genome/genomeviz/download.html (10). For MUMmer plots comparing Sargasso Sea metagenomic data to the C. symbiosum genome, the following parameters and cut-offs were used: breaklength = 60, minimum cluster length = 20, and match length = 10. Test alignments intended to explore the specificity and depth of coverage of SAR alignments to the C. symbiosum genome were performed with the following archaeal reference genomes: Archaeoglobus fulgidus (NC_000917), Methanothermobacter thermautotrophicus (NC_000916), Thermoplasma volcanium (NC_002689), Pyrobaculum aerophilum (AE009441) and Sulfolobus solfataricus (NC_002754). Resulting delta files were converted into coordinate files for plotting and sequence analysis by using the show-coords program and visualized in graphical format (coverage plot) by using the MUMmerplot program. Conservation of predicted C. symbiosum genes in the unassembled set of whole genome shotgun data from the Sargasso Sea. The BSR represents the ratio of the bit score for the set of predicted C. symbiosum proteins queried against the Sargasso Sea or public genomes database divided by the bit score of C. symbiosum queried against itself (self-match). Application of the BSR reduces biases associated with database size and the length of matching segments by normalizing the bit scores derived from blast algorithms. Ka/Ks Calculations. The amino acid sequences of the ortholgous genes between fosmids were aligned by using clustalw. TRANALIGN implemented in the Emboss package was used to provide a codon- based nucleotide alignment of the orthologous genes. Calculation of sequence divergence at synonymous (Ka) and nonsynonymous (Kn) sites was performed on the nucleotide alignments by using the program DIVERGE implemented in the GCG package, based on the method of Li (11). The ratio of the number of nonsynonymous substitutions per nonsynonymous site to the number of synonymous substitutions per synonymous site for the complete protein sequence provides an index for selection where Ka/Ks = 1 represents neutral evolution, Ka/Ks << 1 is consistent with purifying selection and Ka/Ks >> 1 is consistent with positive selection.

S4 Hallam et al., PNAS 2006 Supplementary material

Supplementary Figure Legends

Fig. 4. Workflow summary for fosmid selection, sequencing, and assembly. (1) A fosmid library containing 2,100 clones enriched for C. symbiosum genomic DNA was constructed (see Supporting Materials and Methods). (2) The library was screened for small subunit (SSU) ribosomal RNA (rRNA) genes to unambiguously identify C. symbiosum-derived fosmids. SSU rRNA-containing fosmids were subsequently sequenced to completion. (3) The library was end-sequenced to provide phylogenetic and functional information and as a resource for extending contigs and ordering scaffolds. (4) Completed fosmids were queried against the set of all end sequences to identify overlapping fosmids capable of extending the anchor position. Fosmid end sequences were also aligned to one another to identify fosmids with overlapping ends in opposing directions to be used as seeds in subsequent rounds of selection and sequencing. (5) This process was repeated five times, culminating in a total of 155 archaeal fosmids. (6) With each successive phase, completed fosmids were aligned to one another as well as to the set of all end sequences to generate extended contigs and ordered scaffolds (see Supporting Materials and Methods).

Fig. 5. Ka/Ks values between orthologous genes common to a-type and b-type populations. The frequency of different Ka/Ks values for 836 orthologous C. symbiosum genes common to a-type and b- type populations. Each number on the x axis represents the highest number within a range of values, for example: 0 < X £ 0.05, 0.05 < X £ 0.15, etc. The arrow marks the point where Ka/Ks >1.0

Fig. 6. secY phylogenetic relationships. Maximum-likelihood analyses were performed as described in Materials and Methods. Fig. 7. Comparative analysis of archaeal reference genomes and SAR3 sample bin. Coverage plots for SAR sample 3 aligned to five archaeal reference genomes (see Materials and Methods). METTH, Methanothermobacter thermautotrophicus (NC_000916); ARCFU, Archaeoglobus fulgidus (NC_000917); THEVO, Thermoplasma volcanium (NC_002689); PYRAE, Pyrobaculum aerophilum (AE009441); SULSO, Sulfolobus solfataricus (NC_002754). Each vertical bar represents an individual whole genome shotgun (WGS) read. To the right of the plots the average percent amino acid identity (upper number) and similarity (lower number) for SAR WGS reads aligning to the reference genome is shown. For simplified visualization of gaps in the alignment, all matches are replotted near the base of the x axis to form a normalized 1D plot spanning the reference sequence. References 1. Schleper, C., Swanson, R. V., Mathur, E. J., DeLong, E. F. (1997) J Bacteriol 179:7803-7811. 2. Cubonova, L., Sandman, K., Hallam, S. J., Delong, E. F., Reeve, J. N. (2005) J Bacteriol 187:5482- 5485. 3. Beja, O., Koonin, E. V., Aravind, L., Taylor, L. T., Seitz, H., Stein, J. L., Bensen, D. C., Feldman, R. A., Swanson, R. V., DeLong, E. F. (2002) Appl Environ Microbiol 68:335-345. 4. Sandler, S. J., Hugenholtz, P., Schleper, C., DeLong, E. F., Pace, N. R., Clark, A. J. (1999) J Bacteriol 181:907-915. 5. Schleper, C., DeLong, E. F., Preston, C. M., Feldman, R. A., Wu, K. Y., Swanson, R. V. (1998) J Bacteriol 180:5003-5009. 6. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) J. Mol. Biol. 215:403-410. 7. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.-A., Barrell, B. (2000) Bioinformatics 16:944-945. 8. Lowe, T., Eddy, S. (1997) Nucleic Acids Res. 25:955-964. 9. Sugahara, J., Yachie, N., Sekine, Y., Soma, A., Matsui, M., Tomita, M., Kanai, A. (2006) In Silico Biology 6:0039. 10. Ghai, R., Hain, T., Chakraborty, T. (2004) BMC Bioinformatics 5:198. 11. Li, W. H., Wu, C. I., Luo, C. C. (1985) Mol Biol Evol 2:150-174.

S5 8 4 10 7

9 6 5 3

2 (1) 1

1

32-42 Kb (2) (3) (5)

1 2 (4)

3 1

2 4 1 3

2 1 5 3 b-type fosmid a-type fosmid

4 6 5‘ end sequence 2 1 3‘ end sequence 3 5 end extension 4 2 1 paired ends 3 7 5 contig

9 8 (6) { } + { } 10 8 0.500

0.450

0.400

0.350

0.300

0.250 Ka/Ks locus_tag description start end dir aa 1.05 CENSYa_1517 hypothetical protein 1532797 1534716 - 640 1.06 CENSYa_1518 hypothetical protein 1535028 1535657 - 210 1.08 CENSYa_1164 hypothetical protein 1196045 1196224 - 60 Frequency0.200 1.17 CENSYa_0627 hypothetical protein 576327 576548 - 74 1.20 CENSYa_1707 hypothetical protein 1702352 1702945 + 198 1.28 CENSYa_0319 hypothetical protein 272229 273338 - 370 0.150 1.30 CENSYa_1132 hypothetical protein 1160688 1160861 - 58 1.34 CENSYa_1072 hypothetical protein 1115432 1115707 + 92 1.52 CENSYa_1624 hypothetical protein 1623774 1623935 + 54 1.66 CENSYa_2022 hypothetical protein 1987509 1987766 - 86 0.100 2.52 CENSYa_1922 hypothetical protein 1888620 1888757 + 46 4.62 CENSYa_0197 hypothetical protein 178657 178881 + 75

0.050

0.000 0.05 0.15 0.20 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.50 2.50 5.00 Ka/Ks secY

Cenarchaeum symbiosum Aeropyrum pernix (Q9YDD0) 79 Pyrobaculum aerophilum (NP_560730) 100 40 Sulfolobus solfataricus (Q9UX84) Crenarchaeota 100 90 Sulfolobus acidocaldarius (P49978) 51 Sulfolobus tokodaii (NP_376288) Nanoarchaeum equitans (NP_963461) Thermococcus kodakarensis (BAD85707) 100 Pyrococcus furiosus (Q8U019) 37 100 Pyrococcus horikoshii (O59442) 79 Pyrococcus abyssi (Q9V1V8) 83 Methanocaldococcus jannaschii (Q60175) 100 100 Methanococcus maripaludis (CAF30978) Methanococcus vannielii (P28541)

45 38 Methanothermobacter thermautotrophicus ( ) Methanopyrus kandleri (AAM01243) Ferroplasma acidarmanus (ZP_00609679) 100 48 Picrophilus torridus (YP_023441)

69 Thermoplasma acidophilum (CAC12372) Euryarchaeota 100 89 Thermoplasma volcanium (BAB59494) Archaeoglobus fulgidus (O28377) Methanosarcina barkeri (YP_303654) 100 100 Methanosarcina acetivorans (NP_616040)

22 Methanococcoides burtonii (EAM9948) 32 Methanosarcina mazei (AAM31843) 47 Methanospirillum hungatei (EAP16414) Natronomonas pharaonis (YP 331175) 42 69 Haloarcula marismortui (CAA44838) 0.1 100 100 Haloarcula marismortui (P28542) Halobacterium salinarum (Q9HPB1)

50 Haloferax volcanii (AAL73212) 100 Haloferax volcanii (Q977V3) 100

80 50

60 67

SULSO %SIM

40

20

0 500000 1000000 1500000 2000000 2500000 gi|15896971|ref|NC_002754.1|

100

80 52

60 69

PYRAE %SIM

40

20

0 500000 1000000 1500000 2000000 gi|18308975|gb|AE009441.1|

100

80 54 71 60

THEVO %SIM SAR3 40

20

0 200000 400000 600000 800000 1000000 1200000 1400000 gi|13540831|ref|NC_002689.2|

100

80 56 73 60

ARCFU %SIM

40

20

0 500000 1000000 1500000 2000000 gi|11497621|ref|NC_000917.1|

100

80 55 72 60

METTH %SIM

40

20

0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 gi|15678031|ref|NC_000916.1| Table 2. Cenarchaeum symbiosum a-type fosmid locus index

Contig Read depthb rRNA cenFOSa JGI GenBank 1 2 G+C hits depth SSU LSU 5S

1 C01A04 3634001 DQ397560 40230 0.60 14 0.348 2 C01A05 3436581 DQ397540 39291 0.59 21 0.534 3 C01B03 3436583 DQ397541 43871 0.58 30 0.684 4 C01B12 na DQ397629 34772 0.57 12 0.345 5 C01C01 3436585 DQ397542 41173 0.54 16 0.389 6 C01C06 3436586 DQ397543 41548 0.59 19 0.457 7 C01C08 na DQ397630 46268 0.52 87 1.880 8 C01D10 3634003 DQ397561 39454 0.59 14 0.355 1 9 C01F04 3634552 DQ397588 37889 0.59 34 0.897 10 C01F06 3634553 DQ397589 34540 0.61 13 0.376 11 C01G02 3634554 DQ397590 39399 0.60 20 0.508 12 C01G08 3634555 DQ397591 41923 0.60 32 0.763 13 C02A01 3436588 DQ397544 33294 0.59 14 0.420 14 C02B01 4001027 DQ397627 40010 0.57 22 0.550 15 C02B11 3436590 DQ397545 45105 0.59 23 0.510 16 C03A04 3634556 DQ397592 41463 0.58 18 0.434 17 C03A05 3436597 DQ397546 39899 0.59 17 0.426 18 C03B12 3634558 DQ397593 38692 0.57 13 0.336 19 C03C07 4001018 DQ397618 41693 0.51 9 0.216 20 C03E10 3634559 DQ397594 39960 0.57 22 0.551 21 C03F09 3634560 DQ397595 39172 0.59 31 0.791 22 C04A07 na DQ397631 42553 0.58 24 0.564 23 C04F04 3436601 DQ397547 39026 0.59 15 0.384 24 C04F12 na DQ397632 40810 0.51 8 0.196 25 C04G08 3436602 DQ397548 46732 0.56 32 0.685 26 C04G09 na DQ397633 33298 0.60 7 0.210 27 C04H09 3436603 DQ397549 41052 0.57 29 0.706 1 1 28 C05A01 3436604 DQ397550 43380 0.58 17 0.392 29 C05A02 4001028 DQ397628 37577 0.58 20 0.532 30 C05B02 3634045 DQ397587 39977 0.58 15 0.375 1 31 C05C02 3634005 DQ397562 42705 0.59 24 0.562 32 C05D06 3634006 DQ397563 42904 0.58 26 0.606 33 C05D08 3634007 DQ397564 39634 0.59 20 0.505 34 C05D09 3436606 DQ397551 43893 0.59 17 0.387 35 C05E01 3634562 DQ397596 42625 0.58 15 0.352 36 C05E04 na DQ397634 40921 0.56 25 0.611 37 C05G03 3634008 DQ397565 45533 0.57 14 0.307 38 C05G09 3634564 DQ397597 38695 0.54 15 0.388 39 C05H05 na DQ397635 43064 0.56 15 0.348 40 C06D08 3634009 DQ397566 38715 0.57 33 0.852 41 C06E05 3634565 DQ397598 42858 0.61 48 1.120 42 C06H12 3634566 DQ397599 37698 0.60 21 0.557 43 C07B06 3634010 DQ397567 41389 0.58 17 0.411 44 C07B08 3634567 DQ397600 41295 0.60 15 0.363 45 C07D04 3634011 DQ397568 41538 0.58 34 0.819 1 1 46 C07D05 3634569 DQ397601 43888 0.54 25 0.570 47 C07D08 3634012 DQ397569 43141 0.58 10 0.232 48 C07D10 na DQ397636 36261 0.56 18 0.496 49 C07E08 3436610 DQ397553 40832 0.60 31 0.759 50 C07G05 3436609 DQ397552 39781 0.55 12 0.302 51 C07H12 4001020 DQ397620 40298 0.51 12 0.298 52 C08B03 3634573 DQ397602 40416 0.51 11 0.272 53 C08B04 3634574 DQ397603 46177 0.59 22 0.476 54 C08D01 3634015 DQ397570 44495 0.54 16 0.360 55 C08E06 3634017 DQ397571 40636 0.54 12 0.295 56 C08F03 3436611 DQ397554 36219 0.58 16 0.442 57 C08G10 3634575 DQ397604 37232 0.59 31 0.833 58 C08H08 na DQ397637 35308 0.53 8 0.227 59 C09A12 3634018 DQ397572 43887 0.58 8 0.182 60 C09C06 3634020 DQ397573 39895 0.60 15 0.376 61 C09C10 3436612 DQ397555 39564 0.59 17 0.430 62 C09D08 3634577 DQ397605 35982 [11079] 0.56 21 0.584 63 C09E09 3634021 DQ397574 42577 0.60 14 0.329 64 C09E11 3634579 DQ397606 41244 0.61 12 0.291 65 C101G10 na AF083071 32998 0.55 21 0.636 1 1 66 C11A07 4001019 DQ397619 38449 0.60 11 0.286 67 C11D05 3634023 DQ397575 41867 0.60 15 0.358 68 C12A09 3634580 DQ397607 38974 0.60 16 0.411 69 C12B02 3634581 DQ397608 36797 0.59 19 0.516 70 C13A04 4001024 DQ397624 38317 0.54 8 0.209 71 C13A10 3436613 DQ397556 40870 0.60 20 0.489 72 C13D11 3634025 DQ397576 38832 0.54 11 0.283 73 C13E07 3634026 DQ397577 38880 0.52 12 0.309 74 C13F02 3634582 DQ397609 43826 0.57 28 0.639 75 C13F12 3634027 DQ397578 39464 0.58 23 0.583 76 C14A09 3634583 DQ397610 45665 0.60 16 0.350 77 C15A07 3634584 DQ397611 43680 0.60 33 0.755 78 C15D07 4001023 DQ397623 43122 0.54 7 0.162 79 C15E09 3634030 DQ397579 43895 0.60 18 0.410 80 C15H04 3634587 DQ397612 44579 0.59 29 0.651 81 C16E09 na DQ397638 40387 0.60 27 0.669 82 C16G08 4001026 DQ397626 39063 0.51 5 0.128 83 C17B05 3634588 DQ397613 40924 0.57 25 0.611 84 C18C03 3436615 DQ397557 38829 0.51 8 0.206 85 C18D02 3634032 DQ397580 38412 0.58 13 0.338 86 C18F05 3634033 DQ397640 39931 0.52 2 0.050 87 C18H11 3634035 DQ397581 41492 0.58 13 0.313 88 C19A05 3634036 DQ397582 40370 0.54 11 0.272 89 C19D04 4001022 DQ397622 42161 0.52 8 0.190 90 C19D08 3634591 DQ397614 40691 0.60 12 0.295 91 C19F09 3634037 DQ397583 41893 0.57 22 0.525 92 C19G04 3634038 DQ397584 46841 0.60 6 0.128 93 C19H10 3436619 DQ397558 43787 0.55 23 0.525 94 C20B06 4001025 DQ397625 45783 0.56 20 0.437 95 C20B09 3634593 DQ397615 39122 0.59 15 0.383 96 C20C07 3634594 DQ397616 42246 0.59 28 0.663 97 C20G08 3634041 DQ397585 40803 0.57 23 0.564 98 C21E11 4001021 DQ397621 41768 0.57 22 0.527 99 C21G04 3634597 DQ397617 36742 0.59 28 0.762 100 C21G07 3436620 DQ397559 38150 0.60 14 0.367 101 C21H07 3634044 DQ397586 38934 0.56 15 0.385 102 C98C02 na DQ397639 40000 0.57 29 0.725

102 4143895 0.57 19.16 0.47 3 4 1

a cenFOS = Cenarchaeum symbiosum, a = a-type b Read depth based on the number of matching end sequences over the length of a given fosmid with >93% nucleotide identity over 90% of the read length Table 3. Cenarchaeum symbiosum b-type fosmid locus index

Contig Read depthb rRNA cenFOSa JGI GenBank 1 2 G+C hits depth SSU LSU 5S

1 C01A01 3436580 DQ397827 37585 0.61 10 0.266 2 C01A08 3436582 DQ397828 41419 0.59 13 0.314 3 C01B01 3436584 DQ397829 40703 0.61 14 0.344 4 C01B07 3634002 DQ397845 41876 0.58 13 0.310 5 C01C10 3436587 DQ397830 40511 0.61 11 0.272 6 C01E05 3634004 DQ397846 40941 0.56 9 0.220 7 C02C04 3436591 DQ397831 42768 0.57 7 0.164 8 C02C07 3436592 DQ397832 39120 0.60 24 0.613 9 C02E10 3436593 DQ397833 42767 0.57 7 0.164 10 C02E12 3436594 DQ397834 39848 0.59 8 0.201 11 C02G05 3436595 DQ397835 44971 0.61 14 0.311 12 C02H06 3436596 DQ397836 39848 0.59 8 0.201 13 C03F10 4001029 DQ397866 41348 0.58 9 0.218 14 C03G11 3436598 DQ397837 41322 0.55 11 0.266 15 C03H05 3436599 DQ397838 39721 0.50 9 0.227 16 C04C05 na DQ397869 40287 0.58 18 0.447 17 C04D12 3634561 DQ397857 40123 0.50 14 0.349 18 C04E08 3436600 DQ397839 40224 0.63 11 0.273 19 C04G10 na DQ397870 40996 0.62 11 0.268 20 C05D01 3436605 DQ397840 37104 0.58 5 0.135 21 C05G02 3634563 DQ397858 39265 0.57 10 0.255 22 C06A09 3436607 DQ397841 38485 0.60 18 0.468 23 C06F03 na DQ397877 37987 0.53 14 0.369 24 C07C04 3436608 DQ397842 41652 0.54 9 0.216 25 C07D01 3634568 DQ397859 43795 0.60 15 0.343 26 C07E09 3634570 DQ397860 42614 0.61 14 0.329 27 C07F02 3634571 DQ397875 41724 0.53 7 0.168 28 C07F03 3634013 DQ397847 42905 0.61 31 0.723 29 C07F11 3634014 DQ397848 34352 0.59 23 0.670 30 C07H03 3634572 DQ397861 38612 0.59 9 0.233 31 C08E01 3634016 DQ397849 37303 0.62 14 0.375 32 C09B07 3634019 DQ397850 40824 0.62 17 0.416 33 C09D12 3634578 DQ397876 39299 0.49 9 0.229 34 C12D02 4001030 DQ397867 38783 0.56 19 0.490 35 C13A11 3634024 DQ397873 40499 0.50 7 0.173 36 C13G08 3634028 DQ397851 38845 0.61 12 0.309 37 C15C12 3436614 DQ397843 41214 0.55 12 0.291 38 C15D08 3634029 DQ397852 38754 0.62 17 0.439 39 C15G10 3634586 DQ397862 33778 0.53 5 0.148 40 C16A06 na DQ397878 33861 0.50 6 0.177 41 C16D03 3634031 DQ397853 42367 0.58 17 0.401 42 C17D04 3436617 DQ397844 43042 0.60 13 0.302 1 43 C17E03 3436618 DQ397872 37670 0.52 10 0.265 44 C18C04 3436616 DQ397871 36161 0.48 9 0.249 45 C20A12 3634592 DQ397863 41539 0.57 23 0.554 46 C20E06 3634039 DQ397854 45538 0.56 7 0.154 47 C20G06 3634040 DQ397855 39443 0.57 13 0.330 48 C21A03 3634042 DQ397856 38527 0.58 11 0.286 49 C21B03 3634595 DQ397864 40572 0.61 20 0.493 50 C21C08 4001032 DQ397868 37513 0.60 9 0.240 51 C21E02 3634596 DQ397865 39981 0.62 17 0.425 52 C21F09 3634043 DQ397874 37092 0.52 8 0.216 53 C60A05 na AF083072 42432 0.59 16 0.377 1 1

53 2119910 0.57 12.58 0.31 1 1 1

a cenFOS = Cenarchaeum symbiosum, b = b-type b Read depth based on the number of matching end sequences over the length of a given fosmid with >93% nucleotide identity over 90% of the read length Table 4. C. symbiosum tiling path construction a

1 > DQ397601 DQ397602 DQ397577 DQ397629 DQ397609 DQ397542 DQ397586 DQ397618 DQ397632 DQ397558 DQ397587 DQ397566 DQ397580 DQ397605

DQ397569 DQ397572 DQ397612 DQ397631 DQ397564 DQ397597 DQ397543 DQ397553 DQ397627 DQ397556 DQ397583 DQ397623

DQ397592 DQ397610 DQ397625 DQ397604 DQ397617 DQ397611 DQ397596 DQ397636 DQ397599 DQ397616 DQ397598 DQ397591

DQ397594 DQ397555 DQ397614 DQ397554 DQ397565 DQ397552 DQ397557 DQ397588 DQ397584 DQ397633 DQ397635 DQ397571 DQ397624 DQ397640 DQ397620 DQ397621 16-23S rRNA

DQ397578 AF083071 DQ397550 DQ397585 DQ397607 DQ397581 DQ397619 DQ397634 DQ397568 DQ397549 DQ397546 DQ397608 DQ397606 DQ397559 DQ397639

DQ397563 DQ397613 DQ397590 DQ397576 DQ397540 DQ397561 DQ397570 DQ397638 DQ397615 DQ397541 DQ397595 DQ397582 DQ397545 DQ397579 DQ397548 DQ397547 DQ397628

DQ397603 DQ397573 DQ397600 DQ397560 DQ397567 DQ397562 DQ397589 DQ397575 DQ397551 DQ397574 DQ397544 DQ397593

DQ397626 DQ397622 DQ397630 < 2,045,089 bp DQ397637

a Only those fosmids partitioning into the a-type population bin are depicted in this table (see methods). Each fosmid is identified by a unique Genbank accession number. Fosmids shown in bold were selected to bridge or extend contigs during later rounds of iterative assembly. indicates the specific fosmid used in tiling path construction (Note: individual boxes are not drawn to scale). Table 5. Summary of tRNA genes identified in C. symbiosum genome

tRNAs Decoding Standard 20 AA 45

Selenocysteine tRNAs (TCA) 0

Possible suppressor tRNAs (CTA,TTA) 0

Total tRNAs 45

intron summary

tRNA with introns Trpa Met Ser Glna Val Tyr Leu Arg Glu CCA CAT GGA CTG CAC GTA CAA CCT TTC

10 1 2 1 1 1 1 1 1 1

positionb 29/30, "37/38" 31/32; 19/20 "37/38" 24/25, "37/38" 40/41 "37/38" "37/38" 15/16 62/63

motifc HBh', hBHBh' hBHBh', hBHBh' hBHBh' HBh', hBHBh' hBHBh' hBH hBHBh' HBh' hBHBh'

four box tRNA sets

Isotype anticodon count Total

Ala AGC GGC CGC TGC

1 1 1 3

Gly ACC GCC CCC TCC

1 1 1 3

Pro AGG GGG CGG TGG

1 1 1 3

Thr AGT GGT CGT TGT

1 1 1 3

Val AAC GAC CAC TAC

1 1 1 3

six box tRNA sets

isotype anticodon count Total

Ser AGA GGA CGA TGA ACT GCT

1 1 1 1 4

Arg ACG GCG CCG TCG CCT TCT

1 1 1 1 1 5

Leu AAG GAG CAG TAG CAA TAA

1 1 1 1 4

two box tRNA sets

isotype anticodon count total

Phe AAA GAA

1 1

Asn ATT GTT

1 1

Lys CTT TTT

1 1 2

Asp ATC GTC

1 1

Glu CTC TTC

1 1 2

His ATG GTG

1 1

Gln CTG TTG

1 1 2

two box and other tRNA sets

isotype anticodon count total

Ile AAT GAT TAT

1 1

Met CAT

3d 3 Tyr ATA GTA

1 1

suppressor CTA TTA

0

Cys ACA GCA

1 1

Trp CCA

1 1

Sec TCA

0

a single gene has two introns. b "37/38" may not be the exact position, but conventional intron position. c for definition, see Marck, C. and Grosjean, H. (2003), RNA, 9, 1516-1531 d one of them may be a tRNAIle decoding AUA codon after modification at the first letter of anticodon. indicates missing tRNA Table 6. Expanded gene families identified in C. symbiosum genome^

Family Gene IDs Locus Description Locus_tag CENSYa_0381, CENSYa_0382, CENSYa_0481, CENSYa_0508, CENSYa_0566, CENSYa_0581, CENSYa_0582, CENSYa_0583, CENSYa_0623, CENSYa_0643, CENSYa_0708, CENSYa_0709, CENSYa_0815, CENSYa_0818, CENSYa_0819, CENSYa_0841, CENSYa_0843, CENSYa_0845, 1 34 hypothetical protein CENSYa_0847, CENSYa_0848, CENSYa_0849, CENSYa_0878, CENSYa_0896, CENSYa_0916, CENSYa_0950, CENSYa_1154, CENSYa_1345, CENSYa_1360, CENSYa_1492, CENSYa_1494, CENSYa_1517, CENSYa_1808, CENSYa_1945, CENSYa_1946 CENSYa_0716, CENSYa_0722, CENSYa_0727, CENSYa_0728, CENSYa_0729, CENSYa_0730, 2 15 hypothetical protein CENSYa_0731, CENSYa_0737, CENSYa_0738, CENSYa_0739, CENSYa_0740, CENSYa_0741, CENSYa_0744, CENSYa_0745, CENSYa_0746 CENSYa_0528, CENSYa_0592, CENSYa_0593, CENSYa_0895, CENSYa_0989, CENSYa_1333, 3 14 hypothetical protein CENSYa_1515, CENSYa_1518, CENSYa_1519, CENSYa_1520, CENSYa_1521, CENSYa_1540, CENSYa_1543, CENSYa_1681 CENSYa_1794, CENSYa_1795, CENSYa_1796, CENSYa_1031, CENSYa_1033, CENSYa_1472, 4 8 copper binding protein, plastocyanin/azurin family CENSYa_1477, CENSYa_1831 5 6 hypothetical protein CENSYa_0355, CENSYa_0786, CENSYa_0905, CENSYa_0982, CENSYa_1449, CENSYa_1909 6 5 transcriptional regulator CENSYa_0493, CENSYa_0603, CENSYa_1209, CENSYa_1361, CENSYa_1476 7 5 serine protease inhibitor CENSYa_0537, CENSYa_1228, CENSYa_1605, CENSYa_1682, CENSYa_1965 * 8 5 hypothetical protein CENSYa_1229, CENSYa_1248, CENSYa_1474, CENSYa_1479, CENSYa_1683 9 5 hypothetical protein CENSYa_0083, CENSYa_0775, CENSYa_0776, CENSYa_0777, CENSYa_1762 10 4 transcriptional regulator CENSYa_1045, CENSYa_1047, CENSYa_1049, CENSYa_1052 11 4 transcription initiation factor TFIIB CENSYa_1331, CENSYa_1825, CENSYa_1876, CENSYa_1884 12 4 conserved hypothetical protein CENSYa_2027, CENSYa_2028, CENSYa_2029, CENSYa_2030 13 4 bacterial-type DNA primase CENSYa_0146, CENSYa_0148, CENSYa_0130, CENSYa_0163 14 4 ABC-type branched-chain amino acid transport system, periplasmic component CENSYa_0483, CENSYa_0616, CENSYa_1226, CENSYa_1357 * 15 4 outermembrane autotransporter CENSYa_0635, CENSYa_0636, CENSYa_0646, CENSYa_0653 16 3 esterase/lipase CENSYa_0305, CENSYa_1240, CENSYa_1732 * 17 3 autotransporter adhesin CENSYa_1634, CENSYa_1636, CENSYa_1777 * 18 3 superfamily II DNA/RNA helicase CENSYa_0153, CENSYa_0154, CENSYa_0155 * 19 3 hypothetical protein CENSYa_0151, CENSYa_0162, CENSYa_0568 20 3 hypothetical protein CENSYa_0159, CENSYa_0161, CENSYa_1565 21 3 protein-disulfide isomerase CENSYa_0223, CENSYa_0976, CENSYa_1269 22 3 hypothetical protein CENSYa_1579, CENSYa_1734, CENSYa_1793 23 3 AAA+ ATPase CENSYa_0523, CENSYa_1848, CENSYa_1969 24 3 flavin-nucleotide-binding protein CENSYa_0608, CENSYa_0648, CENSYa_1975 25 3 hypothetical protein CENSYa_0803, CENSYa_0808, CENSYa_1294

26 3 glycosyltransferase CENSYa_0994, CENSYa_0995, CENSYa_1000 27 3 conserved hypothetical protein CENSYa_1046, CENSYa_1048, CENSYa_1053

28 3 hypothetical protein CENSYa_1244, CENSYa_1247, CENSYa_1475

29 3 hypothetical protein CENSYa_0169, CENSYa_0553, CENSYa_1921

30 2 conserved protein implicated in secretion CENSYa_1885, CENSYa_1915

31 2 TATA-box binding protein CENSYa_0093, CENSYa_1233

32 2 HNH endonuclease CENSYa_0107, CENSYa_0194

33 2 glutathione synthase CENSYa_0125, CENSYa_0926

34 2 helicase CENSYa_0143, CENSYa_0147

35 2 nuclease CENSYa_0164, CENSYa_1106 36 2 conserved hypothetical protein CENSYa_0165, CENSYa_1103

37 2 20S proteasome, alpha and beta subunit CENSYa_0175, CENSYa_0990

38 2 aspartyl/asparaginyl-tRNa+synthetase CENSYa_0241, CENSYa_0611

39 2 hypothetical protein CENSYa_0243, CENSYa_0247

40 2 L-asparaginase CENSYa_0248, CENSYa_1418 *

41 2 hypothetical protein CENSYa_0318, CENSYa_0625

42 2 hypothetical protein CENSYa_0319, CENSYa_1371

43 2 transcriptional regulator CENSYa_0345, CENSYa_1992 44 2 ferredoxin-sulfite/nitrite reductase CENSYa_0358, CENSYa_0360

45 2 periplasmic serine protease (ClpP class) CENSYa_0375, CENSYa_1165 *

46 2 ammonia monooxygenase subunit C CENSYa_0399, CENSYa_0669

47 2 ammonia permease CENSYa_0526, CENSYa_1453

48 2 transcriptional activator, adenine-specific DNa+methyltransferase CENSYa_0569, CENSYa_0595

49 2 conserved hypothetical protein CENSYa_0606, CENSYa_0964

50 2 hypothetical protein CENSYa_0631, CENSYa_0632

51 2 hypothetical protein CENSYa_0633, CENSYa_0644 52 2 hypothetical protein CENSYa_0634, CENSYa_0645

53 2 3'-phosphoadenosine 5'-phosphosulfate sulfotransferase (PAPS reductase)/FAD synthetase CENSYa_0640, CENSYa_0655 54 2 Ser/Thr protein kinase CENSYa_0647, CENSYa_1861

55 2 cysteine sulfinate desulfinase/cysteine desulfurase CENSYa_0657, CENSYa_1572

56 2 hypothetical protein CENSYa_0659, CENSYa_0662

57 2 transposase CENSYa_0661, CENSYa_1086 58 2 hypothetical protein CENSYa_0697, CENSYa_1118 59 2 hypothetical protein CENSYa_0720, CENSYa_0735 60 2 conserved hypothetical protein CENSYa_0766, CENSYa_0913 61 2 hypothetical protein CENSYa_0807, CENSYa_1292 62 2 hypothetical protein CENSYa_0853, CENSYa_0854 63 2 hypothetical protein CENSYa_0890, CENSYa_1775 64 2 conserved hypothetical protein CENSYa_0891, CENSYa_1776 65 2 hypothetical protein CENSYa_0903, CENSYa_2032 66 2 Fe-S oxidoreductase CENSYa_1007, CENSYa_1012 67 2 hypothetical protein CENSYa_1030, CENSYa_1781 68 2 hypothetical protein CENSYa_1100, CENSYa_1101 69 2 hypothetical protein CENSYa_1218, CENSYa_1938 70 2 hypothetical protein CENSYa_1319, CENSYa_1393 71 2 hypothetical protein CENSYa_1526, CENSYa_1527 72 2 hypothetical protein CENSYa_1562, CENSYa_2053 73 2 hypothetical protein CENSYa_1563, CENSYa_2052 74 2 Na+-transporting NADH ubiquinone oxidoreductase subunit CENSYa_1659, CENSYa_1714 75 2 ferredoxin CENSYa_1727, CENSYa_1728 76 2 Zn-dependent oxidoreductase CENSYa_1751, CENSYa_2016 77 2 chaperonin GroEL CENSYa_1851, CENSYa_2001 78 2 glutamine synthetase CENSYa_1853, CENSYa_1872 79 2 hypothetical protein CENSYa_2009, CENSYa_2010

total 263 ^ based on following cutoffs: expection ≥ 1e-20, bitscore ≥ 100, identity ≥ 40% and overlap ≥ 100 aa

* indicates gene content shared between C. symbiosum and public genomes not identified in Sargasso Sea

Expanded gene families (1-13) appear in the sixth circle of the genome map (Figure 3) Table 7. C. symbiosum gene content more conserved in public genome database than Sargasso Sea a locus_tag COG Description Best matching genome BLAST score ratio 1 CENSYa_0012 conserved hypothetical protein Vibrio parahaemolyticus chromosome II 30.6 2 CENSYa_0032 T serine/threonine protein phosphatase Alkaliphillus metalliredigenes 35.6 3 CENSYa_0092 conserved hypothetical protein Sulfolobus tokodaii 39.7 4 CENSYa_0153 K superfamily II DNA/RNA helicase, SNF2 family Thermodesulfovibrio commune 54.3 5 CENSYa_0154 K superfamily II DNA/RNA helicase, SNF2 family Thermodesulfovibrio commune 37.0 6 CENSYa_0155 K superfamily II DNA/RNA helicase, SNF2 family Thermodesulfovibrio commune 56.2 7 CENSYa_0156 hypothetical protein Thermodesulfovibrio commune 45.6 8 CENSYa_0168 F GMP synthase, glutamine amidotransferase domain Acidobacteria-3 sp. Ellin6076 32.5 9 CENSYa_0180 R helicase Archaeoglobus fulgidus 32.7 10 CENSYa_0181 hypothetical protein Thermoplasma volcanium 33.2 11 CENSYa_0221 S hydroxymethylpyrimidine phosphate kinase Haloferax volcanii 30.3 12 CENSYa_0248 E L-asparaginase Anaeromyxobacter dehalogenans 31.0 13 CENSYa_0375 N periplasmic serine protease (ClpP class) Corynebacterium glutamicum 30.4 14 CENSYa_0440 hypothetical protein Rhodoferax ferrireducens 30.3 15 CENSYa_0483 E ABC-type branched-chain amino acid transport Pyrobaculum aerophilum 33.0 16 CENSYa_0537 O serine protease inhibitor Thermococcus kodakaraensis 40.4 17 CENSYa_0616 E ABC-type branched-chain amino acid transport Aeropyrum pernix 30.7 18 CENSYa_0628 L site-specific DNA methylase Chloroflexus aurantiacus 51.1 19 CENSYa_0638 L DNA mismatch repair enzyme Anabaena variabilis plasmid A 39.3 20 CENSYa_0639 L superfamily II helicase Anabaena variabilis plasmid A 32.6 21 CENSYa_0683 L DNA modification methylase Chloroflexus aurantiacus 34.0 22 CENSYa_0721 hypothetical protein Gemmata obscuriglobus 37.0 23 CENSYa_0736 hypothetical protein Gemmata obscuriglobus 42.3 24 CENSYa_0824 L DNA modification methylase Methanococcus jannaschii 38.8 25 CENSYa_0830 L DNA modification methylase Nostoc sp. PCC7120 43.2 26 CENSYa_0864 J RNase P/RNase MRP subunit p29 Methanococcus jannaschii 31.8 27 CENSYa_0888 L DNA modification methylase Helicobacter pylor 52.3 28 CENSYa_0952 hypothetical protein Myxococcus xanthus 37.1 29 CENSYa_0954 R ATPase involved in DNA repair Methanobacterium thermoautotrophicum 51.0 30 CENSYa_0956 ATPase Methanobacterium thermoautotrophicum 53.1 31 CENSYa_0957 conserved hypothetical protein Methanobacterium thermoautotrophicum 30.7 32 CENSYa_1008 helicase Arthrobacter sp. FB24 33.2 33 CENSYa_1041 hypothetical protein Ralstonia solanacearum 54.7 34 CENSYa_1050 hypothetical protein Nitrosomonas europaea 34.6 35 CENSYa_1092 hypothetical protein Methanopyrus kandleri 30.4 36 CENSYa_1108 L adenine specific DNA methylase Fibriobacter succinogenes 30.3 37 CENSYa_1117 hypothetical protein Haloarcula marismortui 30.4 38 CENSYa_1150 L DNA modification methylase Nitrosococcus oceani 49.9 39 CENSYa_1222 E ABC-type branched-chain amino acid transport Pyrobaculum aerophilum 38.4 40 CENSYa_1223 E ABC-type branched-chain amino acid transport Archaeoglobus fulgidus 41.3 41 CENSYa_1226 E ABC-type branched-chain amino acid transport Aeropyrum pernix 32.2 42 CENSYa_1228 O serine protease inhibitor Thermococcus kodakaraensis 33.5 43 CENSYa_1240 esterase/lipase Rhodospirillum rubrum 31.0 44 CENSYa_1277 E anthranilate phosphoribosyltransferase Kineococcus radiotolerans 40.5 45 CENSYa_1328 I poly(3-hydroxyalkanoate) synthetase Haloarcula marismortui 36.5 46 CENSYa_1357 E ABC-type branched-chain amino acid transport Pyrobaculum aerophilum 35.8 47 CENSYa_1418 E L-asparaginase Anaeromyxobacter dehalogenans 34.5 48 CENSYa_1431 tetrahydromethanopterin S-methyltransferase, Methanosarcina barkeri 36.2 49 CENSYa_1480 hypothetical protein Thermosynechococcus elongatus 33.1 50 CENSYa_1510 O molecular chaperone (small heat shock protein) Sulfolobus solfataricus 37.0 51 CENSYa_1552 K RNA 3'-terminal phosphate cyclase Methanopyrus kandleri 31.9 52 CENSYa_1605 O serine protease inhibitor Anabaena variabilis 32.0 53 CENSYa_1613 R dioxygenase Methanopyrus kandleri 38.2 54 CENSYa_1629 DNA helicase Burkholderia pseudomallei 30.8 55 CENSYa_1632 L DNA G/T mismatch repair endonuclease Neisseria gonorrhoeae 30.1 56 CENSYa_1633 P Mg2 and Co2 transporter Dehalococcoides ethenogenes 30.4 57 CENSYa_1634 U autotransporter adhesin Fibriobacter succinogenes 67.4 58 CENSYa_1650 hypothetical protein Nocardioides sp. 30.2 59 CENSYa_1682 O serine protease inhibitor Anabaena variabilis 38.3 60 CENSYa_1818 S uncharacterized conserved protein Methanobacterium thermoautotrophicum 41.9 61 CENSYa_1914 M dolichol-phosphate mannosyltransferase Thermococcus kodakaraensis 31.9 62 CENSYa_1923 S uncharacterized conserved protein Methanopyrus kandleri 45.5 63 CENSYa_1933 hypothetical protein Shewanella denitrificans 39.0 64 CENSYa_1965 O serine protease inhibitor Dictyoglomus thermophilum 35.0 65 CENSYa_2033 hypothetical protein Stigmatella aurantica 31.6 a Identification based on a blast score ratio cut-off ≥30. The blast score ratio represents the bitscore of the C. symbiosum protein queried against all public genomes divided by the self-match bitscore of the C. symbiosum sequence queried against itself (see methods).

COG category descriptions are found in Figure 3 Table 8. C. symbiosum gene content more conserved in Sargasso Sea than public genome databasea locus_tag COG Description BLAST score ratio 1 CENSYa_0001 - rieske iron sulphur protein 75.2 2 CENSYa_0002 C cytochrome b subunit of the bc complex 42.7 3 CENSYa_0003 - copper binding protein, plastocyanin/azurin 53.1 4 CENSYa_0004 - hypothetical protein 53.2 5 CENSYa_0006 R hydrolase of the metallo-beta-lactamase 42.4 6 CENSYa_0007 O prefoldin, chaperonin cofactor 41.9 7 CENSYa_0009 - hypothetical protein 52.8 8 CENSYa_0011 R nucleic acid-binding protein 41.3 9 CENSYa_0017 - hypothetical protein 47.5 10 CENSYa_0020 - ribonuclease HI 60.3 11 CENSYa_0023 N signal peptidase I 34.1 12 CENSYa_0024 - hypothetical protein 41.9 13 CENSYa_0026 - hypothetical protein 59.6 14 CENSYa_0027 - hypothetical protein 64.1 15 CENSYa_0031 F dihydroorotase 54.5 16 CENSYa_0038 R uncharacterized conserved protein, putative 43.6 17 CENSYa_0041 - hypothetical protein 33.9 18 CENSYa_0042 - hypothetical protein 37.3 19 CENSYa_0051 O cytochrome c biogenesis protein 42.2 20 CENSYa_0054 L Mg-dependent DNase 46.7 21 CENSYa_0056 M membrane-associated Zn-dependent protease 66.5 22 CENSYa_0059 - hypothetical protein 53.4 23 CENSYa_0063 - hypothetical membrane protein 35.3 24 CENSYa_0069 - hypothetical protein 53.9 25 CENSYa_0073 - hypothetical protein 76.6 26 CENSYa_0074 - hypothetical protein 54.5 27 CENSYa_0077 J phenylalanyl-tRNA synthetase beta subunit 33.1 28 CENSYa_0079 J phenylalanyl-tRNA synthetase alpha subunit 41.5 29 CENSYa_0081 - hypothetical protein 53.7 30 CENSYa_0086 G transketolase, N-terminal subunit 61.0 31 CENSYa_0089 - hypothetical protein 41.6 32 CENSYa_0095 E cysteine synthase 43.3 33 CENSYa_0096 - conserved hypothetical protein 45.5 34 CENSYa_0097 R nitric oxide reductase Q protein 69.5 35 CENSYa_0098 C dehydrogenase 62.1 36 CENSYa_0103 L single stranded DNA-specific exonuclease 50.2 37 CENSYa_0106 - conserved hypothetical protein 56.4 38 CENSYa_0111 C succinyl-CoA synthetase, beta subunit 77.7 39 CENSYa_0112 - hypothetical protein 54.3 40 CENSYa_0113 J isoleucyl-tRNA synthetase 40.6 41 CENSYa_0119 J exonuclease of the beta-lactamase fold 55.6 42 CENSYa_0120 - hypothetical protein 41.5 43 CENSYa_0121 K transcriptional regulator 32.8 44 CENSYa_0126 E acetylornithine 69.0 45 CENSYa_0127 J diphthamide biosynthesis methyltransferase 55.4 46 CENSYa_0128 M cytidylyltransferase 54.1 47 CENSYa_0159 - hypothetical protein 31.9 48 CENSYa_0161 - hypothetical protein 37.6 49 CENSYa_0176 H methylase involved in ubiquinone/menaquinone 64.7 50 CENSYa_0177 G permease of the major facilitator superfamily 54.7 51 CENSYa_0179 - hypothetical protein 55.0 52 CENSYa_0182 L DNA topoisomerase IB 51.2 53 CENSYa_0185 - hypothetical protein 60.2 54 CENSYa_0187 - hypothetical protein 66.1 55 CENSYa_0196 - hypothetical protein 85.3 56 CENSYa_0202 E 3-dehydroquinate dehydratase 50.2 57 CENSYa_0204 E archaeal shikimate kinase 52.5 58 CENSYa_0208 L Rad3-related DNA helicase 45.2 59 CENSYa_0213 E aspartate/tyrosine/aromatic aminotransferase 44.6 60 CENSYa_0214 E prephenate dehydrogenase 65.9 61 CENSYa_0222 - small membrane protein 49.6 62 CENSYa_0223 O protein-disulfide isomerase 56.3 63 CENSYa_0228 - hypothetical protein 61.1 64 CENSYa_0233 - hypothetical protein 70.9 65 CENSYa_0235 - hypothetical protein 55.5 66 CENSYa_0238 J Asp-tRNAAsn/Glu-tRNAGln amidotransferase B 51.4 67 CENSYa_0240 - hypothetical protein 33.9 68 CENSYa_0252 - conserved hypothetical protein 31.7 69 CENSYa_0253 J deoxyhypusine synthase 33.2 70 CENSYa_0259 - hypothetical protein 61.8 71 CENSYa_0260 S conserved membrane protein 57.5 72 CENSYa_0262 J RNA-binding protein 58.9 73 CENSYa_0267 - hypothetical protein 47.1 74 CENSYa_0268 - conserved hypothetical protein 43.0 75 CENSYa_0269 - hypothetical protein 44.7 76 CENSYa_0273 - hypothetical protein 47.8 77 CENSYa_0276 C dehydrogenase 81.0 78 CENSYa_0278 - hypothetical protein 48.8 79 CENSYa_0287 E cysteine desulfurase/selenocysteine lyase 48.8 80 CENSYa_0289 - hypothetical protein 94.4 81 CENSYa_0300 E asparagine synthase (glutamine-hydrolyzing) 30.4 82 CENSYa_0301 - transcriptional regulator 80.4 83 CENSYa_0302 - conserved hypothetical protein 44.2 84 CENSYa_0303 K DNA-directed RNA polymerase, subunit K/omega 58.9 85 CENSYa_0304 - hypothetical protein 65.2 86 CENSYa_0307 - conserved hypothetical protein 61.5 87 CENSYa_0308 - hypothetical protein 44.5 88 CENSYa_0309 - hypothetical protein 71.5 89 CENSYa_0312 - hypothetical protein 73.5 90 CENSYa_0314 - hypothetical protein 55.6 91 CENSYa_0315 - hypothetical protein 59.1 92 CENSYa_0316 R dehydrogenase 36.3 93 CENSYa_0320 - hypothetical protein 37.5 94 CENSYa_0321 J pseudouridine synthase 62.7 95 CENSYa_0322 F cytidylate kinase 65.0 96 CENSYa_0323 - membrane protein 68.8 97 CENSYa_0331 N signal recognition particle GTPase 31.4 98 CENSYa_0332 O prefoldin, molecular chaperone 44.1 99 CENSYa_0341 S uncharacterized protein conserved in archaea 60.4 100 CENSYa_0345 K transcriptional regulator 66.9 101 CENSYa_0347 J arginyl-tRNA synthetase 49.1 102 CENSYa_0348 R hydrolase of the HAD superfamily 39.4 103 CENSYa_0357 R TPR repeat 58.3 104 CENSYa_0359 P rhodanese-related sulfurtransferase 58.4 105 CENSYa_0362 - conserved hypothetical protein 37.3 106 CENSYa_0365 R metal-dependent phosphoesterase 47.7 107 CENSYa_0368 - hypothetical protein 63.6 108 CENSYa_0383 R archaeal enzyme of ATP-grasp superfamily 43.1 109 CENSYa_0384 K archaeal DNA-binding protein 45.5 110 CENSYa_0385 - hypothetical protein 58.1 111 CENSYa_0387 - multitransmembrane protein 64.1 112 CENSYa_0389 - hypothetical protein 32.8 113 CENSYa_0391 - hypothetical protein 91.7 114 CENSYa_0392 - hypothetical protein 43.0 115 CENSYa_0393 - hypothetical protein 51.0 116 CENSYa_0394 - ammonia monooxygenase subunit B 74.9 117 CENSYa_0399 - ammonia monooxygenase subunit C 95.6 118 CENSYa_0402 - ammonia monooxygenase subunit A 94.2 119 CENSYa_0404 - hypothetical protein 67.4 120 CENSYa_0408 - Zn ribbon protein 51.1 121 CENSYa_0409 - C2H2 Zn finger protein 40.8 122 CENSYa_0410 - multiple Zn finger protein 65.7 123 CENSYa_0415 - conserved hypothetical protein 61.4 124 CENSYa_0416 - conserved hypothetical protein 62.6 125 CENSYa_0418 O DnaJ-class molecular chaperone 35.2 126 CENSYa_0424 - hypothetical protein 50.3 127 CENSYa_0426 - hypothetical protein 34.8 128 CENSYa_0430 S uncharacterized conserved protein 44.9 129 CENSYa_0432 C galactose-1-phosphate uridylyltransferase 58.8 130 CENSYa_0433 - hypothetical protein 45.7 131 CENSYa_0434 K histone acetyltransferase 58.3 132 CENSYa_0442 - hypothetical protein 47.8 133 CENSYa_0447 - ribosomal protein L30E 47.6 134 CENSYa_0449 K transcriptional regulator 44.4 135 CENSYa_0452 O Urease accessory protein 39.4 136 CENSYa_0453 O Urease accessory protein 34.7 137 CENSYa_0457 E urea active transporter 55.1 138 CENSYa_0458 - hypothetical protein 51.1 139 CENSYa_0461 - conserved hypothetical protein 69.0 140 CENSYa_0467 - hypothetical protein 71.8 141 CENSYa_0469 J dimethyladenosine transferase (rRNA 37.1 142 CENSYa_0470 J methylase of polypeptide chain release factor 49.4 143 CENSYa_0479 E aminopeptidase N 43.4 144 CENSYa_0484 E oligoendopeptidase F 58.4 145 CENSYa_0489 - hypothetical protein 70.6 146 CENSYa_0491 L superfamily II helicase 52.4 147 CENSYa_0492 - hypothetical protein 69.2 148 CENSYa_0496 - hypothetical protein 73.6 149 CENSYa_0498 - conserved hypothetical protein 33.3 150 CENSYa_0501 - conserved hypothetical protein 33.3 151 CENSYa_0502 - conserved hypothetical protein 72.7 152 CENSYa_0504 R thioredoxin 42.3 153 CENSYa_0509 - hypothetical protein 46.4 154 CENSYa_0511 - hypothetical protein 33.5 155 CENSYa_0512 - hypothetical protein 46.4 156 CENSYa_0514 - hypothetical protein 53.8 157 CENSYa_0515 S tRNA pseudouridine synthase D 34.3 158 CENSYa_0516 - streptogramin lyase 39.2 159 CENSYa_0520 C dehydrogenase 46.4 160 CENSYa_0522 - hypothetical protein 54.6 161 CENSYa_0530 - copper binding protein, plastocyanin/azurin 52.3 162 CENSYa_0531 S metal-dependent protease of the PAD1/JA B1 46.1 163 CENSYa_0547 R flavin-nucleotide-binding protein 63.6 164 CENSYa_0548 C flavin-dependent oxidoreductase 64.4 165 CENSYa_0549 C alcohol dehydrogenase, class IV 77.4 166 CENSYa_0557 J ribosomal biogenesis protein/nucleolar protein 47.9 167 CENSYa_0561 J N2,N2-dimethylguanosine tRNA methyltransferase 41.8 168 CENSYa_0563 K transcriptional regulator 36.4 169 CENSYa_0565 - uncharacterized conserved protein 42.0 170 CENSYa_0569 - transcriptional activator, adenine-specific DNA 54.2 171 CENSYa_0570 P ABC-type phosphate transport system, periplasmic 40.1 172 CENSYa_0579 - hypothetical protein 51.4 173 CENSYa_0580 - hypothetical protein 36.4 174 CENSYa_0584 - hypothetical protein 61.2 175 CENSYa_0587 I 3-ketoacyl-CoA thiolase/acetyl-CoA 44.1 176 CENSYa_0588 C flavin-dependent oxidoreductase 48.7 177 CENSYa_0589 - hypothetical protein 49.8 178 CENSYa_0590 G glucose-6-phosphate isomerase 42.0 179 CENSYa_0595 - transcriptional activator, adenine-specific DNA 54.3 180 CENSYa_0596 - MunI restriction endonuclease 31.1 181 CENSYa_0600 - uncharacterized membrane-associated protein 68.6 182 CENSYa_0601 - conserved hypothetical protein 57.9 183 CENSYa_0602 E ABC-type dipeptide/oligopeptide/nickel transport 73.4 184 CENSYa_0608 R flavin-nucleotide-binding protein 33.8 185 CENSYa_0630 - hypothetical protein 72.7 186 CENSYa_0641 - hypothetical protein 40.9 187 CENSYa_0647 - Ser/Thr protein kinase 35.1 188 CENSYa_0648 R conserved hypothetical protein 44.4 189 CENSYa_0669 - ammonia monooxygenase/methane monooxygenase, 96.6 190 CENSYa_0670 - conserved hypothetical protein 34.4 191 CENSYa_0765 - hypothetical protein 67.2 192 CENSYa_0766 - conserved hypothetical protein 66.1 193 CENSYa_0767 - hypothetical protein 48.9 194 CENSYa_0768 R ICC-like phosphoesterase 42.7 195 CENSYa_0769 E carbamoylphosphate synthase large subunit 78.7 196 CENSYa_0779 J translation initiation factor 1 52.6 197 CENSYa_0780 T serine/threonine protein kinase 38.5 198 CENSYa_0785 - hypothetical protein 35.0 199 CENSYa_0787 - hypothetical protein 54.9 200 CENSYa_0788 - hypothetical protein 61.5 201 CENSYa_0789 - hypothetical protein 60.5 202 CENSYa_0793 L ATP-dependent endonuclease 57.1 203 CENSYa_0798 G hydroxypyruvate reductase 32.1 204 CENSYa_0800 - hypothetical protein 75.0 205 CENSYa_0814 - single-stranded DNA-specific exonuclease 64.7 206 CENSYa_0831 - hypothetical protein 67.4 207 CENSYa_0833 E methylmalonyl-CoA epimerase/lactoylglutathione 62.5 208 CENSYa_0834 - hypothetical protein 40.5 209 CENSYa_0835 - hypothetical protein 33.2 210 CENSYa_0837 T adenylate cyclase 49.6 211 CENSYa_0850 - hypothetical protein 55.4 212 CENSYa_0857 J ribosomal protein L3 79.1 213 CENSYa_0862 J ribosomal protein S3 61.6 214 CENSYa_0867 J ribosomal protein L24 42.5 215 CENSYa_0873 - hypothetical protein 55.4 216 CENSYa_0876 - hypothetical protein 81.4 217 CENSYa_0884 K transcriptional regulator 55.6 218 CENSYa_0889 - hypothetical protein 38.5 219 CENSYa_0890 - hypothetical protein 36.2 220 CENSYa_0897 - hypothetical protein 56.7 221 CENSYa_0900 - hypothetical protein 30.2 222 CENSYa_0907 - conserved hypothetical protein 65.3 223 CENSYa_0909 - conserved hypothetical protein 52.0 224 CENSYa_0913 - conserved hypothetical protein 43.5 225 CENSYa_0919 T universal stress protein 36.3 226 CENSYa_0924 - DnaJ-class molecular chaperone 44.3 227 CENSYa_0931 G permease 58.5 228 CENSYa_0932 - SAM dependent methyltransferase 40.6 229 CENSYa_0935 - hypothetical protein 39.7 230 CENSYa_0936 L NTP pyrophosphohydrolase 41.6 231 CENSYa_0938 J ribosomal protein L31E 40.9 232 CENSYa_0939 D cell division GTPase 42.0 233 CENSYa_0949 - hypothetical protein 38.0 234 CENSYa_0951 - thiol-disulfide isomerase 64.7 235 CENSYa_0953 C HEAT repeat 77.6 236 CENSYa_0965 D ATPases involved in chromosome partitioning 47.2 237 CENSYa_0966 S uncharacterized protein conserved in archaea 37.0 238 CENSYa_0967 - hypothetical protein 70.0 239 CENSYa_0972 G glyceraldehyde-3-phosphate dehydrogenase 59.5 240 CENSYa_0976 O protein-disulfide isomerase 47.8 241 CENSYa_0977 I 3-hydroxy-3-methylglutaryl CoA synthase 69.5 242 CENSYa_0978 H cobyrinic acid a,c-diamide synthase/dethiobiotin 30.3 243 CENSYa_0984 O thiol-disulfide isomerase 45.0 244 CENSYa_0985 - conserved protein implicated in secretion 47.9 245 CENSYa_0986 - hypothetical protein 68.6 246 CENSYa_0987 - hypothetical protein 46.8 247 CENSYa_0999 - hypothetical protein 31.5 248 CENSYa_1001 M glycosyltransferase 44.9 249 CENSYa_1003 E asparagine synthase (glutamine-hydrolyzing) 44.8 250 CENSYa_1004 - SAM dependent methyltransferase 39.4 251 CENSYa_1013 K transcriptional regulator 70.2 252 CENSYa_1014 K Mn-dependent transcriptional regulator 78.0 253 CENSYa_1016 R RNA-binding protein 44.3 254 CENSYa_1017 - hypothetical protein 79.0 255 CENSYa_1019 - hypothetical protein 48.1 256 CENSYa_1022 - conserved hypothetical protein 53.9 257 CENSYa_1037 - coiled coil protein 79.0 258 CENSYa_1039 - conserved hypothetical protein 64.2 259 CENSYa_1040 - conserved hypothetical protein 56.3 260 CENSYa_1059 - cobalamin binding protein 39.4 261 CENSYa_1060 - secreted periplasmic Zn-dependent protease 36.8 262 CENSYa_1061 S protoporphyrinogen oxidase 54.6 263 CENSYa_1068 E asparagine synthase (glutamine-hydrolyzing) 40.2 264 CENSYa_1074 - hypothetical protein 58.1 265 CENSYa_1075 - hypothetical protein 45.4 266 CENSYa_1077 H thiamine monophosphate kinase 30.3 267 CENSYa_1078 G phosphomannomutase 38.5 268 CENSYa_1080 - hypothetical protein 58.2 269 CENSYa_1082 K transcriptional regulator 83.2 270 CENSYa_1094 - hypothetical protein 35.5 271 CENSYa_1109 - hypothetical protein 62.2 272 CENSYa_1122 - hypothetical protein 50.1 273 CENSYa_1124 - hypothetical protein 52.8 274 CENSYa_1125 M glycosyltransferase 72.5 275 CENSYa_1126 C ferredoxin 50.7 276 CENSYa_1129 P Kef-type K transport system, membrane 64.7 277 CENSYa_1130 - hypothetical protein 40.7 278 CENSYa_1136 J tRNA splicing endonuclease 55.9 279 CENSYa_1137 - DNA repair helicase 53.6 280 CENSYa_1142 - hypothetical protein 32.5 281 CENSYa_1144 - hypothetical protein 50.6 282 CENSYa_1145 - secreted periplasmic Zn-dependent protease 45.0 283 CENSYa_1147 - hypothetical protein 62.8 284 CENSYa_1148 - hypothetical protein 45.0 285 CENSYa_1151 F phosphoribosylaminoimidazole (AIR) synthetase 55.9 286 CENSYa_1152 P Kef-type K transport system, membrane component 49.1 287 CENSYa_1157 J lysyl-tRNA synthetase (class I) 52.1 288 CENSYa_1158 - hypothetical protein 53.3 289 CENSYa_1161 - hypothetical protein 76.7 290 CENSYa_1162 - hypothetical protein 39.8 291 CENSYa_1168 H glutamate-1-semialdehyde aminotransferase 56.9 292 CENSYa_1169 R TPR repeat 47.2 293 CENSYa_1170 - hypothetical protein 41.0 294 CENSYa_1172 - double stranded beta helix fold enzyme 31.0 295 CENSYa_1173 - hypothetical protein 58.7 296 CENSYa_1179 - conserved hypothetical protein 49.8 297 CENSYa_1185 H 1,4-dihydroxy-2-naphthoate 52.4 298 CENSYa_1196 Q short-chain alcohol dehydrogenase 49.2 299 CENSYa_1197 - permease 57.6 300 CENSYa_1198 S uncharacterized conserved protein 55.7 301 CENSYa_1199 - hypothetical protein 57.5 302 CENSYa_1208 - hypothetical protein 34.9 303 CENSYa_1212 - hypothetical protein 38.2 304 CENSYa_1215 S uncharacterized conserved protein 33.3 305 CENSYa_1217 - hypothetical protein 36.6 306 CENSYa_1220 - conserved hypothetical protein 30.4 307 CENSYa_1221 - hypothetical protein 30.7 308 CENSYa_1234 - conserved hypothetical protein 48.9 309 CENSYa_1235 M glycosyltransferases involved in cell wall 38.5 310 CENSYa_1238 H thiamine biosynthesis enzyme 33.8 311 CENSYa_1251 - hypothetical protein 61.9 312 CENSYa_1253 J alanyl-tRNA synthetase 41.2 313 CENSYa_1255 J leucyl-tRNA synthetase 45.5 314 CENSYa_1262 P ABC-type phosphate/phosphonate transport system, 52.6 315 CENSYa_1263 P ABC-type phosphate/phosphonate transport system, 45.4 316 CENSYa_1264 - hypothetical protein 59.5 317 CENSYa_1265 - hypothetical protein 67.9 318 CENSYa_1266 - conserved hypothetical protein 42.7 319 CENSYa_1269 O protein-disulfide isomerase 34.9 320 CENSYa_1271 - ribokinase 39.2 321 CENSYa_1272 R sugar kinase 59.6 322 CENSYa_1289 J peptide chain release factor 1 53.3 323 CENSYa_1290 - hypothetical protein 39.2 324 CENSYa_1296 T protein-tyrosine phosphatase 50.1 325 CENSYa_1297 - hypothetical protein 65.5 326 CENSYa_1298 - hypothetical protein 95.9 327 CENSYa_1299 - hypothetical protein 54.8 328 CENSYa_1303 S metal-dependent protease of the PAD1/JA B1 59.4 329 CENSYa_1304 - hypothetical protein 60.1 330 CENSYa_1305 - conserved hypothetical protein 45.5 331 CENSYa_1306 E ABC-type dipeptide/oligopeptide/nickel transport 52.1 332 CENSYa_1307 E ABC-type dipeptide/oligopeptide/nickel transport 62.3 333 CENSYa_1310 S membrane protein 48.3 334 CENSYa_1311 - conserved hypothetical protein 37.6 335 CENSYa_1312 - solute binding protein 40.0 336 CENSYa_1313 Q SAM dependent methyltransferase 41.3 337 CENSYa_1316 C dehydrogenase 51.1 338 CENSYa_1321 - hypothetical protein 48.4 339 CENSYa_1324 T universal stress protein 40.1 340 CENSYa_1337 - conserved hypothetical protein 47.2 341 CENSYa_1338 - hypothetical protein 30.4 342 CENSYa_1340 - conserved hypothetical protein 33.3 343 CENSYa_1347 - hypothetical protein 45.0 344 CENSYa_1351 R ABC-type transport system involved in Fe-S 51.4 345 CENSYa_1353 H uroporphyrinogen-III synthase/methyltransferase 52.5 346 CENSYa_1366 P phosphate uptake regulator 62.2 347 CENSYa_1367 L DNA modification methylase 54.6 348 CENSYa_1368 R inorganic polyphosphate/ATP-NAD kinase 69.7 349 CENSYa_1369 G UDP-glucuronosyltransferase 47.9 350 CENSYa_1370 E aspartate kinases 46.1 351 CENSYa_1375 - hypothetical protein 60.6 352 CENSYa_1376 - SAM dependent methyltransferase 48.5 353 CENSYa_1383 - RNA-binding protein 35.0 354 CENSYa_1384 E ketol-acid reductoisomerase 86.5 355 CENSYa_1386 J selenocysteine-specific translation elongation 43.9 356 CENSYa_1387 - hypothetical protein 38.3 357 CENSYa_1390 - dioxygenase ferredoxin protein 55.9 358 CENSYa_1391 - membrane-associated phospholipid phosphatase 48.6 359 CENSYa_1392 - hypothetical protein 58.1 360 CENSYa_1397 L single-stranded DNA-binding replication protein 47.9 361 CENSYa_1399 - conserved hypothetical protein 71.1 362 CENSYa_1400 - conserved hypothetical protein 56.6 363 CENSYa_1402 H glutamyl-tRNA reductase 55.4 364 CENSYa_1403 H siroheme synthase/precorrin-2 oxidase 37.8 365 CENSYa_1407 S uncharacterized conserved protein 55.4 366 CENSYa_1408 K transcriptional regulator containing HTH domain 60.7 367 CENSYa_1410 - hypothetical protein 66.0 368 CENSYa_1411 - hypothetical protein 54.0 369 CENSYa_1414 - hypothetical protein 56.7 370 CENSYa_1416 - conserved hypothetical protein 41.9 371 CENSYa_1419 - hypothetical protein 55.2 372 CENSYa_1427 J tyrosyl-tRNA synthetase 37.7 373 CENSYa_1429 - hypothetical protein 70.2 374 CENSYa_1432 - uncharacterized protein 74.9 375 CENSYa_1438 - hypothetical protein 48.0 376 CENSYa_1444 C archaeal/vacuolar-type H -ATPase subunit D 81.0 377 CENSYa_1447 C archaeal/vacuolar-type H -ATPase subunit E 51.5 378 CENSYa_1451 C archaeal/vacuolar-type H -ATPase subunit I 55.1 379 CENSYa_1452 P rhodanese-related sulfurtransferase 47.9 380 CENSYa_1454 - hypothetical protein 44.7 381 CENSYa_1456 C archaeal/vacuolar-type H -ATPase subunit C 51.2 382 CENSYa_1457 - hypothetical protein 40.0 383 CENSYa_1458 - conserved hypothetical protein 55.4 384 CENSYa_1459 - hypothetical protein 50.5 385 CENSYa_1460 - hypothetical protein 65.6 386 CENSYa_1462 - hypothetical protein 52.8 387 CENSYa_1467 - hypothetical protein 66.6 388 CENSYa_1468 F orotidine-5'-phosphate decarboxylase 48.7 389 CENSYa_1471 S membrane protein 55.6 390 CENSYa_1481 - hypothetical protein 31.3 391 CENSYa_1488 M mannose-1-phosphate guanyltransferase 30.7 392 CENSYa_1495 - hypothetical protein 72.2 393 CENSYa_1497 E Xaa-Pro aminopeptidase 60.5 394 CENSYa_1498 Q SAM dependent methyltransferase 37.4 395 CENSYa_1499 H phosphopantothenoylcysteine 35.2 396 CENSYa_1502 R archaeal suger kinase 43.6 397 CENSYa_1503 P phosphate uptake regulator 47.7 398 CENSYa_1504 K transcriptional regulator 50.1 399 CENSYa_1505 R ABC-type transport system, involved in 62.6 400 CENSYa_1507 M hypothetical protein 49.4 401 CENSYa_1529 F thymidylate kinase 40.5 402 CENSYa_1530 R HD superfamily phosphohydrolases 34.3 403 CENSYa_1532 - hypothetical protein 64.6 404 CENSYa_1533 - hypothetical protein 40.8 405 CENSYa_1548 O FKBP-type peptidyl-prolyl cis-trans 47.3 406 CENSYa_1553 - copper export protein 58.4 407 CENSYa_1554 - hypothetical protein 60.6 408 CENSYa_1558 - NADH-dependent FMN reductase 62.9 409 CENSYa_1565 - hypothetical protein 44.9 410 CENSYa_1569 - hypothetical protein 32.7 411 CENSYa_1570 - conserved hypothetical protein 30.9 412 CENSYa_1577 C Fe-S oxidoreductase 89.3 413 CENSYa_1578 - hypothetical protein 67.3 414 CENSYa_1579 - conserved hypothetical protein 48.8 415 CENSYa_1584 - hypothetical protein 38.7 416 CENSYa_1608 L RNA-binding protein 52.9 417 CENSYa_1614 I mevalonate kinase 34.9 418 CENSYa_1615 R archaeal gamma-glutamyl kinase 55.9 419 CENSYa_1618 J glutamyl-tRNA synthetase 47.5 420 CENSYa_1621 Q SAM dependent methyltransferase 44.4 421 CENSYa_1622 - hypothetical protein 62.0 422 CENSYa_1641 T Mn2 -dependent serine/threonine protein kinase 36.1 423 CENSYa_1645 F nucleotide kinase (related to CMP and AMP 42.5 424 CENSYa_1646 J queuine/archaeosine tRNA-ribosyltransferase 50.5 425 CENSYa_1647 - hypothetical protein 96.8 426 CENSYa_1648 - hypothetical protein 77.4 427 CENSYa_1652 G archaeal fructose-1,6-bisphosphatase 62.1 428 CENSYa_1653 - hypothetical protein 48.5 429 CENSYa_1654 L replication factor C/ATPase involved in DNA 45.4 430 CENSYa_1655 - conserved hypothetical protein 55.7 431 CENSYa_1656 - hypothetical protein 39.9 432 CENSYa_1657 H glutamyl-tRNA reductase 41.2 433 CENSYa_1658 - hypothetical protein 57.0 434 CENSYa_1659 C Na -transporting NADH ubiquinone oxidoreductase 66.0 435 CENSYa_1662 I acetyl/propionyl-CoA carboxylase, alpha subunit 52.3 436 CENSYa_1664 C NADH-ubiquinone oxidoreductase, subunit A 80.4 437 CENSYa_1666 C NADH-ubiquinone oxidoreductase, subunit C 46.4 438 CENSYa_1668 C NADH-ubiquinone oxidoreductase, subunit H 71.5 439 CENSYa_1669 C NADH-ubiquinone oxidoreductase, subunit I 89.6 440 CENSYa_1670 C NADH-ubiquinone oxidoreductase, subunit J 78.1 441 CENSYa_1672 C formate hydrogenlyase subunit 3/Multisubunit Na 67.8 442 CENSYa_1673 C NADH-ubiquinone oxidoreductase, subunit L 84.9 443 CENSYa_1675 H geranylgeranyl pyrophosphate 68.3 444 CENSYa_1684 - pyrroloquinoline quinone (Coenzyme PQQ) 54.1 445 CENSYa_1687 - hypothetical protein 35.6 446 CENSYa_1688 M membrane-associated Zn-dependent protease 57.2 447 CENSYa_1700 E argininosuccinate lyase 39.4 448 CENSYa_1701 R metal-sulfur cluster biosynthetic enzyme 85.3 449 CENSYa_1703 - hypothetical protein 67.9 450 CENSYa_1704 C succinate dehydrogenase, subunit D 61.4 451 CENSYa_1708 K transcriptional regulator 37.3 452 CENSYa_1714 C flavodoxin reductases (ferredoxin-NADPH 65.4 453 CENSYa_1716 - hypothetical protein 60.3 454 CENSYa_1721 M UDP-N-acetylglucosamine-1-phosphate transferase 53.6 455 CENSYa_1723 L Cdc46/Mcm DNA replication licensing factor 58.2 456 CENSYa_1724 - hypothetical protein 52.4 457 CENSYa_1727 C ferredoxin 53.7 458 CENSYa_1728 C ferredoxin 56.3 459 CENSYa_1730 J histidyl-tRNA synthetase 43.0 460 CENSYa_1741 O prefoldin, molecular chaperone, alpha subunit 51.4 461 CENSYa_1743 - hypothetical protein 47.2 462 CENSYa_1744 - hypothetical protein 52.3 463 CENSYa_1753 R TPR repeat 36.3 464 CENSYa_1757 S uncharacterized conserved protein 40.4 465 CENSYa_1758 F cytosine deaminase 43.7 466 CENSYa_1761 - uncharacterized conserved protein 55.8 467 CENSYa_1766 J RNase P subunit 40.7 468 CENSYa_1769 - hypothetical protein 36.2 469 CENSYa_1771 - oxygen-sensitive ribonucleoside-triphosphate 42.3 470 CENSYa_1772 F oxygen-sensitive ribonucleoside-triphosphate 46.0 471 CENSYa_1778 - hypothetical protein 73.2 472 CENSYa_1779 R nucleic-acid-binding protein 70.2 473 CENSYa_1782 L DNA topoisomerase I 40.0 474 CENSYa_1788 S uncharacterized protein conserved in archaea 43.5 475 CENSYa_1789 - hypothetical protein 49.2 476 CENSYa_1790 - hypothetical protein 56.5 477 CENSYa_1791 - secreted periplasmic Zn-dependent protease 38.0 478 CENSYa_1793 - conserved hypothetical protein 56.0 479 CENSYa_1795 - copper binding protein, plastocyanin/azurin 30.9 480 CENSYa_1797 - hypothetical protein 45.5 481 CENSYa_1798 - streptogramin lyase 37.1 482 CENSYa_1801 - hypothetical protein 57.9 483 CENSYa_1802 K superfamily II DNA/RNA helicase, SNF2 family 69.5 484 CENSYa_1805 - hypothetical protein 65.3 485 CENSYa_1806 P Mn2 and Fe2 transporters of the NRAMP family 45.0 486 CENSYa_1809 C FAD/FMN-containing dehydrogenase 37.9 487 CENSYa_1810 - conserved hypothetical protein 46.9 488 CENSYa_1812 O DnaJ-class molecular chaperone 38.4 489 CENSYa_1815 R TPR repeat 43.6 490 CENSYa_1816 - hypothetical protein 64.7 491 CENSYa_1819 - hypothetical protein 50.5 492 CENSYa_1834 - hypothetical protein 53.3 493 CENSYa_1835 - hypothetical protein 30.1 494 CENSYa_1838 - hypothetical protein 63.0 495 CENSYa_1840 L archaeal DNA polymerase II, small subunit/DNA 48.1 496 CENSYa_1841 - hypothetical protein 42.6 497 CENSYa_1842 H precorrin-6B methylase 60.0 498 CENSYa_1843 - hypothetical protein 55.2 499 CENSYa_1844 O peptidyl-prolyl cis-trans isomerase (rotamase) 38.2 500 CENSYa_1845 C NADH dehydrogenase, FAD-containing subunit 57.0 501 CENSYa_1846 - hypothetical protein 52.3 502 CENSYa_1852 - hypothetical protein 40.4 503 CENSYa_1854 S uncharacterized protein conserved in archaea 41.6 504 CENSYa_1855 - hypothetical protein 38.3 505 CENSYa_1858 - hypothetical protein 43.5 506 CENSYa_1861 T Ser/Thr protein kinase 43.0 507 CENSYa_1862 J tRNA nucleotidyltransferase (CCA-adding enzyme) 34.0 508 CENSYa_1865 S uncharacterized protein conserved in archaea 38.1 509 CENSYa_1869 F phosphoribosylpyrophosphate synthetase 47.5 510 CENSYa_1870 M small-conductance mechanosensitive channel 51.6 511 CENSYa_1873 - hypothetical protein 35.0 512 CENSYa_1874 - secreted periplasmic Zn-dependent protease 33.2 513 CENSYa_1875 - hypothetical protein 44.1 514 CENSYa_1877 P NhaP-type Na /H and K /H antiporter 53.5 515 CENSYa_1879 - hypothetical protein 55.4 516 CENSYa_1881 P 3'-phosphoadenosine 5'-phosphosulfate (PAPS) 41.8 517 CENSYa_1882 - hypothetical protein 49.1 518 CENSYa_1886 - hypothetical protein 71.5 519 CENSYa_1887 S uncharacterized membrane-associated protein 35.8 520 CENSYa_1891 - hypothetical protein 60.2 521 CENSYa_1894 - hypothetical protein 31.9 522 CENSYa_1898 - hypothetical protein 59.0 523 CENSYa_1899 R short-chain alcohol dehydrogenase related to 56.5 524 CENSYa_1900 - hypothetical protein 55.1 525 CENSYa_1901 - hypothetical protein 56.8 526 CENSYa_1902 - thiamine biosynthesis protein 38.5 527 CENSYa_1905 - conserved hypothetical protein 46.2 528 CENSYa_1907 S uncharacterized conserved protein 56.0 529 CENSYa_1911 F phosphoribosylformylglycinamidine (FGAM) 33.4 530 CENSYa_1919 - hypothetical protein 35.1 531 CENSYa_1920 K RNA-binding protein homologous to eukaryotic 41.0 532 CENSYa_1926 L eukaryotic-type DNA primase, large subunit 51.1 533 CENSYa_1927 L eukaryotic-type DNA primase, catalytic (small) 39.2 534 CENSYa_1928 S uncharacterized conserved protein 61.7 535 CENSYa_1929 F GMP synthase, glutamine amidotransferase domain 49.4 536 CENSYa_1930 R TPR repeat 53.2 537 CENSYa_1936 - hypothetical protein 46.8 538 CENSYa_1939 R uncharacterized membrane protein required for 32.4 539 CENSYa_1942 S uncharacterized conserved protein 64.5 540 CENSYa_1949 M nucleoside-diphosphate-sugar epimerase 32.7 541 CENSYa_1954 L single-stranded DNA-binding replication protein 45.5 542 CENSYa_1958 S uncharacterized protein conserved in archaea 37.8 543 CENSYa_1959 - hypothetical protein 47.2 544 CENSYa_1960 O molecular chaperone GrpE 37.5 545 CENSYa_1963 - hypothetical protein 45.3 546 CENSYa_1966 - hypothetical protein 38.5 547 CENSYa_1971 R metal-sulfur cluster biosynthetic enzyme 75.9 548 CENSYa_1972 Q short-chain alcohol dehydrogenase 44.8 549 CENSYa_1975 - conserved hypothetical protein 62.6 550 CENSYa_1976 - conserved hypothetical protein 59.9 551 CENSYa_1978 G fructose-2,6-bisphosphatase 50.7 552 CENSYa_1979 - heme/copper-type cytochrome/quinol oxidase, 55.6 553 CENSYa_1982 - hypothetical protein 62.9 554 CENSYa_1984 - hypothetical protein 46.2 555 CENSYa_1985 - hypothetical protein 73.7 556 CENSYa_1992 K transcriptional regulator 79.0 557 CENSYa_1994 E threonine-phosphate decarboxylase 37.9 558 CENSYa_1997 H cobalamin-5-phosphate synthase 61.5 559 CENSYa_1998 H 4-diphosphocytidyl-2C-methyl-D-erythritol 39.1 560 CENSYa_2000 F adenine deaminase 54.0 561 CENSYa_2004 - conserved hypothetical protein 36.8 562 CENSYa_2012 - hypothetical protein 39.3 563 CENSYa_2016 C Zn-dependent oxidoreductase 52.8 564 CENSYa_2017 R TPR repeat 30.1 565 CENSYa_2018 J diphthamide synthase subunit 70.3 566 CENSYa_2023 R DNA-binding protein containing a Zn-ribbon 37.7 567 CENSYa_2035 E ATP phosphoribosyltransferase 100.0 568 CENSYa_2037 E histidinol-phosphate decarboxylase 43.4 569 CENSYa_2038 R phosphatase 37.2 570 CENSYa_2046 H dinucleotide-utilizing enzyme 40.4 a Identification based on a blast score ratio cut-off ≥30. The blast score ratio represents the bitscore of the C. symbiosum protein queried

against Sargasso Sea reads divided by the self-match bitscore of the C. symbiosum sequence queried against itself (see methods).

COG category descriptions are found in Figure 3