USOO6632937B1 (12) United States Patent (10) Patent No.: US 6,632,937 B1 Swanson et al. (45) Date of Patent: Oct. 14, 2003

(54) NUCLEICACIDS AND PROTEINS FROM Preston et al., “A pSychrophilic crenarchaeon inhabits a marine : Cenarchaeum Symbiosum gen. nov., Sp. nov.” Proc. Natl. Acad. Sci. USA 93:6241-6246 (Jun. (75) Inventors: Ronald V. Swanson, La Jolla, CA 1996); & “Cenarchaeum symbiosum small subunit riboso (US); Robert A. Feldman, Poway, CA mal RNA gene Sequence,' Database Genbank Accession (US); Christa Schleper, Darmstadt No. U51469 (Aug. 13, 1996) XP002130621 (abstract). (DE) Schleper et al., “Genomic analysis reveals chromosomal (73) Assignee: Diversa Corporation, San Diego, CA variation in natural populations of the uncultured psychro (US) philic archaeon Cenarchaeum SymbioSum,” J. Bacteriol. 180(19):5003–5009, Database EMBL XP-002136935; & (*) Notice: Subject to any disclaimer, the term of this “Cenarchaeum Symbiosum strain B. Database EMBL patent is extended or adjusted under 35 Accession No. AF083072 (Sep. 23, 1998) XP002136935 U.S.C. 154(b) by 0 days. (abstract). (21) Appl. No.: 09/408,020 Schleper et al., “Characterization of a DNA polymerase from the uncultivated psychrophilic archaeon Cenarchaeum (22) Filed: Sep. 29, 1999 Symbiosum,” Journal of Bacteriology 179(24):7803–7811 (Dec. 1997) XP00872756; & “Cenarchaeum symbiosum Related U.S. Application Data DNA polymerase gene.” Database Genbank Accession No. (60) Provisional application No. 60/102.294, filed on Sep. 29, 1998. AF028831 (Jan. 6, 1998) XP002130624 (abstract). Stein et al., “Characterization of Uncultivated Prokaryotes: (51) Int. Cl." ...... C07H2/04; CO7K 14/195 Isolation and Analysis of a 40-Kilobase-Pair Genome Frag (52) U.S. Cl...... 536/23.7; 530/350 ment from a Planktonic Marine Archaeon,” Journal of (58) Field of Search ...... 536/23.7; 530/350 Bacteriology 178(3):591–599 (Feb. 1996) XP002050143. Suzuki et al., “Pig gad65 mRNA forglutamic acid decar (56) References Cited boxylase,” Database EMBL Accession No. D31848 (Apr. PUBLICATIONS 28, 1995) XP002130623 (abstract). X57760, A. Fainsod Jul. 8, 1992.* AF016442, Wilson et al., Aug. 7, 1997.* * cited by examiner AF083071, Schleper et al., Oct. 6, 1998.* Baker et al. Protein Structure prediction and structural Primary Examiner Ardin H. Marschel genomics. Science. (Oct. 5, 2001) vol. 294, pp. 93-96.* Assistant Examiner Marjorie A. Moran Ainsworth et al., “T. aestivum AGP-S mRNA. Database EMBL Accession No. X66080 (May 13, 1992) (74) Attorney, Agent, or Firm-Fish & Richardson P.C. XP002136936 (abstract). (57) ABSTRACT DeLong et al., “Application of Gebomics for Understanding the Evolution of Hyperthermophilic and Nonthermophilic The present application relates to nucleic acids and polypep ,” Biological Bulletin 196(3):363-365 (Jun. tides from Cenarchaeum Symbiosum. Methods of making 1999). the polypeptides and antibodies against the polypeptides are Krejci et al., “Rattus norvegicus acetylcholinesterase-asso also described. ciated collagen,” Database EMBL Accession No. AF007583 (Nov. 1, 1997) XP002130622 (abstract). 2 Claims, 7 Drawing Sheets

U.S. Patent Oct. 14, 2003 Sheet 2 of 7 US 6,632,937 B1

92 /2 82 62 U.S. Patent Oct. 14, 2003 Sheet 3 of 7 US 6,632,937 B1 U.S. Patent Oct. 14, 2003 Sheet 4 of 7 US 6,632,937 B1

D 125A 1258 125C E" E" E.

100

COMPUTER SYSTEM

PROCESSOR

115

NTERNA STORAGE

118 120 DATA RETREVING DISPLAY DEVICE

FIGURE 3 U.S. Patent Oct. 14, 2003 Sheet 5 of 7 US 6,632,937 B1

20 NY START

2O2 STORE NEW SECRUENCE TO A MEMORY

OPEN DATABASE OF SEOUENCES 206 READ FRS SEQUENCE NOAABASE

210 PERFORM COMPARISON OF NEW SEQUENCE AND STORED SECUENCE 212

YES 214

DSPLAY STORED SEQUENCE NAME TO USER NO 224

GO TO NEX SECUENCEN DAABASE

MORE SECUENCES IN YES DATABASE

NO 220

FIGURE 4 U.S. Patent Oct. 14, 2003 Sheet 6 of 7 US 6,632,937 B1

252 250 START 254 STORE A FIRST SEOUENCE TO A MEMORY 256 STORE A SECOND SECUENCE TO A MEMORY 260 READ FIRST CHARACTER OF FIRST SECUENCE 262 READ FRST CHARACTER OF SECOND SECUENCE 264

READ NEXT CHARACTER OF FIRST AND SECOND YES SEQUENCES

NO

CHARACTERS TO

NO 276

DSPLAY HOMOLOGY LEVEL BE WEEN THE FIRST AND SECOND SECUENCES

FIGURE 5 U.S. Patent Oct. 14, 2003 Sheet 7 of 7 US 6,632,937 B1

302 300 NY 304 STORE AFRST SECUENCE TO MEMORY 306 OPEN DATABASE OF SECRUENCE FEATURES 308 READ FRS FEAURE FROM DAABASE

310 COMPARE FEATUREA TRIBUTES WITH THE FIRST SECUENCE

316

DSPLAY FOUND FEATURE TO THE USER 326 NO

READ NEXT FSATURE IN DATABASE

MORE FEATURES IN YES DAABASE

NO 324

FIGURE 6 US 6,632,937 B1 1 2 NUCLEC ACIDS AND PROTEINS FROM 1995. Molecular phylogenetic analysis of a soil microbial CENARCHAEUMSYMBIOSUM community. Eur: J. Soil Sci. 46, 415-421; Hershberger, K. L. et al. 1996. Wide diversity of Crenarchaeota. Nature 384, RELATED APPLICATIONS 420; MacGregor, B.J. 1997. Crenarchaeota in Lake Michi gan sediment. Appl. Env, Microb. 63, 1178–1181 et al.; The present application claims benefit of U.S. Provisional Schleper, C.et al. 1997. Recovery of crenarchaeotal riboso Patent Application Serial No. 60/102,294, filed Sep. 29, mal DNA sequences from freshwater-lake Sediments. Appl. 1998, the disclosure of which is incorporated herein by Env: Microb. 63, 321-323) The ecological distribution of reference in its entirety. these organisms was initially Surprising, Since their closest BACKGROUND OF THE INVENTION cultivated relatives are all thermophilic or hyperthermo philic. No representative of this new archaeal group has yet The identification and characterization of organisms been obtained in pure culture, So the phenotypic and meta which inhabit a diverse range of ecosystems leads to a bolic properties of these organisms, as well as their impact greater understanding of the operation of Such ecosystems. on the environment and global nutrient cycling, remain In addition, because the physiology of Such organisms is 15 unknown. Since growth temperature and habitat character adapted to function in the particular habitat which the istics vary So widely between non-thermophilic and the organism inhabits, the enzymes which carry out the organ hyperthermophilic Creanarchaeota, these groups are likely ism's physiological processes may possess characteristics to differ greatly with respect to their specific physiology and which provide advantages when they are utilized in thera metabolism. peutic procedures, industrial applications, or research appli To gain a better perspective on the genetic and physi cations. Furthermore, by determining the Sequences of these ological characteristics of non-thermophilic crenarchaeotes, organisms genes, insight into their biochemical pathways a genomic Study of Cenarchaeum Symbiosum was begun. and processes may be gained without the necessity of This archaeon lives in Specific association with the marine culturing the organisms in the laboratory, thereby enabling Sponge mexicana off the coast of California, allow the physiological characterization of organisms which are 25 ing access to relatively large amounts of biomass from this recalcitrant to growth in the laboratory. . (Preston, C. M. et al. 1996. A psychrophilic crenar Molecular phylogenetic Surveys have recently revealed an chaeon inhabits a marine Sponge: Cenarchaeum Symbiosum ecologically widespread Crenarchaeal group that inhabits gen. nov., sp. nov. Proc. Natl. Acad. Sci. USA 93, cold and temperate terrestrial and marine environments. To 6241-6246) The approach taken herein differs in several date these organisms have resisted isolation in pure culture, respects from now Standard genomic characterization of So their phenotypic and genotypic characteristics remain cultivated organisms, and also from comparable Studies of largely unknown. In order to characterize the physiology of uncultivated obligate parasites or Symbionts. C. Symbiosum these , to develop methodological approaches for has not been completely physically separated from the characterizing uncultivated microorganisms and identifying tissueS of its metazoan host. Therefore, its genetic material their presence in a Sample, and to identify enzymes produced 35 needs to be identified within the context of complex by these archae which may be useful in therapeutic, genomic libraries that contain Significant amounts of eucary industrial, or laboratory applications, genomic analyses of otic DNA, as well as DNA derived from members of the non-thermophilic crenarchaeote Cenarchaeum Symbio Bacteria. Sum was undertaken. Molecular phylogenetic Surveys of mixed microbial Non-thermophilic Crenarchaeota are one of the more 40 populations have revealed the existence of many new lin abundant, widespread and frequently recovered prokaryotic eages undetected by classical microbiological approaches. groups revealed by molecular phylogenetic approaches. (DeLong, E. F. 1997. Marine microbial diversity: the tip of These microorganisms were originally detected in high the iceberg. Tibtech 15, 2-9.; Pace, N. R. 1997. A molecular abundance in temperate ocean waters and polar Seas. view of microbial diversity and the biosphere. Science 276, (DeLong, E. F. 1992. Archaea in coastal marine environ 45 734–740) Furthermore, quantitative rRNA hybridization ments. Proc. Natl. Acad. Sci. 89,5685–5689; DeLong, E. F experiments demonstrate that Some of these novel prokary et al. 1994. High abundance of Archaea in Antarctic marine otic groups represent major components of natural microbial picoplankton. Nature 371, 695-697; Fuhrman, J. A., et al. communities. These molecular phylogenetic approaches Davis. 1992. Novel major archaebacterial group from have altered current views of microbial diversity and marine plankton. Nature 356, 148-149; Massana, R., et al. 50 ecology, and have demonstrated that traditional cultivation 1997. Vertical distribution and phylogenetic characterization techniques may recover only a Small, Skewed fraction of of marine planktonic Archaea in the Santa Barbara Channel. naturally occurring microbes. However, phylogenetic iden Appl. Env, Microb. 63, 50–56; McInerney, J. O. et al. 1995. tification using Single gene Sequences provides a limited Recovery and phylogenetic analysis of novel archaeal rRNA perspective on other biological properties, particularly for Sequences from a deep-Sea deposit feeder. Appl. Env, 55 novel lineages only distantly related to cultivated and char Microb. 61, 1646–1648; Preston, C. M. et al. 1996. A acterized organisms. Consequently, additional approaches psychrophilic crenarchaeon inhabits a marine Sponge: Cen are necessary to better characterize ecologically abundant archaeum Symbiosum gen. nov, sp. nov. Proc. Natl. Acad. and potentially biotechnologically useful microorganisms, Sci. USA 93, 6241-6246) Representatives have now been many of which resist cultivation attempts. reported in terrestrial environments and freshwater lake 60 Sediments, indicating a widespread distribution. (Bintrim, S. SUMMARY OF THE INVENTION B. et al. 1997. Molecular phylogeny of Archaea from soil. One embodiment of the present invention is an isolated, Proc. Natl. Acad Sci. USA94, 277–282; Jurgens, G. et al. purified, or enriched nucleic acid comprising a Sequence 1997. Novel group within the kingdom Crenarchaeota from selected from the group consisting of SEQ ID NO: 1 and boreal forest soil. Appl. Env, Mircob. 63,803-80515, Kudo, 65 SEQ ID NO: 2, the sequences complementary to SEQ ID Y. et al. 1997. Peculiar archaea found in Japanese paddy NO: 1 and SEQ ID NO: 2, fragments comprising at least 10 Soils. BioSc. Biotech. Biochem. 61, 917-920, Ueda, et al. consecutive nucleotides of SEO ID NO: 1 and SEO ID NO: US 6,632,937 B1 3 4 2, and fragments comprising at least 10 consecutive nucle capable of hybridizing to the nucleic acid of this embodi otides of the sequences complementary to SEQ ID NO: 1 ment under conditions of low Stringency. Another aspect of and SEQ ID NO: 2. One aspect of the present invention is the present invention is an isolated, purified, or enriched an isolated, purified, or enriched nucleic acid capable of nucleic acid having at least 70% homology to the nucleic hybridizing to the nucleic acid of this embodiment under acid of this embodiment as determined by analysis with conditions of high Stringency. Another aspect of the present BLASTN version 2.0 with the default parameters. Another invention is an isolated, purified, or enriched nucleic acid aspect of the present invention is an isolated, purified, or capable of hybridizing to the nucleic acid of this embodi enriched nucleic acid having at least 99% homology to the ment under conditions of moderate Stringency. Another nucleic acid of this embodiment as determined by analysis aspect of the present invention is an isolated, purified, or with BLASTN version 2.0 with the default parameters. enriched nucleic acid capable of hybridizing to the nucleic Another embodiment of the present invention is an acid of this embodiment under conditions of low Stringency. isolated, purified, or enriched nucleic acid comprising at Another aspect of the present invention is an isolated, least 10 consecutive bases of a Sequence Selected from the purified, or enriched nucleic acid having at least 70% group consisting of SEQID NOs: 3, 7, 11, 15, 17, 19, 21, 23, homology to the nucleic acid of this embodiment as deter 15 35, 39, 43,47, 49, 51,53,55, 69, 73, 77 and the sequences mined by analysis with BLASTN version 2.0 with the complementary thereto. One aspect of the present invention default parameters. Another aspect of the present invention is an isolated, purified, or enriched nucleic acid having at is an isolated, purified, or enriched nucleic acid having at least 70% homology to the nucleic acid of this embodiment least 99% homology to the nucleic acid of this embodiment as determined by analysis with BLASTN version 2.0 with as determined by analysis with BLASTN version 2.0 with the default parameters. Another aspect of the present inven the default parameters. tion is an isolated, purified, or enriched nucleic acid having Another embodiment of the present invention is an at least 99% homology to the nucleic acid of this embodi isolated, purified, or enriched nucleic acid comprising a ment as determined by analysis with BLASTN version 2.0 Sequence Selected from the group consisting of SEQ ID with the default parameters. NOs: 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 25 65, 67,71, 75, 79 and the sequences complementary thereto. Another embodiment of the present invention is an One aspect of the present invention is an isolated, purified, isolated, purified, or enriched nucleic acid encoding a or enriched nucleic acid capable of hybridizing to the polypeptide having a Sequence Selected from the group nucleic acid of this embodiment under conditions of high consisting of SEQID NOS: 6, 10, 14, 26, 28, 30, 32, 34,38, Stringency. Another aspect of the present invention is an 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, and 80. isolated, purified, or enriched nucleic acid capable of hybrid Another embodiment of the present invention is an izing to the nucleic acid of this embodiment under condi isolated, purified, or enriched nucleic acid encoding a tions of moderate Stringency. Another aspect of the present polypeptide comprising at least 10 consecutive amino acids invention is an isolated, purified, or enriched nucleic acid of a polypeptide having a Sequence Selected from the group capable of hybridizing to the nucleic acid of this embodi 35 consisting of SEQID NOS: 6, 10, 14, 26, 28, 30, 32, 34,38, ment under conditions of low Stringency. Another aspect of 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, and 80. the present invention is an isolated, purified, or enriched Another embodiment of the present invention is an nucleic acid having at least 70% homology to the nucleic isolated, purified, or enriched nucleic acid encoding a acid of this embodiment as determined by analysis with polypeptide having a Sequence Selected from the group BLASTN version 2.0 with the default parameters. Another 40 consisting of SEQ ID NOs: 4, 8, 12, 16, 18, 20, 22, 24, 36, aspect of the present invention is an isolated, purified, or 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. enriched nucleic acid having at least 99% homology to the Another embodiment of the present invention is an nucleic acid of this embodiment as determined by analysis isolated, purified, or enriched nucleic acid encoding a with BLASTN version 2.0 with the default parameters. polypeptide comprising at least 10 consecutive amino acids Another embodiment of the present invention is an 45 of a polypeptide having a Sequence Selected from the group isolated, purified, or enriched nucleic acid comprising at consisting of SEQ ID NOs: 4, 8, 12, 16, 18, 20, 22, 24, 36, least 10 consecutive bases of a Sequence Selected from the 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. group consisting of SEQ ID NOs: 5, 9, 13, 25, 27, 29, 31, Another embodiment of the present invention is an iso 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79 and the lated or purified polypeptide comprising a sequence Selected Sequences complementary thereto. One aspect of the present 50 from the group consisting of SEQ ID NOS: 6, 10, 14, 26, 28, invention is an isolated, purified, or enriched nucleic acid 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76, and 80. having at least 70% homology to the nucleic acid of this Another aspect of the present invention is an isolated or embodiment as determined by analysis with BLASTN ver purified polypeptide comprising at least 10 consecutive sion 2.0 with the default parameters. amino acids of the polypeptides of this embodiment. Another embodiment of the present invention is an 55 Another aspect of the present invention is an isolated or isolated, purified, or enriched nucleic acid comprising a purified polypeptide having at least 70% homology to the Sequence Selected from the group consisting of SEQ ID polypeptide of this embodiment as determined by analysis NOs: 3, 7, 11, 15, 17, 19, 21, 23, 35, 39, 43, 47, 49, 51, 53, with FASTA version 3.0t78 with the default parameters. 55, 69, 73, 77 and the sequences complementary thereto. Another aspect of the present invention is an isolated or One aspect of the present invention is an isolated, purified, 60 purified polypeptide having at least 99% homology to the or enriched nucleic acid capable of hybridizing to the polypeptide of this embodiment as determined by analysis nucleic acid of this embodiment under conditions of high with FASTA version 3.0t78 with the default parameters. Stringency. Another aspect of the present invention is an Another aspect of the present invention is an isolated or isolated, purified, or enriched nucleic acid capable of hybrid purified polypeptide having at least 70% homology to an izing to the nucleic acid of this embodiment under condi 65 isolated or purified polypeptide comprising at least 10 tions of moderate Stringency. Another aspect of the present consecutive amino acids of the polypeptides of this embodi invention is an isolated, purified, or enriched nucleic acid ment as determined by analysis with FASTA version 3.0t78 US 6,632,937 B1 S 6 with the default parameters. Another aspect of the present 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, and 80 comprising invention is an isolated or purified polypeptide having at introducing a nucleic acid encoding Said polypeptide, Said least 99% homology to the polypeptide of to an isolated or nucleic acid being operably linked to a promoter, into a host purified polypeptide comprising at least 10 consecutive cell. amino acids of the polypeptides of this embodiment as 5 Another embodiment of the present invention is a method determined by analysis with FASTA version 3.0t78 with the of making a polypeptide having a Sequence Selected from default parameters. the group consisting of SEQID NOs: 4, 8, 12, 16, 18, 20, 22, Another aspect of the present invention is an isolated or 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 comprising purified polypeptide comprising a Sequence Selected from introducing a nucleic acid encoding Said polypeptide, Said the group consisting of SEQID NOs: 4, 8, 12, 16, 18, 20, 22, nucleic acid being operably linked to a promoter, into a host 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. One aspect cell. of the present invention is an isolated or purified polypeptide Another embodiment of the present invention is a method comprising at least 10 consecutive amino acids of the of making a polypeptide comprising at least 10 amino acids polypeptides of this embodiment. Another aspect of the of a Sequence Selected from the group consisting of the present invention is an isolated or purified polypeptide 15 sequences of SEQ ID NOs: 4, 8, 12, 16, 18, 20, 22, 24, 36, having at least 70% homology to the polypeptides of this 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 comprising embodiment as determined by analysis with FASTA version introducing a nucleic acid encoding Said polypeptide, Said 3.0t78 with the default parameters. Another aspect of the nucleic acid being operably linked to a promoter, into a host present invention is an isolated or purified polypeptide cell. having at least 99% homology to the polypeptides of this Another embodiment of the present i method of generat embodiment as determined by analysis with FASTA version ing a variant comprising obtaining a nucleic acid comprising 3.0t78 with the default parameters. Another aspect of the a Sequence Selected from the group consisting of SEQ ID present invention is An isolated or purified polypeptide NOs. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, having at least 70% homology to an isolated or purified 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, polypeptide comprising at least 10 consecutive amino acids 25 47, 49, 51, 53, 55, 69, 73 and 77, the sequences comple of the polypeptides of this embodiment as determined by mentary to the sequences of SEQ ID NOS. 1, 2, 5, 9, 13, 25, analysis with FASTA version 3.0t78 with the default param 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, eters. Another aspect of the present invention is an isolated 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43,47, 49, 51,53,55, 69, or purified polypeptide having at least 99% homology to an 73 and 77, fragments comprising at least 30 consecutive isolated or purified polypeptide comprising at least 10 nucleotides of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31,33, consecutive amino acids of the polypeptides of this embodi 37, 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, ment as determined by analysis with FASTA version 3.0t78 19, 21, 23, 35, 39, 43, 47,49, 51,53,55, 69, 73 and 77, and with the default parameters. fragments comprising at least 30 consecutive nucleotides of Another embodiment of the present invention is an iso 35 the sequences complementary to SEQID NOS. 1, 2, 5, 9, 13, lated or purified antibody capable of Specifically binding to 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, a polypeptide comprising a Sequence Selected from the 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51,53,55, group consisting of SEQ ID NOS: 6, 10, 14, 26, 28, 30, 32, 69, 73 and 77 and changing one or more nucleotides in Said 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, and 80. Sequence to another nucleotide, deleting one or more nucle Another embodiment of the present invention is an iso otides in Said Sequence, or adding one or more nucleotides lated or purified antibody capable of Specifically binding to 40 to Said Sequence. In one aspect of the present invention, the a polypeptide comprising at least 10 consecutive amino method further comprises the Step of testing the enzymatic acids of one of the polypeptides of SEQ ID NOS: 6, 10, 14, properties of a translation product of Said variant. 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76, and 80. Another embodiment of the present invention is a com 45 puter readable medium having Stored thereon a Sequence Another embodiment of the present invention is an iso Selected from the group consisting of a nucleic acid code of lated or purified antibody capable of Specifically binding to SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, a polypeptide having a Sequence Selected from the group 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, consisting of SEQ ID NOs: 4, 8, 12, 16, 18, 20, 22, 24, 36, 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77 and a 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. 50 polypeptide code of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, Another embodiment of the present invention is an iso 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, lated or purified antibody capable of Specifically binding to 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. a polypeptide comprising at least 10 consecutive amino Another embodiment of the present invention is a com acids of one of the polypeptides of SEQ ID NOs: 4, 8, 12, puter System comprising a processor and a data Storage 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 55 device wherein Said data Storage device has Stored thereon 78. a Sequence Selected from the group consisting of a nucleic Another embodiment of the present invention is a method acid code of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, of making a polypeptide having a Sequence Selected from 37, 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, the group consisting of SEQ ID NOS: 6, 10, 14, 26, 28, 30, 19, 21, 23, 35, 39, 43, 47,49, 51,53,55, 69, 73 and 77 and 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, and 80 60 a polypeptide code of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, comprising introducing a nucleic acid encoding Said 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, polypeptide, Said nucleic acid being operably linked to a 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. promoter, into a host cell. In one aspect of the present invention, the computer System Another embodiment of the present invention is a method further comprises a Sequence comparer and a data Storage of making a polypeptide comprising at least 10 amino acids 65 device having reference Sequences Stored thereon. For of a Sequence Selected from the group consisting of the example, the Sequence comparer may comprise a computer sequences of SEQID NOS: 6, 10, 14, 26, 28, 30, 32, 34,38, program which indicateS polymorphisms. In another aspect US 6,632,937 B1 7 8 of the present invention is the computer System of this tide present in a living animal is not isolated, but the same embodiment further comprises an identifier which identifies polynucleotide or polypeptide, Separated from Some or all of features in Said Sequence. the coexisting materials in the natural System, is isolated. Another embodiment of the present invention is a method Such polynucleotides could be part of a vector and/or Such for comparing a first Sequence to a reference Sequence polynucleotides or polypeptides could be part of a wherein Said first Sequence is Selected from the group composition, and Still be isolated in that Such vector or consisting of a nucleic acid code of SEQ ID NOS. 1, 2, 5, 9, composition is not part of its natural environment. 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, AS used herein, the term "purified” does not require 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, 53, absolute purity; rather, it is intended as a relative definition. 55, 69, 73 and 77 and a polypeptide code of SEQ ID NOs. Individual nucleic acids obtained from a library have been 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, conventionally purified to electrophoretic homogeneity. The 68, 72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, Sequences obtained from these clones could not be obtained 52, 54, 56, 70, 74, and 78 comprising the steps of reading directly either from the library or from total human DNA. Said first Sequence and Said reference Sequence through use The purified nucleic acids of the present invention have been of a computer program which compares Sequences, and 15 purified from the remainder of the genomic DNA in the determining differences between Said first Sequence and Said organism by at least 10-10 fold. However, the term reference Sequence with Said computer program. In one “purified” also includes nucleic acids which have been aspect of the present invention, the Step of determining purified from the remainder of the genomic DNA or from differences between the first Sequence and the reference other Sequences in a library or other environment by at least Sequence comprises identifying polymorphisms. one order of magnitude, preferably two or three orders, and Another embodiment of the present invention is a method more preferably four or five orders of magnitude. for identifying a feature in a Sequence Selected from the AS used herein, the term “recombinant’ means that the group consisting of a nucleic acid code of SEQ ID NOS. 1, nucleic acid is adjacent to “backbone' nucleic acid to which 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, 63, 65, it is not adjacent in its natural environment. Additionally, to 67,71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 25 be “enriched” the nucleic acids will represent 5% or more of 51,53,55, 69, 73 and 77 and a polypeptide code of SEQ ID the number of nucleic acid inserts in a population of nucleic NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, acid backbone molecules. Backbone molecules according to 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, the present invention include nucleic acids Such as expres 50, 52,54, 56, 70, 74, and 78 comprising the steps of reading Sion vectors, Self-replicating nucleic acids, Viruses, integrat Said Sequence through the use of a computer program which ing nucleic acids, and other vectors or nucleic acids used to identifies features in Sequences and identifying features in maintain or manipulate a nucleic acid insert of interest. Said Sequence with Said computer program. Preferably, the enriched nucleic acids represent 15% or more of the number of nucleic acid inserts in the population of BRIEF DESCRIPTION OF THE DRAWINGS recombinant backbone molecules. More preferably, the 35 enriched nucleic acids represent 50% or more of the number FIG. 1 shows the locations of coding regions, the 96G-C. of nucleic acid inserts in the population of recombinant and the %DNA identity between the approximately 28 Kb of backbone molecules. In a highly preferred embodiment, the common sequence in fosmids 101 G10 and 60A5. enriched nucleic acids represent 90% or more of the number FIGS. 2A and 2B show(s) the sequences surrounding the of nucleic acid inserts in the population of recombinant TATA boxes of several promoters from Cenarchaeum Sym 40 backbone molecules. biosum and the distances from the TATA boxes to the A promoter Sequence is “operably linked to a coding initiation codons in these Sequences. Sequence when RNA polymerase which initiates transcrip FIG. 3 is a block diagram of an exemplary computer tion at the promoter will transcribe the coding Sequence into System. mRNA. FIG. 4 is a flow diagram illustrating one embodiment of 45 "Recombinant' polypeptides or proteins refer to polypep a proceSS 200 for comparing a new nucleotide or protein tides or proteins produced by recombinant DNA techniques, Sequence with a database of Sequences in order to determine i.e., produced from cells transformed by an exogenous DNA the homology levels between the new Sequence and the construct encoding the desired polypeptide or protein. “Syn Sequences in the database. thetic' polypeptides or protein are those prepared by chemi FIG. 5 is a flow diagram illustrating one embodiment of 50 cal Synthesis. a process 250 in a computer for determining whether two A DNA “coding Sequence' or a "nucleotide Sequence Sequences are homologous. encoding a particular polypeptide or protein, is a DNA FIG. 6 is a flow diagram illustrating one embodiment of Sequence which is transcribed and translated into a polypep an identifier proceSS for detecting the presence of a feature 55 tide or protein when placed under the control of appropriate in a Sequence. regulatory Sequences. “Plasmids” are designated by a lower case p preceded DEFINITIONS and/or followed by capital letters and/or numbers. The The term “gene” means the segment of DNA involved in Starting plasmids herein are either commercially available, producing a polypeptide chain; it includes regions preceding 60 publicly available on an unrestricted basis, or can be con and following the coding region (leader and trailer) as well structed from available plasmids in accord with published as, where applicable, intervening sequences (introns) procedures. In addition, equivalent plasmids to those between individual coding segments (exons). described herein are known in the art and will be apparent AS used herein, the term "isolated” means that the mate to the ordinarily skilled artisan. rial is removed from its original environment (e.g., the 65 “Digestion” of DNA refers to catalytic cleavage of the natural environment if it is naturally occurring). For DNA with a restriction enzyme that acts only at certain example, a naturally-occurring polynucleotide or polypep Sequences in the DNA. The various restriction enzymes used US 6,632,937 B1 9 10 herein are commercially available and their reaction One aspect of the present invention is an isolated, conditions, cofactors and other requirements were used as purified, or enriched nucleic acid comprising one of the would be known to the ordinarily skilled artisan. For ana sequences of SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, lytical purposes, typically 1 lug of plasmid or DNA fragment 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, is used with about 2 units of enzyme in about 20 ul of buffer 5 51,53,55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 solution. For the purpose of isolating DNA fragments for the Sequences complementary thereto, or a fragment com plasmid construction, typically 5 to 50 tug of DNA are prising at least 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, digested with 20 to 250 units of enzyme in a larger volume. 200, 300, 400, or 500 consecutive bases of one of the Appropriate buffers and Substrate amounts for particular sequences of SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, restriction enzymes are specified by the manufacturer. Incu 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, bation times of about 1 hour at 37 C. are ordinarily used, 51,53,55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 but may vary in accordance with the Supplier's instructions. or the Sequences complementary thereto. The isolated, puri After digestion the gel electrophoresis may be performed to fied or enriched nucleic acids may comprise DNA, including isolate the desired fragment. cDNA, genomic DNA, and synthetic DNA. The DNA may “Oligonucleotide” refers to either a single stranded 15 be double-Stranded or Single-Stranded, and if Single Stranded poly deoxynucleotide or two complementary poly deoxy may be the coding Strand or non-coding (anti-sense) Strand. nucleotide Strands which may be chemically Synthesized. Alternatively, the isolated, purified or enriched nucleic acids Such Synthetic oligonucleotides have no 5" phosphate and may comprise RNA. thus will not ligate to another oligonucleotide without add AS discussed in more detail below, the isolated, purified, ing a phosphate with an ATP in the presence of a kinase. A or enriched nucleic acids of one of SEQ ID NOS: 1, 2, 3, 5, Synthetic oligonucleotide will ligate to a fragment that has 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, not been dephosphorylated. 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 may be used to prepare one of the DETAILED DESCRIPTION OF THE polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, PREFERRED EMBODIMENT 25 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or In order to begin the characterization of Cenarchaeum fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, Symbiosum, a large region of the C. Symbiosum genome was 50, 75, 100, or 150 consecutive amino acids of one of the Sequenced. In particular, two overlapping C. Symbiosum polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, derived fosmid inserts of approximately 42 kb and 33 kb 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, were Sequenced. The Sequences of the two foSmid inserts 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80. revealed that there are at least two major variants or Strains Accordingly, another aspect of the present invention is an of C. Symbiosum that coexist inside the Sponge tissues of a isolated, purified, or enriched nucleic acid which encodes Single Sponge. This complexity of the C. Symbiosum popu one of the polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, lation was not detected in initial Studies based Solely on 35 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, direct sequencing of PCR amplified SSU genes. (Preston, C. 50, 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and M. et al. 1996. A psychrophilic crenarchaeon inhabits a 80 or fragments comprising at least 5, 10, 15, 20, 25, 30, 35, marine Sponge: Cenarchaeum Symbiosum gen. nov, sp. nov: 40, 50, 75, 100, or 150 consecutive amino acids of one of the Proc. Natl. Acad. Sci. USA 93, 6241-6246) This natural polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, variation would also have been lost upon isolation of a pure 40 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, culture. 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80. The The Cenarchaeum Symbiosum Sequences obtained from coding Sequences of these nucleic acids may be identical to the two foSmids containing overlapping genomic inserts are one of the coding Sequences of one of the nucleic acids of provided in the accompanying Sequence listing and are SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, identified as SEO ID NO: 1 and SEO ID NO: 2. The two 45 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, foSmid Sequences were not entirely identical in their over 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 or a fragment lapping portions but instead contained differences. Upon thereof or may be different coding Sequences which encode further investigation, it was discovered that the two foSmid one of the polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, Sequences were derived from two different, but closely 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, related, strains of Cenarchaeum Symbiosum (called variant A 50 50, 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and and variant B) which may simultaneously inhabit a single 80 or fragments comprising at least 5, 10, 15, 20, 25, 30, 35, Sponge. 40, 50, 75, 100, or 150 consecutive amino acids of one of the Within the Sequences of the foSmid inserts, numerous polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, open reading frames encoding polypeptides having homol 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, ogy to known proteins, as well as open reading frames 55 56,58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 as a result encoding proteins which do not exhibit homology to known of the redundancy or degeneracy of the genetic code. The proteins, were identified. Homology was determined using genetic code is well known to those of skill in the art and can the program FASTA with the default parameters. The be obtained, for example, on page 214 of B. Lewin, Genes polypeptides encoded by these Sequences are identified in VI, Oxford University Press, 1997, the disclosure of which the accompanying Sequence listing as SEQ ID NOS: 6, 10, 60 is incorporated herein by reference. 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, The isolated, purified, or enriched nucleic acid which 76 and 80 (polypeptides with homology to known proteins) encodes one of the polypeptides of SEQID NOs: 4, 6, 8, 10, and SEQ ID NOs: 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 50, 52, 54, 56, 70, 74 and 78 (polypeptides without homol 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, ogy to known proteins). In addition, Sequences encoding the 65 78, and 80 may include, but is not limited to: only the coding 16S rRNA, the 23S rRNA and a tyrosine tRNAs were also sequence of one of SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, identified. 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, US 6,632,937 B1 11 12 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and and nucleic acids are obtained from the Sample. The nucleic 79; the coding sequences of SEQID NOs: 1, 2, 3, 5, 7, 9, 11, acids are contacted with the probe under conditions which 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, permit the probe to Specifically hybridize to any comple 45, 47,49, 51,53,55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, mentary Sequences from Cenarchaeum Symbiosum which 77 and 79 and additional coding Sequences, Such as leader are present therein. Sequences or proprotein Sequences, or the coding Sequences Where necessary, conditions which permit the probe to of SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, Specifically hybridize to complementary Sequences from 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, Cenarchaeum Symbiosum may be determined by placing the 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 and probe in contact with complementary Sequences from Cen non-coding Sequences, Such as introns or non-coding archaeum Symbiosum as well as control Sequences which are Sequences 5' and/or 3' of the coding Sequence. Thus, as used not from Cenarchaeum Symbiosum. In Some analyses, the herein, the term “polynucleotide encoding a polypeptide' control Sequences may be from organisms related to Cen encompasses a polynucleotide which includes only coding archaeum Symbiosum. Alternatively, the control Sequences Sequence for the polypeptide as well as a polynucleotide may be from organisms which are not related to Cenar which includes additional coding and/or non-coding chaeum Symbiosum. Hybridization conditions, Such as the 15 salt concentration of the hybridization buffer, the formamide Sequence. concentration of the hybridization buffer, or the hybridiza Alternatively, the nucleic acid sequences of SEQID NOs: tion temperature, may be varied to identify conditions which 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, allow the probe to hybridize Specifically to nucleic acids 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, 63, 65, from Cenarchaeum Symbiosum. 67, 69, 71, 73, 75, 77 and 79 may be mutagenized using If the Sample contains nucleic acids from Cenarchaeum conventional techniques, Such as Site directed mutagenesis, Symbiosum, Specific hybridization of the probe to the nucleic or other techniques familiar to those skilled in the art, to acids from Cenarchaeum Symbiosum is then detected. introduce silent changes into the polynucleotides of SEQ ID Hybridization may be detected by labeling the probe with a NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, detectable agent Such as a radioactive isotope, a fluorescent 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, 25 dye or an enzyme capable of catalyzing the formation of a 63, 65, 67, 69, 71, 73, 75, 77 and 79. As used herein, “silent detectable product. changes” include, for example, changes which do not alter Many methods for using the labeled probes to detect the the amino acid Sequence encoded by the polynucleotide. presence of nucleic acids from Cenarchaeum Symbiosum in Such changes may be desirable in order to increase the level a Sample are familiar to those skilled in the art. These of the polypeptide produced by host cells containing a vector include Southern Blots, Northern Blots, colony hybridiza encoding the polypeptide by introducing codons or codon tion procedures, and dot blots. Protocols for each of these pairs which occur frequently in the host organism. procedures are provided in Ausubel et al. Current Protocols The present invention also relates to polynucleotides in Molecular Biology, John Wiley 503 Sons, Inc. 1997 and Sambrook et al., Molecular Cloning: A Laboratory Manual which have nucleotide changes which result in amino acid 2d Ed., Cold Spring Harbor Laboratory Press, 1989, the Substitutions, additions, deletions, fusions and truncations in 35 entire disclosures of which are incorporated herein by ref the polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, CCCC. 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, 52, Alternatively, more than one probe (at least one of which 54, 56,58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80. Such is capable of Specifically hybridizing to any complementary nucleotide changes may be introduced using techniques Such Sequences from Cenarchaeum Symbiosum which are present as Site directed mutagenesis, random chemical mutagenesis, 40 in the nucleic acid Sample), may be used in an amplification exonuclease III deletion, and other recombinant DNA tech reaction to determine whether the nucleic acid Sample niques. Alternatively, Such nucleotide changes may be natu contains nucleic acids from Cenarchaeum Symbiosum. rally occurring allelic variants which are isolated by iden Preferably, the probes comprise oligonucleotides. In one tifying nucleic acids which specifically hybridize to probes embodiment, the amplification reaction may comprise a comprising at least 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 45 PCR reaction. PCR protocols are described in Ausubel and 150, 200, 300, 400, or 500 consecutive bases of one of the Sambrook, Supra. Alternatively, the amplification may com sequences of SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, prise a ligase chain reaction, 3SR, or Strand displacement 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, reaction. (See Barany, F., “The Ligase Chain Reaction in a 51,53,55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79 PCR World”, PCR Methods and Applications 1:5-16 or the Sequences complementary thereto to nucleic acids 50 (1991); E. Fahy et al., “Self-sustained Sequence Replication from Cenarchaeum Symbiosum or related organisms under (3SR): An Isothermal Transcription-based Amplification conditions of high, moderate, or low Strigency as provided System Alternative to PCR, PCR Methods and Applica herein. tions 1:25-33 (1991); and Walker G. T. et al., “Strand The isolated, purified, or enriched nucleic acids of SEQ Displacement Amplification-an Isothermal in vitro DNA ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 55 Amplification Technique, Nucleic Acid Research 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 20:1691–1696 (1992) the disclosures of which are incorpo 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79, the sequences rated herein by reference in their entireties). In Such complementary thereto, or a fragment comprising at least procedures, the nucleic acids in the Sample are contacted 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, with the probes, the amplification reaction is performed, and or 500 consecutive bases of one of the sequences of SEQ ID 60 any resulting amplification product is detected. The ampli NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, fication product may be detected by performing gel electro 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, phoresis on the reaction products and Staining the gel with 63, 65, 67, 69, 71, 73, 75, 77 and 79 or the sequences an interculator Such as ethidium bromide. Alternatively, one complementary thereto may also be used as probes to or more of the probes may be labeled with a radioactive identify the presence of Cenarchaeum Symbiosum in a 65 isotope and the presence of a radioactive amplification biological Sample. In Such procedures, a biological Sample product may be detected by autoradiography after gel elec potentially harboring Cenarchaeum Symbiosum is obtained trophoresis. US 6,632,937 B1 13 14 Probes derived from sequences near the ends of the complementary thereto, or a fragment comprising at least sequences of SEQ ID Nos: 1 and 2 may also be used in 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, chromosome walking procedures to identify clones contain or 500 consecutive bases of one of the sequences of SEQ ID ing genomic Sequences located adjacent to the Sequences of NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, SEO ID Nos: 1 and 2. Such methods allow the isolation of 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, genes which encode additional proteins expressed in Cen 63, 65, 67, 69, 71, 73, 75, 77 and 79 or the sequences archaeum Symbiosum and facilitate the further physiological complementary thereto may be used as probes to identify characterization of the organism. and isolate related nucleic acids. In Some embodiments, the Another aspect of the present invention is a method for related nucleic acids may be cDNAS or genomic DNAS from determining whether a Sample contains variant A and/or organisms other than Cenarchaeum Symbiosum. For variant B of Cenarchaeum Symbiosum. In Such procedures, example, the other organisms may be organisms which are a Sample potentially harboring variant A and/or variant B related to Cenarchaeum Symbiosum. In Such procedures, a Cenarchaeum Symbiosum is obtained and nucleic acids are nucleic acid Sample containing nucleic acids from the obtained from the Sample. The nucleic acids are contacted related organism, Such as a cDNA or genomic DNA library with the probe under conditions which permit the probe to 15 from the related organism, is contacted with the probe under Specifically hybridize to any complementary Sequences from conditions which permit the probe to specifically hybridize variant A or variant B of Cenarchaeum Symbiosum which are to related Sequences. Hybridization of the probe to nucleic present therein. Preferably, the probe comprises a Sequence acids from the related organism is then detected using any of having one or more nucleotides which differ between variant the methods described above. A and variant B. Conditions in which the probe specifically Hybridization may be carried out under conditions of low hybridizes to nucleic acids from one of the variants but not Stringency, moderate Stringency or high Stringency. AS an to nucleic acids from the other variant may be determined by example of nucleic acid hybridization, a polymer membrane contacting the probe with its corresponding Sequence from containing immobilized denatured nucleic acids is first pre variant A and variant B and varying the hybridization hybridized for 30 minutes at 45 C. in a solution consisting conditions, Such as the Salt concentration of the hybridiza 25 of 0.9 M. NaCl, 50 mM NaH2PO, pH 7.0, 5.0 mM tion buffer, the formamide concentration of the buffer, or the Na-EDTA, 0.5% SDS, 10xDenhardt's, and 0.5 mg/ml hybridization temperature, to identify conditions in which polyriboadenylic acid. Approximately 2x10 cpm (specific the probe hybridizes to the corresponding Sequence from activity 4-9x10 cpM/ug) of P end-labeled oligonucle one variant but not to the corresponding Sequence from the otide probe are then added to the solution. After 12-16 hours other variant. Hybridization of the probe to nucleic acids of incubation, the membrane is washed for 30 minutes at from the Cenarchaeum Symbiosum variant is then detected room temperature in 1xSET (150 mM NaCl, 20 mM Tris using any of the procedures described above. hydrochloride, pH 7.8, 1 mM Na-EDTA) containing 0.5% The isolated, purified, or enriched nucleic acids of SEQ SDS, followed by a 30 minute wash in fresh 1XSET at ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, Tm-10 C. for the oligonucleotide probe. The membrane is 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 35 then exposed to auto-radiographic film for detection of 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79, the sequences hybridization Signals. complementary thereto, or a fragment comprising at least By varying the Stringency of the hybridization conditions 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, used to identify nucleic acids, Such as cDNAS or genomic or 500 consecutive bases of one of the sequences of SEQ ID DNAS, which hybridize to the detectable probe, nucleic NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 40 acids having different levels of homology to the probe can 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, be identified and isolated. Stringency may be varied by 63, 65, 67, 69, 71, 73, 75, 77 and 79 or the sequences conducting the hybridization at varying temperatures below complementary thereto may be used as probes to identify the melting temperatures of the probes. The melting tem and isolate cDNAS encoding the polypeptides of SEQ ID perature of the probe may be calculated using the following NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 45 34, 36, 38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, formulas: 68, 70, 72, 7476, 78, and 80. In such procedures, a cDNA For probes between 14 and 70 nucleotides in length the library is constructed from a Sample containing Cenar melting temperature (Tm) is calculated using the formula: chaeum Symbiosum. The cDNA library is then contacted Tm=81.5+16.6(log Na+I)+0.41(fraction G+C)-(600/N) with a probe comprising a coding Sequence, or a fragment of 50 where N is the length of the probe. a coding Sequence, encoding one of the polypeptides of SEQ If the hybridization is carried out in a Solution containing ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, formamide, the melting temperature may be calculated using 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, the equation Tm=81.5+16.6(log Na+I)+0.41(fraction 66, 68, 70, 72, 7476, 78, and 80 or a fragment thereof under G+C)-(0.63% formamide)-(600/N) where N is the length of conditions which permit the probe to specifically hybridize 55 the probe. to sequences complementary thereto. cDNAS which hybrid Prehybridization may be carried out in 6xSSC, ize to the probe are then detected and isolated. Procedures 5xDenhardt's reagent, 0.5% SDS, 100 lug denatured frag for preparing and identifying cDNAS are disclosed in mented salmon sperm DNA or 6xSSC, 5xDenhardt's Ausubel et al. Current Protocols in Molecular Biology, John reagent, 0.5% SDS, 100 ug denatured fragmented salmon Wiley 503 Sons, Inc. 1997 and Sambrook et al., Molecular 60 sperm DNA, 50% formamide. The formulas for SSC and Cloning: A Laboratory Manual 2d Ed., Cold Spring Harbor Denhardt's Solutions are listed in Sambrook et al., Supra. Laboratory Press, 1989, the disclosures of which are incor Hybridization is conducted by adding the detectable probe porated herein by reference. to the prehybridization solutions listed above. Where the The isolated, purified, or enriched nucleic acids of SEQ probe comprises double stranded DNA, it is denatured ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 65 before addition to the hybridization solution. The filter is 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, contacted with the hybridization solution for a sufficient 61, 63, 65, 67, 69, 71, 73, 75, 77 and 79, the sequences period of time to allow the probe to hybridize to cDNAS or US 6,632,937 B1 15 16 genomic DNAS containing Sequences complementary ing Sequence which is a naturally occurring allelic variant of thereto or homologous thereto. For probes over 200 nucle one of the coding Sequences described herein. Such allelic otides in length, the hybridization may be carried out at variants may have a Substitution, deletion or addition of one 15-25 C. below the Tm. For shorter probes, such as or more nucleotides when compared to the nucleic acids of oligonucleotide probes, the hybridization may be conducted SEQ ID NOs: 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, at 5-10° C. below the Tm. Preferably, for hybridizations in 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 6xSSC, the hybridization is conducted at approximately 68 59, 61, 63, 65, 67, 69, 71, 73,75, 77 and 79 or the sequences C. Preferably, for hybridizations in 50% formamide con complementary thereto. taining Solutions, the hybridization is conducted at approxi Additionally, the above procedures may be used to isolate mately 42 C. nucleic acids which encode polypeptides having at least All of the foregoing hybridizations would be considered 99%, 95%, at least 90%, at least 85%, at least 80%, or at to be under conditions of high Stringency. least 70% homology to a polypeptide having the Sequence of Following hybridization, the filter is washed in 2xSSC, one of SEQ ID NOS: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 0.1% SDS at room temperature for 15 minutes. The filter is 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, 56,58, then washed with 0.1xSSC, 0.5% SDS at room temperature 15 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or fragments for 30 minutes to 1 hour. Thereafter, the Solution is washed comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, at the hybridization temperature in 0.1xSSC, 0.5% SDS. A or 150 consecutive amino acids thereof as determined using final wash is conducted in 0.1xSSC at room temperature. the FASTA version 3.0t78 algorithm with the default param Nucleic acids which have hybridized to the probe are eterS. identified by autoradiography or other conventional tech Another aspect of the present invention is an isolated or niques. purified polypeptide comprising the Sequence of one of SEQ The above procedure may be modified to identify nucleic ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, acids having decreasing levels of homology to the probe 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, Sequence. For example, to obtain nucleic acids of decreasing 66, 68, 70, 72, 7476, 78, and 80 or fragments comprising at homology to the detectable probe, leSS Stringent conditions 25 least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 may be used. For example, the hybridization temperature consecutive amino acids thereof. AS discussed above, Such may be decreased in increments of 5 C. from 68 C. to 42 polypeptides may be obtained by inserting a nucleic acid C. in a hybridization buffer having a Na+ concentration of encoding the polypeptide into a vector Such that the coding approximately 1M. Following hybridization, the filter may Sequence is operably linked to a Sequence capable of driving be washed with 2x SSC, 0.5% SDS at the temperature of the expression of the encoded polypeptide in a Suitable host hybridization. These conditions are considered to be “mod cell. For example, the expression vector may comprise a erate” conditions above 50° C. and “low” conditions below promoter, a ribosome binding site for translation initiation 50° C. A specific example of “moderate’ hybridization and a transcription terminator. The vector may also include conditions is when the above hybridization is conducted at appropriate Sequences for amplifying expression. 55 C. A specific example of “low stringency” hybridization 35 Promoters Suitable for expressing the polypeptide or conditions is when the above hybridization is conducted at fragment thereof in bacteria include the E. coli. lac or trip 45° C. promoters, the lacI promoter, the lacz promoter, the T3 Alternatively, the hybridization may be carried out in promoter, the trp promoter, the gpt promoter, the lambda P. buffers, Such as 6xSSC, containing formamide at a tempera 40 promoter, the lambda P, promoter the trp promoter, pro ture of 42 C. In this case, the concentration of formamide moters from operons encoding glycolytic enzymes Such as in the hybridization buffer may be reduced in 5% increments 3-phosphoglycerate kinase (PGK), and the acid phosphatase from 50% to 0% to identify clones having decreasing levels promoter. Fungal promoters include the C. factor promoter. of homology to the probe. Following hybridization, the filter Eukaryotic promoters include the CMV immediate early may be washed with 6xSSC, 0.5% SDS at 50° C. These 45 promoter, the HSV thymidine kinase promoter, heat shock conditions are considered to be “moderate' conditions above promoters, the early and late SV40 promoter, LTRs from 25% formamide and “low” conditions below 25% forma retroviruses, and the mouse metallothionein-I promoter. mide. A Specific example of “moderate’ hybridization con Other promoters known to control expression of genes in ditions is when the above hybridization is conducted at 30% prokaryotic or eukaryotic cells or their viruses may also be formamide. A specific example of “low Stringency' hybrid 50 used. ization conditions is when the above hybridization is con Mammalian expression vectors may also comprise an ducted at 10% formamide. origin of replication, any necessary ribosome binding sites, Nucleic acids which have hybridized to the probe are a polyadenylation Site, Splice donor and acceptor Sites, identified by autoradiography. transcriptional termination Sequences, and 5' flanking non For example, the preceding methods may be used to 55 transcribed Sequences. In Some embodiments, DNA isolate nucleic acids having a Sequence with at least 97%, at sequences derived from the SV40 splice and polyadenyla least 95%, at least 90%, at least 85%, at least 80%, or at least tion Sites may be used to provide the required nontranscribed 70% homology to a nucleic acid Sequence Selected from the genetic elements. group consisting of one of the sequences of SEQ ID NOS. Vectors for expressing the polypeptide or fragment 1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 60 thereof in eukaryotic cells may also contain enhancers to 35, 37, 39, 41, 43, 45, 47,49, 51,53,55, 57, 59, 61, 63, 65, increase expression levels. Enhancers are cis-acting ele 67, 69, 71, 73, 75, 77 and 79, fragments comprising at least ments of DNA, usually from about 10 to about 300 bp in 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 300, 400, length that act on a promoter to increase its transcription. or 500 consecutive bases thereof, and the Sequences comple Examples include the SV40 enhancer on the late side of the mentary thereto. Homology may be measured using 65 replication origin bp 100 to 270, the cytomegalovirus early BLASTN version 2.0 with the default parameters. For promoter enhancer, the polyoma enhancer on the late side of example, the homologous polynucleotides may have a cod the replication origin, and the adenovirus enhancers. US 6,632,937 B1 17 18 In addition, the expression vectors preferably contain one representative examples of appropriate hosts, there may be or more Selectable marker genes to permit Selection of host mentioned: bacterial cells, Such as E. coli, Streptomyces, cells containing the vector. Such Selectable markers include Bacillus Subtilis, Salmonella typhimurium and various spe genes encoding dihydrofolate reductase or genes conferring cies within the genera Pseudomonas, Streptomyces, and neomycin resistance for eukaryotic cell culture, genes con 5 Staphylococcus, fungal cells, Such as yeast, insect cells Such ferring tetracycline or amplicillin resistance in E. coli, and as Drosophila S2 and Spodoptera Sf9, animal cells such as the S. cerevisiae TRP1 gene. CHO, COS or Bowes melanoma, and adenoviruses. The In Some embodiments, the nucleic acid encoding one of Selection of an appropriate host is within the abilities of the polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, those skilled in the art. 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, 52, The vector may be introduced into the host cells using any 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or of a variety of techniques, including transformation, fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, transfection, transduction, Viral infection, gene guns, or 50, 75, 100, or 150 consecutive amino acids thereof is Ti-mediated gene transfer. Particular methods include cal assembled in appropriate phase with a leader Sequence cium phosphate transfection, DEAE-Dextran mediated capable of directing Secretion of the translated polypeptide 15 transfection, lipofection, or electroporation (Davis, L., or fragment thereof. Optionally, the nucleic acid can encode Dibner, M., Battey, I., Basic Methods in Molecular Biology, a fusion polypeptide in which one of the polypeptides of (1986)). SEQ ID NOS: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, Where appropriate, the engineered host cells can be 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, cultured in conventional nutrient media modified as appro 64, 66, 68, 70, 72, 7476, 78, and 80 or fragments comprising priate for activating promoters, Selecting transformants or at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 amplifying the genes of the present invention. Following consecutive amino acids thereof is fused to heterologous transformation of a Suitable host Strain and growth of the peptides or polypeptides, Such as N-terminal identification host Strain to an appropriate cell density, the Selected pro peptides which impart desired characteristics, Such as moter may be induced by appropriate means (e.g., tempera increased Stability or Simplified purification. 25 ture shift or chemical induction) and the cells may be The appropriate DNA sequence may be inserted into the cultured for an additional period to allow them to produce vector by a variety of procedures. In general, the DNA the desired polypeptide or fragment thereof. Sequence is ligated to the desired position in the Vector Cells are typically harvested by centrifugation, disrupted following digestion of the insert and the vector with appro by physical or chemical means, and the resulting crude priate restriction endonucleases. Alternatively, blunt ends in extract is retained for further purification. Microbial cells both the insert and the vector may be ligated. A variety of employed for expression of proteins can be disrupted by any cloning techniques are disclosed in Ausubel et al. Current convenient method, including freeze-thaw cycling, Protocols in Molecular Biology, John Wiley 503 Sons, Inc. Sonication, mechanical disruption, or use of cell lysing 1997 and Sambrook et al., Molecular Cloning: A Laboratory agents. Such methods are well known to those skilled in the Manual 2d Ed., Cold Spring Harbor Laboratory Press, 1989, 35 art. The expressed polypeptide or fragment thereof can be the entire disclosures of which are incorporated herein by recovered and purified from recombinant cell cultures by reference. Such procedures and others are deemed to be methods including ammonium Sulfate or ethanol within the scope of those skilled in the art. precipitation, acid extraction, anion or cation eXchange The Vector may be, for example, in the form of a plasmid, chromatography, phosphocellulose chromatography, hydro a viral particle, or a phage. Other vectors include 40 phobic interaction chromatography, affinity chromosomal, nonchromosomal and Synthetic DNA chromatography, hydroxylapatite chromatography and lec Sequences, derivatives of SV40; bacterial plasmids, phage tin chromatography. Protein refolding Steps can be used, as DNA, baculovirus, yeast plasmids, vectors derived from necessary, in completing configuration of the polypeptide. If combinations of plasmids and phage DNA, viral DNA such desired, high performance liquid chromatography (HPLC) as vaccinia, adenovirus, fowlpox virus, and pseudorabies. A 45 can be employed for final purification StepS. variety of cloning and expression vectors for use with Various mammalian cell culture Systems can also be prokaryotic and eukaryotic hosts are described by employed to express recombinant protein. Examples of Sambrook, et al., Molecular Cloning: A Laboratory Manual, mammalian expression systems include the COS-7 lines of Second Edition, Cold Spring Harbor, N.Y., (1989), the 50 monkey kidney fibroblasts (described by Gluzman, Cell, disclosure of which is hereby incorporated by reference. 23:175 (1981), and other cell lines capable of expressing Particular bacterial vectors which may be used include the proteins from a compatible vector, such as the C127, 3T3, commercially available plasmids comprising genetic ele CHO, HeLa and BHK cell lines. ments of the well known cloning vector pBR322 (ATCC The constructs in host cells can be used in a conventional 37017), pKK223-3 (Pharmacia Fine Chemicals, Uppsala, 55 manner to produce the gene product encoded by the recom Sweden), GEM1 (Promega Biotec, Madison, Wis., USA) binant Sequence. Depending upon the host employed in a pOE70, pGE60, pGE-9 (Qiagen), pID10, psiX174 pBlue recombinant production procedure, the polypeptides pro script II KS, pNH8A, pNH16a, pNH18A, pNH46A duced by host cells containing the vector may be glycosy (Stratagene), ptrc99a, pKK223-3, pKK233-3, pIDR540, lated or may be non-glycosylated. Polypeptides of the inven pRIT5 (Pharmacia), pKK232-8 and pCM7. Particular 60 tion may or may not also include an initial methionine amino eukaryotic vectors include pSV2CAT, pOG44, pXT1, pSG acid residue. (Stratagene) pSVK3, pBPV, pMSG, and pSVL (Pharmacia). Alternatively, the polypeptides of SEQ ID NOs: 4, 6, 8, However, any other vector may be used as long as it is 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,38, 40, replicable and viable in the host cell. 42, 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74 The host cell may be any of the host cells familiar to those 65 76, 78, and 80 or fragments comprising at least 5, 10, 15, 20, skilled in the art, including prokaryotic cells, eukaryotic 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids cells, mammalian cells, insect cells, or plant cells. AS thereof can be Synthetically produced by conventional pep US 6,632,937 B1 19 20 tide Synthesizers. In other embodiments, fragments or por length of the PCR product. For example, the reaction may be tions of the polypeptides may be employed for producing the performed using 20 fmoles of nucleic acid to be corresponding full-length polypeptide by peptide Synthesis, mutagenized, 30 pmole of each PCR primer, a reaction therefore, the fragments may be employed as intermediates buffer comprising 50 mM KC1, 10 mM Tris HCl (pH 8.3) for producing the full-length polypeptides. and 0.01% gelatin, 7 mM MgCl, 0.5 mM MnCl, 5 units of Cell-free translation Systems can also be employed to Taq polymerase, 0.2 mM dGTP, 0.2 mM dATP, 1 mM dCTP, produce one of the polypeptides of SEQ ID Nos: 4, 6, 8, 10, and 1 mM dTTP, PCR may be performed for 30 cycles of 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 94 C. for 1 min, 45° C. for 1 min, and 72 C. for 1 min. 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, However, it will be appreciated that these parameters may be 78, and 80 or fragments comprising at least 5, 10, 15, 20, 25, varied as appropriate. The mutagenized nucleic acids are 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids cloned into an appropriate vector and the activities of the thereof using mRNAS transcribed from a DNA construct polypeptides encoded by the mutagenized nucleic acids is comprising a promoter operably linked to a nucleic acid evaluated. encoding the polypeptide or fragment thereof. In Some Variants may also be created using oligonucleotide embodiments, the DNA construct may be linearized prior to 15 directed mutagenesis to generate Site-specific mutations in conducting an in vitro transcription reaction. The transcribed any cloned DNA segment of interest. Oligonucleotide mutagenesis is described in Reidhaar-Olson, J. F. & Sauer, mRNA is then incubated with an appropriate cell-free trans R. T., et al., Science, 241:53-57 (1988), the disclosure of lation extract, Such as a rabbit reticulocyte extract, to pro which is incorporated herein by reference in its entirety. duce the desired polypeptide or fragment thereof. Briefly, in Such procedures a plurality of double Stranded The present invention also relates to variants of the oligonucleotides bearing one or more mutations to be intro polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, duced into the cloned DNA are synthesized and inserted into 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, the cloned DNA to be mutagenized. Clones containing the 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, mutagenized DNA are recovered and the activities of the 25 polypeptides they encode are assessed. 50, 75, 100, or 150 consecutive amino acids thereof. The Another method for generating variants is assembly PCR. term “variant' includes derivatives or analogs of these Assembly PCR involves the assembly of a PCR product polypeptides. In particular, the variants may differ in amino from a mixture of small DNA fragments. A large number of acid sequence from the polypeptides of SEQ ID NOs: 4, 6, different PCR reactions occur in parallel in the same vial, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,38, with the products of one reaction priming the products of 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, another reaction. Assembly PCR is described in U.S. patent 7476, 78, and 80 by one or more substitutions, additions, application Ser. No. 08/677,112, filed Jul. 9, 1997 and U.S. deletions, fusions and truncations, which may be present in patent application Ser. No. 08/942,504, filed Oct. 31, 1997, any combination. the disclosures of which are incorporated herein by refer The variants may be naturally occurring or created in 35 ence in their entireties. Vitro. In particular, Such variants may be created using Still another method of genrating variants is sexual PCR genetic engineering techniques Such as Site directed mutagenesis. In Sexual PCR mutagenesis, forced homolo mutagenesis, random chemical mutagenesis, Exonuclease gous recombination occurs between DNA molecules of III deletion procedures, and Standard cloning techniques. different but highly related DNA sequence in vitro, as a Alternatively, Such variants, fragments, analogs, or deriva 40 result of random fragmentation of the DNA molecule based tives may be created using chemical Synthesis or modifica on Sequence homology, followed by fixation of the croSSOver tion procedures. by primer extension in a PCR reaction. Sexual PCR Other methods of making variants are also familiar to mutagenesis is described in Stemmer, W. P., PNAS, USA, those skilled in the art. These include procedures in which 91: 10747–10751 (1994), the disclosure of which is incor nucleic acid Sequences obtained from natural isolates are 45 porated herein by reference. Briefly, in Such procedures a modified to generate nucleic acids which encode polypep plurality of nucleic acids to be recombined are digested with tides having characteristics which enhance their value in DNASe to generate fragments having an average Size of industrial or laboratory applications. In Such procedures, a 50-200 nucleotides. Fragments of the desired average size large number of variant Sequences having one or more are purifed and resuspended in a PCR mixture. PCR is nucleotide differences with respect to the Sequence obtained 50 conducted under conditions which facilitate recombination from the natural isolate are generated and characterized. between the nucleic acid fragments. For example, PCR may Preferably, these nucleotide differences result in amino acid be performed by resuspending the purified fragments at a changes with respect to the polypeptides encoded by the concentration of 10-30 ng/ul in a solution of 0.2 mM of each nucleic acids from the natural isolates. dNTP, 2.2 mM MgCl2, 50 mM KCL, 10 mM Tris HCl, pH For example, variants may be created using error prone 55 9.0, and 0.1% Triton X-100. 2.5 units of Taq polymerase per PCR. In error prone PCR, PCR is performed under condi 100 ul of reaction mixture is added and PCR is performed tions where the copying fidelity of the DNA polymerase is using the following regime: 94 C. for 60 seconds, 94 C. for low, Such that a high rate of point mutations is obtained 30 seconds, 50–55° C. for 30 seconds, 72 C. for 30 seconds along the entire length of the PCR product. Error prone PCR (30–45 times) and 72° C. for 5 minutes. However, it will be is described in Leung, D. W., et al., Technique, 1:11-15 60 appreciated that these parameters may be varied as appro (1989) and Caldwell, R. C. & Joyce G. F., PCR Methods priate. In Some embodiments, oligonucleotides may be Applic., 2:28-33 (1992), the disclosure of which is incor included in the PCR reactions. In other embodiments, the porated herein by reference in its entirety. Briefly, in Such Klenow fragment of DNA polymerase I may be used in a procedures, nucleic acids to be mutagenized are mixed with first set of PCR reactions and Taq polymerase may be used PCR primers, reaction buffer, MgCl, MnCl, Taq poly 65 in a subsequent set of PCR reactions. Recombinant merase and an appropriate concentration of dNTPs for Sequences are isolated and the activities of the polypeptides achieving a high rate of point mutation along the entire they encode are assessed. US 6,632,937 B1 21 22 Variants may also be created by in Vivo mutagenesis. In like characteristics. Typically Seen as conservative Substitu Some embodiments, random mutations in a Sequence of tions are the following replacements: replacements of an interest are generated by propagating the Sequence of inter aliphatic amino acid Such as Ala, Val, Leu and Ile with est in a bacterial Strain, Such as an E. coli Strain, which another aliphatic amino acid; replacement of a Ser with a carries mutations in one or more of the DNA repair path Thr or Vice versa, replacement of an acidic residue. Such as wayS. Such “mutator Strains have a higher random muta Asp and Glu with another acidic residue, replacement of a tion rate than that of a wild-type parent. Propagating the residue bearing an amide group, Such as ASn and Gln, with DNA in one of these strains will eventually generate random mutations within the DNA. Mutator strains Suitable for use another residue bearing an amide group; exchange of a basic for in vivo mutagenesis are described in PCT Published residue. Such as Lys and Arg with another basic residue, and Application WO 91/16427, the disclosure of which is incor replacement of an aromatic residue. Such as Phe, Tyr with porated herein by reference in its entirety. another aromatic residue. Variants may also be generated using cassette mutagen Other variants are those in which one or more of the esis. In cassette mutagenesis a Small region of a double amino acid residues of the polypeptides of SEQ ID NOs: 4, Stranded DNA molecule is replaced with a Synthetic oligo 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,38, nucleotide “cassette' that differs from the native Sequence. 15 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, The oligonucleotide often contains completely and/or par 7476, 78, and 80 includes a substituent group. tially randomized native Sequence. Still other variants are those in which the polypeptide is Recursive ensemble mutagenesis may also be used to asSociated with another compound, Such as a compound to generate variants. Recursive ensemble mutagenesis is an increase the half-life of the polypeptide (for example, poly algorithm for protein engineering (protein mutagenesis) ethylene glycol). developed to produce diverse populations of phenotypically Additional variants are those in which additional amino related mutants whose members differ in amino acid acids are fused to the polypeptide, Such as a leader Sequence, Sequence. This method uses a feedback mechanism to con a Secretory Sequence, a proprotein Sequence or a sequence trol Successive rounds of combinatorial cassette mutagen which facilitates purification, enrichment, or Stabilization of esis. Recursive ensemble mutagenesis is described in Arkin, 25 the polypeptide. A. P. and Youvan, D. C., PNAS, USA, 89:7811–7815 In Some embodiments, the fragments, derivatives and (1992), the disclosure of which is incorporated herein by analogs retain the same biological function or activity as the reference in its entirety. polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, In Some embodiments, variants are created using expo 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, nential ensemble mutagenesis. Exponential ensemble 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80. In other mutagenesis is a process for generating combinatorial librar embodiments, the fragment, derivative, or analog includes a ies with a high percentage of unique and functional mutants, proprotein, Such that the fragment, derivative, or analog can wherein Small groups of residues are randomized in parallel be activated by cleavage of the proprotein portion to produce to identify, at each altered position, amino acids which lead 35 an active polypeptide. to functional proteins. Exponential ensemble mutagenesis is Another aspect of the present invention are polypeptides described in Delegrave, S. and Youvan, D.C., Biotechnol or fragments thereof which have at least 70%, at least 80%, ogy Research, 11:1548–1552 (1993), the disclosure of at least 85%, at least 90%, at least 95%, or more than 95% which incorporated herein by reference in its entirety. Ran homology to one of the polypeptides of SEQ ID NOs: 4, 6, dom and Site-directed mutagenesis are described in Arnold, 40 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,38, F. H., Current Opinion in Biotechnology, 4:450–455 (1993), 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, the disclosure of which is incorporated herein by reference 7476, 78, and 80 or a fragment comprising at least 5, 10, 15, in its entirety. 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino In Some embodiments, the variants are created using acids thereof. Homology may be determined using a Shuffling procedures wherein portions of a plurality of 45 program, such as FASTA version 3.0t78 with the default nucleic acids which encode distinct polypeptides are fused parameters, which aligns the polypeptides or fragments together to create chimeric nucleic acid Sequences which being compared and determines the extent of amino acid encode chimeric polypeptides. Shuffling procedures are identity or similarity between them. It will be appreciated described in U.S. patent application Ser. No. 08/677,112, that amino acid “homology includes conservative amino filed Jul. 9, 1996, U.S. patent application Ser. No. 08/942, 50 acid Substitutions Such as those described above. 504, filed Oct. 31, 1997, U.S. Pat. No. 5,939,250, issued The polypeptides or fragments having homology to one of Aug. 17, 1999, and U.S. patent application Ser. No. 09/375, the polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 605, filed Aug. 17, 1999, the disclosures of which are 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, 52, incorporated herein by reference in their entireties. 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or The variants of the polypeptides of SEQ ID NOs: 4, 6, 8, 55 a fragment comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 50, 75, 100, or 150 consecutive amino acids thereof may be 42, 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74 obtained by isolating the nucleic acids encoding them using 76, 78, and 80 may be (i) variants in which one or more of the techniques described above. the amino acid residues of the polypeptides of SEQID NOS: Alternatively, the homologous polypeptides or fragments 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 60 may be obtained through biochemical enrichment or puri 38, 40, 42, 44, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, fication procedures. The Sequence of potentially homolo 72, 7476, 78, and 80 are Substituted with a conserved or gous polypeptides or fragments may be determined by non-conserved amino acid residue (preferably a conserved proteolytic digestion, gel electrophoresis and/or microse amino acid residue) and Such Substituted amino acid residue quencing. The Sequence of the prospective homologous may or may not be one encoded by the genetic code. 65 polypeptide or fragment can be compared to one of the Conservative Substitutions are those that Substitute a polypeptides of SEQID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, given amino acid in a polypeptide by another amino acid of 22, 24, 26, 28, 30, 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, US 6,632,937 B1 23 24 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or a homology to RNA helicase, may be used to inhibit the fragment comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, growth of Cenarchaeum Symbiosis. Such agents may be 75, 100, or 150 consecutive amino acids thereof using a identified as described above. program such as FASTA version 3.0t78 with the default The polypeptides of SEQ ID NOs: 30 and 62, which have parameterS. homology to DNA polymerase I, or fragments thereof, may The polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, be used to insert a detectable label into a nucleic acid or to 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, generate blunt ends on nucleic acids which have been 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80 digested with a restriction endonuclease. or fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, The polypeptides of SEQ ID NOs: 42 and 76, which have 50, 75, 100, or 150 consecutive amino acids thereof inven homology to Site specific DNA methyltranseferases, or tion may be used in a variety of applications. For example, fragments thereof, may be used in procedures in which it is the polypeptides or fragments thereof may be used to desirable to protect nucleic acid Sequences from digestion catalyze biochemical reactions. In particular, the polypep with restriction endonucleases. For example, a nucleic acid tides of SEQ ID NOs: 14 and 46, which have homology to Sequence having one or more restriction Sites therein may be glutamate Semialdehyde amino transferase, or fragments 15 treated with the polypeptides of SEQID NOs: 42 or 76 prior thereof, may be used to catalyze the Synthesis of to the addition of linkers to the nucleic acid. Thereafter, the 5-aminolevulinate from S-4-amino-5-oxopentanoate. The linkers may be digested with the restriction enzyme, while polypeptides of SEQ ID NOS: 26 and 58, which have the Sites in the remainder of the nucleic acid are protected homology to triose phosphate isomerase, or fragments from digestion. thereof, may be used to catalyze the Synthesis of glycerone The polypeptides of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, phosphate from D-glyceraldehyde 3-phosphate. The 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, polypeptides of SEQ ID NOs: 32 and 64, which have 52, 54, 56,58, 60, 62, 64, 66, 68, 70, 72, 7476, 78, and 80, homology to dCMP deaminase, or fragments thereof, may or fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, be used to catalyze the reaction of deoxyctidine and water to 50, 75, 100, or 150 consecutive amino acids thereof, may produce deoxyuridine and ammonia. The polypeptides of 25 also be used to generate antibodies which bind Specifically SEQ ID NOS:38 and 72, which have homology to the MenA to the polypeptides or fragments. The resulting antibodies protein, or fragments thereof, may be used to catalyze the may be used to determine whether a biological Sample synthesis of menaquinone. The polypeptide of SEQ ID NO: contains Cenarchaeum Symbiosum. In Such procedures, a 80, which has homology to glucose-1-dehydrogenase, may biological Sample is contacted with an antibody capable of be used to catalyze the Synthesis of D-glucono-1,5-lacctone specifically binding to one of the polypeptides of SEQ ID from D-glucose. NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, The polypeptide of SEQID NO: 10, which has homology 34, 36, 38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, to lysyl tRNA synthetase, or fragments thereof, may be used 68, 70, 72, 7476, 78, and 80 or fragments comprising at least to identify compounds capable of Specifically inhibiting the 5, 10, 15, 20, 25, 30, 35,40.50, 75, 100, or 150 consecutive growth of Cenarchaeum Symbiosis, since tRNA synthetases 35 amino acids thereof. The ability of the biological sample to are attractive targets for agents which inhibit growth. bind to the antibody is then determined. For example, Agents which specifically inhibit the activity of the lysyl binding may be determined by labeling the antibody with a tRNA synthetase from Cenarchaeum Symbiosum may be detectable label Such as a fluorescent agent, an enzymatic identified using a variety of methods known to those skilled label, or a radioisotope. Alternatively, binding of the anti in the art. For example, a plurality of agents may be 40 body to the Sample may be detected using a Secondary generated using combinatorial chemistry or recombinant antibody having Such a detectable label thereon. A variety of DNA libraries encoding a large number of short peptides. assay protocols which may be used to detect the presence of The lysyl tRNA synthetases from Cenarchaeum Symbiosum Cenarchaeum Symbiosum in a Sample are familiar to those and control organisms are contacted with the agents and skilled in the art. Particular assays include ELISA assays, those agents which bind to the lysyl tRNA synthetase from 45 Sandwich assays, radioimmunoassays, and Western Blots. Cenarchaeum Symbiosum but not to the enzyme from the Polyclonal antibodies generated against the polypeptides control organisms are identified. Cenarchaeum Symbiosum of SEQ ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, is then contacted with the identified agents to determine 28, 30, 32, 34, 36, 38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, which agents inhibit the organism's growth. 50 62, 64, 66, 68, 70, 72, 7476, 78, and 80 or fragments The polypeptides of SEQ ID NOS: 28 and 60, which have comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, homology to the TATA box binding protein, or fragments or 150 consecutive amino acids thereof can be obtained by thereof, may be used to identify promoters in nucleic acids direct injection of the polypeptides into an animal or by from Cenarchaeum Symbiosis. In Such procedures, the administering the polypeptides to an animal, preferably a polypeptide or fragment thereof is allowed to contact the 55 nonhuman. The antibody so obtained will then bind the nucleic acid and binding of the polypeptide or fragment polypeptide itself. In this manner, even a Sequence encoding thereof to the nucleic acid is detected. Binding may be only a fragment of the polypeptide can be used to generate detected by performing a gel Shift analysis, a nuclease antibodies which may bind to the whole native polypeptide. protection analysis, or by detecting the retention of the Such antibodies can then be used to isolate the polypeptide nucleic acid on a column matrix having the TATA box 60 from cells expressing that polypeptide. binding protein, or a fragment thereof, affixed thereto. For preparation of monoclonal antibodies, any technique Compounds which specifically inhibit the binding of the which provides antibodies produced by continuous cell line TATA box binding protein of Cenarchaeum Symbiosis to cultures can be used. Examples include the hybridoma promoters may also be used to inhibit growth of the organ technique (Kohler and Milstein, 1975, Nature, 256:495-497, ism. Such compounds may be identified as described above. 65 the disclosure of which is incorporated herein by reference), Similarly, agents which specifically inhibit the activity of the trioma technique, the human B-cell hybridoma technique the polypeptides of SEQ ID NOS: 34 and 66, which have (Kozbor et al., 1983, Immunology Today 4:72, the disclo US 6,632,937 B1 25 26 Sure of which is incorporated herein by reference), and the uridines replace the thymines in the nucleic acid codes of EBV-hybridoma technique (Cole, et al., 1985, in Mono SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, clonal Antibodies and Cancer Therapy, Alan R. Liss, Inc., 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, pp. 77-96, the disclosure of which is incorporated herein by 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77. The homolo reference). gous Sequences may be obtained using any of the procedures Techniques described for the production of Single chain described herein or may result from the correction of a antibodies (U.S. Pat. No. 4,946,778, the disclosure of which Sequencing error. It will be appreciated that the nucleic acid is incorporated herein by reference) can be adapted to codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, produce Single chain antibodies to the polypeptides of SEQ 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, ID NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 21, 23, 35, 39, 43, 47,49, 51, 53, 55, 69, 73 and 77 can be 32, 34, 36,38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, represented in the traditional Single character format (See 66, 68, 70, 72, 7476, 78, and 80 or fragments comprising at the inside back cover of Stryer, Lubert. Biochemistry, 3' least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 edition. W. H. Freeman & Co., New York.) or in any other consecutive amino acids thereof. Alternatively, transgenic format which records the identity of the nucleotides in a mice may be used to express humanized antibodies to these 15 Sequence. polypeptides or fragments thereof. As used herein the term “polypeptide codes of SEQ ID Antibodies generated against the polypeptides of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, 42, 46,58, 60, 62, 64, NOs: 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 34, 36, 38, 40, 42, 44, 48, 50, 52, 54, 56,58, 60, 62, 64, 66, 50, 52, 54, 56, 70, 74, and 78'encompasses the polypeptide 68, 70, 72, 7476, 78, and 80 or fragments comprising at least sequence of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, amino acids thereof may be used in Screening for Similar 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 which polypeptides from other organisms and Samples. In Such are encoded by the extended cDNAs of SEQ ID NOS. 1, 2, techniques, polypeptides from the organism are contacted 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, with the antibody and those polypeptides which specifically 25 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, bind the antibody are detected. Any of the procedures 53, 55, 69, 73 and 77, polypeptide sequences homologous to described above may be used to detect antibody binding. the polypeptides of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, One such screening assay is described in “Methods for 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, Measuring Cellulase Activities”, Methods in Enzymology, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78, Vol 160, pp. 87-116, which is hereby incorporated by or fragments of any of the preceding Sequences. Homolo reference in its entirety. gous polypeptide Sequences refer to a polypeptide Sequence AS used herein the term “nucleic acid codes of SEQ ID having at least 99%, 98%, 97%, 96%, 95%, 90%, 85%, 80%, NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, 75% or 70% homology to one of the polypeptide sequences 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, 42,46,58, 47, 49, 51, 53, 55, 69, 73 and 77* encompasses the nucle 35 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, otide sequences of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 40, 44, 48,50, 52, 54,56, 70, 74, and 78. Homology may be 31,33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, determined using any of the computer programs and param 15, 17, 19, 21, 23, 35, 39, 43, 47,49, 51,53,55, 69, 73 and eters described herein, including FASTA version 3.0t78 with 77, fragments of SEQID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, the default parameters or with any modified parameters. The 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 40 homologous Sequences may be obtained using any of the 17, 19, 21, 23, 35, 39, 43, 47,49, 51,53,55, 69, 73 and 77, procedures described herein or may result from the correc nucleotide sequences homologous to SEQ ID NOS. 1, 2, 5, tion of a Sequencing error. The polypeptide fragments com 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, prise at least 5, 10, 15, 20, 25, 30, 35,40, 50, 75, 100, or 150 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, consecutive amino acids of the polypeptides of SEQ ID 53, 55, 69, 73 and 77 or homologous to fragments of SEQ 45 NOS. 6, 10, 14, 26.28, 30, 32, 34,38, 42, 46, 58, 60, 62, 64, ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 66, 68, 72, 76, 80.4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 50, 52, 54, 56, 70, 74, and 78. Preferably, the fragments are 43, 47, 49, 51, 53, 55, 69, 73 and 77, and sequences novel fragments. It will be appreciated that the polypeptide complementary to all of the preceding Sequences. The codes of the SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, fragments include portions of SEQID NOS. 1, 2, 5, 9, 13, 25, 50 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 can be 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51,53,55, 69, represented in the traditional Single character format or three 73 and 77 comprising at least 10, 15, 20, 25, 30, 35, 40, 50, letter format (See the inside back cover of Starrier, Lubert. 75, 100, 150, 200, 300, 400, or 500 consecutive nucleotides Biochemistry, 3 edition. W. H Freeman & Co., New York.) of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 55 or in any other format which relates the identity of the 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, polypeptides in a Sequence. 35, 39, 43, 47,49, 51,53,55, 69,73 and 77. Preferably, the It will be appreciated by those skilled in the art that the fragments are novel fragments. Homologous Sequences and nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, fragments of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 37, 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 60 15, 17, 19, 21, 23,35, 39, 43,47, 49, 51,53,55, 69, 73 and 19, 21, 23, 35, 39, 43, 47,49, 51,53,55, 69, 73 and 77 refer 77 and polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, 28, to a sequence having at least 99%, 98%, 97%, 96%, 95%, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, 80, 4, 90%, 85%, 80%, 75% or 70% homology to these sequences. 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, Homology may be determined using any of the computer 74, and 78 can be Stored, recorded, and manipulated on any programs and parameters described herein, including 65 medium which can be read and accessed by a computer. AS BLASTN version 2.0 with the default parameters. Homolo used herein, the words “recorded” and “stored” refer to a gous Sequences also include RNA sequences in which process for Storing information on a computer medium. A US 6,632,937 B1 27 28 skilled artisan can readily adopt any of the presently known the Sequence information described herein. One example of methods for recording information on a computer readable a computer system 100 is illustrated in block diagram form medium to generate manufactures comprising one or more in FIG. 3. As used herein, “a computer system” refers to the of the nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, hardware components, Software components, and data Stor 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, age components used to analyze the nucleotide Sequences of 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51,53,55, 69, the nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 73 and 77, one or more of the polypeptide codes of SEQ ID 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, 42, 46,58, 60, 62, 64, 7, 11, 15, 17, 19, 21, 23, 35, 39, 43, 47, 49, 51, 53, 55, 69, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. Another aspect of the present 73 and 77 or the sequences of the polypeptide codes of 6, 10, invention is a computer readable medium having recorded 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, thereon at least 2, 5, 10, 15, or 20 nucleic acid codes of SEQ 76, 80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 56, 70, 74, and 78. The computer system 100 preferably 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, includes a processor for processing, accessing and manipu 43, 47,49, 51, 53, 55, 69, 73 and 77. lating the Sequence data. The processor 105 can be any Another aspect of the invention is a computer readable 15 well-known type of central processing unit, Such as the medium having recorded thereon one or more of the nucleic Pentium III from Intel Corporation, or similar processor acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, from Sun, Motorola, Compaq or International Business 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, and 79. Another Machines. aspect of the present invention is a computer readable Preferably, the computer system 100 is a general purpose medium having recorded thereon at least 2, 5, 10, or 15 of System that comprises the processor 105 and one or more SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, internal data Storage components 110 for Storing data, and 57, 59, 61, 63, 65, 67, 71, 75, and 79. one or more data retrieving devices for retrieving the data Stored on the data Storage components. A skilled artisan can Another aspect of the present invention is a computer readily appreciate that any one of the currently available readable medium having recorded thereon one or more of 25 the nucleic acid codes of SEQ ID NOS. 3, 7, 11, 15, 17, 19, computer System S are Suitable. 21, 23,35, 39, 43, 47,49, 51,53,55, 69, 73 and 77. Another In one particular embodiment, the computer system 100 aspect of the present invention is a computer readable includes a processor 105 connected to a bus which is medium having recorded thereon at least 2, 5, 10, or 15 of connected to a main memory 115 (preferably implemented SEQ ID NOS. 3, 7, 11, 15, 17, 19, 21, 23, 35, 39, 43, 47,49, as RAM) and one or more internal data storage devices 110, 51,53,55, 69, 73 and 77. Such as a hard drive and/or other computer readable media Another aspect of the present invention is a computer having data recorded thereon. In Some embodiments, the readable medium having recorded thereon one or more of computer system 100 further includes one or more data the polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, 28, 30, retrieving device 118 for reading the data stored on the internal data Storage devices 110. 32, 34,38, 42, 46,58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 35 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and The data retrieving device 118 may represent, for 78. Another aspect of the present invention is a computer example, a floppy disk drive, a compact disk drive, a readable medium having recorded thereon one or more of magnetic tape drive, etc. In Some embodiments, the internal the the polypeptide codes of SEQID NOS. 6, 10, 14, 26, 28, data Storage device 110 is a removable computer readable 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76, and 80. 40 medium Such as a floppy disk, a compact disk, a magnetic Another aspect of the present invention is a computer tape, etc. containing control logic and/or data recorded readable medium having recorded thereon one or more of thereon. The computer system 100 may advantageously the the polypeptide codes of SEQ ID NOS. 4, 8, 12, 16, 18, include or be programmed by appropriate Software for 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. reading the control logic and/or the data from the data Another aspect of the present invention is a computer 45 Storage component once inserted in the data retrieving readable medium having recorded thereon at least 2, 5, 10, device. 15, or 20 polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, The computer system 100 includes a display 120 which is 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, 76, 80, used to display output to a computer user. It should also be 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, noted that the computer system 100 can be linked to other 74, and 78. Another aspect of the present invention is a 50 computer System S 125a–c in a network or wide area computer readable medium having recorded thereon at least network to provide centralized access to the computer 2, 5, 10, or 15 polypeptide codes of SEQ ID NOS. 6, 10, 14, system 100. 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76, Software for accessing and processing the nucleotide and 80. Another aspect of the present invention is a com sequences of the nucleic acid codes of SEQ ID Nos.1, 2, 5, puter readable medium having recorded thereon at least 2, 5, 55 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 10, or 15 polypeptide codes of SEQ ID NOS. 4, 8, 12, 16, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. 53, 55, 69, 73 and 77 or the polypeptide codes of SEQ ID Computer readable media include magnetically readable NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, media, optically readable media, electronically readable 64,66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, media and magnetic/optical media. For example, the com 60 48, 50, 52, 54, 56, 70, 74, and 78 (such as search tools, puter readable media may be a hard disk, a floppy disk, a compare tools, and modeling tools etc.) may reside in main magnetic tape, CD-ROM, Digital Versatile Disk (DVD), memory 115 during execution. Random Access Memory (RAM), or Read Only Memory In some embodiments, the computer system 100 may (ROM) as well as other types of other media known to those further comprise a Sequence comparer for comparing the skilled in the art. 65 above-described nucleic acid codes of SEQ ID Nos. 1, 2, 5, Embodiments of the present invention include Systems, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, particularly computer Systems which Store and manipulate 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, US 6,632,937 B1 29 30 53, 55, 69, 73 and 77 or the polypeptide codes of SEQ ID 17:49–61). Less preferably, the PAM or PAM250 matrices NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, may also be used (see, e.g., Schwartz and Dayhoff, eds., 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 1978, Matrices for Detecting Distance Relationships. Atlas 50, 52, 54, 56, 70, 74, and 78 stored on a computer readable of Protein Sequence and Structure, Washington: National medium to reference nucleotide or polypeptide Sequences 5 Biomedical Research Foundation). BLAST programs are Stored on a computer readable medium. A “Sequence com accessible through the U.S. National Library of Medicine, parer' refers to one or more programs which are imple e.g., at www.ncbi.nlm.nih.gov. mented on the computer System 100 to compare a nucleotide The BLAST programs evaluate the Statistical Significance Sequence with other nucleotide Sequences and/or com of all high-scoring Segment pairs identified, and preferably pounds Stored within the data Storage means. For example, Selects those Segments which Satisfy a user-specified thresh the Sequence comparer may compare the nucleotide old of Significance, Such as a user-specified percent homol sequences of the nucleic acid codes of SEQ ID Nos. 1, 2, 5, ogy. Preferably, the Statistical Significance of a high-scoring 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, Segment pair is evaluated using the Statistical significance 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, formula of Karlin (see, e.g., Karlin and Altschul, 1990, Proc. 53, 55, 69, 73 and 77 or the polypeptide codes of SEQ ID 15 Natl. Acad. Sci. USA 87:2267–2268). NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, The parameters used with the above algorithms may be 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, adapted depending on the Sequence length and degree of 50, 52, 54, 56, 70, 74, and 78 stored on a computer readable homology Studied. In Some embodiments, the parameters medium to reference Sequences Stored on a computer read may be the default parameters used by the algorithms in the able medium to identify homologies or structural motifs. absence of instructions from the user. Various Sequence comparer programs identified elsewhere FIG. 4 is a flow diagram illustrating one embodiment of in this patent specification are particularly contemplated for a proceSS 200 for comparing a new nucleotide or protein use in this aspect of the invention. Protein and/or nucleic Sequence with a database of Sequences in order to determine acid Sequence homologies may be evaluated using any of the the homology levels between the new Sequence and the variety of Sequence comparison algorithms and programs 25 Sequences in the database. The database of Sequences can be known in the art. Such algorithms and programs include, but a private database stored within the computer system 100, or are by no means limited to, TBLASTN, BLASTN, BLASTP, a public database such as GENBANK that is available FASTA, TFASTA, and CLUSTALW (Pearson and Lipman, through the Internet. 1988, Proc. Natl. Acad. Sci. USA 85(8):2444-2448; Altschul The process 200 begins at a start state 201 and then moves et al., 1990, J. Mol. Biol. 215(3):403-410; Thompson et al., to a State 202 wherein the new Sequence to be compared is 1994, Nucleic Acids Res. 22(2):4673-4680; Higgins et al., Stored to a memory in a computer System 100. AS discussed 1996, Methods Enzymol. 266:383-402; Altschulet al., 1990, above, the memory could be any type of memory, including J. Mol. Biol. 215(3):403-410; Altschul et al., 1993, Nature RAM or an internal Storage device. Genetics 3:266-272). The process 200 then moves to a state 204 wherein a In one embodiment, protein and nucleic acid Sequence 35 database of Sequences is opened for analysis and compari homologies are evaluated using the Basic Local Alignment son. The process 200 then moves to a state 206 wherein the Search Tool (“BLAST) which is well known in the art (see, first Sequence Stored in the database is read into a memory e.g., Karlin and Altschul, 1990, Proc. Natl. Acad. Sci. USA on the computer. A comparison is then performed at a State 87:2267–2268; Altschul et al., 1990, J. Mol. Biol. 210 to determine if the first sequence is the same as the 215:403-410; Altschul et al., 1993, Nature Genetics 40 Second Sequence. It is important to note that this Step is not 3:266–272; Altschul et al., 1997, Nuc. Acids Res. limited to performing an exact comparison between the new 25:3389-3402). In particular, five specific BLAST programs Sequence and the first Sequence in the database. Well-known are used to perform the following task: methods are known to those of skill in the art for comparing (1) BLASTP and BLAST3 compare an amino acid query two nucleotide or protein Sequences, even if they are not Sequence against a protein Sequence database; 45 identical. For example, gaps can be introduced into one Sequence in order to raise the homology level between the (2) BLASTN compares a nucleotide query Sequence two tested Sequences. The parameters that control whether against a nucleotide Sequence database; gaps or other features are introduced into a Sequence during (3) BLASTX compares the six-frame conceptual transla comparison are normally entered by the user of the computer tion products of a query nucleotide sequence (both 50 System. Strands) against a protein Sequence database; Once a comparison of the two Sequences has been per (4) TBLASTN compares a query protein Sequence against formed at the State 210, a determination is made at a decision a nucleotide Sequence database translated in all Six State 210 whether the two Sequences are the same. Of course, reading frames (both Strands); and the term "same' is not limited to Sequences that are abso (5) TBLASTX compares the six-frame translations of a 55 lutely identical. Sequences that are within the homology nucleotide query Sequence against the Six-frame trans parameters entered by the user will be marked as “same' in lations of a nucleotide Sequence database. the process 200. The BLAST programs identify homologous Sequences by If a determination is made that the two Sequences are the identifying similar Segments, which are referred to herein as same, the process 200 moves to a state 214 wherein the name “high-scoring Segment pairs, between a query amino or 60 of the Sequence from the database is displayed to the user. nucleic acid Sequence and a test Sequence which is prefer This state notifies the user that the sequence with the ably obtained from a protein or nucleic acid Sequence displayed name fulfills the homology constraints that were database. High-Scoring Segment pairs are preferably identi entered. Once the name of the Stored Sequence is displayed fied (i.e., aligned) by means of a Scoring matrix, many of to the user, the process 200 moves to a decision state 218 which are known in the art. Preferably, the scoring matrix 65 wherein a determination is made whether more Sequences used is the BLOSUM62 matrix (Gonnet et al., 1992, Science exist in the database. If no more Sequences exist in the 256:1443–1445; Henikoff and Henikoff, 1993, Proteins database, then the process 200 terminates at an end state 220. US 6,632,937 B1 31 32 However, if more Sequences do exist in the database, then herein, including BLAST2N or BLASTN with the default the process 200 moves to a state 224 wherein a pointer is parameters or with any modified parameters. The method moved to the next Sequence in the database So that it can be may be implemented using the computer System S described compared to the new Sequence. In this manner, the new above. The method may also be performed by reading at Sequence is aligned and compared with every Sequence in least 2, 5, 10, 15, 20, 25, 30 or 40 or more of the above the database. described nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, It should be noted that if a determination had been made 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, at the decision State 212 that the Sequences were not 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51,53,55, homologous, then the process 200 would move immediately 69, 73 and 77 orthepolypeptide codes of SEQ ID NOS. 6, 10, to the decision state 218 in order to determine if any other 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, Sequences were available in the database for comparison. 76, 80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, Accordingly, one aspect of the present invention is a 56, 70, 74, and 78 through use of the computer program and computer System comprising a processor, a data Storage determining homology between the nucleic acid codes or device having stored thereon a nucleic acid code of SEQ ID polypeptide codes and reference nucleotide Sequences or Nos. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, 15 polypeptide Sequences. 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, FIG. 5 is a flow diagram illustrating one embodiment of 47, 49, 51,53,55, 69,73 and 77 or the polypeptide codes of a process 250 in a computer for determining whether two SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, Sequences are homologous. The process 250 begins at a start 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, state 252 and then moves to a state 254 wherein a first 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78, a data storage Sequence to be compared is Stored to a memory. The Second device having retrievably Stored thereon reference nucle Sequence to be compared is then Stored to a memory at a otide Sequences or polypeptide Sequences to be compared to state 256. The process 250 then moves to a state 260 wherein the nucleic acid code of SEQ ID Nos.1, 2, 5, 9, 13, 25, 27, the first character in the first Sequence is read and then to a 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, State 262 wherein the first character of the Second Sequence 7, 11, 15, 17, 19, 21, 23, 35, 39, 43, 47, 49, 51, 53, 55, 69, 25 is read. It should be understood that if the Sequence is a 73 and 77 or the polypeptide codes of SEQ ID NOS. 6, 10, nucleotide Sequence, then the character would normally be 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, either A, T, C, G or U. If the Sequence is a protein Sequence, 76, 80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, then it is preferably in the Single letter amino acid code So 56, 70, 74, and 78 and a Sequence comparer for conducting that the first and Sequence Sequences can be easily com the comparison. The Sequence comparer may indicate a pared. homology level between the Sequences compared or identify A determination is then made at a decision State 264 Structural motifs in the above described nucleic acid code of whether the two characters are the same. If they are the SEO ID Nos. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, Same, then the process 250 moves to a state 268 wherein the 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, next characters in the first and Second Sequences are read. A 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77 or the 35 determination is then made whether the next characters are polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, the same. If they are, then the process 250 continues this 34, 38, 42, 46, 58, 60, 62, 64, 66, 68,72, 76, 80, 4, 8, 12, 16, loop until two characters are not the same. If a determination 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 is made that the next two characters are not the Same, the or it may identify structural motifs in Sequences which are process 250 moves to a decision state 274 to determine compared to these nucleic acid codes and polypeptide codes. 40 whether there are any more characters either Sequence to In Some embodiments, the data Storage device may have read. stored thereon the sequences of at least 2, 5, 10, 15, 20, 25, If there arent any more characters to read, then the 30 or 40 or more of the nucleic acid codes of SEQ ID Nos. process 250 moves to a state 276 wherein the level of 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, homology between the first and Second Sequences is dis 65, 67,71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43,47, 45 played to the user. The level of homology is determined by 49, 51, 53, 55, 69, 73 and 77 or the polypeptide codes of calculating the proportion of characters between the SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, Sequences that were the same out of the total number of 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, Sequences in the first Sequence. Thus, if every character in 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. a first 100 nucleotide Sequence aligned with a every char Another aspect of the present invention is a method for 50 acter in a Second Sequence, the homology level would be determining the level of homology between a nucleic acid 100%. code of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, Alternatively, the computer program may be a computer 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, program which compares the nucleotide Sequences of the 21, 23, 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77 or the nucleic acid codes of the present invention, to reference polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 55 nucleotide Sequences in order to determine whether the 34, 38,42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, nucleic acid code of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, and a reference nucleotide Sequence or polypeptide 15, 17, 19, 21, 23,35, 39, 43,47, 49, 51,53,55, 69, 73 and Sequence, comprising the Steps of reading the nucleic acid 77 differs from a reference nucleic acid Sequence at one or code or the polypeptide code and the reference nucleotide or 60 more positions. Optionally Such a program records the polypeptide Sequence through the use of a computer pro length and identity of inserted, deleted or Substituted nucle gram which determines homology levels and determining otides with respect to the Sequence of either the reference homology between the nucleic acid code or polypeptide polynucleotide or the nucleic acid code of SEQ ID NOS. 1, code and the reference nucleotide or polypeptide Sequence 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, 63, 65, with the computer program. The computer program may be 65 67,71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, 35, 39, 43, 47,49, any of a number of computer programs for determining 51,53,55, 69, 73 and 77. In one embodiment, the computer homology levels, including those Specifically enumerated program may be a program which determines whether the US 6,632,937 B1 33 34 nucleotide sequences of the nucleic acid codes of SEQ ID puter Group and can be accessed on the Worldwide web at NOs. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 61, the address gcg.com. Alternatively, the features may be 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, Structural polypeptide motifs Such as alphas helices, beta 47, 49, 51, 53, 55, 69, 73 and 77 contain a single nucleotide sheets, or functional polypeptide motifs Such as enzymatic polymorphism (SNP) with respect to a reference nucleotide active sites, helix-turn-helix motifs or other motifs known to Sequence. those skilled in the art. Accordingly, another aspect of the present invention is a Once the database of features is opened at the state 306, method for determining whether a nucleic acid code of SEQ the process 300 moves to a state 308 wherein the first feature ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, is read from the database. A comparison of the attribute of 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, the first feature with the first Sequence is then made at a State 43, 47, 49, 51, 53, 55, 69, 73 and 77 differs at one or more 310. A determination is then made at a decision state 316 nucleotides from a reference nucleotide Sequence compris whether the attribute of the feature was found in the first ing the Steps of reading the nucleic acid code and the sequence. If the attribute was found, then the process 300 reference nucleotide Sequence through use of a computer moves to a state 318 wherein the name of the found feature program which identifies differences between nucleic acid 15 is displayed to the user. Sequences and identifying differences between the nucleic The process 300 then moves to a decision state 320 acid code and the reference nucleotide Sequence with the wherein a determination is made whether move features computer program. In Some embodiments, the computer exist in the database. If no more features do exist, then the program is a program which identifies Single nucleotide process 300 terminates at an end state 324. However, if more polymorphisms. The method may be implemented by the features do exist in the database, then the process 300 reads computer System S described above and the method illus the next Sequence feature at a State 326 and loops back to the trated in FIG. 6. The method may also be performed by state 310 wherein the attribute of the next feature is com reading at least 2, 5, 10, 15, 20, 25, 30, or 40 of the nucleic pared against the first Sequence. acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, It should be noted, that if the feature attribute is not found 37, 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 25 in the first sequence at the decision state 316, the process 300 19, 21, 23,35, 39, 43, 47,49, 51,53,55, 69, 73 and 77 and moves directly to the decision state 320 in order to determine the reference nucleotide Sequences through the use of the if any more features exist in the database. computer program and identifying differences between the Accordingly, another aspect of the present invention is a nucleic acid codes and the reference nucleotide Sequences method of identifying a feature within the nucleic acid codes with the computer program. of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, In other embodiments the computer based System may 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, further comprise an identifier for identifying features within 35.39, 43, 47, 49, 51,53,55, 69, 73 and 77 or the polypep the nucleotide Sequences of the nucleic acid codes of SEQ tide codes of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34,38, ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 42, 46, 58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 35 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 43, 47,49,51,53,55, 69,73 and 77 or the polypeptide codes comprising reading the nucleic acid code(s) or polypeptide of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46,58, code(s) through the use of a computer program which 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, identifies features therein and identifying features within the 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. nucleic acid code(s) with the computer program. In one An “identifier” refers to one or more programs which 40 embodiment, computer program comprises a computer pro identifies certain features within the above-described nucle gram which identifies open reading frames. The method may otide sequences of the nucleic acid codes of SEQ ID NOS. be performed by reading a single Sequence or at least 2, 5, 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 10, 15, 20, 25, 30, or 40 of the nucleic acid codes of SEQ 65, 67,71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43,47, ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, 49, 51, 53, 55, 69, 73 and 77 or the polypeptide codes of 45 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 43, 47,49,51,53,55, 69,73 and 77 or the polypeptide codes 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46,58, 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. In one 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, embodiment, the identifier may comprise a program which 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78 through the use of identifies an open reading frame in the nucleic acid codes of 50 the computer program and identifying features within the SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, nucleic acid codes or polypeptide codes with the computer 57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23, program. 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77. The nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, FIG. 7 is a flow diagram illustrating one embodiment of 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, 71, 75, 79, an identifier process 300 for detecting the presence of a 55 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43,47, 49, 51,53,55, 69, feature in a Sequence. The proceSS300 begins at a start State 73 and 77 or the polypeptide codes of SEQ ID NOS. 6, 10, 302 and then moves to a state 304 wherein a first sequence 14, 26, 28, 30, 32, 34, 38, 42, 46, 58, 60, 62, 64, 66, 68, 72, that is to be checked for features is stored to a memory 115 76, 80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, in the computer system 100. The process 300 then moves to 56, 70, 74, and 78 may be stored and manipulated in a a State 306 wherein a database of Sequence features is 60 variety of data processor programs in a variety of formats. opened. Such a database would include a list of each For example, the nucleic acid codes of SEQ ID NOS. 1, 2, feature's attributes along with the name of the feature. For 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45, 57, 59, 61, 63, 65, 67, example, a feature name could be “Initiation Codon’ and the 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, 43, 47,49, 51, attribute would be “ATG'. Another example would be the 53, 55, 69, 73 and 77 or the polypeptide codes of SEQ ID feature name "TAATAA Box” and the feature attribute 65 NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46,58, 60, 62, 64, would be “TAATAA'. An example of such a database is 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, 40, 44, 48, produced by the University of Wisconsin Genetics Com 50, 52, 54,56, 70, 74, and 78 may be stored as text in a word US 6,632,937 B1 35 36 processing file, such as MicrosoftWORD or WORDPER the construction of genomic DNA libraries in fosmid based FECT or as an ASCII file in a variety of database programs vectors. Genomic DNA libraries were constructed from two familiar to those of skill in the art, such as DB2, SYBASE, enriched preparations using the methods described in or ORACLE. In addition, many computer programs and Example 1 below. databases may be used as Sequence comparers, identifiers, or Sources of reference nucleotide Sequences or polypeptide EXAMPLE 1. Sequences to be compared to the nucleic acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, 37, 41, 45,57, 59, Enrichment of Cenarchaeum Symbiosum Cells in 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, 19, 21, 23,35, 39, Samples Obtained from Axinella Mexicana 43, 47,49,51,53,55, 69,73 and 77 or the polypeptide codes Enriched preparations of Cenarchaeum Symbiosum for of SEQ ID NOS. 6, 10, 14, 26, 28, 30, 32, 34, 38, 42, 46,58, use in the preparation of the first fosmid genomic DNA 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, 16, 18, 20, 22, 24, 36, library were obtained essentially as described in Preston, C. 40, 44, 48, 50, 52, 54, 56, 70, 74, and 78. The following list M. et al. 1996. A psychrophilic crenarchaeon inhabits a is intended not to limit the invention but to provide guidance marine Sponge: Cenarchaeum Symbiosum gen. nov, sp. nov: to programs and databases which are useful with the nucleic 15 Proc. Natl. Acad. Sci. USA93, 6241-6246, the disclosure of acid codes of SEQ ID NOS. 1, 2, 5, 9, 13, 25, 27, 29, 31, 33, which is incorporated herein by reference. Briefly, a small 37, 41, 45,57, 59, 61, 63, 65, 67, 71, 75, 79, 3, 7, 11, 15, 17, individual of A. mexicana was incubated in calcium- and 19, 21, 23, 35, 39, 43, 47, 49, 51, 53, 55, 69, 73 and 77 or magnesium-free artificial seawater (ASW) containing 0.25 the polypeptide codes of SEQ ID NOS. 6, 10, 14, 26, 28, 30, mg/ml Pronase. The tissue was then homogenized and 32, 34,38, 42, 46,58, 60, 62, 64, 66, 68,72, 76,80, 4, 8, 12, enriched for archaeal cells by differential centrifugation. 16, 18, 20, 22, 24, 36, 40, 44, 48, 50, 52, 54, 56, 70, 74, and Enriched preparations of Cenarchaeum Symbiosum for 78. use in preparing the Second foSmid genomic DNA library The programs and databases which may be used include, were obtained from a different Sponge individual using the but are not limited to: MacPattern (EMBL), DiscoveryBase following improved enrichment procedure. A Small indi (Molecular Applications Group), GeneMine (Molecular 25 vidual of A. mexicana was incubated in calcium- and Applications Group), Look (Molecular Applications Group), magnesium-free artificial seawater (460 mm NaCl, 11 mM MacLook (Molecular Applications Group), BLAST and KC1, 7 mM NaSO, 2 mM NaHCO) containing 0.25 BLAST2 (NCBI), BLASTN and BLASTX (Altschuletal, J. mg/ml Pronase at room temperature for one hour. The Mol. Biol. 215: 403 (1990)), FASTA (Pearson and Lipman, Sponge tissue was rinsed in artificial Seawater and homog Proc. Natl. Acad. Sci. USA, 8.5: 2444 (1988)), FASTDB enized in a blender. Large particles and Spicules were (Brutlag et al. Comp. App. Biosci. 6:237-245, 1990), Cata removed by low-speed centrifugation (4000 rpm, Sorvall lyst (Molecular Simulations Inc.), Catalyst/SHAPE GSA rotor at 4C). The Supernatant was next centrifuged at (Molecular Simulations Inc.), Ceriusf. DBAccess 5000 rpm for 5 min. at 4 C. to remove large sponge cells, (Molecular Simulations Inc.), HypoCien (Molecular Simu and the resulting Supernatant was centrifuged at 10,000 rpm lations Inc.), Insight II, (Molecular Simulations Inc.), Dis 35 in a GSA rotor at 4 C. for 20 min. to collect the Cenar cover (Molecular Simulations Inc.), CHARMm (Molecular chaeum Symbiosum cells. Following centrifugation, the Simulations Inc.), Felix (Molecular Simulations Inc.), recovered cell fraction containing Cenarchaeum Symbiosum DelPhi, (Molecular Simulations Inc.), QuanteMM, was further incubated for 1 hr at 4 C. in 10 mM Tris/HCl (Molecular Simulations Inc.), Homology (Molecular Simu pH 8 and 200 mM EDTA. The cells were then pelleted and lations Inc.), Modeler (Molecular Simulations Inc.), ISIS 40 Subsequently purified on a 15% Percoll (Sigma) cushion in (Molecular Simulations Inc.), Quanta/Protein Design artificial sea water centrifuged at 2500 rpm in a Beckman (Molecular Simulations Inc.), WebLab (Molecular Simula SS34 rotor. Archaeal cells banded in the light, upper fraction tions Inc.), WebLab Diversity Explorer (Molecular Simula after centrifugation. This cell fraction was washed in ASW tions Inc.), Gene Explorer (Molecular Simulations Inc.), and resuspended in TE buffer (10 mM TrishCl pH 8, 0.1 SeqFold (Molecular Simulations Inc.), the MDL Available 45 mM EDTA). The additional incubation step was found to Chemicals Directory database, the MDL Drug Data Report increase the lysis of Sponge cells, which resulted in an data base, the Comprehensive Medicinal Chemistry enhanced Separation of archaeal and eukaryotic cells in the database, Derwents's World Drug Index database, the Bio percoll gradient. Byte MasterFile database, the Genbank database, and the Genseqn database. Many other programs and data bases 50 Quantitative hybridization experiments were performed would be apparent to one of Skill in the art given the present as described in DeLong, E. F. 1992. Archaea in coastal disclosure. marine environments. Proc. Natl. AcadSci. 89,5685–5689, Motifs which may be detected using the above programs the disclosure of which is incorporated herein by reference, include Sequences encoding leucine Zippers, helix-turn-helix using an oligonucleotide specific for archaea having the motifs, glycosylation sites, ubiquitination Sites, alpha 55 sequence GTGCTCCCCCGCCAATTCCT (SEQ ID NO: helices, and beta sheets, Signal Sequences encoding Signal 115). These hybridization experiments indicated that 25% to peptides which direct the Secretion of the encoded proteins, 30% of the total rRNA from this fraction was derived from Sequences implicated in transcription regulation Such as archaea. homeoboxes, acidic Stretches, enzymatic active Sites, Sub The enriched cell preparations were then utilized to Strate binding sites, and enzymatic cleavage Sites. 60 construct fosmid libraries as described in Example 2 below. The present invention will be further described with reference to the following examples; however, it is to be EXAMPLE 2 understood that the present invention is not limited to Such examples. Construction of Fosmid Libraries In order to begin the physiological characterization of 65 DNA was extracted from the enriched preparations of Cenarchaeum Symbiosum, it was necessary to obtain Example 1 and inserted into foSmids as described in Preston, enriched preparations of Cenarchaeum Symbiosum for use in C. M. et al. 1996. A psychrophilic crenarchaeon inhabits a US 6,632,937 B1 37 38 marine Sponge: Cenarchaeum Symbiosum gen. nov, sp. nov: EXAMPLE 4 Proc. Natl. Acad. Sci. USA 93, 6241–6246 and Stein, J. L. et al. 1996. Characterization of uncultivated prokaryotes: FoSmid Sequencing isolation and analysis of a 40-kilobase-pair genome frag Partial restriction enzyme digests were conducted on two ment from a planktonic marine archaeon. J. Bacteriol. 178, 5 591-599, the disclosures of which are incorporated herein purified fosmids, fosmid 101G10 (which contains the vari by reference. A vertical croSS Section of Sponge (0.5 g) was ant A sequence) and fosmid 60A5 (which contains the mechanically dissociated in 0.22 um filtered, autoclaved variant B sequence). The partially digested DNA was used Seawater using a tissue homogenizer. Cell lysis was accom to construct plasmid libraries containing inserts of 1-2 kb. plished by incubating the dissociated cells in 1 mg of The resulting plasmids were Sequenced using Applied Bio lysozyme per ml for 30 min. at 37 C. followed by an systems (ABI, Foster City, Calif.) Prism Dye-terminator FS incubation for 30 min. at 55 C. with 0.5 mg of proteinase reaction mix. Direct Sequencing from fosmids was used for K per ml and 1% SDS. The tubes were finally placed in a gap filling and resequencing to ensure accuracy. FoSmid boiling water bath for 60 sec to complete lysis. The protein Sequencing was performed by using DNA from a Single 3 ml fraction was removed with two extractions with phe overnight culture purified on an Autogen 740 automated nol:chloroform:isoamyl alcohol (50:49:1), pH 8.0, followed 15 plasmid isolation System. Each reaction consisted of one by a chloroform: isoamyl alcohol (24:1) extraction. Nucleic preparation of DNA directly resuspended by the addition of acids were ethanol-precipitated and resuspended in TE 16 ul H.O., 8 ul oligonucleotide primer (1.4 pmol/ul) and 16 buffer (10 mM Tris.HC1/1 mM Na-EDTA, pH 8.0). lil ABI Prism Dye-terminator FS reaction mix. Cycle Approximately 5 lug of DNA was purified by CsCl equilib Sequencing was performed with a 96° C. 3 min. preincuba rium density gradiant ultracentriguation on a Beckman tion followed by 25 cycles of the sequence 96° C. 20 Optima tabletop ultracentrifuge using a TLA100 rotor. Sec./50 C. 20 sec./60° C. 4 min. and a 5 min. post-cycling The genomic DNA obtained above was inserted into incubation at 60° C. Sequencing reaction products were fosmids as follows. The genomic DNA was partially analyzed on ABI 377 Prism Sequencers. digested with Sau3AI (Promega) and treated with heat-labile The complete Sequences of the Cenarchaeum Symbiosum phosphatase (HK phosphatase, Epicentre). The partially 25 derived inserts in the two fosmids are provided in the digested genomic DNA was ligated with pFOS (See U.J. accompanying sequence listing as SEQ ID NO: 1 (foSmid Kim et al., Nucleic Acids Res. 20:1083–1085 (1992), the 101G10) and SEQ ID NO: 2 (fosmid 60A5). The insert of disclosure of which is incorporated herein by reference) fosmid 101G10 (SEQ ID NO: 1, designated variant A) was which had previously been digested with Aat, phosphatase 32,998 bp and was syntenic over ca. 28 kbp with the 42,432 treated (HK phosphatase), and Subsequently digested with bp insert of fosmid 60A5 (SEQ ID NO:2, designated variant BamHI. The ligation mixture was used for in vitro packag B). Analysis of the common 28 kbp region is shown in FIG. ing with the Gigapack XL packaging System (Stratagene) 1. selecting for DNA inserts of 35 to 45kb. The phage particles Although the Sequences of both fosmids could be aligned were transfected into E. coli DH10B (Bethesda Research unambiguously over most of the overlapping region, four LaboratoriesP and the cells were spread onto LB plates 35 large insertion/deletions ranging in size from 142 bp to 1994 Supplemented with 12.5 lug/ml chloramphenicol. bp were identified between positions 20,500 and 25,800. The EXAMPLE 3 longest insertion contained a repetitive element of 1784 bp, that was found in the sequence of SEQ ID NO: 1 between Identification of Fosmids Containing the menA and ORFO5. It was composed of a 3-fold direct repeat Cenarchaeum Symbiosum rRNA Operon 40 of 575 bp (rep1 through 3 in FIG. 1), with repeats exhibiting The fosmid libraries constructed above were screened to only minor sequence variation (95.8% to 98.7% identity). identify clones containing the rRNA operon. PCR reactions A Segment of 56 bp at the Start of this repeat was also were conducted on the library using primers known to found adjacent to the 3' terminus of the third direct repeat. amplify the rRNA operon. 45 No obvious structural or Sequence Similarities to known The first fosmid library yielded seven unique clones, out repeats or mobile genetic elements from other organisms of a total of 10,236 recombinant fosmids, which contained were identified within the repeat Sequence. Its occurrence in the Cenarchaeum Symbiosum rRNA operon. The second only one variant and its relatively low G+C content relative foSmid library yielded eight unique clones, out of a total of to the rest of the fragment Suggest that it may have been 2100 recombinant fosmids, which contained the Cenar 50 acquired by horizontal transfer from a different genetic chaeum Symbiosum rRNA operon. COnteXt. The sequences of the 16S rRNA genes in each of the 15 The Sequenced regions contained Several open reading fosmids containing the Cenarchaeum Symbiosum rRNA frames or RNA encoding Sequences. Some of the identified operon were determined. The Sequences of the Small Subunit open reading frames encode proteins having homology to rRNA genes of these 15 fosmids exhibited variations with 55 previously identified proteins. In particular, Some of the respect to one another. Ten of the foSmids contained a Small open reading frames encode proteins involved in Several subunit rRNA gene having the sequence of the 16S rRNA metabolic pathways, providing insight into the physiology gene in the insert of SEQ ID NO: 1, while the remaining of Cenarchaeum Symbiosum. foSmids contained a Small Subunit rRNA gene having the An open reading frame which encodes a protein having sequence of the 16S rRNA gene in the insert of SEQID NO: 60 homology to glutamate semialdehyde aminotransferase (a 2. As discussed in more detail below, the differences in the protein involved in heme biosynthesis) was identified Sequences of the rRNA genes may be used to determine between nucleotides 7604-8908 of the insert from fosmid whether a Sample contains Cenarchaeum Symbiosum variant 101 G10 (SEQ ID NO: 1) and between nucleotides A or Cenarchaeum Symbiosum variant B. 23558–24682 of the insert from fosmid 60A5 (SEQ ID NO: In addition to determining the sequences of the rRNA 65 2). These open reading frames have been assigned SEQ ID genes, the Sequences adjacent to the rRNA genes were also NOS: 45 and 13 respectively in the accompanying Sequence determined. listing, while the polypeptides they encode have been US 6,632,937 B1 39 40 assigned SEQ ID NOs: 46 and 14 respectively in the NOS: 63 and 31 respectively in the accompanying Sequence accompanying Sequence listing. A gene encoding glutamate listing, while the polypeptides they encode have been Semialdehyde aminotransferase has also been detected in a assigned SEQ ID NOs: 64 and 32 respectively in the rRNA operon containing genomic fragment of a planktonic accompanying Sequence listing. marine crenarchaeote. (Stein, J. L. et al. 1996. Character 5 An open reading frame encoding a protein having homol ization of uncultivated prokaryotes: isolation and analysis of ogy to the ATP dependent RNA helicase (a protein involved a 40-kilobase-pair genome fragment from a planktonic in translation) was identified between nucleotides 18638–20149 of the insert from fosmid 101G10 (SEQ ID marine archaeon. J. Bacteriol. 178, 591-599) NO: 1) and between nucleotides 34559-36067 of the insert An open reading frame encoding a protein having homol from fosmid 60A5 (SEQ ID NO: 2). These open reading ogy to triose-phosphate isomerase was identified between frames have been assigned SEQ ID NOS: 65 and 33 respec 13944–14612 of the insert from fosmid 101G10 (SEQ ID tively in the accompanying Sequence listing, while the NO: 1) and between nucleotides 29655-30491 of the insert polypeptides they encode have been assigned SEQ ID NOS: from fosmid 60A5 (SEQ ID NO: 2). These open reading 66 and 34 respectively in the accompanying Sequence list frames have been assigned SEQ ID NOS: 57 and 25 respec ing. The identified ATP RNA helicase is highly similar in tively in the accompanying Sequence listing, while the 15 Sequence to homologues found in the genomic Sequences of polypeptides they encode have been assigned SEQ ID NOS: three euryarchaeota (Bult, C., et al. Complete genome 58 and 26 respectively in the accompanying Sequence list Sequence of the methanogenic archaeon, MethanOCOccuS ing. This triosephosphate isomerase represents the first Such jannaschii. Science 273, 1058–1073; Klenk, H. P. et al. protein Sequence reported in a crenarchaeote, and shares 1997. The complete genome sequence of the known archaeal Signature Sequences and deletions which hyperthermophilic, Sulphate-reducing archaeon Archaeoglo distinguish archaeal triosephosphate isomerase genes from bus fulgidus. Nature 390, 364–370; Smith, D. R.et al. 1997. their eucaryal and eubacterial homologues. Complete genome Sequence of Methanobacterium ther An open reading frame encoding a protein having homol moautotrophicum delta H: functional analysis and compara ogy to the TATA binding protein was identified between tive genomics. J. Bacteriol. 179, 7135-7155). 14616-15164 of the insert from fosmid 101G10 (SEQ ID An open reading frame encoding a protein having homol NO: 1) and between nucleotides 30501-31049 of the insert 25 ogy to Men A (a protein involved in menaquinone from fosmid 60A5 (SEQ ID NO: 2) on the strands comple biosynthesis) was identified between nucleotides mentary to the insert strands provided in SEQID NOs: 1 and 20956-21834 of the insert from fosmid 101 G10 (SEQ ID 2. These open reading frames have been assigned SEQ ID NO: 1) and between nucleotides 37404–38282 of the insert NOS: 59 and 27 respectively in the accompanying Sequence from fosmid 60A5 (SEQ ID NO: 2). These open reading listing, while the polypeptides they encode have been frames have been assigned SEQ ID NOS: 71 and 37 respec assigned SEQ ID NOs: 60 and 28 respectively in the tively in the accompanying Sequence listing, while the companying Sequence listing. This TATA box-binding pro polypeptides they encode have been assigned SEQ ID NOS: tein (TBP) is similar to other known archaeal TBPs and is 72 and 38 respectively in the accompanying sequence list N-terminally truncated with respect to the eukaryal Ing. homologs. It shares 49% amino acid similarity with TBP 35 An open reading frame encoding a protein having homol from Pyrococcus woesii. ogy to the site Specific DNA methyltransefrase proteins An open reading frame encoding a protein having homol involved in restriction/modification was identified between ogy to DNA polymerase (a protein involved in DNA repli nucleotides 2637-27454 of the insert from fosmid 101 G10 cation and repair) was identified between nucleotides (SEQ ID NO: 1) and between nucleotides 40563-41669 of 15488–18025 of the insert from fosmid 101G10 (SEQ ID 40 the insert from fosmid 60A5 (SEQ ID NO: 2) on the strands NO: 1) and between nucleotides 31371-33905 of the insert complementary to the insert strands provided in SEQ ID from fosmid 60A5 (SEQ ID NO: 2) on the strands comple NOs: 1 and 2. These open reading frames have been mentary to the insert strands provided in SEQID NOs: 1 and assigned SEQ ID NOs: 75 and 41 respectively in the 2. These open reading frames have been assigned SEQ ID accompanying Sequence listing, while the polypeptides they NOS: 61 and 29 respectively in the accompanying Sequence 45 encode have been assigned SEQ ID NOs: 76 and 42 respec listing, while the polypeptides they encode have been tively in the accompanying Sequence listing. assigned SEQ ID NOs: 62 and 30 respectively in the An open reading frame encoding a protein having homol accompanying Sequence listing. ogy to the histone H1 DNA binding protein was identified The DNA polymerase of Cenarchaeum Symbiosum has a between nucleotides 10625-1134 of the insert from fosmid high degree of Similarity to the crenarchaeal homologs from 50 60A5 (SEQ ID NO: 2). This open reading frame has been the extreme thermophiles Sulfolobus acidocaldarius and assigned SEQ ID No. 5 in the accompanying Sequence Pyrodictium Occultum (54% and 53% resp.) and exhibits all listing, while the polypeptide it encodes has been assigned conserved motifs of B-(a-)type DNA polymerases and 3'-5'- SEQ ID No. 6 in the accompanying Sequence listing. exonuclease motifs, both indicative of archaeal poly An open reading frame encoding a protein having homol merases. A more detailed phylogenetic analysis and bio ogy to lysyl tRNA synthetase was identified between nucle chemical characterization of the C. Symbiosum polymerase 55 otides 13046–14620 of the insert from fosmid 60A5 (SEQ has been published elsewhere. (Schleper, C., et al. 1997. ID NO: 2). This open reading frame has been assigned SEQ Characterization of a DNA polymerase from the unculti ID No. 9 in the accompanying Sequence listing, while the Vated pSychrophilic archaeon Cenarchaeum Symbiosum. J. polypeptide it encodes has been assigned SEQ ID No: 10 in Bact. 179,7803–7811) the accompanying Sequence listing. An open reading frame which encodes a protein having 60 A hypothetical open reading frame was identified between homology to dCMP deaminase (a protein involved in pyri nucleotides 11478-13046 of the insert from fosmid 60A5 midine Synthesis) was identified between nucleotides (SEQID NO: 2). This open reading frame has been assigned 18022-18663 of the insert from fosmid 101G10 (SEQ ID SEQ ID No. 7 in the accompanying Sequence listing, while NO: 1) and between nucleotides 33902–34456 of the insert the polypeptide it encodes has been assigned SEQ ID No: 8 from fosmid 60A5 (SEQ ID NO: 2) on the strands comple 65 in the accompanying Sequence listing. mentary to the insert strands provided in SEQID NOs: 1 and An open reading frame encoding a protein having homol 2. These open reading frames have been assigned SEQ ID ogy to peptidylprolyl cis/trans isomerase (a chaperone) was US 6,632,937 B1 41 42 identified between nucleotides 2015.6-20434 of the insert have been assigned SEQ ID NOs: 55 and 23 respectively in from fosmid 101G10 (SEQ ID NO: 1) on the strand comple the accompanying Sequence listing, while the polypeptides mentary to that provided in the Sequence listing. This open they encode have been assigned SEQ ID NOs: 56 and 24 reading frame has been assigned SEQ ID No. 67 in the respectively in the accompanying Sequence listing. accompanying Sequence listing, while the polypeptide it An open reading frame designated Hypothetical 03 was encodes has been assigned SEQ ID No. 68 in the accom identified between nucleotides 20554-20955 of the insert panying Sequence listing. from fosmid 101G10 (SEQ ID NO: 1) and between nucle An open reading frame encoding a protein having homol otides 37002–37403 of the insert from fosmid 60A5 (SEQ ogy to glucose-1-dehydrogenase was identified between ID NO: 2). These open reading frames have been assigned nucleotides 28065-29843 of the insert from foSmid 101 G10 SEQ ID NOS: 69 and 35 respectively in the accompanying (SEQ ID NO: 1). This open reading frame has been assigned Sequence listing, while the polypeptides they encode have SEQ ID No. 79 in the accompanying sequence listing, while been assigned SEQ ID NOs: 70 and 36 respectively in the the polypeptide it encodes has been assigned SEQID No: 80 accompanying Sequence listing. in the accompanying Sequence listing. An open reading frame designated ORF 05 was identified A hypothetical open reading frame designated Hypotheti 15 between nucleotides 25151-26377 of the insert from fosmid cal 01 was identified between nucleotides 1358-2290 of the 101 G10 (SEQ ID NO: 1) and between nucleotides insert from fosmid 101G10 (SEQ ID NO: 1) and between 39454-40572 of the insert from fosmid 60A5 (SEQ ID NO: nucleotides 17329–18213 of the insert from fosmid 60A5 2). These open reading frames have been assigned SEQ ID (SEQ ID NO: 2) on the strands complementary to the insert NOS: 73 and 39 respectively in the accompanying Sequence strands provided in SEQ ID NOs: 1 and 2. These open listing, while the polypeptides they encode have been reading frames have been assigned SEQ ID NOs: 43 and 11 assigned SEQ ID NOs: 74 and 40 respectively in the respectively in the accompanying Sequence listing, while the accompanying Sequence listing. polypeptides they encode have been assigned SEQ ID NOS: An open reading frame encoding a protein with no 44 and 12 respectively in the accompanying Sequence list homology to known proteins was identified between nucle Ing. otides 3–10421 of the insert from fosmid 60A5 (SEQ ID A hypothetical open reading frame designated Hypotheti 25 NO: 2). This open reading frame has been assigned SEQID cal 02 was identified between nucleotides 8961–9767 of the No. 3 in the accompanying Sequence listing, while the insert from fosmid 101 G10 (SEQ ID NO: 1) between polypeptide it encodes has been assigned SEQ ID No. 4 in nucleotides 24913–25728 of the insert from fosmid 60A5 the accompanying Sequence listing. (SEQ ID NO: 2). These open reading frames have been An open reading frame designated ORFO6 was identified assigned SEQ ID NOs: 47 and 15 respectively in the between nucleotides 27535-28002 of the insert from fosmid accompanying Sequence listing, while the polypeptides they 101G10 (SEQ ID NO: 1). This open reading frame has been encode have been assigned SEQ ID NOs: 48 and 16 respec assigned SEQ ID No. 77 in the accompanying Sequence tively in the accompanying Sequence listing. listing, while the polypeptide it encodes has been assigned An open reading frame designated ORF 01 was identified SEQ ID No. 78 in the accompanying Sequence listing. between nucleotides 9772-10479 of the insert from fosmid 35 A gene coding for tRNA''' was identified between nucle 101 G10 (SEQ ID NO: 1) and between nucleotides otides 12129–12251 of the insert from fosmid 101 G10 (SEQ 25732–26427 of the insert from fosmid 60A5 (SEQ ID NO: ID NO: 1) and between nucleotides 28058–28.180 of the 2) on the Strands complementary to the insert Strands pro insert from fosmid 60A5 (SEQ ID NO:2). This tRNA vided in SEQ ID NOs: 1 and 2. These open reading frames contains a 45bp intron in the vicinity of the anticodon loop. have been assigned SEQ ID NOS: 49 and 17 respectively in 40 Table 1 shows the level of homology between the open the accompanying Sequence listing, while the polypeptides reading frames in the inserts from fosmid 101 G10 and they encode have been assigned SEQ ID NOs: 50 and 18 fosmid 60A5 at the nucleic acid level. Table 1 also shows the respectively in the accompanying Sequence listing. level of homology at the amino acid level between the An open reading frame designated ORF02 was identified polypeptides encoded by the insert from fosmid 101 G10 and between nucleotides 10545-10922 of the insert from fosmid 45 foSmid 60A5. Nucleic acid homology was calculated using 101 G10 (SEQ ID NO: 1) and between nucleotides BLASTN with the default parameters. Amino acid homol 26504–26881 of the insert from fosmid 60A5 (SEQ ID NO: ogy was calculated using FASTA with the parameters. AS 2). These open reading frames have been assigned SEQ ID shown in Table 1 and FIG. 1, the protein coding regions were NOS: 51 and 19 respectively in the accompanying Sequence highly similar in both nucleic acid and deduced amino acid listing, while the polypeptides they encode have been 50 Sequences. assigned SEQ ID NOs: 52 and 20 respectively in the Over the 28 kb common region in the 101 G10 and 60A5 accompanying Sequence listing. inserts, the inserts shared >99.2% identity in their ribosomal An open reading frame designated ORF 03 was identified RNA genes, approximately 87.8% overall DNA identity, an between nucleotides 11382-11987 of the insert from fosmid average of 91.6% similarity in ORF amino acid sequence, 101 G10 (SEQ ID NO: 1) and between nucleotides and complete colinearity of protein encoding regions. AS 27337–27936 of the insert from fosmid 60A5 (SEQ ID NO: 55 shown in Table 1, in protein coding regions the DNA identity 2) on the Strands complementary to the insert Strands pro of the two contigs ranged from 80.9% (triose phosphate vided in SEQ ID NOs: 1 and 2. These open reading frames isomerase) to 91.5% (Hypothetical 03). Within intergenic have been assigned SEQ ID NOs: 53 and 21 respectively in regions the identity dropped to 70-86%, and small insertions the accompanying Sequence listing, while the polypeptides or deletions were found frequently. The high similarity in they encode have been assigned SEQ ID NOS: 54 and 22 60 coding regions and upstream Sequences aided in the identi respectively in the accompanying Sequence listing. fication of genes, Start codons, and putative transcriptional An open reading frame designated ORF 04 was identified promoter motifs (see below). Genes appear as densely between nucleotides 12916-13737 of the insert from fosmid packed in C. Symbiosum as they are in other Sequenced 101 G10 (SEQ ID NO: 1) and between nucleotides archaeal genomes (Bult, C., et al. 1996. Complete genome 28822–29631 of the insert from fosmid 60A5 (SEQ ID NO: 65 Sequence of the methanogenic archaeon, MethanOCOccuS 2) on the Strands complementary to the insert Strands pro jannaschii. Science 273, 1058–1073, Klenk, H. P. et al. vided in SEQ ID NOs: 1 and 2. These open reading frames 1997. The complete genome sequence of the US 6,632,937 B1 43 44 hyperthermophilic, Sulphate-reducing archaeon Archaeoglo The above methods may also be used to determine bus fulgidus. Nature 390,364–370; Smith, D. R., et al. 1997. whether a biological Sample contains variant A and/or vari Complete genome Sequence of Methanobacterium ther ant B. In Such procedures, nucleic acids are obtained from moautotrophicum delta H: functional analysis and compara the biological Sample, amplified using the above primers, tive genomics. J. Bacteriol. 179, 7135-7155). and Sequenced using the above oligonucleotide to determine The ribosomal RNA operon of Cenarchaeum Symbiosum whether the Sample contains the variant A and/or the variant is composed of the genes for the 16S and 23S rRNAS B Sequence. Separated by a Spacer of 131 bp. This organization is typical Similarly, the amplification reaction may be conducted of crenarchaeotes, and differs from rRNA operons of using any primers which generate amplification products euryarchaeotes, which usually contain 5S RNA and tRNA having Sequences which differ between variant A and variant genes. (Garrett, R. A. et al. 1991. Archaeal rRNA operons. 1O B. The amplification products may then be sequenced to TIBS 16, 22–26). The large subunit rRNA genes are located determine whether they have the Sequence of variant A between nucleotides 2680-5674 of SEQ ID NO: 1 (fosmid and/or variant B. In Some embodiment, the amplification 101G10) and between nucleotides 18645–21639 of SEQ ID reaction may be conducted under conditions in which the NO: 2 (fosmid 60A5). The small subunit rRNA genes are amplification primerS Specifically hybridize to one of the located between nucleotides 5806–7278 of SEO ID NO: 1 15 variants. (on the opposite Strand from that shown in the Sequence RFLP analyses were also be used to assess whether the Listing, as indicated in FIG. 1) and between nucleotides foSmids contained the Sequence of variant A or variant B as 21771-23243 of SEQ ID NO: 2. The large and small subunit described in Example 6 below. rRNA genes in the two fosmids were 99.2% and 99.3% identical, respectively. EXAMPLE 6 AS mentioned above, the Sequences of the Cenarchaeum Symbiosum derived inserts in fosmids 101 G10 and 60A5 had RFLP Based Analysis of Fosmids to Determine a high degree of homology but were not completely iden Whether They Contain the Variant A or Variant B tical. The sequence of the insert in fosmid 101 G10 was Sequences designated variant A, while the Sequence of the insert in 25 fosmid 60A5 was designated variant B. Such sequence Primer set 21F (DeLong, E. F. 1992. Archaea in coastal differences could arise if the fosmid inserts were derived marine environments. Proc. Natl. Acad. Sci. 89,5685–5689) from two closely related but distinct strains of Cenarchaeum and 459R-LSU for the amplification of 2.2 kbp of the Symbiosum or, alternatively, the Sequence differences could ribosomal operon, primer set GSAT810F (GAATCCGCC be due to cloning or Sequencing artifacts. To confirm that the CCCGACTATCTT, SEQ ID NO: 118) and 16S37REV fosmid inserts were in fact derived from two closely related (CATGGCTTAGTATCAATC SEQ ID NO: 119) for the Strains, portions of the inserts in a plurality of different amplification of the 16S RNA-GSAT region (2.2 kbp) and foSmids were Sequenced to determine whether they were primer set Cenpol357F (ACITACAACGGI GACGAY identical to either of the inserts in fosmids 101G10 and TTTGA SEQ ID NO: 120) and Cenpo 1735R 60A5, as would be the case if there were in fact two closely (CACCCCGAARTAGTTYTTYTT SEQ ID NO: 121) for related Strains of Cenarchaeum Symbiosum. 35 an internal DNA polymerase fragment (of 1134 bp) were In particular, the ribosomal RNA Spacer regions of variant used in PCR reactions with 5 ng of purified fosmids. The A and variant B contained 10 distinguishing Signature nucle PCR products were cut with Taq I and HpaII (16S-23S otides and the 16S rRNA genes of variant A and variant B RNA), HaeIII and RsaI (GSAT-16S RNA) or HaeIII and contained two distinguishing nucleotides. Example 5 pro Ava (polymerase) and analyzed on 2% agarose gels. vides the results of a PCR based analysis of the 16S rRNA 40 The results are shown in Table 2. If the pattern did not gene and the 16S-23S spacer region in 13 different fosmid exactly match but closely resembled the RFLP of either type inserts. A or B, it was assigned as a lower case letter (a or b, Table 2), meaning that at least 3 out of 4 or 3 out of 5 bands created EXAMPLE 5 by restriction digest appear identical in Size to the ones from 45 either type A or B. As shown in Table 2, RFLP patterns of PCR Based Analysis of Fosmid Inserts to the 1150 bp fragment covering the 5'-end of the GSAT gene Determine whether they Contain the Variant A or and 16S gene and the internal fragment of 1134bp from the Variant B Sequences DNA polymerase gene revealed that all foSmids analyzed Primers 21F and 459R-LSU (CTTTCCCTCACGGTA, could again be assigned to either the A or B type, although SEQ ID NO: 116) were used to amplify the 16S-23S spacer slight variations were also detected (lower case letters in region from the foSmids. The amplification products were 50 Table 2), Suggesting that both variants exhibit further micro sequenced using primer SP23rev (CTATTG CCGTCTTTA heterogeneity which is detectable in protein coding and CACC, SEQ ID NO: 117). intergenic regions. PCR reactions with two archaea-specific 16S rDNA prim The above methods may also be used to determine ers (21F and 958R (DeLong, E. F. 1992. Archaea in coastal whether a biological Sample contains variant A and/or vari marine environments. Proc. Natl. Acad. Sci. 89,5685-5689, 55 ant B. In Such procedures, nucleic acids are obtained from the disclosure of which is incorporated herein by reference), the biological Sample, amplified using the above primers, one of which was biotinylated, were used to amplify a 950 and digested as described above to determine whether the base pair (bp) fragment from the fosmids. The PCR products Sample contains the variant A and/or the variant B Sequence. were purified and Sequenced as described in Preston, C. M. Similar analyses may also be performed using other portions et al. 1996. A psychrophilic crenarchaeon inhabits a marine 60 of the sequences of SEQID NOs: 1 and 2 which are different Sponge: Cenarchaeum Symbiosum gen. nov, sp. nov. Proc. from one another. Natl. Acad. Sci. USA93, 6241-6246 with primer 519R 16S To further confirm the existence of two closely related rDNA. Strains of Cenarchaeum Symbiosum, biological Samples The results of this analysis are shown in Table 2. As were obtained from Several individual and analyzed shown in Table 2, in Samples obtained from Several unique 65 to determine whether the Samples contained variant A and/or rRNA operon-containing foSmids, a Sequence identical to variant B. Example 7 below provides the results of a PCR either variant A (101G10) or variant B (60A5) was present. analysis of the Cenarchaeum Symbiosum 16S rRNA genes in US 6,632,937 B1 45 46 Samples obtained from Several individual Sponges in differ coding regions but also extended into adjacent upstream ent locations and at different times. Sequences. Due to this upstream Similarity, and also because the average G+C content of the Sequences was relatively EXAMPLE 7 high, it was possible to readily identify prospective tran Scriptional (A+T rich) promoter elements. A motif corre Analysis of Samples from Individual Sponges sponding to the consensus of the archaeal TATA-box-like The 16S rRNA genes of variant A and variant B differ at element (C/TTT-A-T/A-A) (Hain, J. et al. 1992. Elements positions 175 and 183.7 (E. coli numbering). PCR reactions of an archaeal promoter defined by mutational analysis. with two archaea-specific 16S rDNAprimers (21F and 958R Nucl. Acids. Res. 20, 5423-5428) was identified upstream of (DeLong, E. F. 1992. Archaea in coastal marine environ nearly all genes (FIG. 2). The exceptions were the genes ments. Proc. Natl. Acad Sci. 89, 5685–5689, the disclosure encoding MenA and DNA polymerase which are located of which is incorporated herein by reference), one of which immediately downstream of other ORFs and may therefore was biotinylated, were used to amplify a 950 base pair (bp) be transcribed as polycistronic mRNAS. In vivo and in vitro fragment from total nucleic acids derived from Several Studies in other archaea have shown that initiation of tran different sponge individuals. The PCR products were puri Scription occurs consistently 24 to 28 bp downstream from fied and sequenced as described in Preston, C. M. et al. 1996. 15 the central T of this motif (Hain, J et al. 1992. Elements of A psychrophilic crenarchaeon inhabits a marine Sponge: an archaeal promoter defined by mutational analysis. Nucl. Cenarchaeum Symbiosum gen. nov, sp. nov. Proc. Natl. Acids. Res. 20, 5423–5428; Palmer, J. R. and Daniels, C. J. Acad. Sci. USA 93, 6241-6246 with primer 519R, the 1995. In vivo definition of an archaeal promoter. J. Bacte disclosure of which is incorporated herein by reference. riol. 177 1844-1849). For twelve of the protein encoding The amplification products were Sequenced to determine genes, the promoter element was found 25 to 30 bp upstream whether they corresponded to variant A and/or variant B. of the ORF (FIG. 2), Suggesting that transcriptional initia The results are shown in Table 3. As shown in Table 3, in 15 tion occurs in close proximity to, or directly at, the trans out of 16 cases U/C ambiguities were found at the Signature lational Start codon. positions, indicating the presence of both variants in Samples A similar observation has been made for 30 of the obtained from a single Sponge (Table 3). Only one sponge 25 predicted 100 strong and medium promoters from 156 kbp (S4) yielded an unambiguous Sequence identical to variant sequence of Sulfolobus Solfataricus (Sensen, C. W. et al. 1996. Organizational characteristics and information content A, but variant B was detected in this individual by another of an archaeal genome: 156 kb of Sequence from SulfolobuS criterion (see below). Solfataricus P2. Molec. Microb. 22, 175-191). Transcription Hybridization analyses were also used to determine initiation at, or in close proximity to, the translational Start whether individual Sponges harbored variant A and/or vari codons has been mapped for Some genes in Halobacterium ant B. The results of these analyses are provided in Example Salinarium (Brown, J. W. et al. 1989. Gene structure, 8 below. organization, and expression in archaebacteria. CRC Crit. Rev. Microb. 16,287–337) and S. Solfataricus (Klenk, H. P., EXAMPLE 8 et al. 1993. Nucleotide Sequence, transcription and phylog Hybridization Based Analysis of Samples Obtained 35 eny of the gene encoding the Superoxide dismutase of from Axinella Mexicana to Determine Whether the Sulfolobus acidocaldarius. Biochim. Biophys. Acta 1174 Samples Contain Variant A and/or Variant B 95–98), and alternative mechanisms for initial mRNA ribosome contact in Archaea have been hypothesized Two oligonucleotides Specific for each variant type were (Brown, J. W. et al. 1989. Gene structure, organization, and designed from the 23S r)NA gene sequences of fosmids 40 expression in archaebacteria. CRC Crit. Rev: Microb. 16, 101G10 and 60A5. The probes differed in 3 positions and 287-337). have the sequences ACACTTCAACTATTTCCTG (SEQ ID The promoters listed in FIG. 2, or fragments thereof, may NO: 122 variant A) and ACACTTTGACTATTTCGTG be used in expression vectors or expression Systems. In one (SEQ ID NO: 123, variant B). Nucleic acid samples from embodiment, the promoters listed in FIG.2 may be operably individual sponges (300 ng) and controls (fosmids 101G10 45 linked to coding regions and introduced into archaebacteria, and 60A5, 50 ng each) were denatured, bound to nylon and in particular Cenarchaeum Symbiosum, to express the membranes (Hybond-N, Amersham), hybridized with the encoded gene product in the archaebacterial cells. labeled probes (Massana, R. et al. 1997. Vertical distribution Alternatively, the promoters listed in FIG. 2 may be and phylogenetic characterization of marine planktonic operably linked to coding regions and introduced into host Archaea in the Santa Barbara Channel. Appl. Env, Microb. cells which are not normally capable of directing transcrip 63, 50-56, the disclosure of which is incorporated herein by tion from archaebacterial promoters. In addition, genes reference in its entirety) and washed at 41.5 C. Hybridiza encoding the proteins required for transcription from these tion was analyzed by autoradiography. promoters are also introduced into the host cells. The genes The results are provided in Table 3. In the samples from encoding these transcription factors may be on the same the majority of host Sponges examined, the presence of both vector as the promoter from Cenarchaeum Symbiosum or on 23S rRNA variants was observed, confirming that the spe 55 a different vector. In Some embodiments, the genes encoding cific association of C. Symbiosum with its host typically these transcription factors are linked to an inducible pro involves the presence of both variants. moter. Expression of the transcription factorS is induced The data provide Strong evidence that these genomic when it is desired to express the proteins which are operably clones are derived from two very closely related, but distinct linked to the promoter from Cenarchaeum Symbiosum. Strains, as opposed to representing two ribosomal RNA 60 Although this invention has been described in terms of operon regions originating from the same organism. This certain preferred embodiments, other embodiments which conclusion is consistent with the observation that all cre will be apparent to those of ordinary skill in the art in view narchaeota characterized to date contain only one ribosomal of the disclosure herein are also within the Scope of this RNA operon (Garrett, R. A. et al. 1991. Archaeal rRNA invention. Accordingly, the Scope of the invention is operons. TIBS 16, 22–26). 65 intended to be defined only by reference to the appended The high conservation between the inserts in fosmid claims. All documents cited herein are incorporated herein 101G10 and fosmid 60A5 was not entirely confined to by reference in their entirety. US 6,632,937 B1 47 48

TABLE 1. TABLE 3

Comparison of Overlapping Coding Sequences from Detection of C. Symbiosium Variants in Natural Populations of Fosmid 101 G10 and Fosmid 60A5 A. mexicana Gene Functional % Identity Variations Name" Category Nucleotide Amino Acid in 23S rRNA Variation in 16S Hybridization Hypothetical 01 unknown 81.4 76.6 1O 23S translation 99.16 A. mexicana Individual or rDNA Positions** Variant Variant 16S translation 99.3 GSAT heme biosynthesis 83.2 83.8 Hypothetical O2 unknown 83.4 81.4 Isolated DNA Source 175 183.7 Type A Type B ORFO1 unknown 83.3 85.7 15 ORF O2 unknown 89.9 95.2 fosmid 101G10 from S12 U U -- ORFO3 unknown 87.9 86.7 fosmid 60A5 from S12 C C -- tRNAyr translation 99.2 s12 Y Y -- -- ORFO4 unknown 87.8 88.1 s1 -- -- TIM glycolysis 80.9 83.3 TBP transcription 83.4 86.3 s2 -- -- DNA polymerase replication/repair 89.0 93.9 s3 Y Y -- -- dCMP deaminase pyrimidine synthesis 85.7 89.8 S4 U U -- w RNA helicase (ATP translation 86.1 92.2 Y Y dependent) Y Y -- -- PPI chaperone 88.4 92.5 25 -- w Hypothetical 03 unknown 91.5 92.4 s8 Y Y -- -- MenA menaquinone 86 89.4 biosynthesis Y Y -- w ORF OS unknown 87.5 90.6 -- -- Methylase restriction/modification 86.4 87.5 Y Y ------

"Hypothetical: open reading frame (ORF) with similarity to proteins of -- w unknown function from the databases. ORF = open reading frame identified by similarity between both fosmids, -- -- including upstream promoter sequence; GSAT = glutamate semialdehyde w aminotransferase; TIM = triose-phosphate isomerase; TBP = TATA box 35 binding protein; PPI = peptidylprolyl cis/trans isomerase. Y Y w -- --

-- -- TABLE 2 -- -- Analysis of Polymorphism at Four Distinct Loci 40 -- -- in Different Fosmids -- -- 16S-23S 16S-GSAT: DNA Polis ------

Fosmid 16S RNA* spacer? HaeIII RsaI Hae Ava -- -- 45 A. A. A. A. A. A. -- -- B B B B B -- -- B b b A. A. -- -- A. afb B -- -- A. A. 50 -- -- b afb A. A. A. -- -- A. A. A. Y Y -- w A. A. A. A. B Y Y -- w A. A. A. A. A. B B B B b Y Y -- -- 55 A. A. A. w w b B B b Y Y -- -- A. al afb Y Y -- -- *'': partial sequence (101G10 through 87F4) or RFLP analysis (C1 H5 Y Y through C2OB5). * : partial sequence. 60 Y Y *: RFLP analysis of PCR products; A/B: identical pattern to either -- -- 101G10 (=A) or 60A5 (=B); a, b: similar pattern to either A or B (see materials and methods). Fosmids C1 H5, C4H1, C15A3 and C20 B5 did not yield PCR products with polymerase-specific primers. The first seven fos *s = Naples Reef; his = Haskle; hh = Hermit Hole: Aq = captive sponge. mids were isolated from a first library, the last 8 fosmids (prefix C) are **Y = direct sequence of PCR product yields C and U at the same posi from a second library. tion. -= not determined. 65 - = not determined; W = weakly positive.

US 6,632,937 B1 125 126

-continued Gly Glu Ile Arg Lieu Ala Gly Thr Phe Asn Ala Ser Asp Asn Val Glin 1635 1640 1645 tog cc g to g g g c att gag titt to a ggc gac ggc acg g g g atg titt gtt 4992 Ser Pro Ser Gly Ile Glu Phe Ser Gly Asp Gly Thr Gly Met Phe Val 1650 1655 1660 acc ggg titt ggg gcc gcg ggc gtgaat gala titc. tcc ctd to c gcc ccc 5040 Thr Gly Phe Gly Ala Ala Gly Val Asin Glu Phe Ser Leu Ser Ala Pro 1665 1670 1675 1680 titt gat aca acc citc ccg gtg cat gtg gala ttg cac gat at a ggc ggc 5088 Phe Asp Thr Thr Leu Pro Val His Val Glu Leu. His Asp Ile Gly Gly 1685 1690 1695 cag cog gCa gtt gat citg gcg titt gca gaa gat ggc agg acc ctic citg 51.36 Glin Pro Ala Val Asp Leu Ala Phe Ala Glu Asp Gly Arg Thr Lieu Lieu 17 OO 1705 1710 ttg citg gCC gog gat gga aca citg gat titc tac agc citt goc ggit gat 51.84 Leu Lieu Ala Ala Asp Gly. Thir Lieu. Asp Phe Tyr Ser Lieu Ala Gly Asp 1715 1720 1725 gcc tat gat at a ggg gaa goa toc cqt act titt caa gtg ccg titt gag 5232 Ala Tyr Asp Ile Gly Glu Ala Ser Arg Thr Phe Glin Val Pro Phe Glu 1730 1735 1740 gat goc gog ggit gct gtg ccc ggc gcc titt tac cag cct cog gat ggc 528 O Asp Ala Ala Gly Ala Val Pro Gly Ala Phe Tyr Glin Pro Pro Asp Gly 1745 175 O 755 1760 tog tot att att gcc gca titt gac ggc agg att gac cag tat gtg gtg 5328 Ser Ser Ile Ile Ala Ala Phe Asp Gly Arg Ile Asp Glin Tyr Val Val 1765 1770 1775 atc ccc titc gag titc gtg to a tat coa citg aca agg ccc ggc acg ccc 5376 Ile Pro Phe Glu Phe Val Ser Tyr Pro Leu Thr Arg Pro Gly. Thr Pro 1780 1785 1790 aca ggg att gac titt gcg cca gac ggg cqc togg atg titc ct g to c acc 5 424 Thr Gly Ile Asp Phe Ala Pro Asp Gly Arg Trp Met Phe Leu Ser Thr 1795 1800 1805 gag aac ggg at a gac cag tac citg citg to g atc ccc titt gac gitg cqc 54.72 Glu Asn Gly Ile Asp Glin Tyr Lieu Lleu Ser Ile Pro Phe Asp Val Arg 1810 1815 1820 agc citg acg tat acg gga acc att coa gta gac ggg gtg gag gga atg 552O Ser Leu Thr Tyr Thr Gly. Thir Ile Pro Val Asp Gly Val Glu Gly Met 1825 1830 1835 1840 cag titt gcg gac aac ggc agg gca citg ttt ttg gcg gac agt gala ggc 5568 Glin Phe Ala Asp Asn Gly Arg Ala Lieu Phe Leu Ala Asp Ser Glu Gly 1845 1850 1855 ttg att tac aat tat gac citg gag gac cog tat gct citg gat ggc aac 5 616 Lieu. Ile Tyr Asn Tyr Asp Leu Glu Asp Pro Tyr Ala Lieu. Asp Gly Asn 1860 1865 1870 aca att to c gtg gaa titc. tcg titt gac ggt agc gtg atg tat gtg citg 5 664 Thir Ile Ser Val Glu Phe Ser Phe Asp Gly Ser Val Met Tyr Val Leu 1875 1880 1885 gag tac gac aca aaa agg gtg gtc. tcg tac gag titg gag titt coc titt 5712 Glu Tyr Asp Thr Lys Arg Val Val Ser Tyr Glu Leu Glu Phe Pro Phe 1890 1895 1900 gac gta to g agc aga aca cqt gca gac acg citg gac ata coa caa att 576 O. Asp Wal Ser Ser Arg Thr Arg Ala Asp Thr Lieu. Asp Ile Pro Glin Ile 1905 1910 1915 1920 gac to a coa aga cac gtt gca gtc. tcg at g ccc ggc aac cac citg tac Asp Ser Pro Arg His Val Ala Val Ser Met Pro Gly Asn His Leu Tyr 1925 1930 1935 ata aca aac to g g to titt gag gala gat gac acc ata cac toc tat gga 585 6 Ile Thr Asn Ser Val Phe Gly Glu Asp Asp Thr Ile His Ser Tyr Gly 1940 1945 1950 US 6,632,937 B1 127 128

-continued alta tot aac aat gac ata tog tog gCa to a tac atc ggc gag gala ggc 5904 Ile Ser Asn. Asn Asp Ile Ser Ser Ala Ser Tyr Ile Gly Glu Glu Gly 1955 1960 1965 atc cc.g gaa cc c gtg ata aac ggg att gac ttt toc aac aac ggc cqC 5952 Ile Pro Glu Pro Wall Ile Asn Gly Ile Asp Phe Ser Asn Asn Gly Arg 1970 1975 1980 cgc atg titt citg att gog ggC a.a. C. ggg titc gac tac cag gtg at a cat 6 OOO Arg Met Phe Lieu. Ile Gly Gly Asn Gly Phe Asp Tyr Glin Wall Ile His 1985 1990 1995 2OOO gac tac atg cita ggc aca aga tac gac at a toc agc agg agc citg citt 6048 Asp Tyr Met Leu Gly. Thr Arg Asp Ile Ser Ser Arg Ser Telu Telu 2005 2010 2015 gat aCa tat gcc att coa ggg cc.g gtt gtt titt coc gCg ggc citt gat 609 6 Asp Thr Tyr Ala Ile Pro Gly Pro Wal Wall Phe Pro Ala Gly Lieu. Asp 2020 2025 2030 titc. tog titt gac agg Ctg to c atg titt ata ata agc acc gcc ggit tog 614 4 Phe Ser Phe Asp Arg Lieu Ser Met Phe Ile Ile Ser Thr Ala Gly Ser 20 40 2O45 gta tac agg tac ggc ctd gac gat ccg titc ata gtt gaa aca at g gac 61.92 Wall Tyr Arg Tyr Gly Lieu Asp Asp Pro Phe Ile Wall Glu Thr Met Asp 2O5 O 2O55 2060 tat cag gag tot titc cqg citg cc c gta coa to a gcg gct gat aat to a 624 O Tyr Glin Glu Ser Phe Arg Teu Pro Wall Pro Ser Ala Ala Asp Asn. Ser 2070 2O75 2080 alta tog gat citg gca titc ggC agc agc ggc ctd aat gcc gta at a tog 6288 Ile Ser Asp Leu Ala Phe Gly Ser Ser Gly Lieu. Asn Ala Wall Ile Ser 2O85 209 O 2095 cac gag ggg citc gac acc citg tac agc titt gta citg gac atc ccg tat 6336 His Glu Gly Leu Asp Thr Leu Ser Phe Wall Leu Asp Ile Pro Tyr 2100 2105 2110 ggg gcc gaa ttg gat att gac agg citt gag citt cog citg gtg ggg gtt 6,384 Gly Ala Glu Lieu. Asp Ile Asp Arg Leu Glu Lieu Pro Teu Val Gly Val 2115 2120 2125 cc.g acg gga titc gag titc tog gac aac ggg cqC cag tac att ggc 64.32 Pro Thr Gly Phe Glu Phe Ser Asp Asn Gly Arg Glin Teu 21.30 2135 214 O gCg titt cgt. gac tot caa to c tog cca ggc acc ctd cct gCg ggC Ctg 64.80 Ala Phe Arg Asp Ser Glin Ser Ser Pro Gly. Thir Leu Pro Ala Gly Lieu 2145 2150 215.5 216 O cag cgc tat gag Ctt ggC alta cca tat gac ctd gct tog gct gta titt 6528 Glin Arg Tyr Glu Lieu Gly Ile Pro Tyr Asp Leu Ala Ser Ala Wall Phe 21 65 217 O 21.75 gCg cag to c ct g g ga ata titc. gat titt cost coc titc. aac ggC at g cqg 6576 Ala Glin Ser Leu Gly Ile Phe Asp Phe Pro Pro Phe Asn Gly Met Arg 218O 21.85 21.90 gcc aat ggC agc titg gca gga tta cat gtg ccg ccc gat gga agc atc 6624 Ala Asn Gly Ser Lieu Ala Gly Telu His Val Pro Pro Asp Gly Ser Ile 21.95 22 OO 2205 citg titc. agg gcc gga aat gcc gaa aga acc gta atc agc tat gac atg 6672 Teu Phe Arg Ala Gly Asn Ala Glu Arg Thr Val Ile Ser 2210 2215 2220 gac agc cat gat ttg gat a Ca tta to a titc agg gala toa titc. aaa cca. Asp Ser His Asp Leu Asp Thr Telu Ser Phe Arg Glu Ser Phe Llys Pro 2225 22.30 2235 2240 gat gtc gga cag tog aca ccc. a.a. C. ata agg gac atg gac ata toc cog 6768 Asp Wall Gly Glin Ser Thr Pro Asn Ile Arg Asp Met Asp Ile Ser Pro 22 45 225 O 2255 gac ggC atg titc. citc tac citg citt caa ggc gat gtt citg gac atg tac 6816 Asp Gly Met Phe Leu Tyr Teu Telu Glin Gly Asp Wal Teu Asp Met Tyr 2260 2265 2270

US 6,632,937 B1 137 138

-continued

Arg Tyr Thr Met Asn Pro Pro His Asp Ile Ala Ser Ala Ala Telu 35 40 45

Gly Ala Glin Ser Phe Ser Teu Pro Gly Gly Ile Ser Pro Ala Pro Gly 50 55 60

Ala Pro Thr Gly Teu Asp Ile Ser Asp Asp Gly Arg His Teu Tyr Wall 65 70 75

Pro Asp Glu Asn Gly Wall Wall Arg Phe Asp Teu Glu Ser Pro Tyr 85 90 95

Arg Telu Asp Gly Gly Thr Phe Gly Ser Ser Wall Wall Gly Ser Asp 100 105 110

Wall Ala Ala Pro Arg Gly Wall Tyr Wall Ala Pro Gly Gly Ser Telu Met 115 120 125

Teu Wall Ser Asp Ser Ala Asp Gly Thr Ile His Arg Glu Telu Ala 130 135 1 4 0

Ser Pro Glu Pro Ala Gly Ala Ala Asn Arg Gly Ser Phe Wall 145 15 O 155 160

Ser Asp Met Asp Gly Ser Pro Wall Gly Ala Gly Phe Ala Gly Telu 1.65 170 175

His Met Wall Ala Gly Asn Asp Thr Gly Arg Wall Glin Tyr Pro 18O 185 190

Ala Gly Thr His Glin Ile Glin Glu Ala Ala Ala Gly Pro Arg Telu Telu 195 200 2O5

Ser Ala Wall Telu Asp Asp Gly Thr Telu Arg Ala Ala Phe Gly 210 215 220

Thr Wall Asp Ala Gly Ser Wall Glin Pro Gly Met Ile Thr Ile Arg Asp 225 230 235 240

Gly His Gly Ser Asn Thr Gly Ile Pro Telu Teu Teu Ala Gly Ala 245 250 255

Ala Asp Ser Asp Wall Met Thr Phe Wall Wall Pro Glu Asp Arg Ala 260 265 27 O

Glu Ala Ala Ala Gly Asp Glin Ser Telu His Wall Pro Ala Ala Ala 275 280 285

Teu Ala Gly Thr Gly Gly Gly Pro Phe Wall Pro Asp Phe Ser Gly Gly 29 O 295

Ser Telu Telu Ala Ser Teu Arg His Glu Arg Pro Phe Glin Gly Glu 305 310 315 320

Glu Met Ala Arg Thr Glu Arg Ser Asp Arg Ala Teu Thr Wall Thr 325 330 335

Ala Gly Gly Ser Glin Met His Wall Gly Gly Ala Gly Gly Asn Ile Thr 340 345 350

Trp Asp Telu Gly Thr Pro His Asp Ile Thr Thr Gly Wall Arg Ala 355 360 365

Gly Ser Asp Ile Teu Pro Ala Pro Ser Ala Gly Arg Asn Wall Wall 370 375

Pro Ser Ile Thr Gly Ile Ala Phe Ser Asp Asp Gly Met Arg Telu Phe 385 390 395 400

Ala Ala Asn Arg Gly Asp Arg Ile Pro Met Glin Teu Asp Ser Pro 405 410 415

Asp Ile Gly Ser Ala Ser Telu Glu Gly Thr Teu Phe Thr Gly Phe 420 425 430

Glin Ser Gly Ile Ala Phe Ser Asp Asp Gly Thr Arg Met Phe Ala Ala 435 4 40 4 45

US 6,632,937 B1 141 142

-continued

865 870 875

Val Asp Val Gly Gly Ile Asp Pro Gly Gly Wall Arg Ile Val Asp Ala 885 890 895

Ala Gly Pro Leu Pro Gly Val Wall Ile Ser Asp Ala Val Ile Pro Gly 9 OO 905 910

Glu Asp Pro Gly Val Ala Arg Phe Ser Telu Ser Asp Ala Glu Wall Leu 915 920 925

Ala Val Ser Gly Tyr Ala Glu Pro Ser Telu Wall Phe Gly Arg His Ala 930 935 940

Val Pro Gly Ala Ala Gly Gly Thr Phe Pro Ser Glin Ile Gly Asn Ala 945 950 955 96.O

Thr Glu Leu Val Gly Ser Ile Pro Asn Pro Thr Teu Asp Phe Gly. Thr 965 970 975

Thr Leu Thr Gly Ala Ala Phe Ser Ala Asp Gly Thr Wal Wall Phe Leu 98O 985 99 O

Ser Asp Gly Pro Thr Gly Arg Wall Tyr Pro Ser Lieu. Asn. Ile Pro 995 10 OO 1005

Phe Asp Ile Ser Ser Ala Ala Pro Gly Gly Phe Wall Ile Wall Pro Wall 1010 1015 1020

Gly Val Ser Asp Ile Ala Phe Ser Ala Asp Gly Arg Asn Met Leu Val 1025 1030 1035 1040 Ala Asp Glu Thr Gly Gly Ile His Arg Tyr Leu Ala Arg Ser Pro Tyr 1045 105 O 1055 Glu Ile Gly Thr Asp Phe Ile Lys Ser Ser Leu Gly Glu Phe Val Glu 1060 1065 1070 Thr Phe Ser Ala Ala Pro Arg Val Glin Asp Leu Ala Gly Ile Ala Phe 1075 1080 1085 Ser His Asp Gly Met Ile Met Leu Ala Ala Gly Gly Ser Gly Ser Val 1090 1095 1100 His Arg Tyr Ser Leu Pro Ser Pro Tyr Ala Val Ser Gly Ala Lys Tyr 1105 1110 1115 1120 Glu Glu Thr Ala Met Ile Gly Gly Ser Pro Ser Gly Leu Glu Phe Ser 1125 1130 1135 Ser Asp Gly Lieu Arg Met Phe Val Pro Asp Ala Gly Ser Glu Thr Ala 1140 1145 1150 Ala Val Tyr Gly Lieu Ala Ala Pro Tyr Gly Ile Gly Glu Ala Glu Pro 1155 1160 1165 Leu Pro Pro Leu Phe Leu Gly Val Gly Ala Glu Glu Ala Thr Leu Ser 1170 1175 118O Pro Asp Gly Arg His Ile Leu Val Pro Gly Arg Pro Gly Lieu Ser Glin 1185 11.90 11.95 1200 Tyr Ser Lieu Phe Ser Thr Asn Lieu Glu Lieu. Cys Ala Glu Pro Arg Gly 1205 1210 1215 Ile Asp Gly Gly Ser Cys Glu Asp Gly Ile Tyr Ala Phe Glu Ser Pro 1220 1225 1230 Gly Arg Gly Glu Gly Val Ser Lieu Ala Ala Ser Ile Thr Ala Ala Asp 1235 1240 1245 Gly Pro Gly Ile Gly Glu Lieu. His Gly Phe Ala Gly Pro Pro Met Pro 1250 1255 1260 Ala Pro Val Met Glu Glin Val Thr Leu Asp Ser Arg Glu Gly Thr Leu 1265 1270 1275 1280 Arg Val Arg Lieu. Asp Arg Thr Val Asp Val Asp Thr Val Arg Pro Tyr 1285 1290 1295 US 6,632,937 B1 143 144

-continued

Lys Met Trp Val Glu Asp Ser Asp Gly Ser Glin Thr Thr Lieu Ala Asn 1300 1305 1310 Ser Thr Lieu Lieu. Asn Ala Glu Asn. Ser Asn. Ile Leu Lleu Phe Arg Lieu 1315 1320 1325 Asp Asp Ala Ala Ala Gly Lys Ile Ser Gly Tyr Thr Ser Pro Val Phe 1330 1335 1340 Arg Thr Trp Ser Ser Pro Phe Leu Gly Thr Asp Gly Ala Thr Arg Pro 1345 1350 1355 1360 His Thr Lieu Gly Phe Gly Asp Val Arg Lieu Ala Asp Ile Tyr Asp Ala 1365 1370 1375 Ser Gly Asp Val Pro Ser Pro Ser Gly Ile Glu Phe Ser Asp Asp Gly 1380 1385 1390 Met Arg Met Phe Val Thr Gly Ile Gly Thr Pro Gly Ile Asin Ile Phe 1395 14 OO 1405 Thr Leu Ser Ala Pro Phe Asp Ile Thr Leu Pro Lys His Ser Gly Ser 1410 1415 1420 Thr Asn. Ile Gly Gly Lieu Ser Val Ser Asp Leu Ala Phe Ala Asn. Asn 1425 1430 1435 1440 Gly Asn. Ser Lieu. Thr Val Lieu. Asp Wall Asp Gly Val Lieu Arg Val Tyr 1445 1450 1455 Ala Leu Gly Asp Asp Tyr Asn Val Val Thr Gly. Thir Thr Gln Lys Phe 1460 1465 1470 Arg Ile Thr Leu Asp Thr Thr Glin Gly Ile Pro Asn Ser Ile Tyr Thr 1475 1480 1485 Ser Pro Asp Gly Leu Ser Glin Phe Val Ala Tyr Asp Asp Arg Ile Asp 1490 1495 15 OO Leu Tyr Val Leu Gly Ser Pro Asn Asp Ile Ser Ser Thr Thr Glu Ile 1505 1510 1515 1520 Ile Pro Tyr Ser Leu Pro Arg Pro Asp Pro Pro Thr Gly Met Asp Phe 1525 1530 1535 Thr Pro Asp Gly Arg Arg Met Phe Leu Ser Thr Glu Asn Gly Ile Asp 1540 1545 1550 Gln Tyr Leu Leu Ser Glu Pro Phe Ala Val Thir Thr Ser Val Phe Leu 1555 15 60 1565 Arg Thr Ile Pro Ile Asp Gly Gly Ala Glu Gly Ile Arg Phe Val Asp 1570 1575 1580 Asn Gly Arg Gly Lieu Phe Val Pro Gly Ala Asp Gly Ile Ile Glin Arg 1585 159 O 1595 1600 His Glu Leu Ile Tyr Pro Tyr Gly Ala Ser Thr Ser Leu Leu Glu Thr 1605 1610 1615 Val Arg Asp Gly Val Thr Asp Gly Gly Pro Gly Glu Asn Pro Ala Ala 1620 1625 1630 Gly Glu Ile Arg Lieu Ala Gly Thr Phe Asn Ala Ser Asp Asn Val Glin 1635 1640 1645 Ser Pro Ser Gly Ile Glu Phe Ser Gly Asp Gly Thr Gly Met Phe Val 1650 1655 1660 Thr Gly Phe Gly Ala Ala Gly Val Asin Glu Phe Ser Leu Ser Ala Pro 1665 1670 1675 1680 Phe Asp Thr Thr Leu Pro Val His Val Glu Leu. His Asp Ile Gly Gly 1685 1690 1695 Glin Pro Ala Val Asp Leu Ala Phe Ala Glu Asp Gly Arg Thr Lieu Lieu 17 OO 1705 1710 US 6,632,937 B1 145 146

-continued Leu Lieu Ala Ala Asp Gly. Thir Lieu. Asp Phe Tyr Ser Lieu Ala Gly Asp 1715 1720 1725 Ala Tyr Asp Ile Gly Glu Ala Ser Arg Thr Phe Glin Val Pro Phe Glu 1730 1735 1740 Asp Ala Ala Gly Ala Val Pro Gly Ala Phe Tyr Glin Pro Pro Asp Gly 1745 175 O 755 1760 Ser Ser Ile Ile Ala Ala Phe Asp Gly Arg Ile Asp Glin Tyr Val Val 1765 1770 1775 Ile Pro Phe Glu Phe Val Ser Tyr Pro Leu Thr Arg Pro Gly. Thr Pro 1780 1785 1790 Thr Gly Ile Asp Phe Ala Pro Asp Gly Arg Trp Met Phe Leu Ser Thr 1795 1800 1805 Glu Asn Gly Ile Asp Glin Tyr Lieu Lleu Ser Ile Pro Phe Asp Val Arg 1810 1815 1820 Ser Leu Thr Tyr Thr Gly. Thir Ile Pro Val Asp Gly Val Glu Gly Met 1825 1830 835 1840 Glin Phe Ala Asp Asn Gly Arg Ala Lieu Phe Leu Ala Asp Ser Glu Gly 1845 1850 1855 Lieu. Ile Tyr Asn Tyr Asp Leu Glu Asp Pro Tyr Ala Lieu. Asp Gly Asn 1860 1865 1870 Thir Ile Ser Val Glu Phe Ser Phe Asp Gly Ser Val Met Tyr Val Leu 1875 1880 1885 Glu Tyr Asp Thr Lys Arg Val Val Ser Tyr Glu Leu Glu Phe Pro Phe 1890 1895 1900 Asp Wal Ser Ser Arg Thr Arg Ala Asp Thr Lieu. Asp Ile Pro Glin Ile 1905 1910 1915 1920 Asp Ser Pro Arg His Val Ala Val Ser Met Pro Gly Asn His Leu Tyr 1925 1930 1935 Ile Thr Asn Ser Val Phe Gly Glu Asp Asp Thr Ile His Ser Tyr Gly 1940 1945 1950 Ile Ser Asn. Asn Asp Ile Ser Ser Ala Ser Tyr Ile Gly Glu Glu Gly 1955 1960 1965 Ile Pro Glu Pro Wal Ile Asn Gly Ile Asp Phe Ser Asn. Asn Gly Arg 1970 1975 1980 Arg Met Phe Leu Ile Gly Gly Asn Gly Phe Asp Tyr Glin Val Ile His 1985 1990 1995 2OOO Asp Tyr Met Leu Gly. Thir Arg Tyr Asp Ile Ser Ser Arg Ser Lieu Lieu 2005 2010 2015 Asp Thr Tyr Ala Ile Pro Gly Pro Val Val Phe Pro Ala Gly Leu Asp 2020 2025 2030 Phe Ser Phe Asp Arg Leu Ser Met Phe Ile Ile Ser Thr Ala Gly Ser 2O35 20 40 2O45 Val Tyr Arg Tyr Gly Leu Asp Asp Pro Phe Ile Val Glu Thr Met Asp 2O5 O 2O55 2060 Tyr Glin Glu Ser Phe Arg Leu Pro Val Pro Ser Ala Ala Asp Asin Ser 2O65 2070 2O75 2080 Ile Ser Asp Leu Ala Phe Gly Ser Ser Gly Lieu. Asn Ala Wal Ile Ser 2O85 209 O 2095 His Glu Gly Leu Asp Thr Leu Tyr Ser Phe Val Leu Asp Ile Pro Tyr 2100 2105 2110 Gly Ala Glu Lieu. Asp Ile Asp Arg Lieu Glu Lieu Pro Leu Val Gly Val 2115 2120 2125 Pro Thr Gly Phe Glu Phe Ser Asp Asin Gly Arg Gln Leu Tyr Ile Gly US 6,632,937 B1 147 148

-continued

21.30 2135 214 O Ala Phe Arg Asp Ser Glin Ser Ser Pro Gly Thr Lieu Pro Ala Gly Lieu 2145 2150 215.5 216 O Glin Arg Tyr Glu Lieu Gly Ile Pro Tyr Asp Leu Ala Ser Ala Val Phe 21 65 217 O 21.75 Ala Glin Ser Leu Gly Ile Phe Asp Phe Pro Pro Phe Asn Gly Met Arg 218O 21.85 21.90 Ala Asn Gly Ser Leu Ala Gly Lieu. His Val Pro Pro Asp Gly Ser Ile 21.95 22 OO 2205 Leu Phe Arg Ala Gly Asn Ala Glu Arg Thr Val Ile Ser Tyr Asp Met 2210 2215 2220 Asp Ser His Asp Leu Asp Thr Lieu Ser Phe Arg Glu Ser Phe Lys Pro 2225 22.30 2235 2240 Asp Val Gly Glin Ser Thr Pro Asn. Ile Arg Asp Met Asp Ile Ser Pro 22 45 225 O 2255 Asp Gly Met Phe Leu Tyr Lieu Lieu Glin Gly Asp Wall Leu Asp Met Tyr 2260 2265 2270 Asn Lieu. Thir Asp Ser Tyr Ser Lieu. Asp Ala Pro Ala Tyr Ala Gly Thr 2275 228O 2285 Leu Asp Leu Glu Pro Glu Asp Val Ile Pro Arg Gly Ile Ser Phe Ser 2290 2295 2300 Arg Asp Gly Thr Ser Lieu Phe Met Thr Gly Glu Asp Wall Asp His Ile 2305 2310 2315 2320 His Glu Tyr Ala Lieu. Asn. Glu Pro Trp Asp Ile Arg Asn Ala Ile Leu 2325 2330 2335 Ala Gly Ser Lieu Ser Ile Ser Ala Val Asin Gly Ala Pro Arg Gly Lieu 234. O 2345 2350 Asp Ile Ser Glu Asp Gly Thr Thr Ala His Thr Met Arg Gly Arg Asp 2355 2360 2365 Phe Asp Thr Gly Pro Ala Ser Leu Val Asn His Ile Leu Pro Gly Glin 2370 2375 2380 Tyr Ser Leu Leu Thr Asp Ala Pro Ala Phe Ala Tyr Pro Val Glu Glu 2385 2390 2395 2400 Glu Gly Ala Pro Gly Asp Leu Ala Phe Ser Asp Asp Gly Met Arg Met 2405 2410 24.15 Phe Val Ala Gly Val Asn. Asn His Leu Arg Glin Tyr Asn Lieu Lleu Ser 2420 24.25 24.30 Pro Tyr Asp Thr Glu Asn Ala Glu His Phe Ile Ser Thr Asp Leu Lieu 2435 24 40 2445 Thr Ala Asp Arg Gly Pro Thr Gly Lieu Val Phe Ser Asp Glu Asn Asp 2450 2455 2460 Phe Phe Ser Thr Gly Ala Arg Ala Glin Phe Val Arg Glin Phe Thr Thr 2465 2470 24.75 24.80 Asn Arg Pro Tyr Asp Ala Ser Thr Ile Thr Lieu Ser Asp Asn Gly Lieu 2485 24.90 2495 Tyr Lys Val Ser Val Asp Gly Leu Pro Ser Gly Ile Arg Phe Thr Pro 25 OO 25 O5 25 10 Asp Gly Met Lys Met Phe Ile Ser Gly Glin Glu Thr Ala Met Ile Tyr 2515 252O 2525 Gln Tyr Ser Leu Pro Ser Pro Tyr Asp Thr Ser Gly Ala Val Arg Asp 25.30 2535 2540 Arg Val Glu Ile Val Ala Gly Lieu Phe Arg Asn Ala Gly Lieu Ser Val 25.45 255 O 2555 2560 US 6,632,937 B1 149 150

-continued

Gly Lieu. Asn. Glu Pro Ser Pro Ser Gly Phe Asp Phe Ser Glu Asp Gly 2565 257 O 2575 Met Glu Leu Tyr Val Thr Gly Ser Gly Leu Val His Arg Tyr Phe Leu 258O 2585 2590 Pro Ser Pro Tyr Gly Leu Glu Asp Ala Ala Tyr Gly Gly Ser Phe His 2595 26 OO 2605 Thr Phe Arg Glu Ser Thr Pro Leu Gly Val Val Val Arg Gly Asp Ala 26.10 2615 262O Met Phe Val Ala Gly Asp Ser Thr Asp Ser Ile Leu Lys Tyr Ser Leu 2625 2630 2 635 264 O Asn Ala Glin Pro Val Gly Asn. Ile Thr His Ala Asp Thr Arg Ala Gly 2645 26.50 2655 Ile Ala Asp Arg Ala Glu Ile Val Phe Gly Ala Met Ala Asp Thr Arg 2660 2665 2670 Ala Glu Ile Leu Asp Gly Ala Asp Val Val His Lys Ser Wall Lys Ile 2675 268O 2685 Asp Val Phe Pro Ile Ser Glu Gly Ile Thr Val Gly Arg Ala Leu Tyr 2690 2695 27 OO Pro Glu Asp Ala Ala Ile Leu Asp Asp Gly Ala Asn Ala Thr His Asn 2705 2710 2715 2720 Arg Val Val Ile Ile Val His Asp Ile Thr Glu Gly Asp Ala Pro Ser 2725 273 O 2735 Ile His Asp Glu Pro Ile Ala Val Gly Ile Tyr Ala Leu Gly Pro Met 2740 2745 2750 Asp Thr Ile Ala Val Val Asp Leu His Arg Leu Ala Val Ser Ala Ser 2755 2760 2765 Leu Ser Gly Gly Asp Ser Pro Ser Ala Ser Asp Ala Ser Gly Val Val 2770 2775 2780 Ala Glu Ser Arg Arg Asn Ala Wall Asp Arg Pro Gly Val Glu Glu Arg 2785 279 O 2.795 2800 Ile Gly His Gly Val Ser Lieu Glu Ala Ala Asp Arg Pro Ala Val Asp 2805 281 O 2815 Asn Met Met Asp Thr Asp Ser Ala Gly Val Tyr Asp Arg Ser Pro Asp 282O 2825 2830 Asp Gly Pro Ala Val Ser Asp Arg Ser Ala Leu Gly Lieu Ala Arg Met 2835 284 O 284.5 Ala Ala Asp Arg Pro Ala Val Asp Asp Met Met Asp Thr Asp Ser Ala 285 O 2855 2.860 Gly Val Tyr Asp Arg Ser Pro Asp Asp Gly Pro Ala Ile Ser Asp Arg 2865 2870 2875 2880 Ser Ala Leu Gly Lieu Ala Arg Met Ala Ala Asp Arg Pro Ala Val Asp 2.885 2890 2.895 Asp Met Met Asp Thr Gly Ser Ala Gly Val Tyr Asp Arg Ser Pro Asp 29 OO 29 O5 2.910 Asp Gly Pro Ala Ile Ser Asp Arg Ser Ala Leu Gly Lieu Ala Arg Met 2915 2920 2925 Ala Ala Asp Arg Pro Ala Val Asp Asp Met Met Asp Thr Gly Ser Glu 29.30 2935 2.940 Ser Thr Ser Arg Leu Gly Pro Val Asp Arg Pro Glu Ile Val Glu Arg 2.945 295 O 2955 2960 His Ser Lieu Ala Ala Ser Val Tyr Lieu Ser Gly Gly Asp Ser Pro Ser 2965 297 O 2975 US 6,632,937 B1 151 152

-continued Val Ala Asp Gly His Asp Val Glu Ser Glu Gly Arg Arg Asp Gly Gly 2.980 2985 2990

Asp Arg Pro Gly Ile Asp Glu Arg Ile Val Ile Lys Ile Ser Tyr Ser 2995 3OOO 3OO5

Arg Gly Ala Ala Asp Ala Pro Arg Val Glu Asp Ala Met Glu Thr Ser 3010 3 O15 3020

Gly Val Thr Ala Tyr Ser Arg Gly Ala Ala Asp Ala Pro Arg Val Glu 3O25 3O3O 3035 3040 Asp Ala Met Glu Thir Ser Gly Val Thr Val Pro Arg Arg Ser Thr Met 305 O 3055 Asp Ala Pro Thr Val Ala Asp Asp His Ser Lieu Ala Arg Thr Ala Ser 3060 3OTO Ile Ser Glu Gly Asp Ser Pro Thr Phe Ala Glu Ala Arg Arg Ala Asp 3075 Thr Val Gly Asp Ile Asp Glu Val Asp Ala Pro Thr Val Ala Asp Asp 3095 31 OO His Ser Lieu Ala Arg Ala Ala Ser Ile Ser Glu Gly Asp Ser Pro Thr 3105 31.10 31.15 312 O Phe Ala Glu Val Arg Arg Ala Asp Thr Val Gly Asp Ile Asp Glu Val 31.25 3130 3135 Asp Ala Pro Ala Wall Ala Glu Arg Lieu Lieu Ala Val Lieu Gly Lieu Glin 314 O 31.45 315 O Ala Pro Asp Ser Pro Gly Val Trp Asp Thr Val Gly Ile Asp His Ser 3155 3160 31.65 Glu Ile Ser Gly Asp Pro Val Pro Glu Pro Arg Val Val Pro Arg Gly 317 O 31.75 318O Gly Gly Gly Gly Gly Gly Gly Ser Ser Asn Arg Gly Lieu Glu Pro His 31.85 319 O 31.95 3200 Gly Gly Gly Tyr Glu Ile Asp Phe Glu Phe Arg Ile Asp Gly Arg Lieu 32O5 3210 3215 Val Leu Phe Asin Gly Thr Asp Wall Leu Ala Glu Ser Gly Lys Asp Lieu 3220 3225 3230 Leu Ile Arg Pro Val Phe Arg Pro Glu Gly Ser Phe Asn Ile Phe Asp 3235 3240 3.245

Met Glu Wall Leu Phe Thr Ala Pro Gly Gly Glu Ile Ser Thr Ala Tyr 325 O 3255 326 O Tyr Asn Arg Ala Gly Ile Leu Met Gly Ile Asp Cys Gly Glu Lieu. Ile 3265 3270 3275 328 O Met Thr Asp Thr Thr Tyr Ser Cys Asp Met Lieu. Asp Ile Phe Gly Asp 3285 3290 3295 Glu Ile Tyr His Val Glu Arg Lieu. Asp Ala Phe Asin Gly Met Val Ile 33OO 3305 3310 Ser Lieu. Asp Gly Pro Leu Asp Gly Thr Val Ser Val Ser Lieu Arg Asp 3315 3320 3325 Asn His Gly Ile Pro Leu Ala Gln His Arg Lieu. His Lys Tyr Glu Ile 3330 3335 3340 Lieu. Ile Lieu. Asp Ala Ala Glu Asn Arg Pro Leu Ser Val Ser Thr Asp 3345 3350 3355 3360 Pro Llys Pro Val Glu Asp Pro Ser Pro Val Gln His Ile Glu Ser Leu 3365 3370 3375 Gln Met Asp Pro Glu Pro Val Glu Ser Glu Pro Leu Pro Met Asp Ser 3380 3385 3390 Glu Pro Val Glu Asp Leu Glu Pro Val Glin His Leu Glu Ser Leu Pro

US 6,632,937 B1 15S 156

-continued

210 215 220 aag agg acg gtg cac agg aag acc ggC aag aag gca gta gta cgc agg 720 Lys Arg Thr Wall His Arg Thr Gly Lys Lys Ala Wall Wall Arg Arg 225 230 235 240 aag agc aCa gtc aag agg acg gca cgg agg gcc ggC aga aag acc 768 Ser Thr Wall Lys Arg Thr Ala Arg Arg Pro Ala Gly Arg Lys Thr 245 250 255 ccc. gga agg gcc gCg cgc agg gcc ggC gca aag agg cgc tag 810 Pro Gly Arg Ala Ala Arg Arg Ala Gly Ala Arg Arg 260 265 cctgctgat 819

SEQ ID NO 6 LENGTH 269 TYPE PRT ORGANISM Cenarchaeum symbiosum

<400 SEQUENCE: 6

Met His Gly Ile Glu Gly Gly Gly Asp Met Ser Glu Asn Phe Wall 1 10 15

Ala Phe Cys Wall Ala Ala Arg Gly Wall Thr Lys Asp Glu Met Lys 25 30

Wall Asp Gly Wall Phe His Glu His Ala Arg His Gly 35 40 45

Gly Glin Ile Arg Phe Pro Asn Pro Glu Wall Glu Glin Arg Wall Ala Glu 50 55 60

Teu Wall Asp Teu Ile Glin Met Arg Asn Glin Teu Ala Glu Met Asn 65 70 75

Arg Ala Ser Gly Asp Gly Gly Wall His Ser Ser Ala Thr Ser Ala Ala 85 90 95

Glu Ala Glu Glin His Arg Ala Glu Telu Wall Glin Teu Wall Glin Met 100 105 110

Arg Asn Glin Telu Ala Glu Met Asn Arg Ala Pro Gly Tys Pro Ala 115 120 125

Arg Lys Ala Ala Gly Lys Thr Ala Arg Arg Lys Ser Lys 130 135 1 4 0

Thr Wall Arg Arg Thr Gly Lys Arg Thr Ala Gly Lys Ala Gly 145 15 O 155 160

Ala Arg Arg Thr Thr Wall Arg Thr Ala Arg Thr Thr 1.65 170 175

Ala Ala Ala Gly Ala Gly Ala Arg Tys Ala Thr 18O 185 190

Wall Arg Thr Wall His Lys Ile Gly Wall Arg Arg Thr Thr 195 200

Ala Arg Arg Thr Ala Gly Lys Ser Thr Wall Arg Arg Ser Thr Wall 210 215 220

Lys Arg Thr Wall His Arg Thr Gly Lys Ala Wall Wall Arg Arg 225 230 235 240

Ser Thr Wall Lys Arg Thr Ala Arg Arg Pro Ala Gly Lys Thr 245 250 255

Pro Gly Arg Ala Ala Arg Arg Ala Gly Ala Arg 260 265

<210 SEQ ID NO 7 &2 11s LENGTH 1569 &212> TYPE DNA

US 6,632,937 B1 161 162

-continued

Pro Telu Telu Telu Lys Pro Wall Thr Ala Ser Gly Val Ala Wall 65 70 75

Ile Ala Wall Met Pro Met Pro Ala Cys Pro His Gly Arg Cys Thr 85 90 95

Pro Gly Gly Ala Ser Asn Thr Pro Asn Ser Tyr Thr Gly 100 105 110

Gly Pro Ile Ala Gly Ala Met Asn Ser Gly Tyr Asp Pro Glu 115 120 125

Glu Wall Arg Ala Teu Ala Arg Telu Arg Ala His His Asp 135 1 4 0

Wall Telu Glu Wall Ile Wall Gly Gly Thr Phe Teu Phe Met 145 155 160

Pro Glu Glin Trp Phe Wall Lys Ser Asp Ala Telu 1.65 170 175

Asn Ser Ala Ser Gly Met Glu Glu Ala His Arg Asn Glu 18O 185 190

Thr Wall His Arg Asn Wall Gly Telu Thr Ile Glu Thr Pro Asp 195 200

Arg Thr Glu His Wall Asp Ala Met Teu Gly Phe Ala Thr 215 220

Arg Glu Ile Gly Wall Glin Ser Telu Arg Glu Glu Wall Telu Arg 225 230 235 240

Wall Asn Arg Gly His Gly Glin Asp Wall Thr Glu Ser Phe Ala Ala 245 250 255

Ala Arg Asp Ala Gly Wall Ala Ala His Met Met Pro Gly Leu 260 265 27 O

Pro Gly Ala Thr Pro Glu Gly Asp Ile Glu Asp Teu Arg Met Telu Phe 275 280 285

Glu Asp Pro Ala Teu Arg Pro Asp Met Telu Wall Pro Ala Telu 29 O 295

Wall Wall Arg Gly Thr Pro Met Glu Glu Tyr Ser Arg Glu Tyr 305 310 315 320

Ser Pro Thr Glu Glu Glu Wall Ile Arg Wall Teu Ser Glu Ala Lys 325 330 335

Ala Arg Wall Pro Arg Trp Ala Arg Ile Met Arg Wall Glin Arg Glu Ile 340 345 350

His Pro Asp Glu Ile Wall Ala Gly Pro Arg Ser Gly Asn Teu Glin 355 360 365

Teu Wall His Arg Teu Glin Glu Glin Gly Arg Arg Arg Cys Ile 370 375

Arg Arg Glu Ala Gly Teu Ala Gly Arg Thr Wall Pro Glin Lys Telu 385 390 395 400

Arg Ile Asp Arg Ala Asp Ser Ala Ser Gly Gly Arg Glu Ser Phe 405 410 415

Ile Ser Telu Wall Asp Gly Asp Asp Ala Ile Gly Phe Wall Arg Telu 420 425 430

Arg Pro Ser Gly Ala Ala His Arg Pro Glu Wall Thr Pro Glu Ser 435 4 40 4 45

Ile Ile Arg Glu Teu His Wall Gly Arg Ser Teu Telu Gly 450 455 460

Glu Arg Gly Gly Ile Glin His Ser Gly Telu Gly Arg Teu Wall Ser 465 470 475 480

US 6,632,937 B1 167 168

-continued <400 SEQUENCE: 10

Met Glu Thr Ile Gly Arg Gly Thr Trp Ile Asp Teu Ala His Glu 1 5 10 15

Teu Wall Glu Arg Glu Glu Ala Telu Gly Arg Asp Thr Glu Met Ile Asn 25 30

Wall Glu Ser Gly Teu Gly Ala Ser Gly Ile Pro His Met Ser Telu 35 40 45

Gly Asp Ala Wall Arg Ala Tyr Gly Wall Gly Teu Ala Wall Asp Met 50 55 60

Gly His Ser Phe Arg Teu Ile Ala Phe Asp Asp Teu Asp Gly Telu 65 70 75

Arg Wall Pro Glu Gly Met Pro Ser Ser Teu Glu Glu His Ile Ala 85 90 95

Arg Pro Wall Ser Ala Ile Pro Asp Pro Gly His Asp Ser 100 105 110

Gly Met His Met Ser Gly Teu Telu Telu Glu Gly Teu Asp Ala Telu Gly 115 120 125

Ile Glu Asp Phe Arg Arg Ala Arg Asp Thr Tyr Arg Asp Gly Telu 130 135 1 4 0

Teu Ala Glu Glin Ile His Arg Ile Telu Ser Asn Ser Ser Wall Ile Gly 145 15 O 155 160

Glu Ile Ala Glu Met Wall Gly Glin Glu Phe Arg Ser Ser Telu 1.65 170 175

Pro Phe Ala Wall Glu Glin Cys Gly Met Thr Ala 18O 185 190

Ser Wall Glu Teu Ala Asp Ser Arg Wall Arg Tyr Arg Cys 195 200

Asp Ala Glu Wall Gly Gly Arg Ile Gly Cys Gly His Glu 210 215 220

Glu Ala Asp Thr Gly Gly Ala Gly Gly Teu Ala Trp Wall 225 230 235

Phe Ala Ala Arg Trp Glin Ala Phe Asp Wall Arg Phe Glu Ala Tyr 245 250 255

Asp Ile Met Asp Ser Wall Arg Ile Asn Asp Trp Wall Ser Asp 260 265 27 O

Ile Telu Ser Ser Pro His Pro His His Thr Arg Glu Met Phe Telu 275 280 285

Asp Lys Gly Gly Ile Ser Ser Ser Gly Asn Wall Wall Thr 29 O 295 3OO

Pro Glin Trp Teu Arg Thr Pro Glin Ser Ile Teu Telu Telu 305 310 315 320

Met Arg Ile Thr Arg Glu Teu Gly Teu Glu Asp Wall 325 330 335

Pro Ser Telu Met Asp Glu Asp Telu Glin Arg Glu Tyr Phe Ala 340 345 350

Gly Gly Gly Arg Gly Gly Arg Glu Ala Asn Arg Gly Telu 355 365

Phe Glu Thr Asn Teu Teu Ala Glin Glu Gly Pro Arg Pro His 370 375 38O

Ala Gly Arg Teu Teu Wall Telu Ser Arg Teu Phe Arg Glu Asn 385 390 395 400

Arg Thr Glu Arg Wall Thr Telu Wall Glu Gly Wall Ile Asp 405 410 415 US 6,632,937 B1 169 170

-continued

Gly Pro Ser Pro Gly Ile Glu Arg Telu Ile Ala Teu Ala Gly Asn Tyr 420 425 430

Ala Asp Asp Met Ser Ala Glu Arg Thr Glu Wall Glu Lieu. Asp Gly 435 4 40 4 45

Ala Thr Arg Gly Ala Teu Ser Glu Telu Ala Glu Met Teu Gly Ser Ala 450 455 460

Pro Glu Gly Gly Teu Glin Asp Wall Ile Gly Wall Ala Lys Ser His 465 470 475 480

Gly Wall Pro Pro Arg Asp Phe Phe Lys Ala Teu Arg Ile Ile Telu 485 490 495

Asp Ala Ser Ser Gly Pro Arg Ile Gly Pro Phe Ile Glu Asp Ile Gly 5 OO 505 510

Arg Glu Lys Wall Ala Gly Met Ile Arg Gly Arg Teu 515 52O

SEQ ID NO 11 LENGTH 885 TYPE DNA ORGANISM: Cenarchaeum sybiosum FEATURE: NAME/KEY: CDS LOCATION: (1) . . . (885) <400 SEQUENCE: 11

atg gag toa gcc ggit gag cag gca cct ggit gtg gta citt cac gac tat 48 Met Glu Ser Ala Gly Glu Glin Ala Pro Gly Wall Wall Teu His Asp Tyr 1 5 10 15

citt toa a.a.a. ttg Cala Cag tat tog ggg agg gac aca att cita tat gC g 96 Teu Ser Telu Glin Glin Tyr Ser Gly Arg Asp Thr Ile Leu Tyr Ala 2O 25 30

acc a.a. C. tgg atg acg gac gaa cc.g cat acg cct aat gaa gct citc ata 144 Thr Asn Trp Met Thr Asp Glu Pro His Thr Pro Asn Glu Ala Lieu Ile 35 40 45

a Ca aat ggit gac citg tat gga titt atg agg atg atg cgt. gat tta aag 192 Thr Asn Gly Asp Teu Gly Phe Met Arg Met Met Arg Asp Lieu Lys 5 O 55 60 act a.a.a. a.a.a. ttg gat citg alta citc. cac agt cct gga ggit tot goc gag 240 Thr Telu Asp Teu Ile Telu His Ser Pro Gly Gly Ser Ala Glu 65 70 75 8O

tct gca gaa tog att gto a Ca tac citt cat gcg a.a.a. tat gat gat att 288 Ser Ala Glu Ser Ile Wall Thr Telu His Ala Asp Asp Ile 85 90 95

Cgg gtc atc ata cc.g tat gcc gca atg toa gca gcc tog atg citt gct 336 Arg Wall Ile Ile Pro Ala Ala Met Ser Ala Ala Ser Met Lieu Ala 100 105 110 tgc gca toa aat to c citg gta atg ggC a.a.a. cac tog tct at a gga cc c 384 Cys Ala Ser Asn Ser Teu Wall Met Gly Lys His Ser Ser Ile Gly Pro 115 120 125

gct gat cc c Cala titt att titc. cca acc aag att ggC atg caa at a atg 432 Ala Asp Pro Glin Phe Ile Phe Pro Thr Lys Ile Gly Met Glin Ile Met 130 135 1 4 0

tct gca cag citt cita att gac gag ttg Cala gaa gtg cag gtg gta tot 480 Ser Ala Glin Telu Teu Ile Asp Glu Telu Glin Glu Wall Glin Wal Wall Ser 145 15 O 155 160 gaa a.a.a. cat cc.g ggC agg citt ggC gca tgg citt cca ttg tta gga Cala 528 Glu His Pro Gly Arg Teu Gly Ala Trp Teu Pro Teu Leu Gly Glin 1.65 170 175 tat cct cct gga citg gtt Cala a.a.a. tgc att agc agc cag aaa cita gct 576 Pro Pro Gly Teu Wall Glin Cys Ile Ser Ser Glin Lys Lieu Ala US 6,632,937 B1 171 172

-continued

18O 185 190 gaa gtg citt gta Cala tgg citg gaa gac cac atg titt gct ggC gag 624 Glu Wall Telu Wall Glin Trp Telu Glu Asp His Met Phe Ala Gly Glu 195 200 2O5

tct gat gCg gca gaa toa a.a.a. a.a.a. ata tot gga atg tta gct tot 672 Ser Asp Ala Ala Glu Ser Ile Ser Gly Met Teu Ala Ser 210 215 220 cct gga a.a.a. tat tac agt cat ggg aga tac ata tog cga gag gag 720 Pro Gly Lys Ser His Gly Arg Ile Ser Arg Glu Glu Cys 225 230 235 240 agg ggC atc ggit ttg a.a.a. alta act gat cita gaa gcc gac Cala gaa titt 768 Arg Gly Ile Gly Teu Ile Thr Asp Telu Glu Ala Asp Glin Glu Phe 245 250 255 cag gat citg aCa ttg tog gta tot cat gca gcg gat atc citg tot Cala 816 Glin Asp Telu Thr Teu Ser Wall Ser His Ala Ala Asp Ile Teu Ser Glin 260 265 27 O titt act cca atc aac a.a.a. atc atc gCg aat cac citc. ggit aat toa gtt 864 Phe Thr Pro Ile Asn Ile Ile Ala Asn His Teu Gly Asn Ser Wall 275 280 285 atc agc a.a.a. cca toa a Ca tag 885 Ile Ser Pro Ser Thr 29 O

SEQ ID NO 12 LENGTH 2.94 TYPE PRT ORGANISM: Cenarchaeum sybiosum

<400 SEQUENCE: 12

Met Glu Ser Ala Gly Glu Glin Ala Pro Gly Wall Wall Teu His Asp Tyr 1 5 10 15

Teu Ser Lys Telu Glin Glin Tyr Ser Gly Arg Asp Thr Ile Teu Ala 25 30

Thr Asn Trp Met Thr Asp Glu Pro His Thr Pro Asn Glu Ala Telu Ile 35 40 45

Thr Asn Gly Asp Teu Gly Phe Met Arg Met Met Arg Telu Lys 50 55 60

Thr Lys Telu Asp Teu Ile Telu His Ser Pro Gly Gly Ser Ala Glu 65 70 75

Ser Ala Glu Ser Ile Wall Thr Telu His Ala Asp Ile 85 90 95

Arg Wall Ile Ile Pro Ala Ala Met Ser Ala Ala Ser Met Telu Ala 100 105 110

Ala Ser Asn Ser Teu Wall Met Gly Lys His Ser Ser Ile Gly Pro 115 120 125

Ala Asp Pro Glin Phe Ile Phe Pro Thr Ile Gly Met Glin Ile Met 130 135 1 4 0

Ser Ala Glin Telu Teu Ile Asp Glu Telu Glin Glu Wall Glin Wall Wall Ser 145 15 O 155 160

Glu His Pro Gly Arg Teu Gly Ala Trp Teu Pro Teu Teu Gly Glin 1.65 170 175

Pro Pro Gly Teu Wall Glin Cys Ile Ser Ser Glin Tys Telu Ala 18O 185 190

Glu Wall Telu Wall Glin Trp Telu Glu Asp His Met Phe Ala Gly Glu 195 200

Ser Asp Ala Ala Glu Ser Ile Ser Gly Met Teu Ala Ser 210 215 220 US 6,632,937 B1 173 174

-continued

Pro Gly Lys Ser His Gly Arg Ile Ser Arg Glu Glu Cys 225 230 235 240

Arg Gly Ile Gly Teu Ile Thr Asp Telu Glu Ala Asp Glin Glu Phe 245 250 255

Glin Asp Telu Thr Teu Ser Wall Ser His Ala Ala Asp Ile Teu Ser Glin 260 265 27 O

Phe Thr Pro Ile Asn Ile Ile Ala Asn His Teu Gly Asn Ser Wall 275 280 285

Ile Ser Pro Ser Thr 29 O

SEQ ID NO 13 LENGTH 1305 TYPE DNA ORGANISM Cenarchaem symbiosum FEATURE: NAME/KEY: CDS LOCATION: (1) . . . (1305) <400 SEQUENCE: 13 gtg gat cita gag cgc gag tac agg gca aag acc agg ggC tog gCg ggg 48 Met Asp Telu Glu Arg Glu Arg Ala Lys Thr Arg Gly Ser Ala Gly 1 5 10 15 alta titt gcc cgg tog aga agg tac cat gta ggg ggg gto agc cac a.a. C. 96 Ile Phe Ala Arg Ser Arg Arg His Wall Gly Gly Wall Ser His Asn 2O 25 30 alta agg tac tat gag cc.g tac cc.g titt gtt aCa agg tog gCg cgc ggC 144 Ile Arg Tyr Glu Pro Pro Phe Wall Thr Arg Ser Ala Arg Gly 35 40 45 aag cac citt gtg gac gto gac ggg a.a. C. aag tat acc gac tat tgg atg 192 Lys His Telu Wall Asp Wall Asp Gly Asn Lys Thr Asp Tyr Trp Met 5 O 55 60

cac tgg agc citg alta citc. ggC cac gCg cc.g gCg Cala gta agg tog 240 His Trp Ser Teu Ile Teu Gly His Ala Pro Ala Glin Wall Arg Ser 70 75 8O

gtg gag ggg cag citg cgc cgc ggC tgg ata cac ggg aCC gca a.a. C. 288 Wall Glu Gly Glin Teu Arg Arg Gly Trp Ile His Gly Thr Ala Asn 85 90 95

cc c acc atg Cgg citc. tog gag atc ata cgc ggg gCg gta aag gCg 336 Pro Thr Met Arg Teu Ser Glu Ile Ile Arg Gly Ala Wall Lys Ala 100 105 110

gag aag ata agg tat gtt aCa to c ggC acg gag gcc gtc atg tat 384 Glu Lys Ile Arg Wall Thr Ser Gly Thr Glu Ala Wall Met 115 120 125

gca agg atg gCg cgc gca cgc acg gga a.a.a. a.a.a. gtg at a gca aag 432 Ala Arg Met Ala Arg Ala Arg Thr Gly Lys Lys Wall Ile Ala Lys 130 135 1 4 0 gto gac ggC ggC tgg cac gga tac gCg tog ggg citg cita aag tog gtc 480 Wall Asp Gly Gly Trp His Gly Tyr Ala Ser Gly Teu Teu Tys Ser Wall 145 15 O 155 160 aac tgg cc.g tac gat gtg ccc. gag agc ggg ggg citc. gto gac gag gag 528 Asn Trp Pro Asp Wall Pro Glu Ser Gly Gly Teu Wall Asp Glu Glu 1.65 170 175 cac acc gtg to c atc cc.g tac a.a. C. aat citg gag gga to c citg gag gCg 576 His Thr Wall Ser Ile Pro Asn Asn Telu Glu Gly Ser Teu Glu Ala 18O 185 190 cita agg cgc gca ggg ggC gac citt gca gto alta gto gag cc.g atg 624 Teu Arg Arg Ala Gly Gly Asp Telu Ala Cys Wall Ile Wall Glu Pro Met 195 200 2O5 US 6,632,937 B1 175 176

-continued citt ggC ggC ggC ggC tgc alta cc.g gca gaa cc.g gac tat citc. cgc ggC 672 Teu Gly Gly Gly Gly Cys Ile Pro Ala Glu Pro Asp Tyr Teu Arg Gly 210 215 220 alta cag gag titt gtg cat tog aag ggit gca citg titc. att citc. gac gag 720 Ile Glin Glu Phe Wall His Ser Lys Gly Ala Teu Phe Ile Teu Asp Glu 225 230 235 240 alta gtc acg ggg titc. Cgg titc. gac titt ggC tgc gCg tac aag a.a.a. atg 768 Ile Wall Thr Gly Phe Arg Phe Asp Phe Gly Cys Ala Tys Lys Met 245 250 255 ggg citg gac cc c gac gtg gtg gCg citg gga aag alta gto ggg ggC gga 816 Gly Telu Asp Pro Asp Wall Wall Ala Telu Gly Lys Ile Wall Gly Gly Gly 260 265 27 O titc. cc c ata ggit gtg gtg tgc ggC aag gac gag gtg atg tgc atc to c 864 Phe Pro Ile Gly Wall Wall Cys Gly Lys Asp Glu Wall Met Cys Ile Ser 275 280 285 gat acc ggC gCg cat gca aga acc gag agg gcg tac att ggc ggC ggC 912 Asp Thr Gly Ala His Ala Arg Thr Glu Arg Ala Tyr Ile Gly Gly Gly 29 O 295 3OO acc titt tot gca aac ccc. gCg acg atg act gcg ggit gcc gCg gca citc. 96.O Thr Phe Ser Ala Asn Pro Ala Thr Met Thr Ala Gly Ala Ala Ala Telu 305 310 315 320 ggit gca citc. agg gag aga agg ggC aCa cita tac ccc. aga at a a.a. C. to c OO 8 Gly Ala Telu Arg Glu Arg Arg Gly Thr Telu Pro Arg Ile Asn Ser 325 330 335 atg ggg gac gac gca agg gCg cgg citc. tog agg alta titc. gac ggC agg Met Gly Asp Asp Ala Arg Ala Arg Telu Ser Arg Ile Phe Asp Gly Arg 340 345 350 gtt gca gtg acc ggC agg ggC tog citg titc. atg acg cac titt aCa cc.g 104 Wall Ala Wall Thr Gly Arg Gly Ser Leu Phe Met Thr His Phe Thr Pro 355 360 365 gat ggg gcc cgc agg alta to c agc gCg gca gat gct gcc gcc tgc gat 152 Asp Gly Ala Arg Arg Ile Ser Ser Ala Ala Asp Ala Ala Ala Cys Asp 370 375 38O gtg cat citg citg cac agg cac citg gac atg att a Ca agg gac ggC 200 Wall His Telu Telu His Arg His Telu Asp Met Ile Thr Arg Asp Gly 385 390 395 400 alta titc. titt citg cca ggC citg ggg gcc ata tct gcc gcc cac toa 248 Ile Phe Phe Telu Pro Gly Telu Gly Ala Ile Ser Ala Ala His Ser 405 410 415 agg gCg gac citt ggg gcc atg tat tog gCg tot gag cgc titt gCg ggg 296 Arg Ala Asp Telu Gly Ala Met Ser Ala Ser Glu Arg Phe Ala Gly 420 425 430 gga citg 305 Gly Telu

SEQ ID NO 14 LENGTH 434 TYPE PRT ORGANISM: Cenarchaem symbiosum

<400 SEQUENCE: 14

Met Asp Lieu Glu Arg Glu Tyr Arg Ala Lys Thr Arg Gly Ser Ala Gly 1 5 10 15

Ile Phe Ala Arg Ser Arg Arg His Wall Gly Gly Wall Ser His Asn 2O 25 30

Ile Arg Tyr Glu Pro Tyr Pro Phe Wall Thr Arg Ser Ala Arg Gly 35 40 45

Lys His Lieu Wall Asp Wall Asp Gly Asn Thr Asp Trp Met 50 55 60 US 6,632,937 B1 177 178

-continued Gly His Trp Ser Teu Ile Teu Gly His Ala Pro Ala Glin Val Arg Ser 65 70 75

Ala Wall Glu Gly Glin Teu Arg Arg Gly Trp Ile His Gly Thr Ala Asn 85 90 95

Glu Pro Thr Met Arg Teu Ser Glu Ile Ile Arg Gly Ala Wall Lys Ala 100 105 110

Ala Glu Lys Ile Arg Wall Thr Ser Gly Thr Glu Ala Wall Met Tyr 115 120 125

Ala Ala Arg Met Ala Arg Ala Arg Thr Gly Lys Lys Wall Ile Ala Lys 130 135 1 4 0

Wall Asp Gly Gly Trp His Gly Tyr Ala Ser Gly Teu Teu Ser Wall 145 15 O 155 160

Asn Trp Pro Asp Wall Pro Glu Ser Gly Gly Teu Wall Asp Glu Glu 1.65 170 175

His Thr Wall Ser Ile Pro Asn Asn Telu Glu Gly Ser Teu Glu Ala 18O 185 190

Teu Arg Arg Ala Gly Gly Asp Telu Ala Wall Ile Wall Glu Pro Met 195 200 2O5

Teu Gly Gly Gly Gly Ile Pro Ala Glu Pro Asp Teu Gly 210 215 220

Ile Glin Glu Phe Wall His Ser Gly Ala Teu Phe Teu Asp Glu 225 230 235 240

Ile Wall Thr Gly Phe Arg Phe Asp Phe Gly Cys Ala Lys Met 245 250 255

Gly Telu Asp Pro Asp Wall Wall Ala Telu Gly Lys Ile Wall Gly Gly Gly 260 265 27 O

Phe Pro Ile Gly Wall Wall Gly Lys Asp Glu Wall Met Ile Ser 275 280 285

Asp Thr Gly Ala His Ala Arg Thr Glu Arg Ala Tyr Ile Gly Gly 29 O 295

Thr Phe Ser Ala Asn Pro Ala Thr Met Thr Ala Gly Ala Ala Ala Telu 305 310 315 320

Gly Ala Telu Arg Glu Arg Arg Gly Thr Telu Pro Arg Ile Asn Ser 325 330 335

Met Gly Asp Asp Ala Arg Ala Arg Telu Ser Arg Ile Phe Asp Gly Arg 340 345 350

Wall Ala Wall Thr Gly Arg Gly Ser Telu Phe Met Thr His Phe Thr Pro 355 360 365

Asp Gly Ala Arg Arg Ile Ser Ser Ala Ala Asp Ala Ala Ala Cys 370 375

Wall His Telu Telu His Arg His Telu Asp Met Ile Thr Arg Asp Gly 385 390 395 400

Ile Phe Phe Telu Pro Gly Lys Telu Gly Ala Ile Ser Ala Ala His Ser 405 410 415

Arg Ala Asp Telu Gly Ala Met Ser Ala Ser Glu Arg Phe Ala Gly 420 425 430

Gly Telu

SEQ ID NO 15 LENGTH 816 TYPE DNA ORGANISM: Cenarchaeum symbiosum FEATURE: NAME/KEY: CDS LOCATION: (1) . . . (816)

US 6,632,937 B1 181 182

-continued Met Ile Telu Phe Gly Lys Ser Asp Pro Ser Asp Leu Lleu Arg Glin Ala 10 15

Asp Lieu Telu Cys Ser Gly Asn Ala Ala Val Gly Lieu. Tyr 25 30

Ser Arg Ile Telu Asp Asp Pro Glin Asn Arg Met Val Leu Glin Arg 35 40 45

Telu Ala Teu Asn Arg Ile Arg Arg Tyr Ser Asp Ala Ile Thr 50 55 60

Cys Phe Asp Telu Teu Teu Glu Lieu. Asp Asp Gly Asp Ala Pro Ala Tyr 65 70 75 8O

Asn. Asn Ala Ile Ala Glin Ala Glu Lieu Gly Asp Thr Ala Ser Ala 85 90 95

Leu Glu Asn Tyr Gly Ala Ile Glu Ala Ser Pro Arg Tyr Ala Pro 100 105 110

Ala Tyr Phe Asn Arg Ala Wall Lieu Lieu. Asp Arg Leu Gly Glu. His Glu 115 120 125

Asp Ala Telu Pro Asp Teu Asp Lys Ala Thr Arg Leu Asp Arg Asp Lys 130 135 1 4 0

Ala Asn Pro Phe Tyr Gly Ile Val Teu Gly Lys Met Gly Arg 145 15 O 155 160

His Ala Glu Ala Teu Ser Phe Lys Glu Wall Cys Arg Ala Asp His 1.65 170 175

Gly His Ala Asp Ser Glin Phe His Wall Ala Ile Glu Wall Ala Glu Lieu 18O 185 190

Gly Lys His Ala Glu Ala Teu Gly Glu Lieu Ala Ala Leu Pro Ala Glu 195 2OO 2O5

Glu Asn Ala Asn Wall Leu Tyr Ala Arg Ala Arg Ser Lieu Ala 210 215 220

Gly Lieu Asp Asp Glu Ser Ile Ala His Leu Gln Lys Ala Ala 225 230 235 240

Arg Lys Asp Ser Lys Thr Ile Ala Arg Ala Glu Lys Ala 245 250 255

Phe Asp His Ile Arg Asp Asp Pro Arg Phe Lys Lys Ile Ala Gly 260 265 27 O

<210 SEQ ID NO 17 &2 11s LENGTH 696 &212> TYPE DNA <213> ORGANISM: Cenarchaeum symbiosum &220s FEATURE <221 NAME/KEY: CDS <222> LOCATION: (1) . . . (696) <400 SEQUENCE: 17 gtg act gac aag a Ca agg atc atc gtc ctg cgc aac goc at g act gala 48 Met Thr Asp Lys Thr Arg Ile Ile Wall Leu Arg Asn Ala Met Thr Glu 1 10 15 cag to c gcc cgg gcc atg atc gag gCa aaa. aag acg ggg cca ttc agg 96 Glin Ser Ala Arg Ala Met Ile Glu Ala Lys Lys Thr Gly Pro Phe Arg 25 30 gcc atg atg agg gCg ccc. cca aag gag gaC gto cat gta cat toc gta 144 Ala Met Met Arg Ala Pro Pro Lys Glu Asp Wall His Wal His Ser Wall 35 40 45 agg citc gtc cac gag gCg citc. atc cqc gtc to c gcc cqg tac to g gcc 192 Arg Lieu Wall His Glu Ala Teu Ile Arg Val Ser Ala Arg Tyr Ser Ala 5 O 55 60 gac titt titc. aga agg gcc cac cog atc aag gtg gat cag aac gtg 240 US 6,632,937 B1 183 184

-continued Asp Phe Phe Arg Ala Wall His Pro Ile Lys Val Asp Glin Asn Wall 65 70 75 8O atc gag gtg gtg citg ggC gac ggC gtc titc. cc.g alta agg to a aag tog 288 Ile Glu Wall Wall Teu Gly Asp Gly Wall Phe Pro Ile Arg Ser Lys Ser 85 90 95 cgc ata cgc aag acc citg to c gcc ggg cgc ggC aag aac agg gtc gat 336 Arg Ile Arg Lys Thr Teu Ser Ala Gly Arg Gly Lys Asn Arg Wall Asp 100 105 110 citg gaa citc. gag gag cac gta tac gCg gaa toa gag ggC gtg atg tgc 384 Teu Glu Telu Glu Glu His Wall Tyr Ala Glu Ser Glu Gly Wall Met Cys 115 120 125 citt gac cgg cac ggC ggg gag acc ggC titt ccc. tac aag acg ggg acc 432 Teu Asp Arg His Gly Gly Glu Thr Gly Phe Pro Tyr Lys Thr Gly Thr 130 135 1 4 0 ggC gCg gtc gag cc.g tac cc.g cgg cgc atg citt gat tog tog gag aat 480 Gly Ala Wall Glu Pro Tyr Pro Arg Arg Met Teu Asp Ser Ser Glu Asn 145 15 O 155 160 gtg cgg cgc cc.g gag alta gac acc ggg gtg gcg citg gaa a.a.a. citc. cgg 528 Wall Arg Arg Pro Glu Ile Asp Thr Gly Wall Ala Teu Glu Telu Arg 1.65 170 175 gta aag citc. cgc ggg ccc. cc.g cct gac ggC atg cgc gac citc. cgg gag 576 Wall Lys Telu Arg Gly Pro Pro Pro Asp Gly Met Arg Asp Teu Arg Glu 18O 185 190 gag titt gca gtc aga tog gto gaa gaa gtg tat gcc cct gtc tac gag 624 Glu Phe Ala Wall Arg Ser Wall Glu Glu Wall Ala Pro Wall Tyr Glu 195 200 2O5 tog cgg citt gtg ggg ccc. a.a.a. a.a.a. aag gtc cgg alta atg cgg ata gac 672 Ser Arg Telu Wall Gly Pro Lys Lys Wall Arg Ile Met Arg Ile Asp 210 215 220 gCg gca aga a.a.a. aag atg citg tag 696 Ala Ala Arg Lys Met Teu 225 230

SEQ ID NO 18 LENGTH 231 TYPE PRT ORGANISM: Cenarchaeum symbiosum

<400 SEQUENCE: 18

Met Thr Asp Lys Thr Arg Ile Ile Wall Telu Arg Asn Ala Met Thr Glu 1 5 10 15

Glin Ser Ala Arg Ala Met Ile Glu Ala Thr Gly Pro Phe 2O 25 30

Ala Met Met Arg Ala Pro Pro Lys Glu Asp Wall His Wall His Ser Wall 35 40 45

Arg Telu Wall His Glu Ala Teu Ile Arg Wall Ser Ala Arg Ser Ala 50 55 60

Asp Phe Phe Arg Ala Wall His Pro Ile Lys Wall Asp Glin Asn Wall 65 70 75

Ile Glu Wall Wall Teu Gly Asp Gly Wall Phe Pro Ile Arg Ser Lys Ser 85 90 95

Arg Ile Arg Lys Thr Teu Ser Ala Gly Arg Gly Lys Asn Arg Wall Asp 100 105 110

Teu Glu Telu Glu Glu His Wall Tyr Ala Glu Ser Glu Gly Wall Met Cys 115 120 125

Teu Asp Arg His Gly Gly Glu Thr Gly Phe Pro Tyr Thr Gly Thr 130 135 1 4 0

Gly Ala Wall Glu Pro Pro Arg Arg Met Teu Asp Ser Ser Glu Asn