USOO8168205B2

(12) United States Patent (10) Patent No.: US 8,168.205 B2 Kunsch et al. (45) Date of Patent: May 1, 2012

(54) STREPTOCOCCUS PNEUMONIAE W WEA's 88: POLYPEPTIDES WO WO-96,05859 2, 1996 WO WO-96,08582 3, 1996 (75) Inventors: Charles A. Kunsch, Norcross, GA (US); WO WO-96, 16082 5, 1996 Gil H. Choi, Rockville, MD (US); WO WO-96,33276 10, 1996 Patrick J. Dillon, Carlsbad, CA (US); W. W85. 1 . E. Craig A. Rosen, Laytonsville, MD (US); WO WO-98/26O72 6, 1998 Steven C. Barash, Rockville, MD (US); WO WO-99/15675 4f1999 Michael R. Fannon, Silver Spring, MD (US); Brian A. Dougherty, OTHER PUBLICATIONS Killingworth, CT (US) Passarge, Eberhard. Color Atlas of Genetics. 1995 Thieme Medical (73) Assignee: Human Genome Sciences, Inc., Publishers, In, New York, p. 48-49.* Abbas et al. Cellular and Molecular Immunology 4th edition, 2000 Rockville, MD (US) Chapter 15 p. 360-362.* Ellis, R.W. (Chapter 29 of “Vaccines' Plotkin, S.A. et al. (eds) (*) Notice: Subject to any disclaimer, the term of this published by W. B. Saunders company (Philadelphia) in 1988, p. patent is extended or adjusted under 35 571* U.S.C. 154(b) by 0 days. (McDaniel et al. Infection and Immunity, Jan. 1991, p. 222-228.* Poolman et al. Expert Rev. Vaccines 3(5), 597-604 (2004).* (21) Appl. No.: 12/635,794 Adamou et al., “Identification and characterization of a novel family of Pneumococcal proteins that are protective against sepsis.” Infec (22) Filed: Dec. 11, 2009 tion and Immunity, 69(2):949-958 (2001). Altschul, et al., “Basic local alignment search tool.” J. Mol. Biol. (65) Prior Publication Data 215:403-410 (1990). US 2010/0221287 A1 Sep. 2, 2010 Gagnon, et al., “Cloning, sequencing and expression in Escherichia coli of the pts gene encoding I of the phosphoenolpyruvate: Related U.S. Application Data Sugar phosphotransferase transport system from Streptoccus salivarius,” Gene, 121:71-78 (1992). (60) Continuation of application No. 11/563,552, filed on Gasc, et al., J. Bacteriology, 173:7361-7367 (1991). Nov. 27, 2006, now abandoned, which is a division of Hillier, et al., GenBank Accession No. H38607 (1995). application No. 10/158,844, filed on Jun. 3, 2002, now Hillier, et al., GenBank Accession No. R54118 (1995). Pat. No. 7,141,418, which is a division of application Hillier, et al., GenBank Accession No. R15155 (1995). No. 08/961,527, filed on Oct. 30, 1997, now Pat. No. Hillier, et al., GenBank Accession No. R01821 (1995). 6,420,135. Hillier, et al., GenBank Accession No. H04394 (1995). Hillier, et al., GenBank Accession No. R66534 (1995). (60) Provisional application No. 60/029.960, filed on Oct. Hillier, et al., GenBank Accession No. H65672 (1995). 31, 1996. Hillier, et al., GenBank Accession No. HO 1173 (1995). Kaneko, et al (DNA Research 2 pp. 153-166), 1995. (51) Int. Cl. Martin, et al., “Relatedness of penicillin-binding protein la genes C07K L/00 (2006.01) from different clones of penicillin-resistant Streptococcus C07K I4/00 (2006.01) pneumoniae isolated in South Africa and Spain.” The EMBOJournal, C07K 17/00 (2006.01) 11(11):3831-3836 (1992). A6 IK39/00 (2006.01) Merriam Webster's Collegiate Dictionary, 10th edition, p. 545 A6 IK39/02 (2006.01) (1997). A 6LX39/09 (2006.01) (Continued) A6 IK39/38 (2006.01) (52) U.S. Cl...... 424/244.1; 424/184.1; 424/234.1; 530/350 Primary Examiner — Oluwatosin Ogunbiyi (58) Field of Classification Search ...... None (74) Attorney, Agent, or Firm — Human Genome Sciences, See application file for complete search history. Inc. (56) References Cited (57) ABSTRACT U.S. PATENT DOCUMENTS The present invention provides polynucleotide sequences of 5,302,527 A 4, 1994 Birkett et al. the genome of Streptococcus pneumoniae, polypeptide 6,251,581 B1 6, 2001 Ullman et al. sequences encoded by the polynucleotide sequences, corre 6,251,588 B1 6/2001 Shannon et al. sponding polynucleotides and polypeptides, vectors and 6,573,082 B1 6, 2003 Choi et al. hosts comprising the polynucleotides, and assays and other 6,699,703 B1 3/2004 Doucette-Stamm et al. uses thereof. The present invention further provides poly FOREIGN PATENT DOCUMENTS nucleotide and polypeptide sequence information stored on EP O 622 081 A2 11/1994 computer readable media, and computer-based systems and EP O 687 688 A1 12/1995 methods which facilitate its use. JP 8-116973 5, 1996 WO WO-93, 10238 5, 1993 WO WO-95/06732 3, 1995 6 Claims, 2 Drawing Sheets US 8,168.205 B2 Page 2

OTHER PUBLICATIONS Rudinger, Peptide Hormones “Characteristics of the amino acids as & 8 components of a peptide hormone sequence. J.A. Parsons (Ed.) Pearce, et al., “Genetic identification of exported proteins in University Park Press, Baltimore, pp. 5-7 (1996). Streptococcus pneumoniae.” Molecular Microbiology, 9(5): 1037- Tan, et al., GenBank Accession No. M87840 (1995). 1050 (1993). Zhang, et al., J. Clin. Microbiol. 33(3):596-601, 1995. Pearson, et al., “Improved tools for biological sequence comparison.” Proc. Natl. Acad. Sci., 85:2444-2448 (1988). * cited by examiner U.S. Patent US 8,168.205 B2

U.S. Patent May 1, 2012 US 8,168.205 B2 1. 2 STREPTOCOCCUS PNEUMONIAE The antibiotic generally prescribed to treat S. pneumoniae POLYPEPTIDES is benzylpenicillin, although resistance to this and to other antibiotics is found occasionally. Pneumococcal resistance to CROSS REFERENCE TO RELATED penicillin results from mutations in its penicillin-binding pro APPLICATIONS teins. In uncomplicated pneumococcal pneumonia caused by a sensitive strain, treatment with penicillin is usually Success This application is a continuation of U.S. application Ser. ful unless started too late. Erythromycin or clindamycin can No. 11/563,552, filed Nov. 27, 2006 (now abandoned), which be used to treat pneumonia in patients hypersensitive to peni is a divisional of U.S. application Ser. No. 10/158,844, filed cillin, but resistant strains to these drugs exist. Broad spec Jun. 3, 2002 (now U.S. Pat. No. 7,141,418, issued Nov. 28, 10 trum antibiotics (e.g., the tetracyclines) may also be effective, 2006), which is a divisional of U.S. application Ser. No. although tetracycline-resistant strains are not rare. In spite of 08/961,527, filed Oct. 30, 1997 (now U.S. Pat. No. 6,420,135, the availability of antibiotics, the mortality of pneumococcal issued Jul. 16, 2002), which is a non-provisional of and bacteremia in the last four decades has remained stable claims benefit under 35 U.S.C. S 119(e) of U.S. Provisional between 25 and 29%. (Gillespie, S. H., et al., J. Med. Micro Application No. 60/029,960, filed Oct. 31, 1996, each of 15 biol. 28:237-248 (1989). which is incorporated by reference herein in its entirety. S. pneumoniae is carried in the upper respiratory tract by many healthy individuals. It has been suggested that attach STATEMENT UNDER 37 C.F.R.S1.77(b)(5) ment of pneumococci is mediated by a disaccharide receptor on fibronectin, present on human pharyngeal epithelial cells. This application refers to a “Sequence Listing listed (Anderson, B. J., et al., J. Immunol. 142:2464-2468 (1989). below, which is provided as a text document. The document is The mechanisms by which pneumococci translocate from the entitled “PB340P1D2C1 SeqList.txt” (2,796,867 bytes, cre nasopharynx to the lung, thereby causing pneumonia, or ated Dec. 10, 2009), and is hereby incorporated by reference migrate to the blood, giving rise to bacteremia or septicemia, in its entirety herein. are poorly understood. (Johnston, R. B., et al., Rev. Infect. 25 Dis. 13(Suppl. 6):S509-517 (1991). FIELD OF THE INVENTION Various proteins have been suggested to be involved in the pathogenicity of S. pneumoniae, however, only a few of them The present invention relates to the field of molecular biol have actually been confirmed as virulence factors. Pneumo ogy. In particular, it relates to, among other things, nucleotide cocci produce an IgA1 protease that might interfere with host sequences of Streptococcus pneumoniae, contigs, ORFs, 30 defense at mucosal surfaces. (Kornfield, S.J., et al., Rev. Inf fragments, probes, primers and related polynucleotides Dis. 3:521-534 (1981). S. pneumoniae also produces thereof, peptides and polypeptides encoded by the sequences. , an enzyme that may facilitate attachment to and uses of the polynucleotides and sequences thereof. Such epithelial cells by cleaving sialic acid from the host glycolip as in fermentation, polypeptide production, assays and phar ids and gangliosides. Partially purified neuraminidase was maceutical development, among others. 35 observed to induce meningitis-like symptoms in mice; how ever, the reliability of this finding has been questioned BACKGROUND OF THE INVENTION because the neuraminidase preparations used were probably contaminated with cell wall products. Other pneumococcal Streptococcus pneumoniae has been one of the most exten proteins besides neuraminidase are involved in the adhesion sively studied microorganisms since its first isolation in 1881. 40 of pneumococci to epithelial and endothelial cells. These It was the object of many investigations that led to important pneumococcal proteins have as yet not been identified. scientific discoveries. In 1928, Griffith observed that when Recently, Cundell et al., reported that peptide permeases can heat-killed encapsulated pneumococci and live Strains con modulate pneumococcal adherence to epithelial and endot stitutively lacking any capsule were concomitantly injected helial cells. It was, however, unclear whether these permeases into mice, the nonencapsulated could be converted into 45 function directly as adhesions or whether they enhance adher encapsulated pneumococci with the same capsular type as the ence by modulating the expression of pneumococcal adhe heat-killed strain. Years later, the nature of this “transforming sions. (DeVelasco, E. A., et al., Micro. Rev. 59:591-603 principle.” or carrier of genetic information, was shown to be (1995). A better understanding of the virulence factors deter DNA. (Avery, O.T., et al., J. Exp. Med., 79:137-157 (1944)). mining its pathogenicity will need to be developed to cope In spite of the vast number of publications on S. pneumo 50 with the devastating effects of pneumococcal disease in niae many questions about its virulence are still unanswered, humans. and this pathogen remains a major causative agent of serious Ironically, despite the prominent role of S. pneumoniae in human disease, especially community-acquired pneumonia. the discovery of DNA, little is known about the molecular (Johnston, R. B., et al., Rev. Infect. Dis. 13(Suppl. 6):S509 genetics of the organism. The S. pneumoniae genome consists 517 (1991)). In addition, in developing countries, the pneu 55 of one circular, covalently closed, double-stranded DNA and mococcus is responsible for the death of a large number of a collection of so-called variable accessory elements, such as children under the age of 5 years from pneumococcal pneu prophages, plasmids, transposons and the like. Most physical monia. The incidence of pneumococcal disease is highest in characteristics and almost all of the genes of S. pneumoniae infants under 2 years of age and in people over 60years of age. are unknown. Among the few that have been identified, most Pneumococci are the second most frequent cause (after Hae 60 have not been physically mapped or characterized in detail. mophilus influenzae type b) of bacterial meningitis and otitis Only a few genes of this organism have been sequenced. (See, media in children. With the recent introduction of conjugate for instance current versions of GENBANK and other nucleic vaccines for H. influenzae type b, pneumococcal meningitis is acid databases, and references that relate to the genome of S. likely to become increasingly prominent. S. pneumoniae is pneumoniae such as those set out elsewhere herein.) the most important etiologic agent of community-acquired 65 It is clear that the etiology of diseases mediated or exacer pneumonia in adults and is the second most common cause of bated by S. pneumoniae, infection involves the programmed bacterial meningitis behind Neisseria meningitidis. expression of S. pneumoniae genes, and that characterizing US 8,168.205 B2 3 4 the genes and their patterns of expression would add dramati ORFs of the present invention, particular those polypeptides cally to our understanding of the organism and its host inter that have a pharmacological activity. actions. Knowledge of S. pneumoniae genes and genomic The present invention further includes recombinant con organization would improve our understanding of disease structs comprising one or more fragments of the Streptococ etiology and lead to improved and new ways of preventing, 5 cus pneumoniae genome of the present invention. The recom ameliorating, arresting and reversing diseases. Moreover, binant constructs of the present invention comprise vectors, characterized genes and genomic fragments of S. pneumo Such as a plasmid or viral vector, into which a fragment of the niae would provide reagents for, among other things, detect Streptococcus pneumoniae has been inserted. ing, characterizing and controlling S. pneumoniae infections. The present invention further provides host cells contain There is a need to characterize the genome of S. pneumoniae 10 ing any of the isolated fragments of the Streptococcus pneu and for polynucleotides of this organism. moniae genome of the present invention. The host cells can be a higher eukaryotic host cell. Such as a mammalian cell, a SUMMARY OF THE INVENTION lower eukaryotic cell. Such as a yeast cell, or a procaryotic cell Such as a bacterial cell. The present invention is based on the sequencing of frag 15 The present invention is further directed to isolated ments of the Streptococcus pneumoniae genome. The pri polypeptides and proteins encoded by ORFs of the present mary nucleotide sequences which were generated are pro invention. A variety of methods, well known to those of skill vided in SEQID NOS:1-391. in the art, routinely may be utilized to obtain any of the The present invention provides the nucleotide sequence of polypeptides and proteins of the present invention. For several hundred contigs of the Streptococcus pneumoniae instance, polypeptides and proteins of the present invention genome, which are listed in tables below and set out in the having relatively short, simple sequences readily Sequence Listing Submitted herewith, and representative can be synthesized using commercially available automated fragments thereof, in a form which can be readily used, ana peptide synthesizers. Polypeptides and proteins of the present lyzed, and interpreted by a skilled artisan. In one embodi invention also may be purified from bacterial cells which ment, the present invention is provided as contiguous strings 25 naturally produce the protein. Yet another alternative is to of primary sequence information corresponding to the nucle purify polypeptide and proteins of the present invention from otide sequences depicted in SEQID NOS:1-391. cells which have been altered to express them. The present invention further provides nucleotide The invention further provides methods of obtaining sequences which are at least 95% identical to the nucleotide homologs of the fragments of the Streptococcus pneumoniae sequences of SEQID NOS:1-391. 30 genome of the present invention and homologs of the proteins The nucleotide sequence of SEQID NOS:1-391, a repre encoded by the ORFs of the present invention. Specifically, sentative fragment thereof, or a nucleotide sequence which is by using the nucleotide and amino acid sequences disclosed at least 95% identical to the nucleotide sequence of SEQID herein as a probe or as primers, and techniques such as PCR NOS:1-391 may be provided in a variety of mediums to cloning and colony/plaque hybridization, one skilled in the facilitate its use. In one application of this embodiment, the 35 art can obtain homologs. sequences of the present invention are recorded on computer The invention further provides antibodies which selec readable media. Such media includes, but is not limited to: tively bind polypeptides and proteins of the present invention. magnetic storage media, such as floppy discs, hard disc Stor Such antibodies include both monoclonal and polyclonal age medium, and magnetic tape; optical storage media Such antibodies. as CD-ROM; electrical storage media such as RAM and 40 The invention further provides hybridomas which produce ROM; and hybrids of these categories such as magnetic/ the above-described antibodies. A hybridoma is an immortal optical storage media. ized cell line which is capable of secreting a specific mono The present invention further provides systems, particu clonal antibody. larly computer-based systems which contain the sequence The present invention further provides methods of identi information herein described stored in a data storage means. 45 fying test samples derived from cells which express one of the Such systems are designed to identify commercially impor ORFs of the present invention, or a homolog thereof. Such tant fragments of the Streptococcus pneumoniae genome. methods comprise incubating a test sample with one or more Another embodiment of the present invention is directed to of the antibodies of the present invention, or one or more of fragments of the Streptococcus pneumoniae genome having the DFs of the present invention, under conditions which particular structural or functional attributes. Such fragments 50 allow a skilled artisan to determine if the sample contains the of the Streptococcus pneumoniae genome of the present ORF or product produced therefrom. invention include, but are not limited to, fragments which In another embodiment of the present invention, kits are encode peptides, hereinafter referred to as open reading provided which contain the necessary reagents to carry out frames or ORFs, fragments which modulate the expression of the above-described assays. an operably linked ORF, hereinafter referred to as expression 55 Specifically, the invention provides a compartmentalized modulating fragments or EMFs, and fragments which can be kit to receive, in close confinement, one or more containers used to diagnose the presence of Streptococcus pneumoniae which comprises: (a) a first container comprising one of the in a sample, hereinafter referred to as diagnostic fragments or antibodies, or one of the DFs of the present invention; and (b) DFS. one or more other containers comprising one or more of the Each of the ORFs in fragments of the Streptococcus pneu 60 following: wash reagents, reagents capable of detecting pres moniae genome disclosed in Tables 1-3, and the EMFs found ence of bound antibodies or hybridized DFs. 5' to the ORFs, can be used in numerous ways as polynucle Using the isolated proteins of the present invention, the otide reagents. For instance, the sequences can be used as present invention further provides methods of obtaining and diagnostic probes or amplification primers for detecting or identifying agents capable of binding to a polypeptide or determining the presence of a specific microbe in a sample, to 65 protein encoded by one of the ORFs of the present invention. selectively control gene expression in a host and in the pro Specifically, such agents include, as further described below, duction of polypeptides, such as polypeptides encoded by antibodies, peptides, carbohydrates, pharmaceutical agents US 8,168.205 B2 5 6 and the like. Such methods comprise steps of: (a) contacting of the sequences. The primary nucleotide sequences gener an agent with an isolated protein encoded by one of the ORFs ated by sequencing the fragments are provided in SEQ ID of the present invention; and (b) determining whether the NOS:1-391. (As used herein, the “primary sequence” refers agent binds to said protein. to the nucleotide sequence represented by the IUPAC nomen The present genomic sequences of Streptococcus pneumo clature system.) niae will be of great value to all laboratories working with this In addition to the aforementioned Streptococcus pneumo organism and for a variety of commercial purposes. Many niae polynucleotide and polynucleotide sequences, the fragments of the Streptococcus pneumoniae genome will be immediately identified by similarity searches against Gen present invention provides the nucleotide sequences of SEQ Bank or protein databases and will be of immediate value to ID NOS: 1-391, or representative fragments thereof, in a form Streptococcus pneumoniae researchers and for immediate 10 which can be readily used, analyzed, and interpreted by a commercial value for the production of proteins or to control skilled artisan. gene expression. As used herein, a “representative fragment of the nucle The methodology and technology for elucidating extensive otide sequence depicted in SEQID NOS:1-391” refers to any genomic sequences of bacterial and other genomes has and portion of the SEQ ID NOS:1-391 which is not presently will greatly enhance the ability to analyze and understand 15 represented within a publicly available database. Preferred chromosomal organization. In particular, sequenced contigs representative fragments of the present invention are Strep and genomes will provide the models for developing tools for tococcus pneumoniae open reading frames (ORFs), expres the analysis of chromosome structure and function, including sion modulating fragment (EMFs) and fragments which can the ability to identify genes within large segments of genomic be used to diagnose the presence of Streptococcus pneumo DNA, the structure, position, and spacing of regulatory ele niae in Sample (DFs). A non-limiting identification of pre ments, the identification of genes with potential industrial ferred representative fragments is provided in Tables 1-3. As applications, and the ability to do comparative genomic and discussed in detail below, the information provided in SEQID molecular phylogeny. NOS:1-391 and in Tables 1-3 together with routine cloning, synthesis, sequencing and assay methods will enable those DESCRIPTION OF THE FIGURES 25 skilled in the art to clone and sequence all “representative fragments' of interest, including open reading frames encod FIG. 1 is a block diagram of a computer system (102) that can be used to implement computer-based systems of present ing a large variety of Streptococcus pneumoniae proteins. invention. While the presently disclosed sequences of SEQID NOS: FIG. 2 is a schematic diagram depicting the data flow and 1-391 are highly accurate, sequencing techniques are not computer programs used to collect, assemble, edit and anno 30 perfect and, in relatively rare instances, further investigation tate the contigs of the Streptococcus pneumoniae genome of of a fragment or sequence of the invention may reveal a the present invention. Both Macintosh and Unix platforms are nucleotide sequence error present in a nucleotide sequence used to handle the AB373 and 377 sequence data files, largely disclosed in SEQID NOS:1-391. However, once the present as described in Kerlavage et al., Proceedings of the Twenty invention is made available (i.e., once the information in SEQ Sixth Annual Hawaii International Conference on System 35 ID NOS:1-391 and Tables 1-3 has been made available), Sciences, 585, IEEE Computer Society Press, Washington resolving a rare sequencing error in SEQID NOS:1-391 will D.C. (1993). Factura (AB) is a Macintosh program designed be well within the skill of the art. The present disclosure for automatic vector sequence removal and end-trimming of makes available Sufficient sequence information to allow any sequence files. The program Loadis runs on a Macintosh of the described contigs or portions thereof to be obtained platform and parses the feature data extracted from the 40 readily by Straightforward application of routine techniques. sequence files by Factura to the Unix based Streptococcus Further sequencing of Such polynucleotide may proceed in pneumoniae relational database. Assembly of contigs (and like manner using manual and automated sequencing meth whole genome sequences) is accomplished by retrieving a ods which are employed ubiquitous in the art. Nucleotide specific set of sequence files and their associated features sequence editing Software is publicly available. For example, using ExtrSeq, a Unix utility for retrieving sequences from an SQL database. The resulting sequence file is processed by 45 Applied Biosystems (AB) AutoAssembler can be used as an seq filter to trim portions of the sequences with more than 2% aid during visual inspection of nucleotide sequences. By ambiguous nucleotides. The sequence files were assembled employing Such routine techniques potential errors readily using TIGR Assembler, an assembly engine designed at The may be identified and the correct sequence then may be ascer Institute for Genomic Research (TIGR) for rapid and accurate tained by targeting further sequencing effort, also of a routine assembly of thousands of sequence fragments. The collection 50 nature, to the region containing the potential error. of contigs generated by the assembly step is loaded into the Even if all of the very rare sequencing errors in SEQ ID database with the lassie program. Identification of open read NOS:1-391 were corrected, the resulting nucleotide ing frames (ORFs) is accomplished by processing contigs sequences would still be at least 95% identical, nearly all with Zorf or GenMark. The ORFs are searched against S. would be at least 99% identical, and the great majority would pneumoniae sequences from GenBank and againstall protein 55 beat least 99.9% identical to the nucleotide sequences of SEQ sequences using the BLASTN and BLASTP programs, ID NOS:1-391. described in Altschul et al., J. Mol. Biol. 215: 403-410 As discussed elsewhere herein, polynucleotides of the (1990)). Results of the ORF determination and similarity present invention readily may be obtained by routine appli searching steps were loaded into the database. As described cation of well known and Standard procedures for cloning and below, some results of the determination and the searches are 60 sequencing DNA. Detailed methods for obtaining libraries set out in Tables 1-3. and for sequencing are provided below, for instance. A wide variety of Streptococcus pneumoniae strains that can be used DETAILED DESCRIPTION OF ILLUSTRATIVE to prepare S. pneumoniae genomic DNA for cloning and for EMBODIMENTS obtaining polynucleotides of the present invention are avail 65 able to the public from recognized depository institutions, The present invention is based on the sequencing of frag such as the American Type Culture Collection (ATCCTM). ments of the Streptococcus pneumoniae genome and analysis While the present invention is enabled by the sequences and US 8,168.205 B2 7 8 other information herein disclosed, the S. pneumoniae strain magnetic/optical storage media. A skilled artisan can readily that provided the DNA of the present Sequence Listing, Strain appreciate how any of the presently known computer read 7/87 14.8.91, has been deposited in the ATCCTM, as a conve able mediums can be used to create a manufacture comprising nience to those of skill in the art. As a further convenience, a computer readable medium having recorded thereon a nucle library of S. pneumoniae genomic DNA, derived from the 5 otide sequence of the present invention. Likewise, it will be same strain, also has been deposited in the ATCCTM. The S. clear to those of skill how additional computer readable pneumoniae strain was deposited on Oct. 10, 1996, and was media that may be developed also can be used to create given Deposit No. 55.840, and the cDNA library was depos analogous manufactures having recorded thereon a nucle ited on Oct. 11, 1996 and was given Deposit No. 97755. The otide sequence of the present invention. genomic fragments in the library are 15 to 20 kb fragments 10 As used herein, “recorded refers to a process for storing generated by partial Sau3A1 digestion and they are inserted information on computer readable medium. A skilled artisan into the BamHI site in the well-known lambda-derived vector can readily adopt any of the presently know methods for lambda DASH II (Stratagene, La Jolla, Calif.). The provision recording information on computer readable medium togen of the deposits is not a waiver of any rights of the inventors or erate manufactures comprising the nucleotide sequence their assignees in the present Subject matter. 15 information of the present invention. The nucleotide sequences of the genomes from different A variety of data storage structures are available to a skilled strains of Streptococcus pneumoniae differ somewhat. How artisan for creating a computer readable medium having ever, the nucleotide sequences of the genomes of all Strepto recorded thereon a nucleotide sequence of the present inven coccus pneumoniae strains will be at least 95% identical, in tion. The choice of the data storage structure will generally be corresponding part, to the nucleotide sequences provided in based on the means chosen to access the stored information. SEQID NOS:1-391. Nearly all will be at least 99% identical In addition, a variety of data processor programs and formats and the great majority will be 99.9% identical. can be used to store the nucleotide sequence information of Thus, the present invention further provides nucleotide the present invention on computer readable medium. The sequences which are at least 95%, preferably 99% and most sequence information can be represented in a word process preferably 99.9% identical to the nucleotide sequences of 25 ing text file, formatted in commercially-available software SEQ ID NOS:1-391, in a form which can be readily used, such as WordPerfect and MicroSoft Word, or represented in analyzed and interpreted by the skilled artisan. the form of an ASCII file, stored in a database application, Methods for determining whether a nucleotide sequence is such as DB2, Sybase, Oracle, or the like. A skilled artisan can at least 95%, at least 99% or at least 99.9% identical to the readily adapt any number of data-processor structuring for nucleotide sequences of SEQID NOS:1-391 are routine and 30 mats (e.g., text file or database) in order to obtain computer readily available to the skilled artisan. For example, the well readable medium having recorded thereon the nucleotide known fasta algorithm described in Pearson and Lipman, sequence information of the present invention. Proc. Natl. Acad. Sci. USA 85: 2444 (1988) can be used to Computer software is publicly available which allows a generate the percent identity of nucleotide sequences. The skilled artisan to access sequence information provided in a BLASTN program also can be used to generate an identity 35 computer readable medium. Thus, by providing in computer score of polynucleotides compared to one another. readable form the nucleotide sequences of SEQID NOS:1- 391, a representative fragment thereof, or a nucleotide Computer Related Embodiments sequence at least 95%, preferably at least 99% and most preferably at least 99.9% identical to a sequence of SEQID The nucleotide sequences provided in SEQID NOS:1-391, 40 NOS:1-391 the present invention enables the skilled artisan a representative fragment thereof, or a nucleotide sequence at routinely to access the provided sequence information for a least 95%, preferably at least 99% and most preferably at least wide variety of purposes. 99.9% identical to a polynucleotide sequence of SEQ ID The examples which follow demonstrate how software NOS:1-391 may be “provided in a variety of mediums to which implements the BLAST (Altschul et al., J. Mol. Biol. facilitate use thereof. As used herein, provided refers to a 45 215:403-410 (1990)) and BLAZE (Brutlag et al., Comp. manufacture, other than an isolated nucleic acid molecule, Chem. 17:203-207 (1993)) search algorithms on a Sybase which contains a nucleotide sequence of the present inven system was used to identify open reading frames (ORFs) tion; i.e., a nucleotide sequence provided in SEQID NOS:1- within the Streptococcus pneumoniae genome which contain 391, a representative fragment thereof, or a nucleotide homology to ORFs or proteins from both Streptococcus pneu sequence at least 95%, preferably at least 99% and most 50 moniae and from other organisms. Among the ORFs dis preferably at least 99.9% identical to a polynucleotide of SEQ cussed herein are protein encoding fragments of the Strepto ID NOS:1-391. Such a manufacture provides a large portion coccus pneumoniae genome useful in producing of the Streptococcus pneumoniae genome and parts thereof commercially important proteins, such as used in (e.g., a Streptococcus pneumoniae open reading frame fermentation reactions and in the production of commercially (ORF)) in a form which allows a skilled artisan to examine the 55 useful metabolites. manufacture using means not directly applicable to examin The present invention further provides systems, particu ing the Streptococcus pneumoniae genome ora Subset thereof larly computer-based systems, which contain the sequence as it exists in nature or in purified form. information described herein. Such systems are designed to In one application of this embodiment, a nucleotide identify, among other things, commercially important frag sequence of the present invention can be recorded on com 60 ments of the Streptococcus pneumoniae genome. puter readable media. As used herein, "computer readable As used herein, “a computer-based system” refers to the media' refers to any medium which can be read and accessed hardware means, Software means, and data storage means directly by a computer. Such media include, but are not lim used to analyze the nucleotide sequence information of the ited to: magnetic storage media, such as floppy discs, hard present invention. The minimum hardware means of the com disc storage medium, and magnetic tape; optical storage 65 puter-based systems of the present invention comprises a media such as CD-ROM; electrical storage media such as central processing unit (CPU), input means, output means, RAM and ROM; and hybrids of these categories, such as and data storage means. A skilled artisan can readily appre US 8,168.205 B2 10 ciate that any one of the currently available computer-based nize that any one of the publicly available homology search systems are Suitable for use in the present invention. programs can be used as the search means for the computer As Stated above, the computer-based systems of the present based systems of the present invention. Of course, suitable invention comprise a data storage means having stored proprietary systems that may be known to those of skill also therein a nucleotide sequence of the present invention and the may be employed in this regard. necessary hardware means and software means for Support FIG. 1 provides a block diagram of a computer system ing and implementing a search means. illustrative of embodiments of this aspect of present inven As used herein, "data storage means’ refers to memory tion. The computer system 102 includes a processor 106 which can store nucleotide sequence information of the connected to a bus 104. Also connected to the bus 104 are a present invention, or a memory access means which can 10 main memory 108 (preferably implemented as random access access manufactures having recorded thereon the nucleotide memory, RAM) and a variety of secondary storage devices sequence information of the present invention. 110, such as a hard drive 112 and a removable medium stor As used herein, "search means’ refers to one or more age device 114. The removable medium storage device 114 programs which are implemented on the computer-based sys may represent, for example, a floppy disk drive, a CD-ROM tem to compare a target sequence or target structural motif 15 drive, a magnetic tape drive, etc. A removable storage with the sequence information stored within the data storage medium 116 (such as a floppy disk, a compact disk, a mag means. Search means are used to identify fragments or netic tape, etc.) containing control logic and/or data recorded regions of the present genomic sequences which match a therein may be inserted into the removable medium storage particular target sequence or target motif. A variety of known device 114. The computer system 102 includes appropriate algorithms are disclosed publicly and a variety of commer software for reading the control logic and/or the data from the cially available Software for conducting search means are and removable medium storage device 114, once it is inserted into can be used in the computer-based systems of the present the removable medium storage device 114. invention. Examples of Such software includes, but is not A nucleotide sequence of the present invention may be limited to, MacPattem (EMBL), BLASTN and BLASTX stored in a well known manner in the main memory 108, any (NCBIA). A skilled artisan can readily recognize that any one 25 of the secondary storage devices 110, and/or a removable of the available algorithms or implementing Software pack storage medium 116. During execution, Software for access ages for conducting homology searches can be adapted for ing and processing the genomic sequence (Such as search use in the present computer-based systems. tools, comparing tools, etc.) reside in main memory 108, in As used herein, a “target sequence' can be any DNA or accordance with the requirements and operating parameters amino acid sequence of six or more nucleotides or two or 30 of the operating system, the hardware system and the Soft more amino acids. A skilled artisan can readily recognize that Ware program or programs. the longer a target sequence is, the less likely a target sequence will be present as a random occurrence in the data Biochemical Embodiments base. The most preferred sequence length of a target sequence is from about 10 to 100 amino acids or from about 30 to 300 35 Other embodiments of the present invention are directed to nucleotide residues. However, it is well recognized that isolated fragments of the Streptococcus pneumoniae genome. searches for commercially important fragments, such as The fragments of the Streptococcus pneumoniae genome of sequence fragments involved in gene expression and protein the present invention include, but are not limited to fragments processing, may be of shorter length. which encode peptides and polypeptides, hereinafter open As used herein, “a target structural motif or “target 40 reading frames (ORFs), fragments which modulate the motif’ refers to any rationally selected sequence or combi expression of an operably linked ORF, hereinafter expression nation of sequences in which the sequence(s) are chosen modulating fragments (EMFs) and fragments which can be based on a three-dimensional configuration which is formed used to diagnose the presence of Streptococcus pneumoniae upon the folding of the target motif. There are a variety of in a sample, hereinafter diagnostic fragments (DFs). target motifs known in the art. Protein target motifs include, 45 As used herein, an "isolated nucleic acid molecule' or an but are not limited to, enzymic active sites and signal "isolated fragment of the Streptococcus pneumoniae sequences. Nucleic acid target motifs include, but are not genome' refers to a nucleic acid molecule possessing a spe limited to, promoter sequences, hairpin structures and induc cific nucleotide sequence which has been Subjected to puri ible expression elements (protein binding sequences). fication means to reduce, from the composition, the number A variety of structural formats for the input and output 50 of compounds which are normally associated with the com means can be used to input and output the information in the position. Particularly, the term refers to the nucleic acid mol computer-based systems of the present invention. A preferred ecules having the sequences set out in SEQID NOS:1-391, to format for an output means ranks fragments of the Strepto representative fragments thereofas described above, to poly COccus pneumoniae genomic sequences possessing varying nucleotides at least 95%, preferably at least 99% and espe degrees of homology to the target sequence or target motif. 55 cially preferably at least 99.9% identical in sequence thereto, Such presentation provides a skilled artisan with a ranking of also as set out above. sequences which contain various amounts of the target A variety of purification means can be used to generate the sequence or target motif and identifies the degree of homol isolated fragments of the present invention. These include, ogy contained in the identified fragment. but are not limited to methods which separate constituents of A variety of comparing means can be used to compare a 60 a solution based on charge, Solubility, or size. target sequence or target motif with the data storage means to In one embodiment, Streptococcus pneumoniae DNA can identify sequence fragments of the Streptococcus pneumo be enzymatically sheared to produce fragments of 15-20kb in niae genome. In the present examples, implementing soft length. These fragments can then be used to generate a Strep ware which implement the BLAST and BLAZE algorithms, tococcus pneumoniae library by inserting them into lambda described in Altschulet al., J. Mol. Biol. 215:403-410 (1990), 65 clones as described in the Examples below. Primers flanking, is used to identify open reading frames within the Streptococ for example, an ORF, such as those enumerated in Tables 1-3 cus pneumoniae genome. A skilled artisan can readily recog can then be generated using nucleotide sequence information US 8,168.205 B2 11 12 provided in SEQ ID NOS:1-391. Well known and routine encoding the next in-frame stop codon (i.e., the stop codon at techniques of PCR cloning then can be used to isolate the the 3' end of the ORF). Those of ordinary skill in the art ORF from the lambda DNA library or Streptococcus pneu appreciate that preferred fragments within each ORF moniae genomic DNA. Thus, given the availability of SEQID described in the table include fragments of each ORF which NOS:1-391, the information in Tables 1, 2 and 3, and the 5 include the entire sequence from the delineated “start” and information that may be obtained readily by analysis of the “stop' positions excepting the first and last three nucleotides sequences of SEQ ID NOS:1-391 using methods set out since these encode stop codons. Thus, polynucleotides set out above, those of skill will be enabled by the present disclosure as ORFs in the tables but lacking the three (3) 5' nucleotides to isolate any ORF-containing or other nucleic acid fragment and the three (3) 3 nucleotides are encompassed by the of the present invention. 10 The isolated nucleic acid molecules of the present inven present invention. Those of skill also appreciate that particu tion include, but are not limited to single stranded and double larly preferred are fragments within each ORF that are poly stranded DNA, and single stranded RNA. nucleotide fragments comprising polypeptide coding As used herein, an “open reading frame.” ORF, means a sequence. As defined herein, "coding sequence' includes the series of triplets coding for amino acids without any termina- 15 fragment within an ORF beginning at the first in-frame ATG tion codons and is a sequence translatable into protein. (triplet encoding methionine) and ending with the last nucle Tables 1, 2, and 3 list ORFs in the Streptococcus pneumo otide prior to the triplet encoding the 3' stop codon. Preferred niae genomic contigs of the present invention that were iden are fragments comprising the entire coding sequence and tified as putative coding regions by the GeneMark software fragments comprising the entire coding sequence, excepting using organism-specific second-order Markov probability 20 the coding sequence for the N-terminal methionine. Those of transition matrices. It will be appreciated that other criteria skill appreciate that the N-terminal methionine is often can be used, in accordance with well known analytical meth removed during post-translational processing and that poly ods, Such as those discussed herein, to generate more inclu nucleotides lacking the ATG can be used to facilitate produc sive, more restrictive, or more selective lists. tion of N-terminal fusion proteins which may be beneficial in Table 1 sets out ORFs in the Streptococcus pneumoniae 25 the production or use of genetically engineered proteins. Of contigs of the present invention that over a continuous region course, due to the degeneracy of the genetic code many poly of at least 50 bases are 95% or more identical (by BLAST nucleotides can encode a given polypeptide. Thus, the inven analysis) to a nucleotide sequence available through Gen tion further includes polynucleotides comprising a nucleotide Bank in October, 1997. sequence encoding a polypeptide sequence itself encoded by Table 2 sets out ORFs in the Streptococcus pneumoniae 30 the coding sequence within an ORF described in Tables 1-3 contigs of the present invention that are not in Table 1 and herein. Further, polynucleotides at least 95%, preferably at match, with a BLASTP probability score of 0.01 or less, a least 99% and especially preferably at least 99.9% identical in polypeptide sequence available through GenBank in October, sequence to the foregoing polynucleotides, are contemplated 1997. by the present invention. Table 3 sets out ORFs in the Streptococcus pneumoniae 35 Polypeptides encoded by polynucleotides described above contigs of the present invention that do not match signifi and elsewhere herein are also provided by the present inven cantly, by BLASTP analysis, a polypeptide sequence avail tion as are polypeptide comprising aan amino acid sequence able through GenBank in October, 1997. at least about 95%, preferably at least 97% and even more In each table, the first and second columns identify the preferably 99% identical to the amino acid sequence of a ORF by, respectively, contig number and ORF number within 40 polypeptide encoded by an ORF shown in Tables 1-3. These the contig; the third column indicates the first nucleotide of polypeptides may or may not comprise an N-terminal the ORF (actually the first nucleotide of the stop codon imme methionine. diately preceding the ORF), counting from the 5' end of the The concepts of percent identity and percent similarity of contig strand; and the fourth column, “stop (nt) indicates the two polypeptide sequences is well understood in the art. For last nucleotide of the stop codon defining the 3'end of the 45 example, two polypeptides 10 amino acids in length which ORF. differ at three amino acid positions (e.g., at positions 1,3 and In Tables 1 and 2, column five, lists the Reference for the 5) are said to have a percent identity of 70%. However, the closest matching sequence available through GenBank. same two polypeptides would be deemed to have a percent These reference numbers are the databases entry numbers similarity of 80% if, for example at position 5, the amino commonly used by those of skill in the art, who will be 50 acids moieties, although not identical, were 'similar” (i.e., familiar with their denominators. Descriptions of the nomen possessed similar biochemical characteristics). Many pro clature are available from the National Center for Biotech grams for analysis of nucleotide or amino acid sequence nology Information. Column six in Tables 1 and 2 provides similarity, such as fasta and BLAST specifically list percent the gene name of the matching sequence; column seven pro identity of a matching region as an output parameter. Thus, vides the BLAST identity score and column eight the BLAST 55 for instance, Tables 1 and 2 herein enumerate the percent similarity score from the comparison of the ORF and the identity of the highest scoring segment pair in each ORF and homologous gene; and column nine indicates the length in its listed relative. Further details concerning the algorithms nucleotides of the highest scoring segment pair identified by and criteria used for homology searches are provided below the BLAST identity analysis. and are described in the pertinent literature highlighted by the Each ORF described in the tables is defined by “start (nt)' 60 citations provided below. (5') and “stop (nt) (3) nucleotide numbers. These position It will be appreciated that other criteria can be used to numbers refer to the boundaries of each ORF and provide generate more inclusive and more exclusive listings of the orientation with respect to whether the forward or reverse types set out in the tables. As those of skill will appreciate, Strand is the coding strand and which reading frame the cod narrow and broad searches both are useful. Thus, a skilled ing sequence is contained. The “start position is the first 65 artisan can readily identify ORFs in contigs of the Strepto nucleotide of the triplet encoding a stop codon just 5' to the coccus pneumoniae genome other than those listed in Tables ORF and the “stop' position is the last nucleotide of the triplet 1-3, such as ORFs which are overlapping or encoded by the US 8,168.205 B2 13 14 opposite strand of an identified ORF in addition to those other words, in the coding region of an ORF, substitution of ascertainable using the computer-based systems of the one codon for another which encodes the same amino acid is present invention. expressly contemplated. Any specific sequence disclosed As used herein, an “expression modulating fragment,99 herein can be readily screened for errors by resequencing a EMF, means a series of nucleotide molecules which modu particular fragment, Such as an ORF, in both directions (i.e., lates the expression of an operably linked ORF or EMF. sequence both Strands). Alternatively, error Screening can be As used herein, a sequence is said to “modulate the expres performed by sequencing corresponding polynucleotides of sion of an operably linked sequence' when the expression of Streptococcus pneumoniae origin isolated by using part or all the sequence is altered by the presence of the EMF. EMFs of the fragments in question as a probe or primer. include, but are not limited to, promoters, and promoter 10 Preferred DFs of the present invention comprise at least modulating sequences (inducible elements). One class of about 17, preferably at least about 20, and more preferably at EMFs are fragments which induce the expression oran oper least about 50 contiguous nucleotides within an ORF set out ably linked ORF in response to a specific regulatory factor or in Tables 1-3. Most highly preferred DFs specifically hybrid physiological event. ize to a polynucleotide containing the sequence of the ORF EMF sequences can be identified within the contigs of the 15 from which they are derived. Specific hybridization occurs Streptococcus pneumoniae genome by their proximity to the even under stringent conditions defined elsewhere herein. ORFs provided in Tables 1-3. An intergenic segment, or a Each of the ORFs of the Streptococcus pneumoniae fragment of the intergenic segment, from about 10 to 200 genome disclosed in Tables 1, 2 and 3, and the EMFs found 5' nucleotides in length, taken from any one of the ORFs of to the ORFs, can be used as polynucleotide reagents in numer Tables 1-3 will modulate the expression of an operably linked ous ways. For example, the sequences can be used as diag ORF in a fashion similar to that found with the naturally nostic probes or diagnostic amplification primers to detect the linked ORF sequence. As used herein, an “intergenic seg presence of a specific microbe in a sample, particularly Strep ment” refers to fragments of the Streptococcus pneumoniae tococcus pneumoniae. Especially preferred in this regard are genome which are between two ORF(s) herein described. ORFs such as those of Table 3, which do not match previously EMFs also can be identified using known EMFs as a target 25 characterized sequences from other organisms and thus are sequence or target motif in the computer-based systems of the most likely to be highly selective for Streptococcus pneumo present invention. Further, the two methods can be combined niae. Also particularly preferred are ORFs that can be used to and used together. distinguish between strains of Streptococcus pneumoniae, The presence and activity of an EMF can be confirmed particularly those that distinguish medically important strain, using an EMF trap vector. An EMF trap vector contains a 30 Such as drug-resistant strains. cloning site linked to a marker sequence. A marker sequence In addition, the fragments of the present invention, as encodes an identifiable phenotype, such as antibiotic resis broadly described, can be used to control gene expression tance or a complementing nutrition auxotrophic factor, which through triple helix formation or antisense DNA or RNA, can be identified or assayed when the EMF trap vector is both of which methods are based on the binding of a poly placed within an appropriate host under appropriate condi 35 nucleotide sequence to DNA or RNA. Triple helix-formation tions. As described above, a EMF will modulate the expres optimally results in a shut-off of RNA transcription from sion of an operably linked marker sequence. A more detailed DNA, while antisense RNA hybridization blocks translation discussion of various marker sequences is provided below. A ofan mRNA molecule into polypeptide. Information from the sequence which is Suspected as being an EMF is cloned in all sequences of the present invention can be used to design three reading frames in one or more restriction sites upstream 40 antisense and triple helix-forming oligonucleotides. Poly from the marker sequence in the EMF trap vector. The vector nucleotides suitable for use in these methods are usually 20 to is then transformed into an appropriate host using known 40 bases in length and are designed to be complementary to a procedures and the phenotype of the transformed host in region of the gene involved in transcription, for triple-helix examined under appropriate conditions. As described above, formation, or to the mRNA itself, for antisense inhibition. an EMF will modulate the expression of an operably linked 45 Both techniques have been demonstrated to be effective in marker sequence. model systems, and the requisite techniques are well known As used herein, a "diagnostic fragment. DF, means a series and involve routine procedures. Triple helix techniques are of nucleotide molecules which selectively hybridize to Strep discussed in, for example, Lee et al., Nucl. Acids Res. 6:3073 tococcus pneumoniae sequences. DFs can be readily identi (1979); Cooney et al., Science 241:456 (1988); and Dervanet fied by identifying unique sequences within contigs of the 50 al., Science 251: 1360 (1991). Antisense techniques in general Streptococcus pneumoniae genome, such as by using well are discussed in, for instance, Okano, J. Neurochem. 56:560 known computer analysis software, and by generating and (1991) and Oligodeoxynucleotides as Antisense Inhibitors of testing probes or amplification primers consisting of the DF Gene Expression, CRC Press, Boca Raton, Fla. (1988)). sequence in an appropriate diagnostic format which deter The present invention further provides recombinant con mines amplification or hybridization selectivity. 55 structs comprising one or more fragments of the Streptococ The sequences falling within the scope of the present cus pneumoniae genomic fragments and contigs of the invention are not limited to the specific sequences herein present invention. Certain preferred recombinant constructs described, but also include allelic and species variations of the present invention comprise a vector, Such as a plasmid thereof. Allelic and species variations can be routinely deter or viral vector, into which a fragment of the Streptococcus mined by comparing the sequences provided in SEQID NOS: 60 pneumoniae genome has been inserted, in a forward or 1-391, a representative fragment thereof, or a nucleotide reverse orientation. In the case of a vector comprising one of sequence at least 95%, preferably at least 99% and most at the ORFs of the present invention, the vector may further least preferably 99.9% identical to SEQID NOS:1-391, with comprise regulatory sequences, including for example, a pro a sequence from another isolate of the same species. Further moter, operably linked to the ORF. For vectors comprising the more, to accommodate codon variability, the invention 65 EMFs of the present invention, the vector may further com includes nucleic acid molecules coding for the same amino prise a marker sequence or heterologous ORF operably acid sequences as do the specific ORFs disclosed herein. In linked to the EMF. US 8,168.205 B2 15 16 Large numbers of Suitable vectors and promoters are present invention produced naturally by a bacterial strain, or known to those of skill in the art and are commercially avail by other methods. Methods for isolation and purification that able for generating the recombinant constructs of the present can be employed in this regard include, but are not limited to, invention. The following vectors are provided by way of immunochromatography, HPLC, size-exclusion chromatog example. Useful bacterial vectors include phagescript, 5 raphy, ion-exchange chromatography, and immuno-affinity PsiX174, pBluescript SK, pBS KS, pNH8a, pNH16a, chromatography. pNH18a, pNH46a (available from Stratagene); pTrc99A, The polypeptides and proteins of the present invention also pKK223-3, pKK 233-3, pIDR540, pRIT5 (available from can be purified from cells which have been altered to express Pharmacia). Useful eukaryotic vectors include pWLineo, the desired polypeptide or protein. As used herein, a cell is pSV2cat, pCG44, pXT1, pSG (available from Stratagene) 10 pSVK3, pBPV, pMSG, pSVL (available from Pharmacia). said to be altered to express a desired polypeptide or protein Promoter regions can be selected from any desired gene when the cell, through genetic manipulation, is made to pro using CAT (chloramphenicol ) vectors or other duce a polypeptide or protein which it normally does not vectors with selectable markers. Two appropriate vectors are produce or which the cell normally produces at a lower level. pKK232-8 and pCM7. Particular named bacterial promoters 15 Those skilled in the art can readily adapt procedures for include lacI, lacz, T3, T7, gpt, lambda PR, and trc. Eukaryotic introducing and expressing either recombinant or synthetic promoters include CMV immediate early, HSV thymidine sequences into eukaryotic or prokaryotic cells in order to kinase, early and late SV40, LTRs from retrovirus, and mouse generate a cell which produces one of the polypeptides or metallothionein-I. Selection of the appropriate vector and proteins of the present invention. promoter is well within the level of ordinary skill in the art. Any host/vector system can be used to express one or more The present invention further provides host cells contain of the ORFs of the present invention. These include, but are ing any one of the isolated fragments of the Streptococcus not limited to, eukaryotic hosts such as HeLa cells, CV-1 cell, pneumoniae genomic fragments and contigs of the present COS cells, and Sf9 cells, as well as prokaryotic host such as invention, wherein the fragment has been introduced into the E. coli and B. subtilis. The most preferred cells are those host cell using known methods. The host cell can be a higher 25 which do not normally express the particular polypeptide or eukaryotic host cell. Such as a mammalian cell, a lower protein or which expresses the polypeptide or protein at low eukaryotic host cell. Such as a yeast cell, or a procaryotic cell, natural level. Such as a bacterial cell. "Recombinant as used herein, means that a polypeptide A polynucleotide of the present invention, such as a recom or protein is derived from recombinant (e.g., microbial or binant construct comprising an ORF of the present invention, 30 mammalian) expression systems. “Microbial refers to may be introduced into the host by a variety of well estab recombinant polypeptides or proteins made in bacterial or lished techniques that are standard in the art, such as calcium fungal (e.g., yeast) expression systems. As a product, "recom phosphate transfection, DEAE, dextran mediated transfec binant microbial defines a polypeptide or protein essentially tion and electroporation, which are described in, for instance, free of native endogenous Substances and unaccompanied by Davis, L. et al., BASIC METHODS IN MOLECULAR 35 associated native glycosylation. Polypeptides or proteins BIOLOGY (1986). expressed in most bacterial cultures, e.g., E. coli, will be free A host cell containing one of the fragments of the Strepto of glycosylation modifications; polypeptides or proteins coccus pneumoniae genomic fragments and contigs of the expressed in yeast will have a glycosylation pattern different present invention, can be used in conventional manners to from that expressed in mammalian cells. produce the gene product encoded by the isolated fragment 40 “Nucleotide sequence” refers to a heteropolymer of deox (in the case of an ORF) or can be used to produce a heterolo yribonucleotides. Generally, DNA segments encoding the gous protein under the control of the EMF. polypeptides and proteins provided by this invention are The present invention further provides isolated polypep assembled from fragments of the Streptococcus pneumoniae tides encoded by the nucleic acid fragments of the present genome and short oligonucleotide linkers, or from a series of invention or by degenerate variants of the nucleic acid frag 45 oligonucleotides, to provide a synthetic gene which is capable ments of the present invention. By “degenerate variant' is of being expressed in a recombinant transcriptional unit com intended nucleotide fragments which differ from a nucleic prising regulatory elements derived from a microbial or viral acid fragment of the present invention (e.g., an ORF) by operon. nucleotide sequence but, due to the degeneracy of the Genetic Recombinant expression vehicle or vector” refers to a plas Code, encode an identical polypeptide sequence. 50 mid or phage or virus or vector, for expressing a polypeptide Preferred nucleic acid fragments of the present invention from a DNA (RNA) sequence. The expression vehicle can are the ORFs and subfragments thereof depicted in Tables 2 comprise a transcriptional unit comprising an assembly of (1) and 3 which encode proteins. a genetic regulatory elements necessary for gene expression A variety of methodologies known in the art can be utilized in the host, including elements required to initiate and main to obtain any one of the isolated polypeptides or proteins of 55 tain transcription at a level sufficient for suitable expression the present invention. At the simplest level, the amino acid of the desired polypeptide, including, for example, promoters sequence can be synthesized using commercially available and, where necessary, an enhancer and a polyadenylation peptide synthesizers. This is particularly useful in producing signal; (2) a structural or coding sequence which is tran Small peptides and fragments of larger polypeptides. Such scribed into mRNA and translated into protein, and (3) appro short fragments as may be obtained most readily by synthesis 60 priate signals to initiate translation at the beginning of the are useful, for example, in generating antibodies against the desired coding region and terminate translation at its end. native polypeptide, as discussed further below. Structural units intended for use in yeast or eukaryotic expres In an alternative method, the polypeptide or protein is sion systems preferably include a leader sequence enabling purified from bacterial cells which naturally produce the extracellular secretion of translated protein by a host cell. polypeptide or protein. One skilled in the art can readily 65 Alternatively, where recombinant protein is expressed with employ well-known methods for isolating polypeptides and out a leader or transport sequence, it may include an N-ter proteins to isolate and purify polypeptides or proteins of the minal methionine residue. This residue may or may not be US 8,168.205 B2 17 18 Subsequently cleaved from the expressed recombinant pro Thereafter cells are typically harvested, generally by centrifu tein to provide a final product. gation, disrupted to release expressed protein, generally by “Recombinant expression system” means host cells which physical or chemical means, and the resulting crude extract is have stably integrated a recombinant transcriptional unit into retained for further purification. chromosomal DNA or carry the recombinant transcriptional Various mammalian cell culture systems can also be unit extra chromosomally. The cells can be prokaryotic or employed to express recombinant protein. Examples of mam eukaryotic. Recombinant expression systems as defined malian expression systems include the COS-7 lines of mon herein will express heterologous polypeptides or proteins key kidney fibroblasts, described in Gluzman, Cell 23:175 upon induction of the regulatory elements linked to the DNA (1981), and other cell lines capable of expressing a compat segment or synthetic gene to be expressed. 10 ible vector, for example, the C127, 3T3, CHO, HeLa and Mature proteins can be expressed in mammalian cells, BHK cell lines. yeast, bacteria, or other cells under the control of appropriate Mammalian expression vectors will comprise an origin of promoters. Cell-free translation systems can also be replication, a suitable promoter and enhancer, and also any employed to produce such proteins using RNAs derived from necessary ribosome binding sites, polyadenylation site, splice the DNA constructs of the present invention. Appropriate 15 donor and acceptor sites, transcriptional termination cloning and expression vectors for use with prokaryotic and sequences, and 5' flanking nontranscribed sequences. DNA eukaryotic hosts are described in Sambrook et al., Molecular sequences derived from the SV40 viral genome, for example, Cloning: A Laboratory Manual, 2nd Edition, Cold Spring SV40 origin, early promoter, enhancer, splice, and polyade Harbor Laboratory Press, Cold Spring Harbor, N.Y. (1989), nylation sites may be used to provide the required nontran the disclosure of which is hereby incorporated by reference in scribed genetic elements. its entirety. Recombinant polypeptides and proteins produced in bac Generally, recombinant expression vectors will include terial culture is usually isolated by initial extraction from cell origins of replication and selectable markers permitting trans pellets, followed by one or more salting-out, aqueous ion formation of the host cell, e.g., the amplicillin resistance gene exchange or size exclusion chromatography steps. Microbial of E. coli and S. cerevisiae TRP1 gene, and a promoter 25 cells employed in expression of proteins can be disrupted by derived from a highly expressed gene to direct transcription any convenient method, including freeze-thaw cycling, Soni of a downstream structural sequence. Such promoters can be cation, mechanical disruption, or use of cell lysing agents. derived from operons encoding glycolytic enzymes such as Protein refolding steps can be used, as necessary, in complet 3-phosphoglycerate kinase (PGK), alpha-factor, acid phos ing configuration of the mature protein. Finally, high perfor phatase, or heat shock proteins, among others. The heterolo 30 mance liquid chromatography (HPLC) can be employed for gous structural sequence is assembled in appropriate phase final purification steps. with translation initiation and termination sequences, and The present invention further includes isolated polypep preferably, a leader sequence capable of directing secretion of tides, proteins and nucleic acid molecules which are substan translated protein into the periplasmic space or extracellular tially equivalent to those herein described. As used herein, medium. Optionally, the heterologous sequence can encode a 35 Substantially equivalent can refer both to nucleic acid and fusion protein including an N-terminal identification peptide amino acid sequences, for example a mutant sequence, that imparting desired characteristics, e.g., stabilization or simpli varies from a reference sequence by one or more substitu fied purification of expressed recombinant product. tions, deletions, or additions, the net effect of which does not Useful expression vectors for bacterial use are constructed result in an adverse functional dissimilarity between refer by inserting a structural DNA sequence encoding a desired 40 ence and Subject sequences. For purposes of the present protein together with Suitable translation initiation and termi invention, sequences having equivalent biological activity, nation signals in operable reading phase with a functional and equivalent expression characteristics are considered Sub promoter. The vector will comprise one or more phenotypic stantially equivalent. For purposes of determining equiva selectable markers and an origin of replication to ensure lence, truncation of the mature sequence should be disre maintenance of the vector and, when desirable, provide 45 garded. amplification within the host. The invention further provides methods of obtaining Suitable prokaryotic hosts for transformation include homologs from other strains of Streptococcus pneumoniae, of strains of E. coli, B. subtilis, Salmonella typhimurium and the fragments of the Streptococcus pneumoniae genome of various species within the genera Pseudomonas and Strepto the present invention and homologs of the proteins encoded myces. Others may, also be employed as a matter of choice. 50 by the ORFs of the present invention. As used herein, a As a representative but non-limiting example, useful sequence or protein of Streptococcus pneumoniae is defined expression vectors for bacterial use can comprise a selectable as a homolog of a fragment of the Streptococcus pneumoniae marker and bacterial origin of replication derived from com fragments or contigs or a protein encoded by one of the ORFs mercially available plasmids comprising genetic elements of of the present invention, if it shares significant homology to the well known cloning vector pBR322 (ATCCTM 37017). 55 one of the fragments of the Streptococcus pneumoniae Such commercial vectors include, for example, pKK223-3 genome of the present invention or a protein encoded by one (available form Pharmacia Fine Chemicals, Uppsala, Swe of the ORFs of the present invention. Specifically, by using den) and GEM 1 (available from Promega Biotec, Madison, the sequence disclosed herein as a probe or as primers, and Wis., USA). These pBR322 “backbone' sections are com techniques such as PCR cloning and colony/plaque hybrid bined with an appropriate promoter and the structural 60 ization, one skilled in the art can obtain homologs. sequence to be expressed. As used herein, two nucleic acid molecules or proteins are Following transformation of a suitable host strain and said to “share significant homology’ if the two contain growth of the host strain to an appropriate cell density, the regions which possess greater than 85% sequence (amino selected promoter, where it is inducible, is derepressed or acid or nucleic acid) homology. Preferred homologs in this induced by appropriate means (e.g., temperature shift or 65 regard are those with more than 90% homology. Especially chemical induction) and cells are cultured for an additional preferred are those with 93% or more homology. Among period to provide for expression of the induced gene product. especially preferred homologs those with 95% or more US 8,168.205 B2 19 20 homology are particularly preferred. Very particularly pre 1. Biosynthetic Enzymes ferred among these are those with 97% and even more par Open reading frames encoding proteins involved in medi ticularly preferred among those are homologs with 99% or ating the catalytic reactions involved in intermediary and more homology. The most preferred homologs among these macromolecular metabolism, the biosynthesis of Small mol are those with 99.9% homology or more. It will be understood 5 ecules, cellular processes and other functions includes that, among measures of homology, identity is particularly enzymes involved in the degradation of the intermediary preferred in this regard. products of metabolism, enzymes involved in central inter Region specific primers or probes derived from the nucle mediary metabolism, enzymes involved in respiration, both otide sequence provided in SEQ ID NOS:1-391 or from a aerobic and anaerobic, enzymes involved in fermentation, nucleotide sequence at least 95%, particularly at least 99%, 10 enzymes involved in ATP proton motor force conversion, especially at least 99.5% identical to a sequence of SEQID enzymes involved in broad regulatory function, enzymes NOS:1-391 can be used to prime DNA synthesis and PCR involved in amino acid synthesis, enzymes involved in nucle amplification, as well as to identify colonies containing otide synthesis, enzymes involved in and vitamin cloned DNA encoding a homolog. Methods suitable to this synthesis, can be used for industrial biosynthesis. aspect of the present invention are well known and have been 15 The various metabolic pathways present in Streptococcus described in great detail in many publications such as, for pneumoniae can be identified based on absolute nutritional example, Innis et al., PCR Protocols, Academic Press, San requirements as well as by examining the various enzymes Diego, Calif. (1990)). identified in Table 1-3 and SEQID NOS:1-391. When using primers derived from SEQID NOS:1-391 or Of particular interest are polypeptides involved in the deg from a nucleotide sequence having an aforementioned iden radation of intermediary metabolites as well as non-macro tity to a sequence of SEQID NOS:1-391, one skilled in the art molecular metabolism. Such enzymes include , glu will recognize that by employing high Stringency conditions cose oxidases, and catalase. (e.g., annealing at 50-60° C. in 6xSSC and 50% formamide, Proteolytic enzymes are another class of commercially and washing at 50-65°C. in 0.5xSSC) only sequences which important enzymes. Proteolytic enzymes find use in a number are greater than 75% homologous to the primer will be ampli 25 of industrial processes including the processing of flax and fied. By employing lower stringency conditions (e.g., hybrid other vegetable fibers, in the extraction, clarification and izing at 35-37° C. in 5xSSC and 40-45% formamide, and depectinization of fruit juices, in the extraction of vegetables washing at 42°C. in 0.5xSSC), sequences which are greater oil and in the maceration of fruits and vegetables to give than 40-50% homologous to the primer will also be amplified. unicellular fruits. A detailed review of the proteolytic When using DNA probes derived from SEQ ID NOS:1- 30 enzymes used in the food industry is provided in Romboutset 391, or from a nucleotide sequence having anaforementioned al., Symbiosis 21:79 (1986) and Voragen et al. in Biocatalysts identity to a sequence of SEQ ID NOS:1-391, for colony/ In Agricultural Biotechnology, Whitaker et al., Eds. Ameri plaque hybridization, one skilled in the art will recognize that can Chemical Society Symposium Series 389:93 (1989). by employing high Stringency conditions (e.g., hybridizing at The metabolism of Sugars is an important aspect of the 50-65° C. in 5xSSC and 50% formamide, and washing at 35 primary metabolism of Streptococcus pneumoniae. Enzymes 50-65°C. in 0.5xSSC), sequences having regions which are involved in the degradation of Sugars, such as, particularly, greater than 90% homologous to the probe can be obtained, , galactose, fructose and Xylose, can be used in indus and that by employing lower stringency conditions (e.g., trial fermentation. Some of the important Sugar transforming hybridizing at 35–37° C. in 5xSSC and 40-45% formamide, enzymes, from a commercial viewpoint, include Sugar and washing at 42°C. in 0.5xSSC), sequences having regions 40 such as glucose . Other metabolic which are greater than 35-45% homologous to the probe will enzymes have found commercial use Such as glucose oxi be obtained. dases which produces ketogulonic acid (KGA). KGA is an Any organism can be used as the source for homologs of intermediate in the commercial production of ascorbic acid the present invention so long as the organism naturally using the Reichstein's procedure, as described in Krueger et expresses Sucha protein or contains genes encoding the same. 45 al., Biotechnology 6(A), Rhine et al., Eds. Verlag Press, The most preferred organism for isolating homologs are bac Weinheim, Germany (1984). teria which are closely related to Streptococcus pneumoniae. Glucose oxidase (GOD) is commercially available and has Illustrative Uses of Compositions of the Invention been used in purified form as well as in an immobilized form Each ORF provided in Tables 1 and 2 is identified with a for the deoxygenation of beer. See, for instance, Hartmeir et function by homology to a known gene or polypeptide. As a 50 al., Biotechnology Letters 1:21 (1979). The most important result, one skilled in the art can use the polypeptides of the application of GOD is the industrial scale fermentation of present invention for commercial, therapeutic and industrial gluconic acid. Market for gluconic acids which are used in the purposes consistent with the type of putative identification of detergent, textile, leather, photographic, pharmaceutical, the polypeptide. Such identifications permit one skilled in the food, feed and concrete industry, as described, for example, in art to use the Streptococcus pneumoniae ORFs in a manner 55 Bigelis et al., beginning on page 357 in GENE MANIPULA similar to the known type of sequences for which the identi TIONS AND FUNGI; Benett et al., Eds. Academic Press, fication is made; for example, to ferment a particular Sugar New York (1985). In addition to industrial applications, GOD Source or to produce a particular metabolite. A variety of has found applications in medicine for quantitative determi reviews illustrative of this aspect of the invention are avail nation of glucose in body fluids recently in biotechnology for able, including the following reviews on the industrial use of 60 analyzing syrups from starch and cellulose hydrosylates. This enzymes, for example, BIOCHEMICAL ENGINEERING application is described in Owusu et al., Biochem. et Bio AND BIOTECHNOLOGY HANDBOOK 2nd Ed., Mac physica. Acta. 872:83 (1986), for instance. Millan Publications, Ltd. NY (1991) and BIOCATALYSTS The main Sweetener used in the world today is sugar which IN ORGANIC SYNTHESES, Tramperet al., Eds. Elsevier comes from Sugar beets and Sugar cane. In the field of indus Science Publishers, Amsterdam, The Netherlands (1985). A 65 trial enzymes, the glucose isomerase process shows the larg variety of exemplary uses that illustrate this and similar est expansion in the market today. Initially, soluble enzymes aspects of the present invention are discussed below. were used and later immobilized enzymes were developed US 8,168.205 B2 21 22 (Krueger et al., Biotechnology, The Textbook of Industrial 2. Generation of Antibodies Microbiology, Sinauer Associated Incorporated, Sunderland, As described here, the proteins of the present invention, as Mass. (1990)). Today, the use of glucose-produced high fruc well as homologs thereof, can be used in a variety of proce tose syrups is by far the largest industrial business using dures and methods known in the art which are currently immobilized enzymes. A review of the industrial use of these 5 applied to other proteins. The proteins of the present inven enzymes is provided by Jorgensen, Starch 40.307 (1988). tion can further be used to generate an antibody which selec Proteinases, such as alkaline serine proteinases, are used as tively binds the protein. Such antibodies can be either mono detergent additives and thus represent one of the largest Vol clonal or polyclonal antibodies, as well fragments of these umes of microbial enzymes used in the industrial sector. antibodies, and humanized forms. 10 The invention further provides antibodies which selec Because of their industrial importance, there is a large body of tively bind to one of the proteins of the present invention and published and unpublished information regarding the use of hybridomas which produce these antibodies. A hybridoma is these enzymes in industrial processes. (See Faultman et al., an immortalized cell line which is capable of secreting a Acid Proteases Structure Function and Biology, Tang, J., ed., specific monoclonal antibody. Plenum Press, New York (1977) and Godfrey et al., Industrial In general, techniques for preparing polyclonal and mono Enzymes, MacMillan Publishers, Surrey, UK (1983) and clonal antibodies as well as hybridomas capable of producing Hepner et al., Report Industrial Enzymes by 1990, Hel Hep the desired antibody are well known in the art (Campbell, A. ner & Associates, London (1986)). M., Monoclonal Antibody Technology: Laboratory Tech Another class of commercially usable proteins of the niques. In Biochemistry And Molecular Biology, Elsevier Sci present invention are the microbial lipases, described by, for 20 ence Publishers, Amsterdam, The Netherlands (1984); St. instance, Macrae et al., Philosophical Transactions of the Grothet al., J. Immunol. Methods 35: 1-21 (1980), Kohler and Chiral Society of London 310:227 (1985) and Poserke, Jour Milstein, Nature 256:495-497 (1975)), the trioma technique, nal of the American Oil Chemist Society 61:1758 (1984). A the human B-cell hybridoma technique (Kozboret al., Immu major use of lipases is in the fat and oil industry for the nology Today 4:72 (1983), pgs. 77-96 of Cole et al., in Mono production of neutral glycerides using lipase catalyzed inter- 25 clonal Antibodies And Cancer Therapy, Alan R. Liss, Inc. esterification of readily available triglycerides. Application of (1985)). Any animal (mouse, rabbit, etc.) which is known to lipases include the use as a detergent additive to facilitate the produce antibodies can be immunized with the pseudogene removal of fats from fabrics in the course of the washing polypeptide. Methods for immunization are well known in procedures. the art. Such methods include subcutaneous or interperitoneal The use of enzymes, and in particular microbial enzymes, 30 injection of the polypeptide. One skilled in the art will rec as catalyst for key steps in the synthesis of complex organic ognize that the amount of the protein encoded by the ORF of molecules is gaining popularity at a great rate. One area of the present invention used for immunization will vary based great interest is the preparation of chiral intermediates. Prepa on the animal which is immunized, the antigenicity of the ration of chiral intermediates is of interest to a wide range of peptide and the site of injection. synthetic chemists particularly those scientists involved with 35 The protein which is used as an immunogen may be modi the preparation of new pharmaceuticals, agrochemicals, fra fied or administered in an adjuvant in order to increase the grances and flavors. (See Davies et al., Recent Advances in the protein's antigenicity. Methods of increasing the antigenicity Generation of Chiral Intermediates Using Enzymes, CRC of a protein are well known in the art and include, but are not Press, Boca Raton, Fla. (1990)). The following reactions cata limited to coupling the antigen with a heterologous protein lyzed by enzymes are of interest to organic chemists: hydroly- 40 (such as globulin orgalactosidase) or through the inclusion of sis of carboxylic acid esters, phosphate esters, amides and an adjuvant during immunization. nitriles, esterification reactions, trans-esterification reactions, For monoclonal antibodies, spleen cells from the immu synthesis of amides, reduction of alkanones and oxoalkan nized animals are removed, fused with myeloma cells, such as ates, oxidation of alcohols to carbonyl compounds, oxidation SP2/0-Ag14 myeloma cells, and allowed to become mono of Sulfides to Sulfoxides, and carbon bond forming reactions 45 clonal antibody producing hybridoma cells. Such as the aldol reaction. Any one of a number of methods well known in the art can When considering the use of an enzyme encoded by one of be used to identify the hybridoma cell which produces an the ORFs of the present invention for biotransformation and antibody with the desired characteristics. These include organic synthesis it is sometimes necessary to consider t