<<

US 20070O83334A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2007/0083334 A1 Mintz et al. (43) Pub. Date: Apr. 12, 2007

(54) METHODS AND SYSTEMS FOR (60) Provisional application No. 60/322,285, filed on Sep. ANNOTATING BOMOLECULAR 14, 2001. Provisional application No. 60/322,359, SEQUENCES filed on Sep. 14, 2001. Provisional application No. 60/322,506, filed on Sep. 14, 2001. Provisional appli (75) Inventors: Liat Mintz, Kendall-Park, NJ (US); cation No. 60/324,524, filed on Sep. 26, 2001. Pro Hanging Xie, Plainsboro, NJ (US); visional application No. 60/354.242, filed on Feb. 6, Dvir Dahari, Tel Aviv (IL); Erez 2002. Provisional application No. 60/371,494, filed Levanon, Petach Tikva (IL); Shiri on Apr. 11, 2002. Provisional application No. 60/384, Freilich, Haifa (IL); Nili Beck, Kfar 096, filed on May 31, 2002. Provisional application Saba (IL); Wei-Yong Zhu, Plainsboro, No. 60/397,784, filed on Jul. 24, 2002. NJ (US); Alon Wasserman, Tel Aviv (IL); Chen Hermesh, Mishmar Hashiva Publication Classification (IL); Idit Azar, Tel Aviv (IL); Jeanne Bernstein, Kfar Yona (IL) (51) Int. Cl. G06F 9/00 (2006.01) Correspondence Address: (52) U.S. Cl...... 702/19, 702/20 Martin D. Moynihan PRTSI, Inc. (57) ABSTRACT P.O. Box 16446 Arlington, VA 22215 (US) A method of annotating biomolecular sequences. The method comprises (a) computationally clustering the bio (73) Assignee: Compugen Ltd. molecular sequences according to a progressive homology range, to thereby generate a plurality of clusters each being (21) Appl. No.: 11/443,428 of a predetermined homology of the homology range; and (22) Filed: May 31, 2006 (b) assigning at least one ontology to each cluster of the plurality of clusters, the at least one ontology being: (i) Related U.S. Application Data derived from an annotation preassociated with at least one biomolecular sequence of each cluster, and/or (ii) generated (63) Continuation of application No. 10/242,799, filed on from analysis of the at least one biomolecular sequence of Sep. 13, 2002, now abandoned. each cluster thereby annotating biomolecular sequences. Patent Application Publication Apr. 12, 2007 Sheet 1 of 25 US 2007/0083334 A1

CO N y

CS V g) sew s O) ou Patent Application Publication Apr. 12, 2007 Sheet 2 of 25 US 2007/0083334 A1

r CO wa

i

s c Patent Application Publication Apr. 12, 2007 Sheet 3 of 25 US 2007/0083334 A1 Figure 2

gastrontestinal systern

gastrointestinal glands -. Ngastrointestinal tract / Y. -16 A- issuire tieyer gal \ -4y Visetia esophageal gastric cal ecta NN.bowel issaepitheliurn di Sadler liver ar sets of earra carcer N largethers Y sudder tepa T toy recret juriura

4. er Patent Application Publication Apr. 12, 2007 Sheet 4 of 25 US 2007/0083334 A1

N

ef CD s 3) ept

.2

C ---s Patent Application Publication Apr. 12, 2007 Sheet 5 of 25 US 2007/0083334 A1 Figure 4 (page 1/9) 1 tumor 1.1 epithelial cell tumors 1.1.1 carcinoma 1.1.1.1 adenocarcinoma 1.1.1.2 lobular carcinoma 1.2 Mesenchinal cell tumors 1.2.1 sarconna .2.1.1 liposarcoma 1.2.1.2 rhabdomyosarcoma 1.2.1.3 pnet 1.2.1.4 ewing sarcoma 1.3 blood funnors 1.3.1 lymphoma 1.3.2 leukemia 1.3.3 myeloma 1.4 endocrine tumors 1.4.1 pheocromocytoma 1.4.2 carcinoid Patent Application Publication Apr. 12, 2007 Sheet 6 of 25 US 2007/0083334 A1 Figure 4 (page 2/9) 2 endocrine System 2.1 adrenal 2.1.1 pheocromocytoma 2.2 pancreas 2.2.1 islets of Langerhans 2.3 neuroendocrine 2.3.1 hypothalamus 2.3.2 carcinoid 2.4 thyroid

3 vascular tissue 3.1 arteries 3.1.1 aorta 3.2 vein Patent Application Publication Apr. 12, 2007 Sheet 7 of 25 US 2007/0083334 A1 Figure 4 (page3/9) 4 genitourinary system 4.1 urinary system 4.1.1 bladder 4.1.2 kidney 4.2 genital system 4.2.1 women genital System 4.2.1.1 cervix 4.2.É.2 ovary 4.2.1.3 uterus 4.2.1.3.1 endometriun 4.2.2 men gentile system 4.2.2.1 prostate 4.2.2.2 testis 4.2.2.2.1 epididymis

5 muscles 5.1 rhabdomyosarcoma 5.2 tongue 5.3 bladder 5.4 heart 5.5 uterus Patent Application Publication Apr. 12, 2007 Sheet 8 of 25 US 2007/0083334 A1 Figure 4 (page 4/9) 6 Blood 6.1 peripheral blood 6.1.1 erythroid line 6.1.2 leukocyte 6.1.2.1 lymphoid system 6.1.2.1.1 lymphoma 6.1.2.1.2 spleen 6.1.2.1.3 thalamus 6.2 stern cells 6.2.1 myeloid 6.2.2 myeloma 6.3 Bone narroW 6.4 leukemia Patent Application Publication Apr. 12, 2007 Sheet 9 of 25 US 2007/0083334 A1 Figure 4 (page 5/9) 7 nerve system 7.1 CNS, central nervous system 7.1.1 brain 7.1.1.1 cerebrum 7.1.1.2 cerebellum 7.1.1.3 pitituary gland 7.1.1.4 hypothalamus 7.1.1.5 thalamus 7.1.1.6 olfactory 7.1.1.7 Hippocampus 7.1.1.8 amygdala 7.1.1.9 frontal lobe 7.1.10 pnet 7.2 Embryonal nerve system 7.2.1 primitive neuroectoderm 7.3 retina 7.3.1 retinoblastoma Patent Application Publication Apr. 12, 2007 Sheet 10 of 25 US 2007/0083334 A1 Figure 4 (page 6/9) 8 breast 8.1 ductal breast 8.1.1 ductal carcinoma 8.2 lobular carcinoma 8.3 mammary

9 Skeleton 9.1 bore 9.1.1 eving sarcoma 9.1.2 craniofacial 9.1.2.1 calvarium 9.2 connective tissue 9.2.1 trabeculae 9.2.2 cartilage

10 embryo 10.1 annion 10.2 chorion 10.3primitive neuroectoderm 10.4 placenta Patent Application Publication Apr. 12, 2007 Sheet 11 of 25 US 2007/0083334 A1 Figure 4 (page 7/9) 11 exocrine system 11.1 pancreas 11.1.1 islets of Langerhans 11.2prostate 11.3salivary gland Patent Application Publication Apr. 12, 2007 Sheet 12 of 25 US 2007/0083334 A1 Figure 4 (Page 8/9) 12 face organs 12.1 nose 12.2 ear 12.2.1 cochlea 12.3eye 12.3.1 retina 12.3.1.1 retinobastonia 12.3.2 lenis f2.4 mouth 12.5 tongue

13 gastrointestinal system 13.1 UICOSa 13.2stomach 13.3 intestine 13.3.1 colorectal 13.3.1.1 colon 13.4 hepatobiliary system 13.4.1 liver 13.4.2 biliary system 13.4.2.1 gall bladder 13.5 pancreas 13.5.1 islets of Langerhans Patent Application Publication Apr. 12, 2007 Sheet 13 of 25 US 2007/0083334 A1 Figure 4 (page 9/9) 14 respiratory system 14.1 nasopharynx 14.2 lung 14.2.1 small cell lung carcinona

15 Skin 15.1 dernis 15.1.1 melanocyte

16 fat tissue 16.1 iposarcoma Patent Application Publication Apr. 12, 2007 Sheet 14 of 25 US 2007/0083334 A1

120

s 100 gif AARNTSEEstegger":Ey. s s i.A. A.5 80 ' C : .S. 60 l U A. dy 40 w) s 3 220

O O 10 20 30 40 50 60 70 80 90 100 LOD Score Figure 5 Patent Application Publication Apr. 12, 2007 Sheet 15 of 25 US 2007/0083334 A1

ce o d d e e s e as ed go t s f ad n sugaolfsäsuoo Joaquin N Patent Application Publication Apr. 12, 2007 Sheet 16 of 25 US 2007/0083334 A1 i

s suiao i?suolo Joaquit N Patent Application Publication Apr. 12, 2007 Sheet 17 of 25 US 2007/0083334 A1

Subolji/suo joaqun N Patent Application Publication Apr. 12, 2007 Sheet 18 of 25 US 2007/0083334 A1

brain stomach placenta kidney spleen thymus ovary heart lung testis liver breast pancreas i muscle adenocarcinoma colon normal Duke A Duke B Duke C Duke D B C s D genomic DNA Patent Application Publication Apr. 12, 2007 Sheet 19 of 25 US 2007/0083334 A1

89_InáH Patent Application Publication Apr. 12, 2007 Sheet 20 of 25 US 2007/0083334 A1

6a.m.??, Patent Application Publication Apr. 12, 2007 Sheet 21 of 25 US 2007/0083334 A1

Brain Stomach Placenta Kidney Spleen Thymus Adenocarcinoma Colon Normal i Colon Duke’s A Colon Duke's B Colon Duke's C Colon Duke's D . Patent Application Publication Apr. 12, 2007 Sheet 22 of 25 US 2007/0083334 A1

2u Ouoleo fun SN #

Ouseboa At ON 9ff

t E. e

SeAe I uOSSeIdx peZeu.ION eA)be Patent Application Publication Apr. 12, 2007 Sheet 23 of 25 US 2007/0083334 A1

euouble fun SN is 9 oup23-OaAyyppo. 9i Gi (ouapW) 0900 gif vil (ouapy) oz.90 vil coodog)-euoup Jelouapf Eli oup JeouenbSylplay Zi Ou)Je3-uenbS gppoA is eu Ougojeo souenbS oil euoup Jeo souenbS 6: i (lood-og) ouple3-uenbS gy Og) euoupoleo SoueabS is (uo) snouenbS fun 9ty eulouse ZFS)). Gil NiOZ-9) reuo, fun. Eit ood peup og 'ulo fundi

in a r n r are N N a e SeAe UOISSeIdx peZIIeu.ION 9Agee Patent Application Publication Apr. 12, 2007 Sheet 24 of 25 US 2007/0083334 A1

A 2ULuoup J23 ft. In SNY 9 OuJeboat p-po, 9 gli (ouapw) 0990 Gli ity (ouapW) OZ-9) will to Oro-buloupe3Ouapy Ei ougeo-tuenbS gpa ety ouble3-uenbSuppo Li eu Ougoue) souenbS Oy euouplje3 SoulenbS 6 i too dog) ouple-uebS 9 O) 2uoupe) SoulenbS is (uo) soughbS fung euoupleo ZSO is NiO29O euo fun ey lood use upogu ON fun Ziy

in to to r n - o v SeAe UOISSeIdx peZeu.ION eAgee

US 2007/0083334 A1 Apr. 12, 2007

METHODS AND SYSTEMIS FOR ANNOTATING from a large number of clones. Each sequencing reaction BIOMOLECULAR SEQUENCES generates 300 base pairs or so of sequence that represents a unique sequence tag for a particular transcript. An EST RELATED APPLICATIONS sequencing project is technically simple to execute since it 0001. This application is a continuation of pending U.S. requires only a cDNA library, automated DNA sequencing patent application Ser. No. 10/242,799, filed on Sep. 13, capabilities and Standard bioinformatics protocols. 2002, which claims the benefit of U.S. Provisional Patent 0010. To generate meaningful amounts of data, however, Application Nos. 60/322,285, filed on Sep. 14, 2001; high throughput template preparation, sequencing and 60/322,359, filed on Sep. 14, 2001: 60/322,506, filed on Sep. analysis protocols must be applied. As such, the number of 14, 2001: 60/324,524, filed on Sep. 26, 2001; 6f(0/354,242, new genes identified as well as the statistical significance of filed on Feb. 6, 2002: 60/371,494, filed on Apr. 11, 2002: the data is proportional to the number of clones sequenced 60/384,096, filed on May 31, 2002; and 60/397,784, filed on as well as the complexity of the tissue being analyzed Jul. 24, 2002. The contents of the above Applications are all Adams et al. (1995) Nature 377:3-173; Hillier et al. (1996) incorporated herein by reference. Genome Res. 6:807-828). FIELD AND BACKGROUND OF THE 0011 Subtractive cloning Subtractive cloning offers an INVENTION inexpensive and flexible alternative to EST sequencing and cDNA array hybridization. In this approach, double 0002 The present invention relates to systems and meth stranded cDNA is created from the two-cell or tissue popu ods useful for annotating biomolecular sequences. More lations of interest, linkers are ligated to the ends of the cDNA particularly, the present invention relates to computational fragments and the cDNA pools are then amplified by PCR. approaches, which enable systemic characterization of bio The cDNA pool from which unique clones are desired is molecular sequences and identification of differentially designated the “tester, and the cDNA pool that is used to expressed biomolecular sequences such as sequences asso Subtract away shared sequences is designated the “driver. ciated with a pathology. Following initial PCR amplification, the linkers are removed 0003. In the postgenomic era, data analysis rather than from both cDNA pools and unique linkers are ligated to the data collection presents the biggest challenge to biologists. tester sample. The tester is then hybridized to a vast excess Efforts to ascribe biological meaning to genomic data, of driver DNA and sequences that are unique to the tester whether by identification of function, structure or expression cDNA pool are amplified by PCR. pattern are lagging behind sequencing efforts Boguski MS 0012. The primary limitation of subtractive methods is (1999) Science 286:453-455). that they are not always comprehensive. The cDNAs iden 0004. It is well recognized that elucidation of spatial and tified are typically those, which differ significantly in expres temporal patterns of gene expression in healthy and diseased sion level between cell-populations and Subtle quantitative states may contribute immensely to further understanding of differences are often missed. In addition each experiment is disease mechanisms. a pair wise comparison and since Subtractions are based on a series of sensitive biochemical reactions it is difficult to 0005 Therefore, any observational method that can rap directly compare a series of RNA samples. idly, accurately and economically observe and measure the pattern of expression of selected individual genes or of 0013 Differential display—Differential display is whole genomes is of great value to Scientists. another PCR-based differential cloning method Liang and Pardee (1992) Science 257:967-70; Welsh et al. (1992) 0006. In recent years, a variety of techniques have been Nucleic Acids Res. 20:4965-70). In classical differential developed to analyze differential gene expression. However, display, reverse transcription is primed with either oligo-dT current observation and measurement methods are inaccu or an arbitrary primer. Thereafter an arbitrary primer is used rate, time consuming, labor intensive or expensive, often in conjunction with the reverse transcription primer to times requiring complex molecular and biochemical analy amplify cloNA fragments and the cDNA fragments are sis of numerous gene sequences. separated on a polyacrylamide gel. Differences in gene 0007 For example, observation methods for individual expression are visualized by the presence or absence of mRNA or cDNA molecules such as Northern blot analysis, bands on the gel and quantitative differences in gene expres RNase protection, or selective hybridization to arrayed sion are identified by differences in the intensity of bands. cDNA libraries see Sambrook et al. (1989) Molecular Adaptation of differential display methods for fluorescent cloning. A laboratory manual, Cold Spring Harbor press, DNA sequencing machines has enhanced the ability to NY depend on specific hybridization of a single oligonucle quantify differences in gene expression Kato (1995) otide probe complementary to the known sequence of an Nucleic Acids Res. 18:3685-90). individual molecule. Since a single human cell is estimated 0014) A limitation of the classical differential display to express 10,000-30,000 genes Liang et al. (1992) Science approach is that false positive results are often generated 257:967-971), single probe methods to identify all during PCR or in the process of cloning the differentially sequences in a complex sample are ineffective and laborious. expressed PCR products. Although a variety of methods 0008. Other approaches for high throughput analysis of have been developed to discriminate true from false posi differential gene expression are summarized infra. tives, these typically rely on the availability of relatively 0009 EST sequencing The basic idea is to create cDNA large amounts of RNA. libraries from tissues of interest, pick clones randomly from 0015 Serial analysis of gene expression (SAGE)—this these libraries and then perform a single sequencing reaction DNA sequence based method is essentially an accelerated US 2007/0083334 A1 Apr. 12, 2007 version of EST sequencing Valculescu et al. (1995) Science 0023 There is thus a widely recognized need for, and it 270:484-8). In this method a digestible unique sequence tag would be highly advantageous to have, systems and methods of 13 or more bases is generated for each transcript in the which can be used for efficient retrieval and processing of cell or tissue of interest, thereby generating a SAGE library. data from biological databases thereby enabling annotation 0016 Sequencing each SAGE library creates transcript of previously un-annotated biomolecular sequences. profiles. Since each sequencing reaction yields information for twenty or more genes, it is possible to generate data SUMMARY OF THE INVENTION points for tens of thousands of transcripts in modest 0024. According to one aspect of the present invention sequencing efforts. The relative abundance of each gene is there is provided a method of annotating biomolecular determined by counting or clustering sequence tags. The sequences according to a hierarchy of interest, the method advantages of SAGE over many other methods include the comprising: (a) computationally constructing a dendrogram high throughput that can be achieved and the ability to having multiple nodes, the dendrogram representing the accumulate and compare SAGE tag data from a variety of hierarchy of interest, wherein each node of the multiple samples, however the technical difficulties concerning the nodes of the dendrogram is annotated by at least one generation of good SAGE libraries and data analysis are keyword; (b) computationally assigning each biomolecular significant. sequence of the biomolecular sequences to a specific node of the multiple nodes of the dendrogram to thereby generate 0017 Altogether, it is clear from the above that labora assigned biomolecular sequences; and (c) computationally tory bench approaches are ineffective, time consuming, classifying each of the assigned biomolecular sequences to expensive and often times inaccurate in handling and pro nodes hierarchically higher than the specific node, thereby cessing the vast amount of genomic information which is annotating biomolecular sequences according to the hierar now available. chy of interest. 0018. It is appreciated, that much of the analysis can be 0025. According to another aspect of the present inven effected by developing computational algorithms, which can tion there is provided a method of identifying differentially be applied to mining data from existing databases, thereby expressed biomolecular sequences, the method comprising: retrieving and integrating valuable biological information. (a) computationally constructing a dendrogram having mul 0019. To date, there are more than a hundred major tiple nodes, the dendrogram representing the hierarchy of biomolecule databases and application servers on the Inter interest, wherein each node of the multiple nodes of the net and new sites are being introduced at an ever-increasing dendrogram is annotated by at least one keyword; (b) rates Ashburner and Goodman (1997) Curr. Opin. Genet. computationally assigning each biomolecular sequence of Dev. 7:750-756; Karp (1998) Trends Biochem. Sci. 23:114 the biomolecular sequences to a specific node of the multiple 116). nodes of the dendrogram to thereby generate assigned biomolecular sequences; (c) computationally classifying 0020. However, these databases are organized in each of the assigned biomolecular sequences to nodes hier extremely heterogeneous formats. These reflect the inherent archically higher than the specific node, to thereby generate complexity of biological data, ranging from plain-text annotated biomolecular sequences; and (d) identifying anno nucleic acid and protein sequences, through the three dimen tated biomolecular sequences assigned to a portion of the sional structures of therapeutic drugs and macromolecules multiple nodes, thereby identifying differentially expressed and high resolution images of cells and tissues, to microar biomolecular sequences. ray-chip outputs. Moreover data structures are constantly evolving to reflect new research and technology develop 0026. According to yet another aspect of the present ment. invention there is provided a computer readable storage medium comprising a database stored in a retrievable man 0021. The heterogeneous and dynamic nature of these ner, the database including files each containing data of a biological databases present major obstacles in mining data specific node of a dendrogram, the data including biomo relevant to specific biological queries. Clearly, simple lecular sequence information and biomolecular sequence retrieval of data is not sufficient for data mining; efficient annotations, wherein the biomolecular sequence annotations data retrieval requires flexible data manipulation and Sophis are selected from the group consisting of contig description, ticated data integration. Efficient data retreival requires the tissue specific expression, pathological specific expression, use of complex queries across multiple heterogeneous data functional features, parameters for ontological annotation Sources; data warehousing by merging data derived from assignment, cellular localization, database sequence Source multiple public sources and local (i.e., private) sources; and and functional alterations. multiple data-analysis procedures that require feeding Sub sets of data derived from different sources into various 0027 According to still another aspect of the present application programs for gene finding, protein-structure invention there is provided a system for generating a data prediction, functional domain or motif identification, phy base of annotated biomolecular sequences, the system com logenetic tree construction, graphic presentation and so prising a processing unit, the processing unit executing a forth. Software application configured for: (a) constructing a den drogram having multiple nodes, the dendrogram represent 0022. Current biological data retrieval systems are not ing a hierarchy of interest, wherein each node of the multiple fully up to the demand of smooth and flexible data integra nodes of the dendrogram is annotated by at least one tion Etzold et al. (1996) Methods Enzymol 266:t14-t28; keyword; (b) assigning each biomolecular sequence of the Schuler et al. (1996) Methods Enzymol. 266:141-162: biomolecular sequences to a specific node of the multiple Chung and Wong (1999) Trends Biotech. 17:351-355). nodes of the dendrogram to thereby generate assigned US 2007/0083334 A1 Apr. 12, 2007 biomolecular sequences; (c) classifying each of the assigned step of confirming annotations of the assigned biomolecular biomolecular sequences to nodes hierarchically higher than sequence in-vivo and/or in-vitro prior to or following step the specific node, to thereby generate annotated biomolecu (c). lar sequences; and (d) storing sequence annotations and 0040 According to an additional aspect of the present sequence information of the annotated biomolecular invention there is provided a method of identifying sequence sequences, thereby generating the database of annotated features unique to differentially expressed mRNA splice biomolecular sequences. variants, the method comprising: (a) computationally iden 0028. According to further features in preferred embodi tifying unique sequence features in each splice variant of an ments of the invention described below, the biomolecular alternatively spliced expressed sequences; and (b) identify sequences are selected from the group consisting of ing differentially expressed splice variants of the alterna polypeptide sequences and polynucleotide sequences. tively spliced expressed sequences, thereby identifying sequence features unique to differentially expressed mRNA 0029. According to still further features in the described splice variants. preferred embodiments the polynucleotides are selected from the group consisting of genomic sequences, expressed 0041 According to yet an additional aspect of the present sequence tags, contigs, complementary DNA (cDNA) invention there is provided a computer readable storage sequences, pre-messenger RNA (mRNA) sequences, and medium comprising data stored in a retrievable manner, the mRNA sequences. data including sequence information of sequence features unique to differentially expressed mRNA splice variants as 0030. According to still further features in the described set forth in files: preferred embodiments the biomolecular sequences are selected from the group consisting of annotated biomolecu 0042 “Transcripts nucleotide seqs part1. lar sequences, unannotated biomolecular sequences and 0043 “Transcripts nucleotide seqs part2” partially annotated biomolecular sequences. 0044 “Transcripts nucleotide seqs part3.new’ and/or 0031. According to still further features in the described preferred embodiments the method further comprising 0045 “Protein.seqs' homology clustering of the biomolecular sequences prior to 0046) provided in CD-ROMs 1 and/or 2 enclosed here step (b). with, and sequence annotations as set forth in annotation 0032. According to still further features in the described categories "HTAA CD and “HTAA TIS, in the file "Sum preferred embodiments the dendrogram is selected from the mary table.new” of CD-ROM3 enclosed herewith. group consisting of a graph, a list, a map and a matrix. 0047 According to still an additional aspect of the present invention there is provided a system for generating 0033 According to still further features in the described a database of sequence features unique to differentially preferred embodiments the hierarchy of interest is selected expressed mRNA splice variants, the system comprising a from the group consisting of a tissue expression hierarchy, processing unit, the processing unit executing a software a developmental expression hierarchy, a pathological application configured for: (a) identifying unique sequence expression hierarchy, a cellular expression hierarchy, an features in each splice variant of an alternatively spliced intracellular expression hierarchy, a taxonomical hierarchy expressed sequences; and (b) identifying differentially and a functional hierarchy. expressed splice variants of the alternatively spliced expressed sequences, thereby identifying sequence features 0034. According to still further features in the described unique to differentially expressed mRNA splice variants. (c) preferred embodiments each node of the multiple nodes is a storing the sequence features unique to the differentially parental node in an additional hierarchy of interest. expressed mRNA splice variants, thereby generating the 0035. According to still further features in the described database of sequence features unique to differentially preferred embodiments the method further comprising clas expressed mRNA splice variants. Sifying the biomolecular sequences of the parental node 0048. According to still further features in the described according to the additional hierarchy of interest. preferred embodiments step (b) is effected by qualifying 0036). According to still further features in the described annotations associated with the alternatively spliced preferred embodiments the system further comprising clas expressed sequences. Sifying the biomolecular sequences of the parental node 0049 According to still further features in the described according to the additional hierarchy of interest. preferred embodiments the method further comprising scor 0037 According to still further features in the described ing the annotations associated with the alternatively spliced preferred embodiments each of the biomolecular sequences expressed sequences according to: (i) prevalence of the is a member of a sequence contig. alternatively spliced expressed sequences in normal tissues; (ii) prevalence of the alternatively spliced expressed 0038 According to still further features in the described sequences in pathological tissues; (iii) prevalence of the preferred embodiments the method further comprising the alternatively spliced expressed sequence in total tissues; and step of confirming annotations of the assigned biomolecular (iv) number of tissues and/or tissue types expressing the sequence in-vivo and/or in-vitro prior to or following step alternatively spliced expressed sequences; (c). 0050. According to still further features in the described 0039. According to still further features in the described preferred embodiments the system further comprising scor preferred embodiments the system further comprising the ing the annotations associated with the alternatively spliced US 2007/0083334 A1 Apr. 12, 2007 expressed sequences according to: (i) prevalence of the 0068 According to still further features in the described alternatively spliced expressed sequences in normal tissues; preferred embodiments the at least one oligonucleotide is (ii) prevalence of the alternatively spliced expressed designed and configured for RNA hybridization. sequences in pathological tissues; (iii) prevalence of the alternatively spliced expressed sequence in total tissues; and 0069. According to yet a further aspect of the present (iv) number of tissues and/or tissue types expressing the invention there is provided a method of annotating biomo alternatively spliced expressed sequences; lecular sequences, the method comprising: (a) computation ally clustering the biomolecular sequences according to a 0051. According to still further features in the described progressive homology range, to thereby generate a plurality preferred embodiments step (b) is effected by identifying the of clusters each being of a predetermined homology of the unique sequence feature. homology range; and (b) assigning at least one ontology to each cluster of the plurality of clusters, the at least one 0.052 According to still further features in the described ontology being: (i) derived from an annotation preassociated preferred embodiments the unique sequence feature is with at least one biomolecular sequence of each cluster; selected from the group consisting of a donor-acceptor and/or (ii) generated from analysis of the at least one concatenation, an alternative exon, an exon and a retained biomolecular sequence of each cluster thereby annotating intron. biomolecular sequences. 0053 According to still further features in the described 0070 According to still a further aspect of the present preferred embodiments identifying unique sequence features invention there is provided a system for generating a data in each splice variant of an alternatively spliced expressed base of annotated biomolecular sequences, the system com sequence is effected by expressed sequence alignment. prising a processing unit, the processing unit executing a 0054 According to a further aspect of the present inven Software application configured for: (a) clustering the bio tion there is provided a kit useful for detecting differentially molecular sequences according to a progressive homology expressed polynucleotide sequences, the kit comprising at range, to thereby generate a plurality of clusters each being least one oligonucleotide being designed and configured to of a predetermined homology of the homology range; and be specifically hybridizable with a polynucleotide sequence (b) assigning at least one ontology to each cluster of the selected from the group consisting of sequence files: plurality of clusters, the at least one ontology being: (i) derived from an annotation preassociated with at least one 0.055 “Transcripts nucleotide seqs part 1 biomolecular sequence of each cluster; and/or (ii) generated 0056 “Transcripts nucleotide seqs part2 and from analysis of the at least one biomolecular sequence of each cluster, to thereby annotate the biomolecular 0057. “Transcripts nucleotide seqs part3.new’ sequences; and (c) storing sequence annotations and sequence information of the annotated biomolecular 0.058 provided in CD-ROMs 1 and/or 2 enclosed here sequences, thereby generating the database of annotated with, under moderate to stringent hybridization conditions. biomolecular sequences. 0059) According to still further features in the described preferred embodiments the at least one oligonucleotide is 0071 According to still a further aspect of the present labeled. invention there is provided a computer readable storage medium comprising a database stored in a retrievable man 0060 According to still further features in the described ner, the database including sequence information as set forth preferred embodiments the at least one oligonucleotide is in files: attached to a solid . 0072 “Transcripts nucleotide seqs part 1 0061 According to still further features in the described preferred embodiments the solid substrate is configured as a 0073. “Transcripts nucleotide seqs part2” microarray and whereas the at least one oligonucleotide includes a plurality of oligonucleotides each being capable 0074 “Transcripts nucleotide seqs part3.new’ and/or of hybridizing with a specific polynucleotide sequence of the 0075) “Proteinseqs” polynucleotide sequences set forth in the files: 0.076 provided in CD-ROMs 1 and/or 2 enclosed here 0062 “Transcripts nucleotide seqs part 1 with, and sequence ontological annotations in #GO P. iGO F, HGO C annotation categories in file "Summa 0063 “Transcripts nucleotide seqs part2 and/or ry table.new” of CD-ROM3 enclosed herewith. 0064 “Transcripts nucleotide seqs part3.new’ 0077 According to still further features in the described 0065 provided in CD-ROMs 1 and/or 2 enclosed here preferred embodiments the biomolecular sequences are with. selected from the group consisting of polynucleotide sequences and polypeptide sequences. 0.066 According to still further features in the described preferred embodiments each of the plurality of oligonucle 0078. According to still further features in the described otides is being attached to the microarray in a regio-specific preferred embodiments the homology range is between a. 99%-35%. 0067. According to still further features in the described 0079 According to still further features in the described preferred embodiments the at least one oligonucleotide is preferred embodiments the analysis of the at least one designed and configured for DNA hybridization. biomolecular sequence includes literature text mining. US 2007/0083334 A1 Apr. 12, 2007

0080 According to still further features in the described 24-28, 35-38, 12 and 29-31 wherein presence of the biomo preferred embodiments the analysis of the at least one lecular sequence indicates colon cancer in the Subject. biomolecular sequence includes cellular localization predic 0095 According to still a further aspect of the present tion. invention there is provided method of diagnosing lung 0081. According to still further features in the described cancer in a subject, the method comprising identifying in the preferred embodiments the analysis of the at least one Subject the presence or absence of a biomolecular sequence biomolecular sequence includes homology analysis. selected from the group consisting of SEQID NOs: 15, 18, 0082) According to still further features in the described 21 and 32 wherein presence of the biomolecular sequence preferred embodiments the at least one ontology is selected indicates lung cancer in the Subject. from the group consisting of molecular biology, microbiol 0096. According to still a further aspect of the present ogy, developmental biology, immunology, Virology, bio invention there is provided a method of diagnosing Ewing chemistry, physiology, pharmacology, medicine, bioinfor sarcoma in a Subject, the method comprising identifying in matics, cell biology, endocrinology, structural biology, the subject the presence or absence of a biomolecular mathematics, chemistry, medicine, plant Sciences, neurol sequence as set forth in SEQID NO: 7, wherein presence of ogy, genetics, Zoology, ecology, genomics, cheminformat the biomolecular sequence indicates Ewing sarcoma in the ics, computer Sciences, statistics, physics and artificial intel Subject. ligence. 0097 According to still a further aspect of the present 0083. According to still further features in the described invention there is provided a computer readable storage preferred embodiments the ontology includes a Subontology. medium comprising data stored in a retrievable manner, the data including sequence information of differentially 0084. According to still further features in the described preferred embodiments the method further comprising scor expressed biomolecular sequences as set forth in files: ing the at least one ontology assigned to a cluster of the 0098 “Transcripts nucleotide seqs part 1 plurality of clusters according to: (i) a degree of homology 0099 “Transcripts nucleotide seqs part2” characterizing the cluster; and (ii) relevance of annotation to information obtained from literature text mining. 0.100 “Transcripts nucleotide seqs part3.new’ and 0085. According to still further features in the described 0101) “Protein.seqs' preferred embodiments the system further comprising scor 0102 provided in CD-ROMs 1 and/or 2 enclosed here ing the at least one ontology assigned to a cluster of the with, and sequence annotations as set forth in annotation plurality of clusters according to: (i) a degree of homology categories “SA’’’ and “R.A. in the file "Summary table.new’ characterizing the cluster; and (ii) relevance of annotation to of CD-ROM3 enclosed herewith. information obtained from literature text mining. 0103). According to still a further aspect of the present 0.086 According to still further features in the described invention there is provided a computer readable storage preferred embodiments the method further comprising gen medium comprising data stored in a retrievable manner, the erating a sequence profile to each cluster of the plurality of data including sequence information of biomolecular clusters following step (b). sequences exhibiting gain of function or loss of function as 0087. According to still further features in the described set forth in files: preferred embodiments the system further comprising gen 0.104) “Transcripts nucleotide seqs partl” erating a sequence profile to each cluster of the plurality of clusters following step (b). 01.05 “Transcripts nucleotide seqs part2” 0088 According to still a further aspect of the present 0106) “Transcripts nucleotide seqs part3.new' and invention there is provided a computer readable storage 0.107) “Protein.seqs' medium, comprising a database stored in a retrievable manner, the database including biomolecular sequence 0108) provided in CD-ROMs 1 and/or 2 enclosed here information as set forth in files: with, and sequence annotations as set forth in annotation category “DN, in the file "Summary table.new’ of CD 0089. “Transcripts nucleotide seqs part 1 ROM3 enclosed herewith. 0090 “Transcripts nucleotide seqs part2” 0.109 According to still further features in the described preferred embodiments the database further includes infor 0.091 “Transcripts nucleotide seqs part3.new’ and/or mation pertaining to generation of the data and potential 0092 “Protein.seqs' uses of the data. 0093 provided in CD-ROMs 1 and/or 2 enclosed here 0110. According to still further features in the described with, and biomolecular sequence annotations as set forth in preferred embodiments the medium is selected from the file “Summary table.new” of CD-ROM 3 enclosed here group consisting of a magnetic storage medium, an optical with. storage medium and an optico-magnetic storage medium. 0094. According to still a further aspect of the present 0111. According to still further features in the described invention there is provided a method of diagnosing colon preferred embodiments the database further includes infor cancer in a subject, the method comprising identifying in the mation pertaining to gain and/or loss of function of the Subject the presence or absence of a biomolecular sequence differentially expressed mRNA splice variants or polypep selected from the group consisting of SEQ ID NOs: 4, 39. tides encoded thereby. US 2007/0083334 A1 Apr. 12, 2007

0112 The present invention successfully addresses the self-validation studies. Only predictions made with LOD shortcomings of the presently known configurations by scores above 2 were evaluated and used for GO annotation providing methods and systems useful for systematically process. annotating biomolecular sequences. 0.122 FIGS. 6a-care histograms showing the distribution 0113. Unless otherwise defined, all technical and scien of proteins (closed squares) and contigs (opened squares) tific terms used herein have the same meaning as commonly from Ensembl version 1.0.0 in the major nodes of three GO understood by one of ordinary skill in the art to which this categories—cellular component (FIG. 6a), molecular func invention belongs. Although methods and materials similar tion (FIG. 6b), and biological process (FIG. 6c). or equivalent to those described herein can be used in the practice or testing of the present invention, Suitable methods 0123 FIG. 7 illustrates results from RT-PCR analysis of and materials are described below. In case of conflict, the the expression pattern of the AA535072 (SEQ ID NO:39) patent specification, including definitions, will control. In colorectal cancer-specific transcript. The following cell and addition, the materials, methods, and examples are illustra tissue samples were tested: B-colon carcinoma cell line tive only and not intended to be limiting. SW480 (ATCC-228): C-colon carcinoma cell line SW620 (ATCC-227); D-colon carcinoma cell line colo-205 (ATCC BRIEF DESCRIPTION OF THE DRAWINGS 222). Colon normal tissue indicates a pool of 10 different samples, (Biochain, cat no A406029). The adenocarcinoma 0114. The invention is herein described, by way of sample represents a pool of spleen, lung, stomach and example only, with reference to the accompanying drawings. kidney adenocarcinomas, obtained from patients. Each of With specific reference now to the drawings in detail, it is the tissues (i.e., colon carcinoma samples Duke's A-D; and stressed that the particulars shown are by way of example normal muscle, pancreas, breast, liver, testis, lung, heart, and for purposes of illustrative discussion of the preferred ovary, thymus, spleen kidney, placenta, stomach, brain) embodiments of the present invention only, and are pre were obtained from 3-6 patients and pooled. sented in the cause of providing what is believed to be the most useful and readily understood description of the prin 0.124 FIG. 8 illustrates results from RT-PCR analysis of ciples and conceptual aspects of the invention. In this regard, the expression pattern of the AA513157 (SEQ ID NO: 7) no attempt is made to show structural details of the invention Ewing sarcoma specific transcript. The (+) or (-) symbols, in more detail than is necessary for a fundamental under indicate presence or absence of in the standing of the invention, the description taken with the reaction mixture. A molecular weight standard is indicated drawings making apparent to those skilled in the art how the by M. Tissue samples (i.e., Ewing sarcoma samples, spleen several forms of the invention may be embodied in practice. adenocarcinoma, brain, prostate and thymus) were obtained from patients. The Ln-CAP human prostatic adenocarci 0115) In the drawings: noma cell line was obtained from the ATCC (Manassas, Va.). 0116 FIG. 1a illustrates a system designed and config 0.125 FIG. 9 is an autoradiogram of a northern blot ured for generating a database of annotated biomolecular analysis depicting tissue distribution and expression levels sequences according to the teachings of the present inven of AA513157 (SEQ ID NO: 7) Ewing sarcoma specific tion. transcript. Arrows indicate the molecular weight of 28S and 0117 FIG. 1b illustrates a remote configuration of the 18S ribosomal RNA subunits. The indicated tissue samples system described in FIG. 1a. were obtained from patients and SK-ES-1-Ewing sarcoma 0118 FIG. 2 illustrates a gastrointestinal tissue hierarchy cell-line was obtained from the ATCC (CRL-1427). dendogram generated according to the teachings of the 0.126 FIG. 10 illustrates results from semi quantitative present invention. RT-PCR analysis of the expression pattern of the AA469088 (SEQ ID NO: 40) colorectal specific transcript. Colon nor 0119 FIG. 3 is a scheme illustrating multiple alignment mal was obtained from Biochain, cat no: A406029. The of alternatively spliced expressed sequences with a genomic adenocarcinoma sample represents a pool of spleen, lung, sequence including 3 exons (A, B and C) and two introns. stomach and kidney adenocarcinomas, obtained from Two alternative splicing events are described: One from the patients. Each of all other tissues (i.e., colon carcinoma donor site, which involves an AB junction, between donor samples Duke’s A-D; and normal thymus, spleen, kidney, and proximal acceptor and an AC junction, between donor placenta, stomach, brain) were obtained from 3-6 patients and distal acceptor; A Second alternative splicing event is and pooled. described from the acceptor site, which involves AC junc tion, between distal donor and acceptor and BC junction, 0.127 FIG. 11 is a histogram depicting Real-Time RT between proximal donor and acceptor. PCR quantification of copy number, of a lung specific transcript, (SEQ ID NO: 15). Amplification products 0120 FIG. 4 is a tissue hierarchy dendogram generated obtained from the following tissues were quantified; normal according to the teachings of the present invention. The salivary gland from total RNA (Clontech, cat no: 64110-1); higher annotation levels are marked with a single number, lung normal from pooled adult total RNA (BioChain, cat no: i.e., 1-16. The lower annotation levels are marked within the A409363): lung tumor squamos cell carcinoma (Clontech, relevant category as one-four numbers after the point (e.g. 4. cat no: 64013-1); lung tumor squamos cell carcinoma (Bio genitourinary system; 4.2 genital system; 4.2.1 women Chain, cat no: A409017); pooled lung tumor squamos cell genital system; 4.2.1.1 cervix). carcinoma (BioChain, cat no: A411075); moderately differ 0121 FIG. 5 is a graph illustrating a correlation between entiated squamos cell carcinoma (BioChain, cat no: LOD scores of textual information analysis and accuracy of A409091); well differentiated squamos cell carcinoma (Bio ontological annotation prediction. Results are based on Chain, cat no: A408175); pooled adenocarcinoma (Bio US 2007/0083334 A1 Apr. 12, 2007

Chain, cat no: A411076); moderately differentiated alveolus obtained from the following tissues and cell-lines were cell carcinoma (BioChain, cat no: A409089); non-small cell quantified; Samples 1-6 are commercial normal lung lung carcinoma cell line H1299; The following normal and samples (BioChain, CDP-061010; A503205, A503384, tumor samples were obtained from patients: normal lung A503385, A503204, A503206, A409363). Sample 7 is lung (internal number-CG-207N), lung carcinoma (internal num well differentiated adenocarcinoma (BioChain, CDP ber-CG-72), squamos cell carcinoma (internal number-CG 064004A. A504117). Sample 8 is lung moderately differen 196), squamos cell carcinoma (internal number-CG-207), tiated adenocarcinoma (BioChain, CDP-064004A. lung adenocarcinoma (internal number-CG-120), lung A504119). Sample 9 is lung moderately to poorly differen adenocarcinoma (internal number-CG-160). Copy number tiated adenocarcinoma (BioChain, CDP-064004A. was normalized to the levels of expression of the house A504116). Sample 10 is lung well differentiated adenocar keeping genes Proteasome 26S Subunit (dark columns) and cinoma (BioChain, CDP-064004A. A504118). Samples GADPH (bright columns). 11-16 are lung adenocarcinoma samples obtained from patients. Sample 17 is lung moderately differentiated squa 0128 FIG. 12 is a histogram depicting Real-Time RT mous cell carcinoma (BioChain, CDP-064004B: A503187). PCR quantification of copy number, of the lung specific Sample 18 is lung Squamous cell carcinoma (BioChain, transcript (SEQ ID NO: 32). Amplification products CDP-064004B: A503386). Samples 20-21 are lung moder obtained from the following tissues and cell-lines were ately differentiated squamous cell carcinoma (BioChain, quantified: lung normal from pooled adult total RNA (Bio CDP-064004B; A503387, A503383). Sample 22 is lung Chain, cat no: A409363); lung tumor squamos cell carci squamous cell carcinoma pooled (BioChain, CDP-064004B; noma (Clontech, cat no: 64013-1); lung tumor squamos cell A411075). Samples 23-26 and sample31 are lung squamous carcinoma (BioChain, cat no: A409017); pooled lung tumor cell carcinoma obtained from patients. Sample 27 is lung squamos cell carcinoma (BioChain, cat no: A411075); mod squamous cell carcinoma (Clontech, 64013-1). Sample 28 is erately differentiated Squamos cell carcinoma (BioChain, cat lung squamous cell carcinoma (BioChain, A409017). no: A409091); well differentiated squamos cell carcinoma Sample 29 is lung moderately differentiated squamous cell (BioChain, cat no: A408175); pooled adenocarcinoma (Bio carcinoma (BioChain, CDP-064004B: A409091). Sample Chain, cat no: A411076); moderately differentiated alveolus 30 is lung well differentiated Squamous cell carcinoma cell carcinoma (BioChain, cat no: A409089); non-small cell (BioChain, CDP-064004B: A408175). Samples 32-35 are lung carcinoma cell line H1299; The following normal and lung small cell carcinoma (BioChain, CDP-064004D: tumor samples were obtained from patients: normal lung A504115, A501390, A501389, A501391). Sample 36-37 are (internal number-CG-207N), lung carcinoma (internal num lung large cell carcinoma (BioChain, CDP-064004C: ber-CG-72), squamos cell carcinoma (internal number-CG A504113, A504114). Sample 38 is lung moderately differ 196), squamos cell carcinoma (internal number-CG-207), entiated alveolus cell carcinoma (BioChain, A409089). lung adenocarcinoma (internal number-CG-120), lung Sample 39 is lung carcinoma obtained from patient. Sample adenocarcinoma (internal number-CG-160). Copy number 40 is lung H1299 non-small cell carcinoma cell line. Sample was normalized to the levels of expression of the house 41 is normal salivary gland sample (Clontech, 64110-1). keeping genes Proteasome 26S Subunit (dark columns) and Copy number was normalized to the levels of expression of GADPH (bright columns). the housekeeping genes Proteasome 26S Subunit (dark col 0129 FIG. 13 is a histogram depicting Real-Time RT umns) and GADPH (bright columns). PCR quantification of copy number, of the lung specific DESCRIPTION OF THE PREFERRED transcript (SEQ ID NO: 18). Amplification products EMBODIMENTS obtained from the following tissues and cell-lines were quantified: lung normal from pooled adult total RNA (Bio 0131 The present invention is of methods and systems, Chain, cat no: A409363); lung tumor squamos cell carci which can be used for annotating biomolecular sequences. noma (Clontech, cat no: 64013-1); lung tumor squamos cell Specifically, the present invention can be used to identify carcinoma (BioChain, cat no: A409017); pooled lung tumor and annotate differentially expressed biomolecular squamos cell carcinoma (BioChain, cat no: A411075); mod sequences, such as differentially expressed alternatively erately differentiated Squamos cell carcinoma (BioChain, cat spliced sequences. no: A409091); well differentiated squamos cell carcinoma 0.132. The principles and operation of the present inven (BioChain, cat no: A408175); pooled adenocarcinoma (Bio tion may be better understood with reference to the drawings Chain, cat no: A411076); moderately differentiated alveolus and accompanying descriptions. cell carcinoma (BioChain, cat no: A409089); non-small cell lung carcinoma cell line H1299; The following normal and 0.133 Before explaining at least one embodiment of the tumor samples were obtained from patients: normal lung invention in detail, it is to be understood that the invention (internal number-CG-207N), lung carcinoma (internal num is not limited in its application to the details of construction ber-CG-72), squamos cell carcinoma (internal number-CG and the arrangement of the components set forth in the 196), squamos cell carcinoma (internal number-CG-207), following description or illustrated in the drawings. The lung adenocarcinoma (internal number-CG-120), lung invention is capable of other embodiments or of being adenocarcinoma (internal number-CG-160). Copy number practiced or carried out in various ways. Also, it is to be was normalized to the levels of expression of the house understood that the phraseology and terminology employed keeping genes Proteasome 26S Subunit (dark columns) and herein is for the purpose of description and should not be GADPH (bright columns). regarded as limiting. 0130 FIG. 14 is a histogram depicting Real-Time RT 0.134 Terminology PCR quantification of copy number, of a lung specific 0.135). As used herein, the term "oligonucleotide' refers to transcript (SEQ ID NO: 21). Amplification products a single stranded or double Stranded oligomer or polymer of US 2007/0083334 A1 Apr. 12, 2007 ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) or 0145 The term “annotation” refers to a functional or mimetics thereof. This term includes oligonucleotides com structural description of a sequence, which may include posed of naturally-occurring bases, Sugars and covalent identifying attributes such as locus name, keywords, Med internucleoside linkages (e.g., backbone) as well as oligo line references, cloning data, information of coding region, nucleotides having non-naturally-occurring portions which regulatory regions, catalytic regions, name of encoded pro function similarly. Such modified or substituted oligonucle tein, Subcellular localization of the encoded protein, protein otides are often preferred over native forms because of hydrophobicity, protein function, mechanism of protein desirable properties such as, for example, enhanced cellular function, information on metabolic pathways, regulatory uptake, enhanced affinity for nucleic acid target and pathways, protein-protein interactions and tissue expression increased stability in the presence of nucleases. profile. 0136. The phrase “complementary DNA (cDNA) refers to the double stranded or single stranded DNA molecule, The Ontological Annotation Approach which is synthesized from a messenger RNA template. 0146 An ontology refers to the body of knowledge in a 0137 The term “contig” refers to a series of overlapping specific knowledge domain or discipline Such as molecular sequences with Sufficient identity to create a longer contigu biology, microbiology, immunology, Virology, plant sci ous sequence. A plurality of contigs may form a cluster. ences, pharmaceutical chemistry, medicine, neurology, Clusters are generally formed based upon a specified degree endocrinology, genetics, ecology, genomics, proteomics, of homology and overlap (e.g., a stringency). The different cheminformatics, pharmacogenomics, bioinformatics, com contigs in a cluster do not typically represent the entire puter Sciences, statistics, mathematics, chemistry, physics sequence of the gene, rather the gene may comprise one or and artificial intelligence. more unknown intervening sequences between the defined contigs. 0147 An ontology includes domain-specific concepts— referred to herein as Sub-ontologies. A Sub-ontology may be 0138. The term “cluster” refers to a nucleic acid sequence cluster or a protein sequence cluster. The former refers to a classified into Smaller and narrower categories. group of nucleic acid sequences which share a requisite level 0.148. The ontological annotation approach of the present of homology and or other similar traits according to a given invention is effected as follows. clustering criterion; and the latter refers to a group of protein 0.149 First, biomolecular sequences are computationally sequences which share a requisite level of homology and/or clustered according to a progressive homology range, other similar traits according to a given clustering criterion. thereby generating a plurality of clusters each being of a 0.139. A process and/or method to group nucleic acid or predetermined homology of the homology range. protein sequences as Such is referred to as clustering, which is typically performed by a clustering (i.e., alignment) 0.150 Progressive homology according to this aspect of application program implementing a cluster algorithm. the present invention is used to identify meaningful homolo gies among biomolecular sequences and thereby assign new 0140. As used herein the phrase “biomolecular ontological annotations to sequences, which share requisite sequences’ refers to amino acid sequences (i.e., peptides, levels of homologies. Essentially, a biomolecular sequence polypeptides) and nucleic acid sequences, which include but is assigned to a specific cluster if displays a predetermined are not limited to genomic sequences, expressed sequence homology to at least one member of the cluster (i.e., single tags, contigs, complementary DNA (cDNA) sequences, pre linkage). As used herein “progressive homology range' messenger RNA (mRNA) sequences, and mRNA sequences. refers to a range of homology thresholds, which progress via 0141. With the presentation of the human genome work predetermined increments from a low homology level (e.g. ing draft, data analysis rather than data collection presents 35%) to a high homology level (e.g. 99%). Further descrip the biggest challenge to biologists. Efforts to ascribe bio tion of a progressive homology range is provided in the logical meaning to genomic data, include the development Examples section which follows. of advanced wet laboratorial techniques as well as comput erized algorithms. While the former are limited due to 0151. Following generation of clusters, one or more inaccuracy, time consumption, labor intensiveness and costs ontologies are assigned to each cluster. Ontologies are the latter are still unfeasible due to the poor organization of derived from an annotation preassociated with at least one on hand sequence databases as well as the composite nature biomolecular sequence of each cluster, and/or generated by of biological data. analyzing (e.g., text-mining) at least one biomolecular sequence of each cluster thereby annotating biomolecular 0142. As is further described hereinbelow, the present inventors have developed a computer-based approach for the Sequences. functional, spatial and temporal analysis of biological data. 0152 Any annotational information identified and/or The present methodology generates comprehensive data generated according to the teachings of the present invention bases which greatly facilitate the use of available genetic can be stored in a database which can be generated by a information in both research and commercial applications. Suitable computing platform. 0143. As is further described hereinunder, the present 0153. Thus, the method according to this aspect of the invention encompasses several novel approaches for anno present invention provides a novel approach for annotating tating biomolecular sequences. biomolecular sequences even on a scale of a genome, a 0144) “Annotating refers to the act of discovering and/or transcriptom (i.e., the repertoire of all messenger RNA assigning an annotation (i.e., critical or explanatory notes or molecules transcribed from a genome) or a proteom (i.e., the comment) to a biomolecular sequence of the present inven repertoire of all proteins translated from messenger RNA tion. molecules). This enables transcriptome-wise comparative US 2007/0083334 A1 Apr. 12, 2007

analyses (e.g., analyzing chromosomal distribution of FIT, FASTA, and TFASTA in the Wisconsin Genetics Soft human genes) and cross-transcriptome comparative studies ware Package Release 7.0, Genetics Computer Group, 575 (e.g., comparing expressed data across species) both of Science Dr., Madison, Wis. which may involve various Subontologies such as molecular 0.161 Another example of an algorithm which is suitable function, biological process and cellular localization. for sequence alignment is the BLAST algorithm, which is 0154 Biomolecular sequences which can be used as described in Altschul et al., J. Mol. Biol. 215:403-410 working material for the annotating process according to this (1990). Software for performing BLAST analyses is pub aspect of the present invention can be obtained from a licly available through the National Center for Biotechnol biomolecular sequence database. Such a database can ogy Information (http://www.ncbi.nlm.nih.gov/). include protein sequences and/or nucleic acid sequences derived from libraries of expressed messenger RNA i.e., 0162 Since the present invention requires processing of expressed sequence tags (EST). cDNA clones, contigs, large amounts of data, sequence alignment is preferably pre-mRNA, which are prepared from specific tissues or effected using assembly Software. cell-lines or from whole organisms. 0.163 A number of commonly used computer software 0155 This database can be a pre-existing publicly avail fragment read assemblers capable of forming clusters of able database i.e., GenBank database maintained by the expressed sequences, and aligning members of the cluster National Center for Biotechnology Information (NCBI), part (individually or as an assembled contig) with other of the National Library of Medicine, and the TIGR database sequences (e.g., genomic database) are now available. These maintained by The Institute for Genomic Research, Blocks packages include but are not limited to. The TIGR Assem database maintained by the Fred Hutchinson Cancer bler Sutton G. et al. (1995) Genome Science and Technol Research Center, Swiss-Prot site maintained by the Univer ogy 1:9-19), GAP Bonfield J. K. et al. (1995) Nucleic Acids sity of Geneva and GenPept maintained by NCBI and Res. 23:4992-4999), CAP2 Huang X. et al. (1996) Genom including public protein-sequence database which contains ics 33:21-31), the Genome Construction Manager Laurence all the protein databases from GenBank, or private data C B. Et al. (1994) Genomics 23:192-2011, Bio Image bases (i.e., the LifeSeq.TM and PathoSeq.TM databases avail Sequence Assembly Manager, SeqMan Swindell S R. and able from Incyte Pharmaceuticals, Inc. of Palo Alto, Calif.). Plasterer J. N. (1997) Methods Mol. Biol. 70:75-89), and Optionally, biomolecular sequences of the present invention LEADS and GenCarta (Compugen Ltd. Israel). can be assembled from a number of pre-existing databases 0164. It will be appreciated that since applying sequence as described in Example 5 of the Examples section. homology analysis on large number of sequences is com putationally intensive, local alignment (i.e., the alignment of 0156 Alternatively, the database can be generated from portions of protein sequences) is preferably effected prior to sequence libraries including, but not limited to, cDNA global alignment (alignment of protein sequences along their libraries, EST libraries, mRNA libraries and the like. entire length), as described in Example 6 of the Examples 0157 Construction and sequencing of a cDNA library is section. one approach for generating a database of expressed mRNA sequences. cDNA library construction is typically effected 0.165. Once progressive clusters are formed, one or more by tissue or cell sample preparation, RNA isolation, cDNA ontological annotations (i.e., assigning an ontology) are sequence construction and sequencing. assigned to each cluster. 0158. It will be appreciated that such cDNA libraries can 0166 Systematic and standardized ontological nomen be constructed from RNA isolated from whole organisms, clature is preferably used. Such nomenclature (i.e., key tissues, tissue sections, or cell populations. Libraries can words) can be obtained from several sources. For example, also be constructed from a tissue reflecting a particular ontological annotations derived from three main ontologies: pathological or physiological state. molecular function, biological process and cellular compo nent are available from the Gene Ontology Consortium 0159. Once raw sequence data is obtained, biomolecular (www.geneontology.org). sequences are computationally clustered according to a 0.167 Alternatively a list of homogenized ontological progressive homology range using one or more clustering nomenclature can be obtained from AcroMed—a computer algorithms. To obtain progressive clusters, the biomolecular generated database of biomedical acronyms and the associ sequences are clustered through single linkage. Namely, a ated long forms extracted from the recent Medline abstracts biomolecular sequence belongs to a cluster if this sequence (http://www.expasy.org/tools/). shares a sequence homology above a certain threshold to one member of the cluster. The threshold increments from a high 0.168. Optionally, various conversion tables which link homology level to a low homology level with a predeter Commission number, InterPro protein motifs and mined resolution. Preferably the homology range is selected SwissProt keywords to gene ontology nodes are also avail from 99% -35%. able from www.geneontology.org and can be used with the 0160 Computational clustering can be effected using any present method. commercially available alignment Software including the 0.169 Ontologies, sub ontologies, and their ontological local homology algorithm of Smith & Waterman, Adv. Appl. relations (i.e., inherent relation—the sub-ontology “IS THE” Math. 2:482 (1981), using the homology alignment algo ontology or composite relation—the ontology “HAS’ the rithm of Needleman & Wunsch, J. Mol. Biol. 48:443 (1970), Sub ontology) can be organized into various computer data using the search for similarity method of Pearson & Lipman, structures such as a tree, a map, a graph, a stack or a list. Proc. Natl. Acad. Sci. USA 85:2444 (1988), or using These may also be presented in various data format Such as, computerized implementations of algorithms GAP, BEST text, table, html, or extensible markup language (XML) US 2007/0083334 A1 Apr. 12, 2007

0170 Ontologies and/or subontologies assigned to a spe 0.178 Computer-dedicated software for biological text cific biomolecular sequence can be derived from an anno analysis is available from http://www.expasy.org/tools/. tation, which is preassociated with at least one biomolecular Examples include, but are not limited to, MedMiner—A sequence in a cluster generated as described hereinabove. Software system which extracts and organizes relevant sen 0171 For example, biomolecular sequences obtained tences in the literature based on a gene, gene-gene or from an annotated database are typically preassociated with gene-drug query; Protein Annotators Assistant—A Software an annotation. An "annotated database' refers to a database system which assists protein annotators in the task of biomolecular sequences, which are at least partially charac assigning functions to newly sequenced proteins; and terized with respect to functional or structural aspects of the XplorMed—A software system which explores a set of sequence. Examples of annotated databases include but are abstracts derived from a bibliographic search in MEDLINE. not limited to: GenBank (www.ncbi.nlm.nih.gov/GenBank/ 0.179 Alternatively, assignment of ontological annota ), Swiss-Prot (www.expasy.ch/sprot/sprot-top.html), GDB tions may be effected by analyzing molecular, cellular (www.gdb.org/), PIR (www.mips.biochem.mpg.de/proj/ and/or functional traits of the biomolecular sequences. Pre prostseqdb/), YDB (www.mips.biochem.mpg.de/proj/yeast/ diction of cellular localization may be done using any ), MIPS (www.mips.biochem.mpg.de/proj/human), HGI computer dedicated software. For example prediction of (www.tigr.org/tdb/hgi/), Celera Assembled Human Genome cellular localization can be done using the ProLoc (Einat (www.celera.com/products/human ann.cfm and LifeSeq Hazkani-Covo, Erez, Levanon, Galit Rotman, Dan Graurand Gold (http://lifeseq gold.incyte.com). Additional specialized Amit Novik, a manuscript submitted for publication) com annotated databases include annotative information on putational platform. This software is capable of predicting metabolic (http://www.genome.ad.jp/kegg/metabolism the cellular localization of polypeptide sequences based on .html) and regulatory pathways (http://www.genome.ad.ip/ inherent features, including specific localization signatures, /regulation.html), and protein-protein interactions protein domains, amino acid composition, pI and protein (http://dip.doe-mbi.ucla.edu/), etc. length. Other examples for cellular localization prediction softwares include PSORT Prediction of protein sorting 0172 Alternatively, ontologies can be generated from an signals and localization sites and TargetP Prediction of analysis of at least one biomolecular sequence in each of the subcellular location, both available from http://www.ex clusters of the present invention. pasy.org/tools/. 0173 Preferably, analysis of the biomolecular sequence 0180 Prediction of functional annotations may be is effected by literature text mining. Since manual review of effected by motif analysis of the biomolecular sequences of related-literature may be a daunting task, computational the present invention. Thus for example, by implementing extraction of text information is preferably effected. any motif analysis software, which is based on protein 0174 Thus, the method of the present invention can also homology (see for example, http://motif.genome.ad.jp? and process literature and other textual information and utilize http://www.accelrys.com/products/grailprofindex.html) it is processed textual data for generating additional ontological possible to predict functional motifs of DNA sequences annotations. For example, text information contained in the including repeats, promoter sequences and CpG islands and sequence-related publications and definition lines in of encoded proteins such as Zinc finger and leucine Zipper. sequence records of sequence databases can be extracted and 0181. Due to the progressive nature of the clusters of the processed. Ontological annotations derived from processed present invention, ontology assignment starts at the highest text data are then assigned to the sequences in the corre level of homology. Any biomolecular sequence in the clus sponding clusters. ter, which shares identical level of homology compared to an 0175 Ontological annotations can also be extracted from ontologically annotated protein in the cluster is assigned the sequence associated Medical subject heading (MeSH) terms same ontological annotation. This procedure progresses which are assigned to published papers. from the highest level of homology to a lower threshold 0176). Additional information on text mining is provided level with a predetermined increment resolution. Newly in Example 7 of the Examples section and is disclosed in discovered homologies enable assignment of existing onto “Mining Text Using Keyword Distributions. Ronen Feld logical annotations to biomolecular sequences sharing man, Ido Dagan, and Haym Hirsh, Proceedings of the 1995 homologous sequences and being previously unannotated or Workshop on Knowledge Discovery in Databases, “Finding partially annotated (see Examples 5-9 of the Examples Associations in Collections of Text, Ronen Feldman and section). Haym Hirsh, Machine Learning and Data Mining: Methods 0182 Once assignment of an annotation is effected, and Applications, edited by R. S. Michalski, I. Bratko, and annotated clusters are disassembled resulting in annotation M. Kubat, John Wiley & Sons, Ltd., 1997 "Technology Text of each biomolecular sequence of the cluster. Mining, Turning Information Into Knowledge: A White 0183 Such annotated biomolecular sequences are then Paper from IBM,” edited by Daniel Tkach, Feb. 17, 1998, tested for false annotation. This is effected using the fol each of which is fully incorporated herein by reference. lowing scoring parameters: 0177. It will be appreciated that text mining may be 0.184 (i) A degree of homology characterizing the pro performed, in this and other embodiments of the present gressive cluster—accuracy of the annotation directly corre invention, for the text terms extracted from the definitions of lates with the homology level used for the annotation gene or protein sequence records, retrievable from databases process (see Examples 7-9 of the Examples section). such as GenBank and Swiss-Prot and title line, abstract of Scientific papers, retrievable from Medline database (e.g., 0185 (ii) Relevance of annotation to information http://www.ncbi.nlm.nih.gov/PubMed/). obtained from literature text mining—each assigned onto US 2007/0083334 A1 Apr. 12, 2007

logical annotation which results from literature text mining 12 may also take input from database 18 in performing or functional or cellular prediction is assessed using scoring annotations and iterative annotations. Further, user interface parameters such as LOD score (For further details see 14 is linked directly to database 18, such a user may dispatch Example 7 of the Examples section). queries to database 18 and retrieve information stored 0186 The present invention also enables the use of the therein. As such, user interface 14 allows a user to compile homologies identified according to the teachings of the queries, send instructions, view querying results and per present invention to annotate more sensitively and rapidly a forming specific analyses on the results as needed. query sequence. Essentially this involves building a 0197). In performing ontological annotations, processing sequence profile for each annotated cluster. A profile enables unit 12 may take input from one or more application scoring of a biomolecular sequence according to functional modules 16. Application module 16 performs a specific domains along a sequence and generally makes searches operation and produced a relevant annotative input for more sensitive. Essentially, clustered sequences are also processing unit 12. For example, application module 16 may tested for relevance to the cluster based upon shared func perform cellular localization analysis on a biomolecular tional domains and other characteristic sequence features. sequence query, thereby determining the cellular localiza 0187 Ontologically annotated biomolecular sequences tion of the encoded protein. Such a functional annotation is are stored in a database for further use. Additional informa then input to and used by processing unit 12. Examples for tion on generation and contents of Such databases is pro application Software for cellular localization prediction are vided hereinunder. provided hereinabove. 0188 Such a database can be used to query functional 0198 System 10 of the present invention may also be domains and sequences comprising thereof. Alternatively, connected to one or more external databases 20. External the database can be used to query a sequence, and retrieve database 20 is linked to processing unit 12 in a bi-directional manner, similar to the connection between database 18 and the compatible annotations. processing unit 12. External database 20 may include any 0189 Although the present methodology can be effected background information and/or sequence information that using prior art systems modified for Such purposes, due to pertains to the biomolecular sequence query. External data the large amounts of data processed and the vast amounts of base 20 may be a proprietary database or a publicly available processing needed, the present methodology is preferably database which is accessible through a public network Such effected using a dedicated computational system. as the Internet. External database 20 may feed relevant 0190. Thus, according to another aspect of the present information to processing unit 12 as it effects iterative invention and as illustrated in FIGS. 1a-b, there is provided ontological annotation. External database 20 may also a system for generating a database of annotated biomolecu receive and store ontological annotations generated by pro lar sequences. cessing unit 12. In this case external database 20 may interact with other components of system 10 like database 0191 System 10 includes a processing unit 12, which 18. executes a Software application designed and configured for annotating biomolecular sequences, as described herein 0199. It will be appreciated that the databases and appli above. System 10 further serves for storing biomolecular cation modules of system 10 can be directly connected with sequence information and annotations in a retrievable/ processing unit 12 and/or user interface 14 as is illustrated searchable database 18. Database 18 further includes infor in FIG. 1a, or Such a connection can be achieved via a mation pertaining to database generation. network 22, as is illustrated in FIG. 1b. 0192 System 10 may also include a user interface 14 0200 Network 22 may be a private network (e.g., a local (e.g., a keyboard and/or a mouse, monitor) for inputting area network), a secured network, or a public network (Such database or database related information, and for providing as the Internet), or a combination of public and private database information to a user. and/or secured networks. 0193 System 10 of the present invention may be any 0201 Thus, the present invention provides a well char computing platform known in the art including but not acterized approach for the systemic annotation of biomo limited to a personal computer, a work station, a mainframe lecular sequences. The use of text information analysis, and the like. annotation scoring system and robust sequence clustering procedure enables for the first time the creation of the best 0194 Preferably, database 18 is stored on a computer possible annotations and assignment thereof to a vast num readable media Such as a magnetic optico-magnetic or ber of biomolecular sequences sharing homologous optical disk. sequences. The availability of ontological annotations for a 0.195 System 10 of the present invention may be used by significant number of biomolecular sequences from different a user to query the stored database of annotations and species can provide a comprehensive account of sequence, sequence information to retrieve biomolecular sequences structural and functional information pertaining to the bio stored therein according to inputted annotations or to molecular sequences of interest. retrieve annotations according to a biomolecular sequence query. The Hierarchical Annotation Approach 0196. It will be appreciated that the connection between 0202 "Hierarchical annotation” refers to any ontology user interface 14 and processing unit 12 is bi-directional. and Subontology, which can be hierarchically ordered. Likewise, processing unit 12 and database 18 also share a Examples include but are not limited to a tissue expression two-way communication channel, wherein processing unit hierarchy, a developmental expression hierarchy, a patho US 2007/0083334 A1 Apr. 12, 2007 logical expression hierarchy, a cellular expression hierarchy, 0212 Finally, each of the assigned biomolecular an intracellular expression hierarchy, a taxonomical hierar sequences is recursively classified to nodes hierarchically chy, a functional hierarchy and so forth. higher than the specific nodes, such that the root node of the 0203 According to another aspect of the present inven dendrogram encompasses the full biomolecular sequence tion there is provided a method of annotating biomolecular set, which can be classified according to a certain hierarchy, sequences according to a hierarchy of interest. The method while the offspring of any node represent a partitioning of is effected as follows. the parent set. 0204 First, a dendrogram representing the hierarchy of 0213 For example, a biomolecular sequence found to be interest is computationally constructed. As used herein a specifically expressed in "rhabdomyosarcoma', will be clas “dendrogram' refers to a branching diagram containing sified also to a higher hierarchy level, which is "sarcoma', multiple nodes and representing a hierarchy of categories and then to "Mesenchimal cell tumors” and finally to a based on degree of similarity or number of shared charac highest hierarchy level “Tumor. In another example, a teristics. sequence found to be differentially expressed in endometrium cells, will be classified also to a higher hier 0205 Each of the multiple nodes of the dendrogram is archy level, which is “uterus', and then to “women genital annotated by at least one keyword describing the node, and system’’ and to 'genital system” and finally to a highest enabling literature and database text mining, as is further hierarchy level “genitourinary system'. The retrieval can be described hereinunder. A list of keywords can be obtained performed according to each one of the requested levels. from the GO Consortium (www.geneontlogy.org); measures are taken to include as many keywords, and to include 0214 Since annotation of publicly available databases is keywords which might be out of date. For example, for at times unreliable, newly annotated biomolecular sequences tissue annotation (see FIG. 4), a hierarchy was built using all are confirmed using computational or laboratory approaches available tissue/libraries sources available in the GenBank, as is further described hereinbelow. while considering the following parameters: ignoring Gen 0215. It will be appreciated that once temporal or spatial Bank synonyms, building anatomical hierarchies, enabling annotations of sequences are established using the teachings flexible distinction between tissue types (normal versus of the present invention, it is possible to identify those pathology) and tissue classification levels (organs, systems, sequences, which are differentially expressed (i.e., exhibit cell types, etc.). spatial or temporal pattern of expression in diverse cells or 0206. It will be appreciated that the dendrogram of the tissues). Such sequences are assigned to only a portion of the present invention can be illustrated as a graph, a list, a map nodes, which constitute the hierarchical dendrogram. or a matrix or any other graphic or textual organization, which can describe a dendrogram. An example of a dendro 0216 Changes in gene expression are important deter gram illustrating the gastrointestinal tissue hierarchy is minants of normal cellular physiology, including cell cycle provided in FIG. 2. regulation, differentiation and development, and they directly contribute to abnormal cellular physiology, includ 0207. In a second step, each of the biomolecular ing developmental anomalies, aberrant programs of differ sequences is assigned to at least one specific node of the entiation and cancer. Accordingly, the identification, cloning dendrogram. and characterization of differentially expressed genes can provide relevant and important insights into the molecular 0208. The biomolecular sequences according to this determinants of processes such as growth, development, aspect of the present invention can be annotated biomolecu aging, differentiation and cancer. Additionally, identification lar sequences, unannotated biomolecular sequences or par of Such genes can be useful in development of new drugs tially annotated biomolecular sequences. and diagnostic methods for treating or preventing the occur 0209 Annotated biomolecular sequences can be rence of Such diseases. retrieved from pre-existing annotated databases as described hereinabove. 0217 Newly annotated sequences identified according to the present invention are tested under physiological condi 0210 For example, in GenBank, relevant annotational tions (i.e., temperature, pH, ionic strength, Viscosity, and like information is provided in the definition and keyword fields. biochemical parameters which are compatible with a viable In this case, classification of the annotated biomolecular organism, and/or which typically exist intracellularly in a sequences to the dendrogram nodes is directly effected. A viable cultured yeast cell or mammalian cell). This can be search for Suitable annotated biomolecular sequences is effected using various laboratory approaches Such as, for performed using a set of keywords which are designed to example, FISH analysis, PCR, RT-PCR, southern blotting, classify the biomolecular sequences to the hierarchy (i.e., northern blotting, electrophoresis and the like (see Examples same keywords that populate the dendrogram) 13-20 of the Examples section) or more elaborate 0211. In cases where the biomolecular sequences are approaches which are detailed in the Background section. unannotated or partially annotated, extraction of additional 0218. It will be appreciated that true involvement of annotational information is effected prior to classification to differentially expressed genes in a biological process is dendrogram nodes. This can be effected by sequence align better confirmed using an appropriate cell or animal model, ment, as described hereinabove. Alternatively, annotational as further described hereinunder. information can be predicted from structural studies. Where needed, nucleic acid sequences can be transformed to amino 0219. Although the present methodology can be effected acid sequences to thereby enable more accurate annotational using prior art systems modified for Such purposes, due to prediction. the large amounts of data processed and the vast amounts of US 2007/0083334 A1 Apr. 12, 2007 processing needed, the present methodology is preferably exon junctions), intron sequences, alternative exon effected using a dedicated computational system. sequences and alternative sequences. 0220 Such a system is described hereinabove. The sys 0228. Once a unique sequence feature is identified, the tem includes a processing unit which executes a software expression pattern of the splice variant is determined. If the application designed and configured for hierarchically anno splice variant is differentially expressed then the unique tating biomolecular sequences as described hereinabove. feature thereof is annotated accordingly. The system further serves for storing biomolecular sequence 0229. Alternatively spliced expressed sequences of this information and annotations in a retrievable/searchable data aspect of the present invention, can be retrieved from base. numerous publicly available databases. Examples include 0221) The hierarchical annotation approach enables to but are not limited to ASDB an alternative splicing data assign an appropriate annotation level even in cases where base generated using GenBank and Swiss-Prot annotations expression is not restricted to a specific tissue type or cell (http://cbc.g.nersc.gov/asdb, ASMaml)B-a database of type. For example, different expressed sequences of a single alternative splices in human, mouse and rat (http:// contig which are annotated as being expressed in several 166.111.30.65/ASMAMDB.html). Alternative splicing data different tissue types of a single specific organ or a specific base—a database of alternative splices from literature system, are also annotated by the present invention to a (http://cgsigm.cshl.org/new alt exon db2/), Yeast intron higher hierarchy level thus denoting association with the database-Database of intron in yeast (http://www.cse.uc specific organ or system. In Such cases using keywords alone sc.edu/research/compbio/yeast introns.html). The Intron would not efficiently identify differentially expressed erator—alternative splicing in C. elegans based on analysis sequences. Thus for example, a sequence found to be of EST data (http://www.cse.ucsc.edu/~kent/intronerator), expressed in sarcoma, Ewing sarcoma tumors, pnet, rhab ISIS Intron Sequence Information System including a sec domyosarcoma, liposarcoma and mesenchymal cell tumors, tion of human alternative splices (http://isis.bit.uq.edu.au/), can not be assigned to specific sarcomas, but still can be TAP Transcript Assembly Program result of alternative annotated as mesenchymal cell tumor specific. Using this splicing (http://stl.wustl.edu/-zkan/TAP/) and HASDB da hierarchical annotation approach in combination with tabase of alternative splices detected in human EST data. advanced sequence clustering and assembly algorithms, 0230. Additionally, alternative splicing sequence data capable of predicting alternative splicing, may facilitate a utilized by this aspect of the present invention can be simple and rapid identification of gene expression patterns. obtained by any of the following bioinformatical Annotation of Differentially Expressed approaches. Alternatively Spliced Sequences 0231 (i) Genomically aligned ESTs the method iden 0222 Although numerous methods have been developed tifies ESTs which come from the same gene and looks for to identify differentially expressed genes, none of these differences between them that are consistent with alternative addressed splice variants, which occur in over 50% of splicing, such as large insertion or deletion in one EST. Each human genes. Given the common sequence features of candidate splice variant can be further assessed by aligning splice variants it is very difficult to identify splice variants the ESTs with respective genomic sequence. This reveals which expression is differential, using prior art methodolo candidate exons (i.e., matches to the genomic sequence) gies. Therefore assigning unique sequence features to dif separated by candidate splices (i.e., large gaps in the EST ferentially expressed splice variants may have an important genomic alignment). Since intronic sequences at splice impact to the understanding of disease development and junctions (i.e., donor/acceptor concatenations) are highly may serve as valuable markers to various pathologies. conserved (essentially 99.24% of introns have a GT-AG at their 5' and 3' ends, respectively) sequence data can be used 0223 Thus, according to another aspect of the present to verify candidate splices Burset et al. (2000) Nucleic invention there is provided a method of identifying sequence Acids Res. 28:4364-75 LEADS module Shoshan, et al. features unique to differentially expressed mRNA splice Proceeding of SPIE (eds. M. L. Bittner, Y. Chen, A. N. variants. The method is effected as follows. Dorsel, E. D. Dougherty) Vol. 4266, pp. 86-95 (2001).; R. 0224 First, unique sequence features are computation Sorek, G. Ast, D. Graur, Genome Res. In press; Compugen ally identified in identified splice variants of alternatively Ltd. U.S. patent application Ser. No. 09/133,987). spliced expressed sequences. 0232 (ii) Identification based on intron information— 0225. As used herein the phrase “splice variants' refers to The method creates a database of individual intron naturally occurring nucleic acid sequences and proteins sequences annotated in GenBank and utilizes Such encoded therefrom which are products of alternative splic sequences to search for EST sequences which include the ing. Alternative splicing refers to intron inclusion, exon intronic sequences Croft et al. (2000) Nat. Genet. 24:340 exclusion, or any addition or deletion of terminal sequences, 1. which results in sequence dissimilarities between the splice 0233 (iii) EST alignment to expressed sequences—looks variant sequence and the wild-type sequence. for insertions and deletions in ESTs relative to a set of 0226. Although most alternatively spliced variants result known mRNAs. Such a method enables to uncover alterna from alternative exon usage. Some result from the retention tively spliced variants with having to align ESTs with of introns not spliced-out in the intermediate stage of RNA genomic sequence Brett et al. (2000) FEBS Lett. 474-83 transcript processing. 86). 0227. As used herein the phrase “unique sequence fea 0234. It will be appreciated that in order to avoid false tures' refers to donor/acceptor concatenations (i.e., exon positive identification of novel splice isoforms, a set of US 2007/0083334 A1 Apr. 12, 2007

filters is applied. For example, sequences are filtered to flank and/or extend across the unique sequence features of exclude EST having sequence deviations, such as chimer the alternatively spliced expressed sequences of the present ism, random variation in which a given EST sequence or invention are generated. potential vector contamination at the ends of an EST. 0242 Preferably, oligonucleotides which are capable of 0235 Filtering can be effected by aligning ESTs with hybridizing under Stringent, moderate or mild conditions, as corresponding genomic sequences. Chimeric ESTs can be used in any polynucleotide hybridization assay are utilized. easily excluded by requiring that each EST aligns com Further description of hybridization conditions is provided pletely to a single genomic locus. Genomic location found hereinunder. by homology search and alignment can often be checked against radiation hybrid mapping data Muneer et al (2002) 0243 Oligonucleotides generated by the teachings of the Genomic 79:344-8). Furthermore, since the genomic regions present invention may be used in any modification of nucleic which align with an EST sequence correspond to exon acid hybridization based techniques, which are further sequences and alignment gaps correspond to introns, the detailed hereinunder. General features of oligonucleotide putative splice sites at exon/intron boundaries can be con synthesis and modifications are also provided hereinunder. firmed. Because splice donor and acceptor sites primarily 0244 Aside from being useful in identifying specific reside within the intron sequence, this methodology can splice variants, oligonucleotides generated according to the provide validation which is independent of the EST evi teachings of the present invention may also be widely used dence. Reverse transcriptase artifacts or other cDNA syn as diagnostic, prognostic and therapeutic agents in a variety thesis errors may also be filtered out using this approach. of disorders which are associated with specific splice vari Improper inclusion of genomic sequence in ESTs can also be antS. excluded by requiring pairs of mutually exclusive splices in 0245 Regulation of splicing is involved in 15% of different ESTs. genetic diseases Krawzczak et al. (1992) Hum. Genet. 0236 Additionally, it will be appreciated that observing 90:41-54) and may contribute for example to cancer mis a given splice variant in one EST but not in a second EST splicing of exon 18 in BRCA1, which is caused by a may be insufficient, as the latter can be an un-spliced EST polymorphism in an exonic enhancer Liu et al. (2001) rather than a biological significant intron inclusion. There Nature Genet. 27:55-58). fore measures are taken to focus on mutually exclusive 0246 Thus, oligonucleotides generated according to the splice variants, two different splice variants observed in teachings of the present invention can be included in diag different ESTs, which overlap in a genomic sequence. A nostic kits. For example, oligonucleotides sets pertaining to more stringent filtering may be applied by requiring two a specific disease associated with differential expression of splice variants to share one splice site but differ in another. an alternatively spliced transcript can be packaged in a one 0237. Once splice variants are identified, identification of or more containers with appropriate buffers and preserva unique sequence features therewithin can be effected com tives along with Suitable instructions for use and used for putationally by identifying insertions, deletions and donor diagnosis or for directing therapeutic treatment. Additional acceptor concatenations in ESTs relative to mRNA and information on Such diagnostic kits is provided hereinunder. preferably genomic sequences. 0247. It will be appreciated that an ability to identify 0238. As mentioned hereinabove, once alternatively alternatively spliced sequences, also facilitates identification spliced sequences (having unique sequence features) are of the various products of alternative splicing. identified, determination of their expression patterns is 0248 Recent studies indicate that most alternative splic effected in order to assign an annotation to the unique ing events result in an altered protein International sequence feature thereof. human genome sequencing consortium (2001) Nature 0239 Expression pattern identification may be effected 409:860-921; Modrek et al. (2001) Nucleic Acids Res. by qualifying annotations which are preassociated with the 29:2850-2859). The majority of these changes appear to alternatively spliced expressed sequences, as described here have a functional relevance (i.e., up-regulating or down inabove. This can be accomplished by scoring the annota regulating activity). Such as the replacement of the amino or tions. For example scoring pathological expression annota carboxyl terminus, or in-frame addition and removal of a tions can be effected according to: (i) prevalence of the functional domain. For example, alternative splicing can alternatively spliced expressed sequences in normal tissues; lead to the use of a different site for translation initiation (ii) prevalence of the alternatively spliced expressed (i.e., alternative initiation), a different translation termina sequences in pathological tissues; (iii) prevalence of the tion site due to a frameshift (i.e., truncation or extension), or alternatively spliced expressed sequence in total tissues; and the addition or removal of a stop codon in the alternative (iv) number of tissues and/or tissue types expressing the coding sequence (i.e., alternative termination). Additionally, alternatively spliced expressed sequences. alternative splicing can change an internal sequence region 0240 Alternatively, identifying the expression pattern of due to an in-frame insertion or deletion. One example of the the alternatively spliced expressed sequences of the present latter is the new FC receptor B-like protein, whose C-ter minal transmembrane domain and cytoplasmic tail, which is invention, is accomplished by identifying the unique important for signal transduction in this class of receptors, is sequence feature thereof. This can be effected by any hybrid replaced with a new transmembrane domain and tail by ization-based technique known in the art, such as northern alternative polyadenylation. Another example is the trun blot, dot blot, RNase protection assay, RT-PCR and the like. cated Growth Hormone Receptor which lacks most of its 0241 To this end oligonucleotides probes, which are intracellular domain and has been shown to heterodimerize Substantially homologous to nucleic acid sequences that with the full-length receptor, thus causing inhibition of US 2007/0083334 A1 Apr. 12, 2007 signaling by Growth Hormone Ross, R. J. M., Growth PROSITE (allows a graphical output); at EBI; PROSITE hormone & IGF Research, 9:42-46, (1999). scan Scans a sequence against PROSITE (allows mis 0249 Thus, assigning a unique sequence feature to a matches); at PBIL: PATTINPROT Scans a protein functionally altered splice variant enables identification of sequence or a protein database for one or several pattern(s): such variants. As used herein the phrase “functionally at PBIL: SMART Simple Modular Architecture Research altered splice variants' refers to alternatively spliced Tool; at EMBL: TEIRESIAS Generate patterns from a expressed sequences, which protein products exhibit gain of collection of unaligned protein or DNA sequences; at IBM, function or loss of function or modification of the original all available from http://www.expasy.org/tools/. function. 0256. It will be appreciated that functionally altered 0250) As used herein the phrase “gain of function” refers splice variants may also include a sequence alteration at a to any alternative splicing product, which exhibits increased post-translation modification consensus site. Such as, for functionality as compared to the wild type gene product. example, a tyrosine Sulfation site, a glycosylation site, etc. Examples of post-translational modification prediction soft 0251 As used herein the phrase “loss of function” refers wares include but are not limited to: SignalP Prediction of to any alternative splicing product, which exhibits reduced signal peptide cleavage sites; ChloroP Prediction of chlo function as compared to the wild type gene product includ roplast transit peptides: MITOPROT Prediction of mito ing any reduction in function, total absence of function or chondrial targeting sequences; Predotar—Prediction of dominant negative function. mitochondrial and plastid targeting sequences; NetOGlyc— 0252. As used herein the phrase “dominant negative' Prediction of type O-glycosylation sites in mammalian pro refers to the dominant effect of a splice variant on the teins; DictyCGlyc Prediction of GlcNAc O-glycosylation activity of wild type mRNA. For example, a protein product sites in Dictyostelium; YinOYang O-beta-GlcNAc attach of an altered splice variant may bind a wild type target ment sites in eukaryotic protein sequences; big-PI Predic protein without enzymatically activating it (e.g., receptor tor GPI Modification Site Prediction; DGPI Prediction dimmers), thus blocking and preventing the active of GPI-anchor and cleavage sites (Mirror site); NetPhos— from binding and activating the target protein. Prediction of Serine. Threonine and Tyrosine phosphoryla tion sites in eukaryotic proteins; NetPicoRNA Prediction 0253) As used herein the phrase “functional domain of protease cleavage sites in picornaviral proteins; NMT - refers to a region of a polypeptide, which displays a par Prediction of N-terminal N-myristoylation; Sulfinator Pre ticular function. This function may give rise to a biological, diction of tyrosine sulfation sites all available from http:// chemical, or physiological consequence which may be www.expasy.org/tools/. reversible or irreversible and which may include protein protein interactions (e.g., binding interactions) involving the 0257. Once putative functionally altered splice variants functional domain, a change in the conformation or a are identified, they are validated by experimental verifica transformation into a different chemical state of the func tion and functional studies, using methodologies well known tional domain or of molecules acted upon by the functional in the art. domain, the transduction of an intracellular or intercellular 0258. The Examples section which follows illustrates signal, the regulation of gene or protein expression, the identification and annotation of splice variants. Identified regulation of cell growth or death, or the activation or and annotated sequences are contained within the enclosed inhibition of an immune response. CD-ROMs 1-3. Some of these sequences represent (i.e., are 0254 Identification of putative functionally altered splice transcribed from) entirely new splice variants, while others variants, according to this aspect of the present invention, represent new splice variants of known sequences. In any can be effected by identifying sequence deviations from case, the sequences contained in the enclosed CD-ROM are functional domains of wild-type gene products. novel in that they include previously undisclosed sequence regions in the context of a known gene or an entirely new 0255 Identification of functional domains can be effected by comparing a wild-type gene product with a series of sequence in the context of an unknown gene. profiles prepared by alignment of well characterized proteins 0259. The nucleic acids of the invention can be “isolated” from a number of different species. This generates a con or “purified.” In the event the nucleic acid is genomic DNA, sensus profile, which can then be matched with the query it is considered "isolated when it does not include coding sequence. Examples of programs suitable for Such identifi sequence(s) of a gene or genes immediately adjacent thereto cation include, but are not limited to, InterPro Scan— in the naturally occurring genome of an organism; although Integrated search in PROSITE. Pfam, PRINTS and other Some or all of the 5' or 3' non-coding sequence of an adjacent family and domain databases; ScanProsite—Scans a gene can be included. For example, an isolated nucleic acid sequence against PROSITE or a pattern against SWISS (DNA or RNA) can include some or all of the 5' or 3 PROT and TrEMBL: MotifScan Scans a sequence against non-coding sequence that flanks the coding sequence (e.g., protein profile databases (including PROSITE); Frame-Pro the DNA sequence that is transcribed into, or the RNA fileScan Scans a short DNA sequence against protein sequence that gives rise to, the promoter or an enhancer in profile databases (including PROSITE); Pfam HMM the mRNA). For example, an isolated nucleic acid can search—Scans a sequence against the Pfam protein families contain less than about 5 kb (e.g., less than about 4 kb, 3 kb, database: FingerPRINTScan Scans a protein sequence 2 kb, 1 kb, 0.5 kb, or 0.1 kb) of the 5' and/or 3' sequence that against the PRINTS Protein Fingerprint Database: FPAT naturally flanks the nucleic acid molecule in a cell in which Regular expression searches in protein databases; PRATT the nucleic acid naturally occurs. In the event the nucleic Interactively generates conserved patterns from a series of acid is RNA or mRNA, it is “isolated” or “purified” from a unaligned proteins; PPSEARCH-Scans a sequence against natural Source (e.g., a tissue) or a cell culture when it is US 2007/0083334 A1 Apr. 12, 2007 substantially free of the cellular components with which it 0264. This aspect of the present invention includes natu naturally associates in the cell and, if the cell was cultured, rally occurring sequences of the nucleic acid sequences the cellular components and medium in which the cell was described above, allelic variants (same locus; functional or cultured (e.g., when the RNA or mRNA is in a form that non-functional), homologs (different locus), and orthologs contains less than about 20%, 10%, 5%, 1%, or less, of other (different organism) as well as degenerate variants of those cellular components or culture medium). When chemically sequences and fragments thereof. The degeneracy of the synthesized, a nucleic acid (DNA or RNA) is "isolated' or genetic code is well known, and one of ordinary skill in the “purified when it is substantially free of the chemical art will be able to make nucleotide sequences that differ from precursors or other chemicals used in its synthesis (e.g., the nucleic acid sequences of the present invention but when the nucleic acid is in a form that contains less than nevertheless encode the same proteins as those encoded by about 20%, 10%, 5%, 1%, or less, of the chemical precursors the nucleic acid sequences of the present invention. The or other chemicals). variant sequences (e.g., degenerate variants) can be used in the same manner as naturally occurring sequences. For 0260 Variants, fragments, and other mutant nucleic acids example, the variant DNA sequences of the invention can be are also envisaged by the present invention. As noted above, incorporated into a vector, into the genomic DNA of a where a given biomolecular sequence represents a new gene prokaryote or eukaryote, or made part of a hybrid gene. (rather than a new splice variant of a known gene), the Moreover, variants (or, where appropriate, the proteins they nucleic acids of the invention include the corresponding encode) can be used in the diagnostic assays and therapeutic genomic DNA and RNA. Accordingly, where a given SEQ regimes described below. ID represents a new gene, variations or mutations can occur not only in that nucleic acid sequence, but in the coding 0265. The sequence of nucleic acids of the invention can regions, the non-coding regions, or both, of the genomic also be varied to maximize expression in a particular expres DNA or RNA from which it was made. sion system. For example, as few as one and as many as about 20% of the codons in a given sequence can be altered 0261) The nucleic acids of the invention can be double to optimize expression in bacterial cells (e.g., E. coli), yeast, Stranded or single-stranded and can, therefore, either be a human, insect, or other cell types (e.g., CHO cells). sense strand, an antisense strand, or a portion (i.e., a frag ment) of either the sense or the antisense strand. The nucleic 0266 The nucleic acids of the invention can also be acids of the invention can be synthesized using standard shorter or longer than those disclosed on CD-ROMs 1 and nucleotides or nucleotide analogs or derivatives (e.g., 2. Where the nucleic acids of the invention encode proteins, inosine, phosphorothioate, or acridine Substituted nucle the protein-encoding sequences can differ from those rep otides), which can alter the nucleic acids ability to pair with resented by specific sequences of file “Protein.seqs' in complementary sequences or to resist nucleases. Indeed, the CD-ROM 2. For example, the encoded proteins can be stability or solubility of a nucleic acid can be altered (e.g., shorter or longer than those encoded by one of the nucleic improved) by modifying the nucleic acids base moiety, acid sequences of the present invention. Nucleotides can be Sugar moiety, or phosphate backbone. For example, the deleted from, or added to, either or both ends of the nucleic nucleic acids of the invention can be modified as taught by acid sequences of the present invention or the novel portions ToulméNature Biotech. 19:17, (2001) or Faria et al. Na of the sequences that represent new splice variants. Alter ture Biotech. 19:40-44, (2001), and the deoxyribose phos natively, the nucleic acids can encode proteins in which one phate backbone of nucleic acids can be modified to generate or more amino acid residues have been added to, or deleted peptide nucleic acids PNAs; see Hyrup et al., (1996) from, one or more sequence positions within the nucleic acid Bioorganic & Medicinal Chemistry 4:5-23). Sequences. 0262 PNAS are nucleic acid “mimics': the molecule's 0267 The nucleic acid fragments can be short (e.g., natural backbone is replaced by a pseudopeptide backbone 15-30 nucleotides). For example, in cases where peptides are and only the four nucleotide bases are retained. This allows to be expressed therefrom such polynucleotides need only specific hybridization to DNA and RNA under conditions of contain a Sufficient number of nucleotides to encode novel low ionic strength. PNAS can be synthesized using standard antigenic epitopes. In cases where nucleic acid fragments Solid phase peptide synthesis protocols as described, for serve as DNA or RNA probes or PCR primers, fragments are example by Hyrup et al. (supra) and Perry-O'Keefe et al. selected of a length sufficient for specific binding to one of Proc. Natl. Acad. Sci. USA (1996) 93:14670-675). PNAs of the sequences representing a novel gene or a unique portion the nucleic acids described herein can be used in therapeutic of a novel splice variant. and diagnostic applications. 0268 Nucleic acids used as probes or primers are often 0263 Moreover, the nucleic acids of the invention referred to as oligonucleotides, and they can hybridize with include not only protein-encoding nucleic acids perse (e.g., a sense or antisense strand of DNA or RNA. Nucleic acids coding sequences produced by the chain reac that hybridize to a sense Strand (i.e., a nucleic acid sequence tion (PCR) or following treatment of DNA with an endo that encodes protein, e.g., the coding strand of a double nuclease), but also, for example, recombinant DNA that is: stranded cDNA molecule) or to an mRNA sequence are (a) incorporated into a vector (e.g., an autonomously repli referred to as antisense oligonucleotides. Antisense oligo cating plasmid or virus), (b) incorporated into the genomic nucleotides can be used to specifically inhibit transcription DNA of a prokaryote or eukaryote, or (c) part of a hybrid of any of the nucleic acid sequences of the present invention. gene that encodes an additional polypeptide sequence (i.e., 0269. Design of antisense molecules must be effected a sequence that is heterologous to the nucleic acid sequences while considering two aspects important to the antisense of the present invention or fragments, other mutants, or approach. The first aspect is delivery of the oligonucleotide variants thereof). into the cytoplasm of the appropriate cells, while the second US 2007/0083334 A1 Apr. 12, 2007 aspect is design of an oligonucleotide which specifically with complementary RNA in which, contrary to the usual binds the designated mRNA within cells in a way which b-units, the strands run parallel to each other Gaultier et al., inhibits translation thereof. Nucleic Acids Res. 15:6625-6641, (1987). Alternatively, 0270. The prior art teaches of a number of delivery antisense nucleic acids can comprise a 2'-O-methylribo strategies which can be used to efficiently deliver oligo nucleotide Inoue et al., Nucleic Acids Res. 15:6131-6148, nucleotides into a wide variety of cell types see, for (1987) or a chimeric RNA-DNA analogue Inoue et al., example, Luft (1998) J Mol Med 76 (2): 75-6; Kronenwett FEBS Lett. 215:327-330, (1987). et al. (1998) Blood 91 (3): 852-62; Rajur et al. (1997) 0278. The nucleic acid sequences described above can Bioconjug Chem 8 (6): 935-40; Lavigne et al. (1997) also include ribozymes catalytic sequences. Such a Biochem Biophys Res Commun 237 (3): 566-71 and Aoki et ribozyme will have specificity for a protein encoded by the al. (1997) Biochem Biophys Res Commun 231 (3): 540-5). novel nucleic acids described herein (by virtue of having one or more sequences that are complementary to the cDNAS 0271 In addition, algorithms for identifying those that represent novel genes or the novel portions (i.e., the sequences with the highest predicted binding affinity for portions not found in related splice variants) of the their target mRNA based on a thermodynamic cycle that sequences that represent new splice variants. These accounts for the energetics of structural alterations in both ribozymes can include a catalytic sequence encoding a the target mRNA and the oligonucleotide are also available protein that cleaves mRNA see U.S. Pat. No. 5,093,246 or see, for example, Walton et al. (1999) Biotechnol Bioeng 65 Haselhoff and Gerlach, Nature 334:585-591, (1988). For (1): 1-9). example, a derivative of a tetrahymena L-19 IVS RNA can 0272 Such algorithms have been successfully used to be constructed in which the nucleotide sequence of the implement an antisense approach in cells. For example, the is complementary to the nucleotide sequence to algorithm developed by Walton et al. enabled scientists to be cleaved in an mRNA of the invention (e.g., one of the Successfully design antisense oligonucleotides for rabbit nucleic acid sequences of the present invention; see, U.S. beta-globin (RBG) and mouse tumor necrosis factor-alpha Pat. Nos. 4,987,071 and 5,116,742). Alternatively, the (TNF alpha) transcripts. The same research group has more mRNA sequences of the present invention can be used to recently reported that the antisense activity of rationally select a catalytic RNA having a specific ribonuclease activ selected oligonucleotides against three model target mRNAS ity from a pool of RNA molecules see, e.g., Bartel and (human lactate dehydrogenase A and B and rat gp130) in cell Szostak, Science 261:1411–1418, (1993); see also Krolet al., culture as evaluated by a kinetic PCR technique proved Bio-Techniques 6:958-976, (1988). effective in almost all cases, including tests against three 0279 Fragments having as few as 9-10 nucleotides (e.g., different targets in two cell types with phosphodiester and 12-14, 15-17, 18-20, 21-23, or 24-27 nucleotides) can be phosphorothioate oligonucleotide chemistries. useful as probes or expression templates and are within the 0273. In addition, several approaches for designing and Scope of the present invention. Indeed, fragments that con predicting efficiency of specific oligonucleotides using an in tain about 15-20 nucleotides can be used in Southern blot vitro system were also published (Matveeva et al. (1998) ting, Northern blotting, dot or slot blotting, PCR amplifica Nature Biotechnology 16, 1374-1375). tion methods (where naturally occurring or mutant nucleic acids are amplified), colony hybridization methods, in situ 0274 Several clinical trials have demonstrated safety, hybridization, and the like. feasibility and activity of antisense oligonucleotides. For example, antisense oligonucleotides suitable for the treat 0280 The present invention also encompasses pairs of ment of cancer have been successfully used (Holmund et al. oligonucleotides (these can be used, for example, to amplify (1999) Curr Opin Mol Ther 1 (3):372-85), while treatment the new genes, or portions thereof, or the novel portions of of hematological malignancies via antisense oligonucle the splice variant in, for example, potentially diseased otides targeting c-myb gene, p53 and Bcl-2 had entered tissue) and groups of oligonucleotides (e.g., groups that clinical trials and had been shown to be tolerated by patients exhibit a certain degree of homology (e.g., nucleic acids that Gerwitz (1999) Curr Opin Mol Ther 1 (3):297-306). are 90% identical to one another) or that share one or more functional attributes). 0275 More recently, antisense-mediated suppression of human heparanase gene expression has been reported to 0281. When used, for example, as probes, the nucleic inhibit pleural dissemination of human cancer cells in a acids of the invention can be labeled with a radioactive mouse model Uno et al. (2001) Cancer Res 61 (21):7855 isotope (e.g., using polynucleotide to add P-labeled 60). ATP to the oligonucleotide used as the probe) or an enzyme. Other labels, such as chemiluminescent, fluorescent, or 0276 Thus, the current consensus is that recent develop calorimetric, labels can be used. ments in the field of antisense technology which, as described above, have led to the generation of highly accu 0282. As noted above, the invention features nucleic rate antisense design algorithms and a wide variety of acids that are complementary to those represented by the oligonucleotide delivery systems, enable an ordinarily nucleic acid sequences of the present invention or novel portions thereof (i.e., novel fragments) and as such are skilled artisan to design and implement antisense approaches capable of hybridizing therewith. In many cases, nucleic Suitable for downregulating expression of known sequences acids that are used as probes or primers are absolutely or without having to resort to undue trial and error experimen completely complementary to all, or a portion of the target tation. sequence. However, this is not always necessary. The 0277 Antisense oligonucleotides can also be a-anomeric sequence of a useful probe or primer can differ from that of nucleic acids, which form specific double-stranded hybrids a target sequence so long as it hybridizes with the target US 2007/0083334 A1 Apr. 12, 2007

under the Stringency conditions described herein (or the have been defined in the art. These families include amino conditions routinely used to amplify sequences by PCR) to acids with basic side chains (e.g., lysine, arginine, histidine), form a stable duplex. acidic side chains (e.g., aspartic acid, glutamic acid), 0283 Hybridization of a nucleic acid probe to sequences uncharged polar side chains (e.g., glycine, asparagine, in a library or other sample of nucleic acids is typically glutamine, serine, threonine, tyrosine, cysteine), nonpolar performed under moderate to high Stringency conditions. side chains (e.g., alanine, Valine, leucine, isoleucine, proline, Nucleic acid duplex or hybrid stability is expressed as the phenylalanine, methionine, tryptophan), beta-branched side melting temperature (Tim), which is the temperature at which chains (e.g., threonine, Valine, isoleucine) and aromatic side a probe dissociates from a target DNA and, therefore, helps chains (e.g., tyrosine, phenylalanine, tryptophan, histidine). define the required stringency conditions. To identify The invention includes polypeptides that include one, two, sequences that are related or Substantially identical to that of three, five, or more conservative amino acid Substitutions, a probe, it is useful to first establish the lowest temperature where the resulting mutant polypeptide has at least one at which only homologous hybridization occurs with a biological activity that is the same, or Substantially the same, particular concentration of salt (e.g., SSC or SSPE). (The as a biological activity of the wild type polypeptide. terms “identity” or “identical as used herein are equated 0288 Fragments or other mutant nucleic acids can be with the terms “homology” or “homologous'). Then, assum made by mutagenesis techniques well known in the art, ing a 1% mismatch requires a 1° C. decrease in the Tm, the including those applied to polynucleotides, cells, or organ temperature of the wash (e.g., the final wash) following the isms (e.g., mutations can be introduced randomly along all hybridization reaction is reduced accordingly. For example, or part of the nucleic acid sequences of the present invention if sequences having at least 95% identity with the probe are by Saturation mutagenesis). The resultant mutant proteins sought, the final wash temperature is decreased by 5° C. In can be screened for biological activity to identify those that practice, the change in Tm can be between 0.5° C. and 1.5° retain activity or exhibit altered activity. C. per 1% mismatch 0289. In certain embodiments, nucleic acids of the inven 0284. The hybridization conditions described here can be tion differ from the nucleic acid sequences provided in files employed when the nucleic acids of the invention are used “Transcripts nucleotide seqs part1'. “Transcripts nucle in, for example, diagnostic assays, or when one wishes to otide seqs part2. identify, for example, the homologous genes that fall within “Transcripts nucleotide seqs part3.new' and “ProDG the scope of the invention (as stated elsewhere, the invention seqs” (provided in CD-ROM1 and CD-ROM2) by at least encompasses allelic variants, homologues and orthologues one, but less than 10, 20, 30, 40, 50, 100, or 200 nucleotides of the sequences that represent new genes). Homologous or, alternatively, at less than 1%. 5%, 10% or 20% of the genes will hybridize with the sequences that represent new nucleotides in the Subject nucleic acid (excluding, of course, genes under a stringency condition described herein. splice variants known in the art). Similarly, in certain 0285) A hybridization reaction is carried out at “high embodiments, proteins of the invention can differ from those stringency’ if hybridization (between the probe and a poten encoded by those included in File “Protein.seqs' (provided in CD-ROM2) by at least one, but less than 10, 20, 30, 40, tial target sequence) is carried out at 68°C. in (a) 5xSSC/5x 50, 100, or 200 amino acid residues or, alternatively, at less Denhardt’s solution/1.0% SDS, (b) 0.5 M NaHPO (pH than 1%. 5%, 10% or 20% of the amino acid residues in a 7.2)/1 mM EDTA/7% SDS, or (c) 50% formamide/0.25 M Subject protein (excluding, of course, proteins encoded by NaHPO (pH 7.2)/0.25 MNaC1/1 mM EDTA/7% SDS, and splice variants known in the art (proteins of the invention are washing is carried out with (a) 0.2xSSC/0.1% SDS at room described in more detail below)). If necessary for this temperature or at 42°C., (b) 0.1xSSC/0.1% SDS at 68°C., analysis (or any other test for homology or substantial or (c) 40 mM NaHPO (pH 7.2)/1 mM EDTA and either 1% identity described herein), the sequences should be aligned or 5% SDS at 50° C. for maximum homology, as described elsewhere here. 0286 “Moderately stringent” conditions constitute the hybridization conditions described above and one or more 0290 The present invention also encompasses mutants washes in 3xSSC at 42°C. Of course, salt concentration and e.g., nucleic acids that are 80% (or more) identical to one temperature can be varied to achieve the optimal level of of the nucleic acid sequences disclosed in CD-ROMs 1 and identity between the probe and the target nucleic acid. This 2), which encode proteins that retain substantially at least is well known in the art, and additional guidance is available one, or preferably substantially all of the biological activities in, for example, Sambrook et al., 1989, Molecular Cloning, of the referenced protein. What constitutes “substantially A Laboratory Manual, Cold Spring Harbor Laboratory all may vary considerably. For example, in Some instances, a variant or mutant protein may be about 5% as effective as Press, Cold Spring Harbor, N.Y., and Ausubel et al. (eds.), the protein from which it was derived. But if that level of 1995, Current Protocols in Molecular Biology, John Wiley activity is sufficient to achieve a biologically significant & Sons, New York, N.Y. result (e.g., transport of a Sufficient number of ions across a 0287. As mentioned hereinabove, the nucleic acid cell membrane), the variant or mutant protein is one that sequences of the present invention can be modified to retains substantially all of at least one of the biological encode substitution mutants of the wild type forms. Substi activities of the protein from which it was derived. A tution mutants can include amino acid residues that repre “biologically active' variant or mutant (e.g., fragment) of a sent either a conservative or non-conservative change (or, protein can participate in an intra- or inter-molecular inter where more than one residue is varied, possibly both). A action that can be characterized by specific binding between “conservative' substitution is one in which one amino acid molecules two or more identical molecules (in which case, residue is replaced with another having a similar side chain. homodimerization could occur) or two or more different Families of amino acid residues having similar side chains molecules (in which case, heterodimerization could occur). US 2007/0083334 A1 Apr. 12, 2007

Often, a biologically active fragment will be recognizable by a promoter, enhancer, or other expression control sequence, virtue of a recognizable domain or motif, and one can Such as a polyadenylation signal) that facilitates expression confirm biological activity experimentally. More specifi of the nucleic acid. The vector can replicate autonomously cally, for example, one can make (by synthesis or recombi or integrate into a host genome, and can be a viral vector, nant techniques) a nucleic acid fragment that encodes a Such as a replication defective retrovirus, an adenovirus, or potentially biologically active portion of a protein of the an adeno-associated virus. present invention by inserting the active fragment into an 0297 When present, the regulatory sequence can direct expression vector, and expressing the protein (genetic con constitutive or tissue-specific expression of the nucleic acid. structs and expression systems are described further below), Tissue-specific promoters include, for example, the liver and finally assessing the ability of the protein to function. specific albumin promoter (Pinkert et al., Genes Dev. 1:268 0291. The present invention also encompasses chimeric 277, 1987), lymphoid-specific promoters (Calame and nucleic acid sequences that encode fusion proteins. For Eaton, Adv. Immunol. 43:235-275, 1988), such as those of example, a nucleic acid sequence of the invention can T cell receptors (Winoto and Baltimore, EMBO J. 8:729 include a sequence that encodes a hexa-histidine tag (to 733, 1989) and immunoglobulins (Banerji et al., Cell facilitate purification of bacterially-expressed proteins) or a 33:729-740, 1982; Queen and Baltimore, Cell 33:741-748, hemagglutinin tag (to facilitate purification of proteins 1983), the neuron-specific neurofilament promoter (Byrne expressed in eukaryotic cells). and Ruddle, Proc. Natl. Acad. Sci. USA 86:5473-5477, 1989), pancreas-specific promoters (Edlund et al., Science 0292. The fused heterologous sequence can also encode 230:912-916, 1985), and mammary gland-specific promot a portion of an immunoglobulin (e.g., the constant region ers (e.g., milk whey promoter; see U.S. Pat. No. 4,873.316 (Fc) of an IgG molecule), a detectable marker, or a signal and European Application Publication No. 264,166). Devel sequence (e.g., a sequence that is recognized and is cleaved opmentally-regulated promoters can also be used. Examples by a signal peptidase in the host cell in which the fusion of such promoters include the murine hoX promoters (Kessel protein is expressed). Fusion proteins containing an Fc and Gruss, Science 249:374-379, 1990) and the fetoprotein region can be purified using a protein A column, and they promoter (Campes and Tilghman, Genes Dev. 3:537-546, have increased stability (e.g., a greater circulating half-life) 1989). Moreover, the promoter can be an inducible pro in vivo. moter. For example, the promoter can be regulated by a 0293 Detectable markers are well known in the art and steroid hormone, a polypeptide hormone, or some other can be used in the context of the present invention. For polypeptide (e.g., that used in the tetracycline-inducible example, the expression vector puR278 (Ruther et al., system, "Tet-On” and "Tet-Off: see, e.g., Clontech Inc. EMBO J., 2:1791, 1983) can be used to fuse a nucleic acid (Palo Alto, Calif.), Gossen and Bujard Proc. Natl. Acad. Sci. of the invention to the lacZ gene (which encodes B-galac USA 89:5547, 1992, and Paillard, Human Gene Therapy tosidase). 9:983, 1989). 0298 The expression vector will be selected or designed 0294. A nucleic acid sequence of the invention can also depending on, for example, the type of host cell to be be fused to a sequence that, when expressed, improves the transformed and the level of protein expression desired. For quantity or quality (e.g., solubility) of the fusion protein. For example, when the host cells are mammalian cells, the example, pGEX vectors can be used to express the proteins expression vector can include viral regulatory elements, of the invention fused to glutathione S- (GST). In Such as promoters derived from polyoma, Adenovirus 2, general. Such fusion proteins are soluble and can be easily cytomegalovirus and Simian Virus 40. The nucleic acid purified from lysed cells by adsorption to glutathione inserted (i.e., the sequence to be expressed) can also be agarose beads followed by elution in the presence of free modified to encode residues that are preferentially utilized in glutathione. The pGEX vectors (Pharmacia Biotech Inc: E. coli (Wada et al., Nucleic Acids Res. 20:2111-2118, Smith and Johnson, Gene 67:31-40, 1988) are designed to 1992). These modifications can be achieved by standard include thrombin or factor Xa protease cleavage sites so that DNA synthesis techniques. the cloned target gene product can be released from the GST moiety. Other useful vectors include pMAL (New England 0299| Expression vectors can be used to produce the Biolabs, Beverly, Mass.) and pRIT5 (Pharmacia, Piscat proteins encoded by the nucleic acid sequences of the away, N.J.), which fuse maltose E binding protein and invention ex vivo (e.g., the expressed proteins can be protein A, respectively, to a protein of the invention. purified from expression systems such as those described herein) or in vivo (in, for example, whole organisms). 0295) A signal sequence, when present, can facilitate Proteins can be expressed in vivo in a way that restores secretion of the fusion protein from a cell, and can be expression to within normal limits and/or restores the tem cleaved off by the host cell. The nucleic acid sequences of poral or spatial patterns of expression normally observed. the present invention can also be fused to “inactivating Alternatively, proteins can be aberrantly expressed in vivo sequences, which render the fusion protein encoded, as a (i.e., at a time or place, or to an extent, that does not normally whole, inactive. Such proteins can be referred to as “pre occur in vivo). For example, proteins can be over expressed proteins,” and they can be converted into an active form of or under expressed with respect to expression in a wild-type the protein by removal of the inactivating sequence. state; expressed at a different developmental stage; 0296. The present invention also encompasses genetic expressed at a different time during the cell cycle; or constructs (e.g., plasmids, cosmids, and other vectors that expressed in a tissue or cell type where expression does not transport nucleic acids) that include a nucleic acid of the normally occur. invention in a sense or antisense orientation. The nucleic 0300. The present invention also encompasses various acids can be operably linked to a regulatory sequence (e.g., engineered cells, including cells that have been engineered US 2007/0083334 A1 Apr. 12, 2007 20 to express or over-express a nucleic acid sequence described expression vectors (e.g., Tiplasmid) are suitable. These cells herein. Accordingly, the cells can be transformed with a and other types are available from a wide range of Sources genetic construct, such as those described above. A “trans e.g., the American Type Culture Collection, Manassas, Va.; formed cell is a cell into which (or into an ancestor of see also, e.g., Ausubel et al., Current Protocols in Molecular which) one has introduced a nucleic acid that encodes a Biology, John Wiley & Sons, New York, (1994). The protein of the invention. The nucleic acid can be introduced optimal methods of transformation (by, for example, trans by any of the art-recognized techniques for introducing fection) and, as noted above, the choice of expression nucleic acids into a host cell (e.g., calcium phosphate or vehicle will depend on the host system selected. Transfor calcium chloride co-precipitation, DEAE-dextran-mediated mation and transfection methods are described in, for transfection, lipofection, or electroporation). example, Ausubel et al., Supra; expression vehicles can be chosen from those provided in, for example, Pouwels et al., 0301 The terms “transformed cell” or “host cell” refer Cloning Vectors: A Laboratory Manual, (1985), Supp. not only to the particular Subject cell, but also to the progeny (1987). The host cells harboring the expression vehicle can or potential progeny of Such cells. Mutations or environ be cultured in conventional nutrient media, adapted as mental influences may modify the cells in Succeeding gen needed for activation of a chosen nucleic acid, repression of erations and, even though Such progeny may not be identical a chosen nucleic acid, selection of transformants, or ampli to the parent cell, they are nevertheless within the scope of fication of a chosen nucleic acid. the invention. The cells of the invention can be "isolated cells or “purified preparations' of cells (e.g., an in vitro 0305 Expression systems can be selected based on their preparation of cells), either of which can be obtained from ability to produce proteins that are modified (e.g., by phos multicellular organisms such as plants and animals (in which phorylation, glycosylation, or cleavage) in Substantially the case the purified preparation would constitute a subset of the same way they would be in a cell in which they are naturally cells from the organism). In the case of unicellular micro expressed. Alternatively, the system can be one in which organisms (e.g., microbial cells), the preparation is purified naturally occurring modifications do not occur, or occur in when at least 10% (e.g., 25%, 50%, 75%, 80%, 90%, 95% a different position, or to a different extent, than they or more) of the cells within it are the cells of interest (e.g., otherwise would. the cells that express a protein of the invention). 0306 If desired, the host cells can be those of a stably 0302) The expression vectors of the invention can be transfected cell line. Vectors suitable for stable transfection designed to express proteins in prokaryotic or eukaryotic of mammalian cells are available to the public (see, e.g., cells. For example, polypeptides of the invention can be Pouwels et al. (Supra) as are methods for constructing them expressed in bacterial cells (e.g., E. coli), fungi, yeast, or (see, e.g., Ausubel et al. (Supra). In one example, a nucleic insect cells (e.g., using baculovirus expression vectors). For acid of the invention is cloned into an expression vector that example, a baculovirus such as Autographa Californica includes the dihydrofolate reductase (DHFR) gene. Integra nuclear polyhedrosis virus (AcNPV), which grows in tion of the plasmid and, therefore, the nucleic acid it Spodoptera frugiperda cells, can be used as a vector to contains, into the host cell chromosome is selected for by express foreign genes. A nucleic acid of the invention can be including 0.01-300 mM methotrexate in the cell culture cloned into a non-essential region (for example the polyhe medium (as described in Ausubel et al., Supra). This domi drin gene) of the viral genome and placed under control of nant selection can be accomplished in most cell types. a promoter (e.g., the polyhedrin promoter). Successful inser 0307 Moreover, recombinant protein expression can be tion of the nucleic acid results in inactivation of the poly increased by DHFR-mediated amplification of the trans hedrin gene and production of non-occluded recombinant fected gene. Methods for selecting cell lines bearing gene virus (i.e., virus lacking the proteinaceous coat encoded by amplifications are described in Ausubel et al. (Supra) and the polyhedrin gene). These recombinant viruses are then generally involve extended culture in medium containing typically used to infect insect cells (e.g., Spodoptera fru gradually increasing levels of methotrexate. DHFR-contain giperda cells) in which the inserted gene is expressed (see, ing expression vectors commonly used for this purpose e.g., Smith et al., J. Virol. 46:584, 1983 and U.S. Pat. No. include pCVSEII-DHFR and p AdD26SV(A) (which are 4.215,051). If desired, mammalian cells can be used in lieu also described in Ausubel et al., Supra). of insect cells, provided the virus is engineered so that the nucleic acid is placed under the control of a promoter that is 0308) A number of other selection systems can be used. active in mammalian cells. These include those based on herpes simplex virus thymi dine kinase, hypoxanthine-guanine phosphoribosyl-trans 0303 Useful mammalian cells include rodent cells, such ferase, and adenine phosphoribosyltransferase genes, which as Chinese hamster ovary cells (CHO) or COS cells, primate can be employed in th, hgprt, or aprt cells, respectively. In cells, such as African green monkey kidney cells, rabbit addition, gpt, which confers resistance to mycophenolic acid cells, or pig cells). The mammalian cells can also be human (Mulligan et al., Proc. Natl. Acad. Sci. USA, 78:2072, cells (e.g., a hematopoietic cell, a fibroblast, or a tumor cell). 1981); neo, which confers resistance to the aminoglycoside For example, HeLa cells, 293 cells, 3T3 cells, and WI38 G-418 (Colberre-Garapin et al., J. Mol. Biol. 150:1, 1981): cells are useful. Other suitable host cells are known to those and hygro, which confers resistance to hygromycin (San skilled in the art and are discussed further in Goeddel Gene terre et al., Gene 30:147, 1981), can be used. Expression Technology: Methods in Enzymology 185, Aca 0309. In view of the foregoing, it is clear that one can demic Press, San Diego, Calif., (1990). synthesize proteins encoded by the nucleic acid sequences of 0304 Proteins can also be produced in plant cells, if the present invention (i.e., recombinant proteins). Methods desired. For plant cells, viral expression vectors (e.g., cau of generating and recombinant proteins are well known in liflower mosaic virus and tobacco mosaic virus) and plasmid the art. Recombinant protein purification can be effected by US 2007/0083334 A1 Apr. 12, 2007

affinity. Where a protein of the invention has been fused to 0316 Protein expression can also be regulated in cells a heterologous protein (e.g., a maltose binding protein, a without using the genetic constructs described above. B-galactosidase protein, or a trpE protein), antibodies or Instead, one can modify the expression of an endogenous other agents that specifically bind to the latter can facilitate gene within a cell (e.g., a cell line or microorganism) by purification. The recombinant protein can, if desired, be inserting a heterologous DNA regulatory element into the further purified (e.g., by high performance liquid chroma genome of the cell such that the element is operably linked tography or other standard techniques see, Fisher, Labora to the endogenous gene. For example, an endogenous gene tory Techniques. In Biochemistry And Molecular Biology, that is “transcriptionally silent.” (i.e., not expressed at Eds. Work and Burdon, Elsevier, (1980)). detectable levels) can be activated by inserting a regulatory 0310. Other purification schemes are known as well. For element that promotes the expression of a normally example, non-denatured fusion proteins can be purified from expressed gene product in that cell. Techniques such as human cell lines as described by Janknecht et al. (Proc. Natl. targeted homologous recombination can be used to insert the Acad. Sci. USA, 88:8972, 1981). In this system, a nucleic heterologous DNA (see, e.g., U.S. Pat. No. 5.272,071 and acid is Subcloned into a vaccinia recombination plasmid WO 91/06667). Such that it is translated, in frame, with a sequence encoding an N-terminal tag consisting of six histidine residues. 0317. The polypeptides of the present invention include Extracts of cells infected with the recombinant vaccinia the protein sequences contained in the File “Protein.seqs' of virus are loaded onto Ni" nitriloacetic acid-agarose col CD-ROM 2 and those encoded by the nucleic acids umns, and histidine-tagged proteins are selectively eluted described herein (so long as those nucleic acids contain with imidazole-containing buffers. coding sequence and are not wholly limited to an untrans 0311 Alternatively, Chemical synthesis can also be uti lated region of a nucleic acid sequence), regardless of lized to generate the proteins of the present invention e.g., whether they are recombinantly produced (e.g., produced in proteins can be synthesized by the methods described in and isolated from cultured cells), otherwise manufactured Solid Phase Peptide Synthesis, 2nd Ed., The Pierce Chemi (by, for example, chemical synthesis), or isolated from a cal Co., Rockford, Ill., (1984). natural biological Source (e.g., a cell or tissue) using stan 0312 The invention also features expression vectors that dard protein purification techniques. can be transcribed and translated in vitro using, for example, 0318. The terms “peptide,”“polypeptide,” and “protein' a T7 promoter and T7 polymerase. Thus, the invention are used herein interchangeably to refer to a chain of amino encompasses methods of making the proteins described acid residues, regardless of length or post-translational herein in vitro. modification (e.g., glycosylation or phosphorylation). Pro 0313 Sufficiently purified proteins can be used as teins (including antibodies that specifically bind to the described herein. For example, one can administer the products of those nucleic acid sequences that encode protein protein to a patient, use it in diagnostic or screening assays, or fragments thereof) and other compounds can be “iso or use it to generate antibodies (these methods are described lated' or “purified.” The proteins and compounds of the further below). present invention are "isolated or “purified' when they 0314. The cells perse can also be administered to patients exist as a composition that is at least 60% (e.g., 70%, 75%, in the context of replacement therapies. For example, a 80%, 85%, 90%, 95%, or 99% or more) by weight the nucleic acid of the present invention can be operably linked protein or compound of interest. Thus, the proteins of the to an inducible promoter (e.g., a steroid hormone receptor invention are substantially free from the cellular material (or regulated promoter) and introduced into a human or non other biological or cell culture material) with which they human (e.g., porcine) cell and then into a patient. Optionally, may have, at one time, been associated (naturally or other the cell can be cultivated for a time or encapsulated in a wise). Purity can be measured by any appropriate standard biocompatible material. Such as poly-lysine alginate. See, method (e.g., column chromatography, polyacrylamide gel e.g., Lanza, Nature Biotechnol. 14:1107, (1996); Joki et al. electrophoresis, or HPLC analysis Nature Biotechnol. 19:35, 2001; and U.S. Pat. No. 5,876, 0319. The proteins of the invention also include those 742. When a steroid hormone receptor-regulated promoter is encoded by novel fragments or other mutants or variants of used, protein production can be regulated in the Subject by the protein-encoding sequences of the present invention. administering a steroid hormone to the Subject. Implanted These proteins can retain substantially all (e.g., 70%, 80%, recombinant cells can also express and secrete an antibody 90%. 95%, or 99%) of the biological activity of the full that specifically binds to one of the proteins encoded by the length protein from which they were derived and can, nucleic acid sequences of the present invention. The anti therefore, be used as agonists or mimetics of the proteins body can be any antibody or any antibody derivative from which they were derived. The manner in which bio described herein. An antibody “specifically binds to a logical activity can be determined is described generally particular antigen when it binds to that antigen but not, to a herein, and specific assays (e.g., assays of enzymatic activity detectable level, to other molecules in a sample (e.g., a tissue or ligand-binding ability) are known to those of ordinary or cell culture) that naturally includes the antigen. skill in the art. In some instances, retention of biological 0315) While the host cells described above express activity is not necessary or desirable. For example, frag recombinant proteins, the invention also encompasses cells ments that retain little, if any, of the biological activity of a in which gene expression is disrupted (e.g., cells in which a full-length protein can be used as immunogens, which, in gene has been knocked out). These cells can serve as models turn, can be used as therapeutic agents (e.g., to generate an of disorders that are related to mutated or mis-expressed immune response in a patient), diagnostic agents (e.g., to alleles and are also useful in drug screening. detect the presence of antibodies or other proteins in a tissue US 2007/0083334 A1 Apr. 12, 2007 22 sample obtained from a patient), or to generate or test proteins or genes having a particular property are known in antibodies that specifically bind the proteins of the inven the art. These methods can be adapted for rapid screening. tion. Recursive ensemble mutagenesis (REM), a new technique that enhances the frequency of functional mutants in librar 0320 In other instances, the proteins encoded by nucleic ies, can be used in combination with screening assays to acids of the invention can be modified (e.g., fragmented or identify useful variants of the proteins of the present inven otherwise mutated) so their activities oppose those of the tion Arkin and Yourvan, Proc. Natl. Acad. Sci. USA naturally occurring protein (i.e., the invention encompasses 89:7811-7815, (1992); Delgrave et al., Protein Engineering variants of the proteins encoded by nucleic acids of the invention that are antagonistic to a biological process). One 6:327-331, (1993)). of ordinary skill in the art will recognize that the more 0324 Cell-based assays can be exploited to analyze var extensive the mutation, the more likely it is to affect the iegated libraries constructed from one or more of the pro biological activity of the protein (this is not to say that minor teins of the invention. For example, cells in a cell line (e.g., modifications cannot do so as well). Thus, it is likely that a cell line that ordinarily responds to the protein(s) of mutant proteins that are agonists of those encoded by wild interest in a Substrate-dependent manner) can be transfected type proteins will differ from those wild type proteins only with a library of expression vectors. The transfected cells are at non-essential residues or will contain only conservative then contacted with the protein and the effect of the expres Substitutions. Conversely, antagonists are likely to differ at sion of the mutant on signaling by the protein (Substrate) can an essential residue or to contain non-conservative Substi be detected (e.g., by measuring redox activity or protein tutions. Moreover, those of ordinary skill in the art can folding). Plasmid DNA can then be recovered from the cells engineer proteins so that they retain desirable traits (i.e., that score for inhibition, or alternatively, potentiation of those that make them efficacious in a particular therapeutic, signaling by the protein (Substrate). Individual clones are diagnostic, or screening regime) and lose undesirable traits then further characterized. (i.e., those that produce side effects, or produce false 0325 The invention also contemplates antibodies (i.e., positive results through non-specific binding). immunoglobulin molecules) that specifically bind (see the definition above) to the proteins described herein and anti 0321) In the event a protein of the invention is encoded by body fragments (e.g., antigen-binding fragments or other a new gene, the invention encompasses proteins that arise immunologically active portions of the antibody). Antibod following alternative transcription, RNA splicing, transla ies are proteins, and those of the invention can have at least tional- or post-translational events (e.g., the invention one or two heavy chain variable regions (VH), and at least encompasses splice variants of the new genes). In the event one or two light chain variable regions (VL). The VH and a protein of the invention is encoded by a novel splice VL regions can be further subdivided into regions of hyper variant, the invention encompasses proteins that arise fol variability, termed "complementarity determining regions' lowing alternative translational- or post-translational events (CDR), which are interspersed with more highly conserved (i.e., the invention does not encompass proteins encoded by “framework regions’ (FR). These regions have been pre known splice variants, but does encompass other variants of cisely defined see, Kabat et al., Sequences of Proteins of the novel splice variant). Post-translational modifications are Immunological Interest, Fifth Edition, U.S. Department of discussed above in the context of expression systems. Health and Human Services, NIH Publication No. 91-3242, 0322 The fragmented or otherwise mutant proteins of the (1991) and Chothia.et al., J. Mol. Biol. 196:901-917, invention can differ from those encoded by the nucleic acids (1987), and antibodies or antibody fragments containing of the invention to a limited extent (e.g., by at least one but one or more of them are within the scope of the invention. less than 5, 10 or 15 amino acid residues). As with other, 0326. The antibodies of the invention can also include a more extensive mutations, the differences can be introduced heavy and/or light chain constant region constant regions by adding, deleting, and/or substituting one or more amino typically mediate binding between the antibody and host acid residues. Alternatively, the mutant proteins can differ tissues or factors, including effector cells of the immune from the wild type proteins from which they were derived by system and the first component (C1q) of the classical at least one residue but less than 5%, 10%, 15% or 20% of complement system, and can therefore form heavy and light the residues when analyzed as described herein. If the immunoglobulin chains, respectively. For example, the anti mutant and wild type proteins are different lengths, they can body can be a tetramer (two heavy and two light immuno be aligned and analyzed using the algorithms described globulin chains, which can be connected by, for example, above. disulfide bonds). The heavy chain constant region contains 0323 Useful variants, fragments, and other mutants of three domains (CH1, CH2 and CH3), whereas the light chain the proteins encoded by the nucleic acids of the invention constant region has one (CL). can be identified by screening combinatorial libraries of 0327. An antigen-binding fragment of the invention can these variants, fragments, and other mutants for agonist or be: (i) a Fab fragment (i.e., a monovalent fragment consist antagonist activity. For example, libraries of fragments (e.g., ing of the VL, VH, CL and CH1 domains); (ii) a F(ab') N-terminal, C-terminal, or internal fragments) of one or fragment (i.e., a bivalent fragment containing two Fab more of the proteins of the invention can be used to generate fragments linked by a disulfide bond at the hinge region); populations of fragments that can be screened and, once (iii) a Fd fragment consisting of the VH and CH1 domains: identified, isolated. The proteins can include those in which (iv) a Fv fragment consisting of the VL and VH domains of one or more cysteine residues are added or deleted, or in a single arm of an antibody, (v) a dAb fragment Ward et al., which a glycosylated residue is added or deleted. Methods Nature 341:544-546, (1989), which consists of a VH for screening libraries (e.g., combinatorial libraries of pro domain; and (vi) an isolated complementarity determining teins made from point mutants or cDNA libraries) for region (CDR). US 2007/0083334 A1 Apr. 12, 2007

0328 F(ab') fragments can be produced by pepsin diges mational epitopes can sometimes be identified by identifying tion of the antibody molecule, and Fab fragments can be antibodies that bind to a protein in its native form, but not generated by reducing the disulfide bridges of F(ab')2 frag in a denatured form. ments. Alternatively, Fab expression libraries can be con 0331 The host animal (e.g., a rabbit, mouse, guinea pig, structed Huse et al., Science 246:1275, (1989) to allow or rat) can be immunized with the antigen, optionally linked rapid and easy identification of monoclonal Fab fragments to a carrier (i.e., a Substance that stabilizes or otherwise with the desired specificity. Methods of making other anti improves the immunogenicity of an associated molecule). bodies and antibody fragments are known in the art. For and optionally administered with an adjuvant (see, e.g., example, although the two domains of the Fv fragment, VL Ausubel et al., Supra). An exemplary carrier is keyhole and VH, are coded for by separate genes, they can be joined, limpet hemocyanin (KLH) and exemplary adjuvants, which using recombinant methods or a synthetic linker that enables will be selected in view of the host animals species, include them to be made as a single protein chain in which the VL Freund's adjuvant (complete or incomplete), adjuvant min and VH regions pair to form monovalent molecules known eral gels (e.g., aluminum hydroxide), Surface active Sub as single chain FV (ScPV); see e.g., Bird et al., Science stances such as lysolecithin, pluronic polyols, polyanions, 242:423-426, (1988); Huston et al., Proc. Natl. Acad. Sci. USA 85:5879-5883, (1988); Colcher et al., Ann. NY Acad. peptides, oil emulsions, dinitrophenol, BCG (bacille Cal Sci. 880:263-80, (1999); and Reiter, Clin. Cancer Res. mette-Guerin), and Corynebacterium parvun. KLH is also 2:245-52, (1996). Techniques for producing single chain Sometimes referred to as an adjuvant. The antibodies gen antibodies are also described in U.S. Pat. Nos. 4,946,778 and erated in the host can be purified by, for example, affinity 4,704,692. Such single chain antibodies are encompassed chromatography methods in which the polypeptide antigen within the term “antigen-binding fragment of an antibody. is immobilized on a resin. These antibody fragments are obtained using conventional 0332 Epitopes encompassed by an antigenic peptide may techniques known to those of ordinary skill in the art, and the be located on the Surface of the protein (e.g., in hydrophilic fragments are screened for utility in the same manner that regions), or in regions that are highly antigenic (Such regions intact antibodies are screened. Moreover, a single chain can be selected, initially, by virtue of containing many antibody can form dimers or multimers and, thereby, charged residues). An Emini Surface probability analysis of become a multivalent antibody having specificities for dif human protein sequences can be used to indicate the regions ferent epitopes of the same target protein. that have a particularly high probability of being localized to 0329. The antibody can be a polyclonal (i.e., part of a the surface of the protein. heterogeneous population of antibody molecules derived 0333. The antibody can be a fully human antibody (e.g., from the Sera of the immunized animals) or a monoclonal an antibody made in a mouse that has been genetically antibody (i.e., part of a homogeneous population of anti engineered to produce an antibody from a human immuno bodies to a particular antigen), either of which can be globulin sequence, such as that of a human immunoglobulin recombinantly produced (e.g., produced by phage display or gene (the kappa, lambda, alpha (IgA1 and IgA2), gamma by combinatorial methods, as described in, e.g., U.S. Pat. (IgG1, IgG2, IgG3, IgG4), delta, epsilon and mu constant No. 5,223,409, WO 92/18619; WO 91/17271; WO region genes or the myriad immunoglobulin variable region 92/20791; WO 92/15679; WO 93/01288; WO 92/01047; genes). Alternatively, the antibody can be a non-human WO92/09690: WO 90/02809; Fuchs et al., Bio/Technology antibody (e.g., a rodent (e.g., a mouse or rat), goat, or 9:1370-1372, (1991); Hay et al. Human Antibody Hybrido non-human primate (e.g., monkey) antibody). mas 3:81–85, (1992); Huse et al. Science 246:1275-1281, (1989); Griffiths et al. EMBO J. 12:725-734, (1993); Hawk 0334 Methods of producing antibodies are well known in ins et al., J. Mol. Biol 226:889-896, (1992); Clackson et al. the art. For example, as noted above, human monoclonal Nature 352:624–628, (1991); Gram et al., Proc. Natl. Acad. antibodies can be generated in transgenic mice carrying the Sci. USA89:3576-3580, (1992); Garrad et al., Bio/Technol human immunoglobulin genes rather than those of the ogy 9:1373-1377, (1991); Hoogenboom et al. Nucl. Acids mouse. Splenocytes obtained from these mice (after immu Res. 19:4133-4137, (1991); and Barbas et al., Proc. Natl. nization with an antigen of interest) can be used to produce Acad. Sci. USA 88:7978-7982, (1991). In one embodiment, hybridomas that secrete human mAbs with specific affinities an antibody is made by immunizing an animal with a protein for epitopes from a human protein (see, e.g., WO 91/00906, encoded by a nucleic acid of the invention (one, of course, WO 91/10741; WO92/03918: WO92/03917; Lonberg et that contains coding sequence) or a mutant or fragment (e.g., al., Nature 368:856-859, 1994; Green et al., Nature Genet. an antigenic peptide fragment) thereof. Alternatively, an 7:13-21, 1994; Morrison et al. Proc. Natl. Acad. Sci. USA animal can be immunized with a tissue sample (e.g., a crude 81:6851-6855, 1994; Bruggeman et al., Immunol. 7:33-40, tissue preparation, a whole cell (living, lysed, or fraction 1993; Tuaillon et al., Proc. Natl. Acad. Sci. USA 90:3720 ated) or a membrane fraction). Thus, antibodies of the 3724, 1993; and Bruggeman et al., Eur. J. Immunol 21:1323 invention can specifically bind to a purified antigen or a 1326, 1991). tissue (e.g., a tissue section, a whole cell (living, lysed, or 0335 The antibody can also be one in which the variable fractionated) or a membrane fraction). region, or a portion thereof (e.g., a CDR), is generated in a 0330. In the event an antigenic peptide is used, it can non-human organism (e.g., a rat or mouse). Thus, the include at least eight (e.g., 10, 15, 20, or 30) consecutive invention encompasses chimeric, CDR-grafted, and human amino acid residues found in a protein of the invention. The ized antibodies and antibodies that are generated in a non antibodies generated can specifically bind to one of the human organism and then modified (in, e.g., the variable proteins in their native form (thus, antibodies with linear or framework or constant region) to decrease antigenicity in a conformational epitopes are within the invention), in a human. Chimeric antibodies (i.e., antibodies in which dif denatured or otherwise non-native form, or both. Confor ferent portions are derived from different animal species US 2007/0083334 A1 Apr. 12, 2007 24

(e.g., the variable region of a murine mAb and the constant humanized antibodies in which specific amino acid residues region of a human immunoglobulin) can be produced by have been substituted, deleted or added (in, e.g., in the recombinant techniques known in the art. For example, a framework region to improve antigen binding). For gene encoding the Fc constant region of a murine (or other example, a humanized antibody will have framework resi species) monoclonal antibody molecule can be digested with dues identical to those of the donor or to amino acid residues restriction enzymes to remove the region encoding the other than those of the recipient framework residue. To murine Fc, and the equivalent portion of a gene encoding a generate such antibodies, a selected, Small number of accep human Fc constant region can be substituted therefore see tor framework residues of the humanized immunoglobulin European Patent Application Nos. 125,023; 184, 187; 171, chain are replaced by the corresponding donoramino acids. 496; and 173,494; see also WO 86/01533: U.S. Pat. No. The substitutions can occur adjacent to the CDR or in 4,816,567: Better et al., Science 240:1041-1043, (1988); Liu regions that interact with a CDR (U.S. Pat. No. 5,585,089, et al., Proc. Natl. Acad. Sci. USA84:3439-3443, (1987); Liu see especially columns 12-16). Other techniques for human et al., J. Immunol. 139:3521-3526, (1987); Sun et al., Proc. izing antibodies are described in EP 519596 A1. Natl. Acad. Sci. USA 84:214-218, (1987); Nishimura et al., 0339. In certain embodiments, the antibody has an effec Cancer Res. 47:999-1005, (1987); Wood et al., Nature tor function and can fix complement, while in others it can 314:446-449, (1985); Shaw et al., J. Natl. Cancer Inst. neither recruit effector cells nor fix complement. The anti 80:1553-1559, (1988); Morrison et al., Proc. Natl. Acad. Sci. body can also have little or no ability to bind an Fc receptor. USA 81:6851, (1984); Neuberger et al., Nature 312:604, For example, it can be an isotype or subtype, or a fragment (1984); and Takeda et al., Nature 314:452, (1984)). or other mutant that cannot bind to an Fc receptor (e.g., the 0336. In a humanized or CDR-grafted antibody, at least antibody can have a mutant (e.g., a deleted) Fc receptor one or two, but generally all three of the recipient CDRs (of binding region). The antibody may or may not alter (e.g., heavy and or light immuoglobulin chains) will be replaced increase or decrease) the activity of a protein to which it with a donor CDR. One need only replace the number of binds. CDRs required for binding of the humanized antibody to a protein described herein or a fragment thereof. The donor 0340. In other embodiments, the antibody can be coupled can be a rodent antibody, and the recipient can be a human to a heterologous Substance, Such as a toxin (e.g., ricin, framework or a human consensus framework. Typically, the diphtheria toxin, or active fragments thereof), another type immunoglobulin providing the CDRs is called the “donor of therapeutic agent (e.g., an antibiotic), or a detectable (and is often that of a rodent) and the immunoglobulin label. A detectable label can include an enzyme (e.g., providing the framework is called the “acceptor.” The accep horseradish peroxidase, alkaline phosphatase, B-galactosi tor framework can be a naturally occurring (e.g., a human) dase, or acetylcholinesterase), a prosthetic group (e.g., framework, a consensus framework or sequence, or a streptavidin/biotin and avidin/biotin), or a fluorescent, lumi sequence that is at least 85% (e.g., 90%. 95%, 99%) iden nescent, bioluminescent, or radioactive material. (e.g., tical thereto. A “consensus sequence' is one formed from the umbelliferone, fluorescein, fluorescein isothiocyanate, most frequently occurring amino acids (or nucleotides) in a rhodamine, dichlorotriazinylamine fluorescein, dansyl chlo family of related sequences (see, e.g., Winnaker, From ride or phycoerythrin (which are fluorescent), luminol Genes to Clones, Verlagsgesellschaft, Weinheim, Germany, (which is luminescent), luciferase, luciferin, and aequorin 1987). Each position in the consensus sequence is occupied (which are bioluminescent), and 'I, I, S or H (which by the amino acid residue that occurs most frequently at that are radioactive)). position in the family (where two occur equally frequently, 0341 The antibodies of the invention (e.g., a monoclonal either can be included). A "consensus framework” refers to antibody) can be used to isolate the proteins of the invention the framework region in the consensus immunoglobulin (by, for example, affinity chromatography or immunopre Sequence. cipitation) or to detect them in, for example, a cell lysate or 0337. An antibody can be humanized by methods known supernatant (by Western blotting, ELISAs, radioimmune in the art. For example, humanized antibodies can be gen assays, and the like) or a histological section. One can erated by replacing sequences of the Fv Variable region that therefore determine the abundance and pattern of expression are not directly involved in antigen binding with equivalent of a particular protein. This information can be useful in sequences from human Fv Variable regions. General meth making a diagnosis or in evaluating the efficacy of a clinical ods for generating humanized antibodies are provided by teSt. Morrison Science 229:1202-1207, (1985)), Oi et al. Bio 0342. The invention also includes the nucleic acids that Techniques 4:214, (1986), and Queen et al. (U.S. Pat. Nos. encode the antibodies described above and vectors and cells 5,585,089; 5,693,761 and 5,693,762). Those nucleic acid (e.g., mammalian cells Such as CHO cells or lymphatic cells) sequences required by these methods can be obtained from that contain them. Similarly, the invention includes cell lines a hybridoma producing an antibody the polypeptides of the (e.g., hybridomas) that make the antibodies of the invention present invention, or fragments thereof. The recombinant and methods of making those cell lines. DNA encoding the humanized antibody, or fragment thereof, 0343 Non-human transgenic animals are also within the can then be cloned into an appropriate expression vector. Scope of the invention. These animals can be used to study 0338 Humanized or CDR-grafted antibodies can be pro the function or activity of proteins of the invention and to duced such that one, two, or all CDRs of an immunoglobulin identify or evaluate agents that modulate their activity. A chain can be replaced see, e.g., U.S. Pat. No. 5.225,539; “transgenic animal' can be a mammal (e.g., a mouse, rat, Jones et al., Nature 321:552-525, (1986); Verhoeyan et al., dog, pig, cow, sheep, goat, or non-human primate), an avian Science 239:1534, (1988); and Beidler et al., J. Immunol. (e.g., a chicken), or an amphibian (e.g. a frog) having one or 141:4053-4060, (1988). Thus, the invention features more cells that include a transgene (e.g., an exogenous DNA US 2007/0083334 A1 Apr. 12, 2007 molecule or a rearrangement (e.g., deletion of) endogenous 0352 Pharmaceutical compositions including such pro chromosomal DNA). The transgene can be integrated into or teins or protein encoding sequences, antibodies directed can occur within the genome of the cells of the animal, and against Such proteins or polynucleotides capable of altering it can direct the expression of an encoded gene product in expression of Such proteins may be used to treat diseases one or more types of cells or tissues. Alternatively, a involving transcription factors binding proteins, for example transgene can "knock out' or reduce gene expression. This diseases where there is non-normal replication or transcrip can occur when an endogenous gene has been altered by tion of DNA and RNA respectively; while probe sequences homologous recombination, which occurs between it and an or antibodies may be used for diagnosis of Such diseases. exogenous DNA molecule that was introduced into a cell of the animal (e.g., an embryonic cell) at a very early stage in 0353 Small GTPase Regulatory/Interacting Protein: the animals development. 0354) This category contains proteins such as escort 0344 Intronic sequences and polyadenylation signals can protein, guanyl-nucleotide exchange factor, guanyl-nucle be included in the transgene and, when present, can increase otide exchange factor adaptor, GDP-dissociation inhibitor, expression. One or more tissue-specific regulatory GTPase inhibitor, GTPase activator, guanyl-nucleotide sequences can also be operably linked to a transgene of the releasing factor, GDP-dissociation stimulator, regulator of invention to direct expression of protein to particular cells G-protein signaling, RAS interactor, RHO interactor, RAB (exemplary regulatory sequences are described above, and interactor, RAL interactor. many others are known to those of ordinary skill in the art). 0355 Pharmaceutical compositions including such pro teins or protein encoding sequences, antibodies directed 0345. A “founder animal is one that carries a transgene against Such proteins or polynucleotides capable of altering of the invention in its genome or expresses mRNA from the expression of Such proteins may be used to treat diseases in transgene in its cells or tissues. Founders can be bred to which the signal-transduction, typically involving G-pro produce a line of transgenic animals carrying the founders transgene or bred with founders carrying other transgenes teases is non-normal, either as a cause, or as a result of the (in which case the progeny would bear the transgenes borne disease; while probe sequences or antibodies may be used by both founders). Accordingly, the invention features for diagnosis of Such diseases. founder animals, their progeny, cells or populations of cells 0356) Calcium Binding: obtained therefrom, and proteins obtained therefrom. For 0357 This category contains calcium binding proteins, example, a nucleic acid of the invention can be placed under ligand binding or carriers, Such as , the control of a promoter that directs expression of the Calpain, calcium-dependent protein serine/threonine phos encoded protein in the milk or eggs of the transgenic animal. phatase, calcium sensing proteins, calcium storage proteins. The protein can then be purified or recovered from the animal’s milk or eggs. Animals suitable for Such purpose 0358. : include pigs, cows, goats, sheep, and chickens. 0359. This category contains enzymes that catalyze oxi 0346) The biomolecular sequences of the present inven dation-reduction reactions, such as acting tion can be divided to functional groups, according to GO on the following groups of donors: CH-OH, CH-CH, classification (www.geneontology.org), defined by the activ CH-NH2, CH NH; oxidoreductases acting on NADH or ity of the original sequences from which the new variants NADPH, nitrogenous compounds, Sulfur group of donors, have been identified or to which the novel genes are homolo heme group, hydrogen group, diphenols and related Sub gous. Based on this classification it is possible to identify stances as donors: oxidoreductases acting on peroxide as diseases and conditions which can be diagnosed and treated acceptor, Superoxide radicals as acceptor, oxidizing metal using novel sequence information and annotations such as ions, CH2 groups: oxidoreductases acting on reduced ferre those uncovered by the present invention. doxin as donor, oxidoreductases acting on reduced fla Vodoxin as donor; and oxidoreductases acting on the alde 0347 Immunoglobulin: hyde or oxo group of donors. 0348. This category contains proteins that are involved in 0360 Pharmaceutical compositions including such pro the immune and complement systems such as antigens and teins or protein encoding sequences, antibodies directed autoantigens, immunoglobulins, MHC and HLA proteins against Such proteins or polynucleotides capable of altering and their associated proteins. expression of Such proteins may be used to treat diseases 0349 Pharmaceutical compositions including such pro caused by non-normal activity of oxidoreductases; while teins or protein encoding sequences, antibodies directed probe sequences or antibodies may be used for diagnosis of against Such proteins or polynucleotides capable of altering Such diseases. expression of Such proteins may be used to treat diseases 0361 Receptors: involving the immunological system including inflamma tion, autoimmune diseases, infectious diseases, as well as 0362. This category contains various receptors, such as cancerous processes; while probe sequences or antibodies signal transducers, complement receptors, ligand-dependent may be used for diagnosis of Such diseases. nuclear receptors, transmembrane receptors, GPI-anchored membrane-bound receptors, various coreceptors, internal 0350 Transcription Factor Binding: ization receptors, receptors to neurotransmitters, hormones 0351. This category contains proteins involved in tran and various other effectors and ligands. scription factors binding, RNA and DNA binding, such as 0363 Pharmaceutical compositions including such pro transcription factors, RNA and DNA binding proteins, Zinc teins or protein encoding sequences, antibodies directed fingers, helicase, , histones, nucleases. against Such proteins or polynucleotides capable of altering US 2007/0083334 A1 Apr. 12, 2007 26 expression of Such proteins may be used to treat diseases leading to various pathologies; while probe sequences or caused by non-normal activity of oxidoreductases diseases antibodies may be used for diagnosis of Such diseases. involving various receptors, including receptors to neu rotransmitters, hormones and various other effectors and 0371 , Acting on Acid Anhydrides: ligands; while probe sequences or antibodies may be used 0372 This category contains hydrolytic enzymes that are for diagnosis of Such diseases. acting on acid anhydrides, such as hydrolases acting on acid 03.64 Examples of these diseases include, but are not anhydrides, in phosphorus-containing anhydrides, in Sulfo limited to, chronic myelomonocytic leukemia caused by nyl-containing anhydrides; and hydrolases catalysing trans growth factor beta receptor deficiency Rao DS, Chang J C, membrane movement of Substances, and involved in cellular Kumar P D, Mizukami I, Smithson G. M. Bradley S V. and Subcellular movement. Parlow A F, Ross T S (2001) Mol Cell Biol, 21 (22):7796 0373 Pharmaceutical compositions including such pro 806), thrombosis associated with protease-activated receptor teins or protein encoding sequences, antibodies directed deficiency Sambrano G R. Weiss E J, Zheng Y W. Huang against Such proteins or polynucleotides capable of altering W. Coughlin S R (2001) Nature, 413 (6851):26-7), hyper expression of Such proteins may be used to treat diseases in cholesterolemia associated with low density lipoprotein which the -related activities are non-normal receptor deficiency Koivisto U. M. Hubbard AL, Mellman (increased or decreased); while probe sequences or antibod I (2001) Cell, 105 (5):575-85), familial Hibernian fever ies may be used for diagnosis of Such diseases. associated with tumour necrosis factor receptor deficiency Simon A, Drenth J. P. van der Meer J W (2001) Ned Tijdschr 0374 , Transferring Phosphorus-Containing Geneeskd, 145 (2):77-8), colitis associated with immuno Groups: globulin E receptor expression Dombrowicz, D, Nutten S. Desreumaux P. Neut C, Torpier G, Peeters M, Colombel J F. 0375. This category contains various enzymes that cata Capron M (2001) J Exp Med, 193 (1):25-34), and alagille lyze the transfer of phosphate from one molecule to another, syndrome associated with Jagged 1 Stankiewicz, P, Ruijner J. Such as using the following groups as Loffler C. Kruger A, Nimmakayalu M. Pilacik B, Krajew acceptors: alcohol group, carboxyl group, nitrogenous ska-Walasek M. Gutkowska A, Hansmann I, Giannakudis I group, phosphate; phosphotransferases with regeneration of donors catalysing intramolecular transfers; diphosphotrans (2001) Am J Med Genet, 103 (2):166-71). ferases; ; and phosphotransferases for 0365 Protein Serine/Threonine : other substituted phosphate groups. 0366. This category contains kinases which phosphori 0376 Pharmaceutical compositions including such pro late serine/threonine residues, mainly involved in signal teins or protein encoding sequences, antibodies directed transduction, such as transmembrane receptor protein serine/ against Such proteins or polynucleotides capable of altering threonine kinase, 3-phosphoinositide-dependent protein expression of Such proteins may be used to treat diseases in kinase, DNA-dependent , G-protein-coupled which the transfer of functional group to a modulated moiety receptor phosphorylating protein kinase, SNF1A/AMP-ac is not normal so that a beneficial effect may be achieved by tivated protein kinase, casein kinase, calmodulin regulated modulation of Such transfer, while probe sequences or protein kinase, cyclic-nucleotide dependent protein kinase, antibodies may be used for diagnosis of Such diseases. cyclin-dependent protein kinase, eukaryotic translation ini tiation factor 2alpha kinase, galactosyltransferase-associated 0377 Phosphoric Monoester Hydrolases: kinase, glycogen synthase kinase 3, protein kinase C, recep 0378. This category contains hydrolytic enzymes that are tor signaling protein serine/threonine kinase, ribosomal pro acting on ester bonds, such as: nuclease, Sulfuric ester tein S6 kinase, IkB kinase. hydrolase, carboxylic ester hydrolase, thiolester hydrolase, 0367 Pharmaceutical compositions including such pro phosphoric monoester hydrolase, phosphoric diester hydro teins or protein encoding sequences, antibodies directed lase, triphosphoric monoester hydrolase, diphosphoric against Such proteins or polynucleotides capable of altering monoester hydrolase, and phosphoric triester hydrolase. expression of Such proteins may be used to treat diseases which may be ameliorated by a modulating kinase activity, 0379 Pharmaceutical compositions including such pro which is one of the main signaling pathways inside cell; teins or protein encoding sequences, antibodies directed while probe sequences or antibodies may be used for diag against Such proteins or polynucleotides capable of altering nosis of Such diseases. expression of Such proteins may be used to treat diseases in which the hydrolytic cleavage of a covalent bond with 0368 Channel/Pore Class Transporters: accompanying addition of water, —H being added to one product of the cleavage and —OH to the other, is not normal 0369. This category contains proteins that mediate the so that a beneficial effect may be achieved by modulation of transport of molecules and macromolecules across mem Such reaction; while probe sequences or antibodies may be branes, such as alpha-type channels, porins, pore-forming used for diagnosis of Such diseases. toxins. 0380 Enzyme Inhibitors: 0370 Pharmaceutical compositions including such pro teins or protein encoding sequences, antibodies directed 0381. This category contains inhibitors and suppressors against Such proteins or polynucleotides capable of altering of other proteins and enzymes, such as inhibitors of kinases, expression of Such proteins may be used to treat diseases in phosphatases, chaperones, guanylate cyclase, DNA gyrase, which the transport of molecules and macromolecules Such ribonuclease, proteasome inhibitors, diazepam-binding as neurotransmitters, hormones, Sugar etc. is non-normal inhibitor, ornithine decarboxylase inhibitor, GTPase inhibi US 2007/0083334 A1 Apr. 12, 2007 27 tors, dOTP pyrophosphatase inhibitor, phospholipase inhibi may be achieved by modulation of such reaction; while tor, proteinase inhibitor, protein biosynthesis inhibitors, probe sequences or antibodies may be used for diagnosis of alpha-amylase inhibitors. Such diseases. 0382 Pharmaceutical compositions including such pro 0394 Hydrolases, Acting on Glycosyl Bonds: teins or protein encoding sequences, antibodies directed against Such proteins or polynucleotides capable of altering 0395. This category contains hydrolytic enzymes that are expression of Such proteins may be used to treat diseases in acting on glycosyl bonds, such as hydrolases hydrolyzing which beneficial effect may be achieved by modulating the N-glycosyl compounds, S-glycosyl compounds, and O-gly activity of inhibitors and Suppressors of proteins and cosyl compounds. enzymes; while probe sequences or antibodies may be used 0396 Pharmaceutical compositions including such pro for diagnosis of Such diseases. teins or protein encoding sequences, antibodies directed 0383 Electron Transporters: against Such proteins or polynucleotides capable of altering expression of Such proteins may be used to treat diseases in 0384. This category contains ligand binding or carrier which the hydrolase-related activities are non-normal proteins involved in electron transport, Such as: flavin (increased or decreased); while probe sequences or antibod containing electron transporter, cytochromes, electron ies may be used for diagnosis of Such diseases. donors, electron acceptors, electron carriers, and cyto chrome-c oxidases. 0397 Kinases: 0385 Transferases, Transferring Glycosyl Groups: 0398. This category contains kinases, which phosphori late serine/threonine or tyrosine residues, mainly involved in 0386 This category contains various enzymes that cata signal transduction. It covers enzymes Such as 2-amino-4- lyze the transfer of a chemical group. Such as a glycosyl, hydroxy-6-hydroxymethyldihydropteridine pyrophosphoki from one molecule to another. It covers enzymes Such as nase, NAD(+) kinase, acetylglutamate kinase, adenosine murein lytic endotransglycosylase E, and Sialyltransferase. kinase, , adenylsulfate kinase, arginine 0387 Pharmaceutical compositions including such pro kinase, , choline kinase, , teins or protein encoding sequences, antibodies directed cytidylate kinase, deoxyadenosine kinase, deoxycytidine against Such proteins or polynucleotides capable of altering kinase, deoxyguanosine kinase, dephospho-CoAkinase, dia expression of such proteins may be used to treat diseases in cylglycerol kinase, dolichol kinase, ethanolamine kinase, which the transfer of a glycosyl chemical group from one , , glutamate 5-kinase, glycerol molecule to another is not normal so that a beneficial effect kinase, glycerone kinase, , , may be achieved by modulation of such reaction; while homoserine kinase, hydroxyethylthiazole kinase, inositol/ probe sequences or antibodies may be used for diagnosis of phosphatidylinositol kinase, ketohexokinase, , nucleoside-diphosphate kinase, , Such diseases. phosphoenolpyruvate carboxykinase, phosphoglycerate 0388 . Forming Carbon-Oxygen Bonds: kinase, , protein kinase, pyruvate 0389. This category contains enzymes that catalyze the dehydrogenase (lipoamide) kinase, , riboki linkage between carbon and oxygen, Such as forming nase, ribose-phosphate pyrophosphokinase, selenide, water aminoacyl-tRNA and related compounds. dikinase, , thiamine pyrophosphokinase, , thymidylate kinase, uridine kinase, Xylu 0390 Pharmaceutical compositions including such pro lokinase, 1D-myo-inositol-trisphosphate 3-kinase, phospho teins or protein encoding sequences, antibodies directed , pyridoxal kinase, sphinganine kinase, ribofla against Such proteins or polynucleotides capable of altering vin kinase, 2-dehydro-3-deoxygalactonokinase, 2-dehydro expression of Such proteins may be used to treat diseases in 3-deoxygluconokinase, 4-diphosphocytidyl-2C-methyl-D- which the linkage between carbon and oxygen in an energy erythritol kinase, GTP pyrophosphokinase, L-fuculokinase, dependent process is not normal so that a beneficial effect L-ribulokinase, L-xylulokinase, isocitrate dehydrogenase may be achieved by modulation of such reaction; while (NADP+) kinase, acetate kinase, allose kinase, carbamate probe sequences or antibodies may be used for diagnosis of kinase, cobinamide kinase, diphosphate-purine nucleoside Such diseases. kinase, fructokinase, glycerate kinase, hydroxymethylpyri midine kinase, hygromycin-B kinase, inosine kinase, kana 0391) Ligases: mycin kinase, phosphomethylpyrimidine kinase, phosphor 0392 This category contains enzymes that catalyze the ibulokinase, polyphosphate kinase, propionate kinase, linkage of two molecules, generally utilizing ATP as the pyruvate, water dikinase, rhamnulokinase, tagatose-6-phos energy donor, also called synthetase. It covers enzymes Such phate kinase, tetraacyldisaccharide 4'-kinase, thiamine as beta-alanyl-dopamine hydrolase, carbon-oxygen bonds phosphate kinase, undecaprenol kinase, uridylate kinase, forming ligase, carbon-sulfur bonds forming ligase, carbon N-acylmannosamine kinase, D-erythro-. bonds forming ligase, carbon-carbon bonds form 0399 Pharmaceutical compositions including such pro ing ligase, and phosphoric ester bonds forming ligase. teins or protein encoding sequences, antibodies directed 0393 Pharmaceutical compositions including such pro against Such proteins or polynucleotides capable of altering teins or protein encoding sequences, antibodies directed expression of Such proteins may be used to treat diseases against Such proteins or polynucleotides capable of altering which may be ameliorated by a modulating kinase activity, expression of Such proteins may be used to treat diseases in which is one of the main signaling pathways inside cell; which the joining together of two molecules in an energy while probe sequences or antibodies may be used for diag dependent process is not normal so that a beneficial effect nosis of Such diseases. US 2007/0083334 A1 Apr. 12, 2007 28

04.00 Examples of these diseases include, but are not transducer, transmembrane receptor protein limited to, acute lymphoblastic leukemia associated with signaling protein, transmembrane receptor protein serine/ spleen tyrosine kinase deficiency Goodman PA, Wood C threonine kinase signaling protein, receptor signaling pro M. Vassilev A. Mao C, Uckun F M (2001) Oncogene, 20 tein serine/threonine kinase signaling protein, receptor Sig (30):3969-78), ataxia telangiectasia associated with ATM naling protein serine/threonine phosphatase signaling kinase deficiency (Boultwood J (2001) J Clin Pathol, 54 protein, Small GTPase regulatory/interacting protein, recep (7):51 2-6), congenital haemolytic anaemia associated with tor signaling protein tyrosine kinase signaling protein, erythrocyte pyruvate kinase deficiency Zanella A, Bianchi receptor signaling protein serine/threonine phosphatase. P. Fermo E, Iurlo A. Zappa M, Vercellati C, Boschetti C, Baronciani L, Cotton F (2001) BrJ Haematol, 113 (1):43-8), 04.09 Pharmaceutical compositions including such pro mevalonic aciduria caused by mevalonate kinase deficiency teins or protein encoding sequences, antibodies directed Houten SM, Koster J, Romeijn GJ, Frenkel J. Di Rocco M, against Such proteins or polynucleotides capable of altering Caruso U, Landrieu P. Kelley R I, Kuis W. Poll-The BT, expression of Such proteins may be used to treat diseases in Gibson KM, Wanders RJ, Waterham H R (2001) Eur J. Hum which the signal-transduction is non-normal, either as a Genet, 9 (4): 253-9, and acute myelogenous leukemia asso cause, or as a result of the disease; while probe sequences or ciated with over-expressed death-associated protein kinase antibodies may be used for diagnosis of Such diseases. Guzman M. L., Upchurch D. Grimes B. Howard D S. 0410 Examples of these diseases include, but are not Rizzieri DA, Luger S M, Phillips GL, Jordan CT (2001) limited to, complete hypogonadotropic hypogonadism asso Blood, 97 (7):2177-9). ciated with GnRH receptor deficiency Kottler ML, Chau vin S, Lahlou N. Harris C E. Johnston C J, Lagarde J P. 0401) Nucleotide Binding: Bouchard P. Farid N. R. Counis R (2000) J Clin Endocrinol 0402. This category contains ligand binding or carrier Metab. 85 (9):3002-8), severe combined immunodeficiency proteins, involved in physical interaction with a nucle disease associated with IL-7 receptor deficiency (Puel A. otide-any compound consisting of a nucleoside that is Leonard W J (2000) Curr Opin Immunol, 12 (4):468-73), esterified with orthophosphate or an oligophosphate at any Schizophrenia associated N-methyl-D-aspartate receptor hydroxyl group on the glycose moiety, Such as purine deficiency (Mohn AR, Gainetdinov R R. Caron MG; Koller nucleotide binding proteins. B H (1999) Cell, 98 (4):427-36), Yersinia-associated arthri tis associated with tumor necrosis factor receptor p55 defi 0403) Binding: ciency Zhao Y X. Zhang H. Chiu B, Payne U, Inman RD 04.04 This category contains binding proteins that bind (1999) Arthritis Rheum, 42 (8):1662-72), and Dwarfism of tubulin, Such as microtubule binding proteins. Sindh caused by growth hormone-releasing hormone recep tor deficiency aheshwari H G, Silverman B L. Dupuis J. 04.05 Pharmaceutical compositions including such pro teins or protein encoding sequences, antibodies directed Baumann G (1998) J Clin Endocrinol Metab, 83 (11):4065 against Such proteins or polynucleotides capable of altering 74). expression of Such proteins may be used to treat diseases 0411 Molecular Function Unknown: which are associated with non-normal tubulin activity or structure. Binding of the products of the genes of this family, 0412. This category contains various proteins with or antibodies reactive therewith, can modulate a plurality of unknown molecular function, Such as cell Surface antigens. tubulin activities as well as change microtubulin structure; 0413 Pharmaceutical compositions including such pro while probe sequences or antibodies may be used for diag teins or protein encoding sequences, antibodies directed nosis of Such diseases. against Such proteins or polynucleotides capable of altering expression of Such proteins may be used to treat diseases in 04.06 Examples of these diseases include, but are not which regulation of the recognition, or participation or bind limited to, Alzheimer's disease associated with t-complex of cell Surface antigens to other moieties may improve the polypeptide 1 deficiency Schuller E, Gulesserian T. Seidl R. disease. These diseases include autoimmune diseases, vari Cairns N, Lube G. (2001) Life Sci, 69 (3):263-70), neuro ous infectious diseases, cancer diseases which involve non degeneration associated with apoE deficiency Masliah E. cell Surface antigens recognition and activity, etc., while Mallory M, Ge N, Alford M, Veinbergs I, Roses AD (1995) Exp Neurol, 136 (2): 107-22), progressive axonopathy asso probe sequences or antibodies may be used for diagnosis of ciated with dysfunctional neurofilaments Griffiths I R, Such diseases. Kyriakides E. Barrie J (1989) Neuropathol Appl Neurobiol, 0414 Enzyme Activators: 15 (1):63-74 familial frontotemporal dementia associated 0415. This category contains enzyme regulators, such as with tau deficiency astor P. Pastor E. Carnero C. Vela R. activators of kinases, phosphatases, sphingolipids, chaper Garcia T, Amer G, Tolosa E, Oliva R (2001) Ann Neurol, 49 ones, guanylate cyclase, tryptophan hydroxylase, proteases, (2):263-7), and colon cancer suppressed by APC White R phospholipases, caspases, proprotein convertase 2 activator, L (1997) Pathol Biol (Paris), 45 (3):240-4). cyclin-dependent protein kinase 5 activator, Superoxide 04.07 Receptor Signaling Proteins: generating NADPH oxidase activator, sphingomyelin phos phodiesterase activator, monophenol monooxygenase acti 0408. This category contains receptor proteins involved in signal transduction, such as receptor signaling protein vator, proteasome activator, GTPase activator. serine/threonine kinase, receptor signaling protein tyrosine 0416) Pharmaceutical compositions including such pro kinase, receptor signaling protein tyrosine phosphatase, aryl teins or protein encoding sequences, antibodies directed hydrocarbon receptor nuclear translocator, hematopoeitin/ against Such proteins or polynucleotides capable of altering interferon-class (D200-domain) cytokine receptor signal expression of Such proteins may be used to treat diseases in US 2007/0083334 A1 Apr. 12, 2007 29 which beneficial effect may be achieved by modulating the 0426 Cell Adhesion Molecule: activity of activators of proteins and enzymes; while probe 0427. This category contains proteins that serve as adhe sequences or antibodies may be used for diagnosis of Such sion molecules between adjoining cells. Such as: membrane diseases. associated protein with guanylate kinase activity, cell adhe sion receptor, neuroligin, calcium-dependent cell adhesion 0417 Transferases, Transferring One-Carbon Groups: molecule, selectin, calcium-independent cell adhesion mol 0418. This category contains various enzymes that cata ecule, extracellular matrix protein. lyze the transfer of a chemical group. Such as a one-carbon, 0428 Pharmaceutical compositions including such pro from one molecule to another. The category covers enzymes teins or protein encoding sequences, antibodies directed Such as methyltransferase, amidinotransferase, hydroxym against Such proteins or polynucleotides capable of altering ethyl-, formyl- and related transferase, carboxyl- and car expression of Such proteins may be used to treat diseases in bamoyltransferase. which adhesion between adjoining cells is involved, typi cally conditions in which the adhesion is non-normal; while 0419 Pharmaceutical compositions including such pro probe sequences or antibodies may be used for diagnosis of teins or protein encoding sequences, antibodies directed Such diseases. Typical examples of Such conditions are against Such proteins or polynucleotides capable of altering cancer conditions in which non-normal adhesion may cause expression of Such proteins may be used to treat diseases in and enhance the process of metastasis. Other examples of which the transfer of a one-carbon chemical group from one Such conditions include conditions of non-normal growth molecule to another is not normal so that a beneficial effect and development of various tissues in which modulation may be achieved by modulation of such reaction; while adhesion among adjoining cells can improve the condition. probe sequences or antibodies may be used for diagnosis of Such diseases. 0429 Examples of theses diseases include, but are not limited to, Wiskott-Aldrich syndrome associated with WAS 0420 Transferases: deficiency Westerberg L. Greicius G. Snapper SB, Aspen strom P. Severinson E (2001) Blood, 98 (4): 1086-94), 0421. This category contains various enzymes that cata asthma associated with intercellular adhesion molecule-1 lyze the transfer of a chemical group, Such as a phosphate or deficiency Tang M L. Fiscus LC (2001) Pulm Pharmacol amine, from one molecule to another. It covers enzymes Ther, 14(3):203-10), intra-atrial thrombogenesis associated Such as: transferases, transferring one-carbon groups, alde with increased von Willebrand factor activity Fukuchi M. hyde or ketonic groups, acyl groups, glycosyl groups, alkyl Watanabe J. Kumagai K. Katori Y. Baba S, Fukuda K. Yagi oraryl (other than methyl) groups, nitrogenous, phosphorus T, Iguchi A, Yokoyama H. Miura M. Kagaya Y. Sato S, containing groups, Sulfur-containing groups, lipoyltrans Tabayashi K, Shirato K (2001) J Am Coll Cardiol, 37 ferase, deoxycytidyl transferases. (5):1436-42, junctional epidermolysis bullosa associated 0422 Pharmaceutical compositions including such pro with laminin 5 beta3 deficiency Robbins P B, Lin Q. teins or protein encoding sequences, antibodies directed Goodnough J B, Tian H, Chen X, Khavari PA (2001) Proc against Such proteins or polynucleotides capable of altering Natl Acad Sci USA, 98 (9):5193-8), and hydrocephalus expression of Such proteins may be used to treat diseases in caused by neural adhesion molecule L1 deficiency Rolf B. which the transfer of a chemical group from one molecule to Kutsche M. Bartsch U (2001) Brain Res, 891 (1-2):247-52). another is not normal so that a beneficial effect may be 0430 Motor Proteins: achieved by modulation of such reaction; while probe 0431. This category contains proteins that are held to sequences or antibodies may be used for diagnosis of Such generate force or energy by the hydrolysis of ATP and that diseases. functions in the production of intracellular movement or transportation. It covers proteins such as: microfilament 0423 Chaperone: motor, axonemal motor, microtubule motor, and kinetochore 0424. This category contains functional classes of unre motor (like , , or ). lated families of proteins that assist the correct non-covalent 0432 Defense/Immunity Proteins: assembly of other polypeptide-containing structures in Vivo, 0433. This category contains proteins that are involved in but are not components of these assembled structures when the immune and complement systems, such as acute-phase they a performing their normal biological function. The response proteins, antimicrobial peptides, antiviral response category covers proteins such as: ribosomal chaperone, proteins, blood coagulation factors, complement compo peptidylprolyl isomerase, lectin-binding chaperone, nucleo nents, immunoglobulins, major histocompatibility complex Some assembly chaperone, ATPase, cochaper antigens, opsonins. one, heat shock protein, HSP70/HSP90 organizing protein, fimbrial chaperone, metallochaperone, tubulin folding, 0434 Pharmaceutical compositions including such pro HSC70-interacting protein. teins or protein encoding sequences, antibodies directed against Such proteins or polynucleotides capable of altering 0425 Pharmaceutical compositions including such pro expression of Such proteins may be used to treat diseases teins or protein encoding sequences, antibodies directed involving the immunological system including inflamma against Such proteins or polynucleotides capable of altering tion, autoimmune diseases, infectious diseases, as well as expression of Such proteins may be used to treat diseases cancerous processes or diseases which are manifested by which are associated with non-normal protein activity or non-normal coagulation processes, which may include structure or abnormal degradation of Such proteins; while abnormal bleeding or excessive coagulation., while probe probe sequences or antibodies may be used for diagnosis of sequences or antibodies may be used for diagnosis of Such Such diseases. diseases. US 2007/0083334 A1 Apr. 12, 2007 30

0435 Examples of these diseases include, but are not associated with ATP-binding cassette transporter-1 defi limited to, late (C5-9) complement component deficiency ciency (McNeish J. Aiello RJ, Guyot D, Turi T. Gabel C, associated with opsonin receptor allotypes Fijen C A, Aldinger C. Hoppe K L. Roach ML, Royer LJ, de Wet J. Bredius R C, Kuijper E. J. Out TA, De Haas M, De Wit A Broccardo C, Chimini G, Francone O L (2000) Proc Natl P. Daha MR, De Winkel J G (2000) Clin Exp Immunol, 120 Acad Sci USA, 97 (8): 4245-50), systemic primary carnitine (2):338-45), combined immunodeficiency associated with deficiency associated with organic cation transporter defi defective expression of MHC class II genes Griscelli C. ciency (Tang NL, Ganaphthy V. Wu X, Hui J. Seth P. Yuen Lisowska-Grospierre B. Mach B (1989) Immunodefic Rev. 1 PM, Wanders RJ, Fok T F, Hjelm N M (1999) Hum Mol (2):135-53, loss of antiviral activity of CD4 T cells caused Genet, 8 (4):655-60), Wilson disease associated with cop by neutralization of endogenous TNF alpha Pavic I, Polic per-transporting deficiency (Payne AS, Kelly E. J. B, Crnkovic I, Lucin P. Jonjic S. Koszinowski U H (1993) Gitlin J D (1998) Proc Natl Acad Sci USA, 95 (18): 10854 J Gen Virol. 74 (Pt 10):2215-23), autoimmune diseases 9), and atelosteogenesis associated with diastrophic dyspla associated with natural resistance-associated macrophage sia sulphate transporter deficiency (Newbury-Ecob R (1998) protein deficiency Evans CA, Harbuz, MS, Ostenfeld T. J Med Genet, 35 (1):49-53). Norrish A, Blackwell J M (2001) Neurogenetics, 3 (2):69 78), and Epstein-Barr virus-associated lymphoproliferative 0443) : disease inhibited by combined GM-CSF and IL-2 therapy 0444 This category contains enzymes that catalyze the Baiocchi R A, Ward J S. Carrodeguas L. Eisenbeis C F. formation of double bonds by removing chemical groups Peng R, Roychowdhury S. Vourganti S. Sekula T. O’Brien from a substrate without hydrolysis or catalyze the addition M, Moeschberger M, Caligiuri MA (2001).J. Clin Invest, 108 of chemical groups to double bonds. It covers enzymes Such (6):887–94). as carbon-carbon , carbon-oxygen lyase, carbon-nitro gen lyase, carbon-sulfur lyase, carbon-halide lyase, phos 0436 Intracellular Transporters: phorus-oxygen lyase, and other lyases. 0437. This category contains proteins that mediate the transport of molecules and macromolecules inside the cell, 0445 Actin Binding Proteins: Such as: intracellular nucleoside transporter, vacuolar 0446. This category contains actin binding proteins, such assembly proteins, vesicle transporters, vesicle fusion pro as actin cross-linking, actin bundling, F-actin capping, actin teins, type II protein secretors. monomer binding, actin lateral binding, actin depolymeriZ 0438 Pharmaceutical compositions including such pro ing, actin monomer sequestering, actin filament severing, teins or protein encoding sequences, antibodies directed actin modulating, membrane associated actin binding, actin against Such proteins or polynucleotides capable of altering thin filament length regulation, and actin polymerizing pro expression of Such proteins may be used to treat diseases in teins. which the transport of molecules and macromolecules is 0447 Protein Binding Proteins: non-normal leading to various pathologies; while probe 0448. This category contains various proteins, involved sequences or antibodies may be used for diagnosis of Such in diverse biological functions, such as: intermediate fila diseases. ment binding, LIM-domain binding, LLR-domain binding, 0439 Transporters: clathrin binding, ARF binding, Vinculin binding, KU70 binding, troponin C binding PDZ-domain binding, SH3 0440 This category contains proteins that mediate the domain binding, fibroblast growth factor binding, mem transport of molecules and macromolecules, such as chan brane-associated protein with guanylate kinase activity nels, exchangers, pumps. The category covers proteins such interacting, Wnt-protein binding, DEAD/H-box RNA heli as: amine/polyamine transporter, lipid transporter, neu case binding, beta-amyloid binding, myosin binding, TATA rotransmitter transporter, organic acid transporter, oxygen binding protein binding DNA topoisomerase I binding, transporter, water transporter, carriers, intracellular trans polypeptide hormone binding, RHO binding, FH1-domain portes, protein transporters, ion transporters, carbohydrate binding, syntaxin-1 binding, HSC70-interacting, transcrip transporter, polyol transporter, amino acid transporters, Vita tion factor binding, metarhodopsin binding, tubulin binding, min/ transporters, siderophore transporter, drug JUN kinase binding, protein binding, protein signal transporter, channel/pore class transporter, group transloca sequence binding, importin alpha export receptor, poly tor, auxiliary transport proteins, permeases, murein trans glutamine tract binding, protein carrier, beta-catenin bind porter, organic alcohol transporter, nucleobase, nucleoside, ing, protein C-terminus binding, lipoprotein binding, cytosk nucleotide and nucleic acid transporters. eletal protein binding protein, nuclear localization sequence 0441 Pharmaceutical compositions including such pro binding, protein phosphatase 1 binding, adenylate cyclase teins or protein encoding sequences, antibodies directed binding, eukaryotic initiation factor 4E binding, calmodulin against Such proteins or polynucleotides capable of altering binding, collagen binding, insulin-like growth factor bind expression of Such proteins may be used to treat diseases in ing, lamin binding, profilin binding, tropomyosin binding, which the transport of molecules and macromolecules Such actin binding, peroxisome targeting sequence binding, as neurotransmitters, hormones, Sugar etc. is non-normal SNARE binding, cyclin binding. leading to various pathologies; while probe sequences or 0449 Pharmaceutical compositions including such pro antibodies may be used for diagnosis of Such diseases. teins or protein encoding sequences, antibodies directed 0442. Examples of these diseases include, but are not against Such proteins or polynucleotides capable of altering limited to, glycogen storage disease caused by glucose-6- expression of Such proteins may be used to treat diseases phosphate transporter deficiency (Hiraiwa H. Chou J Y which are associated with non-normal protein activity or (2001) DNA Cell Biol, 20 (8):447-53), tangier disease structure. Binding of the products of the variants of this US 2007/0083334 A1 Apr. 12, 2007

family, or antibodies reactive therewith, can modulate a so that a beneficial effect may be achieved by modulation of plurality of protein activities as well as change protein Such reaction; while probe sequences or antibodies may be structure; while probe sequences or antibodies may be used used for diagnosis of Such diseases. for diagnosis of Such diseases. 0460 Hydrolases: 0450 Ligand Binding or Carrier Proteins: 0461 This category contains hydrolytic enzymes, such as 0451. This category contains various proteins, involved GPI-anchor transamidase, peptidases, hydrolases, acting on in diverse biological functions, such as: pyridoxal phosphate ester bonds, glycosyl bonds, ether bonds, carbon-nitrogen binding, carbohydrate binding, magnesium binding, amino (but not peptide) bonds, acid anhydrides, acid carbon-carbon acid binding, cyclosporin A binding, nickel binding, chlo bonds, acid halide bonds, acid phosphorus-nitrogen bonds, rophyll binding, biotin binding, penicillin binding, selenium acid Sulfur-nitrogen bonds, acid carbon-phosphorus bonds, binding, tocopherol binding, lipid binding, drug binding, acid sulfur-sulfur bonds. oxygen transporter, electron transporter, Steroid binding, juvenile hormone binding, retinoid binding, heavy metal 0462 Pharmaceutical compositions including such pro binding, calcium binding, protein binding, glycosaminogly teins or protein encoding sequences, antibodies directed can binding, folate binding, odorant binding, lipopolysac against Such proteins or polynucleotides capable of altering charide binding, nucleotide binding. expression of Such proteins may be used to treat diseases in which the hydrolytic cleavage of a covalent bond with 0452 ATPases: accompanying addition of water, —H being added to one 0453 This category contains enzymes that catalyze the product of the cleavage and —OH to the other, is not normal hydrolysis of ATP to ADP releasing energy that is used in so that a beneficial effect may be achieved by modulation of the cell; adenosine triphosphatase. It covers enzymes Such as Such reaction; while probe sequences or antibodies may be plasma membrane cation-transporting ATPase, ATP-binding used for diagnosis of Such diseases. cassette (ABC) transporter, magnesium-ATPase, hydrogen-f 0463) Enzymes: Sodium-translocating ATPase, arsenite-transporting ATPase, protein-transporting ATPase, DNA , P-type 0464) This category contains naturally occurring or syn ATPase, hydrolase, acting on acid anhydrides, involved in thetic macromolecular Substance composed wholly or cellular and subcellular movement. largely of protein, that catalyzes, more or less specifically, one or more (bio)chemical reactions at relatively low tem 0454 Carboxylic Ester Hydrolases: peratures. The action of RNA that has catalytic activity 0455 This category contains hydrolytic enzymes, acting (ribozyme) is often also regarded as enzymic. Nevertheless, on carboxylic ester bonds, such as N-acetylglucosami enzymes are mainly proteinaceous and are often easily nylphosphatidylinositol deacetylase, 2-acetyl-1-alkylglyc inactivated by heating or by protein-denaturing agents. The erophosphocholine esterase, aminoacyl-tRNA hydrolase, Substances upon which they act are known as Substrates, for arylesterase, carboxylesterase, cholinesterase, gluconolacto which the enzyme possesses a specific binding or active site. nase, sterol esterase, acetylesterase, carboxymethyleneb 0465. This category covers various proteins possessing utenolidase, protein-glutamate methylesterase, lipase, enzymatic activities, such as mannosylphosphate trans 6-phosphogluconolactonase. ferase, para-hydroxybenzoate:polyprenyltransferase, Rieske 0456 Pharmaceutical compositions including such pro iron-sulfur protein, imidazoleglycerol-phosphate synthase, teins or protein encoding sequences, antibodies directed sphingosine hydroxylase, tRNA 2'-, Ste against Such proteins or polynucleotides capable of altering rol C-24 (28) reductase, C-8 sterol isomerase, C-22 sterol expression of Such proteins may be used to treat diseases in desaturase, C-14 sterol reductase, C-3 sterol dehydrogenase which the hydrolytic cleavage of a covalent bond with (C-4 sterol decarboxylase), 3-keto sterol reductase, C-4 accompanying addition of water, —H being added to one methyl sterol oxidase, dihydronicotinamide riboside product of the cleavage and —OH to the other, is not normal quinone reductase, glutamate phosphate reductase, DNA so that a beneficial effect may be achieved by modulation of repair enzyme, , alpha-ketoacid dehydrogenase, Such reaction; while probe sequences or antibodies may be beta-alanyl-dopamine synthase, RNA editase, aldo-keto used for diagnosis of Such diseases. reductase, alkylbase DNA glycosidase, glycogen debranch ing enzyme, dihydropterin deaminase, dihydropterin oxi 0457 Hydrolase, Acting on Ester Bonds: dase, dimethylnitrosamine demethylase, ecdysteroid UDP 0458. This category contains hydrolytic enzymes, acting glucosyl/UDPglucuronosyl transferase, glycine cleavage on ester bonds, such as nucleases, Sulfuric ester hydrolase, system, helicase, histone deacetylase, mevaldate reductase, carboxylic ester hydrolases, thiolester hydrolase, phosphoric monooxygenase, poly(ADP-ribose) glycohydrolase, pyru monoester hydrolase, phosphoric diester hydrolase, triphos vate dehydrogenase, serine esterase, sterol carrier protein phoric monoester hydrolase, diphosphoric monoester hydro X-related thiolase, , tyramine-beta hydroxylase, lase, phosphoric triester hydrolase. para-aminobenzoic acid (PABA) synthase, glu-tRNA(gln) amidotransferase, molybdopterin cofactor Sulfurase, lanos 0459 Pharmaceutical compositions including such pro terol 14-alpha-demethylase, aromatase, 4-hydroxybenzoate teins or protein encoding sequences, antibodies directed octaprenyltransferase, 7,8-dihydro-8-oxoguanine-triphos against Such proteins or polynucleotides capable of altering phatase, CDP-alcohol phosphotransferase, 2,5-diamino-6- expression of Such proteins may be used to treat diseases in (ribosylamino)-4 (3H)-pyrimidonone 5'-phosphate deami which the hydrolytic cleavage of a covalent bond with nase, diphosphoinositol polyphosphate phosphohydrolase, accompanying addition of water, —H being added to one gamma-glutamyl carboxylase, Small protein conjugating product of the cleavage and —OH to the other, is not normal enzyme, Small protein activating enzyme, 1-deoxyxylulose US 2007/0083334 A1 Apr. 12, 2007 32

5-phosphate synthase, 2'-phosphotransferase, 2-octoprenyl 0472 Pharmaceutical compositions including such pro 3-methyl-6-methoxy-1,4-benzoquinone hydroxylase, teins or protein encoding sequences, antibodies directed 2C-methyl-D-erythritol 2.4-cyclodiphosphate synthase, 3.4 against Such proteins or polynucleotides capable of altering dihydroxy-2-butanone-4-phosphate synthase, 4-amino-4- expression of Such proteins may be used to treat diseases deoxychorismate lyase, 4-diphosphocytidyl-2C-methyl-D- which are caused or due to abnormalities in cytoskelaton, erythritol synthase, ADP-L-glycero-D-manno-heptose Syn including cancerous cells, and diseased cells including those thase, D-erythro-7,8-dihydroneopterin triphosphate which do not propagate, grow or function normally; while 2-epimerase, N-ethylmaleimide reductase, O-antigen ligase, probe sequences or antibodies may be used for diagnosis of O-antigen polymerase, UDP-2,3-diacylglucosamine hydro Such diseases. lase, arsenate reductase, carnitine racemase, cobalamin 5'- 0473 Ligands: phosphate synthase, cobinamide phosphate guanylyltrans ferase, enterobactin synthetase, enterochelin esterase, 0474. This category contains proteins that bind to another enterochelin synthetase, glycolate oxidase, , lau chemical entity to form a larger complex, involved in royl transferase, peptidoglycan synthetase, phosphopanteth various biological processes, such as signal trunsduction, einyltransferase, phosphoglucosamine mutase, phosphohep metabolism, growth and differentiation, etc. The category tose isomerase, quinolinate synthase, siroheme synthase, covers ligands such as: opioid peptides, baboon receptor N-acylmannosamine-6-phosphate 2-epimerase, N-acetyl ligand, branchless receptor ligand, breathless receptor anhydromuramoyl-L-alanine amidase, carbon-phosphorous ligand, ephrin, frizzled receptor ligand, frizzled-2 receptor lyase, heme-copper terminal oxidase, disulfide oxidoreduc ligand, heartless receptor ligand, Notch receptor ligand, tase, phthalate dioxygenase reductase, sphingosine-1-phos patched receptor ligand, punt receptor ligand, Ror receptor phate lyase, molybdopterin oxidoreductase, dehydrogenase, ligand, saxophone receptor ligand, SE20 receptor ligand, NADPH oxidase, naringenin-chalcone synthase, N-ethy sevenless receptor ligand, Smooth receptor ligand, thick lammeline chlorohydrolase, polyketide synthase, aldolase, veins receptor ligand, Toll receptor ligand, Torso receptor kinase, phosphatase, CoA-ligase, oxidoreductase, trans ligand, death receptor ligand, Scavenger receptor ligand, ferase, hydrolase, lyase, isomerase, ligase, ATPase, Sulfhy neuroligin, integrin ligand, hormones, pheromones, growth dryl oxidase, lipoate-protein ligase, delta-1-pyrroline-5-car factors, Sulfonylurea receptor ligand. boxyate synthetase, lipoic acid synthase, and tRNA 0475 Pharmaceutical compositions including such pro dihydrouridine synthase. teins or protein encoding sequences, antibodies directed against Such proteins or polynucleotides capable of altering 0466 Pharmaceutical compositions including such pro expression of Such proteins may be used to treat diseases teins or protein encoding sequences, antibodies directed which involve non-normal Secretion of proteins which may against Such proteins or polynucleotides capable of altering be due to non-normal presence, absence or non-normal expression of Such proteins may be used to treat diseases response to normal levels of Secreted proteins including which can be ameliorated by modulating the activity of hormones, neurotransmitters, and various other proteins various enzymes which are involved both in enzymatic secreted by cells to the extracellular environment or diseases processes inside cells as well as in cell signaling; while which are endocrine in nature (cause or are a result of probe sequences or antibodies may be used for diagnosis of hormones); while probe sequences or antibodies may be Such diseases. used for diagnosis of Such diseases. 0467 Cytoskeletal Proteins: 0476 Examples of these diseases include, but are not 0468. This category contains proteins involved in the limited to, analgesia inhibited by orphanin FQ/nociceptin structure formation of the cytoskeleton. Shane R, Lazar DA, Rossi G. C. Pasternak G. W. Bodnar R J (2001) Brain Res, 907 (1-2):109-16), stroke protected by 0469 Pharmaceutical compositions including such pro estrogen Alkayed N J. Goto S, Sugo N. Joh H D. Klaus J. teins or protein encoding sequences, antibodies directed Crain BJ, Bernard O, Traystman RJ, Hum P D (2001) J against Such proteins or polynucleotides capable of altering Neurosci, 21 (19):7543-50, atherosclerosis associated with expression of Such proteins may be used to treat diseases growth hormone deficiency Elhadd TA, AbduTA, Oxtoby which are caused or due to abnormalities in cytoskeleton, J. Kennedy G, McLaren M. Neary R. Belch JJ. Clayton R including cancerous cells, and diseased cells including those N (2001).J. Clin Endocrinol Metab, 86 (9):4223-32), diabetes which do not propagate, grow or function normally; while inhibited by alpha-galactosylceramide Hong S, Wilson M probe sequences or antibodies may be used for diagnosis of T. Serizawa I, Wu L, Singh N, Naidenko OV. Miura T. Haba Such diseases. T. Scherer DC, Wei J. Kronenberg M. Koezuka Y. Van Kaer 0470 Structural Proteins: L (2001) Nat Med, 7 (9): 1052-6), and Huntington's disease 0471. This category contains proteins involved in the associated with huntingtin deficiency Rao DS, Chang J C, structure formation of the cell. Such as: structural proteins of Kumar P D, Mizukami I, Smithson G. M. Bradley S V. ribosome, cell wall structural proteins, structural proteins of Parlow A F, Ross T S (2001) Mol Cell Biol, 21 (22):7796 cytoskeleton, extracellular matrix structural proteins, extra 806). cellular matrix glycoproteins, amyloid proteins, plasma pro 0477 Signal Transducer: teins, structural proteins of eye lens, structural protein of 0478. This category contains various signal transducers, chorion (sensu Insecta), structural protein of cuticle (sensu Such as: activin inhibitors, receptor-associated proteins, Insecta), puparial glue protein (sensu Diptera), structural alpha-2 macroglobulin receptors, morphogens, quorum proteins of bone, yolk proteins, structural proteins of sensing signal generators, quorum sensing response regula muscle, structural protein of Vitelline membrane (sensu tors, receptor signaling proteins, ligands, receptors, two Insecta), structural proteins of peritrophic membrane (sensu component sensor molecules, two-component response Insecta), structural proteins of nuclear pores. regulators. US 2007/0083334 A1 Apr. 12, 2007

0479. Pharmaceutical compositions including such pro 0487. Nucleic Acid Binding Proteins: teins or protein encoding sequences, antibodies directed against Such proteins or polynucleotides capable of altering 0488. This category contains proteins involved in RNA expression of Such proteins may be used to treat diseases in and DNA synthesis and expression regulation, such as which the signal-transduction is non-normal, either as a transcription factors, RNA and DNA binding proteins, Zinc cause, or as a result of the disease; while probe sequences or fingers, helicase, isomerase, histones, nucleases, ribonucle antibodies may be used for diagnosis of Such diseases. oproteins, transcription and translation factors and other. 0489 Pharmaceutical compositions including such pro 0480 Examples of these diseases include, but are not teins or protein encoding sequences, antibodies directed limited to, altered sexual dimorphism associated with signal against Such proteins or polynucleotides capable of altering transducer and activator of transcription 5b Udy G. B. expression of Such proteins may be used to treat diseases Towers R P. Snell R Q Wilkins RJ, Park S H, Ram PA, involving DNA or RNA binding proteins such as: helicases, Waxman DJ, Davey HW (1997) Proc Natl AcadSci USA, , histones and nucleases, for example diseases 94 (14):7239-44), multiple sclerosis associated with sgp130 where there is non-normal replication or transcription of deficiency Padberg F, Feneberg W. Schmidt S, Schwarz M DNA and RNA respectively; while probe sequences or J. Korschenhausen D. Greenberg B D, Nolde T. Muller N. Trapmann H. Konig N, Moller H J. Hampel H (1999) J antibodies may be used for diagnosis of Such diseases. Neuroimmunol. 99 (2):218-23), intestinal inflammation 0490 Proteins Involved in Metabolism: associated with elevated signal transducer and activator of transcription 3 activity Suzuki A, Hanada T. Mitsuyama K. 0491. The totality of the chemical reactions and physical Yoshida T. Kamizono S. Hoshino T, Kubo M, Yamashita A, changes that occur in living organisms, comprising anabo Okabe M. Takeda K. Akira S. Matsumoto S, Toyonaga A, lism and catabolism; may be qualified to mean the chemical Sata M, Yoshimura A (2001) J Exp Med, 193 (4):471-81), reactions and physical processes undergone by a particular carcinoid tumor inhibited by increased signal transducer and Substance, or class of Substances, in a living organism. activators of transcription 1 and 2 Zhou Y. Wang S. Gobl A. 0492. This category covers proteins involved in the reac Oberg K (2001) Oncology, 60 (4):330-8), and esophageal tions of cell growth and maintenance, such as: metabolism cancer associated with loss of EGF-STAT1 pathway Wa resulting in cell growth, carbohydrate metabolism, energy tanabe Q Kaganoi J., Imamura M. Shimada Y. Itami A, pathways, electron transport, nucleobase, nucleoside, nucle Uchida S, Sato F, Kitagawa M (2001) Cancer J, 7 (2):132-9). otide and nucleic acid metabolism, protein metabolism and modification, amino acid and derivative metabolism, protein 0481) RNA Polymerase II Transcription Factors: targeting, lipid metabolism, aromatic compound metabo 0482. This category contains proteins, such as specific lism, one-carbon compound metabolism, coenzymes and and non-specific RNA polymerase II transcription factors, prosthetic group metabolism, Sulfur metabolism, phospho enhancer binding, ligand-regulated transcription factor, gen rus metabolism, phosphate metabolism, oxygen and radical eral RNA polymerase II transcription factors. metabolism, xenobiotic metabolism, nitrogen metabolism, fat body metabolism (sensu Insecta), protein localization, 0483 Pharmaceutical compositions including such pro catabolism, biosynthesis, toxin metabolism, methylglyoxal teins or protein encoding sequences, antibodies directed metabolism, cyanate metabolism, glycolate metabolism, car against Such proteins or polynucleotides capable of altering bon utilization, antibiotic metabolism. expression of Such proteins may be used to treat diseases involving RNA polymerase II transcription factors, for 0493 Examples of metabolism-related diseases include, example diseases where there is non-normal transcription of but are not limited to, multisystem mitochondrial disorder RNA; while probe sequences or antibodies may be used for caused by mitochondrial DNA cytochrome C oxidase II diagnosis of Such diseases. deficiency Campos Y. Garcia-Redondo A, Fernandez Moreno MA, Martinez-Pardo M, Goda G, Rubio J C, Martin 0484 RNA Binding Proteins: MA, del Hoyo P. Cabello A, Bornstein B. Garesse R. Arenas 0485 This category contains RNA binding proteins J (2001) Ann Neurol Sep.: 50 (3):409-13), conduction involved in splicing and translation regulation, Such as defects and Ventricular dysfunction in the heart associated tRNA binding proteins, RNA helicases, double-stranded with heterogeneous connexin43 expression Gutstein D. E. RNA and single-stranded RNA binding proteins, mRNA Morley G. E. Vaidya D, Liu F, Chen F L. Stuhlmann H, binding proteins, snRNA cap binding proteins, 5S RNA and Fishman GI (2001) Circulation, 104 (10): 1194-9), athero 7S RNA binding proteins, poly-pyrimidine tract binding Sclerosis associated with growth Suppressor p27 deficiency proteins, snRNA binding proteins, and AU-specific RNA Diez-Juan A, Andres V (2001) FASEBJ, 15 (11):1989-95). binding proteins. colitis associated with glutathione peroxidase deficiency Esworthy R S. Aranda R, Martin MG, Doroshow J H. 0486 Pharmaceutical compositions including such pro Binder SW, Chu FF (2001) Am J Physiol Gastrointest Liver teins or protein encoding sequences, antibodies directed Physiol. 281 (3):G848-55), and systemic lupus erythemato against Such proteins or polynucleotides capable of altering sus associated with deoxyribonuclease I deficiency Yasu expression of Such proteins may be used to treat diseases tomo K, Horiuchi T, Kagami S, Tsukamoto H. Hashimura C, involving transcription and translation factors such as: heli Urushihara M, Kuroda Y (2001) Nat Genet, 28 (4):313-4). cases, isomerases, histones and nucleases, for example dis eases where there is non-normal transcription, splicing, 0494 Cell Growth and/or Maintenance Proteins: post-transcriptional processing, translation or stability of the 0495. This category contains proteins involved in any RNA; while probe sequences or antibodies may be used for biological process required for cell Survival, growth and diagnosis of Such diseases. maintenance. It covers proteins involved in biological pro US 2007/0083334 A1 Apr. 12, 2007 34 cesses Such as: cell organization and biogenesis, cell growth, pounds” that bind to or otherwise modulate (i.e., stimulate or cell proliferation, metabolism, cell cycle, budding, cell inhibit) the expression or activity of a nucleic acid of the shape and cell size control, sporulation (sensu Saccharomy present invention or the protein it encodes. An agent may be, ces), transport, ion homeostasis, autophagy, cell motility, for example, a small molecule Such as a peptide, peptido chemi-mechanical coupling, membrane fusion, cell-cell mimetic (e.g., a peptoid), an amino acid or an analog thereof, fusion, stress response. a polynucleotide or an analog thereof, a nucleotide or an 0496 Pharmaceutical compositions including such pro analog thereof, or an organic or inorganic compound (e.g., teins or protein encoding sequences, antibodies directed a heteroorganic or organometallic compound) having a against Such proteins or polynucleotides capable of altering molecular weight less than about 10,000 (e.g., about 5,000, expression of Such proteins may be used to treat or prevent 1,000, or 500) grams per mole and salts, esters, and other diseases such as cancer, degenerative diseases, for example pharmaceutically acceptable forms of Such compounds. neurodegenerative diseases or conditions associated with 0501 Agents identified in the screening assays can be aging, or alternatively, diseases wherein apoptosis which used, for example, to modulate the expression or activity of should have taken place, does not take place; while probe the nucleic acids or proteins of the invention in a therapeutic sequences or antibodies may be used for diagnosis of Such protocol, or to discover more about the biological functions diseases. Detection of pre-disposition to a disease, as well as of the proteins. for determination of the stage of the disease can also be 0502. The assays can be constructed to screen for agents effected Examples of these diseases include, but are not that modulate the expression or activity of a protein of the limited to, ataxia-telangiectasia associated with ataxia-te invention or another cellular component with which it langiectasia mutated deficiency Hande et al (2001) Hum interacts. For example, where the protein of the invention is Mol Genet, 10 (5):519-28), osteoporosis associated with an enzyme, the screening assay can be constructed to detect osteonectin deficiency Delany et al (2000) J Clin Invest, agents that modulate either the enzyme’s expression or 105 (7):915-23), arthritis caused by membrane-bound activity or that of its substrate. The agents tested can be those matrix metalloproteinase deficiency Holmbeck et al (1999) obtained from combinatorial libraries. Methods known in Cell, 99 (1): 81-92), defective stratum corneum and early the art allow the production and screening of biological neonatal death associated with transglutaminase 1 defi libraries; peptoid libraries i.e., libraries of molecules that ciency Matsuki et al (1998) Proc Natl Acad Sci USA, 95 function as peptides even though they have a non-peptide (3):1044-9), and Alzheimer's disease associated with estro backbone that confers resistance to enzymatic degradation; gen Simpkins et al (1997) Am J Med, 103 (3A): 19S-25S). see, e.g., Zuckermann et al., J. Med. Chem. 37:2678-85, 0497 Thus, the nucleic acid sequences of the present (1994); spatially addressable parallel solid phase or solution invention and the proteins encoded thereby and the cells and phase libraries; synthetic libraries requiring deconvolution; antibodies described hereinabove can be used in, for “one-bead one-compound libraries; and synthetic libraries. example, screening assays, therapeutic or prophylactic The biological and peptoid libraries can be used to test only methods of treatment, or predictive medicine (e.g., diagnos peptides, but the other four are applicable to testing peptides, tic and prognostic assays, including those used to monitor non-peptide oligomers or libraries of small molecules Lam, clinical trials, and pharmacogenetics). Anticancer Drug Des. 12:145, (1997). Molecular libraries 0498 More specifically, the nucleic acids of the invention can be synthesized as described by DeWitt et al. Proc. Natl. can be used to: (i) express a protein of the invention in a host Acad. Sci. USA 90:6909, (1993) Erb et al. Proc. Natl. cell (in culture or in an intact multicellular organism fol Acad. Sci. USA 91:11422, (1994). Zuckermann et al. J. lowing, e.g., gene therapy, given, of course, that the tran Med. Chem. 37:2678, (1994) Cho et al. Science 261: 1303, Script in question contains more than untranslated (1993) and Gallop et al. J. Med. Chem. 37:1233, (1994). sequence); (ii) detect an mRNA, or (iii) detect an alteration 0503 Libraries of compounds may be presented in solu in a gene to which a nucleic acid of the invention specifically tion see, e.g., Houghten, Biotechniques 13:412-421. binds; or to modulate Such a gene’s activity. (1992), or on beads Lam, Nature 354:82-84, (1991), chips 0499. The nucleic acids and proteins of the invention can Fodor, Nature 364:555-556, (1993)), bacteria or spores also be used to treat disorders characterized by either insuf (U.S. Pat. No. 5,223,409), plasmids Cullet al., Proc Natl ficient or excessive production of those nucleic acids or Acad Sci USA 89:1865-1869, (1992) or on phage Scott proteins, a failure in a biochemical pathway in which they and Smith, Science 249:386-390, (1990); Devlin, Science normally participate in a cell, or other aberrant or unwanted 249:404–406, (1990); Cwirla et al., Proc. Natl. Acad. Sci. activity relative to the wild type protein (e.g., inappropriate USA 87:6378-6382, (1990); Felici, J. Mol. Biol. 222:301 enzymatic activity or unproductive protein folding). The 310, (1991); and U.S. Pat. No. 5,223.409). proteins of the invention are especially useful in screening 0504 The screening assay can be a cell-based assay, in for naturally occurring protein Substrates or other com which case the screening method includes contacting a cell pounds (e.g., drugs) that modulate protein activity. The that expresses a protein of the invention with a test com antibodies of the invention can also be used to detect and pound and determining the ability of the test compound to isolate the proteins of the invention, to regulate their bio modulate the protein's activity. The cell used can be a availability, or otherwise modulate their activity. These uses, mammalian cell, including a cell obtained from a human or and the methods by which they can be achieved, are from a human cell line. described in detail below. 0505 Alternatively, or in addition to examining the abil Screening Assays ity of an agent to modulate expression or activity generally, 0500 The present invention provides methods (or one can examine the ability of an agent to interact with, for 'screening assays”) for identifying agents (or “test com example, to specifically bind to, a nucleic acid or protein of US 2007/0083334 A1 Apr. 12, 2007 the invention. For example, one can couple an agent (e.g., a phase is detected at the end of the period of exposure. For substrate) to a label (those described above, including radio example, a protein of the present invention can be anchored active or enzymatically active Substances, are suitable), to a solid Surface, and the test compound (which is not contact the nucleic acid or protein of the invention with the anchored and can be labeled, directly or indirectly) is added labeled agent, and determine whether they bind one another to the Surface bearing the anchored protein. Un-reacted (e.g., (by detecting, for example, a complex containing the nucleic unbound) components can be removed (by, e.g., washing) acid or protein and the labeled agent). Labels are not, under conditions that allow any complexes formed to remain however, always required. For example, one can use a immobilized on the solid surface, where they can be detected microphysiometer to detect interaction between an agent and (e.g., by virtue of a label attached to the protein or the agent a protein of the invention, neither of which were previously or with a labeled antibody that specifically binds an immo labeled McConnell et al., Science 257: 1906-1912, (1992). bilized component and may, itself, be directly or indirectly A microphysiometer (also known as a cytosensor) is an labeled). analytical instrument that measures the rate at which a cell 0509. One can immobilize either a protein of the present acidifies its environment. The instrument uses a light-ad invention or an antibody to which it specifically binds to dressable potentiometric sensor (LAPS), and changes in the facilitate separation of complexed (or bound) protein from acidification rate indicate interaction between an agent and uncomplexed (or unbound) protein. Such immobilization a protein of the invention. Molecular interactions can also be can also make it easier to automate the assay, and fusing the detected using fluorescence energy transfer (FET, see, e.g., proteins of the invention to heterologous proteins can facili U.S. Pat. Nos. 5,631,169 and 4,868,103). An FET binding tate their immobilization. For example, proteins fused to event can be conveniently measured through fluorometric glutathione-S-transferase can be adsorbed onto glutathione detection means well known in the art (e.g., by means of a sepharose beads (Sigma Chemical Co., St. Louis, Mo.) or fluorimeter). Where analysis in real time is desirable, one glutathione derivatized microtiter plates, then combined can examine the interaction (e.g., binding) between an agent with the agent and incubated under conditions conducive to and a protein of the invention with Biomolecular Interaction complex formation (e.g., conditions in which the salt and pH Analysis BIA, see, e.g., Sjolander and Urbaniczky Anal. levels are within physiological levels). Following incuba Chem. 63:2338-2345, (1991) and Szabo et al., Curr. Opin. tion, the Solid phase is washed to remove any unbound Struct. Biol. 5:699-705, (1995). BIA allows one to detect components (where the solid phase includes beads, the biospecific interactions in real time without labeling any of matrix can be immobilized), the presence or absence of a the interactants (e.g., BIAcore). complex is determined. Alternatively, complexes can be 0506 The screening assays can also be cell-free assays dissociated from a matrix, and the level of protein binding (i.e., soluble or membrane-bound forms of the proteins of or activity can be determined using standard techniques. the invention, including the variants, mutants, and other 0510 Immobilization can be achieved with methods fragments described above, can be used to identify agents known in the art. For example, biotinylated protein can be that bind those proteins or otherwise modulate their expres prepared from biotin-NHS(N-hydroxy-succinimide) using sion or activity). The basic protocol is the same as that for techniques known in the art (e.g., the biotinylation kit from a cell-based assay in that, in either case, one must contact the Pierce Chemicals, Rockford, Ill.) and immobilized in the protein of the invention with an agent of interest for a wells of streptavidin-coated tissue culture plates (also from Sufficient time and under appropriate (e.g., physiological) Pierce Chemical). conditions to allow any potential interaction to occur and then determine whether the agent binds the protein or 0511. The screening assays of the invention can employ otherwise modulates its expression or activity. antibodies that react with the proteins of the invention but do not interfere with their activity. These antibodies can be 0507 Those of ordinary skill in the art will, however, derivatized to a solid surface, where they will trap a protein appreciate that there are differences between cell-based and of the invention. Any interaction between a protein of the cell-free assays. For example, when membrane-bound forms invention and an agent can then be detected using a second of the protein are used, it may be desirable to utilize a antibody that specifically binds the complex formed between solubilizing agent (e.g., non-ionic detergents such as n-oc the protein of the invention and the agent to which it is tylglucoside, n-dodecylglucoside, n-dodecylmaltoside, bound. octanoyl-N-methylglucamide, decanoyl-N-methylglucam ide, Triton R. X-100, Triton R. X-114, ThesitR, Isotridecy 0512 Cell-free assays can also be conducted in a liquid poly(ethylene glycol ether), 3-(3-cholamidopropyl)dim phase, in which case any reaction product can be separated ethylamminio-1-propane sulfonate (CHAPS), 3-(3- (and thereby detected) by, for example: differential centrifu cholamidopropyl)dimethylamminio-2-hydroxy-1-propane gation (Rivas and Minton, Trends Biochem Sci 18:284-7, sulfonate (CHAPSO), or N-dodecyl=N,N-dimethyl-3-am 1993); chromatography (e.g., gel filtration or ion-exchange monio-1-propane Sulfonate). chromatography); electrophoresis see, e.g., Ausubel et al., Eds. Current Protocols in Molecular Biology, J. Wiley & 0508. In the assays of the invention, any of the proteins Sons, New York, N.Y., (1999); or immunoprecipitation described herein or the agents being tested can be anchored see, e.g., Ausubel et al. (Supra); see also Heegaard, J. Mol. to a solid phase or otherwise immobilized (assays in which Recognit. 11:141-148, (1998) and Hage and Tweed, J. one of two substances that interact with one another are Chromatogr. Biomed. Sci. Appl. 699:499-525, (1997)). anchored to a Solid phase are sometimes referred to as Fluorescence energy transfer (see above) can also be used, "heterogeneous' assays). For example, a protein of the and is convenient because binding can be detected without present invention can be anchored to a microtiter plate, a test purifying the complex from Solution. Assays in which the tube, a microcentrifuge tube, a column, or the like before it entire reaction of interest is carried out in a liquid phase are is exposed to an agent. Any complex that forms on the Solid Sometimes referred to as homogeneous assays. US 2007/0083334 A1 Apr. 12, 2007 36

0513. The screening methods of the invention can also be expression or activity of a biomolecular sequence of the designed as competition assays in which an agent and a present invention can be useful in clinical trials (a desired substance that is known to bind a protein of the present extension of the screening assays described above). For invention compete to bind that protein. Depending upon the example, agents that exert an effect by, in part, altering the order of addition of reaction components and the reaction expression or activity of a protein of the invention ex vivo conditions (e.g., whether the reaction is allowed to reach can be tested for their ability to do so as the treatment equilibrium), agents that inhibit complex formation can be progresses in a subject. Moreover, in animal or clinical trials, distinguished from those that disrupt preformed complexes. the expression or activity of a nucleic acid can be used, 0514. In either approach, the order in which reactants are optionally in conjunction with that of other genes, as a “read added can be varied to obtain different information about the out' or marker of the phenotype of a particular cell. agents being tested. For example, agents that interfere with the interaction between a gene product and one or more of Detection Assays its binding partners (by, e.g., competing with the binding 0520. The nucleic acid sequences of the invention can partner), can be identified by adding the binding partner and serve as polynucleotide reagents that are useful in detecting the agent to the reaction at about the same time. Agents that a specific nucleic acid sequence. For example, one can use disrupt preformed complexes (by, e.g., displacing one of the the nucleic acid sequences of the present invention to map components from the complex), can be added after a com the corresponding genes on a chromosome (and thereby plex containing the gene product and its binding partner has discover which proteins of the invention are associated with formed. genetic disease) or to identify an individual from a biologi 0515. The proteins of the invention can also be used as cal sample (i.e., to carry out tissue typing, which is useful in “bait proteins’ in a two- or three-hybrid assaysee, e.g., U.S. criminal investigations and forensic science). The novel Pat. No. 5.283,317; Zervos et al., Cell 72:223-232, (1993); transcripts of the present invention can be used to identify Madura et al., J. Biol. Chem. 268: 12046-12054, (1993); those tissues or cells affected by a disease (e.g., the nucleic Bartel et al. Biotechniques 14:920-924, (1993); Iwabuchi et acids of the invention can be used as markers to identify al., Oncogene 8:1693-1696, (1993); and WO 94/10300 to cells, tissues, and specific pathologies, such as cancer), and identify other proteins that bind to (e.g., specifically bind to) to identify individuals who may have or be at risk for a or otherwise interact with a protein of the invention. Such particular cancer. Specific methods of detection are binding proteins can activate or inhibit the proteins of the described herein and are known to those of ordinary skill in invention (and thereby influence the biochemical pathways the art. and events in which those proteins are active). 0521. The nucleic acids of the present invention can be 0516. As noted above, the screening assays of the inven used to determine whether a particular individual is the tion can be used to identify an agent that inhibits the Source of a biological sample (e.g., a blood sample). This is expression of a protein of the invention by, for example, presently achieved by examining restriction fragment length inhibiting the transcription or translation of a nucleic acid polymorphisms (RFLPs; U.S. Pat. No. 5,272,057), and the that encodes it. In these assays, one can contact a cell or cell sequences disclosed here are useful as additional DNA free mixture with the agent and then evaluate mRNA or markers for RFLP. For example, one can digest a sample of protein expression relative to the levels that are observed in an individual’s genomic DNA, separate the fragments (e.g. the absence of the agent (a statistically significant increase by Southern blotting), and expose the fragments to probes in expression indicating that the agent stimulates mRNA or generated from the nucleic acids of the present invention protein expression and a decrease (again, one that is statis (methods employing restriction endonucleases are discussed tically significant) indicating tat the agent inhibits mRNA or further below). If the pattern of binding matches that protein expression). Methods for determining levels of obtained from a tissue of an unknown source, then the mRNA or protein expression are known in the art and, here, individual is the source of the tissue. would employ the nucleic acids, proteins, and antibodies of the present invention. 0522 The nucleic acids of the present invention can also be used to determine the sequence of selected portions of an 0517. It should be noted that if desired, two or more of the individual’s genome. For example, the sequences that rep methods described herein can be practiced together. For resent new genes can be used to prepare primers that can be example, one can evaluate an agent that was first identified used to amplify an individual's DNA and subsequently in a cell-based assay in a cell free assay. Similarly, and the sequence it. Panels of DNA sequences (each amplified with ability of the agent to modulate the activity of a protein of a different set of primers) can uniquely identify individuals the invention can be confirmed in Vivo (e.g., in a transgenic (as every person will have unique sequences due to allelic animal). differences). 0518. The screening methods of the present invention can 0523 Allelic variation occurs to some degree in the also be used to identify proteins (in the event transcripts of coding regions of these sequences, and to a greater degree in the present invention encode proteins) that are associated the noncoding regions. Each of the sequences described (e.g., causally) with drug resistance. One can then block the herein can, to some degree, be used as a standard against activity of these proteins (with, e.g., an antibody of the which DNA from an individual can be compared for iden invention) and thereby improve the ability of a therapeutic tification purposes. Because greater numbers of polymor agent to exert a desirable effect on a cell or tissue in a subject phisms occur in the noncoding regions, fewer sequences are (e.g., a human patient). necessary to differentiate individuals. The noncoding 0519 Monitoring the influence of therapeutic agents sequences disclosed herein can provide positive individual (e.g., drugs) or other events (e.g., radiation therapy) on the identification with a panel of perhaps 10 to 1,000 primers US 2007/0083334 A1 Apr. 12, 2007 37 which each yield a noncoding amplified sequence of 100 0530. The methods related to predictive medicine can bases. If predicted coding sequences are used, a more also be carried out by using a nucleic acid of the invention appropriate number of primers for positive individual iden to, for example detect, in a tissue of a Subject: (i) the tification would be 500-2,000. presence or absence of a mutation that affects the expression of the corresponding gene (e.g., a mutation in the 5' regu 0524. If a panel of reagents from the nucleic acids latory region of the gene); (ii) the presence or absence of a described herein is used to generate a unique identification mutation that alters the structure of the corresponding gene; database for an individual, those same reagents can later be (iii) an altered level (i.e., a non-wild type level) of mRNA of used to identify tissue from that individual. Using the the corresponding gene (the proteins of the invention can be database, the individual, whether still living or dead, can similarly used to detect an altered level of protein expres Subsequently be linked to even very small tissue samples. sion); (iv) a deletion or addition of one or more nucleotides 0525) DNA-based identification techniques, including from the nucleic acid sequences of the present invention; (v) those in which small samples of DNA are amplified (e.g., by a Substitution of one or more nucleotides in the nucleic acid PCR) can also be used in forensic biology. Sequences sequences of the present invention (e.g., a point mutation); amplified from tissues (such as hair or skin) or body fluids (vi) a gross chromosomal rearrangement (e.g., a transloca (such as blood, saliva, or semen) found at a crime scene can tion, inversion, or deletion); or (vii) aberrant modification of be compared to a standard (e.g., sequences obtained and a gene corresponding to the nucleic acid sequences of the amplified from a Suspect), thereby allowing one to deter present invention (e.g., modification of the methylation mine whether the suspect is the source of the tissue or bodily pattern of the genomic DNA). Similarly, one can test for fluid. inappropriate post-translational modification of any protein 0526. The nucleic acids of the invention, when used as encoded. Abnormal expression or abnormal gene or protein probes or primers, can target specific loci in the human structures indicate that the subject is at risk for the associated genome. This will improve the reliability of DNA-based disorder. forensic identifications because the more identifying mark 0531. A genetic lesion can be detected by, for example, ers examined, the less likely it is that one individual will be providing an oligonucleotide probe or primer having a mistaken for another. Moreover, tests that rely on obtaining sequence that hybridizes to a sense or antisense Strand of a actual genomic sequence (which is possible here) are more nucleic acid sequence of the present invention, a naturally accurate than those in which identification is based on the occurring mutant thereof, or the 5' or 3' sequences that are patterns formed by restriction enzyme generated fragments. naturally associated with the corresponding gene, and exposing the probe or primer to a nucleic acid within a tissue 0527 The nucleic acids of the invention can also be used of interest (e.g., a tumor). One can detect hybridization to study the expression of the mRNAs in histological between the probe or primer and the nucleic acid of the sections (i.e., they can be used in in situ hybridization). This tissue by standard methods (e.g., in situ hybridization) and approach can be useful when forensic pathologists are thereby detect the presence or absence of the genetic lesion. presented with tissues of unknown origin or when the purity Where the probe or primer specifically hybridizes with a of a population of cells (e.g., a cell line) is in question. The new splice variant, the probe or primer can be used to detect nucleic acids can also be used in diagnosing a particular a non-wild type splicing pattern of the mRNA. The anti condition and in monitoring a treatment regime. bodies of the invention can be similarly used to detect the Predictive Medicine presence or absence of a protein encoded by a mutant, mis-expressed, or otherwise deficient gene. Diagnostic and 0528. The nucleic acids, proteins, antibodies, and cells prognostic assays are described further below. described hereinabove are generally useful in the field of predictive medicine and, more specifically, are useful in 0532 Qualitative or quantitative analyses (which reveal diagnostic and prognostic assays and in monitoring clinical the presence or absence of a substance or its level of trials. For example, one can determine whether a subject is expression or activity, respectively) can be carried out for at risk of developing a disorder associated with a lesion in, any one of the nucleic acid sequences of the present inven or the misexpression of a nucleic acid of the invention (e.g., tion, or (where the nucleic acid encodes a protein) the a cancer Such as pancreatic cancer, breast cancer, or a cancer proteins they encode, by obtaining a biological sample from within the urinary system). In addition, the nucleic acids a subject and contacting the sample with an agent capable of expressed in tumor tissues and not in normal tissues are specifically binding a nucleic acid represented by the nucleic markers that can be used to determine whether a subject has acid sequences of the present invention or a protein those or is likely to develop a particular type of cancer. nucleic acids encode. The conditions in which contacting is 0529) The “subject” referred to in the context of any of performed should allow for specific binding. Suitable con the methods of the present invention, is a vertebrate animal ditions are known to those of ordinary skill in the art. The (e.g., a mammal Such as an animal commonly used in biological sample can be a tissue, a cell, or a bodily fluid experimental studies (e.g. rats, mice, rabbits and guinea (e.g., blood or serum), which may or may not be extracted pigs); a domesticated animal (e.g., a dog or cat); an animal from the Subject (i.e., expression can be monitored in vivo). kept as livestock (e.g., a pig, cow, sheep, goat, or horse); a 0533 More specifically, the expression of a nucleic acid non-human primate (e.g. an ape, monkey, or chimpanzee); a sequence can be examined by, for example, Southern or human primate; an avian (e.g., a chicken); an amphibian Northern analyses, polymerase chain reaction analyses, or (e.g., a frog); or a reptile. The animal can be an unborn with probe arrays. For example, one can diagnose a condi animal (accordingly, the methods of the invention can be tion associated with expression or mis-expression of a gene used to carry out genetic screening or to make prenatal by isolating mRNA from a cell and contacting the mRNA diagnoses). The Subject can also be a human. with a nucleic acid probe with which it can hybridize under US 2007/0083334 A1 Apr. 12, 2007

stringent conditions (the characteristics of useful probes are ribozymes (see, for example, U.S. Pat. No. 5,498.531) can known to those of ordinary skill in the art and are discussed be used to detect specific mutations by development or loss elsewhere herein). The mRNA can be immobilized on a of a ribozyme cleavage site. Surface (e.g., a membrane, such as nitrocellulose or other 0537. Any sequencing reaction known in the art (includ commercially available membrane) following gel electro ing those that are automated) can also be used to determine phoresis. whether there is a mutation, and, if so, how the mutant differs from the wild type sequence. Mutations can also be 0534 Alternatively, one or more nucleic acids (the target identified by using cleavage agents to detect mismatched sequence or the probe) can be distributed on a two-dimen bases in RNA/RNA or RNA/DNA duplexes Myers et al., sional array (e.g., a gene chip). Arrays are useful in detecting Science 230:1242, (1985); Cotton et al., Proc. Natl. Acad. mutations because a probe positioned on the array can have Sci. USA 85.4397, (1988); Saleeba et al., Methods Enzymol. one or more mismatches to a nucleic acid of the invention 217:286-295, (1992). Mismatch cleavage reactions employ (e.g., a destabilizing mismatch). For example, genetic muta one or more proteins that recognize mismatched base pairs tions in any of nucleic acid sequences of the present inven in double-stranded DNA (so called “DNA mismatch repair tion can be identified in two-dimensional arrays containing enzymes; e.g., the mutY enzyme of E. coli cleaves A at G/A light-generated DNA probes Croninet al., Human Mutation mismatches and the thymidine DNA glycosylase from HeLa 7:244-255, (1996). Briefly, when a light-generated DNA cells cleaves T at G/T mismatches see Hsu et al., Carcino probe is used, a first array of probes is used to scan through genesis 15:1657-1662, (1994) and U.S. Pat. No. 5.459,039). long stretches of DNA in a sample and a control to identify base changes between the sequences by making linear arrays 0538 Alterations in electrophoretic mobility can also be of sequential overlapping probes. This step allows the iden used to identify mutations. For example, single strand con tification of point mutations, and it can be followed by use formation polymorphism (SSCP) can be used to detect of a second array that allows the characterization of specific differences in electrophoretic mobility between mutant and mutations by using Smaller, specialized probe arrays wild type nucleic acids Orita et al., Proc. Natl. Acad. Sci. complementary to all variants or mutations detected. Each USA 86:2766, (1989); see also Cotton Mutat. Res. 285:125 mutation array is composed of parallel probe sets, one 144, (1993); and Hayashi, Genet. Anal. Tech. Appl. 9:73-79, complementary to the wild-type gene and the other comple (1992). Single-stranded DNA fragments of sample and mentary to the mutant gene. Arrays are discussed further control nucleic acids are denatured and allowed to renature. The secondary structure of single-stranded nucleic acids below; see also: Kozal et al. Nature Medicine 2:753-759, varies according to sequence, and the resulting alteration in (1996). electrophoretic mobility enables the detection of even a 0535 The level of an mRNA in a sample can also be single base change. The sensitivity of the assay is enhanced evaluated with a nucleic acid amplification technique e.g., when RNA (rather than DNA) is used because RNAs RT-PCR (U.S. Pat. No. 4,683.202), ligase chain reaction secondary structure is more sensitive to a change in LCR; Barany, Proc. Natl. Acad. Sci. USA 88:189-193, sequence. See also Keen et al., Trends Genet. 7:5, (1991). (1991); LCR can be particularly useful for detecting point The movement of mutant or wild-type fragments through mutations), self sustained sequence replication Guatelli et gels containing a gradient of denaturant is also informative. al., Proc. Natl. Acad. Sci. USA 87: 1874-1878, (1990)), transcriptional amplification system Kwoh et al., Proc. Natl. 0539 When denaturing gradient gel electrophoresis Acad. Sci. USA 86:1173-1177, (1989), Q-Beta Replicase DGGE: Myers et al., Nature 313:495, (1985) is used, DNA Lizardi et al., Bio/Technology 6:1197. (1988), or rolling can be modified so it will not completely denature (this can circle replication (U.S. Pat. No. 5,854,033). Following be done by, for example by adding a GC clamp of approxi amplification, the nucleic acid can be detected using tech mately 40 bp of high-melting GC-rich DNA by PCR). A niques known in the art. Amplification primers are a pair of temperature gradient can be used in place of a denaturing nucleic acids that anneal to 5' or 3' regions of a gene (plus gradient to identify differences in the mobility of control and and minus Strands, respectively, or Vice-versa) at Some sample DNA Rosenbaum and Reissner, Biophys. Chem. distance (possibly a short distance) from one another. For 265:12753, (1987). example, each primer can consist of about 10 to 30 nucle 0540 Point mutations can also be detected by selective otides and bind to sequences that are about 50 to 200 oligonucleotide hybridization, selective amplification, or nucleotides apart. Serial analysis of gene expression can be selective primer extension Point et al., Nature 324:163, used to detect transcript levels (U.S. Pat. No. 5,695.937). (1986); Saiki et al., Proc. Natl. Acad. Sci. USA 86:6230, Other useful amplification techniques (useful in, for (1989) or by chemical ligation of oligonucleotides as example, detecting an alteration in a gene) include anchor described in Xu et al., Nature Biotechnol. 19:148, (2001). PCR or RACE PCR. Allele specific amplification technology can also be used see, e.g., Gibbs et al., Nucleic Acids Res. 17:2437-2448, 0536 Mutations in the gene sequences of the invention (1989); Prossner, Tibtech. 11:238, (1993); and Barany, Proc. can also be identified by examining alterations in restriction Natl. Acad. Sci. USA 88:189, (1991). enzyme cleavage patterns. For example, one can isolate DNA from a sample cell or tissue and a control, amplify it 0541. When analysis of a gene or protein is carried out in (if necessary), digest it with one or more restriction endo a cell or tissue sample, the cell or tissue can be immobilized nucleases, and determine the length(s) of the fragment(s) on a Support, typically a glass slide, and then contacted with produced (e.g., by gel electrophoresis). If the size of the a probe that can hybridize to the nucleic acid or protein of fragment obtained from the sample is different from the size interest. of the fragment obtained from the control, there is a muta 0542. The detection methods of the invention can be tion in the DNA in the sample tissue. Sequence specific carried out with appropriate controls (e.g., analyses can be US 2007/0083334 A1 Apr. 12, 2007 39 conducted in parallel with a sample known to contain the the sample that includes the level of expression of one or target sequence and a target known to lack it). more of biomolecular sequences of the present invention. 0543. Various approaches can be used to determine pro The sample's profile can be compared with that of a refer tein expression or activity. For example, one can evaluate the ence profile, either of which can be obtained by the methods amount of protein in a sample by exposing the sample to an described herein (e.g., by obtaining a nucleic acid from the antibody that specifically binds the protein of interest. The sample and contacting the nucleic acid with those on an antibodies described above (e.g., monoclonal antibodies, array). As with other detection methods, profile-based assays detectably labeled antibodies, intact antibodies and frag can be performed prior to the onset of symptoms (in which ments thereof) can be used. The methods can be carried out case they can be diagnostic), prior to treatment (in which in-vitro (e.g., one can perform an enzyme linked immun case they can be predictive) or during the course of treatment osorbent assay (ELISA), an immunoprecipitation, an immu (in which case they serve as monitors) see, e.g., Golub et nofluorescence analysis, an enzyme immunoassay (EIA), a al., Science 286:531, (1999). radioimmunoassay (RIA), or a Western blot analysis) or in 0547 As described hereinabove, the screening methods Vivo (e.g., one can introduce a labelled antibody that spe of the invention can be used to identify candidate therapeutic cifically binds to a protein of the present invention into a agents, and those agents can be evaluated further by exam Subject and then detect it by a standard imaging technique). ining their ability to alter the expression of one or more of Alternatively, the sample can be labeled and then contacted the proteins of the invention. For example, one can obtain a with an antibody. For example, one can biotinylate the cell from a Subject, contact the cell with the agent, and sample, contact it with an antibody (e.g., an antibody Subsequently examine the cells expression profile with positioned on an antibody array) and then detect the bound respect to a reference profile (which can be, for example, the sample (e.g., with avidin coupled to a fluorescent label). As profile of a normal cell or that of a cell in a physiologically with methods to detect nucleic acids, appropriate control acceptable condition). The agent is evaluated favorably if studies can be performed in parallel with those designed to the expression profile in the subject’s cell is, following detect protein expression. exposure to the agent, more similar to that of a normal cell 0544 The diagnostic molecules disclosed herein can be or a cell in a physiologically acceptable condition. A control assembled as kits. Accordingly, the invention features kits assay can be performed with, for example, a cell that is not for detecting the presence of the biomolecular sequences of exposed to the agent. the present invention in a biological sample. The kit can 0548 Expression profiles (obtained by evaluating either include a probe (e.g., a nucleic acid sequence or an anti nucleic acid or protein expression) are also useful in evalu body), a standard and, optionally, instructions for use. More ating Subjects. One can obtain a sample from a subject specifically, antibody-based kits can include a first antibody (either directly or indirectly from a caregiver), create an (e.g., in Solution or attached to a solid Support) that specifi expression profile, and, optionally, compare the Subjects cally binds a protein of the present invention and, optionally, expression profile to one or more reference profiles and/or a second, different antibody that specifically binds to the first select a reference profile most similar to that of the subject. antibody and is conjugated to a detectable agent. Oligo A variety of routine statistical measures can be used to nucleotide-based kits can include an oligonucleotide (e.g., a compare two reference profiles. One possible metric is the labeled oligonucleotide) that hybridizes with one of the length of the distance vector that is the difference between nucleic acids of the present invention under stringent con the two profiles. Each of the subject and reference profile is ditions or a pair of oligonucleotides that can be used to represented as a multi-dimensional vector, wherein each amplify a nucleic acid sequence of the present invention. dimension is a value in the profile. The kits can also include a buffering agent, a preservative, a protein-Stabilizing agent, or a component necessary for 0549. The result, which can be communicated to the detecting any included label (e.g., an enzyme or Substrate). Subject, a caregiver, or another interested party, can be the The kits can also contain a control sample or a series of Subject’s expression profile per se, a result of a comparison control samples that can be assayed and compared to the test of the subject’s expression profile with another profile, a sample contained. Each component of the kit can be most similar reference profile, or a descriptor of any of these. enclosed within an individual container, and all of the Communication can be mediated by a computer network various containers can be within a single package. (e.g., in the form of a computer transmission Such as a computer data signal embedded in a carrier wave). 0545. The detection methods described herein can iden tify a Subject who has, or is at risk of developing, a disease, 0550 Accordingly, the invention also features a com disorder, condition, or syndrome (the term “disease' is used puter medium having executable code for effecting the to encompass all deviations from a normal state) associated following steps: receive a subject expression profile; access with aberrant or unwanted expression or activity of a bio a database of reference expression profiles; and either i) molecular sequence of the present invention. The detection select a matching reference profile most similar to the methods also have prognostic value (e.g., they can be used Subject expression profile, or ii) determine at least one to determine whether or not it is likely that a subject will comparison score for the similarity of the Subject expression respond positively (i.e., be effectively treated with) to an profile to at least one reference profile. The subject expres agent (e.g., a nucleic acid, protein, Small molecule or other sion profile and the reference expression profile each include drug)). Samples can also be obtained from a Subject during a value representing the level of expression of one or more the course of treatment to monitor the treatment’s efficacy at of the biomolecular sequences of the present invention. a cellular level. Arrays and Uses Thereof 0546) The present invention also features methods of 0551. The present invention also encompasses arrays that evaluating a sample by creating a gene expression profile for include a substrate having a plurality of addresses, at least US 2007/0083334 A1 Apr. 12, 2007 40 one of which includes a capture probe that specifically binds acids are analyzed, one can amplify the nucleic acids or hybridizes to a nucleic acid represented by any one of the obtained from a sample prior to their application to the array. biomolecular sequences of the present invention. The array The array can also be used to examine tissue-specific gene can have a density of at least 10, 50, 100, 200, 500, 1,000, expression. For example, the nucleic acids or proteins of the 2,000, or 10,000 or more addresses/cm, or densities invention (all or a subset thereof) can be distributed on an between these. In some embodiments, the plurality of array that is then exposed to nucleic acids or proteins addresses includes at least 10, 100, 500, 1,000, 5,000, obtained from a particular tissue, tumor, or cell type. If a 10,000, or 50,000 addresses, while in other embodiments, Sufficient number of diverse samples are analyzed, cluster the plurality of addresses can be equal to, or less than, those ing (e.g., hierarchical clustering, k-means clustering, Baye numbers. sian clustering and the like) can be used to identify other genes that are co-regulated with those of the invention. The 0552) Regardless of whether the array contains nucleic array can be used not only to determine tissue specific acids (as probes or targets) or proteins (as probes or targets), expression, but also to ascertain the level of expression of a the Substrate can be two-dimensional (formed, e.g., by a battery of genes. glass slide, a wafer (e.g., silica or plastic), or a mass spectroscopy plate) or three-dimensional (formed, e.g., by a 0556 Array analysis of the nucleic acids or proteins of gel or pad). Addresses in addition to the addresses of the the invention can be used to study the effect of cell-cell plurality can be disposed on the array. interactions or therapeutic agents on the expression of those nucleic acids or proteins. For example, nucleic acid or 0553 At least one address of the plurality can include a protein that has been obtained from a cell that has been nucleic acid capture probe that hybridizes specifically to one placed in the vicinity of a tissue that has been perturbed in or more of the nucleic acid sequences of the present inven Some way can be obtained and exposed to the probes of an tion. In certain embodiments, a Subset of addresses of the array. Thus, one can use the methods of the invention to plurality will be occupied by a nucleic acid capture probe for determine the effect of one cell type on another (i.e., the one of the nucleic acid sequences of the present invention; response (e.g., a change in the type or quantity of nucleic each address in the Subset can bear a capture probe that acids or proteins expressed) to a biological stimulus can be hybridizes to a different region of a selected nucleic acid. In determined). Similarly, nucleic acid or protein that has been other embodiments, the probe at each address is unique, obtained from a cell that has been treated with an agent can overlapping, and complementary to a different variant of a be obtained and exposed to the probes of an array. In this selected nucleic acid (e.g., an allelic variant, or all possible scenario, one can determine how the therapeutic agent hypothetical variants). If desired, the array can be used to affects the expression of any of the biomolecular sequences sequence the selected nucleic acid by hybridization (see, of the present invention. Appropriate controls (e.g., assays e.g., U.S. Pat. No. 5,695,940). Alternatively, the capture using cells that have not received a biological stimulus or a probe can be a protein that specifically binds to a protein of potentially therapeutic treatment) can be performed in par the present invention or a fragment thereof (e.g., a naturally allel. Moreover, desirable and undesirable responses can be occurring interaction partners of a protein of the invention or detected. If an event (e.g., exposure to a biological stimulus an antibody described herein). In some instances (e.g., in the or therapeutic compound) has an undesirable effect on a cell, event of an autoimmune disease), it is significant that a one can either avoid the event (by, e.g., prescribing an Subject produces antibodies, and the arrays described herein alternative therapy) or take steps to counteract or neutralize can be used to detect those antibodies. More generally, an it. array that contains some or all of the proteins of the present invention can be used to detect any Substance to which one 0557. In more straightforward assays, the arrays or more those proteins bind (e.g., a natural binding partner, described here can be used to monitor the expression of one an antibody, or a synthetic molecule). or more of the biomolecular sequences of the present invention, with respect to time. Such analyses allow one to 0554 An array can be generated by methods known to characterize a disease process associated with the examined those of ordinary skill in the art. For example, an array can Sequence. be generated by photolithographic methods (see, e.g., U.S. Pat. Nos. 5,143,854: 5,510,270; and 5,527,681), mechanical 0558. The arrays are also useful for ascertaining the effect methods (e.g., directed-flow methods as described in U.S. of the expression of a gene on the expression of other genes Pat. No. 5,384.261), pin-based methods (e.g., as described in in the same cell or in different cells (e.g., ascertaining the U.S. Pat. No. 5.288,514), and bead-based techniques (e.g., effect of the expression of any one of the biomolecular as described in PCT US/93/04145). Methods of producing sequences of the present invention on the expression of other protein-based arrays are described in, for example, De Wildt genes). If altering the expression of one gene has a delete et al. Nature Biotech. 18:89-994, (2000), Lueking et al. rious effect on the cell (due to its effect on the expression of Anal. Biochem. 270:103-111, (1999), Ge Nucleic Acids other genes) one can, again, avoid that effect (by, e.g., Res. 28:e3, I-VII, (2000), MacBeath and Schreiber Science selecting an alternate molecular target or counteracting or 289: 1760-1763, (2000), and WO99/51773A1. Addresses in neutralizing the effect). addition to the address of the plurality can be disposed on the Markers array. 0559 The molecules of the present invention are also 0555. The arrays described above can be used to analyze useful as markers of: (i) a cell or tissue type; (ii) disease; (iii) the expression of any of the biomolecular sequences of the a pre-disease state; (iv) drug activity, and (v) predisposition present invention. For example, one can contact an array for disease. with a sample and detect binding between a component of 0560 Using the methods described herein, the presence the sample and a component of the array. In the event nucleic or amount of the biomolecular sequences of the present US 2007/0083334 A1 Apr. 12, 2007

invention, can be detected and correlated with one or more physician would consider the results of pharmacogenomic biological states (e.g., a disease state or a developmental studies when determining whether to administer a compo state). When used in this way, the compositions of the sition of the present invention and how to tailor a therapeutic invention serve as Surrogate markers; they provide an objec regimen for the Subject. tive indicia of the presence or extent of a disease (e.g., 0564 Pharmacogenomics deals with clinically signifi cancer). Surrogate markers are particularly useful when a cant hereditary variations in the response to drugs due to disease is difficult to assess with standard methods (e.g., altered drug disposition and abnormal action in affected when a Subject has a small tumor or when pre-cancerous persons. See, e.g., Eichelbaum et al., Clin. Exp. Pharmacol. cells are present). It follows that Surrogate markers can be Physiol. 23:983-985, (1996), and Linder et al., Clin. Chem. used to assess a disease before a potentially dangerous 43:254-266, (1997). In general, two types of pharmacoge clinical endpoint is reached. Other examples of Surrogate netic conditions can be differentiated. Genetic conditions markers are known in the art (see, e.g., Koomen et al., J. transmitted as a single factor can: (i) alter the way drugs act Mass Spectrom. 35:258-264, 2000, and James, AIDS Treat on the body (altered drug action) or (ii) the way the body acts ment News Archive 209, 1994). on drugs (altered drug metabolism). These pharmacogenetic 0561. The biomolecular sequences of the present inven conditions can occur either as rare genetic defects or as tion, can also serve as pharmacodynamic markers, which naturally-occurring polymorphisms. provide an indicia of a therapeutic result. As pharmacody 0565. One approach that can be used to identify genes namic markers are not directly related to the disease for that predict drug response, known as "a genome-wide asso which the drug is being administered, their presence (or ciation, relies primarily on a high-resolution map of the levels of expression) indicates the presence or activity of a human genome consisting of already known gene-related drug in a Subject (i.e., the pharmacodynamic marker may markers (e.g., a “bi-allelic' gene marker map that consists of indicate the concentration of a drug in a biological tissue, as 60,000-100,000 polymorphic or variable sites on the human the gene or protein serving as the marker is either expressed genome, each of which has two variants.) Such a high or transcribed (or not) in the body in relationship to the level resolution genetic map can be compared to a map of the or activity of the drug). One can also monitor the distribution genome of each of a statistically significant number of of a drug with a pharmacodynamic marker (e.g., these patients taking part in a Phase II/III drug trial to identify markers can be used to determine whether a drug is taken up markers associated with a particular observed drug response by a particular cell type). The presence or amount of or side effect. Alternatively, a high resolution map can be pharmacodynamic markers can be related to the drug perse generated from a combination of some ten-million known or to a metabolite produced from the drug. Thus, these single nucleotide polymorphisms (SNPs; a common alter markers can indicate the rate at which a drug is broken down ation that occurs in a single nucleotide base in a stretch of in vivo. Pharmacodynamic markers can be particularly sen DNA) in the human genome. For example, a SNP may occur sitive (e.g., even a small amount of a drug may activate once per every 1000 bases of DNA. While a SNP may be Substantial transcription or translation of a marker), and they involved in a disease process, the vast majority may not be are therefore useful in assessing drugs that are administered disease-associated. Given a genetic map based on the occur at low doses. Examples regarding the use of pharmacody rence of such SNPs, individuals can be grouped into genetic namic markers are known in the art and include: U.S. Pat. categories depending on a particular pattern of SNPs in their No. 6,033,862; Hattis et al. Env. Health Perspect. 90: 229 individual genome. In such a manner, treatment regimens 238, (1991); Schentag, Am. J. Health-Syst. Pharm. 56 Suppl. can be tailored to groups of genetically similar individuals, 3:S21-S24, (1999); and Nicolau, Am. J. Health-Syst. Pharm. taking into account traits that may be common among Such 56 Suppl. 3: S16-S20, (1991). genetically similar individuals. 0562. The biomolecular sequences of the present inven tion, are also useful as pharmacogenomic markers, which 0566. Two alternative methods, the “candidate gene can provide an objective correlate to a specific clinical drug approach” and "gene expression profiling.” can be used to response or Susceptibility in a particular Subject or class of identify pharmacogenomic markers. According to the first subjects see, e.g., McLeod et al., Eur. J. Cancer 35:1650 method, if a gene that encodes a drugs target is known, all 1652, (1999). The presence or amount of the pharmacoge common variants of that gene can be fairly easily identified nomic marker is related to the predicted response of a in the population, and one can determine whether having one Subject to a specific drug (or type of drug) prior to admin version of the gene versus another is associated with a istration of the drug. By assessing one or more pharmaco particular drug response. In the second approach, the gene genomic markers in a subject, the drug therapy that is most expression of an animal dosed with a drug (e.g., a compo appropriate for the subject, or which is predicted to have a sition of the invention) can reveal whether gene pathways greater likelihood of Success, can be selected. For example, related to toxicity have been activated. based on the presence or amount of RNA or protein asso 0567 Information generated using one or more of the ciated with a specific tumor marker in a subject, an optimal approaches described above can be used in designing thera drug or treatment regime can be prescribed for the Subject. peutic or prophylactic treatments that are less likely to fail or to produce adverse side effects when a subject is treated 0563 More generally, pharmacogenomics addresses the with a therapeutic composition. relationship between an individuals genotype and that indi vidual’s response to a foreign compound or drug. Differ ences in the way individual Subjects metabolize therapeutics Informatics can lead to severe toxicity or therapeutic failure because 0568. The biomolecular sequences of the present inven metabolism alters the relation between dose and blood tion can be provided in a variety of media to facilitate their concentration of the pharmacologically active drug. Thus, a use. For example, one or more of the sequences (e.g., Subsets US 2007/0083334 A1 Apr. 12, 2007 42 of the sequences expressed in a defined tissue type) can be Examples for annotation to nucleic acid sequences and provided as a manufacture (e.g., a computer-readable stor amino acid sequences are provided in Examples 10 and age medium such as a magnetic, optical, optico-magnetic, 14-20 of the Examples section. chemical or mechanical information storage device). The manufacture can provide a nucleic acid or amino acid Pharmaceutical Compositions sequence in a form that will allow examination of the manufacture in ways that are not applicable to a sequence 0574. The nucleic acids, fragments thereof, hybrid that exists in nature or in purified form. The sequence sequences of which they are a part, and gene constructs information can include full-length sequences, fragments containing them; proteins, fragments thereof, chimeras, and thereof, polymorphic sequences including single nucleotide antibodies that specifically bind thereto; and cells, including polymorphisms (SNPs), epitope sequence, and the like. those that are engineered to express the nucleic acids or proteins of the invention) can be incorporated into pharma 0569. The computer readable storage medium further ceutical compositions. These compositions typically also includes sequence annotations (as described in Example 10 include a solvent, a dispersion medium, a coating, an anti of the Examples section). microbial (e.g., an antibacterial or antifungal) agent, an 0570. The computer readable storage medium can further absorption delaying agent (when desired. Such as aluminum include information pertaining to generation of the data monostearate and gelatin), or the like, compatible with pharmaceutical administration (see below). Active com and/or potential uses thereof. pounds, in addition to those of the present invention, can 0571 As used herein, a “computer-readable medium’ also be included in the composition and may enhance or refers to any medium that can be read and accessed directly Supplement the activity of the present agents. by a machine e.g., a digital or analog computer; e.g., a desktop PC, laptop, mainframe, server (e.g., a web server, 0575. The composition will be formulated in accordance network server, or server farm), a handheld digital assistant, with their intended route of administration. Acceptable pager, mobile telephone, or the like. Computer-readable routes include oral or parenteral routes (e.g., intravenous, media include: magnetic storage media, Such as floppy discs, intradermal, transdermal (e.g., Subcutaneous or topical), or hard disc storage medium, and magnetic tape; optical Stor transmucosal (i.e., across a membrane that lines the respi age media Such as CD-ROM; electrical storage media Such ratory or anogenital tract). The compositions can be formu as RAM, ROM, EPROM, EEPROM, flash memory, and the lated as a solution or Suspension and, thus, can include a like; and hybrids of these categories such as magnetic/ sterile diluent (e.g., water, saline solution, a fixed oil. optical storage media. polyethylene glycol, glycerine, propylene glycol or another synthetic solvent); an antimicrobial agent (e.g., benzyl alco 0572 A variety of data storage structures are available to hol or methyl parabens; chlorobutanol, phenol, ascorbic those of ordinary skill in the art and can be used to create a acid, thimerosal, and the like); an antioxidant (e.g., ascorbic computer-readablemedium that has recorded one or more (or acid or Sodium bisulfite); a chelating agent (e.g., ethylene all) of the nucleic acids and/or amino acid sequences of the diaminetetraacetic acid); or a buffer (e.g., an acetate-, cit present invention. The data storage structure will generally rate-, or phosphate-based buffer). When necessary, the pH of depend on the means chosen to access the stored informa the solution or Suspension can be adjusted with an acid (e.g., tion. In addition, a variety of data processor programs and hydrochloric acid) or a base (e.g., Sodium hydroxide). formats can be used to store the sequence information of the Proper fluidity (which can ease passage through a needle) present invention on machine or computer-readable can be maintained by a coating such as lecithin, by main medium. The sequence information can be represented in a taining the required particle size (in the case of a dispersion), word processing text file, formatted in commercially-avail or by the use of surfactants. able software such as WordPerfect and Microsoft Word, or represented in the form of an ASCII file, stored in a database 0576. The compositions of the invention can be prepared application, such as DB2, Sybase, Oracle, or the like. One of as sterile powders (by, e.g., vacuum drying or freeze ordinary skill in the art can readily adapt any number of data drying), which can contain the active ingredient plus any processor structuring formats (e.g., text file or database) to additional desired ingredient from a previously sterile-fil obtain machine or computer-readable medium having tered solution. recorded thereon the sequence information of the present 0577) Oral compositions generally include an inert dilu invention. ent or an edible carrier. For example, the active compound can be incorporated with excipients and used in the form of 0573 The sequence information and annotations are tablets, troches, or capsules (e.g., gelatin capsules). Oral stored in a relational database (such as Sybase or Oracle) compositions can be prepared using fluid carries and used as that can have a first table for storing sequence (nucleic acid mouthwashes. The tablets etc. can also contain a binder and/or amino acid sequence) information. The sequence (e.g., microcrystalline cellulose, gum tragacanth, or gelatin); information can be stored in one field (e.g., a first column) of a table row and an identifier for the sequence can be stored an excipient (e.g., starch or lactose), a disintegrating agent in another field (e.g., a second column) of the table row. The (e.g., alginic acid, Primogel, or corn starch); a lubricang database can have a second table (to, for example, store (e.g., magnesium Stearate or Sterotes); a glidant (e.g., col annotations). The second table can have a field for the loidal silicon dioxide); a Sweetening agent (e.g., Sucrose or sequence identifier, a field for a descriptor or annotation text saccharine); or a flavoring agent (e.g., peppermint, methyl (e.g., the descriptor can refer to a functionality of the salicylate, or orange flavoring. sequence), a field for the initial position in the sequence to 0578 For administration by way of the respiratory sys which the annotation refers, and a field for the ultimate tem, the compositions can be formulated as aerosol sprays position in the sequence to which the annotation refers. (e.g., from a pressured container or dispenser that contains US 2007/0083334 A1 Apr. 12, 2007 a suitable propellant (e.g., a gas Such as carbon dioxide), or 10-20 mg/kg). If the antibody is to act in the brain, a dosage a nebulizer. The ability of a composition to cross a biological of 50 mg/kg to 100 mg/kg is usually appropriate. Generally, barrier can be enhanced by agents known in the art. For partially human antibodies and fully human antibodies have example, detergents, bile salts, and fusidic acid derivatives a longer half-life within the human body than other anti can facilitate transport across the mucosa (and therefore, be bodies. Accordingly, lower dosages and less frequent admin included in nasal sprays or Suppositories). istration are often possible with these types of antibodies. 0579. For topical administration, the active compounds Modifications such as lipidation can be used to stabilize are formulated into ointments, Salves, gels, or creams antibodies and to enhance uptake and tissue penetration according to methods known in the art. e.g., into the brain; see Cruikshank et al., J. Acquired Immune Deficiency Syndromes and Human Retrovirology 0580 Controlled release can also be achieved by using implants and microencapsulated delivery systems (see, e.g., 14:193, (1997)). the materials commercially available from Alza Corporation 0586. As noted above, the present invention encompasses and Nova Pharmaceuticals, Inc.; see also U.S. Pat. No. agents (e.g., Small molecules) that modulate expression or 4.522,811 for the use of liposome-based suspensions). activity of a nucleic acid represented by any of biomolecular 0581. The pharmaceutical compositions of the invention sequences of the present invention. Exemplary doses of can be formulated in dosage units (i.e., physically discrete these agents include milligram or microgram amounts of the units containing a predetermined quantity of the active Small molecule per kilogram of Subject or sample weight compound) for uniformity and ease of administration. (e.g., about 1-500 mg/kg; about 100 mg/kg, about 5 mg/kg: about 1 mg/kg, or about 50 g/kg). Appropriate doses of a 0582 The toxicity and therapeutic efficacy of any given Small molecule depend upon the potency of the Small compound can be determined by standard pharmaceutical molecule with respect to the expression or activity to be procedures carried out in cell culture or in experimental modulated. When one or more of these small molecules is to animals. For example, one of ordinary skill in the art can be administered to an animal (e.g., a human) to modulate routinely determine the LD50 (the dose lethal to 50% of the expression or activity of nucleic acid or protein of the population) and the ED50 (the dose therapeutically effective invention, a physician, Veterinarian, or researcher may pre in 50% of the population). The dose ratio between toxic and scribe a relatively low dose at first, Subsequently increasing therapeutic effects is the therapeutic index. Compounds that the dose until an appropriate response is obtained. In addi exhibit high therapeutic indices are preferred. While com tion, it is understood that the specific dose level for any pounds that exhibit toxic side effects may be used, care particular animal Subject will depend upon a variety of should be taken to design a delivery system that targets Such factors including the activity of the specific compound compounds to the site of affected tissue in order to minimize employed, the age, body weight, general health, gender, and potential damage to uninfected cells and, thereby, reduce diet of the subject, the time of administration, the route of side effects. administration, the rate of excretion, any drug combination, 0583. The data obtained from the cell culture assays and and the degree of expression or activity to be modulated. animal studies described hereinabove can be used to formu late a range of dosage for use in humans (preferably a dosage 0587 Pharmaceutical compositions of the present inven within a range of circulating concentrations that include the tion may also include a therapeutic moiety Such as a cyto ED50 with little or no toxicity). The dosage may vary within toxin (i.e., an agent that is detrimental to a cell), a thera this range depending upon the formulation and the route of peutic agent, or a radioactive ion can be conjugated to the administration. For any compound used in the method of the biomolecular sequences of the present invention or related invention, the therapeutically effective dose can be esti compositions, described hereinabove (e.g., antibodies, anti mated initially from cell culture assays. A dose may be sense molecules, ribozymes etc.). The cytotoxin can be, for formulated in animal models to achieve a circulating plasma example, taxol, cytochalasin B, gramicidin D, ethidium concentration range that includes the IC50 (i.e., the concen bromide, emetine, mitomycin, etoposide, tenoposide, Vinc tration of the test compound which achieves a half-maximal ristine, vinblastine, colchicin, doxorubicin, daunorubicin, inhibition of symptoms) as determined in cell culture. Such dihydroxy anthracin dione, mitoxantrone, mithramycin, information can be used to more accurately determine useful actinomycin D, 1-dehydrotestosterone, glucocorticoids, doses in humans. Levels in plasma may be measured, for procaine, tetracaine, lidocaine, propranolol, puromycin, example, by high performance liquid chromatography. maytansinoids (e.g., maytansinol; see U.S. Pat. No. 5,208, 020), CC-1065 (see U.S. Pat. Nos. 5,475,092, 5,585,499, 0584) A therapeutically effective amount of a protein of and 5,846,545) and analogs or homologs thereof. Therapeu the present invention can range from about 0.001 to 30 tic agents include antimetabolites (e.g., methotrexate, 6-mer mg/kg body weight (e.g., about 0.01 to 25 mg/kg, about 0.1 captopurine, 6-thioguanine, cytarabine, 5-fluorouracil decar to 20 mg/kg, or about 1 to 10 (e.g., 2-9, 3-8, 4-7, or 5-6) bazine), alkylating agents (e.g., mechlorethamine, thioepa mg/kg). The protein can be administered one time per week chlorambucil, CC-1065, melphalan, carmustine (BSNU) for between about 1 to 10 weeks (e.g., 2 to 8 weeks, 3 to 7 and lomustine (CCNU), cyclothosphamide, busulfan, dibro weeks, or about 4, 5, or 6 weeks). However, a single momannitol, Streptozotocin, mitomycin C, and cis-dichlo administration can also be efficacious. Certain factors can rodiamine platinum (II) (DDP) cisplatin), anthracyclines influence the dosage and timing required to effectively treat (e.g., daunorubicin (formerly daunomycin) and doxorubi a subject. These factors include the severity of the disease, cin), antibiotics (e.g., dactinomycin (formerly actinomycin), previous treatments, and the general health or age of the bleomycin, mithramycin, and anthramycin (AMC)), and Subject. anti-mitotic agents (e.g., Vincristine, vinblastine, taxol and 0585 When the active ingredient is an antibody, the maytansinoids). Radioactive ions include, but are not limited dosage can be about 0.1 mg/kg of body weight (generally to iodine, yttrium and praseodymium. US 2007/0083334 A1 Apr. 12, 2007 44

0588 Other therapeutic moieties include, but are not egorized as pathologic (i.e., characterizing or constituting a limited to, toxins such as abrin, ricin A, pseudomonas disease state), or can be categorized as non-pathologic (i.e., exotoxin, or diphtheria toxin; a protein such as tumor deviating from normal but not associated with a disease necrosis factor, Y-interferon, B-interferon, nerve growth fac state). The term is meant to include all types of cancerous tor, platelet derived growth factor, tissue plasminogen acti growths or oncogenic processes, metastatic tissues or malig vator; or, biological response modifiers such as, for example, nantly transformed cells, tissues, or organs, irrespective of lymphokines, interleukin-1 (IL-1), interleukin-2 (IL-2), histopathologic type or stage of invasiveness. “Pathologic interleukin-6 (IL-6), granulocyte macrophase colony stimu hyperproliferative” cells occur in disease states character lating factor (GM-CSF), granulocyte colony stimulating ized by malignant tumor growth. Examples of non-patho factor (G-CSF), or other growth factors. logic hyperproliferative cells include proliferation of cells 0589 The nucleic acid molecules of the invention can be associated with wound repair. inserted into Vectors and used as gene therapy vectors. Gene 0595. The terms “cancer or “neoplasms include malig therapy vectors can be delivered to a subject by, for example, nancies of the various organ systems, such as affecting lung, intravenous injection, local administration (see U.S. Pat. No. breast, thyroid, lymphoid, gastrointestinal, and genito-uri 5.328,470) or by stereotactic injection (see e.g., Chen et al., nary tract, as well as adenocarcinomas, which include malig Proc. Natl. Acad. Sci. USA 91:3054-3057, 1994). The nancies Such as most colon cancers, renal-cell carcinoma, pharmaceutical preparation of the gene therapy vector can prostate cancer and/or testicular tumors, non-small cell include the gene therapy vector in an acceptable diluent, or carcinoma of the lung, cancer of the Small intestine and can comprise a slow release matrix in which the gene cancer of the esophagus. delivery vehicle is imbedded. Alternatively, where the com 0596) The term “carcinoma’ refers to malignancies of plete gene delivery vector can be produced intact from epithelial or endocrine tissues including respiratory system recombinant cells (e.g. retroviral vectors), the pharmaceu carcinomas, gastrointestinal system carcinomas, genitouri tical preparation can include one or more cells which nary system carcinomas, testicular carcinomas, breast car produce the gene delivery system. The pharmaceutical com cinomas, prostatic carcinomas, endocrine system carcino positions of the invention can be included in a container, mas, and melanomas. Exemplary carcinomas include those pack, or dispenser together with instructions for administra forming from tissue of the cervix, lung, prostate, breast, tion. head and neck, colon and ovary. The term also includes carcinosarcomas (e.g., which include malignant tumors Methods of Treatment composed of carcinomatous and sarcomatous tissues). An 0590 The present invention provides for both prophy "adenocarcinoma’ refers to a carcinoma derived from glan lactic and therapeutic methods of treating a Subject at risk of dular tissue or in which the tumor cells form recognizable (or Susceptible to) a disorder or having a disorder associated glandular structures. The term "sarcoma' is art recognized with aberrant or unwanted expression or activity of a nucleic and refers to malignant tumors of mesenchymal derivation. acid or protein of the invention. “Treatment encompasses As used herein, the term "hematopoietic neoplastic disor the application or administration of a therapeutic agent to a der(s)' includes diseases involving hyperplastic/neoplastic patient, or to an isolated tissue or cell line (e.g., one obtained cells of hematopoietic origin. A hematopoietic neoplastic from the patient to be treated), with the purpose of curing or disorder can arise from myeloid, lymphoid or erythroid lessening the severity of the disease or a symptom associated lineages, or precursor cells thereof. Preferably, the diseases with the disease. arise from poorly differentiated acute leukemias (e.g., eryth roblastic leukemia and acute megakaryoblastic leukemia). 0591. Whether carried out prophylactically or therapeu Additional exemplary myeloid disorders include, but are not tically, the methods of the invention can be specifically limited to, acute promyeloid leukemia (APML), acute myel tailored or modified, based on knowledge obtained from the ogenous leukemia (AML) and chronic myelogenous leuke field of pharmacogenomics (see above). mia (CML) (see Vaickus, Crit Rev. in Oncol./Hemotol. 11:267-97, 1991); lymphoid malignancies include, but are 0592. Thus, the invention provides a method for prevent not limited to acute lymphoblastic leukemia (ALL) which ing in a Subject, a disease associated with mis-expression of includes B-lineage ALL and T-lineage ALL, chronic lym a nucleic acid or protein of the present invention. Such phocytic leukemia (CLL), prolymphocytic leukemia (PLL), diseases include cellular proliferative and/or differentiative hairy cell leukemia (HLL) and Waldenstrom's macroglobu disorders, disorders associated with bone metabolism, linemia (WM). Additional forms of malignant lymphomas immune disorders, cardiovascular disorders, liver disorders, include, but are not limited to non-Hodgkin lymphoma and viral diseases, pain or metabolic disorders. variants thereof, peripheral T cell lymphomas, adult T cell 0593 Examples of cellular proliferative and/or differen leukemia/lymphoma (ATL), cutaneous T-cell lymphoma tiative disorders include cancer (e.g., carcinoma, sarcoma, (CTCL), large granular lymphocytic leukemia (LGF), metastatic disorders or hematopoietic neoplastic disorders Hodgkin’s disease and Reed-Sternberg disease. Such as leukemias and lymphomas). A metastatic tumor can 0597. The leukemias, including B-lymphoid leukemias, arise from a multitude of primary tumor types, including but T-lymphoid leukemias, undifferentiated leukemias, erythro not limited to those of prostate, colon, lung, breast or liver. leukemia, megakaryoblastic leukemia, and monocytic leu 0594. The terms “cancer.”“hyperproliferative,” and “neo kemias are encompassed with and without differentiation; plastic.” are used in reference to cells that have exhibited a chronic and acute lymphoblastic leukemia, chronic and capacity for autonomous growth (i.e., an abnormal state or acute lymphocytic leukemia, chronic and acute myelog condition characterized by rapid cellular proliferation). enous leukemia, lymphoma, myelo dysplastic syndrome, Hyperproliferative and neoplastic disease states can be cat chronic and acute myeloid leukemia, myelomonocytic leu US 2007/0083334 A1 Apr. 12, 2007 kemia; chronic and acute myeloblastic leukemia, chronic 1:32–46, (1997) are also useful therapeutics. Since nucleic and acute myelogenous leukemia, chronic and acute promy acid molecules can usually be more conveniently introduced elocytic leukemia, chronic and acute myelocytic leukemia, into target cells than therapeutic proteins may be, aptamers hematologic malignancies of monocyte-macrophage lin offer a method by which protein activity can be specifically eage. Such as juvenile chronic myelogenous leukemia; sec decreased without the introduction of drugs or other mol ondary AML, antecedent hematological disorder, refractory ecules that may have pluripotent effects. anemia; aplastic anemia; reactive cutaneous angioendothe liomatosis; fibrosing disorders involving altered expression 0602. As noted above, the nucleic acids of the invention in dendritic cells, disorders including systemic sclerosis, and the proteins they encode can be used as immunothera E-M syndrome, epidemic toxic oil syndrome, eosinophilic peutic agents (to, e.g., elicit an immune response against a fasciitis localized forms of scleroderma, keloid, and fibros protein of interest). However, in Some circumstances, unde ing colonopathy; angiomatoid malignant fibrous histiocy sirable effects occur when a subject is injected with a protein toma; carcinoma, including primary head and neck squa or an epitope that stimulate antibody production. In those mous cell carcinoma; sarcoma, including kaposi's sarcoma; circumstances, one can instead generate an immune fibroadanoma and phyllodes tumors, including mammary response with an anti-idiotypic antibody see, e.g., Herlyn, fibroadenoma; Stromal tumors; phyllodes tumors, including Ann. Med. 31:66-78, 1991 and Bhattacharya-Chatterjee and histiocytoma; erythroblastosis; and neurofibromatosis. Foon, Cancer Treat. Res. 94:51-68, (1998)). Effective anti idiotypic antibodies stimulate the production of anti-anti 0598. Examples of disorders involving the heart or “car idiotypic antibodies, which specifically bind the protein in diovascular disorders' include, but are not limited to, a question. Vaccines directed to a disease characterized by disease, disorder, or state involving the cardiovascular sys expression of the nucleic acids of the present invention can tem, e.g., the heart, the blood vessels, and/or the blood. A also be generated in this fashion. In other circumstances, the cardiovascular disorder can be caused by an imbalance in target antigen is intracellular. In these circumstances, anti arterial pressure, a malfunction of the heart, or an occlusion bodies (including fragments, single chain antibodies, or of a blood vessel, e.g., by a thrombus. Examples of Such other types of antibodies described above) can be internal disorders include hypertension, atherosclerosis, coronary ized within a cell by delivering them with, for example, a artery spasm, congestive heart failure, coronary artery dis lipid-based delivery system (e.g., LipofectinTM or lipo ease, valvular disease, arrhythmias, and cardiomyopathies. Somes). Single chain antibodies can also be administered by delivering nucleotide sequences that encode them to the 0599 AS discussed, diseases associated (e.g., causally target cell population (see, e.g., Marasco et al., Proc. Natl. associated) with over expression of a nucleic acid of the Acad. Sci. USA 90:7889-7893, 1993). invention (as determined, for example, by the in vivo or ex vivo analyses described above), can be treated with tech 0603 Additional objects, advantages, and novel features niques in which one inhibits the expression or activity of the of the present invention will become apparent to one ordi nucleic acid or its gene products. For example, a compound narily skilled in the art upon examination of the following (e.g., an agent identified using an assay described above) examples, which are not intended to be limiting. Addition that exhibits negative modulatory activity with respect to a ally, each of the various embodiments and aspects of the nucleic acid of the invention (the expression or over expres present invention as delineated hereinabove and as claimed sion of which is causally associated with a disease) can be in the claims section below finds experimental Support in the used to prevent and/or ameliorate that disease or one or more following examples. of the symptoms associated with it. The compound can be a peptide, phosphopeptide, Small organic or inorganic mol EXAMPLES ecule, or antibody (e.g., a polyclonal, monoclonal, human ized, anti-idiotypic, chimeric or single chain antibodies, and 0604 Reference is now made to the following examples, Fab, F(ab')2 and Fab expression library fragments, scfV which together with the above descriptions, illustrate the molecules, and epitope-binding fragments thereof). invention in a non limiting fashion. 0600 Further, antisense, ribozyme, and triple-helix mol 0605 Generally, the nomenclature used herein and the ecules (see above) that inhibit expression of the target gene laboratory procedures utilized in the present invention (e.g., a gene of the invention) can also be used to reduce the include molecular, biochemical, microbiological and recom level of target gene expression, thus effectively reducing the binant DNA techniques. Such techniques are thoroughly level of target gene activity. If necessary, to achieve a explained in the literature. See, for example, “Molecular desirable level of gene expression, molecules that inhibit Cloning: A laboratory Manual Sambrook et al., (1989); gene expression can be administered with nucleic acid “Current Protocols in Molecular Biology” Volumes I-III molecules that encode and express target gene polypeptides Ausubel, R. M., ed. (1994); Ausubel et al., “Current Proto exhibiting normal target gene activity. Of course, where the cols in Molecular Biology”, John Wiley and Sons, Balti assays of the invention indicate that expression or over more, Md. (1989); Perbal, “A Practical Guide to Molecular expression is desirable, the nucleic acid can be introduced Cloning, John Wiley & Sons, New York (1988); Watson et into cells via gene therapy methods with little or no treat al., “Recombinant DNA, Scientific American Books, New ment with inhibitory agents (this can be done to combat not York; Birren et al. (eds) "Genome Analysis: A Laboratory only under expression, but over secretion of a gene product). Manual Series”, Vols. 1-4, Cold Spring Harbor Laboratory Press, New York (1998); methodologies as set forth in U.S. 0601 Aptamer molecules (nucleic acid molecules having Pat. Nos. 4,666,828; 4,683,202; 4,801,531; 5,192,659 and a tertiary structure that permits them to specifically bind to 5,272,057: “Cell Biology: A Laboratory Handbook'', Vol protein ligands; see, e.g., Osborne et al., Curr. Opin. Chem. umes I-III Cellis, J. E., ed. (1994): “Current Protocols in Biol. 1: 5-9, (1997) and Patel Curr. Opin. Chem. Biol. Immunology” Volumes I-III Coligan J. E., ed. (1994); Stites US 2007/0083334 A1 Apr. 12, 2007 46 et al. (eds), “Basic and Clinical Immunology” (8th Edition), National Center for Biotechnology Information (NCBI) was Appleton & Lange, Norwalk, Conn. (1994); Mishell and used as an input to the LEADS platform as described Shigi (eds), “Selected Methods in Cellular Immunology’. Shoshan et al. (2001) Proc. SPIE Microarrays: Optical W. H. Freeman and Co., New York (1980); available immu Technologies and Informatics 4266:86-95; Matloubian noassays are extensively described in the patent and Scien (2000) Nat. Immunol. 1:298-304; David et al. (2002) J. Biol. tific literature, see, for example, U.S. Pat. Nos. 3,791,932: Chem. 277:18084-18090; Sorek et al. (2002) Genome Res. 3,839,153; 3,850,752; 3,850,578; 3,853.987; 3,867,517; 12:1060-7). UniGene Build #146 and libraryQuest.txt were 3,879,262; 3,901,654; 3,935,074; 3,984,533; 3,996.345; obtained from NCBI and Cancer Genome Anatomy Project 4,034,074; 4,098,876; 4,879,219; 5,011,771 and 5,281,521; (CGAP) in National Cancer Institute (NCI), respectively. “Oligonucleotide Synthesis” Gait, M. J., ed. (1984); “Nucleic Acid Hybridization’ Hames, B. D., and Higgins S. 0611 EST tissue information EST information was J., eds. (1985): “Transcription and Translation’ Hames, B. available in web form from Library Browser or Library D., and Higgins S. J., eds. (1984); "Animal Cell Culture' Finder in NCBI or in the flat file libraryQuest.txt. The file Freshney, R. I., ed. (1986): “Immobilized Cells and listed 53 tissue sources, 5 histological states (cancer, mul Enzymes' IRL Press, (1986): “A Practical Guide to Molecu tiple histology, normal, pre-cancer, and uncharacterized his lar Cloning Perbal, B., (1984) and “Methods in Enzymol tology), 6 types of tissue preparations (bulk, cell line, ogy” Vol. 1-317, Academic Press: “PCR Protocols: A Guide flow-sorted, microdissected, multiple preparation, and To Methods And Applications'. Academic Press, San Diego, uncharacterized), and brief descriptions on each library. Calif. (1990); Marshak et al., “Strategies for Protein Puri 5318 libraries were from bulk tissue preparation including fication and Characterization—A Laboratory Course 5000 ORESTES libraries Camargo et al. (2001) Proc. Natl. Manual” CSHL Press (1996); all of which are incorporated Acad. Sci. USA 98: 12103-12108}, 329 from cell lines, 37 by reference as if fully set forth herein. Other general flow-sorted, 66 microdissected, 5 multiple preparation, and references are provided throughout this document. The 1121 were from uncharacterized preparations. Excluding procedures therein are believed to be well known in the art ORESTES libraries, 507 libraries were designated as “non and are provided for the convenience of the reader. All the normalized and 100 were designated normalized or sub information contained therein is incorporated herein by tracted indicating the pretreatment of mRNA before cDNA reference. library construction. A small number of libraries were derived from the same original sample. These were not Example 1 considered separately. Library counts of ESTs rather than direct EST counts were used to provide semi-quantitative Identification of Alternatively Spliced Expressed measurements of expression level, since EST counts in some Sequences Background cases reflect the prevalence of ESTs in one or a few particular libraries, and library counts provide better indi 0606. The etiology of many kinds of cancers, especially cations across different tissue types when both normalized those involving multiple genes or sporadic mutations, is yet and non-normalized libraries were analyzed. Such tissue to be elucidated. Accumulative EST information coming information analyses are limited to those tissues with a from heterogeneous tissues and cell-types, can be used as a sufficient number of libraries. The inclusion of normalized considerable source to understanding some of the events cDNA libraries allowed the examination of genes expressed inherent to carcinogenesis. at low levels. 0607 Although a large number of current bioinformatics 0612 The ESTs from pooled tissue or uncharacterized tools are used to predict tissue specific genes in general and tissue were considered as non-conforming in order to main cancer specific genes in particular, all fail to consider tain the robustness of the results. In addition, 139.243 ESTs alternatively spliced variants Boguski and Schuler (1995) that had no library information were considered non-con Nat. Genet. 10:369-71, Audic and Claverie (1997) Genome forming in investigating tissue- or cancer-specific alternative Res. 7: 986-995: Huminiecki and Bicknell (2000) Genome splicing events. Res. 10:1796-1806; Kawamoto et al. (2000) Genome Res. 10:1817-1827). Alternative splicing is also overlooked by 0613 Results—Human EST and mRNA sequences wet laboratory methods such as SAGE and microarray aligned against genomic sequences and clustered through experiments which have been widely used to study gene Compugen’s LEADS platform were used to identify intron expression, however remain to be linked to alternative boundaries and alternative splicing sites Shoshan et al. splicing modeling see Background section and Valculescu (2001) Proc. SPIE Microarrays: Optical Technologies and et al. (1995) Science 270:484-487; Caron et al. (2001) Informatics 4266:86-95; Matloubian (2000) Nat. Immunol. Science 291: 1289-1292 and Schena et al. (1995) Science 1:298-304: David et al. (2002) J. Biol. Chem. 277:18084 270:467-470). 18090; Sorek et al. (2002) Genome Res. 12:1060-7). 0608. A computational-based approach was developed to 0614) 20,301 clusters with 2.0 million ESTs contained at identify alternatively spliced transcripts, which are least one mRNA sequence, in general agreement with Uni expressed in a temporal and/or spatial pattern. Examples 1-4 Gene build #148 with 20.876 such clusters. The remaining below describe the identification of cancer specific alterna EST sequences, which were clustered to unknown regions of tively spliced isoforms, which were identified according to known genes or to unknown genes were not analyzed. Table the teachings of the present invention. 1 below provides some statistics about EST and mRNA clustering. 125,115 introns, and 213,483 exons were aligned 0609 Experimental Procedures and Reagents either with an mRNA, or with ESTs from at least two 0610 DATA and LEADS alternative splicing model libraries if there was no RNA aligned to the gene segment. ing GenBank version 125 with genomic build # 25 from This was effected to exclude possible genomic contamina US 2007/0083334 A1 Apr. 12, 2007 47 tion in expressed sequences, or other EST technology asso where shown to be included in 7570 clusters, while 6111 ciated faults. clusters comprised both alternatively spliced donor and acceptor sites. TABLE 1. Example 3 EST Cluster RNA Cluster Tissue Distribution of ESTs and Libraries 1 963 1 6527 Following LEADS Alternative Splicing Modeling 2-3 1457 2-3 6372 4 7 1532 4. 7 6204 0.617 Cluster analysis performed to identify alternatively 8-1S 1655 8-1S 1915 spliced ESTs (see Example 2) was further used for tissue 16-31 1879 16-31 226 32-63 2SOO 32-63 40 information extraction. Table 3 below lists ten tissue types 64 127 3481 64 and above 17 with the largest numbers of ESTs along with those from pooled or uncharacterized tissues. 128-2SS 3240 Total 2O3O1 256-511 1406 S12 1023 422 TABLE 3 1024-above 1766 Number of ESTS Number of Libraries Total 2O3O1 Tissue Normal Cancer Total Normal Cancer Total

Brain 93O24 878O3 18O827 30 25 55 Lung 35455 85596 121051 92 1S6 248 Example 2 Placenta 86571. 27291. 11386.2 259 3 262 Uterus 30.052 71521 1 O1573 99 107 2O6 Colon 23796 74998 98.794 274 445 719 Cluster Distribution of Alternatively Spliced Donor Kidney 42628 46811 89439 9 S4 63 and Acceptor Sites Skin 32436 43O85 75521 8 10 18 Prostate 4.0312 2.7963 68275 131 135 266 0615. Alternative splice events include exon skipping, Mammary gland 26509 36638 63147 30S 665 970 alternative 5' or 3' splicing, and intron retention, which can Head 12354 SO167 62S21 62 800 862 and neck be described by the following simplification: a single exon Pooled 1786.18 992 17961O 15 1 16 connects to at least two other exons in either the 3'end (donor Uncharacterized 761.93 9721 85914. 778 106 884 site) or the 5' end (acceptor site), as shown in FIG. 3. Table 2 below lists some statistics of alternative splicing events based on this simplification. 0618. Evidently, ESTs derived from lung, uterus, colon, kidney, mammary gland, head and neck were obtained TABLE 2 aminly from cancerous libraries. The distribution of ESTs in normal and cancer libraries in each case was taken into a Alternative donor site Cluster Alternative acceptor site Cluster consideration and used as a parameter for scoring the 1 3690 1 3751 differential expression annotation. 2 2269 2 2388 3 1348 3 1511 Example 4 4 760 4 799 5 435 5 SO8 Identification of Putative Cancer Specific 6 and above 566 6 and above 710 Alternatively Spliced Transcripts Total 9068 Total 9667 0619. Alternative splicing events restricted to cancertis Sues were identified by looking for any donor-acceptor concatenations exclusively supported by ESTs from cancer 0616 Distribution analysis—As described hereinabove a tissues. Table 4 below lists six examples for such. An valid donor-acceptor concatenation must be supported by at interesting example is the NONO gene (GenBank Accession least one mRNA or by ESTs from at least two different No: BC003129), represented by 1496 ESTs. The NONO libraries. 8254 clusters were found to have both alternatively gene has been previously suggested to code for a possible spliced donor and acceptor sites. When the lower bound on splicing factor Dong B, Horowitz, D S. Kobayashi R, the number of EST libraries supporting each donor-acceptor Krainer A. R. Nucleic Acids Res (1993) 25: 21 (17):4085 concatenation was increased to three, 13.402 alternatively 92). It’s newly discovered restricted expression to cancer spliced donor sites were shown to be included in 6892 tissues Suggests that alternative splicing of multiple genes clusters and 15,015 alternatively spliced acceptor sites may be regulated during carcinogenesis. TABLE 4

Non Uni Spe- Specific

mRNA Gene Total cific E Possible

EST ID PoS. E R Type E R C N R function BCOO3129 17220 7 123, 237 1496 8 d-- 15 1 46 20 3 Splicing factor candidate US 2007/0083334 A1 Apr. 12, 2007

TABLE 4-continued

Non Uni Spe- Specific mRNA Gene Total cific E Possible EST ID PoS. E R Type E R C N R function NM 018035 279851 220, 301 S84 2 d- 7 O 21 9 2 No known function ALS1936S 21938 474, 513 162 3 8- 8 3 6 1 O Oxysterol binding BF341,144 155596 507,542 148. 1 8-H 6 O 7 4 1 BCL2 fadenovirus E1B interacting ABOO9357 7510 1372, 1452 205 6 8-H 7 2 4 2 MAPKKK 7 NM 002382 42712 57, 84 165 7 8- 8 1 7 3 6 MAX protein One mRNA/EST containing both splicing junctions identifies the cluster. Type - indicates the type of transcript, which was shown to be cancer specific. The following symbols were used, (d) donor site; (a) acceptor site: (+) proximal exon; (-) distal exon. Total - indicates the number of ESTs or mRNAs which were used for analysis. Specific/non-specific - indicates total library number which was used for analysis. All mRNA sequences under specific were from cancer tissues. Position - identifies splicing boundaries on the sequence. E EST: R-RNA: C—Cancer; N Normal.

Example 5 of EC (enzyme nomenclature) to GO node. MGI has assigned 5984 SwissProt proteins with GO nodes (http:// Ontological Annotation of Proteins—Data www.mgi.org). 31869 SwissProt proteins were assigned a Collection GO node using SwissProt keyword correspondence and 33048 SwissProt proteins were assigned GO node by Inter BACKGROUND Pro scanning (http://www.ebi.ac.uk/interpro/). The nonre 0620 Recent progress in genomic sequencing, computa dundant protein database was constructed from GenPep file tional biology and ontology development has presented an from NCBI, along with proteins collected from the Saccha opportunity to investigate broad biological systems romyces genome database (SGD) Dwight et al. (2002) 0621. A gene ontology system was developed and spe Nucleic Acids Res. 30:69-72 and the Drosophila genome cifically used to annotate human proteins. Examples 5-9 database (Flybase) The Flybase consortium 2002 Nucleic below describe the development of an ontology engine, a Acids Res. 30:106-108), with a total number of 670130. computational platform for annotation and resultant anno tations of human proteins. Example 6 0622 Gene Ontology (GO) and gene association files were obtained from the Gene Ontology Consortium http:// Generation of Progressive Sequence Clusters www.geneontology.org/. InterPro Scan from http://www.e- 0623) A two-stage strategy was used to build a detailed bi.ac.uk/interprof, and enzyme database from http://ex homology map between all proteins in the comprehensive pasy.proteome.org.au/enzyme?. The following databases and protein database (Example 5). In a first stage, all protein versions were used, GenBank release 122.0, SwissProt release 39.0, Enzyme database Release 26.0, InterPro data pairs with an E score lower than 0.01 using Blastp with base as of Apr. 6, 2001, NCBI LocusLink data as of Mar. 6, default parameters were cataloged. Table 5 lists the distri 2001, MEDLINE databases as of Apr. 6, 2001, and the bution of Blastp results. following files from Gene Ontology Consortium: gene as sociation.fb (version 1.26, Feb. 19, 2001), gene association TABLE 5 .mgi (version 1.19, Mar. 1, 2001), gene association.sgd E escore Percentage (version 1.251, Mar. 13, 2001), gene association.pombase 10-10 10-2 17.58 (version 1.2, Jul. 22, 2000), ec2go (version 1.2, Oct. 23, 10-20 10-10 13.81 2000), and swp2go (version 1.4, Nov. 15, 2000). 58118 10–30 10-20 11.02 SWISS-Prot proteins have been assigned with at least one 10-40 10-30 12.91 GO node by the following sources: 15534 proteins were 10-50 10-40 10.24 assigned with at least a functional GO node by conversion US 2007/0083334 A1 Apr. 12, 2007 49

abstracts. In addition, an upper limit of word frequency TABLE 5-continued (common words Such as and, gene, protein) and a lower limit of word frequency were defined through repeated E escore Percentage training process and manual review. The words within the 10-60 10-50 5.81 upper and the lower limits were considered as predictive. 10–70 10-60 3.64 Since the correlation between the GO nodes and specific 10-80 10-70 2.65 10–90 10-80 2.86 words is positive by nature, negative sentences with words 10-100 10-90 2.53 Such as not and its variants, such as unlikely or unre 10-110 10-100 2.18 sponsive were excluded from consideration. 10-120 10-110 1.58 10-130 10-120 1...SO 0627 Genes with GO annotation from other sources such 10-140 10-130 1.13 as GO consortium, InterPro Scanning or keyword mappings 10-150 10-140 1.01 10-160 10-150 1.01 were used as training data to obtain the correlation between 10-170 10-160 O.92 specific words and specific GO nodes. 10-178 10-170 O.90 0628. The following formula was used. S=log(P(m.g)/ O.OO 6.72 P(m)P(g)), wherein S is the LOD score for word m-GOg combination, wherein P(mg) is the frequency of term mand 0624. In the second stage, all homologous protein pairs GO node g co-occurrence among all word and GO combi were aligned through Needlman-Wunsch algorithm with a nations, P(m) is the frequency of occurrence of term m global alignment to obtain the percentage of identical amino among all word occurrences, and P(g) is the frequency of acids between the two proteins. BLOSUM62 was used as the occurrence of GO node gamong all GO occurrences. substitution matrix. The percentage of identity was defined 0629. In order to predict GO node for any specific gene as the number of amino acids aligned with nonnegative which is linked to one to a few dozen words, the sums of scores divided by the number of amino acids in both aligned LOD scores from all these words for each possible GO were and unaligned length of two proteins in the global alignment. calculated and sorted, and used for further GO annotation. Table 6 shows a percent identity distribution of protein pairs Multiple MeSH terms-GO correlations were tested and were following global alignment. Evidently, the majority of pro found to be no more informative than the single MeSH tein pairs (i.e., 68.5%) exhibited identity levels in the range term-GO correlation, and therefore they were not used. Of 10-50%. 0630 Results Table 7 below, lists general statistics of TABLE 6 text information from publicly available sequence databases. Identity Level Percentage TABLE 7

O-10% 5.67 MeSH Definition 10-20% 24.66 tel Title Abstract line 20-30% 1994 30-40% 10.94 Number of proteins 110608 106190 113073 516952 40-50% 7.31 Number of articles 71703 773.14 82654 na SO-60% 7.09 Number of unique 40011 18175 26630 25915 60-70% 7.24 words* 70-80% 6.70 Average number of 19. OS 2.70 11.65 6.56 80-90% S.98 words per article or per 90-100% 4.47 definition line

Example 7 0631 A predictive probabilistic model was then applied to create possible GO annotations based on the associated Text Mining text information. Definition lines of sequence records, MeSH term annotations, titles and abstracts from sequence 0625 Correlations between presence of specific MeSH related publications were modeled separately. terms, or specific English words in available text informa 0632. The frequency of association of a specific term tion and Gene Ontology assignments in the training data with a specific GO node in the training data was examined. were obtained. The correlations were then used to predict Parameters such as boundaries of the frequency of MeSH Gene Ontology for unassigned genes. terms and other words were optimized through the training 0626 Method Non-characters in titles and abstracts, process, using self-validation and cross validation methods. and in definition line of gene records were eliminated and LOD (logarithm of odds) scores, defined as the logarithm of words were stemmed through the Lingua::stem module from the ratio between the association frequency of any term-GO www.cpan.org. Due to the standardized and curated nature pair and the calculated frequency of the random combination of MeSH terms, MeSH terms were not parsed or stemmed. of this pair, were used to indicate the relatedness of certain The frequency of each word in all the available text infor terms with certain GO node. These LOD scores were found mation was calculated. Words that occurred at least 5 times to be correlative with the accuracy of GO prediction, as over the whole text information space were retained for shown in FIG. S. Text information from titles of MEDLINE further studies. This cutoff threshold was used to eliminate records appeared to have more predictive power, in particu rare words, wrong spellings, and sometimes even the base lar at lower LOD scores, than text information from other pair sequence present in either the definition lines or categories. This Suggests that the title tended to Summarize US 2007/0083334 A1 Apr. 12, 2007 50 the gist of an article in a straightforward manner. MeSH terms had similar predictive capabilities as the abstracts, TABLE 9 possibly because the MeSH terms were derived from the abstracts, and thus had similar information contents. Input 0633 Based on text information, a significant number of GO Consortium annotation, Enzyme proteins were predicted to be associated with one or more conversion, InterPro Text mining GO nodes. Table 8 below, lists the number of proteins with mapping, etc. ProLoc Output predicted GO node from four types of text information in the Cellular Component 44702 522179 574607 three categories of GO. These predicted GO annotations Molecular Function 85626 S26083 580767 were incorporated in GO process to increase the accuracy of Biological Process 69726 S2S842 578.636 homology-based GO annotation and to generate de novo annotations. 0638 Over 85% of proteins were annotated with one or TABLE 8 more GO nodes in each of three GO categories. Table 10 MeSH Definition below, analyses the number of proteins annotated at different tel Title Abstract Line Total homology levels, showing that GO annotations were achieved throughout the homology spectra. Cellular Component 57845 52094 57597 S14191 S21396 Molecular Function 57845 54152 57632 S16319 S23384 Biological Process 57845 53970) 57631 S16402 S2338S TABLE 10 Cellular Molecular 0634) To further enhance the accuracy and coverage of Component Function Biological Process GO annotation process, a computational platform for pre Text 32257 341.37 3O149 10-2 10-10 87967 71717 74277 dicting cellular localization, ProLoc (Einat Hazkani-Covo, 10-10 10-50 122992 70O88 793.18 Erez, Levanon, Galit Rotman, Dan Graur and Amit Novik, a 10-50 O.O 98.059 55132 59051 manuscript Submitted for publication), was used to predict 35% 75% 111130 97209 108334 the cellular localization of individual proteins based on their 75%-90% 38509 68.282 67429 inherent features such as specific localization signatures, 90% 99% 3.8991 98.576 90352 protein domains, amino acid composition, pI, and protein Input GO 447O2 85626 69726 length. Only protein sequences that begin with methionine underwent ProLoc analysis. Thus, 88997 out of 93110 proteins in SwissProt version 39 were analyzed, and 78111 Example 9 proteins have one to three GO predictions in cellular com ponent category. Statistical Validation of Ontological Annotations Example 8 0639 Gene ontology annotations, which were assigned according to the teachings of the present invention, were Gene Ontology Assignment assessed by automatic cross-validation. One fifth of input of input GO annotations were withheld during the GO anno 0635 Progressive single-linkage clusters with 1% reso tation process and the resultant annotations were compared lution were generated to assign GO annotations (i.e., nodes) with the withheld GO nodes. For each protein, the GO node to proteins (see Example 6). Protein clustering and annota with the lowest error score was examined. Table 11 below, tion assignment were effected at each level of homology. lists the coverage and accuracy of Such representative test. The resolution was 1% for global alignment identity (i.e., clustering was first effected at 98%, then at 97% and so TABLE 11 forth). The resolution was 10 fold for the Escore of a BlastP homology pair. For example, clustering was performed at Total Predicted GO Accurate GO 10, then at 10' and so forth. Cellular Component 7431 7186 4642 0636. To examine clustering efficiency and homology Molecular Function 12999 12864 101.38 transitivity, all homology pairs clustered with at least 90% Biological Process 10811 10690 808O identity were examined. At this level, there were a total of 57,004 clusters containing 263,259 protein members. 0.640. Evidently, sample coverage ranged from 96% to Among these clusters, 23.231 clusters contained at least 99% and the reproducibility was between 65% and 80%. three protein members (see FIGS. 6a-c). The lowest homol The lower reproducibility of GO annotations in the “cellular ogy pairs had an identity of 46% while being clustered at component' category, as compared with that in the other two 90% or higher identity levels. GO categories was consistent with the notion that a short 0637 Clusters containing proteins with preassociated or amino acid segment such as a signal peptide affects signifi predicted ontological annotations were analyzed and best cantly protein localization. The presence or absence of Such annotations for individual proteins in the clusters were Small amino acid segments could not be completely captured selected through an error weight calculation. Table 9 below, in sequence similarity comparisons. Detailed analysis of the provides statistics on the number of input gene ontology validation of data indicated that the accuracy of the anno annotations and the number of output annotations following tations correlated with the homology levels (data not processing. shown). Manual validation of assigned annotations was US 2007/0083334 A1 Apr. 12, 2007

performed on a total of 500 annotations and about 85%-93% 0653. Additional lines of the file contain the following of annotations were found to be correct. The higher percent information: age of accuracy in the manual examination over the auto matic cross-validation resulted from the incomplete anno 0654) “*” indicates optional fields: “*” indicates repeat tation of input GO. able features. 0655) “HEST’ represents a list of GenBank accession Example 10 numbers of all expressed sequences (ESTs and RNAs) clustered to a contig, from which a respective transcript is Description of Data derived. The GenBank accession numbers of these expressed sequences are listed only for the first transcript in 0641 Example 10a-e below describe the data table in the contig, e.g. “HEST BC006216.BE674469,BE798748, “Summary table' file, on the attached CD-ROM3. The data NM032716” in Example 10b. The rest of the transcripts table shows a collection of annotations of differentially derived from the same contig, are indicated by an iEST field expressed nucleic acid sequences, which were identified marked with “the same'. according to the teachings of the present invention. 0656 Expressed Sequences, marked with 0642 Each feature in the data table is identified by “if”. “ProDGyXXX', e.g., “ProDGy933” in Example 10d, and expressed sequences, marked with “GeneID XXX', e.g., 0643. Each transcript in the data table is identified by: “GeneID1007Forward” in Example 10e, are proprietary 0644 (i) A Serial number, e.g. “251470 in Example 10a, sequences which do not appear in GenBank database. These “445259'-'445262 in Example 10b. I sequences are deposited in the nucleotide sequence file “ProDG seqs” in the attached CD-ROM2. 0645 (ii) An internal arbitrary transcript accession num ber, e.g. “N62228 4” in Example 10a, “BE674469 0, 0657 Data pertaining to differentially expressed alterna “BE674469 0 124”, “BE674469 1, “BE674469 1 tively spliced sequences is presented in the following for 124” in Example 10b. mat: 0646) The first number of the internal transcript accession 0658) *, * “#TAA CD represents the coordinates of the number is shared by all transcripts which belong to the same differentially expressed sequence segment. A single number contig, and represent alternatively spliced variants of each represents a differentially expressed edge, corresponding to other, e.g. “BE674469 in “BE674469 0”, “BE674469 the specific junction between 2 exons. “TAA CD repre 0 124”, “BE674469 1, “BE674469 1 124 in sented by a pair of numbers represents the start and end Example 10b. positions of a differentially expressed sequence node. For example, “HTAA CD 269 296” in Example 1.0a indicates 0647. The second number of the internal transcript acces that the transcript identified as N62228 4 contains a dif sion number is an internal serial transcript number of a ferentially expressed segment, located between the nucle specific contig, e.g. * 0” or “ 1” in “BE674469 0. otides at positions 269 and 296. “BE674469 0 124”, “BE674469 1, “BE674469 1 0659 *, * “#TAA TIS contains information pertaining 124” in Example 10b. to specific tissue(s), in which the respective transcript is 0648. The third number of the internal transcript acces predicted to be expressed differentially. Tumor tissues are sion number is optional, and represents the GenBank data indicated accordingly. For example, "#TAA TIS lung base version used for clustering, assembly and annotation Tumor indicates that transcript BE674469 0 in Example processes. Unless otherwise mentioned, GenBank database 10b is predicted to be differentially expressed in lung tumor version 126 was used. “124' indicates the use of GenBank tissues. version 124, as in “BE674469 1 124” of Example 10b. 0660) *, ** “#DN' represents information pertaining 0649) “ProDG' following the internal accession number transcripts, which contain altered functional domains, pre indicates an EST sequence data from a proprietary source, dicted to act in a dominant negative manner. This field lists e.g., Examples 3d and 3e. the description of the functional domain(s), which is altered in the respective splice variants e.g., “HDN EGF-like 0650 “han” represents the use of GenBank version 125. domain in Example 10a. This version was used in the annotation of lung and colon cancer specific expressed sequences. 0661 Functional annotations of transcripts based on Gene Ontology (GO) are indicated by the following format. 0651) “lab’ indicates expressed sequences which differ ential pattern of expression has been confirmed in the 0662) *, * “HGO P', annotations related to Biological laboratory. Process, 0652 Transcript accession number identifies each 0663 *, ** “#GO F', annotations related to Molecular sequence in the nucleotide sequence data files “Transcripts Function, and nucleotide seqs part1. “Transcripts nucleotide seqs 0664) *, ** “#GO C. annotations related to Cellular part2 and “Transcripts nucleotide seqs part3 on CD Component. ROMs 1 and 2, and in the respective amino acid sequences data file “Protein.seqs' on CD-ROM2. Of note, some nucle 0665 For each category the following features are otide sequence data files of the above, do not have respective optionally addressed: amino acid sequences in the amino acid sequence file 0.666 “HGOPR represents internal arbitrary accession “Protein.seqs' attached on CD-ROM2. number of the predicted protein corresponding to the func US 2007/0083334 A1 Apr. 12, 2007 52 tionally annotated transcript. This internal accession number ovary Tumor, #TAA CD 269 296 #TAA TIS skin Tumor, identifies the protein in the amino acid sequence file “Pro #TAA CD 59 269 iFTAA TIS ovary, iTAA CD 59 269 tein. seqs' in the attached CD-ROM2, together with the #TAA TIS ovary Tumor, #TAA CD 59 269 iFTAA TIS internal arbitrary transcript accession number. For example, skin Tumor HDN EGF-like domain HGO F iGOPR “#GOPR human 281192 in Example 10a, is a protein human 281192 #GO Acc 3823 #GO Desc antibody iCL2 sequence encoded by transcript N62228 4, which appears #DB sp in the amino acid sequence file “Protein.seqs' in the attached 0676 #EN NRG2 HUMAN #GO P #GOPR human CD-ROM2 and is identified by both numbers, “N62228 4” 281192 #GO Acc 7165 #GO Desc signal transduction #CL and “human 281192. 2 #DB sp#EN NRG2 HUMAN 0667 "iGO Acc’ represents the accession number of the assigned GO entry, corresponding to the following Example 10b “HGO Desc” field. 0677 445259 BE674469 0 #EST BC006216, 0668) “HGO Desc” represents the description of the BE674469,BE798748,NM032716 iTAA CD O 2537 assigned GO entry, corresponding to the mentioned #TAA TIS lung, HTAA CD 0.2537 #TAA TIS lung Tumor “#GO Acc” field. For example, “HGO Acc 7165 iGO Desc signal transduction” in Example 10a, means that 0678) 445260 BE674469 0 124 #124EST BC006216, the respective transcript is assigned to GO entry number BE674469,BE798748.NM 032716 #SA Lung Tumor #RA 7165, corresponding to signal transduction pathway. lung cancer 0669) “HCL represents the confidence level of the GO 0679 445261 BE674469 1 #EST the same #TAA CD assignment, when #CL1 is the highest and iCL5 is the 0 2537 iTAA TIS lung, #TAA CD 0.2537 iTAA TIS lung lowest possible confidence level. Tumor 0670) “HDB' marks the database on which the GO assignment relies on. The 'sp', as in Example 10a, relates to 0680 445262 BE674469 1 124 #124EST BC006216, SwissProt Protein knowledgebase, available from http:// BE674469,BE798748.NM 032716 #SA Lung Tumor #RA www.expasy.ch/sprot/. “InterPro’, as in Example 10c, refers lung cancer to the InterPro combined database, available from http:// www.ebi.ac.uk/interprof, which contains information Example 10c regarding protein families, collected from the following databases: SwissProt (http://www.ebi.ac.uk/swissprot/). 0681 314251 HUMM7BA 0 #EST BF804381, Prosite (http://www.expasy.ch/prosite?), Pfam (http://ww BF805793.BF805830BG978076, HUMM7BA #GO C w.sanger.ac.uk/Software/Pfam?), Prints (http://www.bioinf #GOPR human 313276 #GO Acc 16459 #GO Desc myo man.ac.uk/dbbrowser/PRINTS/), Prodom (http:// sin #CL 2 #DB interpro #EN IPR001609 #GO F #GOPR prodes.toulouse.inra.fr/prodom/). Smart (http://smart.embl human 313281 HGO Acc 3774 HGO Desc motor HCL 1 heidelberg.de/) and Tigrfams (http://www.tigr.org/ #DB sp #EN Q14786 #GO F #GOPR human 313281 TIGRFAMs/). #GO Acc 5524 #GO DescATPbinding #CL 1 #DB sp#EN 0671) “HEN represents the accession of the entity in the Q14786 #GO P #GOPR human 313281 #GO Acc 5983 database (HDB), corresponding to the best hit of the pre #GO Desc starch catabolism #CL 4 HDB sp #EN Q14786 dicted protein. For example, "#DB sp #EN #SA colon, colonic, gut HRA colon normal NRG2 HUMAN” in Example 10a means that the GO assignment in this case was based on SwissProt database, Example 10d while the closest homologue to the assigned protein is depicted in SwissProt entry “NRG2 HUMAN” correspond 0682) 723873 AA157684. TO ProDG #EST ing to protein named “Pro-neuregulin-2 (http)://www.ex 0683 AA157684AA157764AK057980, BF355351, pasy.org/cgi-bin/niceprot.pl?O14511). “HDB interpro iEN ProDGy933 IPR001609 in Example 10c means that GO assignment in this case was based on InterPro database, while the best hit 0684) #GO C #GO Acc 0016021 #GO Desc “integral of the assigned protein is to depicted in membrane protein'iGO F SwissProt accession number “IPR001609, corresponding to “Myosin head (motor domain)” protein family (http:// 0685) #GO Acc 0005978 #GO Desc “glycogen www.ebi.ac.uk/interpro/IEntry?ac=IPR001609). biosynthesis'#GO P #GO Acc 0003707 #GO Desc “ste 0672. The following two fields correspond to the hierar roid hormone receptor chical assignment of the differentially expressed sequences to a specific tissue(s), based on the EST content and EST Example 10e libraries origin within the contig. 0686) 723928 GeneID1007Forward TO ProDG #EST 0673) *, ** “#SA indicates that tissue assignment requires a contig, containing at least 3 ESTs, where at least 0687] AC018755CDS1AC018755mRNA1AW403840, 80% thereof are assigned to a selected tissue. AY040820CDS0, BF3 59557, BF896787, BF898989, BF899932BF900235, BF905509,BI518761,BIT56629,BI 0674) *, ** “HRA' indicates that tissue assignment 822428,BI906477, BI906754, BM550096. BM922784, requires a contig derived from at least two different EST GeneID1007Forward.Gen eID285Forward, ProDGy1006 libraries, originally constructed from a specific tissue. #GO C #GO Acc 0005887 #GO Desc “integral plasma Example 10a membrane protein'#GO F #GO Acc 0007267 #GO Desc 0675 251470 N62228 4 #EST the same #TAA CD “cell-cell signaling'#GO P #GO Acc 0005530 HGO Desc 269 296 #TAA TIS ovary, ifTAA CD 269 296 iTAA TIS “lectin US 2007/0083334 A1 Apr. 12, 2007

Example 11 Example 12 Description of the Sequence Files on the Enclosed Description of the CD-ROM Content CD-ROM 0692 The CD-ROMs enclosed herewith contain the fol 0688. The sequences in the CD-ROM sequence files are lowing files: in FastA text format. Each transcript sequence starts with 0693 CD-ROM1 (1 file): >' mark, followed by the transcript internal accession 0694) 1. “Transcripts nucleotide seqs part1, contain number. The proprietary ProDG EST sequences starts with ing nucleotide sequences of all the transcripts based on >' mark, followed by the internal sequence accession. An genomic production of GenBank version 126. example of the sequence file is presented below. 0695 CD-ROM2 (4 files): Example 11a 0696 1. “Transcripts nucleotide seqs part2', contain ing nucleotide sequences of all the transcripts based on 0689) expressed production of GenBank version 126 (in cases where no genomic data Support was available). >R422780 (SEQ ID NO: 41) 0697 2. “Transcripts nucleotide seqs part3.new, con TGTTTTAGAAATCTCATGATTCCCAGGAAAAAAATTTTAAATTGTGATAC taining nucleotide sequences of all the transcripts based on GenBank versions 124, 125, and transcripts containing AGGTTTGACAGCCTTTTAGTCAAATAAGTTAAAACACACACGCAAACTCA ProDG proprietary sequences. TTTACT CACTTTGCCATTATAATTCAATCACAAAGAAATTTTGGCCAGGC 0.698 3. “Protein. seqs, containing all the amino acid GTGGTGGTTACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCAGGTGG sequences encoded by the transcripts of the invention. 0699 4. “ProDG seqs', containing the proprietary EST ATCACGAGGTCAGGGGATCAAGATCATCCTGGCTAACATGTGAAACCCCG Sequences. TCTCTATTAAAAATAAAAAATTAGCCTGGTGTGGTGGCGGGTGCCTGTAG 0700 CD-ROM3 (1 file): TCCCAGCTACTCGGGAGGCTGAGGCAGCAGAATGGCGTGAACTCAGGAGG 0701 1. “Summary table.new, containing all the anno tation information, as described in Example 10. CGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGATGA 0702 CD-ROM4 (1 file): CAGAGCGAGACTCCATCTCAAAAAAAAAA 0703 1. “Pat 11 US8 Patentin-part1, containing part 1 of the PatentIn format Sequence Listing. Example 11b 0704) CD-ROM5 (1 file): 0690) 0705 1. “Pat 11 US8 Patentin-part2”, containing part 2 of the PatentIn format Sequence Listing. >GeneIDReverse #TY RNA #DE ProDGy sequence #DT 18-JUN-1000 iDR 5 #LN 348 0706 CD-ROM6 (1 file): (SEQ ID NO: 1) 0707 1. “Pat 11 US8 Patentin-part3”, containing part 3 GTGGTTATTACAGCATGGTTCCCAGCCTTACAGTGTCTAAGGCTCTCT of the PatentIn format Sequence Listing. TGTGTCCTGTAGATGTTGTGAAAAAGAAAAAAACAAAAAATACACCACAC Example 13 TGTACTTTTTCCCCCTGCCCCCGTTACTGCCGGTGATTATATAAAAA

TAGTTTTTTTCACATCATTATATCGGCTTCCTATAAACAACAGCCAA In-vitro confirmation of Differentially Expressed Transcripts TTCAGTCAAGACTCCCTTTGGGGAATTCATTTTATTAAAAATTGGTGTCT

GGATACTTCCCTGTACATGCATAAATATGCATGCATGTACAGAAAGACTG Experimental Procedures and Reagents

TATGTGTGTGCCTTGCACACACACCCATACCTCTCAGAAAAAGTGTTT 0708. In-vitro confirmation of in-silico obtained differ entially expressed polynucleotide sequences was effected utilizing laboratorial methodologies, based on nucleotide Example 11c hybridization including northern analysis, RT-PCR and real time PCR. 0691) 0709 RNA preparation Total RNA was isolated from the indicated cell lines or tumor tissues using the Tri >ProDGy1339 #OS Homo sapiens #DE ProDGy sequence Reagent (Molecular Research Center Inc.) following the #DT 26-JUL-2002 #TY EST #DR 5 #AC ProDGy1339 manufacturer's recommendations. Poly(A) RNA was puri #LN 132 fied from total RNA using oligo(dT)s Dynabeads (Dynal). (SEQ ID NO: 2) CAGAAAGCCCAGAGTAGTCCCTGTAAGAAGCTGAGGGGCGCATACCTCTG 0710 Northern blotting —20 g of total RNA or 2 ug of poly(A) RNA were electrophoresed on 1% agarose gels GGGTTTGGGTTCCCTTCAGGGAAGCGAAGGGAGATGACCTCTTTCCAGGC containing formaldehyde, and blotted onto Nytran Super Charge membranes (Shcleicher & Schuell). Hybridization TGGGGACCAAGAGGGCTCCCTAGAAGATATTA. was carried out using a DNA probe (SEQ ID NO: 3) in EZ-Hybridization Solution (Biological Industries, Beit Hae US 2007/0083334 A1 Apr. 12, 2007 54 mek, Israel) at 68°C. for 18 hrs. The membranes were rinsed tured DNA to the nylon membrane was performed overnight twice with 2xSSC, 0.1% SDS at room temperature, followed with 20xSSC. DNA was UV crosslinked (Stratalinker) to a by two washes with 0.1xSSC, 0.1% SDS at 50° C. Autora nylon membrane prior to prehybridization step. Prehybrid diograms were obtained by exposing the membranes to ization was performed using EZ-hybridization Solution X-ray films. (Biological Industries, Cat no: 01-889-1B) at 68° C. for 1 0711 RT-PCR analysis Prior to RT reactions, total hour. The DNA blot was subjected to Southern hybridization RNA was digested with DNase (DNA-freeTM, Ambion) in using specific oligonucleotides end-labeled with adenosine the presence of RNasin. Reverse transcription was carried 5'-y-‘Ptriphosphate (>5000 Ci/mmol, Amersham Bio out on 2 Jug of total RNA, in a 20 Jul reaction, using 2.5 units sciences, Inc.). Hybridization step was effected at 68°C. for of Superscript II Reverse Transcriptase (Bibco/BRL) in the 16 hours. buffer supplied by the manufacturer, with 10 pmol of oli 0716 Following hybridization the membrane was go(dT)s (Promega), and 30 units of Rnasin (Promega). RT washed at gradually increasing stringent conditions: twice in reactions, were standardized by PCR with GAPDH-specific 2xSSC, 0.1% SDS, for 15 min. at room temperature and primers, for 20 cycles. The calibrated reverse transcriptase twice in 0.1XSSC, 0.1% SDS, for 15 min, at 6o° C. Radio samples were then analyzed with gene-specific primers active signal was visualized by autoradiography. either at 35 cycles, or at lower cycles (15 and 20 cycles). PCR products of lower number of cycles were visualized by Example 14 southern blotting, followed by hybridization with the appro priate probe (the same PCR product). Colorectal Cancer Specific Expression of AAS35072 0712 Real-Time RT-PCR Total RNA samples were treated with DnaseI (Ambion) and purified with Rineasy 0717 AA535072 (SEQ ID NO: 39) is a common columns (Qiagen). 2 g of treated RNA samples were added sequence feature to a series of overlapping sequences (SEQ into 20 ul RT-reaction mixture including. RT-PCR end ID NOs: 4, 24-28) with predicted amino acid sequences product 200 units SuperscriptiI (Invitrogen), 40 units RNa provided in SEQ ID NOS: 35-38. sin, and 500 pmol oligo dT. All components were incubated 0718 The indicated tissues and cell lines were examined for 1 hr at 50° C. and then inactivated by incubation for 15 for AA535072 (SEQ ID NOs: 39) expression by RT-PCR min at 70° C. Amplification products were diluted, 1:20, in analysis. Primers for AA535072 were GTGACAGCCAG water. 5 ul of diluted products were used as templates in TAGCTGCCATCTC (SEQ ID NO: 5) and Real-Time PCR reactions using specific primers and the TCCGTTTCTAGCGGCCAGACCTTT (SEQ ID NO: 6). intercalating dye Sybr Green. PCR reactions were denatured at 94° C. for 2 minutes 0713 The amplification stage was effected as follows, followed by 35 cycles at 94° C. for 30 sec, 64° C. for 30 sec 95° C. for 15 sec, 64° C. for 7 sec, 78° C. for 5 sec and 72° and 72° C. for 60 sec. All PCR products were separated on C. for 14 sec. Detection was effected using Roch light cycler an ethidium bromide stained gel. detector. The cycle in which the reactions achieved a thresh 0719. As shown in FIG. 7 amplification yielded a major old level of fluorescence was registered and served to PCR product of 1000 bp. Evidently, AA535072 expression calculate the initial transcript copy number in the RT reac was limited to colorectal cancer tissues; adenocarcinoma, tion. The copy number was calculated using a standard curve colon carcinoma cell line and colon carcinoma Duke A cells. created using serial dilutions of a purified amplicon product. Since colon carcinoma Duke A cells represent an early stage To minimize inherent differences in the RT reaction, the of colon cancer progression, differentially expressed resulting copy number was normalized to the levels of AA535072 can be used as a putative marker of polyps and expression of the housekeeping genes Proteasome 26S Sub benign stages of colon cancer. Furthermore, corresponding unit (GenBank Accession number D78151) or GADPH protein products (SEQ ID NOs: 35-38) may be utilized as (GenBank Accession number: AF261085). important colon cancer specific diagnostic and prognostic 0714 Semiquantitative PCR RT-PCR reaction was per tools. formed with sample specific primers, for 16 cycles. PCR products were used as probes. Labeling procedure was Example 15 carried out using “Random primer DNA labeling mix” according to manufacturers instructions (Cat. No. 20-101 Bone Tumor Ewing's Sarcoma Specific expression 25). Briefly, 25 ng of template DNA were denatured by of AA513157 (SEQ ID NO: 7) heating to 100° C. for 5 minutes, and then chilled on ice for 0720. The indicated tissues and cell lines were examined 5 minutes. Labeling solution contained 11 Jul of denatured for AA513157 (SEQ ID NO: 7) expression by RT-PCR DNA, 4 ul of labeling mix solution (Biological industries), analysis. Primers for SEQ ID NO: 7 were GAAGGCAG 5ul of (p)dCTP (Amersham, Pharmacia, AAO005). Label GCGGATGCTACC (SEQ ID NO: 8) and AGCCTTC ing was effected for 10 minutes in 37° C. Removal of CACGCTGTACACGCCA (SEQID NO: 9). PCR reactions unincorporated nucleotides was effected using Sephadex were denatured at 94° C. for 2 minutes followed by 35 G-50 columns. Prior to hybridization, labeled DNA was cycles at 94° C. for 30 sec, 64° C. for 30 sec and 72° C. for denatured by heating to 100° C. for 5 minutes and then 45 sec. All PCR products were separated on an ethidium rapidly cooled on ice. bromide stained gel. 0715 Southern blotting PCR products were separated 0721. As shown in FIG. 8, amplification reaction yielded on 1.5% agarose gel and size separated. The gel was a specific PCR product of 600 bp. As shown in FIG. 8, in the denatured by two consecutive washes for 20 min in 1x presence of reverse transcriptase (indicated by +) high denaturation buffer, containing 1.5M NaCl, 0.5M NaOH. expression of AA513157 was evident in both samples of Thereafter a neutralization procedure was effected by wash Ewing sarcoma, while only residual expression of ing twice for 20 min in 1x neutralization buffer, containing AA513157 was seen in Ln-Cap cells, brain and splenic 1.5M Nacl, 0.5m Tris/HCL pH=7.0. Blotting of the dena adenocarcinoma. US 2007/0083334 A1 Apr. 12, 2007

0722. To substantiate these, Northern blot analysis of 0730) Real-time PCR analysis indicates that SEQID NO: AA513157 was effected. The following primers were used, 18 is specifically expressed in lung adenocarcinoma samples GAAGGCAGGCTGGATGCTACC (SEQ ID NO: 10), and in lung alveolus cell carcinoma (FIG. 13). GGTAGTATAACCGGGCTCTGT (SEQ ID NO: 11). FIG. 9 illustrates RNA expression of AA513157 in various tis Example 19 sues. Several transcripts were evident upon Northern analy sis: two major transcripts of 800 bp and 1800 bp from ployA RNA preparation and total RNA preparation, respectively. SEQ ID NO: 21—A Lung Cancer Specific Expression of both transcripts was limited to the Ewing Transcript sarcoma cell line. Low expression of the 1800 bp transcript 0731 Real-time quantitative RT-PCR was used to mea was evident in Bone Ewing sarcoma tissue as well. sure the mRNA steady state levels of SEQID NO: 21. The 0723) These results corroborate AA513157 as a putative following primers were used GCTTCGACCGGCTTA Ewing sarcoma marker and a putative pharmaceutical target. GAACT (SEQ ID NO: 22) and GGTGAGCAC GATACGGGC (SEQ ID NO. 23). Example 16 0732 Real-time PCR analysis indicates that SEQID NO: Colorectal Cancer Specific Expression of 21 is specifically expressed in Small lung cell carcinoma and AA469088 in adenocarcinoma (FIG. 14). 0724 AA469088 (SEQ ID NO: 40) is a common Example 20 sequence feature to a series of overlapping sequences (SEQ ID NOs: 12 and 29-31). HSGPGI—A Lung Cancer Specific Transcript 0725. The indicated tissues and cell lines were examined for AA469088 (SEQ ID NO: 40) expression by semi quan 0733 Real-time quantitative RT-PCR was used to mea titative RT-PCR analysis. Primers for AA469088 were sure the mRNA steady state levels of HSGPGI (SEQID NO: CATATTTCACTCTGTTCTCTCACC (SEQ ID NO: 13) 32). The following primers were used GAGCCCTGT and CAGAATGGGATTATGGTAGTCTATCT (SEQ ID GCGCCGCTCAGATGTG (SEQ ID NO: 33) and AGC NO: 14). PCR reactions were effected as follows: 14 cycles CCAAGTTGAATCACCAACCAG (SEQ ID NO:34). at 92°C. for 20 sec, 59° C. for 30 sec and 68°C. for 45 sec. The PCR products were size separated on agarose 1.5% gel, 0734. As shown in FIG. 12, real-time PCR analysis and undergone Southern blot analysis using the PCR prod exhibited specific expression of SEQ ID NO: 32 in lung ucts as specific probe, as described in details in Example 13. adenocarcinoma and lung Squamos cell carcinoma, as com The visualization of the hybridization signal of the PCR pared to the expression in normal lung tissue (2-25 fold). products was performed by autoradiogram exposure to 0735. It is appreciated that certain features of the inven X-ray film. tion, which are, for clarity, described in the context of 0726. As shown in FIG. 10 amplification reaction yielded separate embodiments, may also be provided in combination a major PCR product of 484 bp. Evidently, AA469088 in a single embodiment. Conversely, various features of the expression was limited to colorectal tumor tissues, normal invention, which are, for brevity, described in the context of colon and adenocarcinoma with only minor expression in a single embodiment, may also be provided separately or in the spleen and kidney. any suitable Subcombination. Example 17 0736. Although the invention has been described in con junction with specific embodiments thereof, it is evident that HUMMCDR A Lung Cancer Specific Marker many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is 0727. Real-time quantitative RT-PCR was used to mea intended to embrace all Such alternatives, modifications and sure the mRNA steady state levels of HUMMCDR (SEQID variations that fall within the spirit and broad scope of the NO: 15). The following primers were used CTTCAATTG appended claims. All publications, patents, patent applica GATTATGTTGACCTCTAC (SEQ ID NO: 16) and CAC tions and sequences identified by their accession numbers TATAGGCAACCAGAACAATGTC (SEQ ID NO: 17). mentioned in this specification are herein incorporated in 0728 Real-time PCR analysis (FIG. 11) indicates that their entirety by reference into the specification, to the same SEQ ID NO: 15 is specifically expressed in lung squamous extent as if each individual publication, patent, patent appli cell carcinoma with an evident 2-10 fold higher expression cation or sequence identified by their accession number was than in normal lung samples. specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of Example 18 any reference in this application shall not be construed as an admission that Such reference is available as prior art to the SEQ ID NO: 18 A Lung Cancer Specific present invention. Transcript CD-ROM Content 0729) Real-time quantitative RT-PCR was used to mea sure the mRNA steady state levels of SEQ ID NO: 18. The 0737. The following lists the file content of the six following primers were used GCGAGGACCGGGTATAA CD-ROMs which are enclosed herewith and filed with the GAAGC (SEQ ID NO: 19) and TCGGCTCAGCCAAA application. File information is provided as: File name/bite CACTGTCAG (SEQ ID NO: 20). size/date of creation/operating system/machine format. US 2007/0083334 A1 Apr. 12, 2007 56

LENGTHY TABLES FILED ON CD The patent application contains a lengthy table section. A copy of the table is available in electronic form from the USPTO web site (http://seqdata.uspto.gov/?pageRequest=docDetail&DocID=US20070083334A1). An electronic copy of the table will also be available from the USPTO upon request and payment of the fee set forth in 37 CFR 1.19(b)(3).

CD-ROM1 (1 file): CD-ROM3 (1 file): 1. Transcripts nucleotide seqs part1/594,305.024 bytes/ 1. Summary table/583,723,008 bytes/Sep. 11, 2002/ASCII Sep. 4, 2002/ASCII file/PC. file/PC. CD-ROM2 (4 files): CD-ROM4 (1 file): 1. Pat11 US8 Patentin-part1/626,038,784 bytes/Oct. 24, 1. ProDG seqs/405,504 bytes/Sep. 9, 2002/ASCII file/PC. 2006/text file/PC. 2. Protein. seqs/97,839,104 bytes/Sep. 9, 2002/ASCII file/ CD-ROM5 (1 file): PC. 1. Pat11 US8 Patentin-part2/622,206,976 bytes/Oct. 24, 3. Transcripts nucleotide seqs part2/132,372.480 bytes/ 2006/text file/PC. Sep. 9, 2002/ASCII file/PC. CD-ROM6 (1 file): 4. Transcripts nucleotide seqs part3/27.709,440 bytes/Sep. 1. Pat11 US8 Patentin-part3/604,655,616 bytes/Oct. 24, 11, 2002/ASCII file/PC. 2006/text file/PC.

SEQUENCE LISTING The patent application contains a lengthy “Sequence Listing section. A copy of the “Sequence Listing is available in electronic form from the USPTO web site (http://seqdata.uspto.gov/?pageRequest=docDetail&DocID=US20070083334A1). An electronic copy of the “Sequence Listing will also be available from the USPTO upon request and payment of the fee set forth in 37 CFR 1.19(b)(3).

What is claimed is: 1-98. (canceled) 99. An isolated protein, having a sequence according to the amino acid sequence of SEQ ID NO: 836875. 100. An isolated polynucleotide, having a nucleic acid sequence capable of coding for the amino acid sequence of claim 99. 101. A pharmaceutical composition, comprising the pro tein of claim 99, and a pharmaceutically acceptable carrier.

k k k k k