US007728118B2

(12) United States Patent (10) Patent No.: US 7,728,118 B2 Wood et al. (45) Date of Patent: Jun. 1, 2010

(54) SYNTHETIC NUCLEICACID MOLECULE 5,874,304 A 2f1999 Zolotukhin et al. COMPOSITIONS AND METHODS OF 5,952,547 A 9, 1999 Cornelissen et al. PREPARATION 5,968,750 A 10, 1999 Zolotukhin et al. 5,976,796 A 1 1/1999 Szalay et al. 6,020, 192 A 2/2000 Muzyczka et al. (75) Inventors: Keith V. Wood, Mt. Horeb, WI (US); 6,074,859 A 6/2000 Hirokawa et al. Monika G. Wood, Mt. Horeb, WI (US); 6,114,148 A 9, 2000 Seed et al. Brian Almond, Fitchburg, WI (US); 6,130,313 A 10/2000 Li et al. Aileen Paguio, Madison, WI (US); 6,169,232 B1 1/2001 Hey et al. Frank Fan, Madison, WI (US) 6,306,600 B1 10/2001 Kain et al. 6,602,677 B1 8, 2003 Wood et al. (73) Assignee: Promega Corporation, Madison, WI 6,700,038 B1 3/2004 Dasgputa et al. (US) 6,878,531 B1* 4/2005 Seyfang ...... 435,912 2002/0100076 A1 7/2002 Garcon et al. (*) Notice: Subject to any disclaimer, the term of this 2003. O157643 A1 8, 2003 Almond et al. patent is extended or adjusted under 35 2006, O127988 A1 6/2006 Wood et al. 2008.OO70299 A1 3/2008 Wood et al. U.S.C. 154(b) by 538 days. 2008/0090291 A1 4/2008 Wood et al. (21) Appl. No.: 10/943,508 2009,019 1622 A1 7/2009 Almond et al. (22) Filed: Sep. 17, 2004 FOREIGN PATENT DOCUMENTS EP O337349 A2 10, 1989 (65) Prior Publication Data EP O364,707 A1 4f1990 EP O524448 A1 1, 1993 US 2006/OO68395A1 Mar. 30, 2006 EP 0353464 B1 10, 1993 JP O8-510837 11, 1996 (51) Int. Cl. JP 2000-503536 3, 2000 C7H 2L/02 (2006.01) WO WO-90/O1542 A1 2, 1990 C7H 2L/04 (2006.01) WO WO-91 (16432 A1 10, 1991 CI2N IS/00 (2006.01) WO WO-92, 15673 A1 9, 1992 C12O 1/68 (2006.01) WO WO-95, 18853 A1 7, 1995 CI2P 19/34 (2006.01) WO WO-9518853 A1 7, 1995 (52) U.S. Cl...... 536/23.1; 536/23.2:536/23.4: 536/23.7:435/320.1 (Continued) (58) Field of Classification Search ...... None See application file for complete search history. OTHER PUBLICATIONS GenBank Accession No. AF081957. 1. “Synthetic construct (56) References Cited aminoglycoside 3'-phosphotransferase mutant (mNeo) , com U.S. PATENT DOCUMENTS plete cds”, Aug. 1999.* GenBank Accession No. U19276.1. “Cloning vector pGFP-1 green 4.412,001 A 10, 1983 Baldwin et al. fluorescent , complete cods', Jul. 1995.* 4,503,142 A 3, 1985 Berman et al. CloneTech Catalog, p. 96 (1996/97).* 4,581,335 A 4, 1986 Baldwin CloneTech docoment PT2038-5, p. 1-2 (1996/97).* 4,968,613 A 11, 1990 Masuda et al. GenBank Accession No. AAD50549. "Aminoglycoside 5,096,825 A 3, 1992 Barr et al. 3'-phosphotransferase mutant synthetic construct.” Aug. 1999.* 5,168,062 A 12/1992 Stinski GenBank Accession No. AAA69543, "Neomycin 5, 182,202 A 1/1993 Kajiyama et al. phosphotransferase” Jul. 1995.* 5,196,524 A 3, 1993 Gustafson et al. Wells, K.D. et al., "Codon optimization, genetic insulation, and an 5,219,737 A 6/1993 Kajiyama et al. rtTA reporter improve performance of the tetracycline switch'. 5,221,623 A 6/1993 Legocki et al. Transgenic Research, vol. 8, p. 371-381 (1999).* 5,229,285 A 7, 1993 Kajiyama et al. Shim, J. et al., “Canonical 3'-deoxyribonucleotides as a chain termi 5,283,179 A 2, 1994 Wood nator for HCV NS5B RNA-dependent RNA polymerase'. Antiviral 5,292,658 A 3, 1994 Cormier et al. Research, vol. 58, pp. 243-251 (May 2003).* 5.330,906 A 7, 1994 Kajiyama et al. 5,352,598 A 10/1994 Kajiyama et al. (Continued) 5,418, 155 A 5, 1995 Cormier et al. 5,567,862 A 10/1996 Adang et al. Primary Examiner Teresa E Strzelecka 5,583,024 A 12/1996 McElroy et al. (74) Attorney, Agent, or Firm Michael Best & Friedrich 5,604,123 A 2f1997 KaZami et al. LLP 5,641,641 A 6, 1997 Wood 5,650,289 A T. 1997 Wood (57) ABSTRACT 5,670,356 A 9, 1997 Sherf et al. 5,674,713 A 10/1997 McElroy et al. 5,700,673 A 12/1997 McElroy et al. A method to prepare synthetic nucleic acid molecules having 5,744,320 A 4, 1998 Sherf et al. reduced inappropriate or unintended transcriptional charac 5,786,464 A 7, 1998 Seed teristics when expressed in a particular host cell. 5,795,737 A 8, 1998 Seed et al. 5,814,471 A 9, 1998 Wood 36 Claims, 2 Drawing Sheets US 7,728,118 B2 Page 2

FOREIGN PATENT DOCUMENTS Faisst, S., "Compilation of Vertebrate-Encoded Transcription Fac tors'. Nucleic Acids Research, 20(1), (Jan. 11, 1992), 3-26. WO WO-95/25798 A1 9, 1995 Ferbitz, L., “A Synthetic Gene Coding for Renilla luciferase is a WO WO-96,22376 A1 T 1996 versatile expression marker in green algae', NCBI Sequence Acces WO WO-97,0832O A1 3, 1997 sion No. AAF93166 (Aug. 8, 2000), 1 pg. WO WO-97.26333 A1 7/1997 Fiers, W. “On Codon Usage (letter), Nature, 277(5694), (1979), WO WO-9726366 A1 7/1997 328. WO WO-97/47358 A1 12/1997 Fleer, R. “High-Level Secretion of Correctly Processed Recombi WO WO-99/14336 A2 3, 1999 nant Human Interleukin-1B in Kluyveromyces lactis', Gene, 107(2), WO WO-01/23541 A2 4/2001 (1991), 285-295. WO WO-01/27 150 A2 4/2001 Fuerst, T. R. “Structure and Stability of mRNA Synthesized by WO WO-02/16944 A2 2, 2002 Vaccinia Virus-Encoded Bacteriophage T7 RNA Polymerase in WO WO-02O16944 A2 2, 2002 Mammalian Cells—Importance of the 5' untranslated leader'. Jour WO WO-02090535 A1 11, 2002 nal of Molecular Biology, 206, (1989), 333-348. WO WO-2004/O25264 A2 3, 2004 Gould, S. J., “A Conserved Tripeptide Sorts to WO WO-2004/042010 A2 5, 2004 Peroxisomes”. The Journal of Cell Biology, 108(5), (1989), 1657 WO WO-2006/034061 A2 3, 2006 1664. Gould, S.J., “Antibodies Directed Against the PeroxisomalTargeting OTHER PUBLICATIONS Signal of Firefly Luciferase Recognize Multiple Mammalian BIOBASE, www.gene-regulation.com/pub/databases/html; printed Peroxisomal Proteins”. The Journal of Cell Biology, 110(1), (1990), Jul. 31, 2007.* 27-34. Leclerc. G.M., Biotechniques, vol. 29, pp. 590, 591, 594, 596, 598, Gould, S.J., “Identification and Characterization of a Peroxisomal 600, 601 (2000).* Targeting Signal”. Dissertation Abstracts International, vol. 50/07 “Cloning Vector pGL3–Control', NCBI Sequence Accession No. B, (1989), 2766, 2 pgs. U47296, 4pgs. Gouy, M.. “Codon Usage in Bacteria: Correlation With Gene Expres “Cloning Vector psi STRIKE Puromycin, Complete Sequence'. sivity”. Nucleic Acids Research, 10(22), (1982).7055-7074. NCBI Sequence Accession No. AY497507, 3 pgs. Green, Pamela J., “Control of mRNA Stability in Higher Plants'. “Sequence of pcdna3.1/Hygro'. http://www.invitrogen.com/con Plant Physiology, 102(4), (1993), 1065-1070. tent/sfs, vectors/pcdna3.lhygro seq.txt, 2 pgs. Gruber, M. G., “Design Strategy for Synthetic Luciferase Reporter “Sequence 1 from Patent WO952924.5', NCBI Sequence Accession ', (Abstract Only), 11th International Symposium On No. A47 120, 2 pgs. Bioluminescence and Chemiluminescence, (May 2000), 1 pg. Aota, S., "Codon Usage Tabulated from the GenBank Genetic Henning, K. A., “Humanizing the yeast telomerase template'. Pro Sequence Data'. Nucleic Acids Research, 16 (Supplement), (1988), ceedings of the National Academy of Sciences of USA,95(10), (May 315-402. 12, 1998), 5667-5671. Bachmair, A., “In vivo Half-Life of a Protein is a Function of its Holm, L. “Codon Usage and '. Nucleic Acids Amino Terminal Residue”, Science, 234(4773), (1986), 179-186. Research, 14(7), (1986),3075-3087. Batt, D. B., “Polyadenylation and Transcription Termination in Gene Iannacone, R., “Specific Sequence Modifications of a cry3B Constructs Containing Multiple Tandem Polyadenylation Signals'. Endotoxin Gene Result in High Levels of Expression and Insect Nucleic Acids Research, 22(14), (Jul 15, 1994), 2811-2816. Resistance”. Plant Molecular Biology 34, (1997), 485-496. Benzakour, O... “Evaluation of the Use of the Luciferase-Reporter Ikemura, T., "Codon Usage and tRNA Content in Unicellular and Gene System for Gene-Regulation Studies Involving Cyclic AMP Multicellular Organisms'. Molecular Biology and Evolution, 2(1), (1985), 13-34. Elevating Agents'. The Biochemical Journal, 309 (Pt 2), (1995), Johnson, L. R., “Role of the Sox-2 in the Expres 385-387. sion of the FGF-4 Gene in Embryonal Carcinoma Cells'. Molecular Bernardi, G., “Codon Usage and Genome Composition”, Journal of Reproduction and Development. 50(4), (1998), 377-386. Moleular Biology, 22(4), (1985).363-365. Jones, P.L., “Tumor Necrosis Factor Alpha andlinterleukin-1B Regu Bonin, A. L., “Photinus pyralis luciferase: Vectors that Contain a late the Murine Manganese Superoxide Dismutase Gene Through a Modified luc Coding Sequence Allowing Convenient Transfer into Complex Intronic Enhancer Involving C/EBP-B and NF-kB'. Other Systems”. Gene, 141(1), (1994).75-77. Molecular and Cellular Biology, 17(12), (1997), 6970-6981. Bronstein, I., "Chemiluminescent and Bioluminescent Reporter Keller, G.-A., “Firefly Luciferase is Targeted to Peroxisomes in Gene Assays”. Analytical Biochemistry 219(2), (1994), 169-181. Mammalian Cells”. Proc. Natl. Acad. Sci. USA, 84(10), (1987), Bulmer, M. "Codon Usage and Secondary Structure of MS2 Phage 3264-3268. RNA'. Nucleic Acids Res., 17(5), (1989), 1839-1843. Kim, C. H., "Codon Optimization for High-Level Expression of Bulmer, M. "Coevolution of Codon Usage and Transfer RNA Abun Human Erythropoietin (EPO) in Mammalian Cells'. Gene, 199(1-2), dance', Nature, 325(6106), (1987), 728-730. (1997), 293-301. Chen, H., “Gene Transfer and Expression in Oligodendrocytes Under Kuprash, DV. “Conserved KB Element Located Downstream of the the Control of Myelin Basic Protein Transcriptional Control Region Tumor Necrosis Factor C. Gene: Distinct NF-kB Binding Pattern and Mediated by Adeno-Associated Virus'. Gene Therapy, 5(1), (1998), Enhancer Activity in LPS Activated Murine Macrophages'. 50-58. Oncogene, 11(1), (1995), 97-106. Coker, GT., "8-Br-cAMP Inhibits the Transient Expression of Firefly Lamb, K.A., “Effects of Differentiation on the Transcriptional Regu Luciferase”, FEBS Letters, 249, (1989), 183-185. lation of the FGF-4 Gene: Critical Roles Played by a Distal De Wet, J. R. “Cloning of Firefly Luciferase cDNA and the Expres Enhancer'. Molecular Reproduction and Development, 51(2), sion of Active Luciferase in Escherichia coli'', Proc. Natl. Acad. Sci. (1998), 218-224. USA, 82(23), (1985), 7870-7873. Liljenström, H., “Translation Rate Modification by Preferential De Wet, J. R., “Firefly Luciferase Gene: Structure and Expression in Codon Usage: Intragenic Position Effects”, Journal of Theoretical Mammalian Cells'. Molecular and Cellular Biology, 7(2), (1987), Biology, 124(1), (1987), 43-55. 725-737. Liu, J., “Improved Assay Sensitivity of an Engineered Secreted Dean, C., “mRNA Transcripts of Several Plant Genes are Renilla Luciferase', Gene, 237(1), (1999), 153-159. Polyadenylated at Multiple Sites in vivo”. Nucleic Acids Research, Magari, S. R. "Pharmacologic Control of a Humanized Gene vol. 14(5), (1986), 2229-2240. Therapy System Implanted into Nude Mice'. Journal of Clinical Dementieva, E. I. “Physicochemical Properties of Recombinant Investigation, 100(11), (1997), 2865-2872. Luciola mingrelica Luciferase and its Mutant Forms”, Biochemistry, Malter, J.S., “Identification of an AUUUA-Specific Messenger RNA 61(1), (1996), 115-119. Binding Protein', Science, 246(4930), (1989), 664-666. US 7,728,118 B2 Page 3

Murray, E. E., "Codon Usage in Plant Genes', Nucleic Acids Van Aarssen, R., “CTYIACb) Transcript Formation in Tobacco is Research, 17(2), (Jan. 25, 1989), 477-498. Inefficient”. Plant Molecular Biology, 28(3), (1995), 513-524. Nibu, Y., “A Cell Type-Dependent Enhancer Core Element is Located Viviani, V. R., "Bioluminescence Color Determinants of Phrixothrix in Exon 5 of the Human Angiotensinogen Gene'. Biochemical and Railroad-Worm Luciferases: Chimeric Luciferases, Site-Directed Biophysical Research Communications, 205(2), (1994), 1102-1108. Mutagenesis of Arg 215 and Guanidine Effect”. Photochemistry and Pan, W., “Vaccine Candidate MSP-1 from Plasmodium falciparum: Photobiology, 72(2), (2000), 267-271. a Redesigned 4917 bp Polynucleotide Enables Synthesis and Isola Wada, K., "Codon Usage Tabulated from GenBank Genetic tion of Full-Length Protein from Escherichia coli and Mammalian Sequence Data'. Nucleic Acids Research, 18 (Suppl), (1990), 2367 Cells”, Nucleic Acids Research, 27(4) (1999), 1094-1 103. 2411. Peers, B. “Regulatory Elements Controlling Pituitary-Specific Wain-Hobson, S., “Preferential Codon Usage in Genes”. Gene, Expression of the Human Prolactin Gene'. Molecular and Cellular 13(4), (1981),355-364. Biology, 10(9), (Sep. 1990),4690-4700. Wilson, T., “Removal of poly(A) and Consequence Degradation of Perlak, Frederick J., “Modification of the coding sequence enhances c-fos mRNA Facilitated by 3'AU-Rich Sequences', Nature, vol. 336, plant expression of insect control protein genes'. Proc. Natl. Acad. (1988), 396-399. Sci. USA, 88(8), (1991), 3324-3328. Wood, K. V. "Bioluminescent Click Beetles Revisited'. Journal of Reese, M.G., “Large Scale Sequencing Specific NeuralNetworks for Bioluminescence and Chemiluminescence, 4(1), (1989), 31-39. Promoter and Splice Site Recognition', (Abstract Only), Biocomput Wood, K. V., “Complementary DNA Coding Click Beetle ing. Proceedings of the 1996 Pacific Symposium, Lawrence Hunteret Luciferases can Elicit Bioluminescence of Different Colors', Sci al., ed., World Publishing Co., Singapore, (1996), 1 pg. ence, 244(4905), (1989), 700-702. Reese, M. G. et al., “New Neural Network Algorithms for Improved Wood, K. V., “Introduction to Beetle Luciferases and Their Applica Eukaryotic Promoter Site Recognition'. The Seventh International tions”, Journal of Bioluminescence and Chemiluminescence, 4(1), Genome Sequencing and Analysis Conference, Hilton Head Island, (1989), 289-301. South Carolina, (Abstract Only),(1995), 1 pg. Wood, K. V., "Luc Genes: Introduction of Colour Into Robinson, M. "Codon Usage Can Affect Efficiency of Translation of Bioluminescence Assays”, Journal of Bioluminescence and Genes in Escherichia collii', Nucleic Acids Res., 12(17), Chemiluminescence, 5(2), (1990), 107-114. (1984),6663-6671. Wood, K. V., “Photographic Detection of Luminescence in Saisanit, S., “A Novel Enhancer, the Pro-B Enhancer, Regulates Idl Escherichia coli Containing the Gene for Firefly Luciferase'. Ana Gene Expression in Progenitor B Cells'. Mol. Cell. Biol., 15(3), lytical Biochemistry, 161(2), (1987), 501-507. (1995), 1513-1521. Wood, K. V., “The Chemical Mechanism and Evolutionary Develop Sala-Newby, G., “Engineering a Bioluminescent Indicator for Cyclic ment of Beetle Bioluminescence'. Photochemistry and Photobiol AMP-Dependent Protein Kinase'. The Biochemical Journal, ogy, 62, (1995), 662-673. 279(Part 3), (1991), 727-732. Yanai, K. “A cis-acting DNA Element Located Between TATA Box Sala-Newby, G., “Engineering Firefly Luciferase as an Indicator of and Transcription Initiation Site is Critical in Response to Regulatory Cyclic AMP-Dependent Protein Kinase in Living Cells'. FEBS Let Sequences in Human Angiotensinogen Gene'. The Journal of Bio ters, 307 (2), (Jul 1992),pp. 241-244. logical Chemistry, 271 (27), (1996), 15981-15986. Sala-Newby, G. B., "Stepwise Removal of the C-Terminal 12 Amino Yang, J. K., “Human Dihydrofolate Reductase Gene Organization. Acids of Firefly Luciferase Results in Graded Loss of Activity”. Extensive Conservation of the G + C-rich 5' Non-Coding Sequence Biochimica et Biophysica Acta (BBA)—Protein Structure and and Strong Intron Size Divergence from Homologous Mammalian Molecular Enzymology, 1206, (1994), 155-160. Genes”,Journal of Molecular Biology, 176(2), (1984), 169-187. Schatt, M.D., “A Single DNA-Binding Transcription Factor is Suffi “Partial International Search Report for corresponding PCT Appli cient for Activation From a Distant Enhancer and/or From a Promoter cation No. PCT/US2005/033218”, Jan. 12, 2006, 1 pg. Position”. The EMBOJournal, 9(2), (1990), 481-487. “International Search Report for corresponding PCT Application No. Sharp, P M.. “Codon usage patterns in Escherichia coli, Bacillus PCT/US2005/033218”, (Mar. 31, 2006), 9 pgs. subtilis, , Schizosaccharomyces pombe, Groskreutz, D. J., et al., “Cloning Vector pGL3-Basic, Complete Drosophila melanogaster and Homo sapiens; a Review of the Con sequence”. Database EMBL Online, (Accession No. siderable Within-Species Diversity'. Nucleic Acids Research, EMBL:U47295),(Mar. 1, 1996), 3 pgs. 16(17), (Sep. 12, 1988),8207-8211. Zhuang, Y. , et al., "Co-Reporter vector phRG-B, complete Sharp, PM.. “The Codon Adaptation Index—A Measure of Direc sequence”. Database EMBL Online, (Accession No. tional Synonymous Codon Usage Bias, and Its Potential Applica EMBL:AF362550),(May 15, 2001), 3 pgs. tions”, Nucleic Acids Research, 15(3), (1987), 1281-95. “U.S. Appl. No. 1 1/316,042, Preliminary Amendment filed Dec. 22. Shaw, G., “A Conserved AU Sequence from the 3' Untranslated 2005”, 8 pgs. Region of GM-CSF mRNA Mediates Selective mRNA Degrada “U.S. Appl. No. 11/786,785, Preliminary Amendment filed Apr. 12, tion”. Cell, 46(5), (1986), 659-667. 2007”. 11 pgs. Sherf, B. A., “Firefly Luciferase Engineered for Improved Genetic “U.S. Appl. No. 1 1/825,304. Preliminary Amendment filed Jul. 5, Reporting”. Promega Notes Magazine, No. 14.(1994), 8 pgs. 2007”, 7 pgs. Simpson, C. G., “Efficient Splicing of an AU-Rich Antisense Intron "Australian Patent Application No. 2003297293, Response filed May Sequence'. Plant Molecular Biology, 21(2), (1993), 205-211. 30, 2007 to Examiner's First Report mailed Oct. 5, 2006', 19 pgs. Sommer, J. M.. “In vivo Import of Firefly Luciferase into the "Australian Patent Application No. 2001285278, Examiner's First Glycosomes of Trypanosoma bruceii and Mutational Analysis of the Report mailed Oct. 16, 2006”. 4pgs. C-Terminal Targeting Signal', Molecular Biology of the Cell, 3(7), "Australian Patent Application No. 2003297293, Examiner's Report (1992), 749-759. No. 2 mailed Jun. 18, 2006', 2 pgs. Sorensen, M. A., “Codon Usage Determines Translation Rate in "Australian Patent Application No. 2003297293, Examiner's First Escherichia coli'', Journal of Molecular Biology, 207(2), (1989), Report mailed Oct. 5, 2006”. 365-377. Australian Patent Application No. 2003297293, Response filed Aug. Tanaka, M. “Synonymous Codon Usage and Cost of Genetic Infor 29, 2007 to Examiner's Report No. 2 mailed Jun. 18, 2007, 27 pgs. mation', Bulletin of the Osaka Medical College, 34(1-2), (1988), “Dual-LuciferaseTM Reporter Assay System'. (1998).2 pgs. 3-12. “EP Application No. 01964.425.1, Communication Pursuant to Ticher, A. "Nucleic acid Compositions, Codon Usage, and the Rate Article 96(2) EPC mailed Nov. 23, 2006”, 13 pgs. of Synonymous Substitution in Protein-Coding Genes”, Journal of “EP Application No. 01964.425.1, Communication Pursuant to Molecular Evolution, 28(4), (1989), 286-298. Article 96(2) EPC mailed Jun. 27, 2005”, 12 pgs. Tirkel, S. "GCR-1-Dependent Transcriptional Activation of Yeast “EP Application No. 01964.425.1, Response filed Apr. 6, 2006 to Retrotransposon Ty2-917. Yeast, 13(10), (1997), 917-930. Communication mailed Jun. 27, 2005”. 20 pgs. US 7,728,118 B2 Page 4

“EP Application No. 01964.425.1, Communication Noting Loss of McWherter, C. A., et al., “Scanning Alanine Mutagenesis and De Rights (R. 69(1) EPC mailed Feb. 10, 2006', 1 pg. Peptidization of a Candida albicans Myristoyl-CoA: Protein “EP Application No. 038.19255.5, Communication Pursuant to N-Myristoyltransferase Octapeptide Substrate Reveals Three Ele Article 96(2) EPC mailed May 18, 2007, 5 pgs. ments Critical for Molecular Recognition”, Journal of Biological “Luciferase Reporter Gene Technology', (1996).4 pgs. Chemistry, 272(18), (1997), 11874-11880. “PCT Application No. PCT/US03/371 17, International Preliminary Mount, S. M., "Genomic Sequence, Splicing, and Gene Annotation'. Examination Report mailed Mar. 15, 2007”. 10 pgs. American Journal of Human Genetics, 67(4), (2000),788-792. “PCT Application No. PCT/US03/371 17, International Search Mullins, J. J., et al., “Transgenesis in Nonmurine Species'. Hyper Report mailed Oct. 31, 2005", 5 pgs. tension, 22(4), (1993), 630-633. “PCT Application No. PCT/US2005/033218, International Prelimi Mullins, L.J., et al., “Transgenesis in the Rat and Larger Mammals'. nary Report on Patentability mailed Mar. 29, 2007”. 10 pgs. Journal of Clinical Investigation, 97(7), (Apr. 1996), 1557-1560. “Promega Technical Bulletin No. 161—Luciferase Assay System Riggs, J., et al., “Common Factor 1 Is a Transcriptional Activator With Reporter Lysis Buffer”, (Mar. 1998).9 pgs. Which Binds in the c- Promoter, the Skeletal alpha-Actin Pro “Promega Technical Bulletin No. 101—Luciferase Assay System'. vider, and the Immunoglobulin Heavy-Chain Enhancer'. Molecular (Mar. 1998).9 pgs. and Cellular Biologry, 11(3), (1991), 1765-1769. “Promega Technical Manual—Dual-LuciferaseTM Reporter Assay Senapathy, P., et al., “Splice Junctions, Branch Point Sites, and System”. (Feb. 1999).26 pgs. Exons: Sequence Statistics, Identification, and Applications to “Promega Technical Manual—Steady-Glo TM Luciferase Assay Sys Genome Project”. Methods in Enzymology, 183, (1990).252. tem”. (Oct. 1998), 19 pgs. Sherf, B. A., et al., “Dual-LuciferaseTM Reporter Assay: An “Prosecution File History for U.S. Appl. No. 10/314,827” (as of Nov. Advanced Co-Reporter Technology Integrating Firefly and Renilla 5, 2007), 765 pgs. Luciferase Assays”. Promega Notes Magazine, No. 57. (1996).7 pgs. “Prosecution FileHistory for U.S. Appl. No. 09/645,706”, (as of Nov. Sirot, D., et al., “A Complex Mutant of TEM-1 Beta-Lactamase With 5, 2007), 1047 pgs. Mutations Encountered in Both IRT-4 and Extended-Spectrum TEM “Prosecution File History for U.S. Patent No. 5,670,356', 105 pgs. 15, Produced by Escherichia coli Clinical Isolate”. Antimicrobial Alam, J., et al., “Reporter Genes: Application to the Study of Mam Agents and Chemotherapy, 41(6), (Jun. 1997), 1322-1325. malian Gene Transcription'. Analytical Biochemistry, 188(2), Stapleton, P. D., et al., “Construction and Characterization of (1990).245-254. Mutants of the TEM-1 B-Lactamase Containing Amino Acid Substi Andrews, E. M., et al., “Hierarchy of Polyadenylation Site Usage by tutions Associated With Both Extended-Spectrum Resistance and Bovine Papillomavirus in Transformed Mouse Cells”, Journal of Resistance to B-Lactamase Inhibitors'. Antimicrobial Agents and Virology, 67(12), (1993), 7705-7710. Chemotherapy, 43(8), (Aug. 1999), 1881-1887. Bouthors, A.-T., et al., “Site-Directed Mutagenesis of Residues 164, Strauss, E. C., et al., “In Vivo Protein-DNA Interactions of Hyper 170, 171, 179,220, 237 and 242 in PER-1 (B-Lactamase Hydrolysing sensitive Site 3 of the Human B-Globin Locus Control Region'. Proc. Expanded-Spectrum Cephalosporins”. Protein Engineering, 12(4), Natl. Acad. Sci. USA, 89(13), (Jul 1992),5809-5813. (Apr. 1999), 313-318. Voladri, R. K., et al., "Structure-Function Relationships Among Carswell, S., et al., “Efficiency of Utilization of the Simian Virus 40 Wild-Type Variants of Staphylococcus aureus -Lactamase: Impor Late Polyadenylation Site: Effects of Upstream Sequences'. Molecu tance of Amino Acids 128 and 216”. Journal of Bacteriology, lar and Cellular Biology, 9(10), (1989), 4248-4258. 178(24), (Dec. 1996).7248-7253. Cheng. X., et al., “Agrobacterium-transformed Rice Plants Express Wada, K.-N., et al., “Codon Usage Tabulated From the GenBank ing Synthetic crylA(b) and CrylA(c) Genes are Highly Toxic to Genetic Sequence Data'. Nucleic Acids Research, 200Suppl.), Striped Stem Borer and Yellow Stem Borer'. Proceedings of the (1992).2111-21 18. National Academy of Sciences of the USA, 95(6), (Mar. 17, 1998), “U.S. Appl. No. 09/645,706. Examiner's Answer mailed Dec. 12, 2767-2772. 2007, 67 pgs. Frampton, J., et al., “Synergy Between the NF-El Erythroid-Specific “U.S. Appl. No. 09/645,706, Reply Brief filed Feb. 12, 2008”, 17 pgs. Transcription Factor and the CACCC Factor in the Erythroid-Spe “U.S. Appl. No. 1 1/316,042, Response to Restriction Requirement cific Promotor of the Human Porphobilinogen Deaminase Gene'. filed Apr. 15, 2008 to Restriction Requirement mailed Mar. 18. Molecular and Cellular Biology, 10(7), (1990),3838-3842. 2008, 7. Jensen, P. R., et al., “The Sequence of Spacers Between the Consen "Monstastrea cavernosa mcavFP 6 mRNa, complete cds'. Acces sus Sequences Modulates the Strength of Prokaryotic Promoters', sion: AYO37769 (gl: 1998.2568), (Apr. 5, 2001). Applied and Environmental Microbiology, 64(1), (1998),82-87. "Montastraea cavernosa clone 7.7 green fluorescent protein-like pro Kappel, C. A., et al., “Regulating Gene Expression in Transgenic tein mRNA, Complete cds”. Accession: AYO37768 (gi: 21303777), Animals'. Current Opinion in Biotechnology, 3, (1992),548-553. (May 31, 2001). Kimura, A, et al., “Detailed analysis of the mouse H-2K promoter: "Montastraea cavernosa green fluorescent proten mRNA, complete enhancer-like sequences and their role in the regulation of class I gene cds”. Accession: AF406766 (gi: 15425964), (Sep. 2, 2001). expression”. Cell, 44(2), (Jan. 31, 1986).261-272. "Montastraea faveolata green fluorescent protein mRNA, complete Labas, Y.A., et al., “Diversity and evolution of the green fluorescent cds'. Accession: AF40 1282 (gi: 15081471), (Aug. 5, 2001). protein family. AY037769. Database Genbank,(2002), 2 pgs. “Prosecution FileHistory for U.S. Appl. No. 09/645,706”, (as of Nov. Lesser, M. P. et al., “Green Fluorescent Proteins in Caribbean 5, 2007),387 pgs. scleractinian corals'. AF401282 Database Genbank, (Aug. 2001), Franklin, S., et al., “Development of a GFP reporter gene for 2 pgs. Chlamydomonas reinhardtii chloroplast'. The Plant Journal, 30(6), Lesser, M. P. et al., “Green Fluorescent Proteins in Caribbean (Jun. 2002).733-744. scleractinian corals'. AF406766–Database Genbank, (Sep. 2001), Lesser, M. P. et al., GenBank Accession No. AF40 1282, (Aug. 5, 2 pgs. 2001). Lewis, M. K., et al., “Efficient Site Directed in vitro Mutagenesis Lesser, M. P. et al., GenBank Accession No. AF406766, (Sep. 4. Using Ampicillin Selection'. Nucleic Acids Research, 18(12), 2001). (1990),3439-3443. Matz, M. V., et al., GenBank Accession No. AY037768, (May 31, Maranville, E., et al., “Assessment of Amino-Acid Substutions at 2002). Tryptophan 16 in O-galactosidase'. European Journal of Biochem Matz, M. V., et al., GenBank Accession No. AYO37769. (Apr. 5, istry, 267(5), (2000), 1495-1501. 2002). Matsumura, I., et al., “Directed Evolution of the surface Chemistry of Voss, S. D., et al., “The Role of Enhancers in the Regulation of the Reporter Enzyme B-glucuronidase”. Nature Biotechnology, Cell-Type-Specific Transcriptional Control”. Trends Biochem. Sci., 17(7), (1999),696-701. 11, (1986).287-289. Matz, M.V., et al., “Diversity and evolution of GFP-like fluorescent Yang, F. , et al., “The Molecular Structure of Green Fluorescent proteins”. AY037768–Database Genbank, (May 2002), 2 pgs. Protein', Nature Biotech, 14(10), (1996), 1246-1251. US 7,728,118 B2 Page 5

“Japanese Application Serial No. 2005-513754, Final Office Action “U.S. Appl. No. 12/323,270, Preliminary Amendment filed Dec. 2, mailed May 13, 2008”, FOAR-MISC.4. 2008”, 10 pgs. “U.S. Appl. No. 1 1/316,042. Non-Final Office Action mailed Jun. 11, “European Application No. 038.19255.5, Response filed Nov. 27. 2008”, OARN, 18 pgs. 2007 to Communication mailed May 18, 2007”. 12 pgs. “U.S. Appl. No. 09/645,706, Final Office Action mailed Feb. 3, “European Application Serial No. 01964.425.1, Office Action mailed 2009, 24 pgs. Jun. 9, 2008”, 9 pgs. “U.S. Appl. No. 09/645,706, Record of Oral Hearing held Aug. 13, “European Application Serial No. 01964425.1, Response filed Oct. 2008”, 15 pgs. 20, 2008 to Office Action mailed Jun. 9, 2008”, 25 pgs. “U.S. Appl. No. 09/645,706, Decision on Appeal mailed Sep. 3, “Japanese Application Serial No. 2005-513754, Reasons for Appeal 2008”, 26 pgs. filed on Sep. 5, 2008”, (w? English Translation), 30 pgs. “U.S. Appl. No. 09/645,706, Request to Reopen Prosecution filed “Japanese Application Serial No. 2006-288147 Office Action Mailed Nov. 3, 2008”, 21 pgs. Dec. 3, 2008”, 3 pgs. "Australian Patent Application No. 2001285278, Examiner's Second Kim, C. H., et al., “Codon optimization for high-level expression of Report mailed Dec. 19, 2007”. 2 pgs. human erythropoietin (EPO) in mammalian cells'. Gene, 199(1-2), "Australian Patent Application No. 2001285278, Response filed Apr. (1997), 293-301. 21, 2008 to Examiner's Report mailed Dec. 19, 2007”. 34pgs. Pan, W., et al., “Vaccine candidate MSP-1 from Plasmodium "Australian Patent Application No. 2001285278, Response filed Dec. falciparum: a redesigned 4917 bp polynucleotide enables synthesis 10, 2007 to Examiner's First Report mailed Oct. 16, 2006', 31 pgs. and isolation of full-length protein from Escherichia coli and mam “Canadian Application Serial No. 2,420.328, Office Action mailed malian cells”, Nucleic Acid Research, 27(4), (Feb. 15, 1999), 1094 Feb. 4, 2008”, 3 pgs. 1103. “Canadian Application Serial No. 2,420,328. Response filed Jul. 31. “Cloning Vector pGL3 - Control', NCBI Sequence Accession No. 2008 to Office Action mailed Feb. 4, 2008”, 77 pgs. U47296, 4pgs. Apr. 2002. “Canadian Application Serial No. 2,525,582, Examiner's Report “Cloning Vector psi STRIKE Puromycin, CompleteSequence', NCBI mailed 01-022008”, 5 pgs. Sequence Accession No. AY497507, 3 pgs. Jan. 2004. “Canadian Application Serial No. 2,525,582, Response filed Jun. 20. “Sequence 1 from Patent WO952924.5', NCBI Sequence Accession 2008 to Examiner's Report mailed Jan. 2, 2008", 21 pgs. No. A47 120, 2 pgs. Mar. 1997. “Japanese Application Serial No. 2005-513754, Argument and “U.S. Appl. No. 11-316,042, Final Office Action mailed Apr. 2, Amendment filed Feb. 29, 2008 to Office Action mailed Nov. 13, 2009, 8 pgs. 2007”. (w English Translation), 33 pgs. “European Application Serial No. 038.19255.5. Office Action mailed “Japanese Application Serial No. 2005-513754, Office Action mailed on Mar. 17, 2009, 4pgs. Nov. 13, 2007”. (w English Translation).7 pgs. “European Application No.05797929.6, Office Action mailed Apr. 2, “Japanese Patent Application No. 2002-521985. Notice of Appeal 2009, 4pgs. filed Jun. 13, 2007 to Final Office Action mailed Mar. 16, 2007, 3 “U.S. Appl. No. 09/645,706, Final Office Action mailed Sep. 3, OgS. 2009, 13 pgs. “Japanese Patent Application No. 2002-521985, Amendment and “U.S. Appl. No. 09/645,706, Final Office Action mailed Sep. 3, Appeal Brief filed Jul. 12, 2007”. (w English Translation).27 pgs. 2009”, 13 pgs. “Japanese Patent Application No. 2002-521985, Final Office Action “U.S. Appl. No. 09/645,706, Response filed Jul. 31, 2009 to Final mailed Mar. 16, 2007, 4pgs. Office Action mailed Feb. 3, 2009”. 10 pgs. “Japanese Patent Application No. 2002-521985. Notice of Reasons “U.S. Appl. No. 10/314,827. Notice of Allowance mailed Jul. 13, for Rejection mailed Jun. 7, 2006', (English Translation).6 pgs. 2009, 21 Pgs. “Japanese Patent Application No. 2002-521985, Official Action on “U.S. Appl. No. 1 1/316,042. Final Office Action mailed Sep. 4. Formalities mailed Jul. 19, 2007, 3 pgs. 2009”, 15 pgs. “Japanese Patent Application No. 2002-521985, Response filed Oct. “U.S. Appl. No. 1 1/316,042, Response filed Aug. 3, 2009 to Final 23, 2006 to Notice of Reasons for Rejection mailed Jun. 7, 2006”. (w/ Office Action mailed Apr. 2, 2009”. 11 pgs. English Translation), 54 pgs. “U.S. Appl. No. 1 1/786,785. Non Final Office Action mailed Aug. 4. “U.S. Appl. No. 10/314,827, Amendment filed Nov. 25, 2008 to 2009, 16 pgs. Office Communication mailed Oct. 28, 2008”, 6 pgs. “U.S. Appl. No. 1 1/786,785, Response filed Jun. 25, 2009 to Restric “U.S. Appl. No. 10/314,827. Communication mailed Aug. 5, 2008 tion Requirement mailed May 26, 2009”. 10 pgs. including Transcript of Oral Hearing held Jun. 17, 2008”, 23 pgs. “U.S. Appl. No. 11/786,785, Restriction Requirement mailed May “U.S. Appl. No. 10/314,827. Decision on Appeal dated Jul. 22. 26, 2009, 10 pgs. 2008”, 23 pgs. “Chinese Application Serial No. 200580039282.5, First Office “U.S. Appl. No. 10/314,827. Office Communication mailed Oct. 28. Action mailed Aug. 7, 2009', (English Translation), 4pgs. 2008”, 15 pgs. “European Application Serial No. 01964425.1, Response filed Nov. “U.S. Appl. No. 10/314,827. Request to Reopen Prosecution and 16, 2007 to Office Action mailed Nov. 23, 2006, 27 pgs. Amendment filed Sep. 22, 2008", 21 pgs. “European Application Serial No. 05797929.6, Response filed Aug. “U.S. Appl. No. 1 1/316,042. Restriction Requirement mailed Mar. 12, 2009 to Office Action mailed Apr. 2, 2009', 22 pgs. 18, 2008”, 5 pgs. “Japanese Application Serial No. 2006-288147, Response filed Jun. “U.S. Appl. No. 1 1/316,042, Response filed Dec. 11, 2008 to Non 3, 2009 to Office Action Mailed Dec. 3, 2008”, 8 pgs. Final Office Action mailed Jun. 11, 2008”, 14pgs. * cited by examiner U.S. Patent Jun. 1, 2010 Sheet 1 of 2 US 7,728,118 B2 Figure 1

Amino Acid Codon Phe UUU, UUC Ser UCU, UCC, UCA, UCG, AGU, AGC Tyr UAU, UAC Cys UGU, UGC Leu UUA, UUG, CUU, CUC, CUA, CUG Trp UGG PrO CCU, CCC, CCA, CCG His CAU, CAC Arg CGU, CGC, CGA, CGG, AGA, AGG Gln CAA, CAG Ile AUU, AUC, AUA Thr ACU, ACC, ACA, ACG Asn AAU, AAC LyS AAA, AAG Met AUG Val GUU, GUC, GUA, GUG Ala GCU, GCC, GCA, GCG Asp GAU, GAC Gly GGU, GGC, GGA, GGG Glu GAA, GAG U.S. Patent Jun. 1, 2010 Sheet 2 of 2 US 7,728,118 B2

Spd Sequence pGL4B-4NN3. 1 sequence Spel-Nicol-Ver2

Y N. MCS-4 Sequence bla-5 ---...Ncd.

A 10, 2 US 7,728,118 B2 1. 2 SYNTHETIC NUCLECACID MOLECULE result in elevated levels of DNA transcription in the absence COMPOSITIONS AND METHODS OF of a promoter sequence or for the presence of transcription PREPARATION regulatory sequences to increase the basal levels of gene expression in the absence of a promoter sequence. BACKGROUND Thus, what is needed is a method for making synthetic nucleic acid molecules with altered codon usage without also Transcription, the synthesis of an RNA molecule from a introducing inappropriate or unintended transcription regula sequence of DNA is the first step in gene expression. tory sequences for expression in a particular host cell. Sequences which regulate DNA transcription include pro moter sequences, polyadenylation signals, transcription fac 10 SUMMARY OF THE INVENTION torbinding sites and enhancer elements. A promoter is a DNA sequence capable of specific initiation of transcription and The invention provides an isolated nucleic acid molecule (a consists of three general regions. The core promoter is the polynucleotide) comprising a synthetic nucleotide sequence sequence where the RNA polymerase and its cofactors bind to having reduced, for instance, 90% or less, e.g., 80%, 78%, the DNA. Immediately upstream of the core promoter is the 15 75%, or 70% or less, nucleic acid sequence identity relative to proximal promoter which contains several transcription fac a parent nucleic acid sequence, e.g., a wild-type nucleic acid tor binding sites that are responsible for the assembly of an sequence, and having fewer regulatory sequences such as activation complex that in turn recruits the polymerase com transcription regulatory sequences. In one embodiment, the plex. The distal promoter, located further upstream of the synthetic nucleotide sequence has fewer regulatory proximal promoter also contains transcription factor binding sequences than would result if the sequence differences sites. Transcription termination and polyadenylation, like between the synthetic nucleotide sequence and the parent transcription initiation, are site specific and encoded by nucleic acid sequence, e.g., optionally the result of differing defined sequences. Enhancers are regulatory regions, con codons, were randomly selected. In one embodiment, the taining multiple transcription factor binding sites, that can synthetic nucleotide sequence encodes a polypeptide that has significantly increase the level of transcription from a respon 25 an amino acid sequence that is at least 85%, 90%. 95%, or sive promoter regardless of the enhancer's orientation and 99%, or 100%, identical to the amino acid sequence of a distance with respect to the promoter as long as the enhancer naturally-occurring (native or wild-type) corresponding and promoter are located within the same DNA molecule. The polypeptide (protein). Thus, it is recognized that Some spe amount of transcript produced from a gene may also be regu cific amino acid changes may also be desirable to alter a lated by a post-transcriptional mechanism, the most impor 30 particular phenotypic characteristic of a polypeptide encoded tant being RNA splicing that removes intervening sequences by the synthetic nucleotide sequence. Preferably, the amino (introns) from a primary transcript between splice donor and acid sequence identity is over at least 100 contiguous amino splice acceptor sequences. acid residues. In one embodiment of the invention, the codons Natural selection is the hypothesis that genotype-environ in the synthetic nucleotide sequence that differ preferably ment interactions occurring at the phenotypic level lead to 35 encode the same amino acids as the corresponding codons in differential reproductive success of individuals and therefore the parent nucleic acid sequence. to modification of the gene pool of a population. Some prop Hence, in one embodiment, the invention provides an iso erties of nucleic acid molecules that are acted upon by natural lated nucleic acid molecule comprising a synthetic nucleotide selection include codon usage frequency, RNA secondary sequence having a coding region for a selectable or screen structure, the efficiency of intron splicing, and interactions 40 able polypeptide, wherein the synthetic nucleotide sequence with transcription factors or other nucleic acid binding pro has 90%, e.g., 80%, or less nucleic acid sequence identity to teins. Because of the degenerate nature of the genetic code, a parent nucleic acid sequence encoding a corresponding these properties can be optimized by natural selection without selectable or screenable polypeptide, and wherein the Syn altering the corresponding amino acid sequence. thetic nucleotide sequence encodes a selectable or screenable Under some conditions, it is useful to synthetically alter the 45 polypeptide with at least 85% amino acid sequence identity to natural nucleotide sequence encoding a polypeptide to better the corresponding selectable or screenable polypeptide adapt the polypeptide for alternative applications. A common encoded by the parent nucleic acid sequence. The decreased example is to alter the codon usage frequency of a gene when nucleotide sequence identity may be a result of different it is expressed in a foreign host cell. Although redundancy in codons in the synthetic nucleotide sequence relative to the the genetic code allows amino acids to be encoded by mul 50 codons in the parent nucleic acid sequence. The synthetic tiple codons, different organisms favor Some codons over nucleotide sequence of the invention has a reduced number of others. It has been found that the efficiency of protein trans regulatory sequences relative to the parent nucleic acid lation in a non-native host cell can be substantially increased sequence, for example, relative to the average number of by adjusting the codon usage frequency but maintaining the regulatory sequences resulting from random selections of same gene product (U.S. Pat. Nos. 5,096.825, 5,670,356, and 55 codons or nucleotides at the sequences which differ between 5,874,304). the synthetic nucleotide sequence and the parent nucleic acid However, altering codon usage may, in turn, result in the sequence. In one embodiment, a nucleic acid molecule may unintentional introduction into a synthetic nucleic acid mol include a synthetic nucleotide sequence which together with ecule of inappropriate transcription regulatory sequences. other sequences encodes a selectable or screenable polypep This may adversely effect transcription, resulting in anoma 60 tide. For instance, a synthetic nucleotide sequence which lous expression of the synthetic DNA. Anomalous expression forms part of an open reading frame for a selectable or screen is defined as departure from normal or expected levels of able polypeptide may include at least 100, 150, 200,250,300 expression. For example, transcription factor binding sites or more nucleotides of the open reading, which nucleotides located downstream from a promoter have been demonstrated have reduced nucleic acid sequence identity relative to cor to effect promoter activity (Michael et al., 1990; Lamb et al., 65 responding sequences in a parent nucleic acid sequence. In 1998: Johnson et al., 1998; Jones et al., 1997). Additionally, it one embodiment, the parent nucleic acid sequence is SEQID is not uncommon for an enhancer element to exertactivity and NO:1, SEQID NO:6, SEQID NO:15 or SEQID NO:41, the US 7,728,118 B2 3 4 complement thereof, or a sequence that has 90%. 95% or 99% NO:23, the complement thereof, or a fragment thereof that nucleic acid sequence identity thereto. encodes a polypeptide with Substantially the same activity as In one embodiment, the nucleic acid molecule of the inven the corresponding full-length and optionally wild-type (func tion comprises sequences which have been optimized for tional) polypeptide, e.g., a polypeptide encoded by SEQID expression in mammalian cells, and more preferably, in NO:14 or SEQID NO:43, or a portion thereof which together human cells (see, e.g., WO 02/16944 which discloses meth with other sequences encodes a firefly luciferase. For ods to optimize sequences for expression in a cell of interest). instance, a synthetic nucleotide sequence which forms part of For instance, nucleic acid molecules may be optimized for an open reading frame for a firefly luciferase may include at expression in eukaryotic cells by introducing a Kozak least 100, 150, 200, 250, 300 or more nucleotides of the open sequence and/or one or more introns or decreasing the num 10 reading, which nucleotides have reduced nucleic acid ber of other regulatory sequences, and/or altering codon sequence identity relative to corresponding sequences in a usage to codons employed more frequently in one or more parent nucleic acid sequence. eukaryotic organisms, e.g., codons employed more fre In another embodiment, the invention provides an isolated quently in an eukaryotic host cell to be transformed with the nucleic acid molecule comprising a synthetic nucleotide nucleic acid molecule. 15 sequence which does not include an open reading frame In one embodiment, the synthetic nucleotide sequence is encoding a peptide or polypeptide of interest, e.g., the Syn present in a vector, e.g., a plasmid, and Such a vector may thetic nucleotide sequence may have an open reading frame include other optimized sequences. In one embodiment, the but it does not include sequences that encode a functional or synthetic nucleotide sequence encodes a polypeptide com desirable peptide or polypeptide, but may include one or more prising a selectable polypeptide, which synthetic nucleotide stop codons in one or more reading frames, one or more sequence has at least 90% or more nucleic acid sequence poly(A) adenylation sites, and/or a contiguous sequence for identity to an open reading frame in a sequence comprising, two or more restriction endonucleases (restriction enzymes), for example, SEQID NO:5, SEQID NO:9, SEQID NO:10, i.e., a multiple cloning region (also referred to as a multiple SEQ ID NO:11, SEQ ID NO:30, SEQ ID NO:38, SEQ ID cloning site, “MCS”), and which is generally at least 20, e.g., NO:39, SEQ ID NO:42, SEQ ID NO:44, SEQ ID NO:70, 25 at least 30, nucleotides in length and up to 1000 or more SEQ ID NO:71, SEQ ID NO:72, SEQ ID NO:73, SEQ ID nucleotides, e.g., up to 10,000 nucleotides, which synthetic NO:74, SEQ ID NO:80, SEQ ID NO:81, SEQ ID NO:82, nucleotide sequence has fewer regulatory sequences Such as SEQID NO:83, SEQID NO:84, the complement thereof, or transcription regulatory sequences relative to a corresponding a fragment thereof that encodes a polypeptide with Substan parent nucleic acid sequence. In one embodiment, the Syn tially the same activity as the corresponding full-length and 30 thetic nucleotide sequence which does not encode a peptide optionally wild-type (functional) polypeptide, e.g., a or polypeptide has 90% or less, e.g., 80%, or less nucleic acid polypeptide encoded by SEQID NO:1, SEQID NO:6, SEQ sequence identity to a parent nucleic acid sequence, wherein ID NO:15 or SEQ ID NO:41, or a portion thereof which the decreased sequence identity is a result of a reduced num together with other parent or wild-type sequences encodes a ber of regulatory sequences in the synthetic nucleotide polypeptide with Substantially the same activity as the corre 35 sequence relative to the parent nucleic acid sequence. sponding full-length and optionally wild-type polypeptide. The regulatory sequences which are reduced in the Syn As used herein, “substantially the same activity” is at least thetic nucleotide sequence include, but are not limited to, any about 70%, e.g., 80%, 90% or more, the activity of a corre combination of transcription factor binding sequences, intron sponding full-length and optionally wild-type (functional) splice sites, poly(A) adenylation sites (poly(A) sequences or polypeptide. In one embodiment, an isolated nucleic acid 40 poly(A) sites hereinafter), enhancer sequences, promoter molecule encodes a fusion polypeptide comprising a select modules, and/or promoter sequences, e.g., prokaryotic pro able polypeptide. moter sequences. Generally, a synthetic nucleic acid mol Also provided is an isolated nucleic acid molecule com ecule lacks at least 10%. 20%, 50% or more of the regulatory prising a synthetic nucleotide sequence having a coding sequences, for instance lacks Substantially all of the regula region for a firefly luciferase, wherein the nucleic acid 45 tory sequences, e.g., 80%, 90% or more, for instance, 95% or sequence identity of the synthetic nucleic acid molecule is more, of the regulatory sequences, present in a corresponding 90% or less, e.g., 80%, 78%, 75% or less, compared to a parent or wild-type nucleotide sequence. Regulatory parent nucleic acid sequence encoding a firefly luciferase, sequences, e.g., transcription regulatory sequences, are well e.g., a parent nucleic acid sequence having SEQID NO:14 or known in the art. The synthetic nucleotide sequence may also SEQ ID NO:43, which synthetic nucleotide sequence has 50 have a reduced number of restriction enzyme recognition fewer regulatory sequences including transcription regula sites, and may be modified to include selected sequences, e.g., tory sequences than would result if the sequence differences, sequences at or near the 5' and/or 3' ends of the synthetic e.g., differing codons, were randomly selected. Preferably, nucleotide sequence Such as Kozak sequences and/or desir the synthetic nucleotide sequence encodes a polypeptide that able restriction enzyme recognition sites, for instance, restric has an amino acid sequence that is at least 85%, preferably 55 tion enzyme recognition sites useful to introduce a synthetic 90%, and most preferably 95% or 99% identical to the amino nucleotide sequence to a specified location, e.g., in a multiple acid sequence of a naturally-occurring or parent polypeptide. cloning region 5' and/or 3' to a nucleic acid sequence of Thus, it is recognized that Some specific amino acid changes interest. may be desirable to alter a particular phenotypic characteris In one embodiment, the synthetic nucleotide sequence of tic of the luciferase encoded by the synthetic nucleotide 60 the invention has a codon composition that differs from that of sequence. Preferably, the amino acid sequence identity is over the parent or wild-type nucleic acid sequence. Preferred at least 100 contiguous amino acid residues. In one embodi codons for use in the invention are those which are employed ment, the synthetic nucleotide sequence encodes a polypep more frequently than at least one other codon for the same tide comprising a firefly luciferase, which synthetic nucle amino acid in a particular organism and/or those that are not otide sequence has at least 90% or more nucleic acid sequence 65 low-usage codons in that organism and/or those that are not identity to an open reading frame in a sequence comprising, low-usage codons in the organism used to clone or screen for for example, SEQ ID NO:21, SEQ ID NO:22, SEQ ID the expression of the synthetic nucleotide sequence (for US 7,728,118 B2 5 6 example, E. coli). Moreover, codons for certain amino acids Substitutions such as those resulting in a silent nucleotide (i.e., those amino acids that have three or more codons), may Substitution (encodes the same amino acid) and/or decreased include two or more codons that are employed more fre number of regulatory sequences. Under some circumstances quently than the other (non-preferred) codon(s). The presence (e.g., to permit removal of a transcription factor binding site) of codons in a synthetic nucleotide sequence that are 5 it may be desirable to replace a non-preferred codon with a employed more frequently in one organism than in another codon other than a preferred codon or a codon other than the organism results in a synthetic nucleotide sequence which, preferred codon in order to decrease the number of regulatory when introduced into the cells of the organism that employs Sequences. those codons more frequently, has a reduced risk of aberrant The invention also provides an expression cassette or vec expression and/or is expressed in those cells at a level that 10 tor. The expression cassette or vector of the invention com may be greater than the expression of the wild type (unmodi prises a synthetic nucleotide sequence of the invention opera fied) nucleic acid sequence in those cells under some condi tively linked to a promoter that is functional in a cell or tions. For example, a synthetic nucleic acid molecule of the comprises a synthetic nucleotide sequence, respectively. Pre invention which encodes a selectable or screenable polypep ferred promoters are those functional in mammaliancells and tide may be expressed at a level that is greater, e.g., at least 15 those functional in plant cells. Optionally, the expression about 2, 3, 4, 5, 10-fold or more relative to that of the parent cassette may include other sequences, e.g., one or more or wild-type (unmodified) nucleic acid sequence in a cell or restriction enzyme recognition sequences 5' and/or 3' to an cell extract under identical conditions (such as cell culture open reading frame for a selectable polypeptide or luciferase conditions, vector backbone, and the like). In one embodi and/or a Kozak sequence, and be a part of a larger polynucle ment, the synthetic nucleotide sequence of the invention has 20 otide molecule Such as a plasmid, cosmid, artificial chromo a codon composition that differs from that of the parent or Some or vector, e.g., a viral vector, which may include a wild-type nucleic acid sequence at more than 10%, 20% or multiple cloning region for other sequences, e.g., promoters, more, e.g., 30%, 35%, 40% or more than 45%, e.g., 50%, enhancers, other open reading frames and/or poly(A) sites. In 55%, 60% or more of the codons. one embodiment, a vector of the invention includes SEQID In one embodiment of the invention, the codons that are 25 NO:88, SEQ ID NO:89, SEQ ID NO:90, the complement different are those employed more frequently in a mammal, thereof, or a sequence which has at least 80% nucleic acid while in another embodiment the codons that are different are sequence identity thereto and encodes a selectable and/or those employed more frequently in a plant. A particular type screenable polypeptide. of mammal, e.g., human, may have a different set of preferred In one embodiment, the synthetic nucleotide sequence codons than another type of mammal. Likewise, a particular 30 encoding a selectable or screenable polypeptide is introduced type of plant may have a different set of preferred codons than into a vector backbone, e.g., one which optionally has a another type of plant. In one embodiment of the invention, the poly(A) site 3' to the synthetic nucleotide sequence, a gene majority of the codons which differ are ones that are preferred useful for selecting transformed prokaryotic cells which codons in a desired host cell and/or are not low usage codons optionally is a synthetic sequence, a gene useful for selecting in a particular host cell. Preferred codons for mammals (e.g., 35 transformed eukaryotic cells which optionally is a synthetic humans) and plants are known to the art (e.g., Wada et al., sequence, a noncoding region for decreasing transcription 1990). For example, preferred human codons include, but are and/or translation into adjacent linked desirable open reading not limited to, CGC (Arg), CTG (Leu), AGC (Ser), ACC frames, and/or a multiple cloning region 5' and/or 3' to the (Thr), CCC (Pro), GCC (Ala), GGC (Gly), GTG (Val), ACT synthetic nucleotide sequence encoding a selectable or (Ile), AAG (Lys), AAC (Asn), CAG (Gln), CAC (His), GAG 40 screenable polypeptide which optionally includes one or (Glu), GAC (Asp), TAC (Tyr), TGC (Cys) and TTC (Phe) more protein destabilization sequences (see U.S. application (Wada et al., 1990). Thus, synthetic nucleotide sequences of Ser. No. 10/664,341, filed Sep. 16, 2003, the disclosure of the invention have a codon composition which differs from a which is incorporated by reference herein). In one embodi wild type nucleic acid sequence by having an increased num ment, the vector having a synthetic nucleotide sequence ber of preferred human codons, e.g. CGC, CTG, TCT, AGC, 45 encoding a selectable or screenable polypeptide may lack a ACC, CCC, GCC, GGC, GTG, ACT. AAG, AAC, CAG, promoter and/or enhancer which is operably linked to that CAC, GAG, GAC, TAC, TGC, TTC, or any combination synthetic sequence. In another embodiment, the invention thereof. For example, the synthetic nucleotide sequence of the provides a vector comprising a promoter, e.g., a prokaryotic invention may have an increased number of AGC serine or eukaryotic promoter, operably linked to a synthetic nucle encoding codons, CCC proline-encoding codons, and/or 50 otide sequence encoding a selectable or screenable polypep ACC threonine-encoding codons, or any combination tide. Such vectors optionally include one or more multiple thereof, relative to the parent or wild-type nucleic acid cloning regions, such as ones that are useful to introduce an sequence. Similarly, synthetic nucleotide sequences having additional open reading frame and/or a promoter for expres an increased number of codons that are employed more fre sion of the open reading frame which promoter optionally is quently in plants, have a codon composition which differs 55 different than the promoter for the selectable or screenable from a wild-type nucleic acid sequence by having an polypeptide, and/or a prokaryotic origin of replication. A increased number of the plant codons including, but not lim “vector backbone' as used herein may include sequences ited to, CGC (Arg), CTT (Leu), TCT (Ser), TCC (Ser), ACC (open reading frames) useful to identify cells with those (Thr), CCA (Pro), CCT (Pro), GCT (Ser), GGA (Gly), GTG sequences, e.g., in prokaryotic cells, their promoters, an ori (Val), ATC (Ile), ATT (Ile), AAG (Lys), AAC (Asn), CAA 60 gin of replication for vector maintenance, e.g., in prokaryotic (Gln), CAC (His), GAG (Glu), GAC (Asp), TAC (Tyr), TGC cells, and optionally one or more other sequences including (CyS), TTC (Phe), or any combination thereof (Murray et al., multiple cloning regions e.g., for insertion of a promoter 1989). Preferred codons may differ for different types of and/or open reading frame of interest, and sequences which plants (Wada et al., 1990). inhibit transcription and/or translation. The nucleotide substitutions in the synthetic nucleic acid 65 Also provided is a host cell comprising the synthetic nucle sequence may be influenced by many factors such as, for otide sequence of the invention, an isolated polypeptide (e.g., example, the desire to have an increased number of nucleotide a fusion polypeptide encoded by the synthetic nucleotide US 7,728,118 B2 7 8 sequence of the invention), and compositions and kits com in a synthetic nucleotide sequence which encodes a selectable prising the synthetic nucleotide sequence of the invention, a or screenable polypeptide is altered to reflect that of the host polypeptide encoded thereby, or an expression cassette or organism desired for expression of that nucleotide sequence vector comprising the synthetic nucleotide sequence in Suit while also decreasing the number of potential regulatory able container means and, optionally, instruction means. The sequences relative to the parent nucleic acid molecule. host cell may be an eukaryotic cell Such as a plant or verte Also provided is a method to prepare a synthetic nucleic brate cell, e.g., a mammaliancell, including but not limited to acid molecule which does not code for a peptide or polypep a human, non-human primate, canine, feline, bovine, equine, tide. The method includes altering the nucleotides in a parent ovine or rodent (e.g., rabbit, rat, ferret, hamster, or mouse) nucleic acid sequence having at least 20 nucleotides which cell or a prokaryotic cell. 10 optionally does not code for a functional or desirable peptide The invention also provides a method to prepare a synthetic or polypeptide and which optionally may include sequences nucleotide sequence of the invention by genetically altering a which inhibit transcription and/or translation, to yield a syn parent, e.g., a wild-type or synthetic, nucleic acid sequence. thetic nucleotide sequence which does not include an open The method comprises altering (e.g., decreasing or eliminat reading frame encoding a peptide or polypeptide of interest, ing) a plurality of regulatory sequences in a parent nucleic 15 e.g., the synthetic nucleotide sequence may have an open acid sequence, e.g., one which encodes a selectable or screen reading frame but it does not include sequences that encode a able polypeptide or one which does not encode a peptide or functional or desirable peptide or polypeptide, but may polypeptide, to yield a synthetic nucleotide sequence which include one or more stop codons in one or more reading has a decreased number of regulatory sequences and, if the frames, one or more poly(A) adenylation sites, and/or a con synthetic nucleotide sequence encodes a polypeptide, it pref tiguous sequence for two or more restriction endonucleases, erably encodes the same amino acids as the parent nucleic i.e., a multiple cloning region. The synthetic nucleotide acid molecule. The transcription regulatory sequences which sequence is generally at least 20, e.g., at least 30, nucleotides are reduced include but are not limited to any of transcription in length and up to 1000 or more nucleotides, e.g., up to factor binding sequences, intron splice sites, poly(A) sites, 10,000 nucleotides, and has fewer regulatory sequences such enhancer sequences, promoter modules, and/or promoter 25 as transcription regulatory sequences relative to a corre sequences. Preferably, the alteration of sequences in the Syn sponding parent nucleic acid sequence which does not code thetic nucleotide sequence does not result in an increase in for a peptide or polypeptide, e.g., a parent nucleic acid regulatory sequences. In one embodiment, the synthetic sequence which optionally includes sequences which inhibit nucleotide sequence encodes a polypeptide that has at least transcription and/or translation. The nucleotides are altered to 85%, 90%, 95% or 99%, or 100%, contiguous amino acid 30 reduce one or more regulatory sequences, e.g., transcription sequence identity to the amino acid sequence of the polypep factor binding sequences, intron splice sites, poly(A) sites, tide encoded by the parent nucleic acid sequence. enhancer sequences, promoter modules, and/or promoter Thus, in one embodiment, a method to prepare a synthetic sequences, in the parent nucleic acid sequence. nucleic acid molecule comprising an open reading frame is provided. The method includes altering the codons and/or 35 The invention also provides a method to prepare an expres regulatory sequences in a parent nucleic acid sequence which sion vector. The method includes providing a linearized plas encodes a reporter protein such, as a firefly luciferase or a mid having a nucleic molecule including a synthetic nucle selectable polypeptide Such as one encoding resistance to otide sequence of the invention which encodes a selectable or ampicillin, puromycin, hygromycin or neomycin, to yield a screenable polypeptide which is flanked at the 5' and/or 3' end synthetic nucleotide sequence which encodes a correspond 40 by a multiple cloning region. The plasmid is linearized by ing reporter polypeptide and which has for instance at least contacting the plasmid with at least one restriction endonu 10% or more, e.g., 20%, 30%, 40%, 50% or more, fewer clease which cleaves in the multiple cloning region. The regulatory sequences relative to the parent nucleic acid linearized plasmid and an expression cassette having ends sequence. The synthetic nucleotide sequence has 90%, e.g., compatible with the ends in the linearized plasmid are 85%. 80%, or 78%, or less nucleic acid sequence identity to 45 annealed, yielding an expression vector. In one embodiment, the parent nucleic acid sequence and encodes a polypeptide the plasmid is linearized by cleavage by at least two restric with at least 85% amino acid sequence identity to the tion endonucleases, only one of which cleaves in the multiple polypeptide encoded by the parent nucleic acid sequence. The cloning region. regulatory sequences which are altered include transcription Also provided is a method to clone a promoter or open factor binding sequences, intron splice sites, poly(A) sites, 50 reading frame. The method includes comprising providing a promoter modules, and/or promoter sequences. In one linearized plasmid having a multiple cloning region and a embodiment, the synthetic nucleic acid sequence hybridizes synthetic sequence of the invention which encodes a select under medium stringency hybridization but not stringent con able or screenable polypeptide and/or a synthetic sequence of ditions to the parent nucleic acid sequence or the complement the invention which does not encode a peptide or polypeptide, thereof. In one embodiment, the codons which differ encode 55 which is plasmid is linearized by contacting the plasmid with the same amino acids as the corresponding codons in the at least two restriction endonucleases at least one of which parent nucleic acid sequence. cleaves in the multiple cloning region; and annealing the Also provided is a synthetic (including a further synthetic) linearized plasmid with DNA having a promoter or an open nucleotide sequence prepared by the methods of the inven reading frame with ends compatible with the ends of the tion, e.g., a further synthetic nucleotide sequence in which 60 linearized plasmid. introduced regulatory sequences or restriction endonuclease Exemplary methods to prepare synthetic sequences for recognition sequences are optionally removed. Thus, the firefly luciferase and a number of selectable polypeptide method of the invention may be employed to alter the codon nucleic acid sequences, as well as non-coding regions present usage frequency and/or decrease the number of regulatory in a vector backbone, are described hereinbelow. For sequences in any open reading frame or to decrease the num 65 instance, the methods may produce synthetic selectable ber of regulatory sequences in any nucleic acid sequence, e.g., polypeptide nucleic acid molecules which exhibit similar or a noncoding sequence. Preferably, the codon usage frequency significantly enhanced levels of mammalian expression with US 7,728,118 B2 10 out negatively effecting other desirable physical or biochemi molecule, discrete elements are referred to as being cal properties and which were also largely devoid of regula “upstream” or 5' of the “downstream” or 3' elements. This tory elements. terminology reflects the fact that transcription proceeds in a 5' Clearly, the present invention has applications with many to 3' fashion along the DNA strand. Typically, promoter and genes and across many fields of Science including, but not enhancer elements that direct transcription of a linked gene limited to, life Science research, agrigenetics, genetic therapy, (e.g., open reading frame or coding region) are generally developmental science and pharmaceutical development. located 5' or upstream of the coding region. However, enhancer elements can exert their effect even when located 3' BRIEF DESCRIPTION OF THE FIGURES of the promoter element and the coding region. Transcription 10 termination and polyadenylation signals are located 3' or FIG.1. Codons and their corresponding amino acids. downstream of the coding region. FIG. 2. Design scheme for the pGL4 vector. The term "codon’ as used herein, is a basic genetic coding unit, consisting of a sequence of three nucleotides that specify DETAILED DESCRIPTION OF THE INVENTION a particular amino acid to be incorporation into a polypeptide 15 chain, or a start or stop signal. The term “coding region' when Definitions used in reference to structural genes refers to the nucleotide sequences that encode the amino acids found in the nascent The term “nucleic acid molecule' or “nucleic acid polypeptide as a result of translation of a mRNA molecule. sequence' as used herein, refers to nucleic acid, DNA or Typically, the coding region is bounded on the 5' side by the RNA, that comprises noncoding or coding sequences. Coding nucleotide triplet ATG’ which encodes the initiator sequences are necessary for the production of a polypeptide methionine and on the 3' side by a stop codon (e.g., TAA, or protein precursor. The polypeptide can be encoded by a TAG, TGA). In some cases the coding region is also known to full-length coding sequence or by any portion of the coding initiate by a nucleotide triplet “TTG”. sequence, as long as the desired protein activity is retained. By “protein”, “polypeptide' or “peptide' is meant any Noncoding sequences refer to nucleic acids which do not 25 chain of amino acids, regardless of length or post-transla code for a polypeptide or protein precursor, and may include tional modification (e.g., glycosylation orphosphorylation). regulatory elements such as transcription factor binding sites, The nucleic acid molecules of the invention may also encode poly(A) sites, restriction endonuclease sites, stop codons and/ a variant of a naturally-occurring protein or a fragment or promoter sequences. thereof. Preferably, such a variant protein has an amino acid A “synthetic nucleic acid sequence is one which is not 30 sequence that is at least 85%, preferably 90%, and most found in nature, i.e., it has been derived using molecular preferably 95% or 99% identical to the amino acid sequence biological, chemical and/or informatic techniques. of the naturally-occurring (native or wild-type) protein from A “nucleic acid', as used herein, is a covalently linked which it is derived. sequence of nucleotides in which the 3' position of the pen Polypeptide molecules are said to have an “amino termi tose of one nucleotide is joined by a phosphodiester group to 35 nus’ (N-terminus) and a “carboxy terminus’ (C-terminus) the 5' position of the pentose of the next, and in which the because peptide linkages occur between the backbone amino nucleotide residues (bases) are linked in specific sequence, group of a first amino acid residue and the backbone carboxyl i.e., a linear order of nucleotides. A “polynucleotide', as used group of a second amino acid residue. The terms "N-termi herein, is a nucleic acid containing a sequence that is greater nal and “C-terminal in reference to polypeptide sequences than about 100 nucleotides in length. An "oligonucleotide' or 40 refer to regions of polypeptides including portions of the “primer', as used herein, is a short polynucleotide or a portion N-terminal and C-terminal regions of the polypeptide, of a polynucleotide. An oligonucleotide typically contains a respectively. A sequence that includes a portion of the N-ter sequence of about two to about one hundred bases. The word minal region of a polypeptide includes amino acids predomi "oligo' is sometimes used in place of the word "oligonucle nantly from the N-terminal half of the polypeptide chain, but otide’”. 45 is not limited to Such sequences. For example, an N-terminal Nucleic acid molecules are said to have a '5'-terminus' (5' sequence may include an interior portion of the polypeptide end) and a “3'-terminus’ (3' end) because nucleic acid phos sequence including bases from both the N-terminal and C-ter phodiester linkages occur to the 5' carbon and 3' carbon of the minal halves of the polypeptide. The same applies to C-ter pentose ring of the Substituent mononucleotides. The end of a minal regions. N-terminal and C-terminal regions may, but polynucleotide at which a new linkage would be to a 5' carbon 50 need not, include the amino acid defining the ultimate N-ter is its 5' terminal nucleotide. The end of a polynucleotide at minus and C-terminus of the polypeptide, respectively. which a new linkage would be to a 3' carbon is its 3' terminal The term “wild-type' as used herein, refers to a gene or nucleotide. A terminal nucleotide, as used herein, is the nucle gene product that has the characteristics of that gene or gene otide at the end position of the 3'- or 5'-terminus. product isolated from a naturally occurring Source. A wild DNA molecules are said to have “5' ends and '3' ends’ 55 type gene is that which is most frequently observed in a because mononucleotides are reacted to make oligonucle population and is thus arbitrarily designated the “wild-type' otides in a manner Such that the 5' phosphate of one mono form of the gene. In contrast, the term “mutant” refers to a nucleotide pentose ring is attached to the 3' oxygen of its gene or gene product that displays modifications in sequence neighbor in one direction via a phosphodiester linkage. and/or functional properties (i.e., altered characteristics) Therefore, an end of an oligonucleotides referred to as the "5" 60 when compared to the wild-type gene or gene product. It is end if its 5' phosphate is not linked to the 3' oxygen of a noted that naturally-occurring mutants can be isolated; these mononucleotide pentose ring and as the '3' end if its 3' are identified by the fact that they have altered characteristics oxygen is not linked to a 5' phosphate of a Subsequent mono when compared to the wild-type gene or gene product. nucleotide pentose ring. The term “recombinant protein’ or “recombinant polypep As used herein, a nucleic acid sequence, even if internal to 65 tide' as used herein refers to a protein molecule expressed a larger oligonucleotide or polynucleotide, also may be said from a recombinant DNA molecule. In contrast, the term to have 5' and 3' ends. In either a linear or circular DNA “native protein’ is used herein to indicate a protein isolated US 7,728,118 B2 11 12 from a naturally occurring (i.e., a nonrecombinant) source. isolated nucleic acid or oligonucleotide is to be utilized to Molecular biological techniques may be used to produce a express a protein, the oligonucleotide contains at a minimum, recombinant form of a protein with identical properties as the sense or coding strand (i.e., the oligonucleotide may be compared to the native form of the protein. single-stranded), but may contain both the sense and anti The term “fusion polypeptide' refers to a chimeric protein 5 sense Strands (i.e., the oligonucleotide may be double containing a protein of interest (e.g., luciferase) joined to a Stranded). heterologous sequence (e.g., a non-luciferase amino acid or protein). The term "isolated when used in relation to a polypeptide, The terms “cell,” “cell line,”“host cell as used herein, are as in "isolated protein’ or "isolated polypeptide' refers to a used interchangeably, and all such designations include prog 10 polypeptide that is identified and separated from at least one eny or potential progeny of these designations. By “trans contaminant with which it is ordinarily associated in its formed cell' is meant a cell into which (or into an ancestor of Source. Thus, an isolated polypeptide is present in a form or which) has been introduced a nucleic acid molecule of the setting that is different from that in which it is found in nature. invention, e.g., via transient transfection. Optionally, a In contrast, non-isolated polypeptides (e.g., proteins and nucleic acid molecule synthetic gene of the invention may be 15 enzymes) are found in the state they exist in nature. introduced into a Suitable cell line so as to create a stably The term “purified’ or “to purify’ means the result of any transfected cell line capable of producing the protein or process that removes some of a contaminant from the com polypeptide encoded by the synthetic gene. Vectors, cells, and ponent of interest, Such as a protein or nucleic acid. The methods for constructing Such cell lines are well known in the percent of a purified component is thereby increased in the art. The words “transformants' or “transformed cells' sample. include the primary transformed cells derived from the origi The term “operably linked as used herein refer to the nally transformed cell without regard to the number of trans linkage of nucleic acid sequences in Such a manner that a fers. All progeny may not be precisely identical in DNA nucleic acid molecule capable of directing the transcription of content, due to deliberate or inadvertent mutations. Nonethe a given gene and/or the synthesis of a desired protein mol less, mutant progeny that have the same functionality as 25 ecule is produced. The term also refers to the linkage of screened for in the originally transformed cell are included in sequences encoding amino acids in Such a manner that a the definition of transformants. functional (e.g., enzymatically active, capable of binding to a Nucleic acids are known to contain different types of muta binding partner, capable of inhibiting, etc.) protein or tions. A "point mutation refers to an alteration in the polypeptide is produced. sequence of a nucleotide at a single base position from the 30 The term “recombinant DNA molecule” means a hybrid wild type sequence. Mutations may also refer to insertion or DNA sequence comprising at least two nucleotide sequences deletion of one or more bases, so that the nucleic acid sequence differs from the wild-type sequence. not normally found together in nature. The term “homology” refers to a degree of complementa The term “vector” is used in reference to nucleic acid rity between two or more sequences. There may be partial 35 molecules into which fragments of DNA may be inserted or homology or complete homology (i.e., identity). Homology cloned and can be used to transfer DNA segment(s) into a cell is often measured using sequence analysis Software (e.g., and capable of replication in a cell. Vectors may be derived EMBOSS, the European Molecular Biology Open Software from plasmids, bacteriophages, viruses, cosmids, and the Suite URL is available at www.hgmp.mrc.ac.uk/Software/ like. EMBOSS/overview/html. Such software matches similar 40 The terms “recombinant vector' and “expression vector” sequences by assigning degrees of homology to various Sub as used herein refer to DNA or RNA sequences containing a stitutions, deletions, insertions, and other modifications. desired coding sequence and appropriate DNA or RNA Conservative substitutions typically include substitutions sequences necessary for the expression of the operably linked within the following groups: glycine, alanine; Valine, isoleu coding sequence in a particular host organism. Prokaryotic cine, leucine; aspartic acid, glutamic acid, asparagine, 45 expression vectors include a promoter, a ribosome binding glutamine; serine, threonine; lysine, arginine; and phenylala site, an origin of replication for autonomous replication in a nine, tyrosine. host cell and possibly other sequences, e.g. an optional opera The term "isolated when used in relation to a nucleic acid, tor sequence, optional restriction enzyme sites. A promoter is as in "isolated oligonucleotide' or "isolated polynucleotide' defined as a DNA sequence that directs RNA polymerase to refers to a nucleic acid sequence that is identified and sepa 50 bind to DNA and to initiate RNA synthesis. Eukaryotic rated from at least one contaminant with which it is ordinarily expression vectors include a promoter, optionally a polyaden associated in its source. Thus, an isolated nucleic acid is lyation signal and optionally an enhancer sequence. present in a form or setting that is different from that in which A polynucleotide having a nucleotide sequence encoding a it is found in nature. In contrast, non-isolated nucleic acids protein or polypeptide means a nucleic acid sequence com (e.g., DNA and RNA) are found in the state they exist in 55 prising the coding region of a gene, or in other words the nature. For example, a given DNA sequence (e.g., a gene) is nucleic acid sequence encodes a gene product. The coding found on the host cell in proximity to neighbor region may be present in either a cDNA, genomic DNA or ing genes; RNA sequences (e.g., a specific mRNA sequence RNA form. When present in a DNA form, the oligonucleotide encoding a specific protein), are found in the cell as a mixture may be single-stranded (i.e., the sense strand) or double with numerous other mRNAs that encode a multitude of 60 Stranded. Suitable control elements such as enhancers/pro proteins. However, isolated nucleic acid includes, by way of moters, splice junctions, polyadenylation signals, etc. may be example, Such nucleic acid in cells ordinarily expressing that placed in close proximity to the coding region of the gene if nucleic acid where the nucleic acid is in a chromosomal needed to permit proper initiation of transcription and/or location different from that of natural cells, or is otherwise correct processing of the primary RNA transcript. Alterna flanked by a different nucleic acid sequence than that found in 65 tively, the coding region utilized in the expression vectors of nature. The isolated nucleic acid or oligonucleotide may be the present invention may contain endogenous enhancers/ present in single-stranded or double-stranded form. When an promoters, splice junctions, intervening sequences, polyade US 7,728,118 B2 13 14 nylation signals, etc. In further embodiments, the coding (A) signal utilized in an expression vector may be "heterolo region may contain a combination of both endogenous and gous' or “endogenous. An endogenous poly(A) signal is one exogenous control elements. that is found naturally at the 3' end of the coding region of a The term “regulatory element” or “regulatory sequence' given gene in the genome. A heterologous poly(A) signal is refers to a genetic element or sequence that controls some one which has been isolated from one gene and positioned 3' aspect of the expression of nucleic acid sequence(s). For to another gene. A commonly used heterologous poly(A) example, a promoter is a regulatory element that facilitates signal is the SV40 poly(A) signal. The SV40 poly(A) signal is the initiation of transcription of an operably linked coding contained on a 237 bp BamHI/Bcl I restriction fragment and region. Other regulatory elements include, but are not limited directs both termination and polyadenylation (Sambrook et to, transcription factor binding sites, splicing signals, poly 10 al., 1989). adenylation signals, termination signals and enhancer ele Eukaryotic expression vectors may also contain “viral rep mentS. licons' or “viral origins of replication.” Viral replicons are Transcriptional control signals in eukaryotes comprise viral DNA sequences which allow for the extrachromosomal “promoter” and "enhancer' elements. Promoters and enhanc replication of a vector in a host cell expressing the appropriate ers consist of short arrays of DNA sequences that interact 15 replication factors. Vectors containing either the SV40 or specifically with cellular proteins involved in transcription. polyoma virus origin of replication replicate to high copy Promoter and enhancer elements have been isolated from a number (up to 10 copies/cell) in cells that express the appro variety of eukaryotic sources including genes in yeast, insect priate viral T antigen. In contrast, vectors containing the and mammalian cells. Promoter and enhancer elements have replicons from bovine papillomavirus or Epstein-Barr virus also been isolated from viruses and analogous control ele replicate extrachromosomally at low copy number (about 100 ments, such as promoters, are also found in prokaryotes. The copies/cell). selection of a particular promoter and enhancer depends on The term “in vitro” refers to an artificial environment and the cell type used to express the protein of interest. Some to processes or reactions that occur within an artificial envi eukaryotic promoters and enhancers have a broad host range ronment. In vitro environments include, but are not limited to, while others are functional in a limited subset of cell types. 25 test tubes and cell lysates. The term “in vivo” refers to the For example, the SV40 early gene enhancer is very active in natural environment (e.g., an animal or a cell) and to pro a wide variety of cell types from many mammalian species cesses or reactions that occur within a natural environment. and has been widely used for the expression of proteins in The term "expression system” refers to any assay or system mammalian cells. Two other examples of promoter/enhancer for determining (e.g., detecting) the expression of a gene of elements active in a broad range of mammalian cell types are 30 interest. Those skilled in the field of molecular biology will those from the human elongation factor 1 gene (Uetsuki et al., understand that any of a wide variety of expression systems 1989: Kim et al., 1990; and Mizushima and Nagata, 1990) may be used. A wide range of suitable mammalian cells are and the long terminal repeats of the Rous sarcoma virus available from a wide range of Sources (e.g., the American (Gorman et al., 1982); and the human cytomegalovirus Type Culture Collection, Rockland, Md.). The method of (Boshartet al., 1985). 35 transformation or transfection and the choice of expression The term “promoter/enhancer denotes a segment of DNA vehicle will depend on the host system selected. Transforma containing sequences capable of providing both promoter and tion and transfection methods are described, e.g., in Ausubel enhancer functions (i.e., the functions provided by a promoter et al., 1992. Expression systems include in vitrogene expres element and an enhancer element as described above). For sion assays where a gene of interest (e.g., a reporter gene) is example, the long terminal repeats of retroviruses contain 40 linked to a regulatory sequence and the expression of the gene both promoter and enhancer functions. The enhancer/pro is monitored following treatment with an agent that inhibits or moter may be “endogenous” or “exogenous” or "heterolo induces expression of the gene. Detection of gene expression gous.” An "endogenous enhancer/promoter is one that is can be through any suitable means including, but not limited naturally linked with a given gene in the genome. An "exog to, detection of expressed mRNA or protein (e.g., a detectable enous” or "heterologous' enhancer/promoter is one that is 45 product of a reporter gene) or through a detectable change in placed injuxtaposition to a gene by means of genetic manipu the phenotype of a cell expressing the gene of interest. lation (i.e., molecular biological techniques) Such that tran Expression systems may also comprise assays where a cleav scription of the gene is directed by the linked enhancer/pro age event or other nucleic acid or cellular change is detected. moter. All amino acid residues identified herein are in the natural The presence of “splicing signals' on an expression vector 50 L-configuration. In keeping with standard polypeptide often results in higher levels of expression of the recombinant nomenclature, abbreviations for amino acid residues are as transcript in eukaryotic host cells. Splicing signals mediate shown in the following Table of Correspondence. the removal of introns from the primary RNA transcript and consist of a splice donor and acceptor site (Sambrook et al., 1989). A commonly used splice donor and acceptor site is the 55 splice junction from the 16S RNA of SV40. TABLE OF CORRESPONDENCE Efficient expression of recombinant DNA sequences in eukaryotic cells requires expression of signals directing the 1-Letter 3-Letter AMINO ACID Y Tyr L-tyrosine efficient termination and polyadenylation of the resulting G Gly L-glycine transcript. Transcription termination signals are generally 60 F Phe L-phenylalanine found downstream of the polyadenylation signal and area few M Met L-methionine hundred nucleotides-in length. The term “poly(A) site' or A. Ala L-alanine "poly(A) sequence' as used herein denotes a DNA sequence S Ser L-serine I Ile L-isoleucine which directs both the termination and polyadenylation of the L Leu L-leucine nascent RNA transcript. Efficient polyadenylation of the 65 T Thr L-threonine recombinant transcript is desirable, as transcripts lacking a V Wall L-valine poly(A)tail are unstable and are rapidly degraded. The poly US 7,728,118 B2 15 16 completely complementary to one another be hybridized or -continued annealed together. The art knows well that numerous equiva lent conditions can be employed to comprise medium or low TABLE OF CORRESPONDENCE stringency conditions. The choice of hybridization conditions is generally evident to one skilled in the art and is usually 1-Letter 3-Letter AMINO ACID guided by the purpose of the hybridization, the type of hybrid P Pro L-proline K Lys L-lysine ization (DNA-DNA or DNA-RNA), and the level of desired H His L-histidine relatedness between the sequences (e.g., Sambrook et al., Q Gln L-glutamine 1989; Nucleic Acid Hybridization, A Practical Approach, E Glu L-glutamic acid 10 IRL Press, Washington D.C., 1985, for a general discussion of W Trp L-tryptophan R Arg L-arginine the methods). D Asp L-aspartic acid The stability of nucleic acid duplexes is known to decrease N ASn L-asparagine with increasing numbers of mismatched bases, and further to C Cys L-cysteine be decreased to a greater or lesser degree depending on the 15 relative positions of mismatches in the hybrid duplexes. Thus, The terms “complementary' or “complementarity” are the stringency of hybridization can be used to maximize or used in reference to a sequence of nucleotides related by the minimize stability of such duplexes. Hybridization strin base-pairing rules. For example, for the sequence 5' "A-G-T gency can be altered by: adjusting the temperature of hybrid 3', is complementary to the sequence 3’ “T-C-A' 5". Comple ization; adjusting the percentage of helix destabilizing mentarity may be “partial.” in which only some of the nucleic agents, such as formamide, in the hybridization mix; and acids bases are matched according to the base pairing rules. adjusting the temperature and/or salt concentration of the Or, there may be “complete' or “total complementarity wash solutions. For filter hybridizations, the final stringency between the nucleic acids. The degree of complementarity of hybridizations often is determined by the salt concentra between nucleic acid strands has significant effects on the tion and/or temperature used for the post-hybridization efficiency and strength of hybridization between nucleic acid 25 washes. Strands. This is of particular importance in amplification reac “High stringency conditions” when used in reference to tions, as well as detection methods which depend upon nucleic acid hybridization comprise conditions equivalent to hybridization of nucleic acids. binding or hybridization at 42°C. in a solution consisting of When used in reference to a double-stranded nucleic acid 5xSSPE (43.8 g/l NaCl, 6.9 g/l NaH2POHO and 1.85 g/1 30 EDTA, pH adjusted to 7.4 with NaOH), 0.5% SDS, 5xDen sequence Such as a cDNA or a genomic clone, the term hardt’s reagent and 100 ug/ml denatured salmon sperm DNA “substantially homologous” refers to any probe which can followed by washing in a solution comprising 0.1 xSSPE, hybridize to either or both strands of the double-stranded 1.0% SDS at 42°C. when a probe of about 500 nucleotides in nucleic acid sequence under conditions of low stringency as length is employed. described herein. 35 “Medium stringency conditions” when used in reference to “Probe' refers to an oligonucleotide designed to be suffi nucleic acid hybridization comprise conditions equivalent to ciently complementary to a sequence in a denatured nucleic binding or hybridization at 42°C. in a solution consisting of acid to be probed (in relation to its length) and is bound under 5xSSPE (43.8 g/l NaCl, 6.9 g/l NaH2POHO and 1.85 g/1 selected Stringency conditions. EDTA, pH adjusted to 7.4 with NaOH), 0.5% SDS, 5xDen “Hybridization” and “binding in the context of probes and 40 hardt’s reagent and 100 ug/ml denatured salmon sperm DNA denatured nucleic acids are used interchangeably. Probes that followed by washing in a solution comprising 1.0xSSPE, are hybridized or bound to denatured nucleic acids are base 1.0% SDS at 42°C. when a probe of about 500 nucleotides in paired to complementary sequences in the polynucleotide. length is employed. Whether or not a particular probe remains base paired with “Low Stringency conditions' comprise conditions equiva the polynucleotide depends on the degree of complementar 45 lent to binding or hybridization at 42°C. in a solution con ity, the length of the probe, and the stringency of the binding sisting of 5xSSPE (43.8 g/l NaCl, 6.9 g/l NaH2POHO and conditions. The higher the Stringency, the higher must be the 1.85g/1 EDTA, pH adjusted to 7.4 with NaOH), 0.1% SDS, degree of complementarity and/or the longer the probe. 5xDenhardt's reagent 50xDenhardt’s contains per 500 ml: 5 The term “hybridization' is used in reference to the pairing gFicoll (Type 400, Pharmacia), 5 g BSA (Fraction V: Sigma) of complementary nucleic acid strands. Hybridization and the 50 and 100 g/ml denatured salmon sperm DNA followed by strength of hybridization (i.e., the strength of the association washing in a solution comprising 5xSSPE, 0.1% SDS at 42 between nucleic acid strands) is impacted by many factors C. when a probe of about 500 nucleotides in length is well known in the art including the degree of complementa employed. rity between the nucleic acids, stringency of the conditions The term “T” is used in reference to the “melting tem involved such as the concentration of salts, the Tm (melting 55 perature’. The melting temperature is the temperature at temperature) of the formed hybrid, the presence of other which 50% of a population of double-stranded nucleic acid components (e.g., the presence or absence of polyethylene molecules becomes dissociated into single strands. The equa glycol), the molarity of the hybridizing strands and the G:C tion for calculating the T of nucleic acids is well-known in content of the nucleic acid strands. the art. The Tm of a hybrid nucleic acid is often estimated The term “stringency’ is used in reference to the conditions 60 using a formula adopted from hybridization assays in 1 M oftemperature, ionic strength, and the presence of other com salt, and commonly used for calculating Tm for PCR primers: pounds, under which nucleic acid hybridizations are con (number of A+T)x2° C.--(number of G+C)x4° C.I. (C. R. ducted. With “high stringency' conditions, nucleic acid base Newton et al., PCR, 2nd Ed., Springer-Verlag (New York, pairing will occur only between nucleic acid fragments that 1997), p. 24). This formula was found to be inaccurate for have a high frequency of complementary base sequences. 65 primers longer than 20 nucleotides. (Id.) Another simple esti Thus, conditions of “medium' or “low” stringency are often mate of the T value may be calculated by the equation: required when it is desired that nucleic acids that are not T81.5+0.41 (% G+C), when a nucleic acid is in aqueous US 7,728,118 B2 17 18 Solution at 1 MNaCl. (e.g., Anderson and Young, Quantita length, frequently at least 25 nucleotides in length, and often tive Filter Hybridization, in Nucleic Acid Hybridization, at least 50 or 100 nucleotides in length. Since two polynucle 1985). Other more sophisticated computations exist in the art otides may each (1) comprise a sequence (i.e., a portion of the which take structural as well as sequence characteristics into complete polynucleotide sequence) that is similar between account for the calculation of T. A calculated T is merely an the two polynucleotides, and (2) may further comprise a estimate; the optimum temperature is commonly determined sequence that is divergent between the two polynucleotides, empirically. sequence comparisons between two (or more) polynucle The term “promoter/enhancer denotes a segment of DNA otides are typically performed by comparing sequences of the containing sequences capable of providing-both promoter two polynucleotides over a “comparison window” to identify and enhancer functions (i.e., the functions provided by a 10 and compare local regions of sequence similarity. promoter element and an enhancer element as described A “comparison window', as used herein, refers to a con above). For example, the long terminal repeats of retroviruses ceptual segment of at least 20 contiguous nucleotides and contain both promoter and enhancer functions. The enhancer/ wherein the portion of the polynucleotide sequence in the promoter may be “endogenous” or “exogenous” or "heterolo comparison window may comprise additions or deletions gous.” An "endogenous enhancer/promoter is one that is 15 (i.e., gaps) of 20 percent or less as compared to the reference naturally linked with a given gene in the genome. An "exog sequence (which does not comprise additions or deletions) enous” or "heterologous' enhancer/promoter is one that is for optimal alignment of the two sequences. placed injuxtaposition to a gene by means of genetic manipu Methods of alignment of sequences for comparison are lation (i.e., molecular biological techniques) Such that tran well known in the art. Thus, the determination of percent scription of the gene is directed by the linked enhancer/pro identity between any two sequences can be accomplished moter. using a mathematical algorithm. Preferred, non-limiting The term "’ means the proportion of examples of Such mathematical algorithms are the algorithm base matches between two nucleic acid sequences or the of Myers and Miller (1988); the local homology algorithm of proportion of amino acid matches between two amino acid Smith and Waterman (1981); the homology alignment algo sequences. When sequence homology is expressed as a per 25 rithm of Needleman and Wunsch (1970); the search-for-simi centage, e.g., 50%, the percentage denotes the proportion of larity-method of Pearson and Lipman (1988); the algorithm matches over the length of sequence from one sequence that of Karlin and Altschul (1990), modified as in Karlin and is compared to Some other sequence. Gaps (in either of the Altschul (1993). two sequences) are permitted to maximize matching; gap Computer implementations of these mathematical algo lengths of 15 bases or less are usually used, 6 bases or less are 30 rithms can be utilized for comparison of sequences to deter preferred with 2 bases or less more preferred. When using mine sequence identity. Such implementations include, but oligonucleotides as probes or treatments, the sequence are not limited to: ClustalW (see the URL available at www.e- homology between the target nucleic acid and the oligonucle bi.ac.uk/clustalw/; the ALIGN program (Version 2.0) and otide sequence is generally not less than 17 target base GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wis matches out of 20 possible oligonucleotide matches 35 consin Genetics Software Package, Version 8. Alignments (85%); preferably not less than 9 matches out of 10 possible using these programs can be performed using the default base pair matches (90%), and more preferably not less than 19 parameters. The CLUSTAL program is well described by matches out of 20 possible base pair matches (95%). Higgins et al. (1988); Higgins et al. (1989); Corpet et al. Two amino acid sequences are homologous if there is a (1988); Huang et al. (1992); and Pearson et al. (1994). The partial or complete identity between their sequences. For 40 ALIGN program is based on the algorithm of Myers and example, 85% homology means that 85% of the amino acids Miller, supra. The BLAST programs of Altschuletal. (1990), are identical when the two sequences are aligned for maxi are based on the algorithm of Karlin and Altschul supra. To mum matching. Gaps (in either of the two sequences being obtain gapped alignments for comparison purposes, Gapped matched) are allowed in maximizing matching; gap lengths of BLAST (in BLAST 2.0) can be utilized as described in Alts 5 or less are preferred with 2 or less being more preferred. 45 chuletal. (1997). Alternatively, PSI-BLAST (in BLAST 2.0) Alternatively and preferably, two protein sequences (or can be used to perform an iterated search that detects distant polypeptide sequences derived from them of at least 100 relationships between molecules. See Altschul et al., Supra. amino acids in length) are homologous, as this term is used When utilizing BLAST, Gapped BLAST, PSI-BLAST, the herein, if they have an alignment score of at more than 5 (in default parameters of the respective programs (e.g. BLASTN standard deviation units) using the program ALIGN with the 50 for nucleotide sequences, BLASTX for proteins) can be used. mutation data matrix and a gap penalty of 6 or greater. See See the URL at www.ncbi.nlm.nih.gov. Alignment may also Dayhoff, M. O., in Atlas of Protein Sequence and Structure, be performed manually by inspection 1972, volume 5, National Biomedical Research Foundation, The term “sequence identity” means that two polynucle pp. 101-1 10, and Supplement 2 to this volume, pp. 1-10. The otide sequences are identical (i.e., on a nucleotide-by-nucle two sequences or parts thereofare more preferably homolo 55 otide basis) over the window of comparison. The term “per gous if their amino acids are greater than or equal to 85% centage of sequence identity” means that two polynucleotide identical when optimally aligned using the ALIGN program. sequences are identical (i.e., on a nucleotide-by-nucleotide The following terms are used to describe the sequence basis) for the stated proportion of nucleotides over the win relationships between two or more polynucleotides: “refer dow of comparison. The term "percentage of sequence iden ence sequence', 'comparison window', 'sequence identity. 60 tity” is calculated by comparing two optimally aligned "percentage of sequence identity”, and “substantial identity”. sequences over the window of comparison, determining the A “reference sequence' is a defined sequence used as a basis number of positions at which the identical nucleic acid base for a sequence comparison; a reference sequence may be a (e.g., A.T. C. G.U., or I) occurs in both sequences to yield the Subset of a larger sequence, for example, as a segment of a number of matched positions, dividing the number of full-length cDNA or gene sequence given in a sequence list 65 matched positions by the total number-positions in the win ing, or may comprise a complete cDNA or gene sequence. dow of comparison (i.e., the window size), and multiplying Generally, a reference sequence is at least 20 nucleotides in the result by 100 to yield the percentage of sequence identity. US 7,728,118 B2 19 20 The terms “substantial identity” as used herein denote a char phenotype. Because of the redundant nature of the genetic acteristic of a polynucleotide sequence, wherein the poly code, these other attributes can be optimized by natural selec nucleotide comprises a sequence that has at least 60%, pref tion without altering the corresponding amino acid sequence. erably at least 65%, more preferably at least 70%, up to about Under some conditions, it is useful to synthetically alter the 85%, and even more preferably at least 90 to 95%, more natural nucleotide sequence encoding a protein to better adapt usually at least 99%, Sequence identity as compared to a the protein for alternative applications. A common example is reference sequence over a comparison window of at least 20 to alter the codon usage frequency of a gene when it is nucleotide positions, frequently over a window of at least expressed in a foreign host. Although redundancy in the 20-50 nucleotides, and preferably at least 300 nucleotides, genetic code allows amino acids to be encoded by multiple wherein the percentage of sequence identity is calculated by 10 codons, different organisms favor Some codons over others. comparing the reference sequence to the polynucleotide The codon usage frequencies tend to differ most for organ sequence which may include deletions or additions which isms with widely separated evolutionary-histories. It has been total 20 percent or less of the reference sequence over the found that when transferring genes between evolutionarily window of comparison. The reference sequence may be a distant organisms, the efficiency of protein translation can be Subset of a larger sequence. 15 Substantially increased by adjusting the codon usage fre As applied to polypeptides, the term “substantial identity” quency (see U.S. Pat. Nos. 5,096,825, 5,670,356 and 5,874, means that two peptide sequences, when optimally aligned, 304). such as by the programs GAP or BESTFIT using default gap In one embodiment, the sequence of a reporter gene is weights, share at least about 85% sequence identity, prefer modified as the codon usage of reporter genes often does not ably at least about 90% sequence identity, more preferably at correspond to the optimal codon usage of the experimental least about 95% sequence identity, and most preferably at cells. In another embodiment, the sequence of a reporter gene least about 99% sequence identity. is modified to remove regulatory sequences such as those which may alter expression of the reporter gene or a linked Synthetic Nucleotide Sequences and Methods of the Inven gene. Examples include B-galactosidase (B-gal) and chloram tion 25 phenicol acetyltransferase (cat) reporter genes that are The invention provides compositions comprising synthetic derived from E. coli and are commonly used in mammalian nucleotide sequences, as well as methods for preparing those cells; the B-glucuronidase (gus) reporter gene that is derived sequences which yield synthetic nucleotide sequences that from E. coli and commonly used in plant cells; the firefly are efficiently expressed as a polypeptide or protein with luciferase (luc) reporter gene that is derived from an insect desirable characteristics including reduced inappropriate or 30 and commonly used in plant and mammalian cells; and the unintended transcription characteristics, or do not result in Renilla luciferase, and green fluorescent protein (gfp) inappropriate or unintended transcription characteristics, reporter genes which are derived from coelenterates and are when present in a particular cell type. commonly used in plant and mammalian cells. To achieve Natural selection is the hypothesis that genotype-environ sensitive quantitation of reporter gene expression, the activity ment interactions occurring at the phenotypic level lead to 35 of the gene product must not be endogenous to the experi differential reproductive success of individuals and hence to mental host cells. Thus, reporter genes are usually selected modification of the gene pool of a population. It is generally from organisms having unique and distinctive phenotypes. accepted that the amino acid sequence of a protein found in Consequently, these organisms often have widely separated nature has undergone optimization by natural selection. How evolutionary histories from the experimental host cells. ever, amino acids exist within the sequence of a protein that 40 Previously, to create genes having a more optimal codon do not contribute significantly to the activity of the protein usage frequency but still encoding the same gene product, a and these amino acids can be changed to other amino acids synthetic nucleic acid sequence was made by replacing exist with little or no consequence. Furthermore, a protein may be ing codons with codons that were generally more favorable to useful outside its natural environment or for purposes that the experimental host cell (see U.S. Pat. Nos. 5,096.825, differ from the conditions of its natural selection. In these 45 5,670,356 and 5,874,304.) The result was a net improvement circumstances, the amino acid sequence can be synthetically in codon usage frequency of the synthetic gene. However, the altered to better adapt the protein for its utility in various optimization of other attributes was not considered and so applications. these synthetic genes likely did not reflect genes optimized by Likewise, the nucleic acid sequence that encodes a protein natural selection. is also optimized by natural selection. The relationship 50 In particular, improvements in codon usage frequency are between coding DNA and its transcribed RNA is such that intended only for optimization of a RNA sequence based on any change to the DNA affects the resulting RNA. Thus, its role in translation into a protein. Thus, previously natural selection works on both molecules simultaneously. described methods did not address how the sequence of a However, this relationship does not exist between nucleic synthetic gene affects the role of DNA in transcription into acids and proteins. Because multiple codons encode the same 55 RNA. Most notably, consideration had not been given as to amino acid, many different nucleotide sequences can encode how transcription factors may interact with the synthetic an identical protein. A specific protein composed of 500 DNA and consequently modulate or otherwise influence gene amino acids can theoretically be encoded by more than 10' transcription. For genes found in nature, the DNA would be different nucleic acid sequences. optimally transcribed by the native host cell and would yield Natural selection acts on nucleic acids to achieve proper 60 an RNA that encodes a properly folded gene product. In encoding of the corresponding protein. Presumably, other contrast, synthetic genes have previously not been optimized properties of nucleic acid molecules are also acted upon by for transcriptional characteristics. Rather, this property has natural selection. These properties include codon usage fre been ignored or left to chance. quency, RNA secondary structure, the efficiency of intron This concern is important for all genes, but particularly splicing, and interactions with transcription factors or other 65 important for reporter genes, which are most commonly used nucleic acid binding proteins. These other properties may to quantitate transcriptional behavior in the experimental host alter the efficiency of protein translation and the resulting cells, and vector backbone sequences for genes. Hundreds of US 7,728,118 B2 21 22 transcription factors have been identified in different cell sites and/or vector backbone sequences with a reduced occur types under different physiological conditions, and likely rence of regulatory sequences. The invention also provides a more exist but have not yet been identified. All of these method of preparing synthetic genes containing improved transcription factors can influence the transcription of an codon usage frequencies with a reduced occurrence of tran introduced gene or sequences linked thereto. A useful Syn Scription factor binding sites and additional beneficial-struc thetic reporter gene or vector backbone of the invention has a tural attributes. Such additional attributes include the absence minimal risk of influencing or perturbing intrinsic transcrip of inappropriate RNA splicing junctions, poly(A) addition tional characteristics of the host cell because the structure of signals, undesirable restriction enzyme recognition sites, that gene or vector backbone has been altered. A particularly ribosomal binding sites, and/or secondary structural motifs useful synthetic reporter gene or vector backbone will have 10 Such as hairpin loops. desirable characteristics under a new set and/or a wide variety of experimental conditions. To best achieve these character In one embodiment, a parent nucleic acid sequence encod istics, the structure of the synthetic gene or synthetic vector ing a polypeptide is optimized for expression in a particular backbone should have minimal potential for interacting with cell. For example, the nucleic acid sequence is optimized by transcription factors within a broad range of host cells and 15 replacing codons in the wild-type sequence with codons physiological conditions. Minimizing potential interactions which are preferentially employed in a particular (selected) between a reporter gene or vector backbone and a host cells cell, which codon replacement also reduces the number of endogenous transcription factors increases the value of a regulatory sequences. Preferred codons have a relatively high reporter gene or vector backbone by reducing the risk of codon usage frequency in a selected cell, and preferably their inappropriate transcriptional characteristics of the gene or introduction results in the introduction of relatively few regu vector backbone within a particular experiment, increasing latory sequences such as transcription factor binding sites, applicability of the gene or vector backbone in various envi and relatively few other undesirable structural attributes. ronments, and increasing the acceptance of the resulting Thus, the optimized nucleotide sequence may have an experimental data. improved level of expression due to improved codon usage In contrast, a reporter gene comprising a native nucleotide 25 frequency, and a reduced risk of inappropriate transcriptional sequence, based on a genomic or cDNA clone from the origi behavior due to a reduced number of undesirable transcrip nal host organism, or a vector backbone comprising native tion regulatory sequences. In another embodiment, a parent sequences found in one or a variety of different organisms, vector backbone sequence is altered to remove regulatory may interact with transcription factors when present in an sequences and optionally restriction endonuclease sites, and exogenous host. This risk stems from two circumstances. 30 optionally retain or add other desirable characteristics, e.g., First, the native nucleotide sequence contains sequences that the presence of one or more stop codons in one or more were optimized through natural selection to influence gene reading frames, one or more poly(A) sites, and/or restriction transcription within the native host organism. However, these endonuclease sites. sequences might also influence transcription when the The invention may be employed with any nucleic acid sequences are present in exogenous hosts, i.e., out of context, 35 sequence, e.g., a native sequence Such as a cDNA or one that thus interfering with its performance as a reporter gene or has been manipulated in vitro. Exemplary genes include, but vector backbone. Second, the nucleotide sequence may inad are not limited to, those encoding lactamase (B-gal), neomy vertently interact with transcription factors that were not cin resistance (Neo), hygromycin resistance (Hyg), puromy present in the native host organism, and thus did not partici cin resistance (Puro), amplicillin resistance (Amp), CAT, pate in its natural selection. The probability of such inadvert 40 GUS. galactopyranoside, GFP, Xylosidase, thymidine kinase, ent interactions increases with greater evolutionary separa arabinosidase, luciferase and the like. As used herein, a tion between the experimental cells and the native organism “reporter gene' is a gene that imparts a distinct phenotype to of the reporter gene or vector backbone. cells expressing the gene and thus permits cells having the These potential interactions with transcription factors gene to be distinguished from cells that do not have the gene. would likely be disrupted when using a synthetic reporter 45 Such genes may encode either a selectable or screenable gene having alterations in codon usage frequency. However, a polypeptide, depending on whether the marker confers a trait synthetic reporter gene sequence, designed by choosing which one can select for by chemical means, i.e., through codons based only on codon usage frequency, or randomly the use of a selective agent (e.g., a herbicide, antibiotic, or the replacing sequences or randomly juxtaposing sequences in a like), or whether it is simply a “reporter trait that one can vector backbone, is likely to contain other unintended tran 50 identify through observation or testing, i.e., by screening. Scription factor binding sites since the resulting sequence has Included within the terms selectable or screenable marker not been subjected to the benefit of natural selection to correct genes are also genes which encode a 'secretable marker” inappropriate transcriptional activities. Inadvertent interac whose secretion can be detected as a means of identifying or tions with transcription factors could also occur wheneveran selecting for transformed cells. Examples include markers encoded amino acid sequence is artificially altered, e.g., to 55 that encode a secretable antigen that can be identified by introduce amino acid Substitutions. Similarly, these changes antibody interaction, or even-secretable enzymes which can have not been subjected to natural selection, and thus may be detected by their catalytic activity. Secretable proteins fall exhibit undesired characteristics. into a number of classes, including Small, diffusible proteins Thus, the invention provides a method for preparing syn detectable, e.g., by ELISA, and proteins that are inserted or thetic nucleotide sequences that reduce the risk of undesirable 60 trapped in the cell membrane. interactions of the nucleotide sequence with transcription Elements of the present disclosure are exemplified in detail factors and other trans-acting factors when expressed in a through the use of particular genes and vector backbone particular host cell, thereby reducing inappropriate or unin sequences. Of course, many examples of Suitable genes and tended characteristics. Preferably, the method yields syn vector backbones are known to the art and can be employed in thetic genes containing improved codon usage frequencies 65 the practice of the invention. Therefore, it will be understood for a particular host cell and with a reduced occurrence of that the following discussion is exemplary rather than exhaus regulatory sequences such as transcription factor binding tive. In light of the techniques disclosed herein and the gen US 7,728,118 B2 23 24 eral recombinant techniques that are known in the art, the (DNASTAR, see the URL available at www.dnastar.com), present invention renders possible the alteration of any gene Vector NTITM (Invitrogen, see the URL available at www.in or vector backbone sequence. vitrogen.com), and Sequence Manipulation Suite (see the Exemplary genes include, but are not limited to, a neogene, URL available at www.bioinformatics.org/SMS/index.html). a purogene, an amp gene, a f-gal gene, agus gene, a cat gene, 5 Links to other databases and sequence analysis software are agpt gene, a hyggene, a hisD gene, a ble gene, a mprt gene, listed at see the URL available at www.expasy.org/alinks.h- a bar gene, a nitrilase gene, a mutant acetolactate synthase tml. After one or more sequences are identified, the modifi gene (ALS) or acetoacid synthase gene (AAS), a methotrex cation(s) may be introduced. Once a desired synthetic nucle ate-resistant dhfr gene, a dalapon dehalogenase gene, a otide sequence is obtained, it can be prepared by methods mutated anthranilate synthase gene that confers resistance to 10 well known to the art (such as nucleic acid amplification 5-methyl tryptophan (WO 97/26366), an R-locus gene, a reactions with overlapping primers), and its structural and B-lactamase gene, a XylE gene, an O-amylase gene, a tyrosi functional properties compared to the target nucleic acid nase gene, a luciferase (luc) gene (e.g., a Renilla reniformis sequence, including, but not limited to, percent homology, luciferase gene, a firefly luciferase gene, or a click beetle presence or absence of certain sequences, for example, luciferase (Pyrophorus plagiophthalamus gene), an aequorin 15 restriction sites, percent of codons changed (such as an gene, or a fluorescent protein gene. increased or decreased usage of certain codons) and/or The method of the invention can be performed by, although expression rates. it is not limited to, a recursive process. The process includes As described below, the method was used to create syn assigning preferred codons to each amino acid in a target thetic reporter genes encoding firefly luciferases and select molecule, e.g., a native nucleotide sequence, based on codon 20 able polypeptides, and synthetic sequences for vector back usage in a particular species, identifying potential transcrip bones. Synthetic sequences may support greater levels of tion regulatory sequences such as transcription factor binding expression and/or reduced aberrant expression than the cor sites in the nucleic acid sequence having preferred codons, responding native or parent sequences for the protein. The e.g., using a database of Such binding sites, optionally iden native and parent sequences may demonstrate anomalous tifying other undesirable sequences, and Substituting an alter- 25 transcription characteristics when expressed in mammalian native codon (i.e., encoding the same amino acid) at positions cells, which are likely not evident in the synthetic sequences. where undesirable transcription factor binding sites or other sequences occur. For codon distinct versions, alternative pre Exemplary Uses of the Synthetic Nucleotide Sequences ferred codons are substituted in each version. If necessary, the The synthetic genes of the invention preferably encode the identification and elimination of potential transcription factor 30 same proteins as their native counterpart (or nearly so), but or other undesirable sequences can be repeated until a nucle have improved codon usage while being largely devoid of otide sequence is achieved containing a maximum number of regulatory elements in the coding (it is recognized that a small preferred codons and a minimum number of undesired number of amino acid changes may be desired to enhance a sequences including transcription regulatory sequences or property of the native counterpart protein, e.g. to enhance other undesirable sequences. Also, optionally, desired 35 luminescence of a luciferase) and noncoding regions. This sequences, e.g., restriction enzyme recognition sites, can be increases the level of expression of the protein the synthetic introduced. After a synthetic nucleotide sequence is designed gene encodes and reduces the risk of anomalous expression of and constructed, its properties relative to the parent nucleic the protein. For example, studies of many important events of acid sequence can be determined by methods well known to gene regulation, which may be mediated by weak promoters, the art. For example, the expression of the synthetic and target 40 are limited by insufficient reporter signals from inadequate nucleic acids in a series of vectors in a particular cell can be expression of the reporter proteins. Also, the use of some compared. selectable markers may be limited by the expression of that Thus, generally, the method of the invention comprises marker in an exogenous cell. Thus, synthetic selectable identifying a target nucleic acid sequence, and a host cell of marker genes which have improved codon usage for that cell, interest, for example, a plant (dicot or monocot), fungus, 45 and have a decrease in other undesirable sequences, (e.g., yeast or mammaliancell. Preferred host cells are mammalian transcription factor binding sites), can permit the use of those host cells such as CHO, COS, 293, Hela, CV-1 and NIH3T3 markers in cells that otherwise were undesirable as hosts for cells. Based on preferred codon usage in the host cell(s) and, those markers. optionally, low codon usage in the host cell(s), e.g., high Promoter crosstalk is another concern when a co-reporter usage mammalian codons and low usage E. coli and mam- 50 gene is used to normalize transfection efficiencies. With the malian codons, codons to be replaced are determined. Con enhanced expression of synthetic genes, the amount of DNA current, Subsequent or prior to selecting codons to be containing strong promoters can be reduced, or DNA con replaced, desired and undesired sequences, such as undesired taining weaker promoters can be employed, to drive the transcriptional regulatory sequences, in the target sequence expression of the co-reporter. In addition, there may be a are identified. These sequences, including transcriptional 55 reduction in the background expression from the synthetic regulatory sequences and restriction endonuclease sites, can reporter genes of the invention. This characteristic makes be identified using databases and software such as TRANS synthetic reporter genes more desirable by minimizing the FAC(R) (Transcription Factor Database, see the URL available sporadic expression from the genes and reducing the interfer at www.gene-regulation.com), MatchTM (see the URL avail ence resulting from other regulatory pathways. able at www.gene-regulation.com), Matinspector (Genom- 60 The use of reporter genes in imaging systems, which can be atix, see the URL available at www.genomatix.de), EPD (Eu used for in vivo biological studies or drug screening, is karyotic Promoter Database, see the URL available at another use for the synthetic genes of the invention. Due to www.epd.isb-sib.ch), REBASE(R) (Restriction Enzyme Data their increased level of expression, the protein encoded by a base, NEB, see the URL available at rebase.neb.com), TESS synthetic gene is more readily detectable by an imaging sys (Transcription Element Search System, see the URL avail- 65 tem. In fact, using a synthetic Renilla luciferase gene, lumi able at www.cbil.upenn.edu/tess/). MAR-Wiz (Futuresoft, nescence in transfected CHO cells was detected visually see the URL available at www.futuresoft.org), Lasergene(R) without the aid of instrumentation. US 7,728,118 B2 25 26 In addition, the synthetic genes may be used to express Notall design criteria could be met equally well at the same fusion proteins, for example fusions with secretion leader time. The following priority was established for reduction of sequences or cellular localization sequences, to study tran transcriptional regulatory sites: elimination of transcription Scription in difficult-to-transfect cells Such as primary cells, factor (TF) binding sites received the highest priority, fol and/or to improve the analysis of regulatory pathways and lowed by elimination of splice sites and poly(A) sites, and genetic elements. Other uses include, but are not limited to, finally prokaryotic regulatory sites. When removing regula the detection of rare events that require extreme sensitivity tory sites, the strategy was to work from the lesser important (e.g., studying RNA recoding), use with IRES, to improve the to the most important to ensure that the most important efficiency of in vitro translation or in vitro transcription changes were made last. Then the sequence was rechecked for translation coupled systems such as TnT (Promega Corp., the appearance of new lower priority sites and additional Madison, Wis.), study of reporters optimized to different host 10 changes made as needed. Thus, the process for designing the organisms (e.g., plants, fungus, and the like), use of multiple synthetic GR and RD gene sequences, using computer pro genes as co-reporters to monitor drug toxicity, as reporter grams described herein, involved 5 optionally iterative steps molecules in multiwell assays, and as reporter molecules in that are detailed below drug screening with the advantage of minimizing possible 1. Optimized codon usage and changed A224V to create interference of reporter signal by different signal transduction 15 GRVerl, separately changed A224H, S247H, H348Q pathways and other regulatory mechanisms. and N3461 to create RDver1. These particular amino Additionally, uses for the synthetic nucleotide sequences acid changes were maintained throughout all Subse of the invention include fluorescence activated cell sorting quent manipulations to the sequence. (FACS), fluorescent microscopy, to detect and/or measure the 2. Removed undesired restriction sites, prokaryotic regu level of gene expression in vitro and in vivo, (e.g., to deter latory sites, splice sites, poly(A) sites thereby creating mine promoter strength), Subcellular localization or targeting GRver2 and RDver2. (fusion protein), as a marker, in calibration, in a kit (e.g., for 3. Removed transcription factor binding sites (first pass) dual assays), for in Vivo imaging, to analyze regulatory path and removed any newly created undesired sites as listed ways and genetic elements, and in multi-Welling formats. in step 2 above thereby creating GRver3 and RDver3. Further, although reporter genes are widely used to mea 25 4. Removed transcription factor binding sites created by sure transcription events, their utility can be limited by the step 3 above (second pass) and removed any newly cre fidelity and efficiency of reporter expression. For example, in ated undesired sites as listed in step 2 above thereby U.S. Pat. No. 5,670,356, a firefly luciferase gene (referred to creating GRver4 and RDver4. as luc+) was modified to improve the level of luciferase 5. Removed transcription factor binding sites created by expression. While a higher level of expression was observed, 30 step 4 above (third Pass) and confirmed absence of sites it was not determined that higher expression had improved listed in step 2 above thereby creating GRver5 and regulatory control. RDver5. The invention will be further described by the following 6. Constructed the actual genes by PCR using synthetic nonlimiting examples. In particular, the synthetic nucleic acid oligonucleotides corresponding to fragments of GRVerS molecules of the invention may be derived by other methods and RDver5 designed sequences thereby creating GR6 as well as by variations on the methods described herein. 35 and RD7. GR6, upon sequencing was found to have the EXAMPLE1 serine residue at amino acid position 49 mutated to an asparagine and the proline at amino acid position 230 Synthetic Click Beetle (RD and GR) Luciferase mutated to a serine (S49N, P230S). RD7, upon sequenc Nucleic Acid Molecules ing was found to have the histidine at amino acid posi 40 tion 36 mutated to a tyrosine (H36Y). These changes LucPp/YG is a wild-type click beetle luciferase that emits occurred during the PCR process. yellow-green luminescence (Wood, 1989). A mutant of LucP 4. The mutations described in step 6 above (S49N, P230S plyG named YG#81-6G01 was envisioned. YG#81-6G01 for GR6 and H36Y for RD7) were reversed to create lacks a peroxisome targeting signal, has a lower KM for GRver5.1 and RDver5.1. luciferin and ATP, has increased signal stability and increased 45 5. RDver5.1 was further modified by changing the arginine temperature stability when compared to the wild type (PCT/ codon at position 351 to a glycine codon (R351 G) WO9914336). YG #81-6G01 was mutated to emit green thereby creating RDver5.2 with improved spectral prop luminescence by changing Ala at position 224 to Val (A224V erties compared to RDver5.1. is a green-shifting mutation), or to emit red luminescence by 6. RDver5.2 was further mutated to increase luminescence simultaneously introducing the amino acid Substitutions 50 intensity thereby creating RD 156-1H9 which encodes A224H, S247H, N3461, and H348Q (red-shifting mutation four additional amino acid changes (M21, S349T. set) (PCT/WO9518853) K488T, E538V) and three silent single base changes (see Using YG #81-6G01 as a parent gene, two synthetic gene U.S. application Ser. No. 09/645,706, filed Aug. 24, sequences were designed. One codes for a luciferase emitting 2000, the disclosure of which is incorporated by refer green luminescence (GR) and one for a luciferase emitting ence herein). red luminescence (RD). Both genes were designed to 1) have 55 optimized codon usage for expression in mammaliancells, 2) 1. Optimize Codon Usage and Introduce Mutations Deter have a reduced number of transcriptional regulatory sites mining Luminescence Color including mammalian transcription factor binding sites, The starting gene sequence for this design step was YG #81 splice sites, poly(A) sites and promoters, as well as prokary 6GO1 otic (E. coli) regulatory sites, 3) be devoid of unwanted 60 restriction sites, e.g., those which are likely to interfere with a) Optimize Codon Usage: standard cloning procedures, and 4) have a low DNA The strategy was to adapt the codon usage for optimal sequence identity compared to each other in order to mini expression inhuman cells and at the same time to avoid E. coli mize genetic rearrangements when both are present inside the low-usage codons. Based on these requirements, the best two same cell. In addition, desired sequences, e.g., a Kozak 65 codons for expression in human cells for all amino acids with sequence or restriction enzyme recognition sites, may be more than two codons were selected (see Wada et al., 1990). identified and introduced. In the selection of codon pairs for amino acids with six US 7,728,118 B2 27 codons, the selection was biased towards pairs that have the largest number of mismatched bases to allow design of GR TABLE 1 and RD genes with minimum sequence identity (codon dis Valine tinction): GR RD GR RD Codon Human Parent wer1 wer1 wers wers

GTA 4 13 O O 1 1 Arg: CGC/CGT Lel: CTG.TTG Ser: TCTAGC GTC 13 4 25 24 21 26 GTG 24 12 25 25 25 17 Thr: ACCACT Pro: CCACCT Ala: GCC, GCT 10 GTT 9 2O O O 3 5 Gly: GGC/GGT Wal: GTCGTG Ile:ATCATT

TABLE 2

Based on this selection of codons, two gene sequences encod 15 Leucine ing the YG#81-6G01 luciferase protein sequence were com puter generated. The two genes were designed to have mini GR RD GR RD mum DNA sequence identity and at the same time closely Codon Human Parent wer1 wer1 wers wers similar codon usage. To achieve this, each codon in the two CTA 3 5 O O O O CTC 12 4 O 1 12 11 genes was replaced by a codon from the limited list described CTG 24 4 28 27 19 18 above in an alternating fashion (e.g., Argo is CGC in gene 1 CTT 6 12 O O 1 1 and CGT in gene 2, Arg) is CGT in gene 1 and CGC in TTA 3 17 O O O O gene 2). TTG 6 13 27 27 23 25 For Subsequent steps in the design process it was antici 25 pated that changes had to be made to this limited optimal 2. Remove Undesired Restriction Sites, Prokaryotic Regula codon selection in order to meet other design criteria, how tory Sites, Splice Sites and Poly(A) Sites ever, the following low-usage codons in mammalian cells The starting gene sequences for this design step were GRVerl were not used unless needed to meet criteria of higher prior and RDver1. ity: 30 a) Remove Undesired Restriction Sites: To check for the presence and location of undesired restric tion sites, the sequences of both synthetic genes were com Arg: CGA Leu: CTA Ser: TCG pared against a database of restriction enzyme recognition 35 sequences (REBASE ver,712, see the URL available at Pro: CCG Wal: GTA Ile: ATA www.neb.com/rebase) using standard sequence analysis soft ware (GenePro ver 6.10, Riverside Scientific Ent.). Specifically, the following restriction enzymes were classi Also, the following low-usage codons in E. coli were avoided fied as undesired: when reasonable (note that 3 of these match the low-usage list 40 BamHI, Xho I, Sfi I, Kpn I, Sac I, Milu I, Nhe I, SmaI, Xho for mammalian cells): I, Bgl II, HindIII, Nco I, Nar I, Xba I, HpaI, Sal I, other cloning sites commonly used: EcoRI, EcoRV. Cla I, eight-base cutters (commonly used for complex con Arg: CGA/CGG/AGAJAGG 45 structs), Leu: CTA Pro: CCC Ile: ATA BstE II (to allow N-terminal fusions), Xcm I (can generate A/T overhang used for T-vector clon ing). b) Introduce Mutations Determining Luminescence Color: To eliminate undesired restriction sites when found in a Into one of the two codon-optimized gene sequences was 50 synthetic gene, one or more codons of the synthetic gene introduced the single green-shifting mutation and into the sequence were altered in accordance with the codon optimi other were introduced the 4 red-shifting mutations as Zation guidelines described in 1a above. described above. b) Remove Prokaryotic (E. coli) Regulatory Sequences: The two output sequences from this first design step were To check for the presence and location of prokaryotic regu 55 latory sequences, the sequences of both synthetic genes were named GRver 1 (version 1 GR) and RDverl (version 1 RD). searched for the presence of the following consensus Their DNA sequences are 63% identical (594 mismatches), sequences using standard sequence analysis Software while the proteins they encode differ only by the 4 amino (GenePro): acids that determine luminescence color (see FIGS. 2 and 3 TATAAT (-10 Pribnow box of promoter) for an alignment of the DNA and protein sequences). 60 AGGA or GGAG (ribosome binding site; only considered Tables 1 and 2 show, as an example, the codon usage for if paired with a methionine codon 12 or fewer bases Valine and leucine in human genes, the parent gene YG#81 downstream). 6G01, the codon-optimized synthetic genes GRver1 and To eliminate such regulatory sequences when found in a RDver1, as well as the final versions of the synthetic genes 65 synthetic gene, one or more codons of the synthetic gene at after completion of step 5 in the design process (GRVeriS and sequence were altered in accordance with the codon optimi RDver5). Zation guidelines described in 1a above. US 7,728,118 B2 29 30 c) Remove Splice Sites: search. It further specifies that only TF binding sites that have To check for the presence and location of splice sites, the a perfect match in the query sequence and a minimum log DNA strand corresponding to the primary RNA transcript of likelihood (LLH) score of 10 will be reported. The LLH each synthetic gene was searched for the presence of the scoring method assigns 2 to an unambiguous match, 1 to a following consensus sequences (see Watson et al., 1983) partially ambiguous match (e.g., A or T match W) and 0 to a using standard sequence analysis software (GenePro): match against N. For example, a search with parameters splice donor site: AG|GTRAGT (exonlintron), the search specified above would-result in a “hit' (positive result or was performed for AGGTRAG and the lower stringency match) for TATAA (SEQ ID NO:50) (LLH=10), STRATG GGTRAGT: (SEQ ID NO:51) (LLH=10), and MTTNCNNMA (SEQ ID splice acceptor site: (Y)NCAG|G (intronlexon), the 10 NO:52) (LLH=10) but not for TRATG (SEQ ID NO. 53) search was performed with n=1. (LLH=9) if these four TF binding sites were present in the To eliminate splice sites found in a synthetic gene, one or query sequence. A lower stringency test was performed at the more codons of the synthetic gene sequence were altered in end of the design process to re-evaluate the search param accordance with the codon optimization guidelines described eters. 15 When TESS was tested with a mock query sequence con in 1a above. Splice acceptor sites were generally difficult to taining known TF binding sites it was found that the program eliminate in one gene without introducing them into the other was unable to report matches to sites ending with the 3' end of gene because they tended to contain one of the two only Gln the query sequence. Thus, an extra nucleotide was added to codons (CAG); they were removed by placing the Gln codon the 3' end of all query sequences to eliminate this problem. CAA in both genes at the expense of a slightly increased The first search for TF binding sites using the parameters sequence identity between the two genes. described above found about 100 transcription factor binding d) Remove Poly(A) Sites: sites (hits) for each of the two synthetic genes (GRver2 and To check for the presence and location of poly(A) sites, the RDver2). All sites were eliminated by changing one or more sequences of both synthetic genes were searched for the pres codons of the synthetic gene sequences in accordance with ence of the following consensus sequence using standard 25 the codon optimization guidelines described in 1a above. sequence analysis software (GenePro): However, it was expected that Some these changes created AATAAA. new TF binding sites, other regulatory sites, and new restric To eliminate each poly(A) addition site found in a synthetic tion sites. Thus, steps 2 a-d were repeated as described, and 4 gene, one or more codons of the synthetic gene sequence were new restriction sites and 2 new splice sites were removed. The altered in accordance with the codon optimization guidelines 30 two output sequences from this third design step were named described in 1a above. The two output sequences from this GRver3 and RDver3. Their DNA sequences are 66% identi second-design step were named GRver2 and RDver2. Their cal (541 mismatches). DNA sequences are 63% identical (590 mismatches). 4. Remove New Transcription Factor (TF) Binding Sites then Repeat Steps 2 a-d 3. Remove Transcription Factor (TF) Binding Sites, then 35 Repeat Steps 2 a-d The starting gene sequences for this design step were The starting gene sequences for this design step were GRver3 and RDver3. GRver2 and RDver2. This fourth step is an iteration of the process described in step To check for the presence, location and identity of potential 3. The search for newly introduced TF binding sites yielded TF binding sites, the sequences of both synthetic genes were 40 about 50 hits for each of the two synthetic genes. All sites used as query sequences to search a database of transcription were eliminated by changing one or more codons of the factor binding sites (TRANSFAC v3.2). The TRANSFAC synthetic gene sequences in general accordance with the database (see the URL available at transfac.gbf.de/TRANS codon optimization guidelines described in 1a above. How FAC/index.html) holds information on gene regulatory DNA ever, more high to medium usage codons were used to allow sequences (TF binding sites) and proteins (TFs) that bind to 45 elimination of all TF binding sites. The lowest priority was and act through them. The SITE table ofTRANSFAC Release placed on maintaining low sequence identity between the GR 3.2 contains 4,401 entries of individual (putative) TF binding and RD genes. Then steps 2 a-d were repeated as described. sites (including TF binding sites in eukaryotic genes, in arti The two output sequences from this fourth design step were ficial sequences resulting from mutagenesis studies and in named GRver4 and RDver4. Their DNA sequences are 68% vitro selection procedures based on random oligonucleotide 50 identical (506 mismatches). mixtures or specific theoretical considerations, and consen 5. Remove New Transcription Factor (TF) Binding Sites, then sus binding sequences (from Faisst and Meyer, 1992). Repeat Steps 2 a-d The software tool used to locate and display these TF The starting gene sequences for this design step were binding sites in the synthetic gene sequences was TESS GRver4 and RDver4. (Transcription Element Search Software, http://agavehum 55 gen.upenn.edu/tess/index.html). The filtered string-based This fifth step is anotheriteration of the process described in search option was used with the following user-defined step 3 above. The search for new TF binding sites introduced search parameters: in step 4 yielded about 20 hits for each of the two synthetic Factor Selection Attribute: Organism Classification genes. All sites were eliminated by changing one or more Search Pattern: Mammalia 60 codons of the synthetic gene sequences in general accordance Max. Allowable Mismatch 96: 0 with the codon optimization guidelines described in 1a above. However, more high to medium usage codons were used Min. element length: 5 (these are all considered “preferred') to allow elimination of Min. log-likelihood: 10 all TF binding sites. The lowest priority was placed on main This parameter selection specifies that only mammalian TF 65 taining low sequence identity between the GR and RD genes. binding sites (approximately 1,400 of the 4,401 entries in the Then steps 2a-d were repeated as described. Only one accep database) that are at least 5 bases long will be included in the tor splice site could not be eliminated. As a final step the US 7,728,118 B2 31 32 absence of all TF binding sites in both genes as specified in (1950 bp total for each gene). The 5' and 3' boundaries of all step 3 was confirmed. The two output sequences from this oligonucleotides specifying one strand were generally placed fifth and last design step were named GRver5 and RDver5. in a manner to give an average offset/overlap of 20 bases Their DNA sequences are 69% identical (504 mismatches). relative to the boundaries of the oligonucleotides specifying 5 the opposite Strand. Additional Evaluation of GRver5 and RDver5 The ends of the flanking regions of both genes matched the a) Use Lower Stringency Parameters for TESS: ends of the amplification primers (pRAMtailup: 5'-gtact gagacgacgccagcccaagct taggcctgagtg. SEQ ID NO. 54, and The search for TF binding sites was repeated as described in pRAMtaildn: 5'-ggcatgagcgtgaactgactgaactagcgg.ccg.ccgag step 3 above, but with even less stringent user-defined param 10 eters: SEQID NO:55) to allow cloning of the genes into our E. coli setting LLH to 9 instead of 10 did not result in new hits; expression vector pRAM (WO99/14336). setting LLH to 0 through 8 (incl.) resulted in hits for two A total of 183 oligonucleotides were designed: fifteen oli additional sites, MAMAG (22 hits) and CTKTK (24 gonucleotides that collectively encode the upstream and hits); downstream flanking sequences and 168 oligonucleotides setting LLH to 8 and the minimum element length to 4, the 15 (4x42) that encode both strands of the two genes. search yielded (in addition to the two sites above) dif All 183 oligonucleotides were run through the hairpin ferent 4-base sites for AP-1, NF-1, and c-Myb that are analysis of the OLIGO software (OLIGO 4.0 Primer Analysis shortened versions of their longer respective consensus Software (C) 1989-1991 by Wojciech Rychlik) to identify potentially detrimental intra-molecular loop formation. The sites which were eliminated in steps 3-5 above. guidelines for evaluating the analysis results were set accord It was not realistic to attempt complete elimination of these ing to recommendations of Dr. Sims (Sigma-Genosys Cus sites without introduction of new sites, so no further changes tom Gene Synthesis Department): oligos forming hairpins were made. with AG<-10 have to be avoided, those forminghairpins with b) Search Different Database: AGs-7 involving the 3' end of the oligonucleotide should The Eukaryotic Promoter Database (release 45) contains 25 also be avoided, while those with an overall AGs-5 should information about reliably mapped transcription start sites not pose a problem for this application. The analysis identi (1253 sequences) of eukaryotic genes. This database was fied 23 oligonucleotides able to form hairpins with a AG searched using BLASTN 1.4.11 with default parameters (op between -7.1 and -4.9. Of these, 5 had blocked or nearly timized to find nearly identical sequences rapidly; see Alts blocked 3' ends (0-3 free bases) and were re-designed by chul et al., 1990) at the National Center for Biotechnology 30 removing 1-4 bases at their 3' end and adding it to the adjacent Information site (see the URL available at www/ncbi.nlm. oligonucleotide. nih.gov/cgi-bin/BLAST). To test this approach, a portion of The 40mer oligonucleotide covering the sequence comple pGL3-Control vector sequence containing the SV40 pro mentary to the poly(A)tail had a very low complexity 3' end moter and enhancer was used as a query sequence, yielding (13 consecutive T bases). An additional 40mer was designed the expected hits to SV40 sequences. No hits were found 35 with a high complexity 3' end but a consequently reduced when using the two synthetic genes as query sequences. overlap with one of its complementary oligonucleotides (11 instead of 20 bases) on the opposite-strand. Summary of GRver5 and RDver5 Synthetic Gene Properties Even though the oligonucleotides were designed for use in Both genes, which at this stage were still only “virtual a thermocycler-based assembly reaction, they could also be sequences in the computer, have a codon usage that strongly 40 used in aligation-based protocol for gene construction. In this favors mammalian high-usage codons and minimizes mam approach, the oligonucleotides are annealed in a pairwise malian and E. coli low-usage codons. fashion and the resulting short double-stranded fragments are Both genes are also completely devoid of eukaryotic TF ligated using the sticky overhangs. However, this would binding sites consisting of more than four unambiguous require that all oligonucleotides be phosphorylated. bases, donorand acceptor splice sites (one exception: GRVerš 45 contains one splice acceptor site), poly(A) sites, specific b) Gene Assembly and Amplification prokaryotic (E. coli) regulatory sequences, and undesired In a first step, each of the two synthetic genes was restriction sites. assembled in a separate reaction from 98 oligonucleotides. The gene sequence identity between GRver5 and RDver5 The total volume for each reaction was 50 ul: is only 69% (504 base mismatches) while their encoded pro 50 0.5uMoligonucleotides (0.25 pmoles of each oligo) teins are 99% identical (4 amino acid mismatches). Their 1.0 UTaq DNA polymerase identity with the parent sequence YG#81-6G1 is 74% 0.02 U Pfu DNA polymerase (GRver5) and 73% (RDver5). Their base composition is 2 mM MgCl, 49.9% GC (GRver5) and 49.5% GC (RDver5), compared to 0.2 mM dNTPs (each) 40.2% GC for the parent YG#81-6G01. 55 0.1% gelatin Construction of Synthetic Genes Cycling conditions: (94° C. for 30 seconds, 52° C. for 30 The two synthetic genes were constructed by assembly seconds, and 72°C. for 30 seconds)x55 cycles. from synthetic oligonucleotides in a thermocycler followed In a second step, each assembled synthetic gene was ampli by PCR amplification of the full-length genes (similar to fied in a separate reaction. The total Volume for each reaction Stemmer et al. (1995) Gene. 164, pp. 49-53). Unintended 60 was 50 ul: mutations that interfered with the design goals of the Syn 2.51 assembly reaction thetic genes were corrected. 5.0 UTaq DNA polymerase a) Design of Synthetic Oligonucleotides: 0.1 U Pfu DNA polymerase The synthetic oligonucleotides were mostly 40mers that 65 1 M each primer (pRAMtailup, pRAMtaildn) collectively code for both complete strand each designed 2 mM MgCl, gene (1,626 bp) plus flanking regions needed for cloning 0.2 mM dNTPs (each) US 7,728,118 B2 33 34 Cycling conditions: (94° C. for 20 seconds, 65° C. for 60 sequences, thereby creating GRver5.1 and RDver5.1. The seconds, 72° C. for 3 minutes)x30 cycles. DNA sequences of the mutated regions were confined by The assembled and amplified genes were subcloned into sequence analysis. the pRAM vector and expressed in E. Coli, yielding 1-2% e) Improve Spectral Properties luminescent GR or RD clones. Five GR and five RD clones The RDver5.1 gene was further modified to improve its were isolated and analyzed further. Of the five GR clones, spectral properties by introducing anamino change (R351 G). three had the correct insert size, of which one was weakly thereby creating RDver5.2 luminescent and one had an altered restriction pattern. Of the five RD clones, two had the correct size insert with an altered pGL3 Vectors with RD and GR Genes restriction pattern and one of those was weakly luminescent. 10 The parent click beetle luciferase YG#81-6G1 (“YG'), and Overall, the analysis indicated the presence of a large number the synthetic click beetle luciferase genes GRver5.1 (“GR), of mutations in the genes, most likely the result of errors RDver5.2 (“RD), and RD156-1H9 were cloned into the four introduced in the assembly and amplification reactions. pGL3 reporter vectors (Promega Corp.): pGL3-Basic—no promoter, no enhancer c) Corrective Assembly and Amplification 15 pGL3-Control=SV40 promoter, SV40 enhancer To remove the large number of mutations present in the pGL3-Enhancer-SV40 enhancer (3' to luciferase coding full-length synthetic genes we performed an additional sequences) assembly and amplification reaction for each gene using the pGL3-Promoter SV40 promoter. proof-reading DNA polymerase Tli. The assembly reaction contained, in addition to the 98 GR or RD oligonucleotides, a The primers employed in the assembly of GR and RD syn small amount of DNA from the corresponding full-length thetic genes facilitated the cloning of those genes into pRAM clones with mutations described above. This allows the oligos vectors. To introduce the genes into pGL3 vectors (Promega to correct mutations present in the templates. Corp., Madison, Wis.) for analysis in mammalian cells, each The following assembly reaction was performed for each gene in apRAM vector (PRAMRDver5.1, pRAMGRver5.1, of the synthetic genes. The total Volume for each reaction was 25 and pRAM RD 156-1H9) was amplified to introduce an Nco 50 ul: I site at the 5' end and an Xba I site at the 3' end of the gene. 0.5uMoligonucleotides (0.25 pmoles of each oligo) The primers for pRAMRDver5.1 and pRAMGRver5.1 were: 0.016 pmol plasmid (mix of clones with correct insert size) 2.5 UTli DNA polymerase (SEO ID NO. 56) 2 mM MgCl, 30 GR->5 GGA TCC CAT GGT GAA. GCG, TGA GAA 3 0.2 mM dNTPs (each) o 0.1% gelatin (SEO ID NO: 57) RD->5 GGA TCC CAT GGT- GAA-ACG - CGA 3 Cycling conditions: 94°C. for 30 seconds, then (94°C. for and 30 seconds, 52° C. for 30 seconds, 72°C. for 30 seconds) 35 for 55 cycles, then 72° C. for 5 minutes. (SEO ID NO: 58) The following amplification reaction was performed on s' CTA GCT TTT TTT TCT AGA TAA TCA TGA AGA C 3' each of the assembly reactions. The total volume for each amplification reaction was 50 ul: The primers for pRAM RD156-1H9 were: 1-5ul of assembly reaction 40 40 pmol each primer (pRAMtailup, pRAMtaildn) (SEO ID NO. 59) 2.5 UTli DNA polymerase 5 GCG TAG CCA TGG TAA AGC GTG. AGA AAA ATG TC 3' 2 mM MgCl, and 0.2 mM dNTPs (each) 45 (SEQ ID NO: 6O) Cycling conditions: 94°C. for 30 seconds, then (94°C. for s' CCG ACT CTA GAT TAG TAA CCG CCG CCC TTC ACC 3 20 seconds, 65° C. for 60 seconds and 72° C. for 3 minutes) for 30 cycles, then 72°C. for 5 minutes. The PCR included: The genes obtained from the corrective assembly and 100 ng DNA plasmid amplification step were subcloned into the pRAM vecter and 1 LM primer upstream expressed in E. coli, yielding 75% luminescent GR or RD 50 clones. Forty-four GR and 44 RD clones were analyzed with 1 uM primer downstream the screening robot described in WO99/14336. The six best 0.2 mM dNTPs GR and RD clones were manually analyzed and one best GR 1x buffer (Promega Corp.) and RD clone was selected (GR6 and RD7). Sequence analy 5 units Pfu DNA polymerase (Promega Corp.) sis of GR6 revealed two point mutations in the coding region, 55 Sterile nanopure HO to 50 ul both of which resulted in an amino acid substitution (S49N The cycling parameters were: 94°C. for 5 minutes; (94°C. and P230S). Sequence analysis of RD7 revealed three point for 30 seconds; 55° C. for 1 minute; and 72° C. for 3 min mutations in the coding region, one of which resulted in an utes)x15 cycles. The purified PCR product was digested with amino acid substitution (H36Y). It was confirmed that none Nco I and Xba I, ligated with pGL3-control that was also of the silent point mutations introduced any regulatory or 60 digested with Nco I and Xba I, and the ligated products restriction sites conflicting with the overall design criteria for introduced to E. coli. To insert the luciferase genes into the the synthetic genes. other pGL3 reporter vectors (basic, promoter and enhancer), the pGL3-control vectors containing each of the luciferase d) Reversal of Unintended Amino Acid Substitutions genes was digested with Nco I and Xba I, ligated with other The unintended amino acid Substitutions present in the 65 pGL3 vectors that also were digested with Nco I and Xba I, GR6 and RD7 synthetic genes were reversed by site-directed and the ligated products introduced to E. coli. Note that the mutagenesis to match the GRver5 and RDver5 designed polypeptide encoded by GRver5.1 and RDver5.1 (and US 7,728,118 B2 35 36 RD156-1H9, see below) nucleic acid sequences in pGL3 contain one or more internal transcriptional regulatory vectors has an amino acid Substitution at position 2 to valine sequences that are activated by the enhancer in the vector, and as a result of the Nico I site at the initiation codon in the thus is not suitable as a reporter gene while the synthetic GR oligonucleotide. and RD genes showed a clean reporter response (transfection Because of internal Nco I and Xba I sites, the native gene in 5 efficiency normalized by comparison to native Renilla YG #81-6G01 was amplified from a HindIII site upstream to luciferase gene). See Table 8. a Hpa I site downstream of the coding region and which included flanking sequences found in the GR and RD clones. EXAMPLE 2 The upstream primer (5'-CAA AAA GCTTGG CAT TCC GGTACT GTTGGTAAAGCCACC ATG GTGAAG CGA 10 Synthetic Renilla Luciferase Nucleic Acid Molecule GAG-3'; SEQID NO:61) and a downstream primer (5'-CAA TTGTTGTTGTTAACTTGTTTATT-3'; SEQID NO:62) The synthetic Renilla luciferase genes prepared include 1) were mixed with YG#81-6G01 and amplified using the PCR an introduced Kozak sequence, 2) codon usage optimized for conditions above. The purified PCR product was digested mammalian (human) expression, 3) a reduction or elimina with Nco I and Xba I, ligated with pGL3-control that was also 15 tion of unwanted restriction sites, 4) removal of prokaryotic digested with Hind III and HpaI, and the ligated products regulatory sites (ribosome binding site and TATA box), 5) introduced into E. coli. To insert YG#81-6G01 into the other removal of splice sites and poly(A) sites, and 6) a reduction or pGL3 reporter vectors (basic, promoter and enhancer), the elimination of mammalian transcriptional factor binding pGL3-control vectors containing YG#81-6G01 were Sequences. digested with Nco I and Xba I, ligated with the other pGL3 The process of computer-assisted design of synthetic vectors that also were digested with Nco I and Xba I, and the Renilla luciferase genes by iterative rounds of codon optimi ligated products introduced to E. coli. Note that the clone of Zation and removal of transcription factor binding sites and YG#81-6G01 in the pGL3 vectors has a C instead of an A at other regulatory sites as well as restriction sites can be base 786, which yields a change in the amino acid sequence at described in three steps: residue 262 from Phe to Leu. To determine whether the 25 1. Using the wild type Renilla luciferase gene as the parent altered amino acid at position 262 affected the enzyme bio gene, codon usage was optimized, one amino acid was chemistry, the clone of YG#81-6G01 was mutated to changed (T->A) to generate a Kozak consensus sequence, resemble the original sequence. Both clones were then tested and undesired restriction sites were eliminated thereby for expression in E. coli, physical stability, Substrate binding, creating synthetic gene Rlucver1. and luminescence output kinetics. No significant differences 30 2. Remove prokaryotic regulatory sites, splice sites, poly(A) were found. sites and transcription factor (TF) binding sites (first pass). Partially purified enzymes expressed from the synthetic Then remove newly created TF binding sites. Then remove genes and the parent gene were employed to determine Km newly created undesired restriction enzyme sites, prokary for luciferin and ATP (see Table 3). otic regulatory sites, splice sites, and poly(A) sites without 35 introducing new TF binding sites. This thereby created Rlucver2. TABLE 3 3 . Change 3 bases of Rlucver2 thereby creating Rluc-final. Enzyme Ka (LH2) Ka (ATP) 4. The actual gene was then constructed from Synthetic oli gonucleotides corresponding to the Rluc-final designed YG parent 2 IM 17 M GR 1.3 M 25 M 40 sequence. All mutations resulting from the assembly or RD 24.5 M 46M PCR process were corrected. This gene is Rluc-final. Codon Selection In vitro eukaryotic transcription/translation reactions were Starting with the Renilla reniformis luciferase sequence in also conducted using Promega’s TNT T7 Quick system 45 Genbank (Accession No. M63501), codons were selected according to manufacturers instructions. Luminescence lev based on codon usage for optimal expression in human cells els were 1 to 37-fold and 1 to 77-fold higher (depending on and to avoid E. coli low-usage codons. The best codon for the reaction time) for the synthetic GR and RD genes, respec expression in human cells (or the best two codons if found at tively, compared to the parent gene (corrected for luminom a similar frequency) was chosen for all amino acids with more eter spectral sensitivity). 50 than one codon (Wada et al., 1990): To test whether the synthetic click beetle luciferase genes and the wild type click beetle gene have improved expression in mammalian cells, each of the synthetic genes and the Arg: CGC Lys: AAG parent gene was cloned into a series of pGL3 vectors and Leu: CTG ASn: AAC introduced into CHO cells (Table 8). In all cases, the synthetic 55 Ser: TGTAGC Gln: CAG Thr:ACC His: CAC click beetle genes exhibited a higher expression than the Pro: CCACCT Glu: GAG native gene. Specifically, expression of the synthetic GR and Ala: GCC Asp: GAC RD genes was 1900-fold and 40-fold higher, respectively, Gly: GGC Tyr: TAC than that of the parent (transfection efficiency normalized by Wal: GTG Cys: TGC Ile:ATCATT Ple: TTC comparison to native Renilla luciferase gene). Moreover, the 60 data (basic versus control vector) show that the synthetic genes have reduced basal level transcription. In cases where two codons were selected for one amino Further, in experiments with the enhancer vector where the acid, they were used in an alternating fashion. To meet other percentage of activity in reference to the control is compared criteria for the synthetic gene, the initial optimal codon selec between the native and synthetic gene, the data showed that 65 tion was modified to some extent later. For example, intro the synthetic genes have reduced risk of anomalous transcrip duction of a Kozak sequence required the use of GCT for Ala tion characteristics. In particular, the parent gene appeared to at amino acid position 2 (see below). US 7,728,118 B2 37 38 The following low-usage codons in mammalian cells were Release 45). Three hits were determined one to Mus musculus not used unless needed: Arg: CGA, CGU; Leu: CTA, UUA: promoter H-2Ld (Cell, 44, 261 (1986)), one to Herpes Sim Ser: TCG: Pro: CCG:Val: GTA; and Ile: ATA. The following plex Virus type 1 promoter b'g'2.7 kb, and one to Homo low-usage codons in E. coli were also avoided when reason sapiens DHFR promoter (J. Mol. Biol., 176, 169 (1984)). able (note that 3 of these match the low-usage list for mam However, no further changes were made to Rlucver2. malian cells): Arg: CGA/CGG/AGA/AGG, Leu: CTA: Pro: Summary of Properties for Rlucver2 CCC: Ile: ATA. All 30 low usage codons were eliminated. The introduction Introduction of Kozak Sequences of a Kozak sequence changed the second amino acid The Kozak sequence: 5' aaccATGGCT3' (SEQID NO: 63) from Thr to Ala; (the Nico I site is underlined, the coding region is shown in 10 base composition: 55.7% GC (Renilla wild-type parent capital letters) was introduced to the synthetic Renilla gene: 36.5%); luciferase gene. The introduction of the Kozak sequence one undesired restriction site could not be eliminated: changes the second amino acid from Thr to Ala (GCT). EcoRV at position 488; the synthetic gene had no prokaryotic promoter sequence Removal of Undesired Restriction Sites 15 but one potentially functional ribosome binding site REBASE ver. 808 (updated Aug. 1, 1998: Restriction (RBS) at positions 867-73 (about 13 bases upstream of a Enzyme Database: www.neb.com/rebase) was employed to Met codon) could not be eliminated; identify undesirable restriction sites as described in Example all poly(A) sites were eliminated: 1. The following undesired restriction sites (in addition to splice sites: 2 donor splice sites could not be eliminated those described in Example 1) were removed according to the (both share the amino acid sequence MGK); process described in Example 1: EcolcRI, Ndel, NsiI, SphI. TF sites: all sites with a consensus of >4 unambiguous Spel, Xmal, PstI. bases were eliminated (about 280TF binding sites were The version of Renilla luciferase (Rluc) which incorpo removed) with 3 exceptions due to the preference to rates all these changes is Rlucver1. avoid changes to the amino acid sequence. Removal of Prokaryotic (E. coli) Regulatory Sequences, 25 When introduced into pGL3, Rluc-final has a Kozak Splice Sites, and Poly(A) Sites sequence (CACCATGGCT: SEQID NO:65). The changes in The priority and process for eliminating transcription regu Rluc-final relative to Rlucver2 were introduced during gene lation sites was as described in Example 1. assembly. One change was at position 619, a C to an A, which eliminated a eukaryotic promoter sequence and reduced the Removal of TF Binding Sites 30 stability of a hairpin structure in the corresponding oligo The same process, tools, and criteria were used as nucleotide employed to assemble the gene. Other changes described in Example 1, however, the newer version3.3 of the TRANSFAC database was employed. included a change from CGC to AGA at positions 218-220 After removing prokaryotic regulatory sequences, splice (resulted in a better oligonucleotide for PCR). sites and poly(A) sites from Rlucver1, the first search for TF 35 Gene Assembly Strategy binding sites identified about 60 hits. All sites were elimi The gene assembly protocol employed for the synthetic nated with the exception of three that could not be removed Renilla luciferase was similar to that described in Example 1. without altering the amino acid sequence of the synthetic Renilla gene: 40 Sense Strand primer: 1. site at position 63 composed of two codons for W (SEQ ID NO: 66) (TGGTGG), for CAC-binding protein T00076; 5 AACCATGGCTTCCAAGGTGTACGACCCCGAGCAACGCAAA 3 2. site at position 522 composed of codons for KMV (AANATGGTN), for myc-DF1 T00517; Anti-sense Strand primer: 3. site at position 885 composed of codons for EMG (SEO ID NO: 67) (GARATGGGN), for myc-DF 1 T00517. 45 s' GCTCTAGAATTACTGCTCGTTCTTCAGCACGCGCTCCACG 3." The resulting synthetic gene fragment was cloned into a The subsequent second search for (newly introduced) TF pRAM vector using Nco I and Xba I. Two clones having the binding sites yielded about 20 hits. All new sites were elimi correct size insert were sequenced. Four to six mutations were nated, leaving only the three sites described above. Finally, found in the synthetic gene from each clone. These mutations any newly introduced restriction sites, prokaryotic regulatory 50 sequences, splice sites and poly(A) sites were removed with were fixed by site-directed mutagenesis (Gene Editor from out introducing new TF binding sites if possible. Promega Corp., Madison, Wis.) and Swapping the correct Rlucver2 was obtained. regions between these two genes. The corrected gene was As in Example 1, lower stringency search parameters were confirmed by sequencing. specified for the TESS filtered string search to further evalu 55 Other Vectors ate the synthetic Renilla gene. To prepare an expression vector for the synthetic Renilla With the LLH reduced from 10 to 9 and the minimum luciferase gene in a pGL-3 control vector backbone, 5ug of element length reduced from 5 to 4, the TESS filtered string pGL3-control was digested with Nco I and Xba I in 50 ul final search did not show any new hits. When, in addition to the volume with 2 ul of each enzyme and 5 Jul 10x buffer B parameter changes listed above, the organism classification 60 (nanopure water was used to fill the volume to 50 ul). The was expanded from “mammalia' to “chordata”, the search digestion reaction was incubated at 37°C. for 2 hours, and the yielded only four more TF binding sites. When the Min LLH whole mixture was run on a 1% agarose gel in 1xTAE. The was further reduced to between 8 and 0, the search showed desired vector backbone fragment was purified using two additional 5-base sites (MAMAG and CTKTK) which Qiagen's QIAquick gel extraction kit. combined had four matches in Rlucver2, as well as several 65 The native Renilla luciferase gene fragment was cloned 4-base sites. Also as in Example 1, Rlucver2 was checked for into pGL3-control vector using two oligonucleotides, Nico hits to entries in the EPD (Eukaryotic Promoter Database, I-RL-F and Xba I-RL-R, to PCR amplify native Renilla US 7,728,118 B2 39 40 luciferase gene using pRL-CMV as the template. The from extended protein half-life and, if so, this gives an undes sequence for Nco I-RL-F is 5'-CGCTAGCCATGGCTTC ired disadvantage of the new gene. This possibility is ruled GAAAGTTTATGATCC-3 (SEQ ID NO:68); the sequence out by a cycloheximide chase (“CHX Chase') experiment, for Xba I-RL-R is 5' GGCCAGTAACTCTAGAATTAT which demonstrated that there was no increase of protein TGTT-3' (SEQID NO:69). The PCR reaction was carried out half-life resulted from the humanized Renilla luciferase gene. as follows: To ensure that the increase in expression is not limited to one expression vector backbone, is promoter specific and/or Reaction Mixture (for 100 ul): cell specific, a synthetic Renilla gene (Rluc-final) as well as native Renilla gene were cloned into different vector back 10 bones and under different promoters. The synthetic gene DNA template (Plasmid) 1.0 l (1.0 ngful final) always exhibited increased expression compared to its wild 1OX Rec. Buffer 10.0 l (Stratagene Corp.) type counterpart (Table 5). dNTPs (25 mM each) 1.0 l (final 250M) Primer 1 (10 M) 2.0 l (0.2M final) Primer 2 (10 M) 2.0 l (0.2M final) TABLE 5 Pfu DNA Polymerase 2.0 l (2.5 U?ul, Stratagene Corp.) 15 82.0 ul double distilled water Vector NIH-3T3 HeLa CHO pRL-tk, native 3,834.6 922.4 7,671.9 pRL-tk, synthetic 13,252.5 9,040.2 41,743.5 PCR Reaction: heat 94° C. for 2 minutes; (94° C. for 20 pRL-CMV, native 168,062.2 842,482.5 153,539.5 seconds; 65° C. for 1 minute; 72° C. for 2 minutes; then 72° pRL-CMV, synthetic 2,168,129 8,440,306 2,532,576 pRL-SV40, native 224,224.4 346,787.6 85,323.6 C. for 5 minutes)x25 cycles, then incubate on ice. The PCR pRL-SV40, synthetic 1469,588 2,632,510 1422,830 amplified fragment was cut from a gel, and the DNA purified pRL-null, native 2,853.8 431.7 2.434 and stored at -20°C. pRL-null, synthetic 9,151.17 2,439 28,317.1 To introduce native Renilla luciferase gene fragment into pRGL3b, native 12 21.8 17 pRGL3b, synthetic 13 O.S 212.4 1,094.5 pGL3-control vector, 5ug of the PCR product of the native 25 pRGL3-tk, native 27.9 155.5 186.4 Renilla luciferase gene (RAM-RL-synthetic) was digested pRGL3-tk, synthetic 6,778.2 8,782.5 9,685.9 with Nco I and Xba I. The desired Renilla luciferase gene pRL-tk no intron, native 31.8 16S 93.4 fragment was purified and stored at -20°C. pRL-tk no intron, synthetic 6,665.5 6,379 21,433.1 Then 100 ng of insert and 100 ng of pGL3-control vector backbone were digested with restriction enzymes Nco I and 30 Xba I and ligated together. Then 2 ul of the ligation mixture TABLE 6 was transformed into JM109 competent cells. Eight amplicil lin resistance clones were picked and their DNA isolated. Percent of control vector DNA from each positive clone of pCL3-control-native and Vector CHO cells NIH3T3 cells HeLa cells pGL3-control-synthetic was purified. The correct sequences 35 for the native gene and the synthetic gene in the vectors were pRL-control native 1OO 100 100 confined by DNA sequencing. pRL-control synthetic 1OO 100 100 pRL-basic native 4.1 S.6 O.2 To determine whether the synthetic Renilla luciferase gene pRL-basic synthetic 0.4 O.1 O.O has improved expression in mammalian cells, the gene was pRL-promoter native 5.9 7.8 O6 cloned into the mammalian expression vector pGL3-control 40 pRL-promoter synthetic 1S.O 9.9 1.1 vector under the control of SV40 promoter and SV40 early pRL-enhancer native 42.1 123.9 52.7 enhancer. The native Renilla luciferase gene was also cloned pRL-enhancer synthetic 2.6 1.5 5.4 into the pGL-3 control vector so that the expression from synthetic gene and the native gene could be compared. The With reduced spurious expression the synthetic gene expression vectors were then transfected into four common 45 should exhibit less basal level transcription in a promoterless mammalian cell lines (CHO, NIH3T3, Hela and CV-1: Table vector. The synthetic and native Renilla luciferase genes were 9), and the expression levels compared between the vectors cloned into the pGL3-basic vector to compare the basal level with the synthetic gene versus the native gene. The amount of of transcription. Because the synthetic gene itself has DNA used was at two different levels to ascertain that expres increased expression efficiency, the activity from the promot sion from the synthetic gene is consistently increased at dif 50 erless vector cannot be compared directly to judge the differ ferent expression levels. The results show a 70-600 fold ence in basal transcription, rather, this is taken into consider increase of expression for the synthetic Renilla luciferase ation by comparing the percentage of activity from the gene in these cells (Table 4). promoterless vector in reference to the control vector (expres sion from the basic vector divided by the expression in the TABLE 4 55 fully functional expression vector with both promoter and enhancer elements). The data demonstrate that the synthetic Cell Type Amount Vector Fold Expression Increase Renilla luciferase has a lower level of basal transcription than CHO 0.2 Ig 142 the native gene in mammalian cells (Table 6). 2.8 Jug 145 It is well known to those skilled in the art that an enhancer NIH3T3 0.2 Ig 326 2.0 Ig 593 60 can substantially stimulate promoter activity. To test whether HeLa 0.2 Ig 18S the synthetic gene has reduced risk of inappropriate transcrip 1.0 Lig 103 tional characteristics, the native and synthetic gene were CV-1 0.2 Ig 68 introduced into a vector with an enhancer element (pGL3 2.0 Ig 72 enhancer vector). Because the synthetic gene has higher 65 expression efficiency, the activity of both cannot be compared One important advantage of luciferase reporter is its short directly to compare the level of transcription in the presence protein half-life. The enhanced expression could also result of the enhancer, however, this is taken into account by using US 7,728,118 B2 41 42 the percentage of activity from enhancer vector in reference each transfection, puC19 carrier DNA was added to a total of to the control vector (expression in the presence of enhancer 3 ug DNA. 10 fold less pRL-TK DNA gave similar or more divided by the expression in the fully functional expression signal as the native gene, with reduced risk of inhibiting vector with both promoter and enhancer elements). Such expression from the primary reporter pGL3-control. results show that when native gene is present, the enhancer Experimental treatment sometimes may activate cryptic alone is able to stimulate transcription from 42-124% of the sites within the gene and cause induction or Suppression of the control, however, when the native gene is replaced by the co-reporter expression, which would compromise its func synthetic gene in the same vector, the activity only constitutes tion as co-reporter for normalization of transfection efficien 1-5% of the value when the same enhancer and a strong SV40 cies. One example is that TPA induces expression of co promoter are employed. This clearly demonstrates that Syn 10 reporter vectors harboring the wild-type gene when thetic gene has reduced risk of spurious expression (Table 6). transfecting MCF-7 cells. 500 ng pRL-TK (native), 5 lug The synthetic Renilla gene (Rluc-final) was used in in vitro native and synthetic pRG-B, 2.5ug native and synthetic pRG systems to compare translation efficiency with the native TK were transfected per well of MCF-7 cells. 100 ng/well gene. In a T7 quick coupled transcription/translation system pGL3-control (firefly luc--) was co-transfected with all RL (Promega Corp., Madison, Wis.), pRL-null native plasmid 15 plasmids. Carrier DNA, puC 19, was used to bring the total (having the native Renilla luciferase gene under the control of DNA transfected to 5.1 g/well. 15.3 titl TransFastTransfec the T7 promoter) or the same amount of pRL-null-synthetic tion Reagent (Promega Corp., Madison, Wis.) was added per plasmid (having the synthetic Renilla luciferase gene under well. Sixteen hours later, cells were trypsinized, pooled and the control of the T7 promoter) was added to the TNT reaction split into six wells of a 6-well dish and allowed to attach to the mixture and luciferase activity measured every 5 minutes up well for 8 hours. Three wells were then treated with the 0.2 to 60 minutes. Dual Luciferase assay kit (Promega Corp.) was nM of the tumor promoter, TPA (phorbol-12-myristate-13 used to measure Renilla luciferase activity. The data showed acetate, Calbiochem #524400-S), and three wells were mock that improved expression was obtained from the synthetic treated with 20 ul DMSO. Cells were harvested with 0.4 ml gene. To further evidence the increased translation efficiency Passive Lysis Buffer 24 hours post TPA addition. The results of the synthetic gene, RNA was prepared by an in vitro 25 showed that by using the synthetic gene, undesirable change transcription system, then purified. pRL-null (native or syn of co-reporter expression by experimental stimuli can be thetic) vectors were linearized with BamHI. The DNA was avoided (Table 7). This demonstrates that using synthetic purified by multiple phenol-chloroform extraction followed gene can reduce the risk of anomalous expression. by ethanol precipitation. An in vitro T7 transcription system was employed by prepare RNAs. The DNA template was 30 TABLE 7 removed by using RNase-free DNase, and RNA was purified by phenol-chloroform extraction followed by multiple iso Vector Ru Fold Induction propanol precipitations. The same amount of purified RNA, pRL-tkuntreated (native) 184 either for the synthetic gene or the native gene, was then pRL-tkTPA treated (native) 812 4.4 added to a rabbit reticulocyte lysate or wheat germ lysate. 35 pRG-B untreated (native) 1 pRG-BTPA treated (native) 8 8.0 Again, the synthetic Renilla luciferase gene RNA produced pRG-B untreated (final) 132 more luciferase than the native one. These data Suggest that pRG-BTPA treated (final) 195 1.47 the translation efficiency is improved by the synthetic pRG-tkuntreated (native) 44 sequence. To determine why the synthetic gene was highly pRG-tkTPA treated (native) 192 4.36 expressed in wheat germ, plant codon usage was determined. 40 pRG-tkuntreated (final) 12,816 The lowest usage codons in higher plants coincided with pRG-tkTPA treated (final) 11,347 O.88 those in mammals. Reporter gene assays are widely used to study transcrip tional regulation events. This is often carried out in co-trans EXAMPLE 3 fection experiments, in which, along with the primary 45 reporter construct containing the testing promoter, a second Synthetic Firefly Luciferase Genes control reporter under a constitutive promoter is transfected into cells as an internal control to normalize experimental The luc-gene (U.S. Pat. No. 5,670,356) was optimized variations including transfection efficiencies between the using two approaches. In the first approach (Strategy A), samples. Control reporter signal, potential promoter cross 50 regulatory sequences such as codons were optimized and talk between the control reporter and primary reporter, as well consensus transcription factor binding sites (TFBS) were as potential regulation of the control reporter by experimental removed (see Example 4, although different versions of pro conditions, are important aspects to consider for selecting a grams and databases were used). The sequences obtained for reliable co-reporter vector. the first approach include hluc--ver2AF1 through hluc-- As described above, vector constructs were made by clon 55 ver2AF8 (designations with an “F” indicate the construct ing synthetic Renilla luciferase gene into different vector included flanking sequences). hluc--Ver2AF1 is codon-opti backbones under different promoters. All the constructs mized, hluc+ver2AF2 is a sequence obtained after a first showed higher expression in the three mammalian cell lines round of removal of identified undesired sequences including tested (Table 5). Thus, with better expression efficiency, the transcription factor binding sites, hluc--ver2AF3 was synthetic Renilla luciferase gives out higher signal when 60 obtained after a second round of removal of identified undes transfected into mammalian cells. ired sequences including transcription factor binding sites, Because a higher signal is obtained, less promoter activity hluc--ver2AF4 was obtained after a third round of removal of is required to achieve the same reporter signal, this reduced identified undesired sequences including transcription factor risk of promoter interference. CHO cells were transfected binding sites, hluc--ver2AF5 was obtained after a fourth with 50 ng pCL3-control (firefly luc-) plus one of 5 different 65 round of removal of identified undesired sequences including amounts of native pRL-TK plasmid (50, 100, 500, 1000, or transcription factor binding sites, hluc--ver2AF6 was 2000 ng) or synthetic pRL-TK (5, 10, 50, 100, or 200 ng). To obtained after removal of promoter modules and RBS, hluc--

US 7,728,118 B2 53 54 scription factor binding sites, hluc--ver2BF3 was obtained after a second round of removal of identified undesired - Continued sequences including transcription factor binding sites, hluc-- GACAGAGAAGGAGATTGTGGATTATGTGGCTTCTGAGGTGACAACAGCTA ver2BF4 was obtained after a third round of removal of iden AGAAGCTGAGAGGGGGGGTGGTGTTTGTGGATGAGGTGCCTAAGGGGCTG tified undesired sequences including transcription factor binding sites, hluc--ver2BF5 was obtained after a fourth ACAGGGAAGCTGGATGCTAGAAAGATTAGAGAGATTCTGATTAAGGCTAA round of removal of identified undesired sequences including GAAGGGGGGGAAGATTGCTGTGTAATAATTCTAGA transcription factor binding sites, hluc--ver2BF6 was obtained after removal of promoter modules and RBS, hluc-- hilulc-ver2B2 ver2BF7 was obtained after further removal of identified 10 undesired sequences including transcription factor binding AAAGCCACCATGGAAGATGCTAAAAACATTTTAAGAAGGGGCCTGCTCCT sites, and hluc--ver2BF8 was obtained after modifying a TTTCTACCGTCTGGAGGATGGGACTGGGGGGGAGCAGCTGCATAAAGCTA restriction enzyme recognition site. TGAAGCGGTATGCTCTGGTGCCAGGCACAATTGCGTTCACGGATGCTCAC hluc--ver2B1-B5 have the following sequences (SEQID Nos. 15 24-28): ATTGAGGTGGACATTTACATACGCTGAGTATTTTGAGATGTCGGTGCGGC TGGCTGAGGCTATGAAGCGATATGGGCTGAATACAAACCATAGAATTGTA hilulc-ver2B1 GTGTGCTCTGAGAACTCGTTGCAGTTTTTTTATGCCTGTGGTGGGGGCTC AAAGCCACCATGGAGGATGCTAAGAATATTAAGAAGGGGCCTGCTCCTTT TGTTCATCGGGGTGGGTGTGGCTCCTGCTAACGAGATTTTTACAATGAGA TTATCCTCTGGAGGATGGGACAGCTGGGGAGCAGCTGCATAAGGCTATGA GAGAGCTTTTGAACTCGATGGGGATTTTTCTCAGCCTACAGTGGTGTTTT AGAGATATGCTCTGGTGCCTGGGACAATTGCTTTTACAGATGCTCATATT GTGAGTAAGAAAGGGCTTCAAAAGATTTCT CAATGTGCAAAAGAAGCTGC GAGGTGGATATTTACATATGCTGAGTATTTTTGAGATGTCTGTGAGACTG 25 CTATTATTTTCAAAAGATTATTATTTTATGGACTCTAAGACAGACTACCA GCTGAGGCTATGAAGAGATATGGGCTGAATACAAATCATAGAATTGTGGT GGGGTTTTCAGTCTATGTATACATTTGTGACATCTCATCTGCCTCCTGGG GTGTTCTGAGAATTGTTCTGCAGTTTTTTTTATGCCTGTGCTGGGGGCTC TTCAACGAGTATGACTTTTGTGCCCGAGTCTTTCGACAGAGATAAGACAA TGTTTATTGGGGTGGCTGTGGCTCCTGCTAATGATATTTATAATGAGAGA 30 TTGCTCTGATTTATGAATTCATCTGGGTCTACCGGGCTGCCTAAGGGTGT GAGCTGCTGAATTCTATGGGGATTTCTCAGCCTACAGTGGTGTTTTGTGT AGCTCTGCCACATAGAACAGCTTTGTGTGAGATTTTTCTCATGCTAGGGA CTAAGAAGGGGCTGCAGAAGATTCTGAATGTGCAGAAGAAGCTGCCTATT CCCTATTTTTTTGGGAATCAGATTATTCCTGATACTGCTATTCTGTCGTT ATTCAGAAGATTATTATTATGGATTCTAAGACAGATTATCAGGGGTTTTC 35 TGTGCCCTTTCATCATGGGTTTTGGGATGTTTTACAACACTGGGCTACCT AGTCTATGTATACATTTTGTGACATCTCATCTGCCTCCTGGGTTTAATGA GATATGTGGGTTTAGAGTGGTGCTCATGTATAGGTTTGAGGAGGAGCTTT GTATGATTTTGTGCCTGAGTCTTTTGATAGAGATAAGACAATTGGTCTGA TTTTTTGGGCTCTCTGCAAGATTATAAGATTCAGTCTGCTCTGCTGGTGC TTATGAATTTCTTCTGGGTCTACAGGGCTGCCTAAGGGGGTGGCTCTGCC CTACACTGTLTTCTTTTTTTTGCTAAGTCTACCCTGATCGATAAGTATGA 40 TCATAGAACAGCTTGTGTGAGATTTTCTCATGGTAGAGATCCTATTTTTT TCTGTCCAACCTGCACGAGATTGCTTTTCTGGGGGGGCTCCTCTGTCTAA GGGAATCAGATTATTCCTGATACAGCTATTCTGTCTGTGGTGCGTTTTCA GGAGGTAGGTGAGGCTGTGGCTAAGCGCTTTCATCTGCCTGGAATCAGAC TCATGGGTTTGGGATGTTTACAACACTGGGGTATCTGATTTGTGGGTTTA AGGGGTATGGGCTAACAGAAACAACATCTGCTATTCTGATTTTACACCAG GAGTGGTGCTGATGTATAGATTTTGAGGAGGAGCTGTTTCTGAGATGTCT 45 AGGGGGATGATAAGCCCGGGGCTGTAGGGAAAGTGGTGCCCTTTTTTGAA GCAGGATTATAAGATTCACGTCTGGTCTGCTGGTGCCTAGACTGTTTTCTT GCTAAAGTAGTTGATGTTGATACCGGTAAGACACTGGGGGTGAATCAGCG TTTTTGCTAAGTCTACACTGATTGATAAGTATGATCTGTCTAATCTGCAT AGGGGAACTGTGTGTGAGAGGGCCTATGATTATGTCGGGGTATGTGAACA GAGATTGCTTCTGGGGGGGCTCCTCTGTCTAAGGAGGTGGGGGAGGCTGT 50 ACCCTGAGGCTACAAATGCTCTGATTGATAAGGATGGGTGGCTGCATTTC GGCTAAGAGATTTCATCTGCCTGGGATTAGACAGGGGTATGGGCTGACAG GGGCGATATTGCTTACTGGGATGAGGATGAGCATTTCTTCATCGTGGACA AGACAACATCTGCTATTCTGATTACACCTGAGGGGGATGATAAGCCTGGG GACTGAAGTCGTTGATCAAATATAAGGGGTATCAAGTAGCTCCTGCTGAG GCTGTGGGGAAGGTGGTGGCTTTTTTTTTGAGGCTAAGGTGGTGGATCTG 55 CTGGAGTCCATTCTGCTTCAACATCCTAACATTTTCGATGCTGGGGTGGC GATACAGGGAAGACACTGGGGGTGAATCAGAGAGGGGAGCTGTGTGTGAG TGGGGTGGCTGATGATGATGCTGGGGAGCTGCCTGCTGCTGTAGTGGTGC AGGGCGTATGATTATGTCTGGGTATGTGAATAATCCTGAGGGTACAAATG TGGAGCACGGTAAGACAATGACAGAGAAGGAGATTGTGGATTTATGTGGC CTGTGATTGATAAGGATGGGTGGGTGCATTCTGGGGATATTGGTTATTGG 60 GATGAGGATGAGCATTTTTTTATTGTGGATAGACTGAAGTCTCTGATTAA TTCACAAGTGACAACAGCTAAGAAACTGAGAGGTGGCGTTGTGTTTTGTG

GTATAAGGGGTATCAGGTGGCTCCTGCTGAGCTGGAGTCTATTCTGCTGC GATGAGGTGCCTAAAGGGCTGACAGGCAAGCTGGATGCTAGAAAAATTTT

AGCATCCTAATATTTTTGATGCTGGGGTGGCTGGGCTGCGTGATGATGAT CGAGAGATTCTGATTAAGGCTAAGAAGGGTGGAAAGATTGCTGTGTAATA 65 GGTGGGGAGCTGCCTGCTGCTGTGGTGGTGCTGGAGCATGGGAAGACAAT GTTCTAGA

US 7,728,118 B2 61 62

- Continued - Continued TTCATCTGCCTGGtATAGACAGGGGTAcGGGCTaaCAGAaACAACt TGCTATTCTGATTACACCaCAGGGcGATGACAAaCCtGGGGCTGTaGGGA

GCTATTCTGATTAGACCaCAGGGcGATGACAAaCCtGGGGCTGTaGGGAA AaGTGGTGCCCTTTTTTTTGAaGCCAAaGTaGT't GATCTtGATACcGGtA aGTGGTGCCoTTTTTTGAaGCCAAaGTaGTtGATCTtGATACCGGtAAGA AGACAGTagGGGTGAACCAGaGaGGtCAatTGTGTGTGaGgGGcCCTATG CACTagGGGTGAACCAGaGaGGtCAatTGTGTGTGaGgGGCCCTATGATT ATTTATGTCggGGTAccTtAAcAAccCogAagCTAGAAATGCTCTCATag TATGTCgGGGTACGTtAACAACCCCGAagCTACAAATGCTCTCATagACA AcAAGGAcGGgTGGcTtCATagtCGaGAtATTTGCcTAcTGGGAtGAagA. 10 AGGACGGg TGGCTtCATagtCGaGAtATTGCCTACTGGGAtCAagATGAG TGAGCATTTT cTTrcATcCTGGAcAGACTGAAGTCgtTGATcAAaTAcAA CATTTCTTTCATCGTGGACAGACTGAAGTCgtTGATCAAaTACAAGGGGT GGGGTATCAagTagCTCCTGCcGAGCTtcAgTCcATTCTGCTt CAagAcc

ATCAagTagCTCCTGCcCAGCTtcAgTCcATTCTGGTt CAaCAccCoAAt CoAAtATcTTcGATGCTGGGGTGGCTGGGCTGCCTGATGATGATGCTGGa 15 ATTTcGATGCTGGGGTGGCTGGGGTGCCTGATGATGATGCTGGaGAGCT GAGCTGCCTGCTGCTGTaGTaGTGCTt(GAGGAtGGtAAGACAATGACAGA

GGGTGCTGCTGTaGTaGTGGTt(GAGCAtGGtAAGACAATGACAGAGAAGG GAAGGAGATcGTGGATTATGTGGCTTCaCAaGTGACAACAGCTAAGAAaC

AGATcGTGGATTATGTGGCTTCaCAaGTGACAACAGCTAAGAAaCTccGA TccGAGGt GGcGTt(GTGTTTGTGGATGAGGTGCCTAAaGGaGTCACtCGGC AAGCTGGATGCCAGAAAaATTCGAGAGATTCTCATTAAGGCTAAGAAGGG GGtGGcGTtGTGTTTTGTGGATGAGGTGCCTAAaGGGCTCACtGGCAAGC to GaAAGATTGCTGTGTAATAgTTCTAGA. TGGATGCCAGAAAaATTCGAGAGATTCTCATTTAAGGCTAAGAAGGGt aAAGATTGCTGTGTAATAgTTCTAGA. 25 TABLE 11 The BglI sequence in hluc--ver2BF9 was removed resulting Summary of Firefly Luciferase Constructs in hluc--ver2BF10. hluc--ver2BF10 demonstrated poor Number of expression. COSCSUS transcription Number of CG dinucleotides hluc--ver2B10 has the following sequence 30 Firefly luciferase factor binding Promoter (possible Gene sites modules* methylation sites) Luc- 287 7 97 (SEQ ID NO:33) hluc-ver2AF8 3 O 132 AAAGCCACCATGGAaGATGCCAAaAAcATTAAGAAGGGGCCTGCTCCCTT hluc+ver2BF10 3 O 43 35 cTACCCTCTtGAaGATGGGACtCGCtGGcGAGCAaCTtCACAAaGCTATGA *Promoter modules are defined as a composite regulatory element, with 2 TFBS separated by a spacer, which has been shown to exhibitsynergistic or AGCGgTATGCTCTtcTGGCagGgACAATTGCgTTcACggATGCTCAcATT antagonistic function. GAagTagAcATcACATAcGCTGAGTATTTGAGATGTCgGTGcGgCTGGCa 40 GAaGCTATGAAGcGCTATGGGCTGAATACAAAcCATAGAATTGTaGTGTG EXAMPLE 4 cagTGAGAAcTCgtTGCAGTTcTTTTATGCCogTGCTGGGGGCTGTcTTc Synthetic Selectable Polypeptide Genes ATtGGGGTGGCTGTGGCTCCTGCTAAtGAcATcTACAAcGAGcGAGAGCT gtTGAAcagtATGGGGATcTCTCAGCCTACAGTGGTGTTTGTGag TAAGA 45 Design Process AagGGCTtCAaAAGATTCTcAATGTGCAaAAGAAGCTaCCATCATaCAa Define Sequences AAGATcATCATcATGGAtagoAAGACcGAcTAcCAGGGGTTTCAGTCCAT Protein Sequence that Should be Maintained: Neo: from neogene of pCI-neo (Promega) (SEQID NO:1) GTACACATTTGTaACCTCTCATCTGCCTCCTGGCTTCAAtGAGTAtGACT 50 Hyg: from hyg gene of pcDNA3.1/Hygro (Invitrogen) TcGTGCCogAGTCTTTcGAcAGggAcAAaACgATTGCTCTGATcATGAAC (SEQID NO:6) agcagtCGGTCTAGcGGGCTGCCTAAGGGtGTag CTCTGCCoCATcGAAC DNA Flanking Regions for Starting Sequence: AGCTTGTGTGAGATTcTCTCATGCcAGgQAccCgATcTTtcGaAAccAGA 5' end: Kozak sequence from neo gene of pCI-neo 55 (GCCACCATGA: SEQ ID NO:34)), PflMI site TcATccCTGAcAGtcCTATTCTGTCgGTgGTGCCoTTTCATCATGGGTTT (CCANNNTGG: SEQ ID NO:35), add Ns at end (to avoid search algorithm errors & keep ORF1): neo/hyg: GGGATGTTCACAACACTGGGaAccTCATtTGcGGGTTTTAGAGTGGTGC NNNNNCCAnnnnnTGGCCACC-ATG-G (SEQ ID TcATGTATAGgTTTGAagAagAacTaTTccTacGcTCTtTGCAagATTAT 60 NO:36) AAGATTCAGTCTGCTCTGCTGGTGCCaACAC TaTTCTCTTTTTTTGCTAA Change: Replace PflMI with Sbfl (CCTGCAGG) GTCTACgCTcATagAcAAGTATGActTGTCCAActTGCAccAGATTGCTT 3' end: two stop codons (at least one TAA), PflMI site (not compatible with that at 5' end to allow directional clon CTGGcGGaGCaCCTGTGTCTAAGGAGGTagGtGAGGCTGTGGCTAAGcGc 65 ing), add NS at end (to avoid search algorithm errors): TTTCATCTGCCTGG taTcAGACAGGGGTAcGGGCTaaCAGAaACAACt neo/hyg: TAATAACCAnnnnnTGGNNN (SEQID NO:37) Change: replace PfiMI with AflII (CTTAAG) US 7,728,118 B2 63 64 Define Codon Usage Codon usage was obtained from the Codon Usage Database - Continued (http://www.kazusa.or.jp/codon?): LSSHLAPAEKVSMADAMRRLHTLDPATCPTDHOAKHRIERARTRMEAGLV Based on: GenBank Release 131.015 Aug. 2002 (Naka DODDLDEEHOGLAPAELTARLKARMPDGEDLVVTHGDACLPNMVENGRTS mura et al., 2000). GTRDCGRLGVADRYODLALATRDLAEELGGEWADRTLVLYGAAPDSQRAT Codon Usage Tables were Downloaded for: HS: Homo sapiens Igbpril 50,031 CDSs (21,930,294 YRLLDETT codons) and encoded by MM: Mus musculus Igbrod 23,113 CDSs (10,345.401 10 codons) (SEQ ID NO: 1) EC: Escherichia coli Igbbct 11,985 CDSs (3,688,954 Atgattgaacaagatggattgcacgcaggttct cc.gc.cgcttgggtgga codons) gaggct attcggctatgactgggcacaa.calgacaatcggctgctctgatg EC K12: Escherichia coli K12 gbbct 4.291 CDSs (1,363,716 codons) 15 cc.gc.cgtgttc.cggctgtcagcgcaggggcgc.ccggttcttitttgtcaag HS and MM were compared and found to be closely accgacctgtc.cggtgc cctgaatgaactgcaggacgaggcagcgcggct similar, use HS table EC and EC K12 were compared and found to be closely atcgtggctggcc acgacgggcgt.ccttgcgcagctgtgct cacgttgt similar, use EC K12 table Cactgaag.cgggalagggactggctgct attgggcgaagtgc.cggggcagg Codon Selection Strategy: Overall strategy is to adapt codon usage for optimal expres at ct cotgtcatcto accttgct cotgcc.gagaaagtat coat catggct sion in mammalian cells while avoiding low-usage E. gatgcaatgcggcggctgcatacgcttgat CC9gctacctgcc catt Ca coli codons. One “best codon was selected for each ccacca agcgaaa.catcgcatcgagcgagcacgtact.cggatggaag.ccg amino acid and used to back-translate the desired protein 25 sequence to yield a starting gene sequence. gtctgtcgat Caggatgatctggacgaagagcatcaggggg.tc.gc.gc.ca.g Strategy A was chosen for the design of the neo and hyg genes (see Table 12). (Strategy A: Codon bias optimized: cc.gaactgttcgc.caggct Caaggcgc.gcatgc.ccgacggc gaggat ct c emphasis on codons showing the highest usage fre gtcgtgacccatggcgatgcctgcttgc.cgaatat catggtggaaatggc quency in HS. Best codons are those with highest usage 30 in HS, unless a codon with slightly lower usage has cgctttctggatt catcgactgtggc.cggctgggtgtggcggaccgctat Substantially higher usage in E. coli.). Caggacat agcgttggctaccc.gtgatatgctgaagagcttggcggcgaa TABLE 12 tgggct gaccgct tcct cqtgctttacggitat cqc.cgct cocgat.cgcag 35 Codon Choices in Codon cgcatcgcct tctat cqccttcttgacgagttcttctga Codon Choices in Bias Optimized Strategy Amino acid Examples 1-2 A. Hyg (based on hygromycin gene from Invitrogen's pcDNA3.1/Hygro) Gly GGCGGT GGC (SEO ID NO : 7) Glu GAG GAG MKKPELTATSVEKTLWKTDSWSDLMOLSEGEESRATSTDWGGRGYWLRVN Asp GAC GAC 40 Wall GTG.GTC GTG SCADGTYKDRYWYRTASAALPTPEVLDGETSESLTYCSRRAOGVTLODLP Ala GCCGCT GCC Arg CGCCGT CGC ETELPAVLQPWAEAMDAAAAADLSOTSGTGPTGPOGGQYTTWPDTCALAD Ser TCTAGC AGC Lys AAG AAG PHWYHWOTVMDDTVSASVAOALDELMLWAEDCPEVRHLVHADTGSNNVLT ASn AAC AAC 45 Ile ATCATT ATC DNGRTAVTDWSEMATGDSOYEVANTTTWRPWLAGMEOOTRYTERRHPELA Thr ACCACT ACC Cys TGC TGC GSPRLRAYMLRGLDOLYOSLVDGNTDDAAWAQGRCDAIVRSGAGTWGRTO Tyr TAC TAC Leu CTG/TTG CTG LARRSAAWWTDGCWEWLADSGNRRPSTRPRAKE Phe TTC TTC 50 Gln CAG CAG encoded by His CAC CAC Pro CCACCT CCC (SEQ ID NO : 6) Atgaaaaa.gc.ctgaact caccgcgacgtctgtc.gagaagtttctgat Ca Generate Starting Gene Sequences 55 aaagttcgacagogt ct Cogacctgatgcagct Ctcggagggcgaagaat Use custom codon usage table in Vector NTI 8.0 (Informax) Ctctgctitt cagct tcgatgtaggagggcgtggatatgtc.ctg.cgggta (“Strategy A') aatagotgcgc.cgatggitttctacaaagat.cgittatgtttatcggcactt Back-translate neo and hyg protein sequences tgcatcggcc.gc.gct cocgatt.ccggaagtgcttgacattggggaattica 60 Neo (based on neomycin gene from Promega's pCI-neo) gcgaga.gc.ct gacct attgcatctoccgc.cgtgcacagggtgtcacgttg Caagacct gcctgaaaccgaactgc.ccgctgttctgcagc.cggtc.gcgga (SEQ ID NO: 2) MEODGLHAGSPAAWVERLTGYDWAOOTGCSDAAVTRLSAOGRPWLTVKTD ggc.catggatgcgat.cgctgcggc.cgat cittagc.ca.gacgagcgggttcg 65 LSGALNELODEAARLSWLATTGVPCAAVLDVVTEAGRDWLLLGEWPGQDL gcc cattcggaccgcaaggaatcggt caat acactacatggcgtgattica

US 7,728,118 B2 67 68 GEMS Launcher Release 3.5.2 (June 2003) USSmaI (0.75/1.00) MatInspector professional Release 6.2.1 June 2003 USSnaBI (0.75/1.00) Matrix Family Library Ver 3.1.2 June 2003 (incl. 318 ver- USSpel (0.75/1.00) tebrate matrices in 128 families) USSplice-A (0.75/Optimized) ModelInspector professional Release 4.8 October 2002 5 USSplice-D (0.75/Optimized) Model Library Ver 3.1 March 2003 (226 modules) USXbal (0.75/1.00) SequenceShaper tool USXcmI (0.75/1.00) User Defined Matrices USXhoI (0.75/1.00) Sequence Motifs to Remove from Starting Gene Sequences 10 ALL vertebrates.lib (0.75/Optimized) (In Order of Priority) User-Defined Matrix Subset “neo--hvg-EC Restriction Enzyme Recognition Sequences: Format: Matrix name (core similarity threshold/matrix simi See user-defined matrix Subset neo and hyg. Same as those larity threshold) used for design of hluc--version 2.0 SAatII (0.75/1.00) Generally includes those required for cloning (pGL4) or 15 SBamHI (0.75/1.00) commonly used for cloning SBglI (0.75/1.00) Change: also SbfI, AflI. AccIII SBglII (0.75/1.00) Transcription Factor Binding Sequences: SBsal (0.75/1.00) Promoter modules (2 TF binding sites with defined orien SBsmAI (0.75/1.00) tation) with default score or greater SBsmBI (0.75/1.00) Vertebrate TF binding sequences with score of at least SBstEII (0.75/1.00) core=0.75/matrix optimized SBstXI (0.75/1.00) Eukaryotic Transcription Regulatory Sites: SCsp45I (0.75/1.00) Kozak sequence SCspI (0.75/1.00) Splice donor/acceptor sequences in (+) strand 25 SEcoRI (0.75/1.00) PolyA addition sequences in (+) strand SHindIII (0.75/1.00) Prokaryotic Transcription Regulatory Sequences: SKozak (0.75/Optimized) E. coli promoters SKpnI (0.75/1.00) E. coli RBS (if less than 20 bp upstream of Met codon) SMlul (0.75/1.00) 30 SNcoI (0.75/1.0). User-Defined Matrix Subset “neo--hvg SNdel (0.75/1.00) Format: Matrix name (core similarity threshold/matrix simi SNhel (0.75/1.00) larity threshold) SNotI (0.75/1.00) USAati I (0.75/1.00) SNsiI (0.75/1.00) SBamHI (0.75/1.00) 35 SPfIMI (0.75/1.00) SBglI (0.75/1.00) SPmeI (0.75/1.00) SBglII (0.75/1.00) SPoly Asig (0.75/1.00) SBsal (0.75/1.00) SPstI (0.75/1.00) SBsmAI (0.75/1.00) SSacI (0.75/1.00) SBsmBI (0.75/1.00) 40 SSacII (0.75/1.00) SBstEII (0.75/1.00) SSalI (0.75/1.00) SBstXI (0.75/1.00) SSfil (0.75/1.00) SCsp45I (0.75/1.00) SSgf1 (0.75/1.00) SCspI (0.75/1.00) SSmaI (0.75/1.00) SEC-P-10 (1.00/Optimized) 45 SSnaBI (0.75/1.00) SEC-P-35 (1.00/Optimized) SSpel (0.75/1.00) SEC-Prom (1.00/Optimized) SSplice-A (0.75/Optimized) SEC-RBS (0.75/1.00) SSplice-D (0.75/Optimized) SEcao 96 RI (0.75/1.00) SXbal (0.75/1.00) SHindIII (0.75/1.00) 50 SXcmI (0.75/1.00) SKozak (0.75/Optimized) SXhoI (0.75/1.00) SKpnI (0.75/1.00) ALL vertebrates.lib (0.75/Optimized) SMlul (0.75/1.00) SNcoI (0.75/1.00) User-Defined Matrix Subset “pGL4-072503.” SNdel (0.75/1.00) 55 Format: Matrix name (core similarity threshold/matrix simi SNheI (0.75/1.00) larity threshold) SNot I (0.75/1.00) USAatII (0.75/1.00) SNsiI (0.75/1.00) USAccIII (0.75/1.00) SPfIMI (0.75/1.00) USAflII (0.75/1.00) SPmeI (0.75/1.00) 60 USBamHI (0.75/1.00) SPolyAsig (0.75/1.00) USBglI (0.75/1.00) SPstI (0.75/1.00) USBglII (0.75/1.00) SSacI (0.75/1.00) USBsal (0.75/1.00) SSaclI (0.75/1.00) USBsmAI (0.75/1.00) SSalI (0.75/1.00) 65 USBsmBI (0.75/1.00) SSfil (0.75/1.00) USBstEII (0.75/1.00) SSgf1 (0.75/1.00) USBstXI (0.75/1.00) US 7,728,118 B2 69 70 SCsp45I (0.75/1.00) Use subset “neo--hyg' to check whether problematic E. SCspI (0.75/1.00) coli sequence matches were introduced, and if so try to SEC-P-10 (1.00/Optimized) remove them using an analogous approach to that SEC-P-35 (1.00/Optimized) described above for non E. coli sequences. SEC-Prom (1.00/Optimized) SEC-RBS (0.75/1.00) Use an analogous strategy for the flanking (non-ORF) SEcoRI (0.75/1.00) sequences. Final check with subset “pGL4-072503 after SHindIII (0.75/1.00) change in flanking cloning sites SKozak (0.75/Optimized) After codon optimizing neo and hyg, hneo and hhyg were SKpnI (0.75/1.00) 10 obtained. Regulatory sequences were removed fromhneo and SMlul (0.75/1.00) hhyg yielding hneo-1F and hhyg-1F (the corresponding SNcoI (0.75/1.00) sequences without flanking regions are SEQID Nos. 38 and SNdel (0.75/1.00) 30, respectively). Regulatory sequences were removed from SNheI (0.75/1.00) hneo-1F and hhyg-1F yielding hneo-2F and hhyg-2F (the SNot I (0.75/1.00) 15 SNsiI (0.75/1.00) corresponding sequences without flanking regions are SEQ SPfIMI (0.75/1.00) ID Nos. 39 and 42, respectively). Regulatory sequences were SPmeI (0.75/1.00) removed from hneo-2F and hhyg-2F yielding hneo-3F and SPolyAsig (0.75/1.00) hhyg-3F. Hneo-3F and hhyg-3F were further modified by SPstI (0.75/1.00) altering 5' and 3' cloning sites yielding hneo-3FB and hhyg SSacI (0.75/1.00) 3FB: SSacII (0.75/1.00) hneo-3 (after 3rd round of sequence removal, subset neo SSalI (0.75/1.00) hyg) has the following sequence: SSbf (0.75/1.00) SSflI (0.75/1.00) 25 SSgf1 (0.75/1.00) (SEQ ID NO : 4) SSmaI (0.75/1.00) CCACTCCGTGGCCACCATGATCGAaCAaGAGGGCCTCCAtGCtGGCAGt C SSnaBI (0.75/1.00) SSpei (0.75/1.00) CCGCagctTGGGTcCAaCGCtTGTTTCGGgTACGACTGGGCCCAGCAGAG SSplice-A (0.75/Optimized) 30 SSplice-D (0.75/Optimized) CATCGGaTGtAGCGAtGCgGCCGTGTTCCGtcTaAGCGCtCAagGCCGg C SXbal (0.75/1.00) CCGTGCTGTTCGTGAAGACCGACCTGAGCGGCGCCCTGAACGAGCTt CAa SXcmI (0.75/1.00) SXhoI (0.75/1.00) GACGAGGCtGCCCGCCTGAGCTGGCTGGGCACCACCGGtGTaCCCTGCGC ALL vertebrates.lib 35 CGGtcTGtTGGAtcTtcTGACCGAagCCGGCCGgQACTGGCTGCTGCTGG

Strategy for Removal of Sequence Motifs GCGAGGTCCCt(GGCCAGGAtGTGCTGAGCAGCCACCTtGCCCCCGCtGAG The undesired sequence motifs specified above were removed from the starting gene sequence by selecting alter AAGGTttcCATCATGGCCGAtcGaATGCGg CGCCTGCACACCGTGGACCC nate codons that allowed retention of the specified protein and CGCtACaTGCCCCTTCGACCACCAGGCtAAGCAtCGgATCGAGCGtcCtC flanking sequences. Alternate codons were selected in a way 40 to conform to the overall codon selection strategy as much as GgACCCGCATGGAGGCCGGCCTGGTGGACCAGGACGACGTGGACGAGGAG possible. GAt CAGGGCCTGGCCCGCGCtGAaCTGTTCGCCCGGCTGAAaGCCGGCAT General Steps: Identify undesired sequence matches with Matinspector 45 GGC gCACGGtGAGGACCTGGTtcTGACaCAtGGtGAtGCCTGCCTcCCtA using matrix family subset “neo--hyg”or “neo-i-hyg-EC ACATCATGGTcGAGAAt GGCCGCTTTCtcCGGCTTCATCGACTGCGGtCG and with Model Inspector using default settings. Identify possible replacement codons to remove undesired CGTagGaGTtGGGGACCGCTACCAGGACATCGCCCTGGCCACCCGCGACA sequence matches with SequenceShaper (keep ORF). TCGCtGAGGAGGTtGGGGGCGAGTGGGCCGACCGCTTCtTaGTctTGTAC Incorporate changes into a new version of the synthetic 50 gene sequence and re-analyze with Matinspector and GGCATCGCaGCtCGCGACAGCGAGCGCATCGCCTTCTACCGGCTGCTcGA Model Inspector. CGAGTTCTTtTAATGACCAGg CTCTGG; Specific Steps: First try to remove undesired sequence matches using Sub 55 hneo-3FB (change PflMI sites to Sb. I at 5' end and AflII at 3' set “neo-i-hyg-EC' and SequenceShaper default remain end) has the following sequence: ing thresholds (0.70/Opt-0.20). For sequence matches that cannot be removed with this approach use lower SequenceShaper remaining thresh (SEO ID NO. 5) cctgcaggCCACCATGATCGAAGAAGACGGCCTCCATGCTGGCAGTCCCG olds (e.g. 0.70/Opt-0.05). 60 For sequence matches that still cannot be removed, try CAGCTTGGGTCGAACGCTTGTTCGGGTACGACTGGGCCGAGCAGACCATC different combinations of manually chosen replacement codons (especially if more than 3 base changes might be GGATGTAGCGATGCGGCCGTGTTCCGTCTAAGCGCTCAAGGCCGGCCCGT needed). If that introduces new sequence matches, try to GCTGTTCGTGAAGACCGACCTGAGCGGCGCCCTGAACGAGCTTTCAAGAC remove those using the steps above (a different starting 65 sequence Sometimes allows a different removal Solu GAGGCTGCCCGCCTGAGCTGGCTGGCCACCACCGGTGTACCCTGCGCCGC tion).

US 7,728,118 B2 73 74 Position in ORF:-7 to 11 TABLE 15-continued 2) VSPAX5 Pairwise identity of different gene versions Family: PAX-5/PAX-9 B-cell-specific activating proteins (4 5 Comparisons were of open reading frames (ORFs). members) Best match: B-cell-specific activating protein neo hneo hneo-3 hneo-4 hneo-5 Final hNeo Ref: MEDLINE 94O1O299 hneo-4 99 98 hneo-5 99 Position in ORF: 271 to 299 10 Final hNeo

3) VSAREB Family: Atp1a1 regulatory element binding (4 members) 15 hyg hhyg hhyg-3 hHygro hhyg-4 Final hHyg Best match: AREB6 Hyg 79 78 73 76 78 Ref: MEDLINE 96.061934 hhyg 88 83 86 88 hhyg-3 94 96 98 Position in ORF: 310 to 322 hHygro 96 94 hhyg-4 97 Final hHyg 4) VSVMYB Family: AMV-viral oncogene (2 members)

Best match: V-Myb 25 Ref: MEDLINE 94.147510 Percent Identity Position in ORF: 619 to 629 Divergence 1 2 1 82.2 1 Synthetic puro-SEQID NO: 11 2 19.6 2 Starting puro-SEQID NO: 15 Other sequences remaining inhneo-3F included one E. coli 30 1 2 RBS 8 bases upstream of Met (ORF position 334 to 337): hneo-3FB included a splice acceptor site (+) and Pst site as part of a 5' cloning site for SbfI, and one E. coli RBS 8 bases upstream of Met (ORF position 334 to 337); hhyg-3F had no An expression cassette (hNeo-cassette) with a synthetic neo other sequence matches; and hhyg-3FB included a splice mycin gene flanked by a SV40 promoter and a synthetic acceptor site (+) and PstI site as part of a 5' cloning site for 35 poly(A) site is shown below. SbfI. Subsequently, regulatory sequences were removed from hneo-3F and hhyg-3F yielding hneo-4 and hhyg-4. Then (SEQ ID NO: 44) regulatory sequences were removed from hneo-4 yielding GGATCCGTTTGCGTATTGGGCGCTCTTCCGCTGATCTGCGCAGCACCATG 40 hneo-5. GCCTGAAATAACCTCTGAAAGAGGAACTTGGTTAGCTACCTTCTGAGGCG

TABLE 1.4 GAAAGAACCAGCTGTGGAATGTGTGTGAGTTAGGGTGTGGAAAGTCCCCA TF binding sequences Promoter modules GGCTCCCCAGCAGGCAGAAGTATGCAAAGCATGCATCTCAATTAGTCAGC Gene name 5 FORF3 F 5''FORF3 F 45 AACGAGGTGTGGAAAGTCCCCAGGGTCCCCAGCAGGCAGAAGTATGCAAA. Neo —53– —f Of— hineo-F 1,612 O2fO GCATGCATCTGAATTAGTCAGCAACGATAGTCCCGCCCCTAACTCGGCCC hineo-3F OOO OOO hineo-3FB OOO OOO ATGCCGCCCCTAACTCCGCCCAGTTCCGCCCATCTCCGCCCCATGGCTGA Hyg —f 74— —f3, hhyg-F 1941 Of 40 50 CTAATTTTTTTATTTATGCAGAGGCCGAGGCCGCCTCTGCCTCTGAGCTA hhyg-3F 1.3 O OOO hhyg-3FB 1.3 O OOO TTTCCAGAAGTAGTGAGGAGGCTTTTTTGGAGGCCTAGGCTTTGCAAAAA * Promoter modules are defined as a composite regulatory element, with 2 GCTCGATTTCTTCTGACACTAGCGCCACCATGATCGAACAAGACGGCCTC transcription factor binding sites separated by a spacer, which has been shown to exhibitsynergistic or antagonistic function. 55 CATGCTGGCAGTCCCGCAGCTTGGGTCGAACGCTTGTTCGGGTACGACTG Table 15 summarizes the identity of various genes. GGCCCAGCAGACCATCGGATGTAGCGATGCGGCCGTGTTCCGTCTAAGCG CTCAAGGCCGGCCCGTGCTGTTCGTGAAGACCGACCTGAGCGGCGCCCTG TABLE 1.5 AACGAGCTTCAAGACGAGGCTGCCCGCGTGAGCTGGCTGGCCACCACCGG Pairwise identity of different gene versions 60 Comparisons were of Open reading frames (ORFs). CGTACCCTGCGCCGCTGTGTTGGATGTTTGTGACCGAAGCCGGCCGGGAC neo hneo hneo-3 hneo-4 hneo-5 Final hNeo TGGCTGCTGCTGGGCGAGGTCCCTGGCCAGGATCTGCTGAGCAGCCACCT

Neo 79 78 78 78 77 TGCCCCCGCTGAGAAGGTTTCTATCATGGCCGATGCAATGCGGCGCCTGC hineo 90 90 90 89 65 hineo-3 100 99 98 ACACCCTGGACCCCGCTACCTGCCCCTTCGACCACCAGGCTAAGCATCGG

US 7,728,118 B2 83 84 and - Continued ggcgagaaaatggtgcttgagaataact tctt.cgt.cgagac catgct coc SEO ID NO: 13) atgattgaacaagatggattgcacgcaggttctic cqgcc.gcttgggtgga aa.gcaagat catgcggaaactggagcctgaggagttcgctgcctacctgg gaggct attcggctatgactgggcaca acaga caatcggctgctctgatg agcc attcaaggaga agggcgaggittagacggcct accctic to Ctggcct cc.gc.cgtgttc.cggctgtcagcgcaggggcgc.ccggttcttitttgtcaag cgc.gagat CCct Ctcgitta agggaggcaa.gc.ccgacgt.cgt.ccagattgt accgacctgtc.cggtgc cctgaatgaactgcaggacgaggcagcgcggct cc.gcaact acaacgc.ctacct tcgggc.cagcgacgatctgcct aagatgt 10 atcgtggctggccacgacgggcgttcCttgcgcagctgtgcticgacgttg to atcgagtc.cgaccCtgggttcttitt C caacgctattgtcgagggagct t cactgaagcgggaagggactggctgctattgggcgaagtgc.cggggcag aagaagttcc ctaac accgagttcgtgaaggtgaagggcct coactt cag gatctoctdt catcto accttgctic ctdcc.gagaaagtat coat catggc cc aggaggacgct coagatgaaatgggtaagtacat caa.gagctt.cgtgg 15 tgatgcaatgcggcggctgcatacgcttgatc.cggctacctgcc catt.cg agcgcgtgctgaagaacgagcagtaa (neo-hirl-fusion; . accaccaa.gcgaaa.catcgcatcgagcgagcacgtact cqgatggaa.gc.c EXAMPLE 5 ggtcttgtcgat caggatgatctggacgaagagcatcaggggg.tc.gc.gc.c agc.cgaactgttcgc.caggct Caaggcgc.gcatgc.ccgacggcgaggat c Transcription Factor Binding Sites Used to Identify Sites in Selected Synthetic Sequences tcgt.cgtgacccatggcgatgcctgcttgc.cgaatat catggtggaaaat ggcc.gcttittctggatt catcgactgtggc.cggctgggtgtggcggaccg TF Binding Site Libraries 25 The TF binding site library (“Matrix Family Library') is citat caggacat agcgttggctaccc.gtgatattgctgaagagcttggcg part of the GEMS Launcher package. Table 16 shows the version of the Matrix Family Library which was used in the gcgaatgggctgaccgctt CCtcgtgctttacggitat.cgc.cgctic cc gat design of a particular sequence and Table 17 shows a list of all tcqcagcgcatcgc.ctt citat cqccttcttgacgagttctt caccggtgg vertebrate TF binding sites (“matrices”) in Matrix Family 30 Library Version 2.4, as well as all changes made to vertebrate tgggagcggaggtggcggat.c aggtgg.cggaggctCC9gaggggcttCca matrices in later versions up to 4.1 (section “GENOMATIX MATRIX FAMILY LIBRARY INFORMATION Versions aggtgtacgaccc.cgagcaacgcaaacgcatgat cactgggcct cagtgg 2.4 to 4.1.). (Genomatix has a copyright to all Matrix Library tgggct cqctgcaa.gcaaatgaacgtgctggact cott catcaac tact a Family information). 35 tgatt.ccgagaa.gcacgcc.gagaacgc.cgtgatttittctgcatggtaacg TABLE 16

Ctgcct coagctacctgtggaggcacgt.cgtgcct cacatcgagc.ccgtg Synthetic DNA sequence Genomatix Matrix Family Library gctagatgcatcatCcctgatctgatcggaatgggtaagt ccggcaa.gag pGLAB-NN3* Version 2.4 May 2002 40 luc2A8 and luc2B10 Version 3.0 November 2002 Version 3.1.1 April 2003 cgggaatggct catat cqc ct cotggat cact acaagtacct caccgctt hhyg3 Version 3.1.2 June 2003 hneos ggttcgagctgctgaacct tccaaagaaaatcatCtttgttgggcc acgac hhyg4 Version 3.3 August 2003 SpeI-NcoI-Ver2** Version 4.0 November 2003 tggggggcttgtctggc ctitt cact acticcitacgag caccalagacaagat hineo5 Version 4.1 February 2004 45 hpuro2 Caaggc catcgt.ccatgctgaga.gtgtcgtggacgtgatcgagtic ctggg *NotI-NcoI fragment in pCL4 including amp gene (pGL4B-NN3) acgagtggcctgacatcgaggaggatat cqccct gat Caagagcgaagag *Spel-Nicol-Ver2 (replacement for SpeI-NcoI fragment in pCL4B-NN3

TABLE 17

GENOMATICXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information VSAHRR AHR-arntheterodimers VSAHRARNT.O1 aryl hydrocarbon and AHR-related fArntheterodimers factors VSAHR.O1 aryl hydrocarbon dioxin receptor VSAHRARNT.O2 aryl hydrocarbon/Arnt heterodimers, fixed core VSAP1F AP1 and related factors VSAP1.01 AP1 binding site VSAP1.02 activator protein 1 VSAP1.03 activator protein 1 VSAP1FJ.01 activator protein 1 VSNFE2.01 NE-F2p45 VSVMAF.01 w-Maf US 7,728,118 B2 85 86

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information VSTCF11MAFG.01 TCF11.MafG heterodimers, binding to subclass of AP1 sites VSBEL1.01 Bel 1 similar region VSAP2F Activator Protein 2 VSAP2.01 activator protein 2 VSAP4R AP4 and Related VSAP4.01 activator protein 4 proteins VSAP4.02 activator protein 4 VSTH1E47.01 Thing1/E47 heterodimer TH1 bLH member specific expression in a variety of embryonic issues TAL1ALPHAE47.01 Tal alpha E47 heterodimer TAL1BETAE47.01 Tal heterodimer TAL1BETAITF2.01 Tal heterodimer AP4.O3 activator protein 4 VSAREB Atp1a1 regulatory AREB6.04 AREB6(Atp1a1 element binding regulatory element bin ing factor 6) V AR E B6.02 AREB6 (Atp1a1 regulatory element bin ing factor 6) WSAREB6.03 AREB6 (Atp1a1 regulatory element bin ing factor 6) WSAREB6.01 AREB6 (Atp1a1 regulatory element Oil ing factor 6) VSARP1 Apollipoprotein a and apo ipoprotein AI cIII gene Repressor regulatory protein 1 Protein VSBARB BARbiturate-Inducible WSBARBIE.O1 barbiturate-inducible E. box from element Pro--eukaryot. genes VSBCL6 POZ domain zinc VSBCL6.01 POZ protein, finger expressed in B transcriptional repressor, Cells translocations observed in diffuse large cell ymphoma VSBCL6.02 POZ zinc finger protein, transcriptional repressor, translocations observed in diffuse large cell ymphoma VSBRAC gene, T-Box factor 5 site mesoderm (TBX5), mutations developmental factor related to Holt-Oram syndrome BRACH.O1 Brachyury WSBRNF Brn POU domain BRN3.01 POU transcription factor factors BRN2O1 POU factor Brn-3 (N-Oct 3) WSCABL C- DNA binding CABL.O1 Mul ifunctional c-Abl Src sites type tyrosine kinase WSCART Cart-1 (cartilage XVENT2.01 Xenopus homeodomain homeoprotein 1) actor Xvent-2; early BM Psignaling response V CART1.01 Cart 1 (cartilage homeoprotein 1) VSCDXF Vertebrate caudal V CDX2.01 Cox 2 mammalian caudal related homeodomain ed intestinal transcr. protein actor VSCEBP Ccaat Enhancer CCAAT enhancer Binding Protein Oil ing protein beta CEBPO2 CE BP binding site VSCHOP CHOP binding protein CHOPO1 heterodimers of CHOP and C/EBPalpha WSCLOX CLOX and CLOX cut ike homeodomain homology (CDP) protein factors cut ike homeodomain protein US 7,728,118 B2 87 88

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information VSCDP02 transcriptional repressor CDP VSCDPCR3.01 cut-like homeodomain protein VSCLOX.O1 Clox VSCMYB C-MYB, cellular VSCMYB.01 c-Myb important in transcriptional hematopoesis, cellular activator equivalent to avian myoblastosis virus oncogene v-myb VSCOMP factors which VSCOMP1.01 COMP 1, cooperates with COoperate with myogenic proteins in Myogenic Proteins multicomponent complex VSCOUP Repr. of RXR VSCOUPO1 COUP antagonizes HNF mediated activ. & 4 by binding site retinoic acid responses competition or synergizes by direct protein-protein interaction with HNF-4 CP2-erythrocyte Factor VSCP2.01 CP2 related to drosophila Ef1 VSCREB Camp-Responsive WSCREBP1.01 cAMP-responsive Element Binding element binding protein 1 proteins CREBP1CUN.O1 CRE-binding protein 1, c un heterodimer CREB.O1 cAMP responsive element binding protein hepatic leukemia factor E4BP4 bZIP domain, transcriptional repressor CREB.O2 cAMP responsive element binding protein CREB.O3 cAMP response element binding protein CREB.04 cAMP response element binding protein CREBP1.02 CRE-binding protein 1 ATF.O2 ATF binding site ATF.O1 activating transcription actor AXCREB.O1 Tax/CREB complex I AXCREB.O2 Tax/CREB complex V W-lil -myc activatoricell E E2F, involved in cell cycle regulator cycle regulation, interacts with Rb p107 protein V E2 F. O 3 E2F, involved in cell cycle regulation, interacts with Rb p107 protein E2F involved in cell cycle regulation, interacts with Rb p107 protein papillioma virus E2 WSE2.01 BPV bovine papilloma Transcriptional virus regulator E2 activator E2.02 papilloma virus regulator E2 VSEBOR E-BOx Related factors DELTAEF1.01 deltaFF1 XBP1.01 X-box-binding protein 1 VSEBOX E-BOX binding factors USF.O2 upstream stimulating actor upstream stimulating actor MYCMAX.O3 MYC-MAX binding sites SREBPO3 Sterol regulatory element binding protein SREBPO2 Sterol regulatory element binding protein MYCMAX.O2 c-Myc/Max heterodimer NMYC.O1 N-Myc ATF 6.01 Member of b-zip family, induced by ER damage stress US 7,728,118 B2 89 90

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name information USFO1 upstream stimulating actor MYCMAXO1 c-Myc/Max MAXO1 Max ARNTO1 AhR nuclear translocator homodimers SREBPO1 Sterol regulatory element binding protein 1 and 2 VSECAT Enhancer-CoAaT NFYO2 nuclear factor Y (Y-box binding binding factor) factors NFYO3 nuclear factor Y (Y-box binding factor) nuclear factor Y (Y-box binding factor) VSEGRF EGR nerve growth Egr-1/KroX-24/NGFI-A Factor Induced protein immediate-early gene C & rel. fact. product V EGR2.01 Egr-2/KroX-20 early growth response gene product EGR3.01 early growth response gene 3 product NGFIC.O1 nerve growth factor Sup presor induced protein C Wilms Tumor VSEKLF Erythroid krueppel like Erythroid krueppel like factor actor (EKLF) VSETSF Human and murine CETS1PS401 c-Ets-1 (p54) ETS 1 Factors NRF2.01 nuclear respiratory factor 2 GABPO1 GABP:GA binding protein ELK1.02 Elk-1 FLI.O1 ETS family member FLI ETS2.01 c-Ets-2 binding site ETS101 c-EtS-1 binding site ELK1.01 Elk-1 PU1.01 Pu.1 (Pu120) Ets-like transcription factor identified in lymphoid B cells VSEVI1 EVI1-myleoid Ecotropic vira transforming protein integration site 1 encoded Ecotropic vira integration site 1 encoded

integration site 1 encoded

integration site 1 encoded

integration site 1 encoded

integration site I encoded 800 VSFKHD Fork Head Domain HFEH1.01 HNF-3/Fkh Homolog 1 factors HFEH2.01 HNF-3/Fkh Homolog 2 s HFEH3.01 HNF-3/Fkh Homolog 3 (=Freac-6) HFEH8.01 HNF-3/Fkh Homolog-8 y XFD1.01 Xenopus fork head domain factor 1 XFD2.01 Xenopus fork head domain factor 2 XFD3.01 Xenopus fork head domain factor 3 Hepatocyte Nuclear Factor 3beta US 7,728,118 B2 91 92

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information V FREAC2.01 Forkhead RElated ACtivator-2 Forkhead RElated FREAC3.01 ACtivator-3 Forkhead RElated FREAC4O1 ACtivator-4 Forkhead RElated FREACT.O1 ACtivator-7 complex of Limo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 2 GATA104 GATA-binding factor 1 GATA1.05 GATA-binding factor 1 GATA2O1 GATA-binding factor 2 GATA2O2 GATA-binding factor 2 VSGATA GATA binding factors GATA3.01 GATA-binding factor 3 GATA3.02 GATA-binding factor 3 GATAO1 GATA binding site (consensus) GATA103 GATA-binding factor 1 GATA101 GATA-binding factor 1 GATA102 GATA-binding factor 1 VSGFI1 Growth Factor GFI 1.01 growth factor Independence independence I Zinc transcriptional finger protein acts as repressor transcriptional repressor Gut-enriched Krueppel V gut-enriched Krueppel Like binding Factor ike factor VSGREF Glucocorticoid GR E , responsive and related C2C2 zinc finger protein elements binds glucocorticoid dependent to GREs AR Androgene receptor PRE.O1 binding site WSHAML Human Acute AML 1.01 runt-factor AML-1 Myelogenous Leukemia factors HEAT HEATshock factors HSF1.01 1 y HEN1 E-box binding factor HEN1.01 HEN without transcript. HEN1.02 HEN activation HMTB Human muscle-specific MT muscle specific Mt Mt binding site binding site HNF1 Hepatic Nuclear Factor HNF1.01 hepatic nuclear factor 1 HNF1.02 Hepatic nuclear factor 1 Hepatic Nuclear Factor HNF4.01 Hepatic nuclear factor 4 4 HNF4.02 Hepatic nuclear factor 4 HOMS Homeodomain S8.01 Binding site for S8 type Subfamily S8 homeodmains HOXF Factors with moderate OXA9.01 Member of the vertebrate. activity to homeo HOX - cluster of domain consensus factors Sequence V OX1-3.01 Hox-1.3, vertebrate homeobox protein VSLKRS Ikaros Zinc finger LYF1.01 LyE-I (Ikaros 1), family enriched in B and T ymphocytes karos 2, potential regulator of lymphocyte differentiation karos 1, potential regulator of lymphocyte differentiation karos 3, potential regulator of lymphocyte differentiation VSIRFF Interferon Regulatory RF1.01 interferon regulatory Factors actor 1 RF2.01 interferon regulatory actor 2 SRE.O1 interferon-stimulated response element US 7,728,118 B2 93 94

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name information VSLEFF LEF1 TCF WSLEF1.01 TCF/LEF-1, involved in he Wnt signal transduction pathway VSLTUP Lentiviral Tata TAACC.O1 Lentiviral TATA UPstream element upstream element VSMEF2 -myocyte MEF2.05 MEF2 specific enhancer MEF2.01 myogenic enhancer binding factor actor 2 HMEF2.01 myocyte enhancer factor MMEF2.01 myocyte enhancer f RSRFC4O1 related to serum response actor, C4 RSRFC4O2 related to serum response actor, C4 AMEF2.01 myocyte enhancer factor MEF2.02 myogenic MADS factor MEF-2 M EF2.03 myogenic MADS factor MEF-2 MEF2.04 myogenic MADS factor MEF-2 VSMEF3 MEF3 BINDING M EF3.01 MEF3 binding site SITES present in skeletal muscle-specific transcriptional enhancers WSMEIS Homeodomain factor V M EIS1.01 Homeobox protein aberrantly expressed in MEIS1 binding site myeloid leukemia VSMINI Muscle INItiator MUSCLE INI.O1 Muscle Initiator Sequence MUSCLE INI.O2 Muscle Initiator Sequence MUSCLE INI.O3 Muscle Initiator Sequence VSMOKF Mouse Krueppel like MOK2O1 Ribonucleoprotein factor associated Zinc finger protein MOK-2 VSMTF1 Metal induced MTF-101 Metal transcription factor transcription factor 1, MRE VSMYOD MYOblast MYOD.O2 myoblast determining Determining factor factor MYF5.01 Myf5 myogenic bHLH protein MYOD.O1 myoblast determination gene product complex of Limo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 1 E47.01 MyoD/E47 and MyoD/E12 dimers E47.02 TAL1 E47 dimers VSMYOF MYOgenic Factors NF1.01 nuclear factor 1 MYOGNF1.01 nuclear factor 1 or related factors VSMYT1 Xenopus MYT1 C2HC MYT1.02 MyT1 Zinc finger Zinc finger protein transcription factor involved in primary neurogenesis VSMYT1.01 MyTi zinc finger transcription factor involved in primary neurogenesis VSMZF1 Myeliod Zinc Finger 1 WSMZF1.01 MZF1 factors VSNFAT Nuclear Factor of WSNFAT.O1 Nuclear factor of Activated T-cells activated T-cells VSNFKB Nuclear Factor Kappa V CREL.O1 c-Rel Bic-rel NFKAPPAB.O1 NF-kappaB NFKAPPAB6S.O1 NF-kappaB (p65) NFKAPPABSO.O1 NF-kappaB (p50) NFKAPPAB.O2 NF-kappaB NFKAPPAB.O3 NF-kappaB US 7,728,118 B2 95 96

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information VSNKXH NKX- Homeodomain VSNKX25.01 homeo domain factor NkX-2.5/CSX, tinman homolog, high affinity sites VSNKX25.02 homeo domain factor NkX-2.5/CSX, tinman homolog low affinity sites VSNKX31.01 prostate-specific homeodomain protein NKX3.1 VSNOLF Neuron-specific WSOLF.101 olfactory neuron-specific OLFactory factor factor VSNRSF Neuron-restrictive WSNRSFO1 neuron-Restrictive Silencer Factor silencer factor WSNRSE.O1 neural-restrictive silencer-element VSOAZF Olfactory associated Rat C2H2 Zn finger Zinc finger protein protein involved in olfactory neuronal differentiation VSOCT1 OCTamer binding OCT1.02 octamer-binding factor 1 protein OCT1.06 octamer-binding factor 1 s OCTO1 Octamer binding site OCT1/OCT2 consensus) OCT1.OS octamer-binding factor 1 OCT1.04 octamer-binding factor 1 OCT1.03 octamer-binding factor 1 : OCT1.01 octamer-binding factor 1 VSOCTB OCT6 Binding factors astrocytes + V TST1.01 POU-factor Tst-1 Oct-6 glioblastoma cells VSOCTP OCT1 binding factor VSOCT1PO1 octamer-binding factor 1, (POU-specific domain) POU-specific domain tumor suppr.-neg. VSP53.01 tumor Suppressor p53 regulat. of the tumor Suppr. Rb VSPAX1 PAX-1 binding site WSPAX1.01 PaX-1 paired domain protein expressed in the developing vertebral column of mouse embryos VSPAX3 PAX-3 binding sites VSPAX3.01 Pax-3 paired domain protein expressed in embryogenesis, mutations correlate to Waardenburg Syndrome VSPAX4 Heterogeneous PAX-4 WSPAX40 PAX-4 paired domain binding sites protein, together with PAX-6 involved in pancreatic development VSPAX5 PAX-SPAX-9B WSPAX9.0 Zebrafish PAX9 binding cell-specific activating sites protein WSPAXS.O B cell specific activating protein WSPAXS.O2 B cell specific activating protein VSPAX6 Activ. involved in Iris WSPAX6.O Pax 6 paired domain development in the protein mouse eye VSPAX8 PAX-2/5/8 binding WSPAX8.0 PAX 2/5/8 binding site sites VSPBXF Homeo domain factor WSPBX1.0 homeo domain factor PBX-1 PbX-1 VSPCAT Promoter-CoAaT WSACAATO1 Avian C-type LTR binding factors CCAAT box WSCAATO cellular and viral CCAAT box WSCLTR CAATO1 Mammalian C-type LTR CCAAT box US 7,728,118 B2 97 98

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name information VSPDX1 Pancreatic and V PDX1.01 Pdxl (IDX1/IPF1) intestinal pancreatic and intestinal homeodomain transcr. homeodomain TF factor SL1.01 Pancreatic and intestinal im-homeodomain factor VSPERO PEROxisome PPARA.O1 PPARRXRheterodimers proliferator-activated receptor VSPIT1 GHF-1 pituitary PIT1.01 Pitl, GHF-1 pituitary specific specific pou domain transcription factor transcription factor VSRARF for RARO1 , retenoic acid member of nuclear

RTRO1 -related estis-associated receptor (GCNF/RTR) Regulator of B-Cell BRIGHTO1 Bright, B cell regulator gH transcription of IgE transcnption RBPJ - kappa Mammalian transcriptional repressor RBP-Jkappa/CBF1 WSREBV Epstein-Barr virus EBVRO1 Epstein-Barr virus transcription factor R transcription factor R WSRORA and RORA1.01 RAR-related orphan rar-Rel. Orphan receptor alpha1 Receptor Alpha RORA2O1 RAR-related orphan receptor alpha2 ERO1 estrogen receptor WSRREB Ras-REsponsive RREB1.01 Ras-responsive element element Binding binding protein 1 protein RXRheterodimer V Famesoid X - activated binding sites receptor (RXR/FXR dimer) WDRRXRVitamin D receptor RXR heterodimer site WDRIRXRVitamin D receptor RXR heterodimer site Nuclear receptor involved in the regulation lipid homeostasis VSSATB Special AT-rich SATB1.01 Special AT-rich sequence binding sequence-binding protein protein 1, predominantly expressed in thymocytes, binds to matrix attachment regions (MARs) VSSEF1 SEF1 protein in mouse SEF1.01 SEF1 binding site Retrovirus SL3-3 VSSF1F Vertebrate SF1.01 SF1 steroidogenic factor VSSMAD Vertebrate SMAD SMAD3.01 Smad3 transcription family of transcription factor involved in TGF factors beta signaling SMAD4.O1 Smada transcription factor involved in TGF beta signaling FAST1.01 FAST 1 SMAD interacting protein VSSORY SOxisRY-Sextestis SOXS.O1 Sox 5 determinig and related y SRYO1 sex determining region Y HMG Box factors gene product HMGIYO1 HMGICY) high-mobility group protein I (Y), architectural transcription factor organizing the framework of a nuclear protein-DNA transcriptional complex US 7,728,118 B2 99 100

TABLE 17-continued

GENOMATIXMATRIX FAMILY LIBRARY INFORMATION Versions 2.4 to 4.1 Family Family Information Matrix Name Information WSSOX9.01 SOX (SRY-related HMG box) VSSP1F GC-Box WSSP1.01 stimulating protein 1 factors SP1. GC SP1, ubiquitous zinc finger transcription factor WSGC.O1 GC box elements VSSRFF Serum Response WSSR.F.O2 element binding Factor WSSR.F.O3 serum responsive factor WSSRFO1 serum response factor VSSTAT Signal Transducer and WSSTAT.O1 signal transducers and Activator of Transcript. activators of transcription factors WSSTATS.O1 STAT5: signal transducer and activator of transcription 5 WSSTAT6.01 STAT6: signal transducer and activator of transcnption 6 WSSTAT1.01 signal transducer and activator of transcription 1 WSSTAT3.01 signal transducer and activator of transcription 3 Viral homolog of WST3RO1 vErbA, viral homolog of thyroid hormon thyroid receptor alpha1 (AEV alpha1 vErbA) Tata-Binding Protein VSTATA.O2 Mammalian C-type LTR Factor TATA box VSATATAO1 Avian C-type LTR TATA box VSTATA.O1 cellular and viral TATA TATA box WSMTATA.O1 Muscle TATA box VSTCFF TCF11 transcription WSTCF 11.01 TCF11, KCR-F1 Nrf Factor homodimers VSTEAF TEAATTS DNA VSTEF1.01 TEF-1 related muscle binding domain factors factor VSTTFF Thyroid transcription VSTTF1.01 Thyroid transcription factor-1 factor 1 (TTF1) binding site chicken Vitellogenin VSVBPO1 PAR-type chicken gene Binding Protein vitellogenin promoter factor binding protein VSVMYB AMV-viral myb Oncogene Winged Helix and ZF5 WSWHN.O1 winged helix protein, binding sites involved in hair keratinization and thymus epithelium differentiation X-box binding Factors WSRFX1.01 X-box binding protein RFX1 WSRFX1.02 X-box binding protein RFX1 WSMIF1.01 MIBP-1/RFX1 complex VSXSEC Xenopus SEleno WSSTAFO2 Se-Cys tRNA gene Cystein t-RNA transcription activating activiating factor factor WSSTAFO1 Se-Cys tRNA gene transcription activating factor activator/repressor VSYY101 Yin and Yang 1 binding to transcr. init. site Zinc binding protein WSZBP89.01 Zinc finger transcription factor factor ZBP-89 VSZFIA ZincRinger with WSZID.O1 Zinc finger with InterAction domain interaction domain factors (C) Genomatix Software GmbH 1998–2002 - All rights reserved. US 7,728,118 B2 101 102 B. Chances from Family Library Version 2.4 to Version 3.0 Matrix Family Library Version 3.0 (November 2002) con tains 452 weight matrices in 216 families (Vertebrates: 314 matrices in 128 families) 5 New Weight Matrices Vertebrates

Name Family Information Matrix Name Matrix information VSAP1F AP1 and related VSBACH1.01 BTBPOZ-b2IP factors transcription factor BACH1 forms heterodimers with the Small Mafprotein family VSCIZF CAS interating zinc WSNMP4O1 NMP4 (nuclear matrix finger protei protein 4). CIZ (Cas interacting zinc finger protein) VSCREB Camp-Responsive VSATF 6.02 Activating transcription Element Binding actor 6, member of b-zip proteins amily, induced by ER StreSS VSE4FF Ubiquitous GLI- WSE4FO1 GLI-Krueppel-related Krueppel like Zinc transcription factor, finger involved in regulator of adenovirus cell cycle regulation E4 promoter WSGFI1 Growth Factor WSGf1B.O1 Growth factor independence- independence 1 Zinc transcriptional finger protein Gfi-1B repressor VSGLIF GLI zinc finger WSGLI101 Zinc finger transcription amily actor GLI1 WSHAMIL Human Acute WSAML3.01 Runt-related transcription Myelogenous actor 2/CBFA1 (core Leukemia factors binding factor, runt domain, alpha Subunit 1) VSHESF Vertebrate homologues V&HES1.01 Drosophila hairy and of enhancer of split enhancer of split complex homologue 1 (HES-1) VSHIFF Hypoxia inducible WSHIF1.01 Hypoxia induced factor-1 factor, bHLHPAS (HIF-1) protein family WSHIF1.02 Hypoxia inducible factor, bHLH/PAS protein amily VSHNF6 Onecut WSHNF6.O1 Liver enriched Cut - Homeodomain Homeodomain factor HNF6 transcription factor HNF6 (ONECUT) VSHOXF Factors with WSCRX.O1 Cone-rod homeobox moderate activity to containing transcription homeo domain actoriotx-like consensus sequence homeobox gene WSEN1.01 Homeobox protein (en-1) VSPTX101 Pituitary Homeobox 1 (Ptx1) VSIRFF interferon WSIRF3.01 interferon regulatory Regulatory Factors actor 3 (LRF-3) WSLRF7.01 interferon regulatory actor 7(IRF-7) VSMAZF Myc associated zinc WSMAZ.O1 Myc associated Zinc fingers finger protein (MAZ) WSMAZRO1 MYC-associated zinc finger protein related transcription factor VSMEIS Homeodomain WSMEIS1.01 Binding site for actor aberrantly monomeric Meis 1 expressed in homeodomain protein myeloid leukemia VSMITF Microphthalmia WSMITO1 MIT (microphthalmia transcription factor transcription factor) and US 7,728,118 B2 103 104

-continued Name Family Information Matrix Name Matrix information VSMOKF Mouse Krueppel WSMOK2O2 Ribonucleoprotein like factor associated zinc finger protein MOK-2 (human) VSNEUR NeuroD, Beta2, WSNEUROD1.01 DNA binding site for HLH domain NEUROD1 (BETA-2/ E47 dimer) VSNF1F Nuclear Factor 1 WSNF1.02 Nuclear factor 1 (CTF1) VSNXXH NKXNDLX WSDLX101 DLX-1, -2, and -5 binding Homeodomain sites sites WSDLX3.01 Distal-less 3 homeodomain transcription facto WSHMX3.01 H6 homeodornain HMX3.NkxS.1 transcription factor WSMSXO1 Homeodoinain proteins MSX-1 and MSX-2 WSMSX2.01 Muscle segment homeo box 2, homologue of Drosophila (HOX8) VSNRLF Neural retina WSNRL.O1 Neural retinal basic eucine Zipper factor (bZIP) VSPARF PAR/bZIP family VSDBPO1 Albumin D-box binding VSPBXC PBX1 - MEIS1 VSPBX1 MEIS1.01 Binding site for a complexes PbX1 Meis1 heterodimer VSPBX1 MEIS1.02 Binding site for a PbX1 Meis1 heterodimer VSPBX1 MEIS1.03 Binding site for a PbX1 Meis1 heterodimer VSPLZF C2H2 zinc finger WSPLZF.O1 Promyelocytic leukemia protein PLZF Zink finger (TF with nine Krueppel-like Zink fingers) VSPXRF VSPXRCAR.01 Halfsite of PXR (pregnane X receptor)/RXR resp. CAR (constitutive androstane receptor)/RXR heterodimer binding site VSRQRA v-ERB and rar- WSNBRE.O1 Monomers of the nur related Orphan Subfamily of nuclear Receptor Alpha receptors (nur 77, nurr1, nor-1) VSSF1F Vertebrate VSFTF.O1 Alpha (1)-fetoprotein steroidogenic factor transcription factor (FTF), liver receptor homologue 1 (LHR-1) VSSIXF Sine oculis (SIX) WSSIX3.01 SIX3/SIXdomain (SD) homeodomain and Homeodomain (HD) factors transcription factor VSTALE TALE WSTGIFO1 TG-interacting factor Homeodomain class belonging to TALE class recognizing TG of homeodomam factors motives WSZFSF ZFSPOZ domain WSZF5.01 Zinc finger? POZ domain Zinc finger transcription factor

Weight Matrices Renamed C. Changes from Family Library Version 3.0 to Version 3.1 VSMEIS 101 renamed to VSMEIS1. HOXA9.01 55 Matrix Family Library Version3.1 contains 456 weight matri Weight Matrices Moved to Other Families ces in 216 families (Vertebrates: 318 matrices in 128 families) VSBEL 1.01 moved from VSAP1F to VSBEL1 New Weight Matrices Vertebrates VSNF 1.01 moved from VSMYOF to VSNF1 VSERO1 moved from VSRORA to VSEREF 60 VST3R.O1 moved from VST3RH to VSRORA Family Family Information Matrix Name Matrix Information VSCLTR CAAT.01 moved from VSPCAT to VSRCAT VSLEFF LEF1 TCF VSLEF1.02 TCF/LEF-1, involved in the VSFAST 1.01 moved from VSSMAD to VSFAST Wnt signal 65 transduction Weight Matrices Removed pathway VSMUSCLE INI.03 US 7,728,118 B2 105 106 Version 3.1.2 (June 2003) -continued Matrix VSGflIB.01 corrected. D. Changes from Family Library Version 3.1 to Version 3.3 Family Family Information Matrix Name Matrix Information 5 Matrix Family Library Version 3.3 (August 2003) contains VSPAX2 PAX-2 binding sites VSPAX2.01 Zebrafish 485 weight matrices in 233 families PAX2domain paired protein (Vertebrates: 326 matrices in 130 families) New Weight Matrices Vertebrates

Family Family Information Matrix Name Matrix Information VSEREF Estrogen Response VSER.02 Canonical palindromic Elements estrogen response element (ERE) WSSP1F GC-Box WSBTEB3.01 Basic transcription factors SP1. GC element (BTE) binding protein, BTEB3, FKLF-2 VSCDEF Cell cycle WSCDE.O1 Cell cycle-dependent element, regulators: Cell CDF-1 binding site (CDE/CHR cycle dependent tandem elements regulate cell element cycle dependent repression) VSCHRF Cell cycle VSCHRO1 Cell cycle gene homology regulators: Cell region (CDE/CHR tandem cycle homology elements regulate cell element cycle dependent repression) VSHIFF Hypoxia inducible VSCLOCK BMAL1.01 Binding site of factor, bHLH/ ClockIBMAL1 heterodimer, PAS protein NPAS2BMAL1 heterodimer family WSFKHD Fork Head WSFKHRL1.01 Fkh-domain factor Domain factors FKHRL1 (FOXO) VSP53F p53 tumor suppr.- VSP53.02 Tumor suppressor p53 neg. regulat. of the (5' half site) tumor suppr. Rb VSP53.03 Tumor suppressor p53 (3' half site)

35 Weight Matrices Modified -continued VSGFI 1.01 Family Family Information Matrix Name Matrix Information E. Changes from Family Library Version 3.3 to Version 4.0 VSPAX5 it. 9 B VSPAX5.03 fAn Matrix Family Library Version 4.0 (November 2003) con activating protein tains 535 weight matrices in 253 families VSPAX6 PAX-4, PAX- VSPAX4 Pol PAX4 paired 6 paired domain binding (Vertebrates: 339 matrices in 136 families) domain binding sites site 45 VSPAX6.02 PAX6 paired New Weight Matrices Vertebrates domain and homeodomain are required for binding to this site VSZBPF Zinc binding protein VSZF9.01 Core promoter- Family Family Information Matrix Name Matrix Information factor binding protein so (CPBP) with 3 t Krueppel-type response element, Zinc fingers ATF4 binding site VSAP1R MAF and AP1 related VSBACH2.01 Bach2 bound TRE factors WSNFE2L2.01 Nuclear factor 55 (erythroid-derived Weight Matrices Modified 2)-like 2, NRF2 VSAMLI.01 VSCDXF Vertebrate caudal WSCDX1.01 intestine specific related homeodomain homeodomain factor VSAML3.01 protein -1 Weight Matrices Moved to Other Families VSDEAF R to deformed VSNUDRO1 NR E. VSARNT.01 moved from VSEBOX to VSHIFF (ARNT is 60 autoregulatory factor-1 transcriptional a synonym for HIF1 B) from D. melanogaster regulator protein VSETSF Human and murine VSELF2.01 ETS - family Weight Matrices Removed factors member ELF-2 VSSEF1.01 (NERF1a) VSOCT1.03 VSGABF GA-boxes VSGAGAO1 GAGA-Box 65 VSHNF1 Hepatic Nuclear Factor VSHNF1.03 Hepatic nuclear Version 3.1.1 (April 2003) 1 actor 1 Matrices VSIRF3.01 and VSIRF7.01 corrected. US 7,728,118 B2 107 108

-continued -continued Family Family Information Matrix Name Matrix Information Family Family information Matrix Name Matrix Information VSHOXF Factors with moderate WSGSC.01 Vertebrate activity to homeo bicoid-type VSHIFF Hypoxia VSDEC1.01 Basic helix-loop-helix domain consensus homeodomain inducible protein known as Sequence protein Goosecoid actor, Dec1, Stra13 or VSLHXF Lim homeodomain WSLHX3.01 Homeodomain bHLHSPAS Sharp2 actors binding site in protein LIM Homeodomain 10 amily actor LHX3 VSHNF6 Onecut VSOC2.01 CUT-homeodomain VSNKXH NKXFDLX- VSNKX32.01 Homeodomain Homeodomain transcription factor homeodomain sites protein NKX3.2 actor HNF6 Onecut-2 (BAPX1, NKX3B, VSHOXF Factors with VSOTX2.01 Homeodomain Bagpipe homolog) moderate transcription factor VSRBPF RBPJ - kappa WSRBPJK.02 Mammalian 15 activity to Otx2 (homolog of transcriptional homeo domain Drosophila repressor COSCSUS orthodenticle) RBP-Jkappa/CBF1 Sequence WSGSH1.01 Homeobox VSRP58 RP58 (ZFP238) zinc VSRP58.01 Zinc finger protein transcription finger protein RP58 (ZNF238), actor Gish-1 associated preferentially with VSIRFF Interferon WSIRF4.01 interferon regulatory heterochromatin Regulatory actor (IRF)-related Factors protein (NF-EM5, PIP. LSIRF, ICSAT) VSLHXF Lim VSLMX1B.O1 LIM-homeodomain Weight Matrices Modified homeodomain transcription factor 25 factors VSGRE.01 VSMYT1 MYT1 C2HC VSMYT1L.01 Myelin transcription VSNFY.03 Zinc finger actor 1-like, neuronal protein C2HC zinc finger actor 1 Weight Matrices Moved to Other Families VSNEUR NeuroD, WSNEUROG.O1 Neurogenin 1 and 3 VSBACHI 1.01 moved from VSAP1F to VSAP1R 30 Beta2, (ngn1.3) binding sites HLH domain VSNFE2.01 moved from VSAP1F to VSAP1R WSVMYB AMV-viral VSVMYB.03 v-Myb, viral myb VSTCF 11 MAFG.01 moved from VSAP1F to VSAP1R myb variant from Oncogene transformed BM2 cells VSVMAF.01 moved from VSAP1F to VSAP1R VSVMYB.04 v-Myb, AMV v-myb 35 VSVMYB.05 v-Myb, variant of F. Chances from Family Library Version 4.0 to Version 4.1 AMV v-myb VSZBPF Zinc binding VSZNF202.01 Transcriptional Matrix Family Library Version 4.1 (February 2004) contains protein repressor, binds to 564 weight matrices in 262 families factor elements found predominantly in (Vertebrates: 356 matrices in 138 families) 40 genes that participate New Weight Matrices Vertebrates in lipid metabolism

Weight Matrices Modified Family 45 Family Information Matrix Name Matrix Information VSCMYB.01 VSBNCF Basomuclein VSBNC.O1 Basonuclin, cooperates VSPTXI.01 rDNA with USF1 in rDNA transcription PolI transcription) Copyright (C) Genomatix Software GmbH 1998-2004 All factor (PolI) 50 rights reserved VSCMYB C-my b, VSCMYB.02 c-Myb, important in cellular hematopoesis, cellular transcriptional equivalent to avian EXAMPLE 6 activator myoblastosis virus oncogene v-myb Summary of Design for Particular Selectable Genes VSCP2F CP2- VSCP2.02 LBP-1c (leader 55 erythrocyte binding protein-1c), Factor LSF (late SV40 related to actor), CP2, SEF TF Binding Sites and Search Parameters drosophila (SAA3 enhancer Each TF binding site (“matrix”) belongs to a matrix family Ef1 actor) VSEKLF Basic and VSBKLF.01 Basic krueppel-like that groups functionally similar matrices together, eliminat erythroid actor (KLF3) 60 ing redundant matches by Matinspector professional (the Krueppel like factors search program). Searches were limited to vertebrate TF VSHAND bFILH VSHAND2 E12.01 Heterodimers of the binding sites. Searches were performed by matrix family, i.e., transcription bHLH transcription the results show only the best match from a family for each factor dimer actors HAND2 of HAND2 and (Thing2) and E12 65 site. Matinspector default parameters were used for the core E12 and matrix similarity values (core similarity=0.75, matrix similarity=optimized). US 7,728,118 B2 109 110

TABLE 18 TABLE 19-continued Gene Designations Sequences in Synthetic Hygromycin Genes TFBS in hbyg Matrix Sequence Description Library Before removal of TFBS from hyg (94 matches Family/matrix** Further Information A. Synthetic hygromycin gene V CREB, ATF 6.02 Activating transcription factor 6, member hyg from pcDNA3.1/Hygro Not 10 of b-zip family, induced by ER stress applicable V EGRFEGR3.01 early growth response gene 3 product hhyg humanized ORF Not V ZBPFZF9.01 Core promoter-binding protein (CPBP) applicable with 3 Krueppel-type Zinc fingers hhyg-1 First removal of undesired sequence matches Ver 3.1.2 June V HIFF, HIF1.02 Hypoxia inducible factor, bHLHIPAS 2003 protein family hhyg-2 Second removal of undesired sequence Ver 3.12 June 15 E2F, involved in cell cycle regulation, matches 2003 interacts with Rb p107 protein hhyg-3 Third removal of undesired sequence Ver 3.12 June AP4RAP4.01 Activator protein 4 matches 2003 HEN1 HEN1.02 HEN1 hHygro Changes to ORF and add linker Wer 3.3 MYODiE47.01 MyoD/E47 and MyoD/E12 dimers August 2003 EGRFEGR3.01 early growth response gene 3 product hhyg-4 Fourth removal of undesired sequence Wer 3.3 MOKFAMOK2.02 Ribonucleoprotein associated Zinc finger matches August 2003 protein MOK-2 (human) B. Synthetic neomycin gene SP1FGC.O1 GC box elements NRSFNRSE.O1 Neural-restrictive-silencer-element le:O from pCI-neo or psi STRIKE neo Not RORARORA2.01 RAR-related orphan receptor alpha2 applicable ZBPFZF9.01 Core promoter-binding protein (CPBP) (O humanized ORF Not 25 with 3 Krueppel-type Zinc fingers applicable ZFSFZF5.01 Zinc finger? POZ domain transcription hneo-1 First removal of undesired sequence matches Ver 3.1.2 June actor 2003 AHRRAHRANTO2 Aryl hydrocarbon/Arntheterodimers, fixed hneo-2 Second removal of undesired sequence Ver 3.12 June CO matches 2003 AP1FTCF11MAFG.O1 TCF11/MafGheterodimers, binding to hneo-3 Third removal of undesired sequence Ver 3.12 June 30 subclass of AP1 sites matches 2003 EKLFEKLF.O1 Erythroid krueppel like factor (EKLF) hneo-4 Changed 5' and 3' flanking regions cloning Ver 4.1 NRSFNRSFO1 Neuron-restrictive silencer factor sites February 2004 NRSFNRSE.O1 Neural-restrictive-silencer-element hneo-5 Fourth removal of undesired sequence Ver 4.1 EBOXFMYCMAX.O3 MYC-MAX binding sites matches February 2004 RXRFFXRE.O1 Farnesoid X - activated receptor C. Synthetic puromycin gene 35 (RXR/FXR dimer) V AHRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, fixed O from psi STRIKE puromycin Not CO applicable Winged helix protein, involved in hair npuro humanized ORF Not keratinization and thymus epithelium applicable differentiation hpuro-1 First removal of undesired sequence matches Ver 4.1 40 EGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate-early February 2004 gene product hpuro-2 Second removal of undesired sequence Ver 4.1 SMADSMAD3.01 Smad3 transcription factor involved in matches February 2004 TGF-beta signaling MOKFAMOK2.01 Ribonucleoprotein associated Zinc finger Note: the above sequence names designate the ORF only (except for Hhygro protein MOK-2 (mouse) which includes flanking sequences). Addition of F to the sequence name 45 Myoblast determining factor indicates the presence of up- and down-stream flanking sequences. Addi GLI-Krueppel-related transcription factor, tional letters (e.g., “B) indicate changes were made only to the flanking regions regulator of adenovirus E4 promoter MOKFAMOK2.01 Ribonucleoprotein associated Zinc finger protein MOK-2 (mouse) EGRFEGR2.01 Egr-2/KroX-20 early growth response gene TABLE 19 50 product Sequences in Synthetic Hygromycin Genes y EGRFEGR3.01 early growth response gene 3 product TFBS in hbyg HIFF HIF1.02 Hypoxia inducible factor, bHLHIPAS Before removal of TFBS from hyg (94 matches protein family EBOXAUSFO2 Upstream stimulating factor Family/matrix** Further Information 55 HIFF ARNT.O1 AhR nuclear translocator homo imers s ZFSFZF5.01 Zinc finger? POZ domain transcription VSPCAT, CAATO1 cellular and viral CCAAT box 800 VSMINIMUSCLE INI.02 Muscle Initiator Sequence EBOXAATF 6.01 Member of b-zip family, induced by ER VSMINIMUSCLE INI.O1 Muscle Initiator Sequence damage/stress, binds to the ERSE in VSETSF/PU1.01 Pu.1 (Pu120) Ets-like transcription factor association with NF-Y identified in lymphoid B-cells 60 BEL1 BEL1.01 Bel-1 similar region (defined in Lentivirus VSAHRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, fixed LTRs) COe NRSFNRSE.O1 Neural-restrictive-silencer-element VSEGRFEGR3.01 early growth response gene 3 product VSAP4RAP4.01 Activator protein 4 MYODMYOD.O1 Myoblast determination gene product VSEGRF/NGFIC.O1 Nerve growth factor-induced protein C s NEURNEUROD1.01 DNA binding site for NEUROD VSMAZF MAZ.O1 Myc associated Zinc finger protein (MAZ) (BETA-2/E47 dimer) VSZBPFZF9.01 Core promoter-binding protein (CPBP) 65 AHRRAHRARNT.O1 Aryl hydrocarbon receptor Arnt with 3 Krueppel-type Zinc fingers heterodimers US 7,728,118 B2 111 112

TABLE 19-continued Sequences in Synthetic Hygromycin Genes TFBS in hbyg3 TFBS in hbyg After removal of TFBS from hyg2 (3 matches Before removal of TFBS from hl 94 matches Family/matrix** Further Information Family/matrix** Further Information VSMINIMUSCLE INI.02 Muscle Initiator Sequence SHIFFARNT.O1 AhR nuclear translocator homodimers VSPAX5/PAX5.02 B-cell-specific activating protein VMYBVMYB.O2 VSVMYB/VMYB.02 v-Myb v-Myb 10 MOKFMOK2.01 Ribonucleoprotein associated Zinc finger protein MOK-2 (mouse) **matches are listed in order of occurrence in the corresponding sequence PAXS PAXS.O1 B-cell-specific activating protein PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 heterodimer s MYOF MYOGNF1.01 Myogenin nuclear factor 1 or related 80OS 15 Serum responsive factor TFBS in hHygro CP2 Before removal of TFBS from hygro (5 matches, excluding linker Rat C2H2 Zn finger protein involved in Family/matrix** Further Information olfactor neuronal differentiation AHRRAHRO1 Aryl hydrocarbon dioxin receptor VSMINIMUSCLE INI.02 Muscle Initiator Sequence MINIMUSCLE INI.O1 Muscle Initiator Sequence VSPAX5/PAX5.02 B-cell-specific activating protein PAXS PAXS.O2 B-cell-specific activating protein VSAREB, AREB6. AREB6 (Atp1a1 regulatory element binding : ZBPFZF9.01 Core promoter-binding protein (CPBP) factor 6) with 3 Krueppel-type Zinc fingers V EBOXFATF 6.01 Member of b-zip family, induced by ER VSCDEFCDE.01 Cell cycle-dependent element, CDF-1 damage/stress, binds to the ERSE in 25 binding site (CDE/CHR tandem elements association with NF-Y regulate cell cycle dependent repression) WSSEGRFNGFIC.O1 Nerve growth factor-induced protein C WSZFSFZF5.01 Zinc finger? POZ domain transcription **matches are listed in order of occurrence in the corresponding sequence factor AP4RfAP4O2 Activator protein 4 XBBFAMIF1.01 MIBP-1/RFX1 complex 30 EGRFEGR3.01 early growth response gene 3 product s WHZFWHNO1 Winged helix protein, involved in hair TFBS in hhyg4 keratinization and thymus epithelium After removal of TFBS from hEygro (4 matches differentiation Family/matrix** Further Information WSPAXS PAXS.O1 B-cell-specific activating protein WSWHZFWHNO.1 Winged helix protein, involved in hair 35 VSMINIMUSCLE INI.02 Muscle Initiator Sequence keratinization and thymus epithelium VSPAX5/PAX5.02 B-cell-specific activating protein differentiation VSAREB, AREB6.04 AREB6 (Atpla1 regulatory element binding PAXS PAXS.O1 B-cell-specific activating protein factor 6) PAXS PAXS.O3 PAX5 paired domain protein PAXS PAXS.O3 PAX5 paired domain protein 40 **matches are listed in order of occurrence in the corresponding sequence : ZBPFZF9.01 Core promoter-binding protein (CPBP) with 3 Krueppel-type Zinc fingers CP2F/CP2.01 CP2 TABLE 20 MINIMUSCLE INI.02 Muscle Initiator Sequence AP2FAAP2.01 Activator protein 2 Sequences in Synthetic Neomycin Genes PAXS PAXO1 B cell-specific activating protein 45 TFBS in hineo AHRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, fixed Before removal of TFBS from hineo (69 matches COe VSMINIMUSCLE INI.02 Muscle Initiator Sequence Family/matrix** Further Information WSEGRFEGR3.01 early growth response gene 3 product PCAT, CAATO1 cellular and viral CCAAT box VSSP FSP1.01 stimulating protein 1 SP1, ubiquitous Zinc 50 ZFIAZID.O1 Zinc finger with interaction domain finger transcription factor AP1FTCF11MAFG.O1 TCF11/MafGheterodimers, binding to Core promoter-binding protein (CPBP) subclass of AP1 sites with 3 Krueppel-type Zinc fingers MINIMUSCLE INI.01 Muscle Initiator Sequence EGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate-early AHRRAHRARNT.O1 Aryl hydrocarbon receptor Arnt gene product heterodimers 55 HIFF, HIF1.02 Hypoxia inducible factor, bHLHIPAS EGRFWT1.01 Wilms Tumor Suppressor protein family y SP1FSP1.01 stimulating protein 1 SP1, ubiquitous Zinc SP1FGC.O1 GC box elements finger transcription factor MINIMUSCLE INI.O2 Muscle Initiator Sequence VSRCAT/CLTR CAATO1 Mammalian C-type LTRCCAAT box CP2 WSZBPFAZF9.01 Core promoter-binding protein (CPBP) Winged helix protein, involved in hair with 3 Krueppel-type Zinc fingers 60 keratinization and thymus epithelium EGRFWT1.01 Wilms Tumor Suppressor differentiation EGRFWT1.01 Wilms Tumor Suppressor B-cell-specific activating protein NF1F, NF1.01 Nuclear factor 1 Zinc finger? POZ domain transcription factor s PDX1 PDX1.01 Pdx1 (IDX1/IPF1) pancreatic and Core promoter-binding protein (CPBP) intestinal homeodomain TF with 3 Krueppel-type Zinc fingers 65 Core promoter-binding protein (CPBP) **matches are listed in order of occurrence-in the corresponding sequence with 3 Krueppel-type Zinc fingers US 7,728,118 B2 113 114

TABLE 20-continued TABLE 20-continued Sequences in Synthetic Neomycin Genes Sequences in Synthetic Neomycin Genes TFBS in himeo TFBS in hineo Before removal of TFBS from hineo (69 matches Before removal of TFBS from hineo (69 matches

Family/matrix** Further Information Family/matrix** Further Information WSHIFFHIF1.02 Hypoxia inducible factor, bHLHIPAS VSBCL6/BCL6.02 POZ/zinc finger protein, transcriptional protein family repressor, translocations observed in WSAHRRAHRARNT.O1 Aryl hydrocarbon receptor Arnt 10 diffuse large cell lymphoma heterodimers VSCLOX/CDPO1 cut-like homeodomain protein WSNRSFNRSE.O1 Neural-restrictive-silencer-element WSHIFFHIF1.02 Hypoxia inducible factor, bHLHIPAS **matches are listed in order of occurrence in the corresponding sequence protein family WSCREB, ATF 6.02 Activating transcription factor 6, member TFBS in hneo3 of b-zip family, induced by ER stress 15 VSRXRF/VDR RXR01 VDR/RXR RXR After removal of TFBS from hneo2=before removal of heterodimer site TFBS from hneo3 (0 matches) WSPCAT, CAATO1 cellular and viral CCAAT box WSNRSFNRSE.O1 Neural-restrictive-silencer-element VSP53FP53.01 Tumor suppressor p53 WSNEUR NEUROD1.01 DNA binding site for NEUROD1 (BETA 2/E47 dimer) TFBS in himeO4 WSEBOXAUSFO3 Upstream stimulating factor After removal of TFBS from himeO3 = before removal of TFBS from VSMYODMYOD.O2 Myoblast determining factor hineo4 (7 matches WSNRSFNRSE.O1 Neural-restrictive-silencer-element WSWHZFWHN.O1 Winged helix protein, involved in hair Family/matrix** Further Information keratinization and thymus epithelium differentiation 25 VSPAX5/PAX9.01 Zebrafish PAX9 binding sites WSEBOXAMYCMAX.O3 MYC-MAX binding sites VSAARF/AARE.O1 Amino acid response element, ATF4 binding WSHESFHES1.01 Drosophila hairy and enhancer of split site homologue 1 (HES-1) VSP53FP53.02 Tumor suppressor p53 (5' half site) WSNEUR NEUROD1.01 DNA binding site for NEUROD1 (BETA VSAP1RBACH2.01 Bach2 bound TRE 2/E47 dimer) VSNEUR NEUROG.01 Neurogenin 1 and 3 (ngn1.3) binding sites VSMYODMYOD.O2 Myoblast determining factor 30 VSCMYB, CMYB.01 c-Myb, important in hematopoesis, cellular WSREBVEBVRO1 Epstein-Barr virus transcription factor R equivalent to avian myoblastosis virus WSPAXS PAXS.O2 B-cell-specific activating protein oncogene v-myb WSZFSFZF5.01 Zinc finger? POZ domain transcription Cone-rod homeobox-containing transcription actor factoriotX-like homeobox gene WSZFSFZF5.01 Zinc finger? POZ domain transcription **matches are listed in order of occurrence in the corresponding sequence actor 35 WSEGRFWT1.01 Wilms Tumor Suppressor WSEGRFWT1.01 Wilms Tumor Suppressor TFBS in hineO5 WSZBPFAZF9.01 Core promoter-binding protein (CPBP) with 3 Krueppel-type zinc fingers After removal of TFBS from hneo4 (0 matches) VSMINIMUSCLE INI.01 Muscle Initiator Sequence WSNRSFNRSFO1 Neuron Initiator silencer factor 40 TABLE 21 USPf1 MIPf1MI REI-IP WSNRSFNRSE.O1 Neural-restrictive-silencer-element Sequences in Synthetic Puromycin Genes WSMOKFMOK2.02 Ribonucleoprotein associated Zinc finger TFBS matches in hpuro protein MOK-2 (human) Before removal of TFBS from hpuro (68 matches VSAP2F AP2.01 Activator protein 2 Family/matrix** Further Information VSAP1FAP1F.O1 Activator protein 1 45 VPAXSPAXS.O3 PAX5 paired domain protein WSCDEFCDE.O1 Cell cycle-dependent element, CDF-1 WSEGRFEGR3.01 early growth response gene 3 product binding site (CDE/CHR tandem WSWHZFWHN.O1 Winged helix protein, involved in hair elements regulate cell cycle keratinization and thymus epithelium dependent repression) differentiation WSPAX3 PAX3.01 Pax-3 paired domain protein, VSPAX6/PAX4 PD.01 PAX4 paired domain binding site 50 expressed in embryogenesis, mutations correlate to Waardenburg WSBEL1 BEL1.01 Bel-1 similar region (defined in Lentivirus Syndrome LTRs) V CREB, ATF 6.02 Activating transcription factor 6, WSMOKFMOK2.01 Ribonucleoprotein associated Zinc finger member of b-zip family, induced by protein MOK-2 (mouse) ER stress WSEGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate-early 55 EBOR, XBP1.01 X-box-binding protein 1 gene product P53FP53.03 Tumor suppressor p53 (3' half site) WSEBOXAATF 6.01 Member of b-zip family, induced by ER HESF.HES1.01 Drosophila hairy and enhancer of damage/stress, binds to the ERSE in split homologue 1 (HES-1) association with NF-Y MTF1 MTF-101 Metal transcription factor 1, MRE WSEGRFEGR3.01 early growth response gene 3 product EKLFEKLF.O1 Erythroid krueppel like factor (EKLF) WSNRSFNRSE.O1 Neural-restrictive-silencer-element 60 EGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate VSETSFETS101 c-Ets-1 binding site early gene product WSNRSFNRSFO1 Neuron-restrictive silencer factor EBOXFATF 6.01 Member of b-zip family, induced by VSSP1FSP1.01 stimulating protein 1 SP1, ubiquitous Zinc ER damage/stress, binds to the ERSE finger transcription factor in association with NF-Y WSZBPFAZBP89.01 Zinc finger transcription factor ZBP-89 WSEBOXAATF 6.01 Member of b-zip family, induced by WSPAXS PAXS.O3 PAX5 paired domain protein 65 ER damage/stress, binds to the ERSE WSGREFARE.O1 Androgene receptor binding site in association with NF-Y US 7,728,118 B2 115 116

TABLE 21-continued TABLE 21-continued Sequences in Synthetic Puromycin Genes Sequences in Synthetic Puromycin Genes TFBS matches in hpuro TFBS matches in hpuro Before removal of TFBS from hpuro (68 matches Before removal of TFBS from hpuro (68 matches) Family/matrix** Further Information Family/matrix** Further Information WSCM c-Myb, important in hematopoesis, cellular equivalent to avian VSBCL6 BCL6.O1 POZ zinc finger protein, myoblastosis virus oncogene v-myb 10 transcriptional repressor, AH RRAHRARNT.O1 Aryl hydrocarbon receptor Arnt translocations observed in diffuse heterodimers arge cell lymphoma EBOXFMYCMAX.O3 MYC-MAX binding sites RORARORA2.01 RAR-related orphan receptor alpha2 WSZFSFZF5.01 Zinc finger? POZ domain EBOXFMYCMAX.O3 MYC-MAX binding sites transcription factor VSBCL6 BCL6.O2 POZ zinc finger protein, HIFFHIF1.02 Hypoxia inducible factor, bHLH/ 15 PAS protein family transcriptional repressor, EGRFEGR3.01 early growth response gene 3 product translocations observed in diffuse EGRFWT1.01 Wilms Tumor Suppressor arge cell lymphoma HAMLAML3.01 Runt-related transcription factor 2. WSEGRFEGR3.01 early growth response gene 3 product CBFA1 (core-binding factor, runt WSCREB, ATF 6.02 Activating transcription factor 6, domain, alpha Subunit 1) PAXS PAXS.O3 PAX5 paired domain protein member of b-zip family, induced by EBOXFATF 6.01 Member of b-zip family, induced by ER stress ER damage/stress, binds to the ERSE WSHIFF, HIF1.02 Hypoxia inducible factor, bHLH/ Vy in association with NF-Y PAS protein family HIFFHIF1.02 Hypoxia inducible factor, bHLH/ WSEBORXEP1.01 X-box-binding protein 1 PAS protein family VSDEAFNUDRO1 NUDR (nuclear DEAF-1 related y Zinc finger transcription factor ZBP-89 25 transcriptional regulator protein) Rat C2H2 Zn finger protein involved VSRXRF/VDR RXR01 VDR/RXR Vitamin D receptor RXR in olfactory neuronal differentiation heterodimer site GA BFGAGA.01 GAGA-Box VSAP2FAP2.01 Activator protein 2 EBOXFMYCMAX.O3 MYC-MAX binding sites WSREBVEBVRO1 Epstein-Barr virus transcription MYODMYFS.O1 Myf5 myogenic bHLH protein actor R AP4RTAL1BETAE47.01 Tal-1beta E47 heterodimer 30 NEURNEUROG.O1 Neurogenin 1 and 3 (ngn1.3) binding WSZBPFAZF9.01 Core promoter-binding protein sites (CPBP) with 3 Krueppel-type zinc V HAND HAND2 E12.01 Heterodimers of the bFHLH fingers transcription factors HAND2 WSMYOD, LMO2COM.O1 Complex of Limo2 bound to Tal-1, (Thing2) and E12 E2A proteins, and GATA-1, half-site WSMAZF MAZR.O1 MYC-associated zinc finger protein 35 related transcription factor WSAREBAREB6.03 AREB6 (Atp1al regulatory element Transcriptional repressor, binds to binding factor 6) elements found predominantly in WSRXRF,FXRE.O1 Farnesoid X- activated receptor genes that participate in lipid (RXR/FXR dimer) metabolism 40 WSAHRRAHRO1 Aryl hydrocarbon dioxin receptor VSSP1 FSP1.01 Stimulating protein 1 SP1, ubiquitous Zinc finger transcription factor **matches are listed in order of occurrence in the corresponding sequence AP2FAAP2.01 Activator protein 2 y EBARREB1.01 Ras-responsive element binding protein 1 BFAMIF1.01 MIBP-1/RFX1 complex 45 EBFTAXCREB.O1 Tax/CREB complex TFBS matches in hpuro1 RFAEGR3.01 early growth response gene 3 product After removal of TFBS from hpuro = before removal of TFBS from : MOKFMOK2.01 Ribonucleoprotein associated Zinc hpurol (4 matches finger protein MOK-2 (mouse) V MOKFMOK2.01 Ribonucleoprotein associated Zinc Family/matrix** Further Informafion finger protein MOK-2 (mouse) 50 PAXS PAXS.O1 B-cell-specific activating protein VSNEUR NEUROG.01 Neurogenin 1 and 3 (ngn1.3) binding sites NR Neural-restrictive-silencer-element VSPAX5/PAX5.02 B-cell-specific activating protein MINIMUSCLE INI.02 Muscle Initiator Sequence VSREBVEBVR.O1 Epstein-Barr virus transcription factor R : EBOXFATF 6.01 Member of b-zip family, induced by VSAHRRAHR.O1 Aryl hydrocarbon dioxin receptor ER damage/stress, binds to the ERSE in association with NF-Y 55 **matches are listed in order of occurrence in the corresponding sequence V DEAF NUDRO1 NUDR (nuclear DEAF-1 related transcriptional regulator protein) WSAH RRAHRARNT.O1 Aryl hydrocarbon receptor Arnt heterodimers Zinc finger? POZ domain TFBS matches in hpuro2 transcription factor 60 After removal of TFBS from hpuro 1 (2 matches) Family/matrix** Further Information WSEGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate early gene product VSNEUR NEUROG.01 Neurogenin 1 and 3 (ngn1.3) binding sites WSHIFFHIF1.02 Hypoxia inducible factor, bHLH/ VSBCL6/BCL6.02 POZ/zinc finger protein, transcriptional PAS protein family repressor, translocations observed in ETSFETS101 c-Ets-1 binding site diffuse large cell lymphoma y STATSTAT1.01 Signal transducer and activator of 65 transcription 1 **matches are listed in order of occurrence in the corresponding sequence US 7,728,118 B2 117 118 EXAMPLE 7 TABLE 23 Summary of Design of Synthetic Firefly Luciferase Sequences in Synthetic Luc Genes (version A) Genes TFBS in hluc + ver2A1 Before removal of TFBS from hluc + ver2A1 (110 matches TF Binding Sites and Search Parameters Family matrix* Further Information The TF binding sites are from the TF binding site library V MINIMUSCLE INI.O2 Muscle Initiator Sequence (“Matrix Family Library”) that is part of the GEMS Launcher V WHZFWHN.O1 winged helix protein, involved in hair 10 keratinization and thymus epithelium package. Each TF binding site (“matrix”) belongs to a matrix differentiation family that groups functionally similar matrices together, GREFPRE.O1 Progesterone receptor binding site MAZF MAZR.01 MYC-associated zinc finger protein eliminating redundant matches by Matinspector professional related transcription factor (the search program). Searches were limited to vertebrate TF SP1F SP1.01 stimulating protein 1 SP1, ubiquitous Zinc binding sites. Searches were performed by matrix family, i.e. 15 finger transcription factor ZBPFZBP89.01 Zinc finger transcription factor ZBP-89 the results show only the best match from a family for each SF1FFSF1.01 SF1 steroidogenic factor 1 site. Matinspector default parameters were used for the core EGRFNGFIC.O1 Nerve growth factor-induced protein C and matrix similarity values (core similarity=0.75, matrix MINIMUSCLE INI.01 Muscle Initiator Sequence EGRFEGR2.01 Egr-2/KroX-20 early growth response gene similarity-optimized). product ZFSFZF5.01 Zinc finger? POZ domain transcription TABLE 22 actor HESF HES1.01 Drosophila hairy and enhancer of split Luc Gene Designations homologue 1 (HES-1) Synthetic luc gene (versions A and B NRSFNRSE.O1 neural-restrictive-silencer-element PAXSAPAXS.O2 B-cell-specific activating protein Sequence Description Matrix Library 25 HAMLAML3.01 Runt-related transcription factor 2/CBFA1 (core-binding factor, runt domain, alpha LllC wild-type gene (not applicable) subunit 1) C- improved gene from Promega's pCL3 (not applicable) Progesterone receptor binding site Wectors tumor Suppressor p53 C- Improved gene form Promega's (not applicable) Zinc finger? POZ domain transcription pGL3 (R2.1)-Basic 30 actor Codon optimization strategy A EBOXAATF 6.01 Member of b-zip family, induced by ER damage/stress, binds to the ERSE in hluc + ver2A1 codon optimized luc-- (strategy A) Wer 3.0 association with NF-Y November 2002 EGRFEGR3.01 (early growth response gene 3 product hluc + ver2A2 First removal of undesired sequence Ver 3.0 NF1F, NF1.01 Nuclear factor 1 matches November 2002 35 EGRFEGR3.01 early growth response gene 3 product hluc + ver2A3 Second removal of undesired se- Wer 3.0 REBVEBVRO1 Epstein-Barr virus transcription factor R quence matches November 2002 MOKFAMOK2.01 Ribonucleoprotein associated Zinc finger hluc + ver2A4. Third removal of undesired sequence Ver 3.0 protein MOK-2 (mouse) matches November 2002 PBXCPBXI MEIS1.01 Binding site for a PbX1/Meis1 heterodimer hluc + ver2A5 Fourth removal of undesired se- Wer 3.0 XSECSTAF.O1 Se-CystRNA gene transcription activating quence matches November 2002 40 800 hluc + ver2A6 Fifth removal of undesired sequence Ver 3.0 COMPCOMP1.01 COMP1, cooperates with myogenic matches November 2002 proteins in multicomponent complex hluc + ver2A7 Sixth removal of undesired sequence Ver 3.1.1 MYOF MYOGNF1.01 Myogenin nuclear factor 1 or related matches April 2003 80OS hluc + ver2A8 Removal of Bg|I (RE) site Wer 3.1.1 NEURNEUROD1.01 DNA binding site for NEUROD1 April 2003 (BETA-2/E47 dimer) Codon optimization strategy B 45 MYODMYOD.O2 myoblast determining factor AP2FAP2.01 Activator protein 2 hluc + ver2B1 codon optimized luc-- (strategy B) Wer 3.0 EVI1 EVI1.02 Ecotropic viral integration site 1 encoded November 2002 800 hluc + ver2B2 First removal of undesired sequence Ver 3.0 SMADSMAD4.01 Smada transcription factor involved in matches November 2002 TGF-beta signaling hluc + ver2B3 Second removal of undesired se- Wer 3.0 50 MYODMYFS.O1 Myf5 myogenic bHLH protein HESF.HES1.01 Drosophila hairy and enhancer of split quence matches November 2002 homologue 1 (HES-1) hluc + ver2B4 Third removal of undesired sequence Ver 3.0 PAXS PAXS.O1 B-cell-specific activating protein matches November 2002 EBOXFATF 6.01 hluc + ver2B5 Fourth removal of undesired se- Wer 3.0 Member of b-zip family, induced by ER quence matches November 2002 damage/stress, binds to the ERSE in 55 association with NF-Y hluc + ver2B6 Fifth removal of undesired sequence Ver 3.0 GC box elements matches November 2002 y MAZF MAZR.01 MYC-associated zinc finger protein hluc + ver2B7 Sixth removal of undesired sequence Ver 3.1.1 matches April 2003 related transcription factor RREBFRREB1.01 Ras-responsive element binding protein 1 hluc + ver2B8 Removal of SmaI (RE), Ptx 1 (TF) Wer 3.1.1 AHRRAHRARNT.O1 Aryl hydrocarbon receptor Arnthetero sites April 2003 60 dimers hluc + ver2B9 Removal of additional CpG se- Wer 3.1.1 HIFF, HIF1.02 Hypoxia inducible factor, bHLHIPAS quences April 2003 protein family hluc + ver2B10 Removal of Bg|I (RE) site Wer 3.1.1 Zinc finger? POZ domain transcription April 2003 factor *the sequence names designate open reading frames; RE = restriction EBOXFATF 6.01 Member of b-zip family, induced by ER enzyme recognition sequence 65 damage/stress, binds to the ERSE in association with NF-Y US 7,728,118 B2 119 120

TABLE 23-continued TABLE 23-continued Sequences in Synthetic Luc Genes (version A) TFBS in hluc + ver2A1 Sequences in Synthetic Luc Genes (version A) Before removal of TFBS from hluc + ver2A1 (110 matches TFBS in hluc + ver2A1 Before removal of TFBS from hluc + ver2A1 (110 matches) Fami Further Information YY1F/YY101 Yin and Yang 1 Family matrix* Further Information ETSF GABPO1 GABP:GA binding protein MOKFMOK2.01 Ribonucleoprotein associated Zinc finger 10 WSEGRFEGR3.01 early growth response gene 3 product protein MOK-2 (mouse) V EGRFEGR3.01 early growth response gene 3 product Elk-1 MYC-MAX binding sites WSWHZFWHN.O1 winged helix protein, involved in hair GLI-Krueppel-related transcription factor, keratinization and thymus epithelium regulator of adenovirus E4 promoter differentiation XBBFFRFX1.01 X-box binding protein RFX1 15 VSAP2FAP2.01 Activator protein 2 EVI1 EVI1.06 Ecotropic viral integration site 1 encoded WSHIFF, HIF1.02 actor Hypoxia inducible factor, bHLHIPAS MOKFMOK2.01 Ribonucleoprotein associated Zinc finger protein family protein MOK-2 (mouse) WSNRSFNRSE.O1 neural-restrictive-silencer-element NF1F, NF1.01 Nuclear factor 1 WSZFIAZID.O1 Zinc finger with interaction domain PBXCPBX1 MEIS1.02 Binding site for a Pbxl/Meis1 heterodimer WSSMAD SMAD4O1 ZFSFZF5.01 Zinc finger? POZ domain transcription Smada transcription factor involved actor in TGF-beta signaling HESFHES1.01 Drosophila hairy and enhancer of split WSAHRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, homologue 1 (HES-1) fixed core PAXS PAXS.O1 B-cell-specific activating protein WSEBOXAMYCMAXO1 ETSF GABPO1 GABP:GA binding-protein c-Myc/Max heterodimer MYODMYOD.O2 myoblast determining factor 25 WSEBOXAUSFO3 upstream stimulating factor XSECfSTAF.O1 Se-CystRNA gene transcription activating WSEGRFEGR1.01 Egr-1/KroX-24/NGFI-A immediate-early actor gene product OAZFROAZ.01 Rat C2H2 Zn finger protein involved in olfactory neuronal differentiation VSMINIMUSCLE INI.O1 Muscle Initiator Sequence AP2FAAP2.01 Activator protein 2 WSMOKFAMOK2.01 Ribonucleoprotein associated Zinc finger PAX3 PAX3.01 Pax-3 paired domain protein, expressed in 30 protein MOK-2 (mouse) embryogenesis, mutations correlate to neural-restrictive-silencer-element Waardenburg Syndrome AP2FAAP2.01 Activator protein 2 V Nuclear factor 1 y MTF1 MTF-101 Metal transcription factor 1, MRE SF1 steroidogenic factor 1 SF1FFTF.O1 Alpha (1)-fetoprotein transcription factor (FTF), liver receptor homologue-1 35 (LHR-1) **matches are listed in order of occurrence in the corresponding sequence SMADSMAD4.01 Smada transcription factor involved in TGF-beta signaling NFKBNFKAPPAB.O1 NF-kappaB EKLFEKLF.O1 Erythroid krueppel like factor (EKLF) CREBFTAXCREB.O1 Tax/CREB complex TFBS in hluc + ver2A3 E2FFE2F03 E2F, involved in cell cycle regulation, 40 After removal of TFBS from hluc + ver2A2 = before removal of TFBS interacts with Rb p107 protein from hluc + ver2A3 (8 matches CP2F/CP2.01 AHRRAHRARNT.O1 Aryl hydrocarbon receptor Arnt Family/matrix** Further Information heterodimers EGRFEGR2.01 Egr-2/KroX-20 early growth response VSEGRFEGR2.01 Egr-2/KroX-20 early growth response gene gene product 45 product Zinc finger? POZ domain transcription VSHAMLAML3.01 Runt-related transcription factor 2/CBFA1 factor (core-binding factor, runt domain, alpha EBOR, XBP1.01 X-box-binding protein 1 subunit 1) FKHDXFD3.01 Xenopus fork head domain factor 3 VSMYOF/MYOGNF1.01 Myogenin nuclear factor 1 or related factors AP2FAAP2.01 Activator protein 2 VSNF1F/NF1.01 Nuclear factor 1 EGRFNGFIC.O1 Nerve growth factor-induced protein C 50 VSETSF, GABPO1 GABP:GA binding protein PCATACAATO1 Avian C-type LTR CCAAT box VSNFKB, NFKAPPAB.01 NF-kappaB PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 heterodimer VSEKLF, EKLF.01 Erythroid krueppel like factor (EKLF) AHRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, fixed VSFKHDXFD3.01 Xenopus fork head domain factor 3 COe MOKFMOK2.01 Ribonucleoprotein associated Zinc finger **matches are listed in order of occurrence in the corresponding sequence protein MOK-2 (mouse) 55 GREF,GRE.O1 Glucocorticoid receptor, C2C2 zinc finger protein binds glucocorticoid dependent to GRES EURNEUROD1.01 DNA binding site for NEUROD1 (BETA TFBS in hluc + ver2A6 2/E47 dimer) After removal of TFBS from hluc + ver2A5 (2 matches RSFNRSE.O1 neural-restrictive-silencer-element 60 RSFNRSE.O1 neural-restrictive-siiencer-element Family/matrix** Further Information HRRAHRARNT.O2 Aryl hydrocarbon/Arntheterodimers, fixed COe VSHAMLAML3.01 Runt-related transcription factor 2/CBFA1 EBOXFATF 6.01 Member of b-zip family, induced by ER (core-binding factor, runt domain, alpha damage/stress, binds to the ERSE in subunit 1) association with NF-Y VSFKHDXFD3.01 Xenopus fork head domain factor 3 HIFFHIF1.02 Hypoxia inducible factor, bHLHIPAS 65 protein family **matches are listed in order of occurrence in the corresponding sequence US 7,728,118 B2 121 122

TABLE 24-continued TFBS in hluc + ver2A6 Sequences in Synthetic Luc Genes (version B) Before removal of TFBS from hluc + ver2A6 (4 matches TFBS in hluc + ver2B1 Before removal of TFBS from hluc + ver2B1 (187 matches Family/matrix** Further Information Family/matrix** Further Information VSPAX5/PAX5.03 PAX5 paired domain protein VSLEFF/LEF1.02 TCF/LEF-1, involved in the Wnt signal transduction CLOXFCDPCR3.01 cut-like homeodomain protein pathway GFI1, Gf1B.O1 Growth factor independence 1 Zinc VSIRFF/IRF7.01 Interferon regulatory factor 7 (IRF-7) 10 finger protein Gfi-1B VSFKHDXFD3.01 Xenopus fork head domain factor 3 GATALMO2COM.02 complex of Limo2 bound to Tal-1, E2A proteins, and GATA-1, half-site 2 **matches are listed in order of occurrence in the corresponding sequence SRFFSRFO1 serum response factor HOXTMEIS1. HOXA9.01 Homeobox protein MEIS1 binding site OCT1, OCT1.03 octamer-binding factor 1 15 GFI1, GFI1.01 Growth factor independence 1 Zinc finger protein acts as transcriptional repressor TFBS in hluc + ver2A7 Liver enriched Cut - Homeodomain After removal of TFBS from hluc + ver2A6 = before removal of TFBS transcription factor HNF6 (ONECUT) from hluc + ver2A7 (1 match HAMLAML1.01 runt-factor AML GREFPRE.O1 Progesterone receptor binding site Family/matrix Further Information STATSTATS.O1 STAT5: signal transducer and activator of transcription 5 VSFKHDXFD3.01 Xenopus fork head factor 3 cellular and viral TATA box elements cut-like homeodomain protein HNF-3/Fkh Homolog-8 FAST-1 SMAD interacting protein 25 Growth factor independence 1 Zinc finger protein Gfi-1B TFBS in hluc + ver2A8 CART CART1.01 Cart-1 (cartilage homeoprotein 1) After removal of TFBS from hluc + ver2A7 (1 match HMTBAMTBF.O1 muscle-specific Mt binding site TBPFTATA.O1 cellular and viral TATA box elements Family matrix Further Information FKHDXFD2O1 Xenopus forkhead domain factor 2 30 BRNFBRN2O1 POU factor Brn-2 (N-Oct 3) VSFKHDXFD3.01 Xenopus fork head domain factor 3 MEF2A AMEF2.01 myocyte enhancer factor BRNFBRN2O1 POU factor Brn-2 (N-Oct 3) BEL1 BEL1.01 Bel-1 similar region (defined in Lentivirus LTRs) TABLE 24 NOLFOLF1.01 olfactory neuron-specific factor 35 OCT1, OCT1.06 octamer-binding factor 1 Sequences in Synthetic Luc Genes (version B) NFKBNFKAPPAB.O2 NF-kappaB TFBS in hluc + ver2B1 BCL6 BCL6.02 POZ zinc finger protein, transcriptional Before removal of TFBS from hluc + ver2B1 (187 matches repressor, translocations observed in diffuse large cell lymphoma Family/matrix** Further Information MOKFAMOK2.01 Ribonucleoprotein associated Zinc 40 finger protein MOK-2 (mouse) V HOXFAPTX101 Pituitary Homeobox 1 (Ptx1) HEATHSF1.01 heat shock factor 1 V OCT1, OCT1.04 octamer-binding factor 1 OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific V OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific domain domain PIT1. PIT1.01 Pitl, GHF-1 pituitary specific pou V homeo domain factor Nkx-2.5/CSX, domain transcription factor inman homolog low affinity sites HOXFACRXO1 Cone-rod homeobox-containing WSBARB, BARBIE.O1 barbiturate-inducible element 45 transcription factor otX-like homeobox WSTBPFTATA.O1 cellular and viral TATA box elements gene VSGATAGATA.01 GATA binding site (consensus) V Liver enriched Cut - Homeodomain VSAP4RfAP4O1 Activator protein 4 transcription factor HNF6 (ONECUT) WSHEN1 HEN1.02 HEN1 CLOXFCLOXO1 Clox WSSRFFSRFO1 serum response factor y BCL6 BCL6.02 POZ zinc finger protein, transcriptional WSPARFDBPO1 Albumin D-box binding protein 50 repressor, translocations observed in WSMOKFMOK2.01 Ribonucleoprotein associated Zinc diffuse large cell lymphoma finger protein MOK-2 (mouse) HOXFAPTX101 Pituitary Homeobox 1 (Ptx1) WSEV11 EVI1.04 Ecotropic viral integration site 1 GATAGATA1.02 GATA-binding factor 1 encoded factor FKHD,FREAC4O1 Fork head RElated ACtivator-4 WSGFI1, Gf1B.O1 Growth factor independence 1 Zinc s E4FF.E4FO1 GLI-Krueppel-related transcription finger protein Gfi-1B 55 actor, regulator of adenovirus E4 WSRBPFRBPJKO1 Mammalian transcriptional repressor broiloter RBP-kappa/CBF1 V PDX1, ISL1.01 Pancreatic and intestinal WSTBPFTATA.O2 Mammalian C-type LTRTATA box im-homeodomain factor VSAP4R/TAL1ALPHAE47.01 Tal-1alpha/E47 heterodimer CART CART1.01 Cart-1 (cartilage homeoprotein 1) WSSRFFSRFO1 serum response factor y GFI1, GFI1.01 Growth factor independence 1 Zinc VSOCTPOCT1PO1 octamer-binding factor 1, POU-specific finger protein acts as transcriptional 60 domain repressor WSBRNFABRN2.01 POU factor Brn-2 (N-Oct 3) IRFFAIRF3.01 interferon regulatory factor 3 (IRF-3) WSCREB,E4BP4O1 E4BP4, bZIP domain, transcriptional BARB.BARBIE.O1 barbiturate-inducible element repressor PBXFPBX1.01 homeo domain factor Pbx-1 VSVBPFVBPO1 PAR-type chicken vitellogenin EVI1 EVI1.02 Ecotropic viral integration site 1 promoter-binding protein encoded factor WSEVI1 EVI1.04 Ecotropic viral integration site 1 65 GATAGATA2.01 GATA-binding factor 2 encoded factor BRNFBRN2O1 POU factor Brn-2 (N-Oct 3) US 7,728,118 B2 123 124

TABLE 24-continued TABLE 24-continued Sequences in Synthetic Luc Genes (version B) Sequences in Synthetic Luc Genes (version B) TFBS in hluc + ver2B1 TFBS in hluc + ver2B1 Before removal of TFBS from hluc + ver2B1 (187 matches Before removal of TFBS from hluc + ver2B1 (187 matches

Family/matrix** Further Information Family/matrix** Further Information PARFDBPO1 Albumin D-box binding protein VSAP4RTH1E47.01 Thing1/E47 heterodimer, TH1 bHLH BRNFABRN3.01 POU transcription factor Brn-3 member specific expression in a variety ZBPFZBP89.01 Zinc finger transcription factor ZBP-89 10 of embryonic tissues CREBFTAXCREB-02 Tax/CREB complex XSECSTAF.O1 Se-Cys tRNA gene transcription GREFPRE.O1 Progesterone receptor binding site activating factor RBPFRBPJKO1 Mammalian transcriptional repressor IKRSIK3.01 Ikaros 3, potential regulator of RBP-Jkappa/CBF1 lymphocyte differentiation GATAGATA3.02 GATA-binding factor 3 AP1FAP1.01 AP1 binding site STATSTATO1 signal transducers and activators of 15 MAZF MAZ.01 Myc associated Zinc finger protein transcription (MAZ) KRSIK2.01 karos 2, potential regulator of MZF1 MZF1.01 MZF ymphocyte differentiation CLOXFCDPCR3.01 cut-like homeodomain protein SRFF, SRFO1 serum response factor P53FP53.01 tumor Suppressor p53 SEF1 SEF1.01 SEF1 binding site SMADSMAD3.01 Smad3 transcription factor involved in HAMLAML1.01 runt-factor AML-1 TGF-beta signaling MOKFMOK2O2 Ribonucleoprotein associated Zinc HMTBAMTBF.O1 muscle-specific Mt binding site finger protein MOK-2 (human) OCT1, OCT1.03 octamer-binding factor 1 Forkhead RElated ACtivator-2 FKHDXFD3.01 Xenopus forkhead domain factor 3 muscle-specific Mt binding site PIT1. PIT1.01 Pitl, GHF-1 pituitary specific pou Growth factor independence 1 Zinc domain transcription factor finger protein acts as transcriptional OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific repressor 25 domain ECATNFYO3 nuclear factor Y (Y-box binding factor) HOXFAHOX1-3.01 Hox-1.3, vertebrate homeobox protein HOXT/MEIS1. HOXA9.01 Homeobox protein MEIS 1 binding site PBXFPBX1.01 homeo domain factor Pbx-1 PCATACAATO1 Avian C-type LTRCCAAT box ECATNFYO3 nuclear factor Y (Y-box binding factor) HNF6, HNF6.01 Liver enriched Cut - Homeodomain PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 transcription factor HNF6 (ONECUT) heterodimer CLOXFCLOXO1 Clox 30 CLOXFCDP.O2 transcriptional repressor CDP GATAGATA3.02 GATA-binding factor 3 HOXTMEIS1. HOXA9.01 Homeobox protein MEIS1 binding site AREB, AREB6.04 AREB6 (Atp1al regulatory element HOXFAHOXA9.01 Member of the vertebrate HOX- cluster binding factor 6) of homeobox factors GATAGATA3.02 GATA-binding factor 3 GATAGATA1.02 GATA-binding factor 1 FKHD.HNF3B.O1 Hepatocyte Nuclear Factor 3beta PCATACAATO1 Avian C-type LTR CCAAT box IRFFAIRF1.01 interferon regulatory factor 1 35 XSECSTAF.O1 Se-Cys tRNA gene transcription NKXHFNKX31.01 prostate-specific homeodomain protein activating factor NKX3.1 OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific PBXFPBX1.01 homeo domain factor Pbx-1 domain ECATNFYO3 nuclear factor Y (Y-box binding factor) CLOXFCDPO1 cut-like homeodomain protein PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 FASTFAST1.01 FAST-1 SMAD interacting protein heterodimer ECATNFYO1 nuclear factor Y (Y-box binding factor) CLOXFGDP.O2 transcriptional repressor CDP 40 MEF2AMMEF2.01 myocyte enhancer factor HOXTMEIS1. HOXA9.01 Homeobox protein MEIS1 binding site TBPFTATA.O2 Mammalian C-type LTRTATA box HOXFAHOXA9.01 Member of the vertebrate HOX- cluster FASTFAST1.01 FAST-1 SMAD interacting protein of homeobox factors LTUPFTAACC.O1 Lentiviral TATA upstream element GATAGATA.01 GATA binding site (consensus) MOKFAMOK2.01 Ribonucleoprotein associated Zinc NKXHFNKX31.01 prostate-specific homeodomain protein finger protein MOK-2 (mouse) NKX3.1 45 BRNFABRN2O1 POU factor Brn-2 (N-Oct 3) GATAGATA3.02 GATA-binding factor 3 HOXFACRXO1 Cone-rod homeobox-containing HOXFACRX.O1 Cone-rod homeobox-containing transcription factor otX-like homeobox transcription factor otX-like homeobox gene gene prostate-specific homeodomain protein CART CART1.01 Cart-1 (cartilage homeoprotein 1) NKX3.1 OCT1, OCT1.02 octamer-binding factor 1 50 HEN1 HEN1.01 HEN1 MAZF MAZR.01 MYC-associated zinc finger protein EL1 BEL1.01 Bel-1 similar region (defined in related transcription factor Lentivirus LTRs) ZBPFZBP89.01 Zinc finger transcription factor ZBP-89 HOXFAPTX101 Pituitary Homeobox 1 (Ptx1) GATAGATA3.02 GATA-binding factor 3 BRNFABRN2O1 POU factor Brn-2 (N-Oct 3) HOXFACRX.O1 Cone-rod homeobox-containing NFKBNFKAPPAB.O1 NF-kappaB transcription factoriotx-like homeobox 55 HAMLAML1.01 runt-factor AML-1 gene ZFIAZID.O1 Zinc finger with interaction domain CLOXFCDPCR3.01 cut-like homeodomain protein XSECSTAF.O2 Se-Cys tRNA gene transcription AP1F.VMAFO1 activating factor AP4RTAL1ALPHAE47.01 Tal-1alpha E47 heterodimer KRSIK1.01 Ikaros 1, potential regulator of PAX8 PAX8.01 PAX 2/5/8 binding site lymphocyte differentiation BRACBRACH.O1 Brachyury FASTFAST1.01 FAST-1 SMAD interacting protein GATAGATA1.02 GATA-binding factor 1 60 MOKFAMOK2.01 Ribonucleoprotein associated Zinc RREBFRREB1.01 Ras-responsive element binding finger protein MOK-2 (mouse) protein 1 BEL1 BEL1.01 Bel-1 similar region (defined in MZF1 Lentivirus LTRs) Ribonucleoprotein associated Zinc EGRF.WT1.01 Wilms Tumor Suppressor finger protein MOK-2 (human) MAZF MAZR.01 MYC-associated zinc finger protein HOXFAPTX101 Pituitary Homeobox 1 (Ptx1) 65 related transcription factor LTUPTAACC.O1 Lentiviral TATA upstream element Zinc finger transcription factor ZBP-89 US 7,728,118 B2 125 126

TABLE 24-continued -continued Sequences in Synthetic Luc Genes (version B) TFBS in hluc + ver2B3 TFBS in hluc + ver2B1 After removal of TFBS from hluc + ver2B2 = before removal of TFBS Before removal of TFBS from hluc + ver2B1 (187 matches from hluc + ver2B3 (35 matches

Family/matrix** Further Information Family/matrix** Further Information ZBPFZBP89.01 Zinc finger transcription factor ZBP-89 VSGATAGATA1.02 GATA-binding factor 1 SP1 FGC.O1 GC box elements VSMINIMUSCLE INI.01 Muscle Initiator Sequence RREBFRREB1.01 Ras-responsive element binding 10 VSCLOX/CDPO1 cut-like homeodomain protein protein 1 VSBRNFBRN2.01 POU factor Brn-2 (N-Oct 3) MOKFMOK2.01 Ribonucleoprotein associated Zinc VSNFKB, NFKAPPAB.01 NF-kappaB finger protein MOK-2 (mouse) VSZFIAZID.O1 MEIS, MEIS1.01 Binding site for monomeric Meis1 Zinc finger with interaction domain homeodomain protein VSBCL6/BCL6.02 POZ zinc finger protein, transcriptional repressor, translocations observed in POZ/zinc finger protein, transcriptional 15 repressor, translocations observed in diffuse large cell lymphoma diffuse large cell lymphoma Cone-rod homeobox-containing GATAGATA3.02 GATA-binding factor 3 transcription factoriotX-like homeobox y HOXFACRX.O1 Cone-rod homeobox-containing gene transcription factor otx-like homeobox gene **matches are listed in order of occurrence in the corresponding sequence HOXACRXO1 Cone-rod homeobox-containing transcription factor otx-like homeobox gene MAZF MAZR.01 MYC-associated zinc finger protein related transcription factor TFBS in hluc + ver2B6 MZF1 25 After removal of TFBS from hluc + ver2B5 (2 matches PDX1 PDX1.01 Pdx1 (IDX1/IPF1) pancreatic and intestinal homeodomain TF Family/matrix** Further Information VSHOXF/PTX1.01 Pituitary Homeobox 1 (Ptx1) **matches are listed in order of occurrence in the corresponding sequence VSFKHDXFD3.01 Xenopus fork head domain factor 3 30 **matches are listed in order of occurrence in the corresponding sequence

TFBS in hluc + ver2B3 After removal of TFBS from hluc + ver2B2 = before removal of TFBS from hluc + ver2B3 (35 matches TFBS in hluc + ver2B6 35 Before removal of TFBS from hluc + ver2B6 (6 matches Family/matrix** Further Information OCT1, OCT1.04 octamer-binding factor 1 Family/matrix** Further Information BARB, BARBIE.O1 barbiturate-inducible element NFKBNFKAPPAB.O2 NF-kappaB VSPAX6/PAX4 PD.01 PAX4 paired domain binding site OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific VSHOXF/PTX1.01 Pituitary Homeobox 1 (Ptx1) 40 VSFKHDXFD3.01 Xenopus fork head domain factor 3 domain VSPAX6/PAX6.02 PAX6 paired domain and homeodomain are PIT1. PIT1.01 Pitl, GHF-1 pituitary specific pou required for binding to this site domain transcription factor VSPAX5/PAX5.03 HOXFAPTX101 Pituitary Homeobox 1 (Ptx1) PAX5 paired domain protein FKHD,FREAC4O1 Forkhead RElated ACtivator-4 VSIRFFIRF3.01 Interferon regulatory factor 3 (IRF-3) E4FF.E4F.O1 GLI-Krueppel-related transcription actor, regulator of adenovirus E4 45 **matches are listed in order of occurrence in the corresponding sequence broiloter EVI1 EVI1.02 Ecotropic viral integration site 1 encoded factor GATAGATA2.01 GATA-binding factor 2 GREFPRE.O1 Progesterone receptor binding site TFBS in hluc + ver2B7 RBPFRBPJKO1 Mammalian transcriptional repressor 50 After removal of TFBS from hluc + ver2B6 = before removal of TFBS kappa CBF1 from hluc + ver2B7 (2 matches STATSTATO1 signal transducers and activators of transcription Family/matrix** Further Information KRSIK2.01 karos 2, potential regulator of ymphocyte differentiation VSHOXF/PTX1.01 Pituitary Homeobox 1 (Ptx1) VSFKHDXFD3.01 FKHD,FREAC2.01 Forkhead RElated ACtivator-2 55 Xenopus fork head domain factor 3 SRFF, SRFO1 serum response factor GREFPRE.O1 Progesterone receptor binding site **matches are listed in order of occurrence in the corresponding sequence CLOXFCDPCR3.01 cut-like homeodomain protein AP4RTAL1ALPHAE47.01 Tal-1alpha E47 heterodimer GATAGATA1.02 GATA-binding factor 1 FKHDXFD3.01 Xenopus fork head domain factor 3 60 PBXFPBX1.01 homeo domain factor Pb:X-1 TFBS in hluc + ver2B8 ECATNFYO3 nuclear factor Y (Y-box binding factor) After removal of TFBS from hluc + ver2B7 = before removal of TFBS PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 from hluc + ver2B8 (1 match heterodimer CLOXFCDP.O2 transcriptional repressor CDP Family matrix Further Information HOXTMEIS1. HOXA9.01 Homeobox protein MEIS1 binding site HOXFAHOXA9.01 Member of the vertebrate HOX- cluster 65 VSFKHDXFD3.01 Xenopus fork head domain factor 3 of homeobox factors US 7,728,118 B2 127 128

TABLE 25

TFBS in hluc + ver2B9 Description of Designed Sequences After removal of TFBS from hluc + ver2B8 = before removal of TFBS pGL4 sequences from hluc + ver2B9 (1 match Matrix Family matrix Further Information Sequence Description Library VSFKHDXFD3.01 Xenopus fork head domain factor 3 SpeI-NcoI fragment with MCS, translation trap 10 MCS-1 SpeI-NcoI from pCL4-basics-5F2G-2 Ver 22 September 2OO TFBS in hluc + ver2B10 MCS-2 First removal of undesired sequence Ver 22 After removal of TFBS from hluc + ver2B9 (1 match matches September 15 2OO Family matrix Further Information MCS-3 Second removal of undesired sequence Ver 2.2 matches September VSFKHDXFD3.01 Xenopus fork head domain factor 3 2OO MCS-4 Third removal of undesired sequence Ver 23 matches February 2OO NotI-SpeI fragment with bla gene EXAMPLE 8 Bla Beta-lactamase gene from pGL3 vectors Summary of Design for pGL4 Sequences ba-1* SacII (RE) added, BSmAI (RE) site Ver 22 removed (*) September 2OO FIG. 2 depicts the design scheme for the pGL4 vector. A 25 ba-2* First removal of undesired sequence Ver 23 8Cle:S February portion of the vector backbone in pGL3 which includes an bla 2OO gene and a sequence between bla and a multiple cloning ba-3* Second removal of undesired sequence Ver 2.3 region, but not a second open reading frame, was modified to 8Cle:S February 2OO yield pGL4. pGL4 includes an amplicillin resistance gene 30 ba-4* Third removal of undesired sequence Ver 23 between a NotI and a Spel site, the sequence of which was 8Cle:S February modified to remove regulatory sequences but not to optimize 2OO ba-S* Fourth removal of undesired sequence Ver 2.3 codons for mammalian expression (bla-l-bla-5), and a Spel 8Cle:S February NcoI fragment that includes a multiple cloning region and a 2OO translation trap. The translation trap includes about 60 nucle 35 NotI-NcoI fragment with bla, otides having at least two stop codons in each reading frame. translation trap, MCS The Spel-NcoI fragment from a parent vector, pGL4-basics pGL4B-4NN Combination of bla-5 and MCS-4 Ver 24 May 5F2G-2, was modified to decrease undesired regulatory Sections 2002 sequences (MCS-1 to MCS-4: SEQID Nos. 76-79). One of pGL4B-4NN1 First removal of undesired sequence Ver 24 May 8Cle:S 2002 the resulting sequences, MCS-4, was combined with a modi 40 pGL4B-4NN2 Second removal of undesired sequence Ver 2.4 May fied ampicillin resistance gene, bla-5 (SEQ ID NO:84), to 8Cle:S 2002 yield pGL4B-4NN (SEQ ID NO: 95). pGL4B-4NN was fur pGL4B-4NN3 Third version after removal of CEBP Ver 24 May ther modified (pGL4-NN1-3: SEQID Nos. 96-98). To deter (TF) site 2002 mine if additional polyA sequences in the Spel-Nicol frag SpeI-NcoI fragment with translation ment further reduced expression from the vector backbone, 45 trap, polyA, MCS various polyA sequences were inserted therein. For instance, SpeI-NcoI- Existing MCS replaced with new MCS Ver 4.0 pGL4NN-Blue Heron included a c-mos polyA sequence in Ver2-start November the Spel-NcoI fragment. However, removal of regulatory 2003 sequences in polyA sequences may alter the secondary struc SpeI-NcoI-Ver2 First removal of undesired sequence Ver 4.O ture and thus the function of those sequences. 50 matches November In one vector, the Spel-NcoI fragment from pCL3 (Spel 2003 NcoI start ver2: SEQID NO:48) was modified to remove one (*) Blacodon usage was not optimized for expression in mammalian cells. transcription factor binding site and one restriction enzyme Low usage E. coli codons were avoided when changes were introduced to recognition site, and after the multiple cloning region, yield remove undesired sequence elements. ing Spel-NcoI ver2 (SEQID NO:49). 55 TF Binding Sites and Search Parameters TABLE 26 Each TF binding site (“matrix”) belongs to a matrix family Sequences in Synthetic SpeI-NcoI fragment of pGL4 that groups functionally similar matrices together, eliminat TFBS in MCS-1 ing redundant matches by Matinspector professional (the 60 Before removal of TFBS from MCS-1 (14 matches search program). Searches were limited to vertebrate TF Name of binding sites. Searches were performed by matrix family, i.e., Family/matrix** Further Information the results show only the best match from a family for each VSPAX3 PAX3.01 Pax-3 paired domain protein, expressed in site. Matinspector default parameters were used for the core embryogenesis, mutations correlate to and matrix similarity values (core similarity=0.75, matrix 65 Waardenburg Syndrome similarity-optimized), except for sequence MCS-1 (core VSGATAGATA.O1 GATA binding site (consensus) similarity=1.00, matrix similarity-optimized). US 7,728,118 B2 129 130

TABLE 26-continued TABLE 27-continued Sequences in Synthetic SpeI-NcoI fragment of pCL4 Sequences in Synthetic Not-SpeI Fragment of pGL4 TFBS in MCS-1 TFBS in bla-1 Before removal of TFBS from MCS-1 (14 matches Before removal of TFBS from bla-1 (94 matches Name of Name of family/matrix** Further Information Family/matrix** Further Information Elk-1 VSNKXHINKX31.01 prostate-specific homeodomain protein gut-enriched Krueppel-like factor NKX3.1 10 E2F, involved in cell cycle regulation, CREB,E4BP4O1 E4BP4, bZIP domain, transcriptional interacts with Rb p107 protein reeSSO ETSFNRF2.01 nuclear respiratory factor 2 BRN2. BRN2O1 POU factor Brn-2 (N-Oct 3) AP1F.VMAF.O1 CREB,E4BP4O1 E4BP4, bZIP domain, transcriptional XBBFFRFX1.01 X-box binding protein RFX1 reeSSO AREB, AREB6.04 AREB6 (Atp1a1 regulatory element KXHINKX31.01 prostate-specific homeodomain protein 15 binding factor 6) NKX3.1 c-Myb, important in hematopoesis, FIAZID.O1 Zinc finger with interaction domain cellular equivalent to avian P2F/CP2.01 CP2 myoblastosis virus oncogene V-myb RAC/BRACH.01 Brachyury v-Myb 'AX6 PAX6.01 PaX-6 paired domain protein . KXHINKX31.01 prostate-specific homeodomain protein PAR-type chicken vitellogenin NKX3.1 promoter-binding protein TEAFFTEF1.01 TEF-1 related muscle factor c-Myb, important in hematopoesis, ETSFELK1.02 Elk-1 cellular equivalent to avian myoblastosis virus oncogene V-myb **matches are listed in order of occurrence in the corresponding sequence GATAGATA3.02 GATA-binding factor 3 PAX8 PAX8.01 PAX 2/5/8 binding site 25 HNF4FEHNF4.02 Hepatic nuclear factor 4 E2FFE2FO1 E2F, involved in cell cycle regulation, interacts with Rb p107 protein TFBS in MCS-2 NFAT, NFAT.O1 Nuclear factor of activated T-cells After removal of TFBS from MCS-1 = before removal of TFBS from ECATNFYO2 nuclear factor Y (Y-box binding factor) MCS-2 (12 matches TBPFTATA.O2 Mammalian C-type LTRTATA box 30 MYT1 MYT1.02 MyT1 Zinc finger transcription factor Name of involved in primary neurogenesis family matrix** Further Information GATAGATA3.01 GATA-binding factor 3 CREB, CREB.O2 cAMP-responsive element binding VSGATAGATA.O1 GATA binding site (consensus) protein VSNKXHFNKX31.01 prostate-specific homeodomain protein winged helix protein, involved in hair NKX3.1 35 keratinization and thymus epithelium VSTBPFATATA.01 Avian C-type LTRTATA box differentiation VSCART/CART1.01 Cart-1 (cartilage homeoprotein 1) IRFFISRE.O1 interferon-stimulated response element VSCREB/E4BP4.01 E4BP4, bZIP domain, transcriptional repressor NRSFNRSE.O1 neural-restrictive-silencer-element VSBRN2 BRN2.01 POU factor Brn-2 (N-Oct 3) TCFF/TCF 11.01 TCF11 KCR-F1, Nrf1 homodimers VSCREB/E4BP4.01 E4BP4, bZIP domain, transcriptional repressor STATSTAT.O1 signal transducers and activators of transcription VSTBPFATATA.01 Avian C-type LTRTATA box 40 VSNKXHFNKX31.01 prostate-specific homeodomain protein ECATNFYO3 nuclear factor Y (Y-box binding factor) NKX3.1 OCT1 OCT1.05 octamer-binding factor 1 VSPAX6/PAX6.01 PaX-6 paired domain protein OCTP, OCT1 PO1 octamer-binding factor 1, POU-specific VSPAX8/PAX8.01 PAX 2/5/8 binding site domain VSPAX1 PAX1.01 Pax1 paired domain protein, expressed in the homeo domain factor Nkx-2.5/CSX, developing vertebral column of mouse tinnan homolog low affinity sites embryos 45 PIT1. PIT1.01 Pitl, GHF-1 pituitary specific pou domain transcription factor **matches are listed in order of occurrence in the corresponding sequence CLOXFCDPCR3.01 cut-like homeodomain protein GREFARE.O1 Androgene receptor binding site GATAGATA1.04 GATA-binding factor 1 TFBS in MCS-3 E2TFE2.02 papilloma virus regulator E2 After removal of TFBS from MCS-2=before removal of 50 RPOAPOLYAO1 Mammalian C-type LTR PolyA signal VMYB.VMYB.O2 v-Myb TFBS from MCS-4 (0 matches) CEBPCEBPB.O1 CCAAT?enhancer binding protein beta TFBS in MCS-4 VBPFIVBPO1 PAR-type chicken vitellogenin promoter-binding protein After removal of TFBS from MCS-3 (0 matches) CREB, HLF.O1 hepatic leukemia factor 55 SF1FFSF1.01 SF1 steroidogenic factor 1 TABLE 27 XBBFAMIF1.01 MIBP-1/RFX1 complex IKRSIK2.01 Ikaros 2, potential regulator of Sequences in Synthetic Not-SpeI Fragment of pGL4 lymphocyte differentiation TFBS in bla-1 MINIMUSCLE INI.02 Muscle Initiator Sequence Before removal of TFBS from bla-1 (94 matches PCAT, CLTR CAATO1 Mammalian C-type LTRCCAAT box 60 PAXS PAXS.O1 B-cell-specific activating protein Name of family/matrix** Further Information RPADPADS.O1 Mammalian C-type LTR Poly A VSGATAGATA1.02 GATA-binding factor 1 downstream element VSHOXF/HOX1-3.01 Hox-1.3, vertebrate homeobox protein X-box binding protein RFX1 VSTBPFATATA.01 Avian C-type LTRTATA box CCAAT?enhancer binding protein beta VSETSFNRF2.01 nuclear respiratory factor 2 hepatic leukemia factor VSOCTP, OCT1 PO1 octamer-binding factor 1, POU-specific 65 hepatic nuclear factor 1 domain US 7,728,118 B2 131 132

TABLE 27-continued -continued Sequences in Synthetic Not-SpeI Fragment of pGL4 TFBS in bla-2 TFBS in bla-1 After removal of TFBS from bla-1 = before removal of TFBS from Before removal of TFBS from bla-1 (94 matches bla-2 = (51 matches Name of family/matrix** Further Information Name of family/matrix** Further Information prostate-specific homeodomain protein E2F, involved in cell cycle regulation, NKX3.1 interacts with Rb p107 protein XBBFFRFX1.01 X-box binding protein RFX1 10 NFAT, NFAT.O1 Nuclear factor of activated T-cells STATSTATO1 signal transducers and activators of ECATNFYO2 nuclear factorY (Y-box binding factor) transcription TBPFTATA.O2 Mammalian C-type LTRTATA box HNF1 HNF1.01 hepatic nuclear factor 1 MYT1 MYT1.02 MyT1 Zinc finger transcription factor HMYOS8.01 involved in primary neurogenesis SORYSOXS.O1 Sox-5 GATAGATA3.01 GATA-binding factor 3 RBITBRIGHTO1 Bright, B cell regulator of IgE 15 CREB, CREB.O2 cAMP-responsive element binding transcription protein homeo domain factor Nkx-2.5/CSX, winged helix protein, involved in hair inman homolog low affinity sites keratinization and thymus epithelium GATAGATA1.02 GATA-binding factor 1 differentiation BARB.BARBIE.O1 barbiturate-inducible element RSFNRSE.O1 neural-restrictive-silencer-element MTF1 MTF-101 Metal transcription factor 1, MRE CT T1 OCT1.OS octamer-binding factor 1 NFKBCREL.O1 c-Rel LOXFCDPCR3.01 cut-like homeodomain protein ETSF ELK1.02 Elk-1 REFARE.O1 Androgene receptor binding site CLOXFCDPO1 cut-like homeodomain protein ATAGATA1.04 GATA-binding factor 1 RPOALPOLYAO1 Lentiviral PolyA signal EBPCEBPB.O1 CCAAT?enhancer binding protein beta GATAGATA1.03 GATA-binding factor 1 REBFHLF.O1 hepatic leukemia factor ZFIAZID.O1 Zinc finger with interaction domain BPFIVBPO1 PAR-type chicken vitellogenin WHZFWHN.O1 winged helix protein, involved in hair 25 promoter-binding protein keratinization and thymus epithelium BBF MIF1.01 MIBP-1/RFX1 complex differentiation IKR S IK2.01 karos 2, potential regulator of V PAX1 PAX1.01 Pax1 paired domain protein, expressed ymphocyte differentiation in the developing vertebral column of PAXSAPAXS.O1 B-cell-specific activating protein mouse embryos FFRFX1.02 X-box binding protein RFX1 GATALMO2COM.02 complex of Limo2 bound to Tal-1, E2A 30 PiCEBPB.O1 CCAAT?enhancer binding protein beta proteins, and GATA-1, half-site 2 BiHLF.O1 hepatic leukemia factor NRSFNRSFO1 neuron-restrictive silencer factor FFRFX1.02 X-box binding protein RFX1 AP4RTAL1BETAE47.01 Tal-1beta E47 heterodimer GATA102 GATA-binding factor 1 G complex of Limo2 bound to Tal-1, E2A B BiBARBIE.O1 barbiturate-inducible element proteins, and GATA-1, half-site 2 fMTF-101 Metal transcription factor 1, MRE GATA-binding factor 1 35 FKB,CREL.O1 c-Rel XBBFFRFX1.01 X-box binding protein RFX1 TSFELK1.02 Elk-1 AHRRAHRARNT.O2 aryl hydrocarbon/Arntheterodimers, fixed core BPFTATA.O1 cellular and viral TATA box elements PAXS PAX9.01 Zebrafish PAX9 binding sites MEIS, MEIS1.01 Homeobox protein MEIS1 binding site LOXFCDP.O2 transcriptional repressor CDP HOXFAHOXA9.01 Member of the vertebrate HOX- cluster ATAGATA1.01 GATA-binding factor 1 of homeobox factors i. P1FFTCF11MAFG.O1 TCF11/MafGheterodimers, binding to 40 GATAGATA1.03 GATA-binding factor 1 subclass of AP1 sites MEIS, MEIS1.01 Homeobox protein MEIS1 binding site RN2 BRN2O1 POU factor Brn-2 (N-Oct 3) NOLFOLF1.01 olfactory neuron-specific factor y KXEHANKX25.02 homeo domain factor Nkx-2.5/CSX, AP4RTAL1BETAE47.01 Tal-1 beta E47 heterodimer inman homolog low affinity sites GATAGATA1.02 GATA-binding factor 1 XBBFFRFX1.01 X-box binding protein RFX1 nuclear factor Y (Y-box binding factor) 45 Forkhead RElated ACtivator-4 AHRRAHRARNT.O2 aryl hydrocarbon/Arntheterodimers, Nuclear factor of activated T-cells fixed core interferon regulatory factor 1 PAXS PAX9.01 Zebrafish PAX9 binding sites E2F, involved in cell cycle regulation, CLOXFCDP.O2 transcriptional repressor CDP interacts with Rb p107 protein GATAGATA1.01 GATA-binding factor 1 IRFFAIRF1.01 50 interferon regulatory factor 1 **matches are listed in order of occurrence in the corresponding sequence E2FFE2FO2 E2F, involved in cell cycle regulation, interacts with Rb p107 protein **matches are listed in order of occurrence in the corresponding sequence

TFBS in bla-2 55 After removal of TFBS from bla-1 = before removal of TFBS from bla-2 = (51 matches TFBS in bla-3 Name of family/matrix** Further Information After removal of TFBS from bla-2 = before removal of TFBS from bla-3 = (16 matches VSGATAGATA1.02 GATA-binding factor 1 60 VSETSFNRF2.01 nuclear respiratory factor 2 Name of VSOCTP, OCT1 PO1 octamer-binding factor 1, POU-specific family/matrix** Further Information domain VSETSF/ELK1.02 Elk-1 nuclear respiratory factor 2 VSEBOXINMYC.01 E2F, involved in cell cycle regulation, interacts VSGATAGATA3.02 GATA-binding factor 3 with Rb p107 protein VSPAX8/PAX8.01 PAX 2/5/8 binding site 65 VSNFAT, NFAT.O1 Nuclear factor of activated T-cells VSHNF4HNF4.02 Hepatic nuclear factor 4 VSTBPFTATA.02 Mammalian C-type LTRTATA box US 7,728,118 B2 133 134

-continued TABLE 28 TFBS in bla-3 Sequences in Synthetic NotI-NcoI Fragment of pCL4 After removal of TFBS from bla-2 = before removal of TFBS from bla-3 = (16 matches Before removal of TFBS from pCL4B-4NN = (11 matches Name of Name of family/matrix** Further Information family/matrix** Further Information VSMYT1 MYT1.02 MyT1 Zinc finger transcription factor involved in VSSMAD. FAST1.01 FAST-1 SMAD interacting protein primary neurogenesis 10 VSSMAD. FAST1.01 FAST-1 SMAD interacting protein winged helix protein, involved in hair VSETSF/FLI.01 ETS family member FLI keratinization and thymus epithelium VSRBPFRBPJKO1 Mammalian transcriptional repressor RBP differentiation Jkappa CBF1 SORYSOXS.O1 Sox-5 VSETSF/FLI.01 ETS family member FLI CEBPCEBPB.O1 CCAAT?enhancer binding protein beta VSEBOX/USF.O2 upstream stimulating factor CREB, HLFO1 hepatic leukemia factor 15 VSCEBPCEBPB.01 CCAAT?enhancer binding protein beta VBPFVBPO1 PAR-type chicken Vitellogenin promoter-binding VSGATAGATA3.01 GATA-binding factor 3 protein VSWHZF/WHN.01 winged helix protein, involved in hair PAXS PAXS.O1 B-cell-specific activating protein keratinization and thymus epithelium XBBFFRFX1.02 X-box binding protein RFX1 differentiation CREB, HLFO1 hepatic leukemia factor VSETSFNRF2.01 nuclear respiratory factor 2 GATAGATA1.03 GATA-binding factor 1 VSTBPFATATA.01 Avian C-type LTRTATA box MEIS, MEIS1.01 Homeobox protein MEIS1 binding site NOLF.OLF1.01 olfactory neuron-specific factor **matches are listed in order of occurrence in the corresponding sequence **matches are listed in order of occurrence in the corresponding sequence

25 After removal of TFBS from pCL4B-4NN = before removal of TFBS from DCL4B-4NN1 (7 matches TFBS in bla-4 After removal of TFBS from bla-3 = before removal of TFBS from Name of bla-4 = (14 matches family/matrix** Further Information 30 Name of nuclear respiratory factor 2 family matrix** Further Information winged helix protein, involved in hair keratinization and thymus epithelium nuclear respiratory factor 2 differentiation Nuclear factor of activated T-cells VSCEBPCEBPB.01 CCAAT?enhancer binding protein beta winged helix protein, involved in hair 35 VSEBOX/USF.O2 upstream stimulating factor keratinization and thymus epithelium VSETSF/FLI.01 ETS family member FLI differentiation VSSMAD. FAST1.01 FAST-1 SMAD interacting protein GATAGATA3.01 GATA-binding factor 3 VSSMAD. FAST1.01 FAST-1 SMAD interacting protein CEBPCEBPB.O1 CCAAT?enhancer binding protein beta EBOXFUSFO2 upstream stimulating factor **matches are listed in order of occurrence in the corresponding sequence PAXS PAXS.O1 B-cell-specific activating protein XBBFFRFX1.02 X-box binding protein RFX1 40 GATAGATA1.03 GATA-binding factor 1 MEIS, MEIS1.01 Homeobox protein MEIS1 binding site ZFIAZID.O1 Zinc finger with interaction domain WHZFWHN.O1 winged helix protein, involved in hair After removal of TFBS from pCL4B-4NN1 = before removal of TFBS keratinization and thymus epithelium from DCL4B-4NN2 (4 matches differentiation 45 V PAX1 PAX1.01 Pax1 paired domain protein, expressed in Name of he developing vertebral column of mouse family/matrix** Further Information embryos VSGATA, LMO2COM.O2 complex of Limo2 bound to Tal-1, E2A VSETSFNRF2.01 nuclear respiratory factor 2 proteins, and GATA-1, half-site 2 VSWHZF/WHN.01 winged helix protein, involved in hair 50 keratinization and thymus epithelium **matches are listed in order of occurrence in the corresponding sequence differentiation VSCEBPCEBPB.01 CCAAT?enhancer binding protein beta VSEBOX/USF.O2 upstream stimulating factor **matches are listed in order of occurrence in the corresponding sequence TFBS in bla-S 55 After removal of TFBS from bla-4 (5 matches

Name of family/matrix** Further Information After removal of TFBS from pCL4B-4NN2 (3 matches VSETSFNRF2.01 nuclear respiratory factor 2 60 Name of VSWHZF/WHN.01 winged helix protein, involved in hair family/matrix** Further Information keratinization and thymus epithelium differentiation upstream stimulating factor VSGATAGATA3.01 GATA-binding factor 3 winged helix protein, involved in hair keratinization VSCEBP/CEBPB.01 CCAAT?enhancer binding protein beta and thymus epithelium differentiation VSEBOX/USF.O2 upstream stimulating factor VSETSFNRF2.01 nuclear respiratory factor 2 65 **matches are listed in order of occurrence in the corresponding sequence **matches are listed in order of occurrence in the corresponding sequence US 7,728,118 B2 135 136

TABLE 29 -continued Sequences in Synthetic Spel-NcoI section of pGL4 TFBS in SpeI-NcoI-Ver2 TFBS in SpeI-NcoI-Ver2-start After removal of TFBS from Spel-Neo-Ver2-start (28 matches Before removal of TFBS from Spel-NCOI-Ver2-start (34 matches Family/matrix** Further Information Family/matrix** Further Information CART CART1.01 Cart-1 (cartilage homeoprotein 1) PAX8 PAX8.01 PAX 2/5/8 binding site NKXHFNKX25.02 Homeo domain factor NkX-2.5/CSX, GATAGATA1.02 GATA-binding factor 1 inman homolog low affinity sites CREB,E4BP4O1 E4BP4, bZIP domain, transcriptional 10 CDXFCDX2.01 Cox-2 mammalian caudal related repressor intestinal transcr. factor NKXHFNKX31.01 Prostate-specific homeodomain protein BRNFBRN3.01 POU transcription factor Brn-3 NKX3.1 TBPFTATA.O2 Mammalian C-type LTRTATA box Avian C-type LTRTATA box FKHD,FREAC3.01 Fork head related activator-3 (FOXC1) E4BP4, bZIP domain, transcriptional OCT1, OCT1.02 Octamer-binding factor 1 repressor 15 CART CART1.01 Cart-1 (cartilage homeoprotein 1) Prostate-specific homeodomain protein PDX1 PDX1.01 Pdx1 (IDX1/IPF1) pancreatic and NKX3.1 intestinal homeodomain TF CART CART1.01 Cart-1 (cartilage homeoprotein 1) PARFFDBPO1 Albumin D-box binding protein NKXHFNKX25.02 Homeo domain factor NkX-2.5/CSX, GATAGATA3.02 GATA-binding factor 3 inman homolog low affinity sites VBPFIVBPO1 PAR-type chicken vitellogenin ETSF ELK1.01 Elk-1 promoter-binding protein CDXFCDX2.01 Cdx-2 mammalian caudal related AP4RTAL1ALPHAE47.01 Tal-1alpha E47 heterodimer intestinal transcr. factor RPS8.RPS8.01 Zinc finger protein RP58 (ZNF238), RNFBRN3.01 POU transcription factor Brn-3 associated preferentially with BPFATATA.O2 Mammalian C-type LTRTATA box heterochromatin KHD,FREAC3.01 Fork head related activator-3 (FOXC1) COMPCOMP1.01 COMP1, cooperates with myogenic CT1, OCT1.02 Octamer-binding factor 1 proteins in multicomponent complex ARTICART1.01 Cart-1 (cartilage homeoprotein 1) 25 CLOXFCLOXO1 Clox DX1.PDX1.01 Pdx1 (IDX1/IPF1) pancreatic and TBPFATATA.O1 Avian C-type LTRTATA box intestinal homeodomain TF PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 PARFDBPO1 Albumin D-box binding protein heterodimer GATAGATA3.02 GATA-binding factor 3 PBXFPBX1.01 Homeo domain factor Pbx-1 VBPFVBPO1 PAR-type chicken vitellogenin IRFFAIRF1.01 interferon regulatory factor 1 promoter-binding protein 30 TEAFTEF1.01 TEF-1 related muscle factor AP4RTAL1ALPHAE47.01 Tal-1 alpha E47 heterodimer RPS8.RPS8.01 Zinc finger protein RP58 (ZNF238), *matches are listed in order of occurrence in the corresponding sequence associated preferentially with heterochromatin COMPCOMP1.01 COMP1, cooperates with myogenic The number of consensus transcription factor binding sites proteins in multicomponent complex 35 present in the vector backbone (including the amplicillin resis CLOXFCLOXO1 Clox tance gene) was reduced from 224 in pGL3 to 40 in pGL4, and TBPFATATA.O1 Avian C-type LTRTATA box the number of promoter modules was reduced from 10 in PBXCPBX1 MEIS1.02 Binding site for a PbX1/Meis1 heterodimer pGL3 to 4 for pGL4, using databases, search programs and PBXFPBX1.01 Homeo domain factor Pbx-1 the like as described herein. Other modifications in pGL4 IRFFAIRF1.01 interferon regulatory factor 1 40 relative to pGL3 include the removal of the fl origin of TEAFTEF1.01 TEF-1 related muscle factor replication and the redesign of the multiple cloning region. EBOXFATF 6.01 Member of b-zip family, induced by ER damage/stress, binds to the ERSE in MCS-1 to MCS-4 have the following sequences (SEQ ID association with NF-Y Nos:76-79) V Homeodomain protein NKX3.2 (BAPX1, NKX3B, Bagpipe homolog) 45 V E2TFE2.02 Papilloma virus regulator E2 MCS-1 V EVI1 EVI1.05 Ecotropic viral integration site 1 ACTAGTCGTCTCTCTTGAGAGACCGCGATCGCCACCATGATAAGTAA encoded factor V GATAGATA3.02 GATA-binding factor 3 GTAATATTTAAATAAGTAAGGCCTGAGTGGCCCTCGAGCCAGCCTTGA

**matches are listed in order of occurrence in the corresponding sequence 50 GTTGGTTGAGTCCAAGTCACGTCTGGAGATCTGGTACCTACGCGTGA

GCTCTACGTAGCTAGCGGCCTCGGCGGCCGAATTCTTGCGATCTAAG

TAAGCTTGGCATTCCGGTACTGTTGGTAAAGCCACCATGG TFBS in SpeI-NcoI-Ver2 After removal of TFBS from Spel-Neo-Ver2-start (28 matches 55 MCS-2 ACTAGTACGTCTCTCTTGAGAGACCGCGATCGCCACCATGATAAGTA Family/matrix** Further Information AGTAATATTAAATAAGTAAGGCCTGAGTGGCCCTCGAGTCCAGCCTT VSPAX8/PAX8.01 PAX 2/5/8 binding site VSGATAGATA1.02 GATA-binding factor 1 GAGTTGGTTGAGTCCAAGTCACGTGTGGAGATCTGGTACCTTACGCGT VSCREB/E4BP4.01 E4BP4, bZIP domain, transcriptional 60 repressor AGAGCTCTACGTAGCTAGCGGCCTGGGCGGCCGAATTCTTGCGATCT Prostate-specific homeodomain protein NKX3.1 AAGCTTGGCAATCCGGTACTGTTGGTAAAGCCACCATGG VSTBPFATATA.01 Avian C-type LTRTATA box VSCREB/E4BP4.01 E4BP4, bZIP domain, transcriptional MCS-3 repressor ACTAGTACGTCTCTCTTGAGAGACCGCGATCGCATGCCTAGGTAGGT Prostate-specific homeodomain protein 65 NKX3.1 AGTATTAGAGCATAGGTAGAGGCCTAAGTGGCCCTCGAGTCCAGGCT

US 7,728,118 B2 149 150

- Continued - Continued tatic gccactggcaggagccactgg talacaggattagcagagcgaggitat actttgtc.cgcct coat coagt citatgagctgctgtcgtgatgctagagt gtaggcggtgct acagagttcttgaagtggtggcctaact acggctacac aagaagttcgc.ca.gtgagtag titt cogalagagttgttggc cattgctactg tagaagaacagt atttggt atctg.cgctctgctgaa.gc.cagttacct tcg gcatcgtggitat cacgctcgt.cgttctgg tatggctt.cgttcaact ctggit gaaaaagagttggtagcticttgatc.cggcaaacaaaccaccgctggtagc to coagcggt caa.gc.cgggtcacatgat Cacc catatt atgaagaaatgc ggtggitttittttgtttgcaa.gcagcagattacgc.gcagaaaaaaaggat C agticagotcCttagggcct Cogatcgttgtcagaagtaagttggcc.gcgg 10 t caagaagat.cctittgatcttittctacggggtctgacgct cagtggaacg tgttgtc.gct catgg taatggcagdact acacaattct cittac.cgt.catg aaaact cacgittaagggattittggit catgagattatcaaaaaggat.cttic ccatcc.gtaagatgcttitt cogt gaccc.gc.gagtact caac caagttcgtt acctagat cottittaaattaaaaatgaagttittaaatcaatctaaagtat ttgtgagtag titat acggcgaccaa.gctgct Cttgc.ccggcgt.ctatac 15 atatgagtaaacttggtctgacagcggcc.gcaaatgctaalaccactgcag ggga caac accgc.gc.ca catagcagtactittgaaagtgct catcatcggg tggittaccagtgcttgat cagtgaggcaccgatctgagcgatctgcctat aatcgttctt.cgggg.cggaaagacticaaggat Cttgcc.gct attgagat c titcgttcgt.ccatagtggcctgact coccgt.cgtgtagat cactacgatt cagttcgatatagcc cact cittgcacccagttgat citt cagcatcttitta cgtgagggct taccatcaggc cc cagogcagcaatgatgcc.gc.gaga.gc.c Cttt Caccagcgttt C9ggtgtgcaaaaacaggcaa.gcaaaatgcc.gca gcgttcaccggc.ccc.cgatttgtcagcaatgalaccagc.ca.gcagggaggg aagaagggaatgagtgcgacacgaaaatgttggatgct catact.cgt.cct cc.gagcgaagaagtggit Cotgct actttgtcc.gc.ct coat coagtictatg ttittcaat attattgaagcatttatcagggittac tagtacgtc.t.ct caag agctgctgtcgtgatgctagagtaagaagttcgc.ca.gtgagtag titt cog 25 gatalagtaagtaat attalaggtacgggagg tattggacaggcc.gcaataa aagagttgttggc cattgct actggcatcgtggitat cacgct citcgttctg aatat ctittattitt cattacatctgttgttgttggitttitttgttgttgaatcga gtatggct tcgttcaactctggttcc.cagoggtcaa.gc.cgggt cacatga tag tactaacatacgct ct coat caaaacaaaacgaaacaaaacaaacta t cacccatat tatgaagaaatgcagt cagotcct tagggcctic catcgt. 30 gcaaaataggctgtc.cccagtgcaagtgcaggtgcc agaac attt Ctctg tgtcagaagtaagttggcc.gcggtgttgtc.gct catggtaatggcagcac gcctaactgg.ccggtacctgagotcgctagoctogaggatat Caagat ct tacaca attct cittaccgt catgccatcc.gtaagatgcttitt.ccgtgacc ggccticggcggccaa.gcttggcaatc.cggtactgttggtaaagccaccat ggcgagtact caac Caagt cattttgtgagtagtgtatacggc gaccalag 35 99. Ctgcticttgc.ccggcgt.ct at acggga caacaccgc.gc.ca catagcagta EXAMPLE 10 Ctttgaaagtgct catcatcgggaatcgttct tcgggg.cggaaagactica aggatc.ttgcc.gctattgagatccagttcgatatagcc cactic ttgcacc 40 Summary of Sequences Removed in Synthetic Genes cagttgat citt cagcatcttt tactitt caccagogttt cqgggtgtgcaa Search Parameters: aaac aggcaa.gcaaaatgcc.gcaaagaagggaatgagtgcgacacgaaaa TFBS searches were limited to vertebrate TF binding sites. tgttggatgct catact cqtcctttittcaat attattgaagcatttatca 45 Searches were performed by matrix family, i.e., the results show only the best match from a family for each site. Matin gggttact agtacgt.ct ct caaggataagtaagtaat attalaggtacggg spector default parameters were used for the core and matrix similarity values (core similarity=0.75, matrix agg tattgga caggcc.gcaataaaatat ctittattitt cattacatctgtg similarity=optimized), except for sequence MCS-1 (core tgttggitttitttgttgttgaatcgatag tactaa catacgct citc catcaaa 50 similarity=1.00, matrix similarity-optimized). Promoter module searches included all available promoter acaaaacgaaacaaaacaaactagdaaaataggctgtc.cccagtgcaagt modules (vertebrate and others) and were performed using gCaggtgc.ca.gaac atttctict. default parameters (optimized threshold or 80% of maximum 55 score). The pGL4 backbone (NotI-NcoI) has the following sequence: Splice site searches were performed for splice acceptor or donor consensus sequences.

(SEO ID NO: 74) TABLE 31 gcggcc.gcaaatgctaalaccactgcagtggittaccagtgcttgat cagtg 60 TFBS aggcaccgat ct cagogatctgcct attitcgttcgt.cc at agtggcctga Matrix (family Promoter Splice sites Sequence Library matches modules (+Strand) Ctcc.ccgt.cgtgtagat Cactacgatt.cgtgagggcttaccat Caggcc C puro (not 62 5 O Cagcgcagcaatgatgcc.gc.gaga.gc.cgcgttcaccggc.ccc.cgatttgt applicable) 65 hpuro (not 68 4 1 Cagcaatgalaccagc.ca.gcagggagggc.cgagcgaagaagtggtc.ctgct applicable) US 7,728,118 B2 151 152

TABLE 31-continued TABLE 31-continued

TFBS TFBS Matrix (family Promoter Splice sites Matrix (family Promoter Splice sites Sequence Library matches modules (+Strand) Sequence Library matches modules (+Strand) hpuro1 Wer 4.1 4 2 1 hluc-ver2 Wer 3.0 187 2 8 February November 2004 2002 hpuro2 Wer 4.1 hluc-ver2 Wer 3.0 No data No data No data February 10 November 2004 2002 hluc-ver2 Wer 3.0 35 No data Neo (not 53 No data November applicable) 2002 (O (not 61 hluc-ver2 Wer 3.0 No data No data No data applicable) 15 November hneo-1 Wer 3.1 .2 June No data No data No data 2002 2003 hluc-ver2 Wer 3.0 No data No data No data hneo-2 Wer 3.1 .2 June No data No data No data November 2003 2002 hneo-3 Wer 3.1 .2 June 2003 hluc-ver2 Wer 3.0 hneo-4 Wer 4.1 November February 2002 2004 hluc-ver2 Wer 3.1.1 hneo-5 Wer 4.1 April February 2003 2004 hluc-ver2 Wer 3.1.1 25 April (not 74 No data 2003 applicable) hluc-ver2 Wer 3.1.1 (not 94 April applicable) 2003 Wer 3.1 .2 June No data No data No data hluc-ver2 Wer 3.1.1 2003 30 April Wer 3.1 .2 June No data No data No data 2003 2003 hluc-ver2 B10 Ver 3.1.1 Wer 3.1 .2 June April 2003 2003 Wer 3.3 35 August MCS-1 Wer 22 14 No data 2003 September Wer 3.3 2001 August MCS-2 Wer 22 12 No data 2003 September 40 2001 LllC (not 213 No data MCS-3 Wer 22 No data applica ble) September LC (not 189 No data 2001 applica ble) MCS-4 Wer 23 hluc-ver2A1 Wer 3.0 110 February Novem e 45 2001 2002 hluc-ver2A2 Wer 3.0 No data No data No data Bla (no No data No data Novem e applicable) 2002 bla-1 Wer 22 94 Wer 3.0 No data September Novem e 50 2OO 2002 bla-2 Wer 23 51 No data hluc-ver2A4 Wer 3.0 No data No data No data February Novem e 2OO 2002 bla-3 Wer 23 16 No data Wer 3.0 No data No data No data February Novem e 55 2OO 2002 Wer 23 14 No data hluc-ver2A6 Wer 3.0 February Novem e 2OO 2002 bla-5 Wer 23 hluc-ver2A6 Wer 3.1 February April 60 2OO 2003 Wer 3.1 Ver 24 May April 2002 2003 Ver 24 May 7 Wer 3.1 2002 April 65 Ver 24 May 2003 2002 US 7,728,118 B2 154 VSPAX5 (PAX-5/PAX-9 B-cell-specific activating pro TABLE 31-continued tein). TFBS Matrix (family Promoter Splice sites REFERENCES Sequence Library matches modules (+Strand) pGL4B-4NN3 Ver 24 May 3 O (not Altschulet al., Nucl. Acids Res., 25, 3389 (1997). 2002 applicable) Aota et al., Nucl. Acids Res., 16, 315 (1988). SpeI-NcoI- Wer 4.O 34 1 (not Boshart et al., Cell, 41,521 (1985). Ver2-Start November applicable) 10 2003 Bronstein et al., Cal. Biochem., 219, 169 (1994). Spel-NcoI-Ver2 Ver 4.0 28 1 (not Corpet et al., Nucl. Acids Res., 16,881 (1988). November applicable) 2003 deWet et al., Mol. Cell. Biol. 7, 725 (1987). Dikema et al., EMBO.J., 4, 761 (1985). Using the 5 sequences, i.e., hluc--ver2A1, bla-1, hneo-1, 15 Faist and Meyer, Nucl. Acids Res., 20, 26 (1992). hpuro-1, hhyg-1 (humanized codon usage) for analysis, Gorman et al., Proc. Natl. Acad. Sci. USA, 79, 6777 (1982). TFBS from the following families were found in 3 out 5 Higgins et al., Gene, 73,237 (1985). Sequences: Higgins et al., CABIOS, 5, 151 (1989). Huang et al., CABIOS, 8, 155 (1992). VSAHRR (AHR-arnt heterodimers and AHR-related fac Itolcik et al., PNAS, 94, 12410 (1997). tors) Johnson et al., Mol. Reprod. Devel., 50,377 (1998). VSETSF (Human and murine ETS 1 factors) Jones et al., Mol. Cell. Biol., 17, 6970 (1997). V&NFKB (Nuclear Factor Kappa B/c-rel) 25 Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 87, 2264 VSVMYB (AMV-viral myb oncogene) (1990). VSCDEF (Cell cycle regulators: Cell cycle dependent ele Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 90, 5873 ment) (1993). VSHAND (bHLH transcription factor dimer of HAND2 30 Keller et al., J. Cell Biol., 84, 3264 (1987). and E12) Kim et al., Gene, 91,217 (1990). VSNRSF (Neuron-Restrictive Silencer Factor) Lamb et al., Mol. Reprod. Devel., 51,218 (1998). VSWHZF (Wingeg Helix and ZF5 binding sites) Mariatis et al., Science, 236, 1237 (1987). VSCMYB (C-myb, cellular transcriptional activator) Michael et al., EMBO. J., 9, 481 (1990). VSMENI (Muscle INItiator) 35 Mizushima and Nagata, Nucl. Acids Res., 18, 5322 (1990). VSP53F (p53 tumor suppr.-neg. regulat. of the tumor Murray et al., Nucl. Acids Res., 17, 477 (1989). suppr. Rb) Myers and Miller, CABIOS, 4, 11 (1988). VSZF5F (ZF5 POZ domain zinc finger) Nakamura et al., NAR, 28:292 (2000). VSDEAF (Homolog to deformed epidermal autoregula 40 Needleman and Wunsen, J. Mol. Biol. 48, 443 (1970). tory factor-1 from D. melanogaster) Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85, 2444 VSMYOD (MYOblast Determining factor) (1988). VSPAX5 (PAX-5/PAX-9 B-cell-specific activating pro Pearson et al., Meth. Mol. Biol., 24, 307 (1994). 45 Sharp et al., Nucl. Acids Res., 16, 8207 (1988). tein) Sharp et al., Nucl. Acids Res., 15, 1281 (1987). VSEGRF (EGR/nerve growth Factor Induced protein C & Smith and Waterman, Adv. Appl. Math., 2, 482 (1981). rel. fact.) Stemmer et al., Gene, 164, 49 (1995). VSNEUR (NeuroD, Beta2, HLH domain) Uetsuki et al., J. Biol. Chem., 264, 5791 (1989). VSREBV (Epstein-Barr virus transcription factor R): 50 Voss et al., Trends Biochem. Sci., 11, 287 (1986). TFBS from the following families were found in 4 out of 5 Wada et al., Nucl. Acids Res., 18, 2367 (1990). Sequences: Watson et al, eds. Recombinant DNA. A Short Course, Sci VSETSF (Human and murine ETS 1 factors) entific American Books, W. H. Freeman and Company, VSCDEF (Cell cycle regulators: Cell cycle dependent ele 55 New York (1983). ment) Wood, K. Photochemistry and Photobiology, 62, 662 (1995). Wood, K. Science 244, 700 (1989) VSHAND (bHLH transcription factor dimer of HAND2 All publications, patents and patent applications are incor and E12) porated herein by reference. While in the foregoing specifi VSNRSF (Neuron-Restrictive Silencer Factor) 60 cation, this invention has been described in relation to certain VSPAX5 (PAX-5/PAX-9 B-cell-specific activating pro preferred embodiments thereof, and many details have been tein) set forth for purposes of illustration, it will be apparent to VSNEUR (NeuroD, Beta2, HLH domain); and those skilled in the art that the invention is susceptible to 65 additional embodiments and that certain of the details herein TFBS from the following families were found in 5 out of 5 may be varied considerably without departing from the basic Sequences: principles of the invention. US 7,728,118 B2 155 156

SEQUENCE LISTING

< 16 Os NUMBER OF SEO ID NOS: 97

SEQ ID NO 1 LENGTH: 795 TYPE: DNA ORGANISM: Unknown FEATURE: OTHER INFORMATION: Neo from neomycin gene from Promega's poI-neo.

SEQUENCE: 1 atgattgaac aagatggatt gcacgcaggt tct Coggcc.g Cttgggtgga gaggctatt C 6 O ggctatgact ggg cacaiaca gacaatcggc tgctctgatg ccggctgtca 12 O gcgc aggggc gcc.cggttct ttttgtcaag accacctgt ccggtgc cct gaatgaactg 18O

Caggacgagg Cagcgcggct atcgtggctg gcc acgacgg gcgttcCttg cgcagctgtg 24 O

Ctcgacgttg t cactgaagc gggaagggac tggctgctat tgggcgaagt 3OO gatctoctdt Catctoac Ct tgctic ctdcc gagaaagtat c catcatggc tgatgcaatg 360 cggcggctgc atacgcttga tccggctacc tgcc cattcq accaccaagc gaaac at CC atcgagcgag cacgtact cq gatggaagcc ggt Cttgtcg at Caggatga tctggacgaa gaggat Cagg agc.cgaactg titcgc.caggc t caaggcgc.g catgc.ccgac 54 O ggcgaggat.c c catggcgat gcc tecttgc cgaatat cat ggtggaaaat ggcc.gcttitt ctggatt cat cgactgtggc cggctgggtg tgg.cggaccg citat caggac 660 at agcgttgg ctaccc.gtga tattgctgaa gagcttggcg gcgaatgggc tgaccgcttic 72 O citcgtgctitt acggitat.cgc cgctic ccgat tcgcagcgca tcqccttcta togccttctt gacgagttct tctga 79.

SEQ ID NO 2 LENGTH: 264 TYPE : PRT ORGANISM: Unknown FEATURE: OTHER INFORMATION: Neo from neomycin gene from Promega's poI-neo.

<4 OOs, SEQUENCE: 2

Met Ile Glu Glin Asp Gly Lieu. His Ala Gly Ser Pro Ala Ala Trp Val 1. 5 1O 15

Glu Arg Luell Phe Gly Tyr Asp Trp Ala Glin Glin Thir Ile Gly Cys Ser 25

Asp Ala Ala Val Phe Arg Lieu. Ser Ala Glin Gly Arg Pro Wall Lieu. Phe 35 4 O 45

Wall Lys Thir Asp Lieu. Ser Gly Ala Lieu. Asn. Glu Lell Glin Asp Glu Ala SO 55 6 O

Ala Arg Luell Ser Trp Leu Ala Thr Thr Gly Val Pro Ala Ala Wall 65 70 7s 8O

Lell Asp Wall Val Thr Glu Ala Gly Arg Asp Trp Lell Lell Luell Gly Glu 85 90 95

Wall Pro Gly Glin Asp Lieu Lleu Ser Ser His Lieu. Ala Pro Ala Glu Lys 105 11 O

Wall Ser Ile Met Ala Asp Ala Met Arg Arg Lieu. His Thir Luell Asp Pro 115 12 O 125

Ala Thir Pro Phe Asp His Glin Ala Lys His Arg Ile Glu Arg Ala 13 O 135 14 O

US 7,728,118 B2 161 162

- Continued atcggit caat acactacatg gcgtgatttic atatgcgcga ttgct gatcc ccatgtgitat

Cactggcaaa Ctgtgatgga cgacaccgt.c agtgcgt.ccg tcgc.gcaggc tict catgag 54 O citgatgctitt ggg.ccgagga ctgcc.ccgaa gtc.cggCacc tcgtgcacgc ggattitcggc tccaacaatg t cctacgga Caatggcc.gc atalacagcgg t cattgact g gagcgaggcg 660 atgttcgggg att CCCaata cgaggtogcc aaCat Cttct tctggaggcc gtggttggct 72 O tgtatggagc agcagacgcg c tact tcgag cggaggcatc cggagcttgc aggat.cgc.cg cggctic.cggg cgtatatgct cc.gcattggit cittgaccaac tct at cagag Cttggttgac 84 O ggcaattit.cg atgatgcagc ttggg.cgcag ggit catgcg acgcaatcgt. C catc.cgga 9 OO gcc.gggactg tcgggcgtac acaaatcgcc cgcagaa.gcg cggcc.gtctg gaccgatggc 96.O tgtgtagaag tact.cgc.cga tagtggaaac cgacgc.ccca gCact cqtcc gagggcaaag gaat 1024

<210s, SEQ ID NO 7 &211s LENGTH: 341 212. TYPE : PRT <213> ORGANISM: Escherichia coli

<4 OO > SEQUENCE: 7 Met Lys Llys Pro Glu Lieu. Thir Ala Thir Ser Wall Glu Phe Lieu. Ile 1. 5 1O 15

Glu Lys Phe Asp Ser Val Ser Asp Lieu Met Glin Lell Ser Glu Gly Glu 25 3O

Glu Ser Arg Ala Phe Ser Phe Asp Val Gly Gly Arg Gly Tyr Val Lieu. 35 4 O 45

Arg Wall Asn Ser Cys Ala Asp Gly Phe Tyr Lys Asp Arg Tyr Val Tyr SO 55 6 O

Arg His Phe Ala Ser Ala Ala Lieu. Pro Ile Pro Glu Wall Lieu. Asp Ile 65 70 7s 8O

Gly Glu Phe Ser Glu Ser Lieu. Thir Tyr Cys Ile Ser Arg Arg Ala Glin 85 90 95

Gly Wall Thir Lieu. Glin Asp Lieu Pro Glu Thir Glu Lell Pro Ala Wall Lieu. 105 11 O

Glin Pro Wall Ala Glu Ala Met Asp Ala Ile Ala Ala Ala Asp Lieu. Ser 115 12 O 125

Glin Thir Ser Gly Phe Gly Pro Phe Gly Pro Gln Gly Ile Gly Glin Tyr 13 O 135 14 O

Thir Thir Trp Arg Asp Phe Ile Cys Ala Ile Ala Asp Pro His Val Tyr 145 150 155 160

His Trp Glin Thr Val Met Asp Asp Thir Wal Ser Ala Ser Wall Ala Glin 1.65 17O 17s

Ala Luell Asp Glu Lieu Met Lieu. Trp Ala Glu Asp Pro Glu Val Arg 18O 185 19 O

His Luell Wall His Ala Asp Phe Gly Ser Asn. Asn Wall Lell Thir Asp Asn 195

Gly Arg Ile Thir Ala Val Ile Asp Trp Ser Glu Ala Met Phe Gly Asp 21 O 215 22O

Ser Glin Tyr Glu Wall Ala Asn. Ile Phe Phe Trp Arg Pro Trp Lieu Ala 225 23 O 235 24 O

Met Glu Glin Gln Thr Arg Tyr Phe Glu Arg Arg His Pro Glu Lieu. 245 250 255