(12) United States Patent (10) Patent No.: US 8,326,547 B2 Liu Et Al

USOO8326547B2

(12) United States Patent (10) Patent No.: US 8,326,547 B2 Liu et al. (45) Date of Patent: Dec. 4, 2012

(54) METHOD OF SEQUENCE OPTIMIZATION Khalid et al. (Prosiding Simposium Kebangsaan Sains Matematik FOR IMPROVED RECOMBINANT PROTEIN ke-16 (2008) Jun. 205; pp. 1-11).* EXPRESSIONUSING A PARTICLESWARM Shen et al.( Computational Biology and Chemistry (2008) vol. 32: OPTIMIZATION ALGORTHM pp. 53–60).* Xiao et al. (Concurrency and Computation: Practice and Experience (75) Inventors: Xiaowu Liu, Nanjing (CN); Yun He, (2004) vol. 16; pp. 895-915).* Nanjing (CN); Zhuying Wang, Khalid et al. (Second Asia International Conference on Modelling Monmouth Junction, NJ (US); Chunjiao and Simulation; IEEE Computing Society (2008); AICMS, 08: ACM, Wang, Nanjing (CN); Zhibing Liu, Kuala Lumpur, Malaysia). Nanjing (CN); Tianhui Xia, Edison, NJ O'Neill et al. (Congress on Evolutionary Computation (2004) CEC, (US); Luquan Wang, East Brunswick, 2004, Jun. 19-23, 2004, vol. 1; pp. 104-110).* NJ (US); Fang Liang Zhang, Fanwood, Rouchka et al. (BMC Bioinformatics (2007) vol. 8; pp. 292-299).* Zhang et al. (The 1st International Congress on Bioinformatics and NJ (US) Biomedical Engineering (2007) ICBBE 2007, Jul. 6-9, 2007: pp. (73) Assignee: Nanjingjinsirui Science & Technology 53-56).* Biology Corp., Nanjing (CN) * cited by examiner (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154(b) by 114 days. Primary Examiner — Lori A Clow (21) Appl. No.: 12/894,401 (74) Attorney, Agent, or Firm — Panitch Schwarze Belisario & Nadel LLP (22) Filed: Sep. 30, 2010 (65) Prior Publication Data (57) ABSTRACT US 2011 FOO81708 A1 Apr. 7, 2011 An improved gene sequence optimization method, the sys Related U.S. Application Data tematic optimization method, is described for boosting the recombinant expression of genes in bacteria, yeast, insect and (60) Provisional application No. 61/249,411, filed on Oct. mammalian cells. This general method takes into account of 7, 2009. multiple, preferably most or all, of the parameters and factors (51) Int. Cl. affecting protein expression including codon usage, tRNA G06F 9/10 (2011.01) usage, GC-content, ribosome binding sequences, promoter, 5'-UTR, ORF and 3'-UTR sequences of the genes to improve G06F 9/00 (2011.01) and optimize the gene sequences to boost the protein expres (52) U.S. Cl...... 702/19; 702/20 sion of the genes in bacteria, yeast, insect and mammalian (58) Field of Classification Search ...... None cells. In particular, the invention relates to a system and a See application file for complete search history. method for sequence optimization for improved recombinant protein expression using a particle Swarm optimization algo (56) References Cited rithm. The improved systematic optimization method can be OTHER PUBLICATIONS incorporated into a software for more efficient optimization. Cai et al. (Journal of Theoretical Biology (2008) vol. 254, pp. 123 127). 20 Claims, 1 Drawing Sheet U.S. Patent Dec. 4, 2012 US 8,326,547 B2

5. 5 ge S. a

US 8,326,547 B2 1. 2 METHOD OF SEQUENCE OPTIMIZATION Particle Swarm Optimization (PSO) is a population based FOR IMPROVED RECOMBINANT PROTEIN stochastic optimization technique modeled on Swarm intelli EXPRESSION USING A PARTICLESWARM gence that finds a solution to an optimization problem in a OPTIMIZATIONALGORTHM search space or model and predicts Social behavior in the presence of objectives. It was first developed by Dr. Eberhart CROSS-REFERENCE TO RELATED and Dr. Kennedy in 1995, inspired by social behavior of bird APPLICATIONS flocking or fish schooling (Proceedings of the IEEE Interna tional Conference on Neural Networks, 1942-948). In PSO, This application claims the benefit under 35 U.S.C. S 119 the potential Solutions, called particles, fly through a multi (e) of U.S. Provisional Application No. 61/249,411, filed Oct. 10 dimensional problem space by following the current opti 7, 2009, the entire disclosure of which is hereby incorporated mum particles. Each particle keeps track of its coordinates herein by reference. (position and Velocity) in the problem space which are asso FIELD OF THE INVENTION ciated with the best solution (fitness) it has achieved so far, the 15 local best. Each particle also tracks the “best value obtained The invention relates to recombinant protein expression in so far by any particle in the neighbors of the particle, the bacterial, yeast, insect or mammalian cells. In particular, the neighboring best. When a particle takes all the population as invention relates to a system and a method for sequence its topological neighbors, the best value is a global best, optimization for improved recombinant protein expression which is known to all and immediately updated when a new using a particle Swarm optimization algorithm. best position is found by any particle in the problem space. The particle Swarm optimization concept consists of, at BACKGROUND OF THE INVENTION each time step, changing the Velocity of each particle toward its local best and neighboring best locations. The change in Recombinant protein expression has become a major tool Velocity is weighted by a random term, with separate random to analyze intracellular processes. The expression of foreign 25 numbers being generated for change in Velocity toward its genes in transformed organisms is now an indispensable local best and neighboring best locations. method for purification of the proteins for Subsequent uses, It is demonstrated that PSO gets better results in a faster, Such as protein characterization, protein identification, pro cheaper way compared with other methods. In addition, there tein function and structure study, etc. Proteins are also needed are few parameters to adjust in PSO algorithm. PSO can be to be expressed at large scale to be used as enzymes, as 30 used across a wide range of applications, as well as for spe nutritional proteins and as biopharmaceuticals (drugs). cific applications focused on a specific requirement. In the Escherichia coli (E. coli) is one of the most widely used past several years, PSO has been successfully applied in protein expression host system because it allows rapid several research and application areas. For example, PSO has expression and Subsequent large-scale, cost-effective manu been Successfully applied in research and application areas facturing of the recombinant proteins. While most prokary 35 such as bellow optimum design (Ying et al., 2007, Application otic genes are readily expressed in a prokaryotic expression of particle Swarm optimization algorithm in bellow optimum system, such as E. coli, many eukaryotic genes cannot be design, Journal of Communication and Computer, 32, 50-56). expressed efficiently in a prokaryotic system. The completion It has also been used for optimization of codon usage (Cai et of the human genome sequencing project has led to a rapid al., 2008, Optimizing the codon usage of synthetic gene with increase in genetic information, with tens of thousands of new 40 QPSO algorithm, Journal of Theoretical Biology, 254, 123 proteins waiting to be expressed and explored. Efficiently 127). expressing these proteins in a recombinant system, Such as an Despite the exhaustive effort of protein expression E. coli cell, for further study and use has become a pressing researchers and ever-increasing knowledge of protein expres issue. Sion, significant obstacles remain when one attempts to Many sequence factors, such as codon usage, mRNA sec 45 express a foreign or synthetic gene in a protein expression ondary structures, cis-regulatory sequences, GC content and system Such as E. coli. There is a need of a faster and simpler other similar variables affect protein expression (Villalobos et systematic sequence optimization method that coordinates al, 2006, “Gene Designer: a synthetic biology tool for con various sequence factors, resulting in improved protein structing artificial DNA segments.” BMC Bioinformatics 7, expression in a recombinant system. Such a method is 285). Methods have been developed to optimize one or more 50 described here. sequence elements to improve protein expression. For example, it has been demonstrated that codon optimization BRIEF SUMMARY OF THE INVENTION can increase protein expression level (Pikaart et al., 1996, Expression and codon usage optimization of the erythroid In one general aspect, embodiments of the present inven specific transcription factor cGaTA-1 in baculoviral and bac 55 tion relate to a method for optimizing a gene sequence for terial systems, Protein Expression and Purification, Vol. 8, expression of a protein in a host cell. The method comprises: pp. 469-475; and Hale et al., 1998, Codon optimization of the (a)identifying a plurality of sequence factors that affect the gene encoding a domain from human type 1 neurofibromin expression of the protein in the host cell; protein results in a threefold improvement in expression level (b) defining a particle Swarm optimization algorithm com in Escherichia coli, Protein Expression and Purification, vol. 60 prising a function for each of the plurality of sequence factors; 12, pp. 185-188). However, the prior art methods are gener and ally limited to the optimization of a particular sequence fac (c) applying the particle Swarm optimization algorithm to tor, e.g., codon usage, that improves recombinant expression the gene sequence to obtain an optimized gene sequence for of a particular protein in a specific host cell. There remains a expression of the protein in the host cell, wherein the opti need of a general method for sequence optimization that takes 65 mized gene sequence takes into account of the plurality of into account of multiple or all sequence factors and is appli sequence factors and achieves the maximum value of the cable for improved expression of any protein in any host cell. Swarm optimization algorithm, US 8,326,547 B2 3 4 wherein at least the applying step is performed on a com read in conjunction with the appended drawings. For the puter. purpose of illustrating the invention, there are shown in the In another general aspect, embodiments of the present drawings embodiments which are presently preferred. It invention relate to a system for optimizing a gene sequence should be understood, however, that the invention is not lim for expression of a protein in a host cell. The system com ited by the drawings. prises a computer system for applying a particle Swarm opti mization algorithm to the gene sequence to obtain an opti In the drawing: mized gene sequence for expression of the protein in the host FIG. 1 is a picture of a SDS-PAGE gel after Coomassie cell, wherein the particle Swarm optimization algorithm com Blue Staining, which illustrates recombinant expression in E. prises a function of each of a plurality of sequence factors that coli cells of the human OCT4 gene: Lanes 1-2 contained cell affect the expression of the protein in the host cell, the opti 10 lysates from cells transformed with human OCT 4 gene mized gene sequence takes into account of the plurality of sequence without any sequence optimization, the cells were sequence factors and achieves the maximum value of the gown under conditions for induced expression of the OCT 4 Swarm optimization algorithm. gene; Lane 3 contained cell lysate from cells transformed In yet another general aspect, the present invention relates with human OCT4 gene sequence optimized by systematic to a program product stored on a recordable medium for 15 optimization according to an embodiment of the present optimizing a gene sequence for expression of a protein in a invention, the cells were gown under conditions for non host cell. The program product comprises a computer soft induced expression of the optimized OCT4 gene; Lanes 4-6 ware for applying a particle Swarm optimization algorithm to contained cell lysates from the same cells as that used in Lane the gene sequence to obtain an optimized gene sequence for 3, except that the cells were gown under conditions for expression of the protein in the host cell, wherein the particle induced expression the optimized OCT4 gene; and Lane M Swarm optimization algorithm comprises a function of each contains protein markers with the molecular weight shown on of a plurality of sequence factors that affect the expression of the left side of the picture. the protein in the host cell, the optimized gene sequence takes into account of the plurality of sequence factors and achieves DETAILED DESCRIPTION OF THE INVENTION the maximum value of the Swarm optimization algorithm. 25 In a preferred embodiment of the present invention, the Unless defined otherwise, all technical and scientific terms particle Swarm optimization algorithm is defined as: used herein have the same meaning as is commonly under stood by one of skill in the art to which this invention belongs. All publications and patents referred to herein are incorpo 30 rated by reference. The invention provides a method useful for performing gene sequence optimization to boost the protein expression of genes in expression host cells. In one aspect, the invention wherein, provides a significant improvement of the gene sequence (2) 35 optimization method for protein expression. The invention f(X) is an initiation total score of the gene sequence; provides a systematic method whereby preferably all or most total of the parameters and factors affecting protein expression n is the length of the protein; and including, but not limited to, codon usage, tRNA usage, GC S. represents a function of codons within the protein; content, ribosome binding sequences, promoter, 5'-UTR, p is the number of the identified plurality of sequence 40 ORF and 3'-UTR sequences of the genes are taken into con factors and p1; sideration to improve and optimize the gene sequences to f(x) denotes a function of the i' sequence factor of the boost the protein expression of genes in expression host cells. identified p sequence factors; and Omitting one or more factors or parameters from the consid (), denotes the relative weight given to f(X); eration may resultin low or no expression of the interest genes wherein the optimized gene sequence achieves the maxi 45 in the expression host cells. mum value of F(x). According to embodiments of the present invention, an In another embodiment, the plurality of sequence factors inventive particle Swarm optimization algorithm is applied to are comprised of GC-content, CIS elements, repetitive ele accomplish the systematic optimization of gene sequences. ments, RNA splicing sites, ribosome binding sequences, Pro This systematic approach represents a significant shift from moter, 5'-UTR, ORF and 3'-UTR sequences of the genes, etc. 50 the prior art approaches that focused on individual factors, Other aspects of the invention relate to a method of Such as codon optimization, mRNA secondary structures or expressing a protein using the optimized gene sequence other factors, thus results in great improvement in gene obtained from a method of the present invention, an isolated expression of recombinant proteins, particularly those that nucleic acid molecule comprising the optimized gene could not be optimally expressed using the conventional sequence, and a vector or a recombinant host cell comprising 55 methods. the isolated nucleic acid. Protein expression is the translation of mRNA. To boost The details of one or more embodiments of the disclosure protein expression, the expressed proteins are preferably pro are set forth in the accompanying drawings and the descrip duced at high level and remain stable with no or very little tion below. Other features, objects, and advantages will be degradation. To reduce or minimize the proteolytic degrada apparent from the description and drawings, and from the 60 tion of the protein, host strains with several deficient protease claims. genes are preferably used for protein expression. To produce high level of proteins, mRNA is preferably produced at high BRIEF DESCRIPTION OF THE SEVERAL level, not degraded quickly, and is translated efficiently. VIEWS OF THE DRAWINGS To reduce or minimize the mRNA degradation or increase 65 the stability of mRNA thus to reduce the turnover time of The foregoing Summary, as well as the following detailed mRNA, cis-acting mRNA destabilizing motifs including, but description of the invention, will be better understood when not limited to, AU-rich elements (AREs) and RNase recog US 8,326,547 B2 5 6 nition and cleavage sites is preferably mutated or deleted analyzing codon usage bias is codon context index (CCI) from the gene sequences. AU-rich elements (AREs) with the derived from “codon pair theory (Irwin B, Heck J D and core motif of AUUUA (SEQID NO:3) are usually found in Hatfield GW, 1995, Codon pair utilization biases influence the 3' untranslated regions of mRNA. Another example of the translational elongation step times. Journal of Biological mRNA cis-element consists of sequence motif TGYYGAT- 5 Chemistry, 270 (39), 22801-22806) and the optimal range for GYYYYY (SEQID NO:2), where Y stands for either T or C. CCI is preferably from 0.7 to 1.0. RNase recognition sequences include, but are not limited to, The maximization of CAI or CCI is not enough to boost RNase E recognition sequence. A host strain with deficient protein expression. In the traditional codon optimization RNases can also be used for protein expression. RNase splicing sites can cause RNA splicing to produce a 10 methods, the most preferred codons are always selected, different mRNA and therefore reduce the original mRNA which will result in the quick exhaustion of the tRNAs of the level. RNase splicing sites are also preferably mutated to most preferred codons and hence the Subsequential decrease non-functional to maintain the mRNA level. of the translation efficiency. To produce high level of mRNA, the optimal transcription According to embodiments of the present invention, the promoter sequence is preferably used in the gene sequences. 15 codon diversity is also taken into account. The most preferred For prokaryotic host such as E. coli, one of the strong pro codons are used the most for codon optimization, however, moters is T7 Promoter for T7 RNA Polymerase (T7 RNAP). less preferred codons are also used to increase the tRNA Some bases of long or short tandem simple sequence repeat usage efficiency thus to increase translation efficiency, (SSR) are preferably mutated using codon degeneracy to although to a less content. break the repeats to reduce polymerase slippage, to thus 20 The potential strong stem-loop secondary structures of reduce premature protein or protein mutations. mRNA located in the downstream of the start codon may There are additional factors and parameters that affect hinder the movement of the ribosome complex, and thus slow mRNA translation and the resulting protein expression level. down the translation and reduce the translation efficiency. These factors affect translation from translation initiation The strong secondary structures of mRNA can even cause the through translation termination. Ribosomes bind mRNA at 25 ribosome complex to fall off the mRNA and result in the the ribosome binding site (RBS) to initiate translation. termination of translation. There are several methods for free Because ribosomes do not bind to double-stranded RNA, the energy calculation and secondary structure prediction. One of local mRNA structure around this region is preferably single them is mfold program (Mathews et al., 1999, Expanded Stranded and not form any stable secondary structure. The Sequence Dependence of Thermodynamic Parameters consensus RBS sequence, AGGAGG (SEQ ID NO:1), for 30 prokaryotic cells Such as E. coli, also called Shine-Dalgarnon Improves Prediction of RNA Secondary Structure, J. Mol. sequence, is preferably placed a few bases just before the Biol. 288,911-940). translation start site in the genes to be expressed. However, According to embodiments of the present invention, the internal ribosome entry site (IRES) is preferably mutated to local secondary structures of mRNA with a low free energy prevent ribosomes binding to avoid non-specific translation 35 (AG<-18 Kcal/mol) or a long complementary stem (>10 bp) initiation. are defined as too stable for efficient translation. The gene After translation initiation, ribosomes read the mRNA and sequences is preferably optimized to make the local structure enlist the tRNAs to transfer the correct amino acid building not so stable. blocks to make proteins. Since there exist 61 codons to Both of the 5'-UTR and 3'-UTR of mRNA are preferably encode 20 naturally occurring amino acids and 3 additional 40 taken into consideration for mRNA structure free energy codons (amber, ochre, opal) to encode one stop signal of calculation and secondary structure prediction. translation, which is called “degeneracy of the genetic code'. GC-content of mRNA is also preferably monitored. An each amino acid can be coded by several different codons. ideal range for GC 96 is approximately 30-70%. High GC Accordingly, the same amino acid can be transferred to ribo content will make mRNAS to form strong stem-loop second somes by several different tRNAs. However, the use of syn- 45 ary structures. It will also cause problems for PCR amplifi onymous codons is strongly biased in both the prokaryotic cation and gene cloning. The high GC-content of the target and eukaryotic systems, comprising both bias between sequence is preferably mutated using codon degeneracy to be codons recognized by the same transfer RNA and bias around 50-60%. There are two different measurements for between groups of codons recognized by different synony GC96. One is the global GC 96 which is averaged along the mous tRNAs (Michael Bulmer, 1987, Coevolution of codon 50 whole sequence; the other is more useful, which is the local usage and transfer RNA abundance, Nature 325,728-730). Several statistical methods have been proposed for quantita GC% calculated within a shifted “window' offixed size (e.g. tively analyzing codon usage bias. One of the most commonly 60 bp). used methods is codon adaptation index (CAI). Codon adap According to embodiments of the present invention, the tation index is a measurement of the relative adaptiveness of 55 local GC 96 is optimized to around 50-60%. the codon usage while the relative adaptiveness is calculated Theoretically all the parameters and factors affecting gene as the ratio of the usage of each codon to that of the most expression, including those described above, can be taken abundant synonymous codon for the same amino acid (Sharp into account to optimize the genes for optimal expression of P M and Li W H, 1987, The Codon Adaptation Index-a the genes. For a short gene of a few hundred base pairs, it is measure of directional synonymous codon usage bias, and its 60 possible to optimize the sequences of the genes manually by potential applications. Nucleic Acids Research, 15 (3), 1281 checking and modify the sequences using those parameters. 1295). However, most of the genes are much longer and even up to To boost protein translation efficiency, the gene sequences tens of thousands of base pairs. It is not possible to manually is preferably optimized so that codon usages are optimized perform a systematic optimization of the gene sequences. according to the tRNAs abundance or the availability of the 65 Embodiments of the present invention tackle this problem different tRNAs. Generally the optimal range for CAI is using an inventive algorithm based on Particle Swarm Opti preferably from 0.8 to 1.0. A second method to quantitatively mization (PSO) theory. US 8,326,547 B2 7 8 A novel POS algorithm was defined to systematically opti the most frequency in all synonymous codons of ith amino mize gene sequences by taking into account multiple, prefer acid, and n is the length of protein sequence decoded by the ably most or all, of those parameters and factors that affect DNA sequence; gene expression. An objective function, F(x), is defined as: c6 (8) And f(x) = X(x) (1) i=1 Wherein (), (), (), (), co, (), and (), denote the relative Weights given to f(X), f(X), f(X), f(X), f(X), f(X) and f,(X), respectively; 10 Wherein f(x) scores the undesirable splicing sites in the optimized gene sequence, c6 is the occurrences of the candi And f(x)=Sgoao Xi (2) date splicing sites, X, is the base of Score function, C. is a Wherein f(X) is an initiation total score of optimized threshold of scoring the splicing site, and S6 represents a score DNA sequence, n is the total length of protein sequence of the splice site evaluated by splicing site prediction system; decoded by this DNA sequence, and S. represents a score 15 of codons; -- (9) Ci V 8 - X* cl I (3) f i And f(x) =XXdi, Wherein f(x) scores GC content with a fixed window in the optimized gene sequence, 1 is the length of target DNA Wherein f(x) scores the direct repeats in the optimized sequence, V is the cutoff value of ideal GC content, and c, is DNA sequence, c1 is occurrences of repetitive fragments, 11 25 the occurrence of base G and C, c, is 1 ifith nucleotide is Gor is the length of repeats, and d, represents the score of jth C, otherwise, 0. nucleotide of ith direct repeat; F(x) is an objective function that can be expanded to include multiple or all parameters or factors that affect gene c2 2 (4) expression. When the optimization is going on, the value of And f(x) =XXr, 30 F(x) will go up until it reaches maximum, i.e., the global best, When the optimized sequence is obtained. The invention therefore relates to a process for optimizing Wherein f(x) scores the reverse repeats in the optimized the gene sequences using the systematic method. The above DNA sequence, c2 is occurrences of reverse repetitive frag 35 objective functions, F(X) and f(X) through f,(X), can be pro ments, 12 is the length of reverse repeats, and r, represents the grammed into a Software for easy operation. Using a com score of jth nucleotide of ith reverse repeat; puter loaded with the Software, one can optimize a gene sequence for improved expression of the gene in a host cell, for example, by removing mRNA destabilizing motifs via c3 3 (5) 40 mutating or deleting the motifs from the gene sequence to be And f(x) =XX dy, expressed, adding the DNA sequences or motifs that enhance transcription or mRNA production to the gene sequence to be expressed, adding the DNA sequences or motifs that stabilize Wherein f(x) scores the dyad repeats in the optimized mRNA to the gene sequence to be expressed, placing the most 45 favorable RBS sequences just before or a few bases before the DNA sequence, c3 is occurrences of dyad repetitive frag translation start site in the gene to be expressed, optimizing ments, 13 is the length of reverse repeats, and dy, represents the ORF sequences to maximize the codon usage efficiency, the score of jth nucleotide of ith dyad repeat; optimizing the gene sequences by using alternative codons until that the local mRNA structure around RBS region is 50 c4 (6) single-stranded and not form any stable secondary structure And f(x) =Xe X stor to increase the translation efficiency to enhance translation, i=1 etc. Within minutes, one can optimize a gene sequence with all the parameters and factors considered for optimal expression Wherein f(x) scores the negative motifs in the optimized 55 of the gene assisted by a computer loaded with a software DNA sequence such as PolyA, restriction sites, C4 is occur executing a POS algorithm according to an embodiment of rences of negative motifs, e, is the corresponding weight the present invention. given to ith motif, and S, scores the ith negative motif: In the above-mentioned embodiments, in view of the present disclosure, those skilled in the art will know how to 60 screen for cis-acting mRNA destabilizing motifs Such as AU t (7) rich elements (AREs), RNAse recognition and cleavage sites. And f(x) = (II (fik lf Those skilled in the art will also know how to calculate CAI i=1 and the free energy of mRNA and mutate the gene sequences. According to embodiments of the present invention, the Whereinf(X) measures the used codon bias of target gene 65 systematic method to optimize gene sequences can be used sequence, f, represents the frequency of the kth synonymous for any protein expression systems such as that using bacteria, codon of ith amino acid, f, Fix represents max the codon with yeast, insect or mammalian cells as the host cells. US 8,326,547 B2 10 In one embodiment, the optimized gene sequence obtained EXAMPLE by a method of the present invention can be synthesized, cloned into the host cell and expressed in the host cell for the This example illustrates the optimization and expression of production of the encoded protein. a gene sequence, e.g., human OCT 4 gene encoding POU Thus, another embodiment of the present invention relates to a method for expressing a protein in a host cell. The method class 5 homeobox 1, for recombinant expression in E. coli. comprises: Similar method can be used for optimization and expression (a) obtaining an optimized gene sequence for expression of of other genes in E. coli or other host cells. the protein in the host cell using a method according to an The DNA sequence of the wild-type human OCT4 gene embodiment of the present invention; 10 (gi261859841) (SEQ ID NO: 1) was subject to Particle (b) synthesizing a nucleic acid molecule comprising the Swarm Optimization (POS) analysis using a POS algorithm optimized gene sequence; having an objective function F(x) as that described above. (c) introducing the nucleic acid molecule into the host cell During the sequence optimization, the value of F(X) went up to obtain a recombinant host cell: and until it reached maximum, i.e., the global best, when the (d) cultivating the recombinant host cell under conditions 15 optimized OCT 4 gene sequence (SEQ ID NO:2) was to allow expression of the protein from the optimized gene obtained. A DNA molecule having the optimized OCT4 gene sequence. sequence was synthesized using a known method. In view of the present invention, any method can be used to synthesize the nucleic acid molecule comprising the opti- Each of the wild-type human OCT4 gene and the opti mized gene sequence, e.g., by using a DNA synthesizer, by 20 mized OCT4 gene was cloned 1ntO an inducible expression introducing mutations into an existing nucleic acid molecule, vector pET43a(+) (Invitrogen), using standard molecular etc. In view of the present disclosure, those skilled in the art biology techniques. Each of the expression vectors for the can readily clone the nucleic acid molecule and express the wild-type OCT4 gene and the optimized OCT4 gene was protein from the optimized gene sequence in the host cell transformed into an E. coli host cell BL21 (DE3), using stan using known molecular biology techniques, all without undue 25 dard molecular biology techniques. The resulting recombi experimentation. nant E. coli cells containing the expression vector were cul Embodiments of the present invention also relate to a tured under conditions inducible or non-inducible for the nucleic acid molecule comprising the optimized gene expression of the cloned OCT4 gene. The total proteins in the sequence obtained from a method of the present invention, as cells were analyzed by SDS PAGE followed by Coomassie NS SSR and host cells comprising the nucleic acid 30 Blue Staining. molecule of the present invention. Various embodiments of the invention have now been As shown in FIG. 1, when grown under conditions for described. It is to be noted, however, that this description of induced expression of the cloned OCT 4 gene. the optimized these specific embodiments is merely illustrative of the prin- OCT 4 gene resulted in significantly increased protein ciples underlying the inventive concept. It is therefore con- 35 expression in the E. coli host cells. templated that various modifications of the disclosed embodi- It will be appreciated by those skilled in the art that changes ments will, without departing from the spirit and scope of the could be made to the embodiments described above without invention, be apparent to persons skilled in the art. departing from the broad inventive concept thereof. It is The following specific examples of the methods of the understood, therefore, that this invention is not limited to the invention are further illustrative of the nature of the invention, 40 particular embodiments disclosed, but it is intended to cover it needs to be understood that the invention is not limited modifications within the spirit and scope of the present inven thereto. tion as defined by the appended claims.

SEQUENCE LISTING

<16 Os NUMBER OF SEO ID NOS: 2

<21 Os SEQ ID NO 1 &211s LENGTH: 108O &212s. TYPE: DNA <213> ORGANISM: Homo sapiens

<4 OOs SEQUENCE: 1 atggcgggac acctggcttic ggattitcgcc ttct cqc ccc ct coaggtgg tdgaggtgat 60 gggcc agggggg.ccggagcc gggctgggitt gatcCtcgga cctggctaag ct tcca aggc 12O CctCctggag gccaggaat C9ggc.cgggg gttgggc.cag gCtctgaggt gtgggggatt 18O

cc.cccatgcc ccc.cgc.cgta tagttctgt ggggggatgg cqtact.gtgg gCCC caggtt 24 O

ggagtggggc tagtgc.ccca aggcggcttg gagacct ct c agcctgaggg caa.gcagga 3 OO gtcggggtgg agagcaactic catggggcc tocc cqgagc cctgcaccgt caccCCtggit 360

gcc.gtgaagc tiggagaagga gaagctggag caaaacccgg aggagt ccca ggacatcaaa 42O gct ctgcaga aagaactica gcaatttgcc aagctcctga agcagaagag gat caccctg 48O US 8,326 547 B2 11 12 - Continued ggatatacac aggc.cgatgt ggggotcacc Ctgggggttc tatttgggaa ggt attcagc 54 O caaacgacca totgcc.gctt tdaggctctg cagcttagct tcaagaacat gtgtaagctg 6OO cggc ccttgc tigcagaagtg ggtggaggaa gotgacaa.ca atgaaaatct t caggagata 660 tgcaaagcag aaa.ccct cit gcaggc.ccga aagagaaagc galaccagtat cgaga accga 72 O gtgagaggca acctggagaa tttgttcCtg cagtgc.ccga aacccacact gcagoagat C 78O agccacat cq CCC agcagct tdgct cag aaggatgtgg to cagtgtg gttctgtaac 84 O cggcgc.ca.ga agggcaa.gcg at Caagcagc gactatgcac aacgagagga ttittgaggct 9 OO gctgggtctic Ctttct Cagg gggaccagtg tcc titt cct c tdgcc cc agg gcc cc attitt 96.O ggtaccc.cag gctatgggag ccct cactitc actgcactgt act cotcggit ccott tocct 1 O2O gagggggaag cctitt.ccc cc tdt citcc.gtc accactctgg gct ct cocat gcattcaaac 108 O

<210s, SEQ ID NO 2 &211s LENGTH: 108 O &212s. TYPE: DNA <213> ORGANISM: Artificial Sequence 22 Os. FEATURE: <223> OTHER INFORMATION: Synthetic gene for human OCT 4 gene

<4 OOs, SEQUENCE: 2 atggc.cggtc atctggctag tattittgca ttittct cc.gc cqc.cgggtgg tggtggcgat 6 O ggccCaggtg gtc.ca.galacc aggttgggta gatccacgca catggctgtc. Ctt coagggit 12 O cc.gc.caggtg gtcCaggitat cqgtcCaggit gtagg to cqg gtag taagt atggggt at C 18O cc.gc.catgtc. caccgc.cgta caattctgt ggtgg catgg Ctt actgtgg tcc.gcaagta 24 O ggtgtaggcc tigtaccaca gggtggtctg. gaaacaagtic agc.ca.galagg cgaggctggg 3OO gtaggggit cq aatcgaattic agatggcgct agc.ccggagc catgcactgt aactcCaggc 360 gcc.gtaaaac taaaaaga aaaactggag Cagaatc.cag aagagtc.gca agatat caaa 42O gCactgcaaa aagagctgga acaatttgct aaactgctga aacaaaaacg cattacgctg 48O ggittatacac aag.ccgacgt aggtotgaca Ctggggg.tcc titt.cggtaa agtatt citcg 54 O cagacaacaa tittgcc.gctt tdaag.ccctg cagctgtcat ttaaaaatat gtgtaaactg 6OO cgcc cactgc tigcagaaatg ggtagaggaa gocgacaa.ca acgagaatct gcaa.gagatt 660 tgtaaagctgaaacgctggit acaggc.ccgt aaacgtaaac gcacaagtat cgaaaatcgt. 72 O gtc.cgtggta acctggagaa totgttcc tig caatgtc.caa alaccaacgct gcaacaaatc 78O tcticacat cq Cacaacaact gggtctggag aaagacgtag tacgc.gt atg gttctgtaac 84 O cgcc.gc.ca.ga aaggtaaacg tag tagtagc gattacgctic agcgc galaga ctittgaagcc 9 OO gCagg tagtc. c9ttct Cogg gggtc.cagta agttt CCCaC tdgcaccggg to cacatt to 96.O ggtacaccag gctacggttc. tcc.gc actitt acagc cctgt at agttcggit tcc attcc.cg 1 O2O gaaggtgaag Ctttitccacc agitat cogta acaacgctgg ggit coccaat gcatagtaat 108 O

We claim: cell via the computer-assisted particle Swarm optimiza 1. A method for optimizing a gene sequence for expression tion algorithm that takes into account the plurality of of a protein in a host cell, the method comprising: 60 sequence factors and achieves the maximum value of the (a)identifying a plurality of sequence factors that affect the Swarm optimization algorithm to thereby obtain the expression of the protein in the host cell; optimized gene sequence, (b) defining a computer-assisted particle Swarm optimiza wherein the plurality of sequence factors comprises at least tion algorithm comprising a function for each of the one factor that is not codon usage of the host cell. plurality of sequence factors; and 65 2. The method of claim 1, wherein the plurality of sequence (c) analyzing the gene sequence to obtain an optimized factors comprises at least two sequence factors selected from gene sequence for expression of the protein in the host the group consisting of codon usage of the host cell; tRNA US 8,326,547 B2 13 14 usage of the host cell; GC-content of the gene sequence; a wherein c2 is the number of occurrences of a reverse repeat DNA cis-acting element of the gene sequence; a repetitive within the gene sequence, 12 is the length of the reverse element of the gene sequence; a promoter of the gene repeat, and r, represents a score of the jth nucleotide in sequence; 5'-UTR sequence; ribosome binding site (RBS) the ith reverse repeat in the gene sequence; sequence; RNA splicing site sequence; 3'-UTR sequence; and a function of dyad repeats defined as an mRNA cis-element sequence. 3. The method of claim 2, wherein the DNA cis-element is selected from the group consisting of a TATA box, Pribnow c3 3 (5) box, SOS box, CAAT box, CCAAT box and an operator; the mRNA cis-element sequence is selected from the group con 10 sisting of a sequence of a ribosomal protein leader, a Zip code motif, an mRNA stability element, an mRNA destability ele ment, a translational repressor, a translational enhancer, a wherein c3 is the number of occurrences of a dyad repeat polyadenylation element that affects 3' UTR maturation, a within the gene sequence, 13 is the length of the dyad splicing enhancer or silencer, and an internal ribosome entry 15 repeat, and dy, represents a score of the jth nucleotide in site (IRES); and the ribosome binding site (RBS) is selected the ith dyad repeat in the gene sequence; from the group consisting of Shine-Dalgarnon sequence a function of negative motifs defined as (SEQID NO:1-AGGAGG), Kozak sequence, and a derivative thereof. 4. The method of claim 1, wherein the host cell is selected c4 (6) from the group consisting of a bacterial cell, a yeast cell, an f(x) = X & Xslotif insect cell and a mammalian cell. 5. The method of claim 1, wherein the particle swam opti mization algorithm is defined as: wherein cA. is the number of occurrences of a negative motif 25 within the gene sequence, e, is the corresponding weight given to the ith negative motif in the gene sequence, and S., represents a score of the ith negative motif: a function of used codon bias defined as

30 t (7) wherein, f(x) - (fik / t (2) i= f(x) is an initiation total score of the gene sequence; n is the length of the protein; and 35 whereinf represents the frequency of the kth synonymous S codon represents a function of codons within the pro codon of the ith amino acid of the protein, f, repre tein; sents the frequency of the most frequent synonymous p is the number of the identified plurality of sequence codon of the ith amino acid of the protein, n is the length factors and p1; of protein sequence; f(x) denotes a function of the i' sequence factor of the 40 a function of undesirable splicing sites defined as identified p sequence factors; co, denotes the relative weight given to f(x); and wherein the optimized gene sequence achieves the maxi mum value of F(X), and the method further comprising synthesizing a nucleic acid molecule comprising the 45 optimized gene sequence. 6. The method of claim 5, wherein f(x) comprises two or whereinco is the number of occurrences of an undesirable more selected from the group consisting of splicing site within the gene sequence, X, is the base of a function of direct repeats defined as score function; C. is a threshold of scoring the undesir 50 able splicing site; S6 represents a score of the undesir able splicing site evaluated by a splicing site prediction cl il. (3) system; and a function of GC content defined as

55 wherein c1 is the number of occurrences of a direct repeat (9) Ci within the gene sequence, 11 is the length of the direct vgc w repeat, and d, represents a score of the jth nucleotide in i i the ith direct repeat in the gene sequence; a function of reverse repeats defined as 60 wherein 1 is the length of the gene sequence, v, the cutoff value of ideal GC content for the host cell, c, is the c2 2 (4) number of occurrences of base G and C within the gene sequence, c, is 1 if the jth nucleotide is G or C, c, is 0 if 65 the jth nucleotide is A or T. 7. The method of claim 5, wherein p is selected from the group consisting of 2, 3, 4, 5, 6, and 7. US 8,326,547 B2 15 16 8. A system for optimizing a gene sequence for expression wherein c2 is the number of occurrences of a reverse repeat of a protein in a host cell, the system comprising a computer within the gene sequence, 12 is the length of the reverse system for applying a computer-assisted particle Swarm opti repeat, and r, represents a score of the jth nucleotide in mization algorithm to the gene sequence to obtain an opti the ith reverse repeat in the gene sequence; mized gene sequence for expression of the protein in the host a function of dyad repeats defined as cell, wherein the computer-assisted particle Swarm optimiza tion algorithm comprises a function of each of a plurality of sequence factors that affect the expression of the protein in the c3 3 (5) host cell, and takes into account the plurality of sequence factors and achieves the maximum value of the Swarm opti 10 mization algorithm to thereby obtain the optimized gene sequence, wherein the plurality of sequence factors com wherein c3 is the number of occurrences of a dyad repeat prises at least one factor that is not codon usage of the host within the gene sequence, 13 is the length of the dyad cell. repeat, and dy, represents a score of the jth nucleotide in 9. The system of claim 8, wherein the plurality of sequence 15 the ith dyad repeat in the gene sequence; factors comprises at least two sequence factors selected from a function of negative motifs defined as the group consisting of codon usage of the host cell; thNA usage of the host cell; GC-content of the gene sequence; a DNA cis-acting element of the gene sequence; a repetitive c4 (6) element of the gene sequence; a promoter of the gene f(x) = X & Xslotif sequence; 5'-UTR sequence; ribosome binding site (RBS) sequence; RNA splicing site sequence; 3'-UTR sequence; and an mRNA cis-element sequence. wherein cA. is the number of occurrences of a negative motif 10. The system of claim 8, wherein the particle Swarm 25 within the gene sequence, e, is the corresponding weight optimization algorithm is defined as: given to the ith negative motif in the gene sequence, and S, represents a score of the ith negative motif: a function of used codon bias defined as

30 t (7) wherein, f(x) - (fik / t

35 whereinf represents the frequency of the kth synonymous f(x) is an initiation total score of the gene sequence; codon of the ith amino acid of the protein, f, repre n is the length of the protein; and sents the frequency of the most frequent synonymous S. represents a function of codons within the pro codon of the ith amino acid of the protein, n is the length tein; of protein sequence; p is the number of the identified plurality of sequence 40 a function of undesirable splicing sites defined as factors and p1; f(x) denotes a function of the i' sequence factor of the identified p sequence factors; and (), denotes the relative weight given to f(X); wherein the optimized gene sequence is obtained when 45 F(x) reaches the maximum. 11. The system of claim 10, whereinf(x) comprises two or whereinco is the number of occurrences of an undesirable more selected from the group consisting of splicing site within the gene sequence, X, is the base of a function of direct repeats defined as score function; C. is a threshold of scoring the undesir 50 able splicing site; S6 represents a score of the undesir able splicing site evaluated by a splicing site prediction cl il. (3) system; and a function of GC content defined as

55 wherein c1 is the number of occurrences of a direct repeat (9) Ci within the gene sequence, 11 is the length of the direct V8 - x -* repeat, and d, represents a score of the jth nucleotide in f i the ith direct repeat in the gene sequence; a function of reverse repeats defined as 60 whereinl is the length of the gene sequence, V is the cutoff value of ideal GC content for the host cell, c, is the c2 2 (4) number of occurrences of base G and C within the gene sequence, c is 1 if the jth nucleotide is G or C. c. is 0 if 65 the jth nucleotide is A or T. 12. The system of claim 10, wherein p is selected from the group consisting of 2, 3, 4, 5, 6, and 7. US 8,326,547 B2 17 18 13. A program product stored on a non-transitory computer readable medium for optimizing a gene sequence for expres c3 3 (5) sion of a protein in a host cell, the program product compris ing: a computer software for applying a computer-assisted particle Swarm optimization algorithm to the gene sequence to obtain an optimized gene sequence for expression of the protein in the host cell, wherein the particle Swarm optimiza wherein c3 is the number of occurrences of a dyad repeat tion algorithm comprises a function of each of a plurality of within the gene sequence, 13 is the length of the dyad sequence factors that affect the expression of the protein in the repeat, and dy, represents a score of the jth nucleotide in host cell, the computer-assisted Swarm optimization algo 10 the ith dyad repeat in the gene sequence; rithm takes into account the plurality of sequence factors and a function of negative motifs defined as achieves the maximum value of the Swarm optimization algo rithm to thereby obtain the optimized gene sequence, wherein c4 (6) the plurality of sequence factors comprises at least one factor f(x) = X & Xshotif that is not codon usage of the host cell. 15 14. The program product of claim 13, wherein the particle Swarm optimization algorithm is defined as: wherein cA. is the number of occurrences of a negative motif within the gene sequence, e, is the corresponding weight given to the ith negative motif in the gene sequence, and S, represents a score of the ith negative motif. a function of used codon bias defined as

wherein, 25 t (7)

f5 (x) = Ti=1 (fik lf f(x) is an initiation total score of the gene sequence; n is the length of the protein; and whereinf represents the frequency of the kth synonymous 30 codon of the ith amino acid of the protein, f, repre S codon represents a function of codons within the pro sents the frequency of the most frequent synonymous tein, codon of the ith amino acid of the protein, n is the length p is the number of the identified plurality of sequence of protein sequence; factors and p1; a function of undesirable splicing sites defined as f(x) denotes a function of the i' sequence factor of the 35 identified p sequence factors; and co, denotes the relative weight given to f(x): wherein the optimized gene sequence is obtained when F(x) reaches the maximum. 40 15. The program product of claim 14, wherein f(x) com whereinco is the number of occurrences of an undesirable prises two or more selected from the group consisting of splicing site within the gene sequence, X, is the base of a function of direct repeats defined as score function; C. is a threshold of scoring the undesir able splicing site; S6 represents a score of the undesir 45 able splicing site evaluated by a splicing site prediction cl il. (3) system; and a function of GC content defined as

50 (9) wherein c1 is the number of occurrences of a direct repeat Ci within the gene sequence, 11 is the length of the direct repeat, V 8 - x -* and d, represents a score of the jth nucleotide in the ith direct i i repeat in the gene sequence; a function of reverse repeats defined as 55 wherein l is the length of the gene sequence, V, the cutoff value of ideal GC content for the host cell, c, is the number of occurrences of base G and C within the gene c2 2 (4) sequence, c, is 1 if the jth nucleotide is G or C, c, is 0 if the jth nucleotide is A or T. 60 16. The program product of claim 14, wherein p is selected from the group consisting of 2, 3, 4, 5, 6, and 7. 17. A method for expressing a protein in a host cell, the wherein c2 is the number of occurrences of a reverse repeat method comprising: within the gene sequence, 12 is the length of the reverse (a) obtaining an optimized gene sequence for expression of repeat, and r, represents a score of the jth nucleotide in 65 the protein in the host cell using a method of claim 1: the ith reverse repeat in the gene sequence; (b) synthesizing a nucleic acid molecule comprising the a function of dyad repeats defined as optimized gene sequence; US 8,326,547 B2 19 20 (c) introducing the nucleic acid molecule into the host cell 19. A vector comprising the isolated nucleic acid molecule to obtain a recombinant host cell; and of claim 18. (d) cultivating the recombinant host cell under conditions 20. A recombinant host cell comprising the isolated nucleic to allow expression of the protein from the optimized acid molecule of claim 18. gene Sequence. 18. An isolated nucleic acid molecule comprising the opti mized gene sequence obtained from the method of claim 1.