Computational Codon Optimization of Synthetic Gene for Protein Expression Bevan Kai-Sheng Chung1,2,3 and Dong-Yup Lee1,2,3*

Chung and Lee BMC Systems Biology 2012, 6:134 http://www.biomedcentral.com/1752-0509/6/134 METHODOLOGY ARTICLE Open Access Computational codon optimization of synthetic gene for protein expression Bevan Kai-Sheng Chung1,2,3 and Dong-Yup Lee1,2,3* Abstract Background: The construction of customized nucleic acid sequences allows us to have greater flexibility in gene design for recombinant protein expression. Among the various parameters considered for such DNA sequence design, individual codon usage (ICU) has been implicated as one of the most crucial factors affecting mRNA translational efficiency. However, previous works have also reported the significant influence of codon pair usage, also known as codon context (CC), on the level of protein expression. Results: In this study, we have developed novel computational procedures for evaluating the relative importance of optimizing ICU and CC for enhancing protein expression. By formulating appropriate mathematical expressions to quantify the ICU and CC fitness of a coding sequence, optimization procedures based on genetic algorithm were employed to maximize its ICU and/or CC fitness. Surprisingly, the in silico validation of the resultant optimized DNA sequences for Escherichia coli, Lactococcus lactis, Pichia pastoris and Saccharomyces cerevisiae suggests that CC is a more relevant design criterion than the commonly considered ICU. Conclusions: The proposed CC optimization framework can complement and enhance the capabilities of current gene design tools, with potential applications to heterologous protein production and even vaccine development in synthetic biotechnology. Background expressed recombinant protein since the wild-type for- Recent developments in artificial gene synthesis have eign genes have not been evolved for optimum expres- enabled the construction of synthetic gene circuits [1] sion in the host. Thus, it is highly desirable to harness and even the synthesis of whole bacterial genome [2]. the flexibility in synthetic biology to create customized The introduction of synthetic genes into a living system artificial gene designs that are optimal for heterologous can either modulate existing biological functions or give protein expression. To aid the gene design process, com- rise to novel cellular behavior. In this sense, de novo putational tools have been developed for designing cod- gene synthesis is a valuable synthetic biological tool for ing sequences based on some performance criteria. biotechnological studies, which typically aims to improve Specifically, the degeneracy of the genetic code, tolerance to toxic molecules, retrofit existing biosyn- reflected by the use of sixty-four codons to encode thetic pathways, design novel biosynthetic pathways twenty amino acids and translation termination signal, and/or enhance heterologous protein production [3,4]. leads to the situation whereby all amino acids, except In the aspect of recombinant protein production, natural methionine and tryptophan, can be encoded by two to genes found in wild-type organisms are usually trans- six synonymous codons. Notably, the synonymous formed into the heterologous hosts for recombinant ex- codons are not equally utilized to encode the amino pression. This approach typically results in poorly acids, thus resulting in phenomenon of codon usage bias which was first reported in a study that examines the fre- quencies of 61 amino acid codons (i.e. termination * Correspondence: [email protected] 1Department of Chemical and Biomolecular Engineering, National University codons are excluded) in 90 genes [5]. The emergence of of Singapore, 4 Engineering Drive 4, Singapore 117576, Singapore codon usage bias in organisms has been largely attributed 2NUS Graduate School for Integrative Sciences and Engineering, National to natural selection, mutation, and genetic drift [6]. More University of Singapore, 28 Medical Drive, #05-01, Singapore 117456, Singapore importantly, codon usage bias has been shown to be cor- Full list of author information is available at the end of the article related to gene expression level [7,8]. As a result, this bias © 2012 Chung and Lee; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Chung and Lee BMC Systems Biology 2012, 6:134 Page 2 of 14 http://www.biomedcentral.com/1752-0509/6/134 has been proposed as an important design parameter for individual codon usage optimization (ICO) method gen- enhancing recombinant protein production in heterol- erates a sequence with optimal ICU only; the codon con- ogous expression hosts [9]. Consequently, the algorithms text optimization (CCO) method optimizes sequences implemented in many of the sequence design software with regard to codon context only; and the multi- tools, such as Codon optimizer [10], Gene Designer [11], objective codon optimization (MOCO) method simul- and OPTIMIZER [12], are mainly focused on the fre- taneously considers both ICU and CC. Thus, the quency of individual codon occurrences. Notably, the resultant sequence is ICU-/CC-optimal when its ICU/CC popular web-based software, known as the Java Codon distribution is closest to the organism’s reference ICU/ Adaptation Tool (JCat), is integrated with the PRODORIC CC distribution calculated based on the sequences of na- database to allow convenient retrieval of prokaryotic gen- tive high-expression genes. Based on the mathematical etic information [13,14]. However, apart from individual formulation presented in Methods, the ICO problem can codon usage (ICU) bias, nonrandom utilization of adjacent be described as the maximization of ICU fitness, ΨICU codon pairs in organisms has also been reported in several (see Eqn. 23), subject to the constraint that the codon se- studies [15,16]. This phenomenon is termed “codon con- quence can be translated into the target protein (see text” as it implicates some “rule” for organizing neighbor- Eqns. 3, 4 and 11). Due to the discrete codon variables ing codons as a result of potential tRNA-tRNA steric and nonlinear fitness expression of ΨICU, ICO is classi- interaction within the ribosomes [17,18]. Codon context fied a mixed-integer nonlinear programming (MINLP) (CC) was shown to correlate with translation elongation problem. Nonetheless, it can be linearized using a strat- rate such that the usage of rare codon pairs decreased pro- egy shown in an earlier study by decomposing the non- k k tein translation rates [19]. Therefore, the incorporation of linear |p0 − p1| term (see Equation 23) into a series of CC has been proposed in the conventional ICU-based linear and integer constraints which consist of binary gene optimization algorithm GeneOptimizer [20]. Further- and positive real variables [24]. The resultant mixed- more, a patented technology, known as “Translation integer linear programming (MILP) problem can be Engineering”, demonstrated that better enhancement in solved using well established computational methods translational efficiency is achievable by optimizing codon such as either branch-and-bound and branch-and-cut pair usage in addition to ICU optimization [21]. However, [25]. However, due to the large and discrete search space there is yet a study to investigate the relative effects of which contains all possible DNA sequences that can en- ICU and CC on protein expression. To address this issue, code the target protein, solving the MILP using these we propose a computational analysis to evaluate the per- methods may require a long computational time. Thus, formance of sequences generated by various ICU and CC alternative methods, such as GASCO [26] and QPSOBT optimization approaches. [27], have been proposed for solving ICO using genetic In this study, we applied novel computational proce- algorithm and particle swarm optimization. Although dures to generate DNA sequences exhibiting optimal these heuristic methods are more efficient than conven- ICU and CC in Escherichia coli, Lactococcus lactis, Pichia tional MILP solving procedures, they still require a sig- pastoris and Saccharomyces cerevisiae based on informa- nificant amount of computational resources due to the tion obtained from omics data analysis. While E. coli and iterative nature of the algorithms. To circumvent the S. cerevisiae has been model organisms for recombinant high computational costs, we developed the non-iterative protein production studies, we also consider codon method for solving ICO using the following steps: optimization in the Gram-positive bacterium L. lactis and methylotrophic yeast P. pastoris since they are also I1. Calculate the host’s individual codon usage k promising candidates for expressing recombinant pro- distribution, p0. j teins [22,23]. Assuming that the native DNA sequences I2. Calculate the subject’s amino acid counts, θAA,1. of highly expressed genes have evolved to exhibit optimal I3. Calculate the optimal codon counts for the subject X21 h ICU and CC for high in vivo expression, we demon- È using the expression: θk ¼ pk Â θj Â 1 αj ¼ strated the efficacy of our computational approaches by C;opt 0 A;1 ÀÁ j¼1 performing a leave-one-out cross-validation on the high- f κk g ∀k∈fg1; 2; ...; 64 . expression genes for each expression host. I4. For each τi in the subject’s sequence, randomly k k k assign a codon κ if θC > 0, and decrement θC,opt by one.

Computational Codon Optimization of Synthetic Gene for Protein Expression Bevan Kai-Sheng Chung1,2,3 and Dong-Yup Lee1,2,3*

Gene Composer: Database Software for Protein Construct Design, Codon

Paulo Miguel Da Silva Gaspar Optimizaç˜Ao De Genes Para

Attenuated Viruses Useful for Vaccines Für Impfstoffe Geeignete Abgeschwächte Viren Virus Attenues Utiles Pour Des Vaccins

Computational Tools for Synthetic Gene Optimization

Computational Tools and Algorithms for Designing Customized

(12) United States Patent (10) Patent No.: US 8,697.417 B2 Bakker Et Al

All-In-One Molecular Cloning and Genetic Engineering Design, Simulation & Management Software for Complex Synthetic Biology and Metabolic Engineering Projects

Open-Source Web Application for Rare Codon Identification and Custom DNA Sequence Optimization Edward Daniel1, Goodluck U

2008 Holiday Publication

Views and Protocols

( 12 ) United States Patent

Efficient Codon Optimization with Motif Engineering