Views and Protocols
Total Page:16
File Type:pdf, Size:1020Kb
Tuan-Anh et al. BMC Bioinformatics (2017) 18:100 DOI 10.1186/s12859-017-1517-z METHODOLOGY ARTICLE Open Access Novel methods to optimize gene and statistic test for evaluation – an application for Escherichia coli Tran Tuan-Anh1, Le Thi Ly2, Ngo Quoc Viet3 and Pham The Bao1* Abstract Background: Since the recombinant protein was discovered, it has become more popular in many aspects of life science. The value of global pharmaceutical market was $87 billion in 2008 and the sales for industrial enzyme exceeded $4 billion in 2012. This is strong evidence showing the great potential of recombinant protein. However, native genes introduced into a host can cause incompatibility of codon usage bias, GC content, repeat region, Shine-Dalgarno sequence with host’s expression system, so the yields can fall down significantly. Hence, we propose novel methods for gene optimization based on neural network, Bayesian theory, and Euclidian distance. Result: The correlation coefficients of our neural network are 0.86, 0.73, and 0.90 in training, validation, and testing process. In addition, genes optimized by our methods seem to associate with highly expressed genes and give reasonable codon adaptation index values. Furthermore, genes optimized by the proposed methods are highly matched with the previous experimental data. Conclusion: The proposed methods have high potential for gene optimization and further researches in gene expression. We built a demonstrative program using Matlab R2014a under Mac OS X. The program was published in both standalone executable program and Matlab function files. The developed program can be accessed from http://www.math.hcmus.edu.vn/~ptbao/paper_soft/GeneOptProg/. Keywords: Gene optimization, Neural network, Bayes’ theorem, Euclidean distance, Codon usage bias, Highly expressed gene Background The increment in recombinant protein productivity Since Paul Berg and Peter Lobban each independently pro- reduces a significant production cost, so it might dramatic- posed an approach to generate recombinant DNA in 1969 ally raise profits. In order to improve the productivity, sev- – 1970, recombinant protein has become a widespread tool eral aspects can be optimized such as purification process, for both cellular and molecular biology. For instance, in culture medium and genetic materials (including operator, 2004, more than 75 recombinant proteins were used as promoter, and gene). In this study, we only focus on gene medicine and more than 360 pharmaceuticals based on re- optimization. combinant protein were under development [1]. Moreover, Introducing native genes into a host can cause incompati- Elena’s study indicated that the global market of industrial bility of codon usage bias, GC content, repeat region, enzymes exceeded $4 billion in 2012 [2]. In the future, this Shine-Dalgarno sequence with host’sexpressionsystem. figure can be raised considerably thanks to the applications The yields can fall down significantly [3–7]. In a culture of synthetic biology tools which will improve the productiv- medium, synonymous genes, which share the same oper- ity of recombinant proteins production. ator and promoter, can be expressed at different levels. The synthetic codon optimized gene results in protein level that were ~2 to 22 fold greater than the amounts reported for * Correspondence: [email protected] the native genes [8–12]. A gene optimization program 1Faculty of Mathematics and Computer Science, VNUHCM-University of Science, 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City, Vietnam based on machine learning approach and experimental data Full list of author information is available at the end of the article can handle redesign task rapidly instead of using “brute © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Tuan-Anh et al. BMC Bioinformatics (2017) 18:100 Page 2 of 10 force” method, which consume more significant times than Methods other resources. Data collection The foundation of gene optimization is a phenomenon We used highly expressed genes (HEG) as the reference set of codon synonym. A codon is constructed by three ri- for codon adaptation index (CAI) computing [32]. We also bonucleotides, so there are 43 = 64 codons (there are 4 collected redesigned genes and respective translational ex- kinds of ribonucleotides: A – Adenine, U – Uracil, G – pression levels of product (Table 1). The experimental data Guanine and C - Cytosine). However, 61 types of codons collection process was based on four criteria: 1) expression can code for only 20 kinds of amino acids. This means system should be Escherichia coli, 2) the experiments should there must be several amino acids encoded by at least two express both native and optimized genes, 3) the sequences codons. If one amino acid is coded by several codons, these and respective quantitative productivity should be provided, codons are called synonymous codons. Moreover, codon and 4) expression level should be recorded or could be con- usage is diverse from organism to organism [3, 13–15]. verted to mg/L. The data would be used to form an NN in a Generally, genes having compatible codon usage bias with later step. host’s expression system are usually highly expressed in translational levels. Codon usage bias measurements The aim of gene optimization program is to indicate The preference of codons is correlated with intracellular which synonymous genes can give higher yield by using tRNA concentration in a host environment and reflects a variety of approach including one amino acid – one codon balance necessary for expression efficiency [3, 6, 9, 15]. (JCAT, Codon Optimizer, INCA, UPGene, Gene De- Translation process can be delayed when ribosomes signer), randomization (DNA Works), hybrid (DNA encounter rare codons, which can cause ribosomes to de- Works), Monte Carlo (Gene Designer), genetic algorithm tach from mRNA and abort translation [9]. Moreover, (GASSCO, EuGene), etc. [16–22]. In some case, one mRNA translation rate may impact the secondary structure amino acid – one codon method which replaces rare co- of encoded protein in that frequently used codons tend to dons by the most preferable usage codons can result in encode for structural elements while rare codons are associ- worse protein expression as reported in many past studies ated with linkers [11, 33, 34]. [11, 23–26]. Yields of genes redesigned by randomization CAI is one of the most popular and effective measures for method are greater than yields of native genes, yet the re- quantifying codon usage bias of a gene toward a reference sult is uncertain and we cannot predict expression level set of highly expressed genes [35]. Given a gene g ={g1, g2, until experiment finished [11, 20]. Genetic algorithm and …, gi, … gL(g)}, CAI is defined as (1) Monte Carlo method with linear target function seem !1 more reasonable than other reported methods. However, LgYðÞ LgðÞ parameter estimation has been yet reported [18, 27]. A ðÞ¼ ð Þ CAI g wagi 1 nonlinear method based on neural network was proposed i¼1 but an analysis of its performance was not provided [28]. Some redesigned genes were proven for high expression where L(g) is the length of gene g counted by codon, gi th by experiment [19, 29–31]. However, the most important is the i codon of gene g, ac is generally a codon c cod- ≡ disadvantage is that almost all of these studies did not pro- ing for amino acid a. In this case, c gi, wac described as vide any method to construct the model within an actual (2) is the relative adaptiveness of ac, and oac(HEG) is the experimental data and to evaluate the optimization count of ac in HEG set. methods based on statistics [16–20, 22, 27]. ðÞ Machine learning approaches have been developed rap- oac HEG wac ¼ ð2Þ max oacðÞHEG idly for recent decades. These methods could analyze and oac “learn” pattern from data sources and predict precisely the outcome of a new data instance. Artificial neural network Relative synonymous codon usage (RSCU), which maps (NN) and Bayesian decision are two of the most efficient genes into a 59-dimensional vector space is also a common and popular machine learning algorithm worldwide. NN is a strong learning technique and appropriated with both Table 1 Collected data including redesigned genes and regression and classification problem. Bayesian decision is respective product highly acclaimed due to its simplicity. Host Product Number of genes Reference These are the reason why we propose a novel method E. coli BL21 DNA Polymerase and scFV 62 [46] for gene optimization base on Bayesian theory and Neural E. coli BL21 Cystatin C 2 [12] network which are the most common learning methods E. coli BL21 PEDF 2 [9] using probability and statistics background. We also use E. coli W3110 Prochymosin 7 [11] statistic test to evaluate and compare these methods. Tuan-Anh et al. BMC Bioinformatics (2017) 18:100 Page 3 of 10 Fig. 1 Properties of HEGP and DHEG. The top plots are distribution of HEG’s CAI value. The bottom right plot is distribution of HEG’s GC value. The bottom center plots illustrate HEG probability and distance to HEG of randomly generated gene sequences with respect to their CAI and GC value measure and widely used in gene clustering [36]. The GC content RSCU is Some studies indicated that GC content can impact the stability of the 2nd structure of mRNA which was benefi- o ðÞg cial for translation [7, 10].