Open-Source Web Application for Rare Codon Identification and Custom DNA Sequence Optimization Edward Daniel1, Goodluck U

Daniel et al. BMC Bioinformatics (2015) 16:303 DOI 10.1186/s12859-015-0743-5 SOFTWARE Open Access ATGme: Open-source web application for rare codon identification and custom DNA sequence optimization Edward Daniel1, Goodluck U. Onwukwe1, Rik K. Wierenga1, Susan E. Quaggin2, Seppo J. Vainio3 and Mirja Krause3* Abstract Background: Codon usage plays a crucial role when recombinant proteins are expressed in different organisms. This is especially the case if the codon usage frequency of the organism of origin and the target host organism differ significantly, for example when a human gene is expressed in E. coli. Therefore, to enable or enhance efficient gene expression it is of great importance to identify rare codons in any given DNA sequence and subsequently mutate these to codons which are more frequently used in the expression host. Results: We describe an open-source web-based application, ATGme, which can in a first step identify rare and highly rare codons from most organisms, and secondly gives the user the possibility to optimize the sequence. Conclusions: This application provides a simple user-friendly interface utilizing three optimization strategies: 1. one-click optimization, 2. bulk optimization (by codon-type), 3. individualized custom (codon-by-codon) optimization. ATGme is an open-source application which is freely available at: http://atgme.org Keywords: Codon usage, Sequence optimization, Protein, Translation, DNA Background [4]. Studies have shown that the presence of rare codons Alternative synonymous codons are not used with equal influences gene expression levels [4, 5] and the solubility frequencies in one organism or also when different organ- and amount of the expressed protein [2]. isms are compared [1]. Rare codons are those codons Interestingly, a recent study suggests that some clusters which are used with lower frequencies, < 10 ‰, in a specific of rare codons (in proteins longer than 300 amino acid res- expression organism such as Escherichia coli (E.coli)as idues) called slow-translating regions or slow patches, pro- compared to the original host [2]. Rare codons have for bet- vide the protein domains enough time to fold accurately ter assessment sometimes been divided into those codons [6, 7], and thus playing a role in proper protein folding. In that are used with lower frequencies (5 to 10 ‰)andthose fact, it has been reported that increased rate of translation used with lowest frequencies (≤5 ‰) [3]). To differentiate caused by the elimination of translational pauses due to between these frequencies, these codons are classified as the rare codons, resulted in improper protein folding and rare and highly rare codons, respectively [3]. insolubilization [2]. Once the role of such rare codons has When recombinant gene expression is carried out, codon been considered codon usage can be optimized prior to usage plays a significant role in the efficiency of the host protein production to enhance gene expression rates, in expression system. Gene expression accuracy and efficiency any expression system [8]. In this optimization process, can be reduced if the codon usage frequencies of the rare codons are identified and mutated to more frequently organism of origin and the target host organism differ sig- used codons in the host organism without changing the nificantly, and rare codons dominate within the sequence amino acid sequence of the protein. A variety of mathemat- ical and statistical approaches is available to analyze codon * Correspondence: [email protected] usage. These approaches also enable the analysis of codon 3Biocenter Oulu, Laboratory of Developmental Biology, InfoTech Oulu, Center usage bias in whole groups of organisms and multiple gene for Cell Matrix Research, Faculty of Biochemistry and Molecular Medicine, University of Oulu, Aapistie 5A, FIN-90220 Oulu, Finland sets. This has recently been extensively reviewed [9]. Full list of author information is available at the end of the article © 2015 Daniel et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Daniel et al. BMC Bioinformatics (2015) 16:303 Page 2 of 6 With only a few rare codons present these codons can codon-type), c) optimize by codon (individual codon). be changed by point-mutations. However, recent improve- As the user changes codons the progress is displayed ments in technology have enabled cost-effective produc- in the automatically updated output sequence. See tion of synthetic genes, making the simple ordering of an Fig. 2 for detailed screenshots. optimized gene sequence a feasible alternative, irrespective Furthermore, the user has the option to check if his of the number of rare codons present in the target gene. sequence contains certain restriction enzyme recognition Identification of rare codons is done by using a codon sites. Two enzymes can be checked simultaneously. usage table of the host organism. Throughout the process the user can see the alignment of An online database, called “Codon Usage Database” offers the amino acid sequences of the original and optimized access to the codon usage tables of over 35,000 organisms sequences in a separate box (see Fig. 2a). The translation [10, 11]. This database offers the possibility to explore of the DNA sequence into the protein sequence is accord- expression of genes in organisms different to the commonly ing to the standard genetic code. used ones. In contrast to E.coli, other organisms can offer The final output will be the optimized sequence, in which post-translational modification systems that might be useful rare codons (if any should remain) would be highlighted in to express mammalian proteins of scientific and industrial orange and red. Additionally, the A + T and G + C content interest. The usage frequency values available from the and the number of bases will be given. To provide an over- Codon Usage Database represent the mean values of the view throughout the process the usage data is displayed, codonusagebasedoneverygeneofaspecificorganism namely the codon usage table in which rare and highly rare present in the Genbank® [12] as of June 2007. codons are highlighted, respectively. Currently, commercially available tools exist which offer researchers the possibility for codon optimization. These Optimization methods applications are usually expensive for small to medium ATGme displays rare codons in the target sequence in scale laboratories. Additionally, while several openly avail- color. It differentiates between highly rare codons (red) able codon optimization tools have been created, many of and rare codons (orange). The software offers three differ- them are no longer available; others require the users to ent approaches to optimize the target sequence. The first install the software on their computers. Furthermore, they approach (one-click optimize) exchanges all highly rare are commonly limited to the application of the codon and rare codons with the most frequently used synonym- usage for only a few organisms. Both commercial and ous counterpart. The second (bulk optimize) exchanges all non-commercial tools often provide complex results or instances of a specific rare codon always with the same, analysis requiring significant effort or consultation to in- better (or also worse), codon of the user’s choice. The third terpret. Here we describe a simple user-friendly and flex- approach (optimize by codon) gives the user the possibility ible web-based application, called ATGme, which identifies to look at the sequence and change each codon one by rare codons and gives several options for codon usage one. This can be used to address the problem of repetitive optimization. elements or also the generation and/or modification of restriction enzyme cleavage sites. Implementation Technical details Results and discussion ATGme is an open-source web-based application imple- Codon optimization for gene design is usually applied to mented in HTML/CSS for presentation and pure Javascript enable and/or increase protein expression levels in a spe- for program logic. cific host organism. Generally, there are a variety of possible synthetic sequences derived from the starting sequence, Input and output which could lead to increased expression levels. How does The data input requires four steps: (i) Input of the target our software compare to other public web servers and DNA sequence to be optimized in fasta or text format; (ii) stand-alone applications that allow some kind of codon Input of the codon usage table copied from the Codon optimization? Several codon optimization applications have Usage Database [11] (http://www.kazusa.or.jp/codon/) been created over the last decade, but most are not avail- (Fig. 1). After copying, users can freely modify the able any longer, like e.g. UpGene [13], GeneDesign [14], usage table according to their needs; (iii) After starting GeMS [15], Synthetic Gene Designer [16]. the process, the rare and highly rare codons are ATGme does not need to be downloaded, but is available highlighted in orange and red respectively in the input onlinefreeofcharge[17].Itwillnotbeattheriskofnot sequence as well as in the output sequence (in this being available after some time. In terms of the codon step not optimized yet); (iv) The user can then choose optimization the ATGme software applies a highly simpli- between the three different ways to optimize the se- fied approach. It will replace rare codons in the target quence. a) one-click optimize, b) bulk optimize (per sequence with the single most abundant codon of the host Daniel et al. BMC Bioinformatics (2015) 16:303 Page 3 of 6 Fig.

Open-Source Web Application for Rare Codon Identification and Custom DNA Sequence Optimization Edward Daniel1, Goodluck U

Gene Composer: Database Software for Protein Construct Design, Codon

Paulo Miguel Da Silva Gaspar Optimizaç˜Ao De Genes Para

Attenuated Viruses Useful for Vaccines Für Impfstoffe Geeignete Abgeschwächte Viren Virus Attenues Utiles Pour Des Vaccins

Computational Tools for Synthetic Gene Optimization

Computational Tools and Algorithms for Designing Customized

(12) United States Patent (10) Patent No.: US 8,697.417 B2 Bakker Et Al

All-In-One Molecular Cloning and Genetic Engineering Design, Simulation & Management Software for Complex Synthetic Biology and Metabolic Engineering Projects

Computational Codon Optimization of Synthetic Gene for Protein Expression Bevan Kai-Sheng Chung1,2,3 and Dong-Yup Lee1,2,3*

2008 Holiday Publication

Views and Protocols

( 12 ) United States Patent

Efficient Codon Optimization with Motif Engineering