Design of Nucleic Acid Sequences for DNA Computing Based on a Thermodynamic Approach
Total Page:16
File Type:pdf, Size:1020Kb
Title Design of nucleic acid sequences for DNA computing based on a thermodynamic approach Author(s) Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma Nucleic Acids Research, 33(3), 903-911 Citation https://doi.org/10.1093/nar/gki235 Issue Date 2005 Doc URL http://hdl.handle.net/2115/64504 Type article File Information gki235.pdf Instructions for use Hokkaido University Collection of Scholarly and Academic Papers : HUSCAP Published online February 8, 2005 Nucleic Acids Research, 2005, Vol. 33, No. 3 903–911 doi:10.1093/nar/gki235 Design of nucleic acid sequences for DNA computing based on a thermodynamic approach Fumiaki Tanaka1,*, Atsushi Kameda2, Masahito Yamamoto1,2 and Azuma Ohuchi1,2 1Graduate School of Engineering, Hokkaido University, North 13, West 8, Kita-ku, Sapporo 060-8628, Japan and 2CREST, Japan Science and Technology Corporation, 4-1-8, Honmachi, Kawaguchi, Saitama, 332-0012, Japan Received October 19, 2004; Revised and Accepted January 20, 2005 ABSTRACT order to satisfy the constraints based on the physicochemical properties of nucleic acids. In particular, it is essential to We have developed an algorithm for designing prevent undesired hybridization. In DNA computing, multiple multiple sequences of nucleic acids that have a sequences need to be designed that do not hybridize non- uniform melting temperature between the sequence specifically with each other (10,11), while in RNA secondary and its complement and that do not hybridize non- structure design, a single sequence needs to be designed that specifically with each other based on the minimum folds into the desired secondary structure (12–15). For free energy (DGmin). Sequences that satisfy these example, 40 DNA sequences of length 15 were designed to constraints can be utilized in computations, various prevent undesired hybridization and used for solving a 20- engineering applications such as microarrays, and variable instance of a three-satisfiability problem (2). In the nano-fabrications. Our algorithm is a random field of microarrays, 69 122 DNA sequences of length 45–47 generate-and-test algorithm: it generates a candidate were designed to hybridize specifically to 24 502 transcripts from Arabidopsis thaliana (4). Furthermore, four DNA sequence randomly and tests whether the sequence sequences of length 26 and four DNA sequences of length satisfies the constraints. The novelty of our algorithm 48 were used to construct a periodic two-dimensional crystal- is that the filtering method uses a greedy search to line lattice (7). These sequences were carefully designed for calculate DGmin. This effectively excludes inappropri- the intended hybridization. Thus, an algorithm/program that ate sequences before DGmin is calculated, thereby generates multiple sequences, which do not hybridize non- reducing computation time drastically when specifically with each other, is useful for various applications compared with an algorithm without the filtering. of nucleic acid. Experimental results in silico showed the superiority We applied a thermodynamic approach to the nucleic acid of the greedy search over the traditional approach design for DNA computing. In particular, we focused on de- signing a pool P containing n sequences of length l for which based on the hamming distance. In addition, experi- À (i) the duplex melting temperature (TM) is in the range TM to mental results in vitro demonstrated that the experi- þ TM for any pairwise duplex of a sequence in P and its com- mental free energy (DGexp) of 126 sequences plement and (ii) the minimum free energy (DGmin) is greater correlated well with DGmin ( jRj = 0.90) than with the à than a threshold (DGmin) in any pairwise duplex of sequences hamming distance ( jRj = 0.80). These results validate in P and any concatenation of two sequences in P plus their the rationality of a thermodynamic approach. We complements except for the pairwise duplex of a sequence in P implemented our algorithm in a graphic user and its complement. Traditional approaches to the sequence interface-based program written in Java. design in DNA computing have approximated the stability between two sequences using the hamming distance (i.e. the number of base pairs) rather than DGmin. However, INTRODUCTION since the hamming distance is only an approximation of the Nucleic acids are now being utilized in computations (1–3), stability, DGmin is preferable for predicting the stability. In various engineering applications such as microarrays (4–6), practice, an RNA secondary structure can be adequately and nano-fabrications (7–9). In these fields of research, called predicted using DGmin rather than the number of base pairs ‘DNA computing’, nucleic acid design or sequence design is a (16). However, the algorithm for calculating the DGmin for crucial problem in engineering using nucleic acids. Nucleic double-stranded DNA requires time complexity O(l3), acid design is deciding the base sequences (e.g. ‘GCTAGCT- where l is the length of the sequence, as is the case with AGGTTTA’, ..., ‘ATCGTACGCTATGTGCA’ in DNA) in secondary structure prediction for single-stranded RNA. *To whom correspondence should be addressed. Tel: +81 11 716 2111, ext 6498; Fax: +81 11 706 7834; Email: [email protected] ª The Author 2005. Published by Oxford University Press. All rights reserved. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 904 Nucleic Acids Research, 2005, Vol. 33, No. 3 Since all the combinations of three sequences must be evalu- paper, we formulated li such that li = lj = l (0 < i, ated in this design, dozens of sequences cannot be designed j < n À 1), although we can extend our algorithm easily within a reasonable computation time. Another approach that for li „ lj (0 < i „ j < n À 1). We define P as the pool of reduces the computation time is thus needed. Andronescu et al. n sequences, with length l, to be designed. Furthermore, let (13) overcame this drawback in the secondary structure design P = fU0, U1, ..., UnÀ1g and Q = fV0, V1, ..., VnÀ1g such that À þ by hierarchically decomposing the secondary structure into Vi (0 < i < n À 1) is the complement of Ui. TM and TM are smaller substructures and integrating the partial sequences. defined as the lower and upper thresholds of TM for the duplex à By using this method, they designed sequences that a tradi- between a sequence and its complement. Moreover, DGmin is tional program cannot. defined as the threshold of DGmin given by the sequence We have developed a random generate-and-test algorithm designer. Thus, our algorithm designs a pool P containing that generates a candidate sequence randomly and tests n sequences of length l for which (i) the duplex TM is in the À þ whether the sequence satisfies the constraints. It stores the range from TM to TM for any duplex of Ui (0 < i < n À 1) and à sequence in a sequence pool if and only if it satisfies all Vi, and (ii) DGmin is greater than DGmin for any pairwise duplex the constraints. To reduce the difficulty in computation of sequences in P and any concatenation of two sequences time, we use a greedy search to calculate DGmin. The advant- in P [ Q except for the pairwise duplex of Ui and Vi. age of a greedy search is that it approximates DGmin in less Here, we describe more specifically the combination of 2 time, with time complexity O(l ), than a rigorous algorithm, sequences to be calculated using the DGmin. The sequence 3 which calculates DGmin with time complexity O(l ). The Ui (0 < i < n À 1) is denoted by a string of bases such as 0 1 lÀ1 0 0 DGmin approximated using the greedy search (denoted by ui ui ÁÁÁui (5 to 3 direction), and similarly, Vi (0 < i < 0 1 lÀ1 0 0 DGgre) correlated well with DGmin. The correlation coefficients n À 1) is denoted as vi vi ÁÁÁvi (5 to 3 direction). For 0 0 0 were 0.95, 0.85 and 0.76 at 20mer, 40mer and 60mer lengths, example, if Ui = 5 -AAATTTCCCGGG-3 , then Vi = 5 -CCC- 0 respectively. Furthermore, since DGgre is the upper bound GGGAAATTT-3 . Furthermore, let <X, Y> be the combination for DGmin (i.e. DGmin < DGgre), the sequence such that of sequences X and Y, and XY be the concatenation of se- à à DGgre < DGmin is sure to satisfy DGmin < DGmin. Therefore, quences X and Y in that order. For example, if Ui, Uj and 0 0 0 using a greedy search excludes in advance most inappropriate Uk (0 < i, j, k < n À 1) are 5 -AAATTT-3 ,5-CCCGGG- 0 0 0 sequences before DGmin is calculated without excluding an 3 and 5 -TCTCTC-3 , respectively, then <UiUj,Uk> means the appropriate sequence. With this approach, our algorithm combination of sequences 50-AAATTTCCCGGG-30 and 50- reduces computation time. TCTCTC-30. In our algorithm, the following combinations To evaluate our algorithm, we investigated the effectiveness are considered for the DGmin calculation. of the greedy search filtering. We compared the computation (i) hU U U i (0 < i, j, k < n À 1) time of our algorithm with that of the same algorithm but i j k (ii) hU U V i (0 < i, j, k < n À 1), i „ k without the greedy search filtering.