Title of sequences for DNA computing based on a thermodynamic approach

Author(s) Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma

Nucleic Acids Research, 33(3), 903-911 Citation https://doi.org/10.1093/nar/gki235

Issue Date 2005

Doc URL http://hdl.handle.net/2115/64504

Type article

File Information gki235.pdf

Instructions for use

Hokkaido University Collection of Scholarly and Academic Papers : HUSCAP Published online February 8, 2005

Nucleic Acids Research, 2005, Vol. 33, No. 3 903–911 doi:10.1093/nar/gki235 Design of nucleic acid sequences for DNA computing based on a thermodynamic approach Fumiaki Tanaka1,*, Atsushi Kameda2, Masahito Yamamoto1,2 and Azuma Ohuchi1,2

1Graduate School of , Hokkaido University, North 13, West 8, Kita-ku, Sapporo 060-8628, Japan and 2CREST, Japan Science and Technology Corporation, 4-1-8, Honmachi, Kawaguchi, Saitama, 332-0012, Japan

Received October 19, 2004; Revised and Accepted January 20, 2005

ABSTRACT order to satisfy the constraints based on the physicochemical properties of nucleic acids. In particular, it is essential to We have developed an for designing prevent undesired hybridization. In DNA computing, multiple multiple sequences of nucleic acids that have a sequences need to be designed that do not hybridize non- uniform melting temperature between the sequence specifically with each other (10,11), while in RNA secondary and its complement and that do not hybridize non- structure design, a single sequence needs to be designed that specifically with each other based on the minimum folds into the desired secondary structure (12–15). For free energy (DGmin). Sequences that satisfy these example, 40 DNA sequences of length 15 were designed to constraints can be utilized in computations, various prevent undesired hybridization and used for solving a 20- engineering applications such as microarrays, and variable instance of a three-satisfiability problem (2). In the nano-fabrications. Our algorithm is a random field of microarrays, 69 122 DNA sequences of length 45–47 generate-and-test algorithm: it generates a candidate were designed to hybridize specifically to 24 502 transcripts from Arabidopsis thaliana (4). Furthermore, four DNA sequence randomly and tests whether the sequence sequences of length 26 and four DNA sequences of length satisfies the constraints. The novelty of our algorithm 48 were used to construct a periodic two-dimensional crystal- is that the filtering method uses a greedy search to line lattice (7). These sequences were carefully designed for calculate DGmin. This effectively excludes inappropri- the intended hybridization. Thus, an algorithm/program that ate sequences before DGmin is calculated, thereby generates multiple sequences, which do not hybridize non- reducing computation time drastically when specifically with each other, is useful for various applications compared with an algorithm without the filtering. of nucleic acid. Experimental results in silico showed the superiority We applied a thermodynamic approach to the nucleic acid of the greedy search over the traditional approach design for DNA computing. In particular, we focused on de- signing a pool P containing n sequences of length l for which based on the hamming distance. In addition, experi- À (i) the duplex melting temperature (TM) is in the range TM to mental results in vitro demonstrated that the experi- þ TM for any pairwise duplex of a sequence in P and its com- mental free energy (DGexp) of 126 sequences plement and (ii) the minimum free energy (DGmin) is greater correlated well with DGmin ( jRj = 0.90) than with the à than a threshold (DGmin) in any pairwise duplex of sequences hamming distance ( jRj = 0.80). These results validate in P and any concatenation of two sequences in P plus their the rationality of a thermodynamic approach. We complements except for the pairwise duplex of a sequence in P implemented our algorithm in a graphic user and its complement. Traditional approaches to the sequence interface-based program written in Java. design in DNA computing have approximated the stability between two sequences using the hamming distance (i.e. the number of base pairs) rather than DGmin. However, INTRODUCTION since the hamming distance is only an approximation of the Nucleic acids are now being utilized in computations (1–3), stability, DGmin is preferable for predicting the stability. In various engineering applications such as microarrays (4–6), practice, an RNA secondary structure can be adequately and nano-fabrications (7–9). In these fields of research, called predicted using DGmin rather than the number of base pairs ‘DNA computing’, nucleic acid design or sequence design is a (16). However, the algorithm for calculating the DGmin for crucial problem in engineering using nucleic acids. Nucleic double-stranded DNA requires time complexity O(l3), acid design is deciding the base sequences (e.g. ‘GCTAGCT- where l is the length of the sequence, as is the case with AGGTTTA’, ..., ‘ATCGTACGCTATGTGCA’ in DNA) in secondary structure prediction for single-stranded RNA.

*To whom correspondence should be addressed. Tel: +81 11 716 2111, ext 6498; Fax: +81 11 706 7834; Email: fumiaki@-comp.org

ª The Author 2005. Published by Oxford University Press. All rights reserved.

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 904 Nucleic Acids Research, 2005, Vol. 33, No. 3

Since all the combinations of three sequences must be evalu- paper, we formulated li such that li = lj = l (0 < i, ated in this design, dozens of sequences cannot be designed j < n À 1), although we can extend our algorithm easily within a reasonable computation time. Another approach that for li „ lj (0 < i „ j < n À 1). We define P as the pool of reduces the computation time is thus needed. Andronescu et al. n sequences, with length l, to be designed. Furthermore, let (13) overcame this drawback in the secondary structure design P = fU0, U1, ..., UnÀ1g and Q = fV0, V1, ..., VnÀ1g such that À þ by hierarchically decomposing the secondary structure into Vi (0 < i < n À 1) is the complement of Ui. TM and TM are smaller substructures and integrating the partial sequences. defined as the lower and upper thresholds of TM for the duplex à By using this method, they designed sequences that a tradi- between a sequence and its complement. Moreover, DGmin is tional program cannot. defined as the threshold of DGmin given by the sequence We have developed a random generate-and-test algorithm . Thus, our algorithm a pool P containing that generates a candidate sequence randomly and tests n sequences of length l for which (i) the duplex TM is in the À þ whether the sequence satisfies the constraints. It stores the range from TM to TM for any duplex of Ui (0 < i < n À 1) and à sequence in a sequence pool if and only if it satisfies all Vi, and (ii) DGmin is greater than DGmin for any pairwise duplex the constraints. To reduce the difficulty in computation of sequences in P and any concatenation of two sequences time, we use a greedy search to calculate DGmin. The advant- in P [ Q except for the pairwise duplex of Ui and Vi. age of a greedy search is that it approximates DGmin in less Here, we describe more specifically the combination of 2 time, with time complexity O(l ), than a rigorous algorithm, sequences to be calculated using the DGmin. The sequence 3 which calculates DGmin with time complexity O(l ). The Ui (0 < i < n À 1) is denoted by a string of bases such as 0 1 lÀ1 0 0 DGmin approximated using the greedy search (denoted by ui ui ÁÁÁui (5 to 3 direction), and similarly, Vi (0 < i < 0 1 lÀ1 0 0 DGgre) correlated well with DGmin. The correlation coefficients n À 1) is denoted as vi vi ÁÁÁvi (5 to 3 direction). For 0 0 0 were 0.95, 0.85 and 0.76 at 20mer, 40mer and 60mer lengths, example, if Ui = 5 -AAATTTCCCGGG-3 , then Vi = 5 -CCC- 0 respectively. Furthermore, since DGgre is the upper bound GGGAAATTT-3 . Furthermore, let be the combination for DGmin (i.e. DGmin < DGgre), the sequence such that of sequences X and Y, and XY be the concatenation of se- à à DGgre < DGmin is sure to satisfy DGmin < DGmin. Therefore, quences X and Y in that order. For example, if Ui, Uj and 0 0 0 using a greedy search excludes in advance most inappropriate Uk (0 < i, j, k < n À 1) are 5 -AAATTT-3 ,5-CCCGGG- 0 0 0 sequences before DGmin is calculated without excluding an 3 and 5 -TCTCTC-3 , respectively, then means the appropriate sequence. With this approach, our algorithm combination of sequences 50-AAATTTCCCGGG-30 and 50- reduces computation time. TCTCTC-30. In our algorithm, the following combinations To evaluate our algorithm, we investigated the effectiveness are considered for the DGmin calculation. of the greedy search filtering. We compared the computation (i) hU U U i (0 < i, j, k < n À 1) time of our algorithm with that of the same algorithm but i j k (ii) hU U V i (0 < i, j, k < n À 1), i „ k without the greedy search filtering. The experimental results i j k (iii) hU V U i (0 < i, j, k < n À 1), i „ j showed that the greedy search filtering reduces the total com- i j k (iv) hU V V i (0 < i, j, k < n À 1), (i „ j) ^ (i „ k). putation time drastically. For example, using the filtering i j k 0 reduced the computation time to 83% for 30 sequences For generality, we use two sequences, S(=s0 s1 ÁÁÁsNÀ1)(5 to 0 0 0 with length 20 and to 87% for 20 sequences with length 15 3 direction) and T(=t0 t1 ÁÁÁtMÀ1)(3 to 5 direction), to À þ à when TM ¼ 69:58, TM ¼ 72:58 and DGmin ¼À10:0, and describe the DGmin calculation in detail. The sequences are À þ à TM ¼ 60:49, TM ¼ 63:49 and DGmin ¼À7:0, respectively. defined to be antiparallel to each other. Note that any two In addition, we compared the greedy search with the tradi- sequences can be represented by S and T. In this paper, tional approach based on a hamming distance in terms of the S and T represent Ui and the reverse sequence of XjYk filtering performance. We demonstrated that the greedy search (Xj 2fUj, Vjg, Xk 2fUk, Vkg). can filter out inappropriate sequences better than using the The notation si Á tj represents the between the i-th hamming distance. In a laboratory experiment, we investig- base in sequence S and the j-th base in sequence T, hence the ated the correlation coefficient between DGmin and the experi- structure between S and T is a set of base pairs such that each mental free energy (DGexp) and that between the hamming base is paired at most once. In addition, (si Á tj, si0 Á tj0) 0 0 distance and the DGexp using 126 duplexes. The DGexp cor- fðÞ^0 < i < i < NÀ1 ðÞg0 < j < j < MÀ1 is defined as a related better with DGmin ( j R j = 0.90) than with the hamming structure in which base pairs si Á tj and si0 Á tj0 are formed and distance ( j R j = 0.80). the sequences siþ1siþ2 ÁÁÁsi0À1 and tjþ1tjþ2 ÁÁÁtj0À1 do not form To implement our algorithm, we developed a computer any base pairs. The term (x‚si0 Á tj0 ) fx 2fsi‚tjg‚ð0 < i < program called DNA-SDT, a graphic user interface (GUI)- i0 < NÀ1Þ^ðÞg0 < j < j0 < MÀ1 is defined as a structure based application written in Java. This program enables in which base pair si0 Á tj0 is formed and sequence sisiþ1 ÁÁÁ users to design DNA sequences with our algorithm and can si0À1 in the case x ¼ si (tjtjþ1 ÁÁÁtj0À1 in the case x ¼ tj) do not be downloaded freely from the web site (http://ses3.complex. form any base pair. Similarly, ðsi Á tj‚xÞfx 2fsi0 ‚tj0 g‚0ð < i < eng.hokudai.ac.jp/~fumi95/DNA-SDT/index.html). i0 < NÀ1Þ^ðÞg0 < j < j0 < MÀ1 is defined as a structure in which base pair si Á tj is formed and the sequence siþ1siþ2 ÁÁÁsi0 in the case x ¼ si0 (tjþ1tjþ2 ÁÁÁtj0 in the case ALGORITHM x ¼ tj0 ) do not form any base pair. Definition Outline

Let n be the number of sequences to be designed and li Our algorithm is a random generate-and-test algorithm that (0 < i < n À 1) be the length of each sequence. In this generates a sequence randomly and then stores it in the pool if Nucleic Acids Research, 2005, Vol. 33, No. 3 905

and only if the sequence satisfies all the constraints. The is calculated using a greedy search. Both DGmin and DGgre are main feature of our algorithm is that it uses a greedy search calculated based on the nearest-neighbor model, which calcu- for calculating DGmin to filter out inappropriate sequences. lates the total free energy as the summation of the contribu- The advantage of a greedy search is that it approximates tions of various elementary structures (18,19). The elementary DGmin in less time when compared with a well-known structures considered in this paper are stacking base pairs, dynamic programming algorithm (17). Therefore, using bulge loops, internal loops, dangling ends and free ends. the greedy search before the DGmin calculation to exclude The contributions of the stacking base pairs, defined as inappropriate sequences reduces the computation time. Here- si Á tj‚siþ1 Á tjþ1 , to the free energy are calculated using after, the approximated DGmin using a greedy search is denoted 12 parameters reported previously (19). The free energy con- by DGgre. tributions of the loop regions are sequence dependent The algorithm uses three filters: (15,16,20). The free energies of single bulge loops, defined as s Á t ‚s Á t or s Á t ‚s Á t , are calculated (i) T filter: checks whether the T of candidate sequence i j iþ2 jþ1 i j iþ1 jþ2 M M using 64 parameters covering all the possible combinations U ðÞ0 < c < n À 1 and V is in the range from TÀ to Tþ . c c M M of bulged base and flanking base pairs (19). The free energies If not, U is rejected. c of the other loops, bulge loops longer than one and internal (ii) DG filter: checks whether DG is greater than the gre gre loops, are calculated using conventional parameters and equa- threshold, DGà , for all the combinations above in P, pro- gre tions (20,21). Bulge loops longer than one are defined as vided that the candidate sequence U ðÞ0 < c < n À 1 or c f s Á t ‚s Á t ^ ðÞgl>3 or s Á t ‚s Á t ^ ðÞl > 3 , V is included in that combination. If the DG of any i j iþl jþ1 i j iþ1 jþl c gre and the internal loops are defined as s Á t ‚s Á t ^ combination is less than or equal to DGà , U is rejected. i j iþl jþm gre c ðÞl‚m>2 . The free energies of dangling ends, defined as (iii) DG filter: checks whether DG is greater than the min min ðÞs ‚s Á t , ðÞt ‚s Á t , ðÞs Á t ‚s or ðs Á t ‚ threshold, DGà , for all the combinations above in P, 0 1 0 0 0 1 NÀ2 MÀ1 NÀ1 NÀ1 MÀ2 min t Þ, are calculated using 32 parameters covering all the provided that the candidate sequence U ð0 < c < MÀ1 c possible combinations (22). The free ends are defined as n À 1Þ or Vc is included in that combination. If the the sequences s0 ÁÁÁsi and t0 ÁÁÁtj closing with si Á tj such DGmin of any combination is less than or equal to à that both si0ðÞ