Title Design of nucleic acid sequences for DNA computing based on a thermodynamic approach
Author(s) Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma
Nucleic Acids Research, 33(3), 903-911 Citation https://doi.org/10.1093/nar/gki235
Issue Date 2005
Doc URL http://hdl.handle.net/2115/64504
Type article
File Information gki235.pdf
Instructions for use
Hokkaido University Collection of Scholarly and Academic Papers : HUSCAP Published online February 8, 2005
Nucleic Acids Research, 2005, Vol. 33, No. 3 903–911 doi:10.1093/nar/gki235 Design of nucleic acid sequences for DNA computing based on a thermodynamic approach Fumiaki Tanaka1,*, Atsushi Kameda2, Masahito Yamamoto1,2 and Azuma Ohuchi1,2
1Graduate School of Engineering, Hokkaido University, North 13, West 8, Kita-ku, Sapporo 060-8628, Japan and 2CREST, Japan Science and Technology Corporation, 4-1-8, Honmachi, Kawaguchi, Saitama, 332-0012, Japan
Received October 19, 2004; Revised and Accepted January 20, 2005
ABSTRACT order to satisfy the constraints based on the physicochemical properties of nucleic acids. In particular, it is essential to We have developed an algorithm for designing prevent undesired hybridization. In DNA computing, multiple multiple sequences of nucleic acids that have a sequences need to be designed that do not hybridize non- uniform melting temperature between the sequence specifically with each other (10,11), while in RNA secondary and its complement and that do not hybridize non- structure design, a single sequence needs to be designed that specifically with each other based on the minimum folds into the desired secondary structure (12–15). For free energy (DGmin). Sequences that satisfy these example, 40 DNA sequences of length 15 were designed to constraints can be utilized in computations, various prevent undesired hybridization and used for solving a 20- engineering applications such as microarrays, and variable instance of a three-satisfiability problem (2). In the nano-fabrications. Our algorithm is a random field of microarrays, 69 122 DNA sequences of length 45–47 generate-and-test algorithm: it generates a candidate were designed to hybridize specifically to 24 502 transcripts from Arabidopsis thaliana (4). Furthermore, four DNA sequence randomly and tests whether the sequence sequences of length 26 and four DNA sequences of length satisfies the constraints. The novelty of our algorithm 48 were used to construct a periodic two-dimensional crystal- is that the filtering method uses a greedy search to line lattice (7). These sequences were carefully designed for calculate DGmin. This effectively excludes inappropri- the intended hybridization. Thus, an algorithm/program that ate sequences before DGmin is calculated, thereby generates multiple sequences, which do not hybridize non- reducing computation time drastically when specifically with each other, is useful for various applications compared with an algorithm without the filtering. of nucleic acid. Experimental results in silico showed the superiority We applied a thermodynamic approach to the nucleic acid of the greedy search over the traditional approach design for DNA computing. In particular, we focused on de- signing a pool P containing n sequences of length l for which based on the hamming distance. In addition, experi- À (i) the duplex melting temperature (TM) is in the range TM to mental results in vitro demonstrated that the experi- þ TM for any pairwise duplex of a sequence in P and its com- mental free energy (DGexp) of 126 sequences plement and (ii) the minimum free energy (DGmin) is greater correlated well with DGmin ( jRj = 0.90) than with the à than a threshold (DGmin) in any pairwise duplex of sequences hamming distance ( jRj = 0.80). These results validate in P and any concatenation of two sequences in P plus their the rationality of a thermodynamic approach. We complements except for the pairwise duplex of a sequence in P implemented our algorithm in a graphic user and its complement. Traditional approaches to the sequence interface-based program written in Java. design in DNA computing have approximated the stability between two sequences using the hamming distance (i.e. the number of base pairs) rather than DGmin. However, INTRODUCTION since the hamming distance is only an approximation of the Nucleic acids are now being utilized in computations (1–3), stability, DGmin is preferable for predicting the stability. In various engineering applications such as microarrays (4–6), practice, an RNA secondary structure can be adequately and nano-fabrications (7–9). In these fields of research, called predicted using DGmin rather than the number of base pairs ‘DNA computing’, nucleic acid design or sequence design is a (16). However, the algorithm for calculating the DGmin for crucial problem in engineering using nucleic acids. Nucleic double-stranded DNA requires time complexity O(l3), acid design is deciding the base sequences (e.g. ‘GCTAGCT- where l is the length of the sequence, as is the case with AGGTTTA’, ..., ‘ATCGTACGCTATGTGCA’ in DNA) in secondary structure prediction for single-stranded RNA.
*To whom correspondence should be addressed. Tel: +81 11 716 2111, ext 6498; Fax: +81 11 706 7834; Email: fumiaki@dna-comp.org
ª The Author 2005. Published by Oxford University Press. All rights reserved.
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected] 904 Nucleic Acids Research, 2005, Vol. 33, No. 3
Since all the combinations of three sequences must be evalu- paper, we formulated li such that li = lj = l (0 < i, ated in this design, dozens of sequences cannot be designed j < n À 1), although we can extend our algorithm easily within a reasonable computation time. Another approach that for li „ lj (0 < i „ j < n À 1). We define P as the pool of reduces the computation time is thus needed. Andronescu et al. n sequences, with length l, to be designed. Furthermore, let (13) overcame this drawback in the secondary structure design P = fU0, U1, ..., UnÀ1g and Q = fV0, V1, ..., VnÀ1g such that À þ by hierarchically decomposing the secondary structure into Vi (0 < i < n À 1) is the complement of Ui. TM and TM are smaller substructures and integrating the partial sequences. defined as the lower and upper thresholds of TM for the duplex à By using this method, they designed sequences that a tradi- between a sequence and its complement. Moreover, DGmin is tional program cannot. defined as the threshold of DGmin given by the sequence We have developed a random generate-and-test algorithm designer. Thus, our algorithm designs a pool P containing that generates a candidate sequence randomly and tests n sequences of length l for which (i) the duplex TM is in the À þ whether the sequence satisfies the constraints. It stores the range from TM to TM for any duplex of Ui (0 < i < n À 1) and à sequence in a sequence pool if and only if it satisfies all Vi, and (ii) DGmin is greater than DGmin for any pairwise duplex the constraints. To reduce the difficulty in computation of sequences in P and any concatenation of two sequences time, we use a greedy search to calculate DGmin. The advant- in P [ Q except for the pairwise duplex of Ui and Vi. age of a greedy search is that it approximates DGmin in less Here, we describe more specifically the combination of 2 time, with time complexity O(l ), than a rigorous algorithm, sequences to be calculated using the DGmin. The sequence 3 which calculates DGmin with time complexity O(l ). The Ui (0 < i < n À 1) is denoted by a string of bases such as 0 1 lÀ1 0 0 DGmin approximated using the greedy search (denoted by ui ui ÁÁÁui (5 to 3 direction), and similarly, Vi (0 < i < 0 1 lÀ1 0 0 DGgre) correlated well with DGmin. The correlation coefficients n À 1) is denoted as vi vi ÁÁÁvi (5 to 3 direction). For 0 0 0 were 0.95, 0.85 and 0.76 at 20mer, 40mer and 60mer lengths, example, if Ui = 5 -AAATTTCCCGGG-3 , then Vi = 5 -CCC- 0 respectively. Furthermore, since DGgre is the upper bound GGGAAATTT-3 . Furthermore, let
Let n be the number of sequences to be designed and li Our algorithm is a random generate-and-test algorithm that (0 < i < n À 1) be the length of each sequence. In this generates a sequence randomly and then stores it in the pool if Nucleic Acids Research, 2005, Vol. 33, No. 3 905
and only if the sequence satisfies all the constraints. The is calculated using a greedy search. Both DGmin and DGgre are main feature of our algorithm is that it uses a greedy search calculated based on the nearest-neighbor model, which calcu- for calculating DGmin to filter out inappropriate sequences. lates the total free energy as the summation of the contribu- The advantage of a greedy search is that it approximates tions of various elementary structures (18,19). The elementary DGmin in less time when compared with a well-known structures considered in this paper are stacking base pairs, dynamic programming algorithm (17). Therefore, using bulge loops, internal loops, dangling ends and free ends. the greedy search before the DGmin calculation to exclude The contributions of the stacking base pairs, defined as inappropriate sequences reduces the computation time. Here- si Á tj‚siþ1 Á tjþ1 , to the free energy are calculated using after, the approximated DGmin using a greedy search is denoted 12 parameters reported previously (19). The free energy con- by DGgre. tributions of the loop regions are sequence dependent The algorithm uses three filters: (15,16,20). The free energies of single bulge loops, defined as s Á t ‚s Á t or s Á t ‚s Á t , are calculated (i) T filter: checks whether the T of candidate sequence i j iþ2 jþ1 i j iþ1 jþ2 M M using 64 parameters covering all the possible combinations U ðÞ0 < c < n À 1 and V is in the range from TÀ to Tþ . c c M M of bulged base and flanking base pairs (19). The free energies If not, U is rejected. c of the other loops, bulge loops longer than one and internal (ii) DG filter: checks whether DG is greater than the gre gre loops, are calculated using conventional parameters and equa- threshold, DGà , for all the combinations above in P, pro- gre tions (20,21). Bulge loops longer than one are defined as vided that the candidate sequence U ðÞ0 < c < n À 1 or c f s Á t ‚s Á t ^ ðÞgl>3 or s Á t ‚s Á t ^ ðÞl > 3 , V is included in that combination. If the DG of any i j iþl jþ1 i j iþ1 jþl c gre and the internal loops are defined as s Á t ‚s Á t ^ combination is less than or equal to DGà , U is rejected. i j iþl jþm gre c ðÞl‚m>2 . The free energies of dangling ends, defined as (iii) DG filter: checks whether DG is greater than the min min ðÞs ‚s Á t , ðÞt ‚s Á t , ðÞs Á t ‚s or ðs Á t ‚ threshold, DGà , for all the combinations above in P, 0 1 0 0 0 1 NÀ2 MÀ1 NÀ1 NÀ1 MÀ2 min t Þ, are calculated using 32 parameters covering all the provided that the candidate sequence U ð0 < c < MÀ1 c possible combinations (22). The free ends are defined as n À 1Þ or Vc is included in that combination. If the the sequences s0 ÁÁÁsi and t0 ÁÁÁtj closing with si Á tj such DGmin of any combination is less than or equal to à that both si0ðÞpseudoknot-free constraint in the RNA secondary structure prediction. At this point, we do Input: n, l, TÀ , Tþ , DGà , DGà M M gre min not consider intra-molecular base pairs or the interactions Output: pool P consisting of n sequences with length l. between loop regions. Procedure: (i) Initialize pool P as an empty set. (ii) Iterate the following procedure until P has n sequences. TM filter (a) Generate candidate sequence Uc ðÞ0 < c < n À 1 This filter checks whether a candidate sequence paired to its with length l randomly, then add U to P. À þ c complement has a TM in the range TM to TM. (b) Evaluate U with T filter. If U is rejected, exclude U c M c c from P and return to ii(a). DH TM ¼ þ DS ‚ 1 (c) Evaluate Uc with DGgre filter. If Uc is rejected, exclude R ln ðCT=aÞ Uc from P and return to ii(a). (d) Evaluate Uc with DGmin filter. If Uc passes, leave Uc in where R is the gas constant, CT is the concentration, DH is P; else exclude Uc from P and return to ii(a). the enthalpy and DS is the entropy. Parameter a is set to 1 for self-complementary and to 4 for non-self-complementary. The order of the filters is important. Each candidate Parameters DH and DS are calculated based on the sequence should be evaluated in this order to reduce the com- nearest-neighbor model (18,19). putation time. If pool P has m sequences, the time complexities to evaluate the (m + 1)-th candidate sequence are O(l ), O(m2l2) 2 3 DGgre filter and O(m l ) at the TM, DGgre and DGmin filters, respectively. à By evaluating the candidate sequences in ascending order of This filter checks whether DGgre is greater than DGgre for all time complexity, the inappropriate ones can be excluded the combinations as mentioned in Algorithm. sooner. Using a ‘greedy search’ reduces the computation time for calculating DGmin. The greedy search works well because a structure with DGmin tends to include stable helices (i.e. con- Energy model tinuous complementary regions). Therefore, the greedy search The DGmin between S and T is calculated using a dynamic approximates DGmin by iteratively searching for the most programming algorithm (17), while the DGgre between S and T stable helix and fixing the helix. 906 Nucleic Acids Research, 2005, Vol. 33, No. 3
R The greedy search algorithm is as follows: where DGfreeEnd is the free energy of the helix with a R 1. First, calculate the free energies of all helices over the free end and DGcore is that of one without a free end. (d) Search for a region where DGR is minimum over all structure between sequence s0s1 ÁÁÁsNÀ1 and sequence freeEnd helices. This region is freshly denoted as si00 si00þ1 ÁÁÁsk00 t0t1 ÁÁÁtMÀ1. Calculate the free energies of the helices 0 0 0 0 with free ends using the following equations. Here, a (5 ! 3 ) and tj00 tj00þ1 ÁÁÁtl00ðÞ¼j00þk00Ài00 (3 ! 5 ). Further- 0 0 more, the free energies of the regions with and without a helix is denoted by sisiþ1 ÁÁÁsk (5 ! 3 ) and R R 0 0 free end are freshly denoted as DGfreeEnd and DGcore, tjtjþ1 ÁÁÁtlðÞ¼jþkÀi (3 ! 5 ). R respectively. If DG < DtðÞMÀ1‚tl‚sNÀ1‚sk holds, DGfreeEnd P¼ DGcore þDs0‚si‚t0‚tj þDtðÞMÀ1‚tl‚sNÀ1‚sk freeEnd kÀ1 fix the base pairs, si00 Á tj00 ‚si00þ1 Á tj00þ1‚ ...‚sk00 Á tl00 , and DGcore ¼ a¼i eS sa‚saþ1‚tjþaÀi‚tjþaÀiþ1 00 00 R DG is the free energy of a helix with free ends; DG update k, l and DGcore to k , l and DGcore þ DGcore, freeEnd core respectively. This means that the base pairs in the region is that without free ends. Ds0‚si‚t0‚tj represents the free energy contribution of the dangling or free end between are energetically favorable. 0 0 0 0 4. Calculate DGgre using sequence 5 -s0s1 ÁÁÁsi-3 and sequence 3 -t0t1 ÁÁÁtj-5 closing with base pair si Á tj. For ðÞ^i ¼ 0 ðÞj ¼ 0 , DGgre ¼DGcoreþDs0‚si‚t0‚tj þDtðÞþMÀ1‚tl‚sNÀ1‚sk init‚ Ds0‚si‚ t0‚tj is zero because there is no dangling or free end. eS s ‚s ‚t ‚t is the free energy of the a aþ1 jþaÀi jþaÀiþ1 where init is the energy penalty for forming double-stranded stacked base pair between sequence 50-s s -30 and a aþ1 DNA. If DG > 0, however, set DG = 0. This is because sequence 30-t t -50. gre gre jþaÀi jþaÀiþ1 the free energies are calculated relative to two non- 2. Search for a minimal value of DG over all helices. freeEnd interacting sequences, for which the free energy is defined Then, fix the base pairs in the helix where DG is mini- freeEnd as zero. Thus, two sequences must remain separate with mum. This region is freshly denoted as s s ÁÁÁs (50 ! 30) i iþ1 k zero free energy rather than form a structure with positive and t t ÁÁÁt (30 ! 50). Furthermore, the free ener- j jþ1 lðÞ¼jþkÀi free energy. gies of the regions with and without free ends are freshly denoted as DGfreeEnd and DGcore, respectively. In the above procedure, the number of iterations for a 3. Iterate the following procedure d times, where d is a para- search, d, is defined as ‘degree’. A structure with degree = 0 meter defined below. has at most one helix. Note that base pairing does not occur (a) IfðÞ^ 0 < iÀ1 ðÞ0 < jÀ1 holds, calculate the free ener- during a greedy search when more than two continuous com- gies of all helices with the loop closing with base pair plementary bases do not exist between two sequences. For si Á tj over the structure between sequence s0s1 ÁÁÁsiÀ1 degree = 1, there are at most three helices. Eventually, for and the sequence t0t1 ÁÁÁtjÀ1. Thus, the free energy degree = d, there are at mostðÞ 1 þ 2 Á d helices. If the length 0 0 of helix, denoted as si0 si0þ1 ÁÁÁsk0 (5 ! 3 ) and of sequences S and T are l and 2 Á l, respectively, the number of 0 0 tj0 tj0þ1 ÁÁÁtl0ðÞ¼j0þk0Ài0 (3 ! 5 ), is calculated as follows: iterations is at most ðÞl À 1 =3. Thus, the time complexity L L 2 3 DG ¼ DG þ Ds‚s 0 ‚t ‚t 0 of greedy search O degree Á l is Ol at worst. However, freeEnd Pcore 0 i 0 j L k0À1 because the degree increases slowly with the sequence length, DG ¼ 0 eS s ‚s ‚t 0 0 ‚t 0 0 þ core a¼i a aþ1 j þaÀi j þaÀi þ1 the time complexity of a greedy search can be in practice eL sk0 ‚si‚tl0 ‚tj ‚ 2 L regarded as Ol . To confirm this, we calculated the degree where DGfreeEnd is the free energy of the helix with a free L for 10 000 random pairs of sequences from 10mer to 100mer in end, DG is that of one without a free end and core steps of 10mer. As shown in Table 1, degree was at most 9 at eL s 0 ‚s ‚t 0 ‚t is the free energy contribution of a k i l j 80mer and 90mer, while it was 33 [=(100 À 1)/3]) at 100mer in bulge or internal loop between sequence s 0 s 0 ÁÁÁs k k þ1 i the worst case. Furthermore, degree was rarely >6(< 1%), and sequence t 0 t 0 ÁÁÁt closing with base pairs l l þ1 j and, for >98% of the 10 000 random pairs, degree was in the sk0 Á tl0 and si Á tj. (b) Search for a region where DGL is minimum over all range 0–4 at any length. Therefore, degree was nearly constant freeEnd regardless of the sequence length on average, indicating that helices. This region is freshly denoted as si0 si0þ1 ÁÁÁsk0 2 0 0 0 0 the time complexity of a greedy search is Ol in practice. (5 ! 3 ) and tj0 tj0þ1 ÁÁÁtl0ðÞ¼j0þk0Ài0 (3 ! 5 ). Further- more, the free energies of the regions with and without a L L free end are freshly denoted as DGfreeEnd and DGcore, DGmin filter L respectively. If DGfreeEnd < Ds0‚si‚t0‚tj holds, fix à This filter checks whether the DGmin is greater than DGmin for the base pairs, si0 Á tj0 ‚si0þ1 Á tj0þ1‚...‚sk0 Á tl0 , and update i, j and DG to i0, j0 and DG þ DGL , all the combinations as mentioned in Algorithm. core core core DG between S and T can be decomposed into two terms: respectively. This means that the base pairs in the min region are energetically favorable. DGmin ¼ min fDs0‚si‚t0‚tj þ Vsi‚tj g‚ 2 (c) If ðÞ^k þ 1 < NÀ1 ðÞl þ 1 < MÀ1 holds,calculatethe 0< i < NÀ1 0< j < MÀ1 free energies of all helices with a loop closing with base pair sk Á tl over the structure between sequence skþ1 where Vsi‚tj represents the minimum value of the free en- 0 0 skþ2 ÁÁÁsNÀ1 andsequencetlþ1tlþ2 ÁÁÁtMÀ1.Thus,thefree ergy between sequence sisiþ1 ÁÁÁsNÀ1 (5 ! 3 direction) and 00 00 00 0 0 0 0 energy of helix, denoted as si si þ1 ÁÁÁsk (5 ! 3 )and sequence tjtjþ1 ÁÁÁtMÀ1 (3 ! 5 direction) closing with si Á tj. 0 0 tj00 tj00þ1 ÁÁÁtl00ðÞ¼j00þk00Ài00 (3 ! 5 ), is calculated as follows: Recall that Ds‚s ‚t ‚t represents the free energy of the R R 0 i 0 j DG DG Dt ‚t 00 ‚s ‚s 00 0 0 freeEnd ¼P 00core þ ðÞMÀ1 l NÀ1 k dangling or free ends between sequence s0s1 ÁÁÁsi (5 ! 3 R k À1 0 0 DG core ¼ a¼i00 eS sa‚saþ1‚tj00þaÀi00 ‚tj00þaÀi00þ1 þ direction) and sequence t0t1 ÁÁÁtj (3 ! 5 direction) closing 00 00 eL sk‚si ‚tl‚tj ‚ with base pair si Á tj. Nucleic Acids Research, 2005, Vol. 33, No. 3 907
Table 1. Distribution of degree for greedy search for 10 000 random pairs of sequences from 10mer to 100mer in steps of 10mer
Degree Sequence length (mer) 10 20 30 40 50 60 70 80 90 100
0 9426 7941 6679 5813 5032 4544 3973 3575 3196 2931 1 560 1838 2783 3238 3638 3723 3953 3856 3861 3863 2 14 204 471 767 1040 1301 1467 1752 1959 2067 3 0 17 57 161 244 336 458 596 698 770 4 0 0 10 16 38 78 106 165 206 242 5 0 0 0 4 7 13 28 43 48 89 6 0 0 0 1 1 5 15 12 24 30 7 0000000047 8 0000000031 9 0000000110