Paradigms for Computational

Robert M. Dirks∗, Milo Lin¶, Erik Winfree†‡ and Niles A. Pierce§‡ ∗Chemistry, ¶Undergraduate, †Computer Science and Computation & Neural Systems, §Applied & Computational Mathematics, ‡Bioengineering, California Institute of Technology, Pasadena, CA 91125

Nucleic Acids Research, in press

ABSTRACT (12, 13). Computational sequence selection (1, 14–21) are likely to play an increasing role in explor- The design of DNA and RNA sequences is critical for many ing this new design . endeavors, from DNA , to PCR-based ap- A fundamental design problem consists of selecting plications, to DNA hybridization arrays. Results in the the sequence of a nucleic acid strand that will adopt a literature rely on a wide variety of design criteria adapted target secondary structure. As depicted in Figure 1a, to the particular requirements of each application. Using this is the inverse of the more famous folding problem an extensively-studied thermodynamic model, we perform of determining the structure (and folding mechanism) a detailed study of several criteria for designing sequences for a given sequence. To attempt the intended to adopt a target secondary structure. We con- of novel nucleic acid structures, we require both an ap- clude that superior should explicitly im- proximate empirical physical model and a search algo- plement both a positive (optimize affinity rithm for selecting promising sequences based on this for the target structure) and a negative design paradigm model. Experimental feedback on the quality of the de- (optimize specificity for the target structure). The com- sign and the performance of the design can monly used approaches of sequence symmetry minimiza- then be obtained by folding the molecule in vitro. Al- tion and minimum free energy satisfaction primarily im- ternatively, if this feedback loop can be closed compu- plement negative design and can be strengthened by in- tationally by folding the molecule in silico, the quality troducing a positive design component. Surprisingly, our of sequence could be rapidly assessed and im- findings hold for a wide range of secondary structures and proved before attempting laboratory validation. are robust to modest perturbation of the thermodynamic In designing nucleic acid sequences, we consider the parameters used for evaluating sequence quality, suggest- two principal paradigms illustrated in Figure 1b. Pos- ing the feasibility and ongoing utility of a unified approach itive design methods attempt to select for a desired to as parameter sets are further refined. outcome by optimizing sequence affinity for the tar- Finally, we observe that designing for thermodynamic sta- get structure. Negative design methods attempt to se- bility does not determine folding kinetics, emphasizing the lect against unwanted outcomes by optimizing sequence opportunity for extending design criteria to target kinetic specificity for the target structure. A successful design features of the energy landscape. must exhibit both high affinity and high specificity (14), so useful design algorithms must satisfy the objectives INTRODUCTION of both paradigms, even if they explicitly implement Understanding how to design molecular structures is an only one. essential step in allowing technology to interface with For some applications, it may be desirable to sup- biology and in developing systems with increasing func- plement these thermodynamic design considerations tional density. Nucleic acids hold great promise as a de- with additional kinetic requirements. For example, in sign medium for the construction of nanoscale devices designing molecular machines (8), selecting sequences with novel mechanical or chemical function (1, 2). Ef- that fold or assemble quickly may be crucial, since nat- forts are currently underway in many laboratories to urally occuring RNA sequences have been observed to use DNA and RNA molecules for applications in pat- have persistent metastable states (22) and theoretical terning (3), assembly (4–6), transport, switching (7–9), models suggest that random sequences have highly frus- circuitry (10), DNA computing (11), and DNA chips trated energy landscapes with folding times that grow exponentially with sequence length (23). Alternatively, §Email: [email protected] it may be important to design interactions with inten-

1 Nucleic Acids Research, in press 2

a) Sequence c) hairpin d) a b sign criteria are judged based on kinetic considerations, ACGT f g d c as favorable thermodynamic properties do not ensure h

F o l d i n g e f e bulge fast folding. Physical Model d i loop g h Search Algorithm D e s i g n c j interior Physical Model b multiloop k loop a l e f Structure a b c d g h s r m n The secondary structure of a nucleic acid strand is sim- b) Conformational Space t q p hairpin Seq B ply a list of base pairs between Watson-Crick comple- stacked bases o interior loop

∆ G ments (A·U, C·G for RNA and A·T, C·G for DNA) bulge loop Seq A or wobble pairs (G·U or G·T); it may be described as multiloop a graph with connections between paired bases on a polymer backbone, as depicted in Figure 1c. A coarse- Target Undesired a b c de f g h i j k l m n o p q r s t Structure Structure hairpin hairpin grained energy landscape may be defined over the finite number of all possible secondary structures, where the Figure 1. a) Feedback loop for evaluating nucleic acid properties of each secondary structure represent an en- sequence designs and methodologies. b) Positive and neg- semble average over the three-dimensional atomic struc- ative design paradigms. Two sequences are evaluated us- tures consistent with that base-pairing graph. Decades ing an empirical potential on both the desired target struc- of effort have been invested in the formulation and pa- ture and an undesired structure. Using a positive design rameterization of an empiricial potential for the free paradigm, sequence A would be selected since it exhibits energy of a nucleic acid strand based on a loop de- a stronger affinity than sequence B for the target structure composition of the base-pairing graph (25–27). Despite (i.e. lower ∆G). Using a negative design paradigm, sequence both conceptual and practical limitations, this model B would be selected since it exhibits specificity for the target has great utility for studying the properties of natural structure while sequence A exhibits specificity for the unde- and engineered RNA and DNA structures (10, 27), serv- sired structure. To provide a common basis for comparison, ing as the basis for efficient dynamic programming al- ∆G = 0 for a strand with no base pairs. c) Canonical loops gorithms that calculate the minimum energy structure of nucleic acid secondary structure: hairpin loops, stacked (15, 28–32) and partitition function (21, 33) for a given base pairs, a bulge loop, an interior loop, and a multiloop. nucleic acid strand over a large class of secondary struc- † These loop structures are all nested (i.e. there are no cross- tures including many (see Figure 1d). ing arcs in the corresponding polymer graph with the back- The folding kinetics of a sequence can be addressed bone drawn as a straight line). d) A sample pseudoknot by simulating the trajectory through secondary struc- with base pairs a·f and c·h (with a < c) that fail to satisfy ture space as a continuous-time Markov process (36, the nesting property a < c < h < f, yielding crossing arcs 37). Changes in secondary structure are described in in the corresponding polymer graph. terms of elementary steps corresponding to the break- ing or formation of a single . For each elemen- tary step, the ratio of the forward and backward rates tionally frustrated folding kinetics in order to control is defined to be consistent with the equilibrium prob- fuel delivery during the work cycle (24). abilities of the two end states (37, 38). However, ad The present study uses efficient partition function al- hoc arguments are required to set the magnitude of the gorithms and kinetics simulations to examine rates. There is some evidence that qualitative proper- the thermodynamic and kinetic properties of sequences ties of kinetic simulations are insensitive to the specific designed using seven methods that capture aspects of rate model (37). the positive and negative design paradigms. Although several of these design criteria have been widely used, Thermodynamic and Kinetic Evaluation Met- we are not aware of any previous attempt to assess their rics relative performance. Evaluated based on thermody- namic considerations, we consistently observe that se- The partition function over secondary structure space quence selection methods that implement both posi- provides an ideal conceptual framework for evaluating tive and negative design paradigms outperform meth- the affinity and specificity of a sequence for the target ods that implement either paradigm alone. This trend structure. If ∆G(s) is the free energy of a sequence in appears to be robust to changes in both the target sec- secondary structure s, then the probability of sampling ondary structure and the parameters in the physical model, and to the choice of either RNA or DNA as the †Many of these methods are now implemented for use via on- design material. The trend does not hold when the de- line servers (34, 35). Nucleic Acids Research, in press 3 s at thermodynamic equilibrium is given by where the quantity in brackets is just the matrix of pair probabilities P with entries P equal to the probability p(s) = 1 e−∆G(s)/RT , (1) i,j Q of forming base pair i·j. The extra column has entries where the partition function Pi,N+1 equal to the probability that base i is unpaired.

X −∆G(s)/RT Again, each row sum is one. Hence the average number Q = e , (2) of incorrect ‡ may be expressed s∈Ω ∗ X ∗ is a weighted sum over the set of all secondary struc- n(s ) = N − Pi,jSi,j. (5) tures Ω, R is the universal gas constant and T is the 1 ≤ i ≤ N temperature. If the probability p(s∗) of folding to the 1 ≤ j ≤ N +1 target graph s∗ is close to unity, then within the con- text of the approximate physical model, the sequence This metric has the advantage that sequences that ∗ achieves both high affinity and high specificity for the adopt secondary structures similar to s (e.g. due to target structure. breathing) are now identified as promising candidates. ∗ The probability p(s∗) represents a very stringent de- However, even if n(s )<

2. Energy minimization. Sequences are selected all design methods are provided in Materials and Meth- that attain a low energy on the target structure using ods. the standard energy model. This approach implements explicit positive design. RESULTS 3. Minimum free energy (MFE) satisfaction. Sequences are selected so as to ensure that the target We now compare the perfomance of these seven design structure is the lowest energy structure (15, 19). Note methods. All designed sequences are at local minimima that a sequence with the correct minimum energy struc- in the sense that no mutation of one base pair or of one ture may nonetheless have a low probability of adopting unpaired base results in a better sequence based on the the target fold. This approach implements explicit neg- given design criterion. ative design. RNA Multiloop Design. Each method was used 4. Sequence symmetry minimization (SSM). to perform 100 independent sequence designs for a 4- Sequences are selected so as to prohibit repeated subse- stem RNA multiloop comprising 71 nucleotides. His- ∗ ∗ quences of a specified word length (1). For subsequences tograms of p(s ) and n(s ) are shown in Figures 2a and that are not base-paired to consecutive bases in the tar- 2b, with median values recorded in Table 1. For ran- get graph (e.g. single stranded or branched regions), the dom sequences, approximately 95% of the designs have ∗ ∗ complementary words are also prohibited from appear- p(s ) < 0.1 and the median value of n(s ) is 7.2. Energy ing in the design. This is a heuristic approach to nega- minimization performs worse than random while MFE tive design, attempting to ensure specificity for the tar- satisfaction and SSM perform somewhat better. There get structure by guaranteeing mismatches within any is a dramatic improvement in sequence quality using a subsequence of the word length that hybridizes incor- combination of energy minimization and SSM. Directly ∗ ∗ rectly. optimizing either p(s ) or n(s ) leads to sequences with 5. Energy minimization and SSM. Sequences are excellent thermodynamic properties. selected that attain a low energy on the target graph To provide an alternative view of average design per- subject to the constraint that SSM is satisfied (14). formance, Figure 2c depicts the base-pairing probabil- ∗ This approach explicitly addresses both paradigms, ities Pi,j for the median sequence based on p(s ). En- combining rigorous positive design and heuristic neg- ergy minimization completely fails to capture the con- ative design. nectivity of the target structure. The other methods 6. Probability. Sequences are selected to maximize demonstrate the correct basic structure with varying the probability (15, 17, 21) of sampling the target struc- propensities for extending or adding helices. ture p(s∗). Positive and negative design are simultane- Model Robustness. It is inevitable that new pa- ously addressed in a single rigorous approach. rameter sets will continue to be developed for the loop- 7. Average incorrect nucleotides. Sequences are based potential functions used for these studies (26, 27). selected to minimize the average number of incorrect For our design methods to be useful, the quality of a se- nucleotides n(s∗). Positive and negative design are si- quence must be robust to perturbations in the approxi- multaneously addressed in a single rigorous approach. mate physical model; sequences that behave well using many different parameter sets are more likely to per- In each case, a design method is obtained by em- form well in the laboratory. To examine this issue, we ploying a heuristic search procedure to optimize one consider 1000 randomized potential functions for RNA of the design criteria. It is these design criteria that where every parameter¶ is independently adjusted by are the focus of the present work. Examining a set an amount uniformly distributed on ±10%, ±20% or of sequences obtained by independent search processes ±50%. provides a characterization of typical performance. For For each design method, the top-ranked sequence the random, MFE satisfaction and SSM methods, any based on p(s∗) is reexamined using these modified sequence that satisfies the criterion is a global mini- potentials. The new probabilities are shown in Fig- mum. For the probability and average incorrect nu- ure 3, with the original probabilities depicted as dashed cleotide methods, the global optimum is not necessarily lines. For perturbations distributed uniformly on attained, but there is an absolute standard of success ±10%, these probability distributions are peaked near (i.e. p(s∗) ≈ 1 or n(s∗) ≈ 0) that is frequently achieved. the original probabilities, with the sharpest peaks oc- For methods involving energy minimization, there is no curing for the best original sequences. The studies mathematical guarantee that the selected sequences are § near the global minimum. Implementation details for for small structures, (e.g. the one in Figure 1a), where the global minimum was determined using a algorithm §For energy minimization, the global minimum energy is (39) (see Materials and Methods). achieved by at least one sequence for all structures we consider. ¶There are 10,692 and 12,198 nonzero parameters for the RNA For energy minimization plus SSM, we verified this property only (27) and DNA (26) models, respectively. Nucleic Acids Research, in press 5

a) RNA Design Algorithm Performance 100

Random Energy Minimization MFE Satisfaction 50 SSM Energy Minimization & SSM Probability Average Incorrect Nucleotides N u m b e r o f S q n c s 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p(s*) b) RNA Design Algorithm Performance 100

50 N u m b e r o f S q n c s 0 0 1 2 3 4 5 6 7 8 9 10 >10 n(s*)

c) Pair Probabilities Pair Probabilities Pair Probabilities

SSM Probability 10 10 Displayed Pairs = 37 10 Displayed Pairs = 24 p(s*) = 0.08 p(s*) = 0.97 Energy Minimization 20 20 n(s*) = 5.32 20 n(s*) = 0.06 Displayed Pairs = 45 30 p(s*) = 0.00 30 30 n(s*) = 43.28

B a s e 40 B a s e 40 B a s e 40 Random MFE Satisfaction Energy Min & SSM Displayed Pairs = 33 50 50 Displayed Pairs = 47 50 Displayed Pairs = 25 p(s*) = 0.00 p(s*) = 0.16 p(s*) = 0.87 n(s*) = 6.48 60 60 n(s*) = 5.64 60 n(s*) = 0.28 70 70 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Base Base Base

d) R NA Probabilit y and Free E nergy Co mparison e) RNA Thermodynamic and Kinetic Comparison 1 1

0.75 0.75

0.5 0.5 p ( s * ) p ( s * )

0.25 0.25

0 0 -70 -60 -50 -40 -30 -20 -10 0 1000 2000 3000 4000 5000 ∆G(s*) t(s*)

Figure 2. RNA Multiloop: a) Histograms for 100 sequence designs based on probability of sampling the target graph, p(s∗). The color legend applies to all plots. b) Histograms for the same 100 sequence designs based on average number of incorrect ∗ ∗ nucleotides, n(s ). c) Base-pairing probabilities Pi,j for the median sequence based on p(s ). Square sizes correspond to Pi,j ≥ {0.5, 0.05, 0.005}, respectively. The target structure is identical to that obtained by optimizing probability (black) or the average number of incorrect nucleotides (not shown). d) p(s∗) versus free energy, ∆G(s∗). Each dot corresponds to one of 100 sequences designed using each method. Each bold square corresponds to the median over the 100 sequences designed using each method. e) p(s∗) versus median folding time, t(s∗), over 1000 kinetic trajectories starting from random coil initial conditions. Dots and squares interpreted as for part d). Nucleic Acids Research, in press 6

Table 1. Sequence statistics for RNA multiloop designs of Figure 2.

Design Method p(s∗) n(s∗) CG content Entropy Top-ranked based on p(s∗) ((((((..((((((.....))))))..((((((.....))))))..((((((.....))))))..)))))) Random 0.00 7.22 0.50 1.00 AUGGGUUAUCACUGCGGCUCAGUGAAACAAGCGUCGUUCGCUUGGGACGUCUAUAUAAGACGUUUACCCAU Energy Minimization 0.00 32.46 0.91 0.35 GGGGGCACGGGGGCCUCUGGCCCCCACGGCCCCCGCCGGGGGCCACGGGCCCCUCUGGGGCCCACGCCCCC MFE Satisfaction 0.16 4.14 0.50 0.99 GGCGUCUAAAGAACGAUAAGUUCUUAUGAUUCAAAGACUGAAUCUGGAUCGAGGACGUCGAUCGUGACGCC SSM 0.08 4.89 0.50 0.99 GACGCACCCCUGAGACCGCCUCAGGUUGUAAGCGAUGGGCUUACCAGAUUCCACAUAGGAAUCAAUGCGUC Energy Minimization & SSM 0.87 0.28 0.66 0.68 GGAGCCAAGACCUCGUUCAGAGGUCACGCCCUGGAAAACAGGGCAACCCCGCUUAGUGCGGGGACGGCUCC Probability 0.97 0.06 0.69 0.40 GCCGGCAAGCCCUCGACUAGAGGGCAAGCGGUCGACUAGACCGCAAGCCGUCGAAUAGACGGCAAGCCGGC Average Incorrect Nucleotides 0.97 0.06 0.68 0.39 GCCGGCACGGCCUCGACUAGAGGCCAAGCCGUCGAAUAGACGGCAAGCCCUCGACUAGAGGGCAAGCCGGC

RNA Model Study with 10% Perturbations 1000 content of 50% and an average sequence entropy of ap- 800 proximately one, meaning that each base is equally 600

400 likely at each position. Similar trends are observed 200 for MFE satisfaction, emphasizing that it is a negative

N u m b e r o f P a t S s 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 design approach, in that it does not attempt to opti- p(s*) RNA Model Study with 20% Perturbations mize affinity for the target structure by increasing the 1000

800 CG content. By contrast, energy minimization leads 600 to 91% CG content with a dramatic drop in the se- 400 quence entropy at each position. The combined ap- 200 proach of energy minimization and SSM increases the N u m b e r o f P a t S s 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p(s*) average sequence entropy and reduces the CG content RNA Model Study with 50% Perturbations 1000 to about 65%. By comparison, designs based on direct 800 optimization of p(s∗) or n(s∗) have similar CG con- 600

400 tents, but much lower average sequence entropies, sug- 200 gesting greater uniformity across independent sequence

N u m b e r o f P a t S s 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p(s*) designs in the placement of C and G bases throughout the strand. Figure 3. RNA model perturbation study. For the multi- The differing design objectives of methods that im- loop designs of Figure 2, the top-ranked sequence for each plement positive and negative design paradigms are am- method based on p(s∗) is reexamined using 1000 randomized ply illustrated by the plot of probability versus free potential functions where every parameter is independently energy in Figure 2d. Here, the methods of SSM and adjusted by an amount uniformly distributed on ±10%, MFE satisfaction produce sequences with ∆G values ±20%, or ±50%.The original probabilities are depicted as comparable to those for the random method. Energy dashed lines. minimization naturally produces the lowest ∆G values, while the methods that combine positive and negative design sacrifice some level of affinity to achieve greater with perturbations distributed uniformly on ±20% and specificity and hence higher p(s∗) values. ±50% demonstrate that the best sequences are surpris- ∗ ingly robust even to large perturbations. Kinetics. We estimate t(s ) as the median folding Sequence Composition. The contrasting behavior time over 1000 stochastic simulation runs as plotted against p(s∗) in Figure 2e. Each simulation was termi- of sequences designed by different methods is partly at- 4 tributable to the variation in sequence composition as nated after 10 dimensionless time units had elapsed. summarized by Table 1 in terms of fraction of CG nu- Sequences designed by energy minimization had very cleotides and average Shannon entropy per position.k low probabilities and failed to fold during the time As expected, the random and SSM designs have a CG frame of the simulations. Random sequences also had very low probabilities but did succeed in folding. On kFor 100 sequences designed by a given method, the informa- average, the negative design approaches of MFE satis- tion entropy at position i is defined by faction and SSM yielded sequences with improved prob- X abilities and folding times relative to random sequences. σi = − fi(η) log4 fi(η) η=A,C,G,U The combined approach of energy minimization and SSM yielded significantly higher probabilities with fold- where fi(η) is the fraction of base η at position i, and σi varies ing times that are comparable with SSM. Sequences de- between zero (all nucleotides are identical) and one (equal number ∗ ∗ of each nucleotide). The average entropy per position over a signed by direct optimization of p(s ) or n(s ) yielded sequence of length N is then P σ /N. i=1,N i the highest probabilities but somewhat slower folding Nucleic Acids Research, in press 7 times. Figure 2e illustrates two distinct classes of slow a) folding sequences: sequences with low p(s∗) have en- ergy landscapes in which s∗ is not a prominent local minimum, while sequences with high p(s∗) have s∗ as the global minimum, but often have highly frustrated energy landscapes, possibly due to high CG content. Each of the three methods that implement both posi- tive and negative design paradigms produce a number of sequences that appear excellent based on both equi- librium and kinetic properties. However, in general, the depth of the global minimum in the energy landscape does not determine the kinetic accessibility of that con- formation (37). Other RNA structures. The multiloop structure considered in Figure 2 had stems of length α = 6 and single stranded multiloop regions of length β = b) 2. In Figure 4, the design conclusions are general- ized to a related family of multiloop structures with α ∈ {4, 6, 8}, β ∈ {0, 2, 4}. Results for a larger RNA multiloop with 122 nucleotides and a small RNA pseu- doknot with 30 nucleotides are shown in Figures 5 and 6. We have also examined design performance for open structures, hairpins, and three-stem multiloop struc- tures (not shown). In all of these cases, the same trends are observed in the relative performance of the different design methods. DNA Design. For each of the non-pseudoknotted cases, analogous data is provided for DNA in Sup- plementary Material Figures 7-10 and Table 2. Sim- ilar trends are observed in the relative performance of the different design methods. Based on equilibrium properties, the most noticeable differences compared to Figure 4. RNA Multiloop Variations: Design performance the RNA designs are: a) the best methods no longer based on a) p(s∗) and b) n(s∗) with stem α = (4, 6, 8) and ∗ consistently produce sequences with p(s ) > 0.90, b) single stranded multiloop regions β = (0, 2, 4). Surfaces structures with helices of length α = 4 are difficult to show the mean values plus and minus one standard devia- stabilize. Comparing equilibrium and kinetic proper- tion for 100 independently designed sequences. The results ties, higher probabilities are achievable with RNA, and for optimizing average incorrect nucleotides (not shown) are faster folding times are typical for DNA. nearly indistinguishable from those obtained by optimizing probability. DISCUSSION Relative Merits of Design Criteria. Based on ther- modynamic considerations, our results support classify- thermodynamic energy model, performs so similarly to ing design criteria according to the extent to which they SSM, which neglects the model. Methods based on SSM implement positive (affinity) and negative (specificity) are widely used – our results suggest that they could design paradigms. The design methods that implement be improved by incorporating a positive design compo- both paradigms (energy minimization plus SSM, prob- nent. For many structures, high probabilities (within ability, average incorrect nucleotides) significantly out- the context of an approximate physical model) are ob- ∗ ∗ perform other methods. In general, the worst perfor- tained by directly optimizing p(s ) or n(s ). mance was observed for methods that implemented nei- Optimization of equilibrium properties leads to se- ther paradigm (random) or positive design alone (en- quences with widely differing folding times. One sim- ergy minimization), with somewhat better performance ple design approach is to filter sequences to identify observed for negative design methods (MFE satisfac- fast or slow folders as desired. Alternatively, new se- tion, SSM). It is perhaps surprising that MFE satis- quence selection algorithms could be developed that ex- faction, which performs negative design using the full plicitly take into account the structure of the energy Nucleic Acids Research, in press 8

a) RNA Design Algorithm Performance 100 Random gated here. It appears that it is not necessary to clas- Energy Minimization MFE Satisfaction sify target structures according to the demands that SSM 50 Energy Minimization & SSM Probability they place on positive or negative design, as methods Average Incorrect Nucleotides that implement both paradigms are generally preferred. N u m b e r o f S q n c s 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Furthermore, we observe the same relative performance p(s*) b) RNA Design Algorithm Performance rankings for RNA and DNA despite systematically dif- 100 ferent thermodynamic parameters for the two materi-

50 als. Evaluations of sequence quality for either material appear robust to perturbations in the parameter sets. N u m b e r o f S q n c s 0 This suggests that the relative merits of the design cri- 0 1 2 3 4 5 6 7 8 9 10 >10 n(s*) teria are not likely to change as the empirical models c) Pair Probabilities Pair Probabilities Pair Probabilities are improved. 20 Energy Minimization 20 SSM 20 Probability Displayed Pairs = 116 Displayed Pairs = 53 Displayed Pairs = 42 p(s*) = 0.00 p(s*) = 0.01 p(s*) = 0.96 The validity of our thermodynamic metrics is linked 40 n(s*) = 40.95 40 n(s*) = 7.41 40 n(s*) = 0.08

60 60 60 to the validity of the underlying empirical models, B a s e B a s e B a s e 80 Random 80 MFE Satisfaction 80 Energy Min & SSM which continue to be refined and evaluated by experi- Displayed Pairs = 62 Displayed Pairs = 47 Displayed Pairs = 46 p(s*) = 0.00 p(s*) = 0.06 p(s*) = 0.64 100 n(s*) = 12.00 100 n(s*) = 5.43 100 n(s*) = 0.81 mental studies (27, 40). Further improvement of these

120 120 120 20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120 models for both thermodynamic and kinetic predictive Base Base Base capability will directly benefit rational design meth- ods. Historically, some parameters have experienced Figure 5. Large RNA multiloop. See captions for Fig- adjustments significantly larger than 10% as the model ures 2abc. was refined (40). It seems likely that parameters that have undergone extensive study (e.g. base-pair stack- a) RNA Design Algorithm Performance 100 Random ing) will experience relatively small changes in the fu- Energy Minimization MFE Satisfaction ture, while other parameters that have not received the SSM 50 Energy Minimization & SSM Probability same degree of scrutiny (e.g. coaxial stacking or dan- Average Incorrect Nucleotides

N u m b e r o f S q n c s gling ends) may change more dramatically. These ad- 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p(s*) justments could alter the design conclusions for some b) RNA Design Algorithm Performance 100 target structures. The partition function may retain utility for design 50 in certain cases where the energy model is known to be incorrect. For example, pseudoknot energy models do N u m b e r o f S q n c s 0 0 1 2 3 4 5 6 7 8 9 10 >10 n(s*) not fully consider geometric constraints, such as steric c) Pair Probabilities Pair Probabilities Pair Probabilities hindrance. Nevertheless, it is reasonable to believe that the unknown energy correction terms are non-negative; 10 Energy Minimization 10 SSM 10 Probability Displayed Pairs = 24 Displayed Pairs = 56 Displayed Pairs = 12 that is, structures violating geometric constraints are in p(s*) = 0.02 p(s*) = 0.06 p(s*) = 0.99 20 n(s*) = 3.99 20 n(s*) = 20.00 20 n(s*) = 0.01 fact less likely than predicted. In this case, for design B a s e B a s e B a s e

30 Random 30 MFE Satisfaction 30 Energy Min & SSM targets that are geometrically unstrained (so that the Displayed Pairs = 30 Displayed Pairs = 47 Displayed Pairs = 14 ∗ p(s*) = 0.01 p(s*) = 0.31 p(s*) = 0.89 missing energy term is small), the predicted p(s ) will 40 n(s*) = 15.60 40 n(s*) = 3.76 40 n(s*) = 0.24 10 20 30 40 10 20 30 40 10 20 30 40 be strictly lower than if the energy correction terms had Base Base Base been included (since all undesired structures have non- Figure 6. RNA Pseudoknot. See captions for Figures 2abc. negative correction terms). Consequently, high-ranking sequences based on existing models are likely to be suc- cessful in practice. Algorithmic Considerations. Each design landscape so as to optimize the kinetic accessibility of method consists of a criterion score and a heuristic the global minimum energy secondary structure. Fur- for optimizing that score. Evaluating the score for a thermore, the observed decoupling of thermodynamic single sequence of length N is an O(N) operation for and kinetic properties suggests that there are sufficient random, energy minimization, SSM, and energy mini- degrees of freedom in sequence space to allow the de- mization plus SSM methods. When used in an adaptive sign of more complex features of the energy landscape walk, each incremental change to the score can be eval- (e.g. metastable states (37)). uated in constant time. For designs based on MFE sat- Robustness of Claims. The consistency in the rel- isfaction, probability, and average incorrect nucleotides, ative merits of these design methods suggests a level each score evaluation is an O(N 3) operation if pseudo- of generality that goes beyond the structures investi- knots are excluded and an O(N 5) operation if a class Nucleic Acids Research, in press 9 of pseudoknots is included. The cost of these latter that the biochemical properties of the twenty amino methods motivates further investigation of optimiza- acids are sufficiently different from those of the four nu- tion techniques for these scores, including filtering the cleotides that there is a change in the degree to which designs of less expensive methods (14) and assembling positive and negative design methods yield collateral larger structures hierarchically (15, 19). specificity and affinity, respectively. The random and SSM methods apply to single or Computational models for multi-stranded structures with or without pseudoknots. currently require three-dimensional fold information. The other five methods require extensions to the stan- To stabilize a given target fold, rational design efforts dard empirical potential functions to handle multiple have focused on positive design for fold affinity: identify strands or pseudoknots.∗∗ The dynamic programming the sequence with the lowest energy on the target fold algorithms that underly three of the methods also re- (43–45). Explicit negative design for fold specificity is quire generalization to handle problems with multiple problematic since it is challenging to describe the space strands (16, 19, 41). of unwanted three-dimensional folds. However, small Additional Design Constraints. Each design ensembles of unwanted structures have been used to ex- method considered here has been simplified to reflect plicitly design for fold specificity (46, 47). Arguments the essence of the approach so as to admit easy descrip- based on the random energy model suggest that implicit tion, comparison and replication. In practice, there are negative design may be achieved by fixing the sequence many additional considerations that might be used to composition prior to optimization (48–50). Recently, a modify these approaches so as to satisfy various addi- novel protein fold was designed from scratch (51) by al- tional constraints. For example, the may wish ternately optimizing the sequence on a fixed backbone to limit the CG content, to use a three-letter alpha- and the backbone for a fixed sequence. The former step bet (11, 42), to prohibit consecutive stretches of a sin- represents positive design via energy minimization. The gle base, to fix the melting temperature, or to impose latter step was implemented by searching nearby struc- various other rules of thumb that have been garnered ture space and redefining the target to be the minimum from years of lab experience. The intended function energy structure – a local form of negative design. Con- of the design may also impose additional requirements, ceptually, it is unclear to what extent this local struc- †† such as the inclusion of subsequences of biological or tural optimization implements global negative design. biochemical relevance (e.g. promoters, restriction sites, There is currently no physical abstraction (akin to genomic targets, ribozymes and deoxyribozymes). Fre- nucleic acid secondary structure) that facilitates the quently, the intention is to design a set of strands that prediction of from protein sequence. interact to form one of several allowed secondary struc- Hence, the feedback loop in Figure 1a must be closed ei- tures (e.g. a DNA beacon switches from a hairpin to a ther by experimental structural characterization meth- helix in the presence of a target ligand). In DNA com- ods or by computationally solving the protein structure puting, it is often necessary to design a combinatorial prediction problem (52). This feedback can be used library of strands, each of which is devoid of secondary both to improve particular sequence designs and to im- structure (18). These problems naturally lead to multi- prove the physical model on which the design process objective optimizations, where we expect positive and is based. To avoid the risk of introducing artifacts into negative design paradigms to continue to play a critical the physical model (53), significant effort has been in- role. vested in developing exact search methods to find the Comparison with . It is informa- globally optimal sequence based on fold affinity, though tive to compare rational nucleic acid design efforts to approximate search methods have also proved useful in those in the related area of rational protein design. Pro- practice (45, 54). teins provide a rich design space with a much greater These limitations would similarly apply to nucleic demonstrated range of natural function than RNA and acid design based on three-dimensional atomic coordi- DNA. Hence, they represent a fertile medium for the nates. However, by designing at the level of secondary design of new medical and industrial products. While structure, it is possible to explicitly address both pos- fold affinity and specificity remain fundamental design itive and negative design paradigms and to use parti- objectives for , it is not clear to what degree tion function algorithms to evaluate design quality com- putationally. The probability of sampling the target the explicit implementation of both positive and neg- ∗ ative design paradigms remains critical. It is possible graph p(s ) has a maximum value of unity. Hence, it ††The related hypothetical approach of performing global ∗∗This complication can be avoided for the two methods involv- structure prediction and adjusting the target to be the minimum ing energy minimization if only stacking energies are considered free energy structure would correspond to explicit negative design and other loop terms (which are largely independent of sequence) (identical to MFE satisfaction except that specificity is achieved are neglected. by adjusting the structure instead of the sequence). Nucleic Acids Research, in press 10 is no longer necessary to perform exact global sequence on the target graph according to the standard loop- searches in order to be sure that the sequence accu- based energy model. Each search uses an exponentially rately reflects the properties of the physical model – it decreasing temperature profile over 106 steps, where is enough to check that p(s∗) is near unity or that n(s∗) each step corresponds to a point mutation that is ac- is sufficiently small. cepted if exp(−∆G/RT )≥ρ, where ∆G is the change in Implications for Design of Nanodevices. For energy and ρ ∈ [0, 1] is a uniformly distributed random many design applications, nucleic acids represent an number. attractive building material. Consider for example, an MFE Satisfaction. An adaptive walk of 1000 steps attempt to design a mechanical device that performs is used to identify a sequence for which the target struc- work by moving through a series of conformations. It ture is the lowest energy structure. Each step consists would be cumbersome to parameterize the protein de- of a random point mutation that is accepted if the new sign problem for mechanical devices in terms of atomic minimum energy structure calculated using dynamic coordinates. It also seems unlikely that it would be pos- programming methods (15, 21, 30–32) does not increase sible to conditionally stabilize a sequence of non-natural the number of mismatches with the target graph (15, folds using positive design methods that do not explic- 19). The 100 sequences used for the study are obtained itly treat fold specificity. However, DNA devices with from 100 independent searches starting from different moving parts and complex conditional conformational random initial sequences. In each case, the target is the changes have already been designed (using ad hoc meth- minimum energy structure. ods) and experimentally demonstrated (8, 10, 24). We Sequence symmetry minization (SSM). 100 se- expect that nucleic acid secondary structure will pro- quences are independently selected that are compatible vide a productive framework for formulating the design with the target graph and satisfy SSM (1) with word problem for functional multi-state machines in a way length 4. For the Large Multiloop structure, the word that simultaneously addresses positive and negative de- length was increased to 5 to provide a larger vocabulary. sign requirements. Ultimately, the objective of rational Energy minimization and SSM. A penalty term nucleic acid design efforts is to develop a ‘molecular is added to the standard energy model to bias the sim- compiler’ that takes as input a for ulated annealing search against sequences that violate a device and produces as output, a list of nucleic acid SSM. The top-ranked sequences from each of 100 in- sequences that can be expected to assemble into the dependent searches are free of SSM violations for the desired structures and function robustly. cases presented. Probability. An adaptive walk of 1000 steps is used MATERIALS AND METHODS to search for the sequence with the highest probability of sampling the target structure based on dynamic pro- Design Implementation Details graming calculations of the partition function (21, 33). Each step consists of a random point mutation that is The parameter sets for RNA and DNA are taken from rejected if the probability decreases and accepted oth- Mfold3.1 (27) with RNA pseudoknot parameters pro- erwise (15, 17, 21). The study uses the top-ranked se- vided by reference (21). There are currently no pseudo- quence from each of 100 independent searches starting knot parameters for DNA. Dangle energies were treated from different random initial sequences. as the d2 option in the Vienna package (15). After each Average number of incorrect nucleotides. In- sequence search is performed with any of the methods dependent adaptive walks based on n(s∗) are used to described below, we check to see whether the sequence obtain 100 sequences in a manner analogous to the di- is quenched, in the sense that no mutation of a single rect optimization of probability described above. base pair or of a single unpaired base improves the de- sign according to the design metric. If the sequence is Global energy minimization not quenched, we run a further adaptive walk, check- ing every 1000 steps to see if the sequence is quenched For methods involving energy minimization, there is no and terminating the search when quenching is achieved. mathematical guarantee that the selected sequences are All RNA and DNA sequences used for these studies are near a global minimum. For small problems, the per- provided in Supplementary Material. formance of heuristic search methods may be assessed Random. 100 random sequences are independently by comparison to the global minimum energy obtained generated that satisfy the target graph base-pairing re- using an exact exponential-time branch-and-bound al- quirements. gorithm developed for protein design (39). If a protein Energy minimization. 100 independent simulated is modeled as a rigid backbone with side-chains repre- annealing runs with different random initial sequences sented by discrete rotamers, the protein design problem are used to identify 100 sequences with a low free energy may be formulated as follows (43, 44): Given p disjoint Nucleic Acids Research, in press 11

sets of rotamers Ri (one set for each position i) and 6. LaBean, T., Yan, H., Kopatsch, J., Liu, F., Winfree, E., Reif, a potential function E(·, ·) that returns the energy be- J., and Seeman, N. C. (2000) Construction, analysis, ligation tween a pair of rotamers at different positions, choose and self-assembly of DNA triple crossover complexes. J. Am. Chem. Soc., 122, 1848–1869. the rotamer ri ∈ Ri at each position that minimizes 7. Soukup, G. and Breaker, R. (1999) precision the sum of the pairwise interactions energies between RNA molecular switches. Proc. Natl. Acad. Sci. USA, 96, all positions 3584–3589. X X E = E(r , r ). 8. Yurke, B., Turberfield, A., Mills, A.P., J., Simmel, F., and total i j Neumann, J. (2000) A DNA-fuelled molecular machine made i j,j