<<

Natural Computing 3: 411–426, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands.

Physical modeling of biomolecular computers: Models, limitations, and experimental validation

JOHN A. ROSE1;;y and AKIRA SUYAMA2;y 1Department of Computer Science, and UPBSB, The University of Tokyo, and Japan Science and Technology Corporation, CREST, Japan (Author for correspondence, e-mail: [email protected]); 2Institute of Physics, The University of Tokyo, and Japan Science and Technology Corporation, CREST, Japan

Abstract. A principal challenge facing the development and scaling of biomolecular computers is the design of physically well-motivated, experimentally validated simula- tion tools. In particular, accurate simulations of computational behavior are needed to establish the feasibility of new architectures, and to guide process implementation, by aiding strand design. Key issues accompanying simulator development include model selection, determination of appropriate level of chemical detail, and experimental vali- dation. In this work, each of these issues is discussed in detail, as presented at the workshop on simulation tools for biomolecular computers (SIMBMC), held at the 2003 Congress on Evolutionary Computation. The three major physical models commonly applied to model biomolecular processes, namely molecular mechanics, chemical kinetics, and statistical thermodynamics, are compared and contrasted, with a focus on the potential of each to simulate various aspects of biomolecular computers. The fun- damental and practical limitations of each approach are considered, along with a dis- cussion of appropriate chemical detail, at the , process, and system levels. The relationship between system analysis and design is addressed, and formalized via the DNA Strand Design problem (DSD). Finally, the need for experimental validation of both underlying parameter sets and overall predictions is discussed, along with illustrative examples.

Key words: biomolecular computing, DNA-based computing, DNA strand design, kinetics, molecular mechanics, statistical thermodynamics

1. Introduction

Since the advent of biomolecular computing (Adleman, 1994), a principal difficulty facing development has been the need for physically well-motivated, experimentally validated simulation tools. In particular, accurate simulations of computational behavior are critical for evalu- y Authors contributed equally to the present work. 412 J.A. ROSE AND A. SUYAMA ating feasibility, and for supporting selection of DNA encodings and reaction conditions which optimize reliability and efficiency. A key issue accompanying development is selection of a model that is theoretically adequate, employs experimentally valid parameters, and provides pre- dictions that lend themselves to clear interpretation and experimental validation. For this reason, a number of algorithms and tools for sim- ulation and design have been proposed in the context of DNA com- puting, including implementations of mass action, via statistical thermodynamic (Hartemink and Gifford, 1999; Rose and Deaton 2001; Rose et al., 2001, Rose et al., 2002; Andronescu et al., 2002; Rose et al., 2004) and kinetic models (Nishikawa et al., 2001; Uejima and Hagiya 2004), and implementations based on simple Watson–Crick sequence- similarity (Deaton et al., 1998; Condon et al., 2001; Garzon et al., 2004). Significantly, numerous well-established tools for modeling bio- polymer physical behavior, which employ the same fundamental prin- ciples (Steger, 1994; Blake et al., 1999; Hofacker, 2003; SantaLucia and Hicks, 2004) are also available outside of the DNA computing field, resulting in the potential for considerable overlap between emerging and established models and packages. To facilitate smooth integration, and to discuss difficulties facing implementation, a Workshop on Biomo- lecular Simulation Tools (SIMBMC’03) was recently organized, at the 2003 Congress on Evolutionary Computation. In this work, the physical models commonly used to analyze bio- polymer systems are compared, along with a discussion of the rela- tionship to system design (for a discussion of related issues regarding simulator design, see the companion paper (Blain et al., 2004)). Issues regarding model selection and application are discussed, including: appropriateness; limitations; level of detail; and experimental valida- tion, at both parameter and prediction levels. Organization is as follows. Section 2 surveys the principal methods applied to model biomolecular processes, including single- models, via molecular mechanics (Section 2.1) and approaches based on mass action (Section 2.2), including kinetics and statistical thermodynamics. Section 2.3 discusses the appropriate level of chemical detail for systems, at the biopolymer, process, and system levels. Section 2.4 addresses the rela- tionship between analysis and design, via formulation of the DNA Strand Design problem. Section 3 considers issues regarding experi- mental validation, via a pair of simple examples: (1) a two–state model for characterizing the annealing of long oligonucleotides; and (2) a Hamming encoding strategy for low-error, intermediate-size Tag–Ant- itag (TAT) system design. PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 413

2. Alternative physical models for simulation

A number of approaches may be employed to model chemical systems. These models generally fall into two categories: (1) methods based on single-molecule simulation (e.g., molecular mechanics), which provide a detailed picture of molecule dynamics (van Holde et al.,1998, Leach 2001), and methods based on simulation of mass action, including (2) kinetics (Wetmur 1991, Voit 2000), and (3) statistical thermodynamics (Cantor and Schimmel, 1980; Wartell and Benight 1985), which provide averaged measures of system behavior. The appropriate method depends upon the chemical system of interest, and the scales associated with the primary system processes under consideration.

2.1. Single-molecule simulation: molecular mechanics

The most accurate approach is the explicit modeling of a single instance of the molecule of interest. For purposes of modeling behavior fol- lowing association, a hydrogen-bonded network (e.g., dsDNA) is often considered to be a single biopolymer. Although the most detailed method involves calculation of electronic wave functions (ab initio methods), a more popular approach for large molecular systems is to dispense with wave functions, in favor of an empirical, Newtonian ball- and-spring view of biopolymer energetics. In this approach, known as molecular mechanics, the energy of a conformation is computed as the sum of two components: the bonding interactions, and the non-bonding interactions. A brief overview of this process will now be provided. For a detailed discussion of force-field construction and use for modeling DNA , see (von Kitzing 1992; Cheatham and Kollman, 2000). The overall bonding energy is defined in terms of the deformation energy of each two-, three-atom, and four-atom center, from the experimentally determined, context-independent equilibrium values characteristic of the group (bond-stretching, bond-angle bending, and dihedral angles, respectively). A harmonic force-field is distinguished by use of a harmonic potential to model the first two types of terms. The low- level assignment of spring constants and equilibrium values, known as field parametrization, provides the field’s connection with experimental observation. Taken over all centers, the sum estimates the energy of forming the covalent bonds characterizing a given (i.e., config- uration). At ambient temperatures, the large magnitude of this energy, away from the vicinity of equilibrium values restricts dynamical changes 414 J.A. ROSE AND A. SUYAMA to rotations about single bonds (i.e., variations in conformation) effec- tively limiting motion to a conformational subspace, within which folding is driven by more modest, non-bonding interactions. These include the long-range stabilizing forces (electrostatic, dipole–dipole, van der Waals, hydrogen bonding), taken over relevant pairs or groups of . The number of degrees of freedom is commonly reduced by treating chemical groups whose internal dynamics are beyond the scope of the simulation (e.g., hydrogen vibrations) in terms of a static, ‘united atom’. An addi- tional consideration when modeling a charged poly-ion such as DNA in buffered solution is the need to account for the stabilizing effect of partial counter-ion screening of the backbone charges (von Kitzing, 1992). The primary stabilizing interaction for a DNA duplex in solution, namely base-pair stacking, arises from the sum of 3 distinct favorable interactions: (1) van der Waals interaction between rings; (2) interactions between induced ring dipole moments; and (3) hydrophobic sequestering of aro- matic rings (Wartell and Benight 1985). In addition to estimation of the above enthalpic components of the free energy, a separate consideration must therefore be made of the entropy change accompanying solvation (e.g., modeling the hydrophobic interaction, which drives biopolymer folding (van Holde et al., 1998)). For each conformation, the sum over the bonding and nonbonding components yields the total conformational potential energy, V. For a biopolymer containing N atoms, the resulting values form a potential energy landscape of size 3N 3. Molecular mechanics simulations on an energy landscape may be classified into two categories: energy minimization, and molecular dynamics (MD) (van Holde et al., 1998). In Energy Minimization, the conformational space is sampled via a search method of choice (e.g., gradient descent), from a set of initial states (Smellie et al., 1995), with the minimum energy state encountered over the sampled subspace identified as the biopolymer’s equilibrium conformation, or . In a Molecular Dynamics simulation, biopolymer dynamical behavior is modeled as proceeding in accordance with Newton’s 2nd law, with the overall restoring force towards equilibrium on a conformation equal to ! the negative gradient of the potential energy, F ¼rV (the molecular mechanics force-field). Biopolymer kinetic energy, K is related to the temperature, T and the number of degrees of freedom, by equipartition 1 of energy as K ¼ 2 ð3N 3ÞkBT. Note that simulations generally require the use of unrealistically high T, to compensate for excessive biopolymer stiffness, accompanying use of harmonic potentials. Biopolymer dynamical motion is then modeled as a trajectory in conformation space which proceeds from initial state, x0 in a step-wise fashion. At each PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 415 time-step Dt, the motion of each particle i proceeds in accordance with Newton’s 2nd law, using a truncated Taylor series, dx 1 d2x x ðt þ DtÞx ðtÞþ i Dt þ i Dt2; ð1Þ i i dt 2 dt2 2 2 where dxi=dt ¼ vi and d xi=dt ¼ ai are particle velocity and accelera- ption,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi respectively. The initial distribution of particle velocities, vi ¼ 2Ki=mir^ is generally chosen randomly, according to a Maxwell- Boltzmann distribution, while the acceleration of particle, i is related to the gradient of the potential energy, by ai ¼riVi=mi. An MD simu- lation may also be employed for energy minimization, by interpreting the time-averaged behavior of the simulated molecule as equivalent to an ensemble average, or identifying the native state as the minimum energy state encountered over the sampled subspace. Simulation tools employing various forms of harmonic force-field are commercially available, including CHARMM (Brooks et al., 1983) and AMBER (Pearlman et al., 1995) (for a comparison of these tools for modeling DNA stacking and helix formation, see (Sponer and Kypr, 1993; Feig and Pettitt 1998); a general discussion of field selection and parametrization is in (Jensen, 2001)). Although such tools have been very successful in modeling small systems, application to biopolymer folding has been more limited, due to the large computational load required for accurate simulation. First of all, a very large number of single-path simulations are required to obtain an accurate picture of the energetic landscape. Furthermore, for each run the motions of a large number of atoms must be simulated at each time-step, Dt, which should be small compared with times of interest (molecular vibrations and rotations occur on time-scales around Dt ¼ 1015 1012 sec). Given current processor speeds, total simulation times range from tens of picoseconds (typical), to a maximum of about 106 sec (van Holde et al., 1998). In contrast, biopolymer folding/melting occurs on the msec to sec scale. Although the recent development of coarse-grained models has supported special-case applications (Takano et al., 2004), the gen- eral application of molecular mechanics methods to model biopolymer folding and association remains problematic.

2.2. Mass action simulations

In a mass-action approach, an attempt to model specific details of molecular dynamics is abandoned, in favor of an estimate of the mean 416 J.A. ROSE AND A. SUYAMA behavior of the chemical system. Of the two basic categories of mass action modeling (kinetic and equilibrium), a kinetic approach is more detailed, since it retains complete information regarding the temporal behavior of all species comprising the reacting system of interest. In contrast, an equilibrium model simplifies, by further restricting attention to processes at equilibrium. The validity of an equilibrium model relies on the correctness of the assumption of system equilibrium (Cantor and Schimmel 1980). More precisely, experimental times of interest, texp should be long compared with the corresponding relaxation time to equilibrium, s. For instance, in the case of Whiplash PCR, application of an equilibrium model requires the time between successive extensions of a DNA hairpin to be large compared with the hairpin relaxation time, s (Rose et al., 2002). If this condition is not met, experimental behavior is likely to deviate from the predictions of an equilibrium model. In this case, it is advisable to instead adopt a kinetic analysis.

2.2.1. Chemical kinetics In a kinetic analysis, a chemical system of interest is modeled via a set of differential equations, each of which represents the dynamics of one system constituent (Voit, 2000). In particular, the time rate of change of the concentration of each species i, denoted X_ i, is modeled as a function þ of the difference between two functions: a production term, Vi

ðXp1 ; :::; Xpj Þ which models the net production of species i from sub- strates p1 pj, and a depletion term, Vi ðXd1 ; :::; Xdk Þ which models the net depletion of species i into products, p1 pk. Depending on the þ model employed, Vi and Vi are each modeled in terms of a specific power-law representation of the component variables. In the most intuitive representation, the Generalized Mass Action (GMA) model (see (Voit, 2000) for a discussion of equally valid alternatives), each is a sum over power law expressions for each contributing elementary þ þ a b process. For instance, a term in V0 of the form, k X1X2 models a single production pathway, which results in the production of a single mole- cule of species i ¼ 0 from a of species i ¼ 1 and b molecules of species i ¼ 2. Here, kþ is the corresponding elementary forward rate constant, while a and b represent the order of the pathway, with respect to species 1 and 2, respectively. Given a complete parametrization (values for all forward (kþ) and reverse (k) rate constants), numerical solution of the resulting set of equations may be accomplished by well- established methods (Voit, 2000), to yield the complete dynamics of the system. For discussions of DNA kinetics, see (Cantor and Schimmel, 1980; Wetmur, 1991). PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 417

Adopting a kinetic model as a general scheme for the analysis of hybridizing DNA systems requires a parameter set for use in estimating sequence-specific forward and reverse rate constants for DNA helix formation. Unfortunately, reported work on DNA kinetics has gener- ally been limited to determination of characteristic rates for specific hairpins and duplexes, and renaturation rates for randomly sheared dsDNAs (Cantor and Schimmel, 1980). As a result, in marked contrast with the sequence dependence of duplex energetics, which are fairly well-established (SantaLucia and Hicks, 2004), the sequence-depen- dence of the rate constants for modeling DNA helix formation is still not well-understood (Cantor and Schimmel, 1980). Parametrization is therefore probably the most serious practical obstacle to the develop- ment of a general kinetic tool for the detailed analysis of hybridizing DNA systems. A sequence-dependent empirical expression for kþ for þ þ DNA hairpins, for use in the decomposition of Keq ¼ k =k into k and k was reported in (Bonnet et al., 1998), and later applied to Whiplash PCR (Rose et al., 2002). A similar approach for associating DNA systems, suggested by the observation that kþ for duplex for- mation is roughly sequence-independent (Cantor and Schimmel, 1980), would be to adopt a mean value for all forward rate constants, kþ for species of a given length, via averaging over available experimental estimates. Approximate values of k for the various duplexes in a solution of interest could then be assigned straightforwardly (via þ k ¼ k =Keq).

2.2.2. Equilibrium analysis An equilibrium approach abandons attempts to model dynamics, and instead restricts attention to predicting the equilibrium distribution of an ensemble of biopolymer instances amongst the various accessible conformations. In the simplest approach, an equilibrium constant for each component equilibrium is estimated via statistical thermodynamic considerations, while the impact of specific chemistry is modeled via strand conservation and mass action (Wartell and Benight, 1985; Rose et al., 2004). The thermodynamic apparatus for modeling the helix-coil transition forms the basis for treating general DNA mixtures, at equi- librium. Briefly, each accessible conformation is assigned a statistical weight, by decomposition into a sequence of B-helical islands, punctu- ated by ssDNA coils. Each helix is assigned a sequence-dependent weight of the form, w ¼ expðDG=RTÞ; ð2Þ 418 J.A. ROSE AND A. SUYAMA where DG is the standard Gibbs free energy of duplex formation, which may be assessed for both Watson–Crick duplexes, and duplexes with single internal mismatches, via the nearest-neighbor model ((SantaLucia and Hicks 2004)), and R is the molar gas constant. Coil regions take a Jacobson–Stockmeyer weight, w / n1:5, where n refers to number of loop links, penalizing the conformation for secondary nucleation (Wartell and Benight, 1985). The cooperativity of stacking is modeled by assigning a penalty of r1=2 to each helix-coil interface. Here, r 4:5 105 is the cooperativity parameter (Wartell and Benight, 1985). For conformations requiring strand association, the penalty for bimolecular nucleation, j (which includes a factor of r) may be treated via external partition function arguments (Wartell and Benight, 1985), or for oligonucleotides, via an initiation parameter (SantaLucia and Hicks, 2004). A conformation’s overall weight is then estimated by the product of all sub-weights, which is regarded as the equilibrium con- stant of formation from isolated strands. An equilibrium approach is appropriate when addressing (van Holde et al., 1998): • Equilibrium constants: Estimation of the equilibrium constant of formation Kk ¼ wk, for conformation k; or, estimation of bulk equilibrium constants,P including: overall equilibrium constants of hp folding, Keq ¼ k wk ¼ Zfold 1, where Zfold is the folding parti- tion function; andP overall equilibrium constants of strand associa- tion, Kassocði; jÞ¼ k wk ¼ Zc, where the sum, taken over all nucleated conformations k between ssDNA species i and j, is the conformal partition function, Zc (Cantor and Schimmel, 1980); • Equilibrium occupancies: Prediction of a biopolymer’s most likely conformation (i.e., native state; e.g., the (van Holde et al., 1998) and RNA (McCaskill, 1990) folding problems); • Ensemble averages: prediction of the experimentally observed value of a physical property of interest, at a given set of reaction condi- tions (e.g., folding free energy, DGo versus temperature); • Structural transitions: analysis of the transition of a biopolymer chain between two well-defined equilibrium states, as induced by a change of reaction conditions (i.e., DNA melting, modeled as a decrease in mean helicity, with increasing temperature). For predicting a physical quantity of interest, the associated ensemble average is estimated by taking a weighted average over accessible states. Quantities of interest include: the fraction of stacked base-pairs, or fractional helicity, h (Wartell and Benight, 1985), and the per-structure probability of error hybridization (computational incoherence), (Rose PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 419 and Deaton, 2001; Rose et al., 2004). For a folding process, such an average takes the simple, concentration-independent form, P w x hxi¼ k k k ð3Þ Zfold where wk and xk are the equilibrium constant of folding, and the value of quantity, x characteristic of conformation k, respectively. For a process involving strand-association, this procedure is complicated by the need to account for the concentration-dependence of nucleation. Estimation of an ensemble average for such process strictly involves construction of a concentration-dependent partition function for the ensemble, followed by estimation of a weighted average (Poland and Scheraga 1970). In practice, however such processes may be approached more straightforwardly by assuming separability into successive asso- ciation and zipping processes (with or without loops), allowing the concentration-dependent probability of nucleation to be treated via simple equilibrium-chemistry arguments, while conformational statisti- cal weights are used to approximate gross equilibrium constants, and an ensemble average value for each nucleated strand-pair (via Eq. 3, with Zfold replaced by Zc). For detailed discussions, see (Wartell and Benight, 1985; Rose et al., 2004).

2.3. The issue of chemical detail

At the biopolymer level, the fundamental issue regarding chemical detail is the need to discretize the conformational energies accessible to each polymer. Given the huge number of accessible folded conformations, and since much relevant information is contained in the details of sec- ondary structure, it is typical to focus on the partitioning of the bio- polymer into distinct segments of helix and coil, while neglecting larger- scale details of folding (van Holde et al., 1998). The assumption that the helix-coil transition occurs on an all-or-none basis for each residue (the Ising model) then recasts the folding problem in terms of local, residue- by-residue transitions between two well-defined states. In modeling DNA annealing, pairs of H-bonded bases are considered to execute the transition between a pair of random coils and a stacked, dsDNA B-helix. For purposes of equilibrium modeling, this provides an unambiguous answer to the question of appropriate granularity. At the process level, considerations of chemical detail concern selection of an appropriate model of duplex formation. Application of a 420 J.A. ROSE AND A. SUYAMA general Ising model, which includes conformations with loops of all sizes, requires consideration of a conformation space which scales exponentially with strand-length, L (Cantor and Schimmel, 1980). For folding processes, this difficulty is circumvented via McCaskill’s algo- 3 rithm (McCaskill, 1990), which estimates Zfold in OðL Þ TIME. For conformations requiring association, an equilibrium constant of for- mation is usually approximated by focusing on a tractable subset of conformation space whose occupancy is expected to dominate. For instance: a two-state model is adequate for modeling very short oligo- (SantaLucia and Hicks, 2004); a perfectly-aligned general model is very satisfactory for modeling the melting of long polynucle- otides (Wartell and Benight, 1985), a statistical zipper model (SZM), which discards looped conformations, is satisfactory for modeling the annealing of short, quasirandom (Benight et al., 1981), while a mismatched SZM with singlet bulges is more appropriate for modeling error-annealing of long oligonucleotides (Rose et al., 2001). These models require OðLÞ, OðL2Þ, OðL2Þ, and OðL3Þ TIME, respectively. A higher-level issue concerns the inclusion of side-processes and process coupling. An equilibrium approach is remarkably effective at predicting the melting of ssDNAs and perfectly-aligned dsDNAs (Wartell and Benight, 1985), and may also be used to address more complex DNA chemistries. However, it is important to consider not only the energetics of the helices which may form, but also the effects of process competition. The wholesale application of a set of DNA Tm’s derived assuming simple, binary melting, to all processes in which the helices may form may lead to spurious conclusions (Rose et al., 2004).

2.4. The inverse problem: DNA strand design

The inverse problem of analysis concerns system design for effective implementation. Instances include the design of: high-fidelity PCR primers (Hartemink and Gifford, 1999) and DNA microarrays (Ben- Dor et al., 2000; Rose et al., 2003), and low-error or high-efficiency DNA computers (Deaton et al., 1998; Condon et al., 2001; Rose et al., 2002; Andronescu et al., 2002; Rose et al., 2004). Here, design refers to the encoding of each of a set of ssDNA strands of interest as a 50 to 30 string over fA,T,G,Cg, with the goal of ensuring both high affinity of all strands to specific target sequences, and high specificity, via low mutual affinity to all other binding sites (Eaton, 1995). This problem may be formalized as the DNA Strand Design Problem (DSD), an instance of PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 421 which is a quadruple, X ¼ðS; R; C;tÞ, where: (1) S is a set of distinct ssDNA species, each of which is distinctly identified by a planned computational role, but initially unencoded; (2) R is a set of hybrid- ization rules, each of which defines a target, annealed conformation between two species, i; j 2 S; (3) C is a set of similarity-constraints on S, where each external constraint restricts a subsequence of a strand, i 2 S to be Watson–Crick complementary to a region of a fixed, biologically- relevant target ( e.g., PCR primer design), while each internal constraint restricts specific subsequences within a strand-pair, i; j 2 S to be related by either identity or complementarity; and (4) t is a threshold value, relative to a measure of hybridization error, . DSD on X then asks: DNA Strand Design (DSD). ‘‘Given constraints C, does an encoding of strands in S exist such that, under reaction conditions of interest, annealing of strands in S proceeds as directed by R, with mean probability per H-bonded structure 1 t?’’. The optimization version asks for the set with minimized measure, . Note that a bias towards design for process fidelity, rather than effi- ciency, is implicit in the DSD problem statement, via use of a per- structure, rather per-strand measure of controllability. From a com- plexity perspective, two questions are of interest: (1) does an algorithm exist for evaluating any trial solution of any DSD instance in polyno- mial TIME, in both strand length, L, and species number, N ¼jSj?If so, the assignment DSD 2 NP may be made (likely, given the existence of efficient methods for equilibrium constant approximation); (2) Does an algorithm exist for solving any instance of DSD in polynomial TIME, in both L and N? If so, then DSD 2 P. Several methods have been developed to systematically generate solutions for special cases of DSD, under shifted Hamming measures of sequence-similarity (Arita and Kobayashi; 2003; Garzon et al., 2004). However, the development of an efficient, systematic algorithm for producing solutions to general DSD instances, via a detailed physical model remains a major open problem. Most current methodologies are thus stochastic, employing the following procedure: (1) The mixture of interest is expressed as a DSD instance; (2) A population, Q of enco- dings is generated, each of which obeys constraints, C and rules, R; (3) A tool for analysis is applied, which assigns a measure of encoding goodness, to each member of Q. Alternatives include measures based on: (1) sequence similarity (Deaton et al., 1998; Liu et al., 1999; Condon et al., 2001; Garzon et al., 2004); (2) stringency (Hartemink and Gifford, 1999); (3) H-bonding probabilities (McCaskill, 1990); (4) structure-freeness Andronescu et al., 2002), and (5) computational 422 J.A. ROSE AND A. SUYAMA incoherence (Rose and Deaton, 2001; Rose et al., 2004). If any member of Q satisfies the goodness criterion (<t), the process halts; (4) Otherwise, a stochastic optimization method is recursively applied to Q (e.g., a Genetic Algorithm (Michalewicz, 1996)), until a satisfactory encoding is produced, or a maximum number of generations is exceeded.

3. The issue of experimental model validation

A tool for solving DSD instances usually implements a model for predicting the static or kinetic behavior of DNA hybridization. The issue of model validity, which is critical for reliable application, hinges upon the appropriateness of the implemented model. The chemical details and time-scales considered should be appropriate to address relevant questions regarding the target process, while neglecting those of subsidiary importance. In addition, parameters used to model underlying physical quantities should correspond to experimentally established values, and should be appropriate to conditions of interest. Finally, predictions should be quantitative, experimentally testable, and clearly interpretable in terms of the biochemical processes at hand. To illustrate, the validation of a pair of simple models, as applied to en- code a size 300 orthonormal set of 23-base DNA sequences is now considered. A two-state model is often employed to predict the melting and renaturation of DNAs shorter than about 15 bps, due to the low like- lihood of stable loop-formation. For purposes of experimentally vali- dating this model for predicting the melting of longer oligonucleotides, melting temperatures (Tm’s) were estimated for 300 dsDNA 23-mers, by observing the fluorescence intensity change with T, in the presence of SYBR Green I. As shown in Figure 1, although modestly higher due to the presence of melting-intermediates, experimental values demon- strated a good correlation with model values, obtained using parameters in (SantaLucia and Hicks, 2004). The model was therefore concluded to be adequate for predicting gross melting behavior, but inadequate for predicting finer details. This is consistent with the reported presence of melting intermediates for dsDNAs longer than about 15 bps, as indi- cated by a substantial deviation between calorimetric and van’t Hoff estimates of the standard enthalpy of duplex formation, DHo, and also with theoretical considerations, which indicate an SZM to be more appropriate for modeling long oligonucleotides (Rose et al., 2001). PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 423

76

75

74 C) ° (

m 73

72 Observed T 71

70

69 63 64 65 66 67 68 Calculated Tm (°C)

Figure 1. Experimentally-measured melting temperatures for 300 dsDNAs of length 23 bps (y-axis), versus values predicted via an all-or-none, nearest-neighbor model. Each point denotes the mean of three independent measurements of Tm for a single dsDNA species, with the accompanying error-bars denoting the standard deviation.

A simulation tool for predicting fatal mishybridization is essential for reliable design. Although a Hamming method (Deaton et al., 1998, Liu et al., 1999) is often employed, this method is insufficiently detailed to detect stable misaligned hybrids, or mishybrids with mis- matches and/or single-base bulges (Rose and Deaton 2001). Use of a shifted, two-state model (Hartemink and Gifford 1999) is also inade- quate, as it neglects the dependence of mishybrid stability on the positional distribution of defects. It is therefore likely that more detailed models of duplex formation will be required to treat larger DNA systems. Unfortunately, a detailed treatment is complicated by the open state of the parameter problem, as no set of thermodynamic parameters has yet been reported which accurately evaluates the sta- bility of complicated mishybrids, and even simple mismatched struc- tures may show thermal stabilities which do not fit easily within a nearest-neighbor picture (SantaLucia and Hicks, 2004). Error-behavior was thus investigated experimentally for a random, 100-strand subset of the 300 member set, by hybridizing each word and its complement with all 100 strands, terminally immobilized on a glass surface. Even under conditions of ten-fold excess input, no significant mishybrid- ization was observed (mean error rate: 2 103). 424 J.A. ROSE AND A. SUYAMA

4. Conclusion

In this work, the major physical approaches commonly employed to model biomolecular processes were compared, with a focus on the potential of each to simulate various aspects of nucleic acid-based sys- tems. The limitations of each model were considered, along with a discussion of appropriate chemical detail, at the biopolymer, process, and system levels. The relationship between analysis and design was addressed, and formalized via the DNA Strand Design problem. The issue of experimental validation was addressed, via a pair of focus examples.

Acknowledgements

We are grateful to Prof. David H. Wood, of the University of Delaware, and Prof. Max Garzon, of the University of Memphis, for helpful corrections of the manuscript. Financial support was generously pro- vided by Grant-in-Aids for Scientific Research B (15300100 and 15310084) from the Ministry of Education, Culture, Sports, Science, and Technology of Japan and by JST-CREST.

Reference

Adleman LM (1994) Molecular computation of solutions to combinatorial problems. Science 266: 1021–1024 Andronescu M et al. (2002) Algorithms for testing that sets of DNA words concatenate without secondary structure. In: Hagiya M and Ohuchi A (eds) DNA Computing, LNCS 2568, pp. 182–195. Springer, NY Arita M and Kobayashi S (2003) DNA sequence design using templates. New Gener- ation Computing 20: 263–277 Ben-Dor A et al. (2000) Universal DNA tag systems: a combinatorial design scheme. Journal of Computational Biology 7: 503–519 Benight AS, Wartell RM and Howell DK (1981) Theory agrees with experimental thermal denaturation of short DNA restriction fragments. Nature 289: 203–205 Blain D et al. (2004) Development, evaluation, and benchmarking of simulation soft- ware. Natural Computing, this issue Blake RD et al. (1999) Statistical mechanical simulation of polymeric DNA melting with MELTSIM. 15: 370–375 Bonnet G et al. (1998) Kinetics of conformational fluctuations in DNA hairpinloops. Proceedings National Academy of the USA 95: 8602–8606 PHYSICAL MODELING OF BIOMOLECULAR COMPUTERS 425

Brooks BR et al. (1983) CHARMM: A program for macromolecular energy, mini- mization, and dynamics calculations. Journal Computational Chemistry 4: 187–217 Cantor C and Schimmel P (1980) Biophysical Chemistry, Part III: The Behavior of Biological . W. H. Freeman, New York Cheatham TE III and Kollman PA (2000) Molecular dynamics simulation of nucleic acids. Annual Review of Physical Chemistry 51: 435–471 Chen J and Reif J eds (2004) DNA Computing – 9th International Workshop on DNA Based Computers, LNCS 2943. Springer, New York Condon A, Corn RM and Marathe A (2001) On combinatorial DNA word design. Journal of Computational Biology 8: 201–220 Deaton RJ et al. (1998) The reliability of DNA-based computing. Physical Review Letters 80: 417–420 Deaton RJ et al. (2002) A software tool for generating non-crosshybridizing libraries of DNA oligonucleotides. In: Hagiya M and Ohuchi A (eds) DNA Computing, LNCS 2568, Springer, NY. pp. 252–261 Eaton B (1995) Let’s get specific: the relationship between specificity and affinity. Chemistry and Biology 2: 635–638 Feig M and Pettitt BM (1998) Structural equilibrium of DNA represented with different force fields. Biophysical Journal 75: 134–149 Garzon M, Hyde B and Bobba K (2004b) Efficiency and reliability of semantic retrieval in DNA-based memories. In: (Chen J and Reif J (eds) DNA Computing –9th International Workshop on DNA Based Computers, LNCS 2943, Springer, New-York. pp. 157–169 Garzon M, Blain D and Bobba K (2004a) Simulation environments for biomolecular computing. Natural Computing, this issue Hagiya M and Ohuchi A eds (2003), DNA Computing, LNCS 2568, Springer, NY Hartemink A and Gifford D (1999) Thermodynamic simulation of deoxyribonucle- otide hybridization for DNA computation. In: Rubin H and Wood DH (eds)(2000), DNA Based Computers III, pp 25–38. Amercian Mathematical Society, Providence Hofacker IL (2003) Vienna RNA secondary structure server. Nucleic Acids Research 31: 3429–3431 Jensen F (2001) Introduction to Computational Chemistry. Wiley, New York Leach A (2001) Molecular Modeling: Principles and Applications. Prentice Hall, Upper Saddle River Liu Q et al. (1999) Progress towards demonstration of a surface based DNA compu- tation: a one word approach to solve a model satisfiability problem. Biosystems 52: 25–33 McCaskill JS (1990) The equilibrium partition function and binding proba- bilities for RNA secondary structure. 29: 1105–1119 Michalewicz Z (1996) Genetic Algorithms + Data Structures = Programs, 3rd ed. Springer, New York Nishikawa A, Yamamura M and Hagiya M (2001) DNA computation simulator based on abstract bases. Soft Computing 5: 25–38 Pearlman DA et al. (1995) AMBER, a computer program for applying molecular mechanics, normal mode analyis, molecular dynamics and free energy calculations to elucidate the structures and energies of molecules. Computer Physics Commu- nications 91: 1–41 426 J.A. ROSE AND A. SUYAMA

Poland D and Scheraga H (1970) Theory of Helix-Coil Transitions in Biopolymers. Academic Press, New York Rose JA and Deaton RJ (2001) The fidelity of annealing–ligation: a theoretical analysis. In: Condon A and Rozenberg G (eds) DNA Computing – 6th International Workshop on DNA Based Computers, LNCS 2054, pp. 231–246. Springer, NY Rose JA et al. (2001) The fidelity of the tag–antitag system. In: Jonaska N and Seeman NC (eds) DNA Computing – 7th International Workshop on DNA Based Com- puters, LNCS 2340, pp. 302–310. Springer, New York Rose JA, et al. (2002) Equilibrium analysis of the efficiency of an autonomous molec- ular computer. Physical Review E 65: Article 021910, 1–13 Rose JA, Hagiya M and Suyama A (2003) The fidelity of the tag-antitag system 2: reconciliation with the stringency picture. In: Proceedings of the Congress on Evolutionary Computation, Dec. 2003, Canberra, Australia, pp 2740–2747 Rose JA, Deaton RJ and Suyama A (2004) Statistical Thermodynamic Analysis and Design of DNA-based Computers. Natural Computing, this issue Rubin H and Wood DH eds (2000), DNA Based Computers III, American Mathe- matical Society, Providence, RI SantaLucia J Jr. and Hicks D (2004) The thermodynamics of DNA structural motifs. Annual Review of Biophysics and 33: 415–440 Smellie A, Teig S and Towbin P (1995) Poling: promoting conformational variation. Journal Computational Chemistry 16: 171–187 Sponer J and Kypr J (1993) Theoretical analysis of the base stacking in DNA. Choice of the force field and a comparison with the oligonucleotide crystal structures. Journal Biomolecular Structure & Dynamics 11: 277–292 Steger G. (1994) Thermal denaturation of double-stranded nucleic acids: prediction of temperatures critical for gradient and polymerase chain reaction. Nucleic Acids Research 22: 2760–2768 Takano M et al. (2004) On the model granularity to simulate protein dynamics: A biological physics view on biomolecular computing. Natural Computing, this issue Uejima H and Hagiya M (2004) Secondary structure design of multi-state DNA ma- chines based on sequential structure transitions. In: Chen J and Reif J (eds), DNA Computing – 9th International Workshop on DNA Based Computers, LNCS 2943, pp. 74–85. Springer, New York van Holde K, Johnson C and Ho P (1998) Principles of Biophysical Chemistry. Prentice Hall, Upper Saddle River Voit E (2000) Computational Analysis of Biochemical Systems. Cambridge University Press, New York von Kitzing E (1992) Modeling DNA structures. Progress in Nucleic Acids Research and 43: 89–110 Wetmur JG (1991) DNA probes: applications of the principles of nucleic acid hybrid- ization. Critical Reviews in and Molecular Biology 26: 227–259 Wartell RM and Benight AS (1985) Thermal denaturation of DNA molecules: A comparison of theory with experiment. Physics Reports 126: 67–107