A Global Sampling Approach to Designing and Reengineering RNA

i “rnaensign” — 2017/7/1 — 22:02 — page 1 — #1 i i i Published online ?? Nucleic Acids Research, ??, Vol. ??, No. ?? 1–11 doi:10.1093/nar/gkn000 Aglobalsamplingapproachtodesigningandreengineering RNA secondary structures 1,2, 1, 3 1 1 Alex Levin ⇤,MieszkoLis ⇤,YannPonty ,CharlesW.O’Donnell ,SrinivasDevadas , 1,2, 1,2,4, Bonnie Berger †,JérômeWaldispühl † 1 Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA 02139, USA. and 2 Department of Mathematics, MIT, Cambridge, MA 02139, USA. and 3 Laboratoire d’Informatique, Ecole´ Polytechnique, Palaiseau, France. and 4 School of Computer Science, McGill University, Montreal, QC H2A 3A7, Canada. Received ??; Revised ??; Accepted ?? ABSTRACT mechanisms (1), reprogramming cellular behavior (2), and designing logic circuits (3). The development of algorithms for designing artificial Here, we aim to design RNA sequences that fold into RNA sequences that fold into specific secondary structures specific secondary structures (a.k.a. inverse folding). Even has many potential biomedical and synthetic biology in this case, efficient computational formulations remain applications. To date, this problem remains computationally difficult, with no exact solutions known. Instead, the solutions difficult, and current strategies to address it resort available today rely on local search strategies and heuristics. to heuristics and stochastic search techniques. The Indeed, the computational difficulty of the RNA design most popular methods consist of two steps: First a problem was proven by M. Schnall-Levin et al. (4). random seed sequence is generated; next, this seed is One of the first and most widely known programs for progressively modified (i.e. mutated) to adopt the desired the RNA inverse folding problem is RNAinverse (5). The folding properties. While computationally inexpensive, this search starts with a seed sequence specified by the user. At each step thereafter, RNAinverse compares the minimum approach raises several questions such as (i) the influence free energy (MFE) structure of the current sequence (i.e. of the seed and (ii) the efficiency of single-path directed the structure computed from a structure prediction algorithm) searches that may be a↵ected by energy barriers in the with the target structure to determine the mutations to perform; mutational landscape. it attempts to traverse the mutational landscape in the direction In this paper, we present RNA-ensign, a novel that improves the current MFE structure’s similarity to the paradigm for RNA design. Instead of taking a progressive target. adaptive walk driven by local search criteria, we use Better RNA design tools have been subsequently developed. an efficient global sampling algorithm to examine large To our knowledge, the best programs currently available are regions of the mutational landscape under structural INFO-RNA (6), RNA-SSD (7, 8) and NUPACK (9). Other and thermodynamical constraints until a solution is programs such as rnaDesign (10) or RNAexinv (11) found. When considering the influence of the seeds and also have demonstrated improvement over RNAinverse. the target secondary structures, our results show that, Conceptually however, all current approaches rely on the same compared to single-path directed searches, our approach principle, which can be delineated in two steps: (i) selection of is more robust, succeeds more often, and generates more a seed, and (ii) a (stochastic) local search that aims to mutate thermodynamically stable sequences. An ensemble approach the seed to fit the target structure. to RNA design is thus well worth pursuing as a complement The traditional single-sequence iterative-improvement to existing approaches. RNA-ensign is available at: approach is simple and computationally fast: at each point http://csb.cs.mcgill.ca/RNAensign. only the next possible point mutations need to be computed and evaluated for fitness so that the best one can be chosen. However, the sequences generated by this approach suffer INTRODUCTION from several shortcomings. Firstly, due to the presence of The design of RNA sequences with specific folding properties energy barriers in the mutational landscape, some good is a critical problem in synthetic biology. Solving this problem sequences (in terms of structure fit and energetic properties) is an important first step in controlling bio-molecular systems, might be difficult to reach from a given seed. Even worse, which can have profound biomedical implications; indeed, sometimes arbitrary initial choices made by such methods it has already proven useful in modifying HIV-1 replication can irrevocably bias a search to produce ineffectual designs. ⇤equal contribution †To whom correspondence should be addressed. Email: [email protected], [email protected] c ?? The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. i i i i i “rnaensign” — 2017/7/1 — 22:02 — page 2 — #2 i i i 2 Nucleic Acids Research, ??, Vol. ??, No. ?? For example, since it is easy to grow existing stem structures structures. To conclude this study, we apply our techniques by single-point sequence mutations, the search can initially to reengineer riboswitches and show that, with only few take off in the direction of “improving” the structural fit by mutations, RNA-ensign enables us to tune the folding growing stems, only to falter when other structural elements properties of RNA molecules. Such applications may prove and rearrangements require multiple point mutations. Finally, to be useful for synthetic biology studies (2, 17, 18, 19). constraining the search to directions that improve the structural fitness function in the initial phases of the search runs counter to biological reality because it rewards mutations MATERIAL AND METHODS that bring the structure “closer” to the desired shape but do Overview of algorithm not directly improve function (e.g., the binding affinity for some ligand). The low-energy ensemble of a structure. Let S⇤ be a fixed In this paper, we present RNA-ensign, a novel and target structure of length n. The low-energy ensemble of S⇤ complementary approach to the RNA design problem, that consists of sequences w that can fold to S⇤ with each such uses global sampling of an energetic ensemble model of the sequence being assigned a certain probability. The probability E/RT RNA mutation landscape. More precisely, starting from a of a sequence w is proportional to e− , where E is random seed sequence, our scheme computes the Boltzmann the energy of w when folding to S⇤. Here, R is the gas distribution of all k-mutants of the seed and samples from constant, and T is the absolute temperature. The constant of these ensemble sequences (12). RNA-ensign starts by proportionality is the sum of the above quantities over all looking at all samples with one mutation (i.e. k =1) and sequences that can fold to S⇤. increments this number k until it finds a mutant whose MFE Using our RNAmutants algorithm (12, 20), we can structure matches the design target’s secondary structure. sample, in polynomial time and space, sequences from the Unlike the classical RNA design schemes, this approach low-energy ensemble of a given S⇤ (a brute force approach largely decouples the forces controlling the walk in the would result in an exponential time algorithm). This is done mutational landscape from the stopping criterion. by setting S⇤ as a structural constraint when invoking the We analyze design choices and show that, compared to local program. searches, our global sampling approach has advantages. While In this paper, we will in fact be concerned with the low- the importance of the choice of seed is widely acknowledged, energy ensemble of S⇤ around a certain seed sequence a0 to our knowledge, very few exhaustive studies allow for the (which we will also call the mutant ensemble). This involves precise quantification of its importance given here. We also sampling k-mutants of a0 (i.e., sequences differing from a0 present an analysis of the strengths and weaknesses of the in exactly k places) with probabilities proportional to the novel global sampling approach introduced here. While it quantities above (we get the constant of proportionality by generates more thermodynamically stable sequences at a high summing only over k-mutants). success rate, it is computationally more expensive than local The samples from the low-energy ensemble around a given search approaches. Nonetheless, our current implementation seed will be our candidate sequences in the design algorithm. can be run on structures with sizes up to 200bp, and thus reaches the current limit of accuracy for base pairing Sampling from a structure’s sequence ensemble. To motivate predictions with a nearest neighbour energy models (13, 14). our ensemble-based design approach, we first examine how This study aims to provide a complete comparison of our sequence search technique (ensemble sampling) differs our ensemble-based energy optimization approach with the from sequences sampled uniformly at random from those classical path-directed searches. We compare RNA-ensign sequences that can fold to our structure. To this end, we with RNAinverse, NUPACK, and, when possible, RNA-SSD randomly select two RNA secondary structures (of 47 and and INFO-RNA. Nevertheless, RNAinverse must be seen 61 nucleotides) from the RNA STRAND database (21), and as the most fair and instructive comparison as it is the only sample one hundred k-mutants of a random seed (i.e., path-directed software that decouples the initialization (i.e. the differing from the seed by k point mutations) for each seed) from the optimization strategy and that uses the same structure, both (a) uniformly from all k-mutants that can fold stopping criterion as RNA-ensign. to the target structure, and (b) with weight corresponding We show that our global search approach has several to the probability of the sequence in the ensemble of k- attractive features: it is successful significantly more often, mutants folding to the given structure.

Load more