Arxiv:1603.08952V1 [Q-Bio.BM] 29 Mar 2016 Solution [11, 12]

RNA Folding Pathways in Stop Motion Sandro Bottaro 1;∗, Alejandro Gil-Ley 1, Giovanni Bussi 1;∗ 1 Scuola Internazionale Superiore di Studi Avanzati, International School for Advanced Studies, 265, Via Bonomea I-34136 Trieste, Italy We introduce a method for predicting RNA folding pathways, with an application to the most important RNA tetraloops. The method is based on the idea that ensembles of three-dimensional fragments extracted from high-resolution crystal structures are heterogeneous enough to describe metastable as well as intermediate states. These ensembles are first validated by performing a quantitative comparison against available solution NMR data of a set of RNA tetranucleotides. Notably, the agreement is better with respect to the one obtained by comparing NMR with extensive all-atom molecular dynamics simulations. We then propose a procedure based on diffusion maps and Markov models that makes it possible to obtain reaction pathways and their relative probabilities from fragment ensembles. This approach is applied to study the helix-to-loop folding pathway of all the tetraloops from the GNRA and UNCG families. The results give detailed insights into the folding mechanism that are compatible with available experimental data and clarify the role of intermediate states observed in previous simulation studies. The method is computationally inexpensive and can be used to study arbitrary conformational transitions. I. INTRODUCTION tures, which are then assembled together to form com- plete models. This idea lead to the development of suc- cessful fragment assembly algorithms for RNA [13{16] Despite its simple 4-letters alphabet, RNA exhibits an and proteins [17]. Fragment assembly methods often rely amazing complexity, which is conferred by its ability to on the working hypothesis that frequencies of appearance engage and interconvert between a variety of specific in- in solved structures can be considered as an approxima- teractions with itself as well as with proteins and ions [1]. tion to the underlying Boltzmann distribution. While A delicate balance between many different factors, such this latter statement is in general not true, the statis- as hydrogen bonding, stacking interactions, electrostat- tics obtained from protein structures in the protein data ics and backbone/sugar flexibility, determines structure bank (PDB) were shown to reproduce distributions as and dynamics of RNA. These interactions are explicitly measured by NMR spectroscopy [18] as well as quantum described in atomistic molecular dynamics (MD) simula- mechanical calculations [19, 20]. Additionally, the suc- tions, that represent an important computational tool for cessful applications of statistical potentials to different the investigation of RNA dynamics. Since the first MD RNA structure prediction problems suggest this approx- simulation on tRNA [2], the accuracy of atomistic force imation to be in practice acceptable [21]. fields has steadily improved, to the point that it is nowa- days possible to obtain stable trajectories on the µs time By pushing these assumptions further, one could imag- scale for A-form double helices and tetraloops [3, 4]. MD ine the possibility to use fragment libraries to predict not simulations have also proven useful to aid the interpreta- only the most stable conformations but also intermedi- tion of experimental data for structured RNAs [5, 6] and ate and excited states [22], so as to provide a description protein-RNA complexes [7{9]. However, models used for of reactive pathways. For example, the conformational nucleic acids are still significantly less accurate compared variations observed in crystal structures of RNA triloops to those used for proteins [10]. Recent simulations on sys- have been linked to their internal dynamics [23], and a tems amenable to converged sampling showed that none simple isomerization process occurring in the backbone of the current atomistic force fields correctly reproduces of a small peptide was recently rationalised by analysing the behaviour of single stranded RNA tetranucleotides in high-energy structures trapped in the PDB [24]. arXiv:1603.08952v1 [q-bio.BM] 29 Mar 2016 solution [11, 12]. As a consequence, understanding the A further methodological step is however required folding dynamics of small motifs such as RNA tetraloops to reconstruct the dynamics of systems undergoing is extremely challenging using MD, both because of the large conformational changes, possibly involving multi- force-field limitations and of the high computational cost. ple pathways, when only an ensemble of structures at From a different perspective, structural bioinformatics equilibrium is given. Of particular interest in this con- approaches to study RNA exploit the empirical relation- text is the concept of diffusion map [25], which describes ship between sequence and three-dimensional structure. the long timescale dynamics of a complex system using a These methods typically proceed by first extracting lo- transition matrix directly calculated from the data. This cal conformations from a database of experimental struc- important theoretical tool was applied to characterise the dynamics of proteins systems [26]. Interestingly, the slow eigenmodes of the transition matrix have been shown to be consistent with an analysis based on transition net- ∗To whom correspondence should be addressed. Email: sbot- works of long molecular dynamics simulations [27]. This [email protected], [email protected] result suggests that the reactive pathways of macromolec- 2 ular systems can be obtained from equilibrium ensembles ory [32], we construct the Markov matrix T dividing by P alone. the diagonal degree matrix D, defined as Dii = j Kij. In this Paper we show that the structural ensemble To obtain a T matrix that is simultaneously normalised of RNA fragments extracted from the PDB exhibits re- and symmetric we introduce here an iterative normal- markably good agreement with available NMR experi- −1=2 −1=2 isation procedure T(t+1) = D(t) T(t)D(t) . Here the mental data for five tetranucleotides. We then consider matrix product is implicit, the subscript (t) indicates the the fragment ensembles of the two most important fam- iteration, and T(0) = K. At convergence, this procedure ilies of RNA tetraloops: GNRA and UNCG (R=A/G yields a matrix T that can be interpreted as a transi- and N=any). By using a formalism related to diffu- tion probability matrix. Here Tij is the probability of sion maps, we build a random walk on a graph in which observing a direct transition from state i to state j. T is each node is an experimental three-dimensional fragment P normalised ( j Tij = 1) and has an uniform equilibrium and edges are weighted with an appropriate measure of P distribution over the ensemble of fragments ( i Tij = 1). similarity [28]. The properties of the resulting random This procedure is closely related to the most common walks are then analysed using the standard machinery version of graph Laplacian normalisation. However, we of Markov state modeling [29] to obtain detailed folding notice that in the usual procedure the transition matrix trajectories. is made non symmetric by normalisation, and thus has We call the method for obtaining folding path- an equilibrium distribution that is not necessarily uni- ways from experimental fragments stop-motion modeling form. The advantage of our formulation is that the av- (SMM), in analogy with the animation technique that erages computed from the random walk are by construc- produces the impression of motion through juxtaposition tion identical the to ensemble averages obtained from the of static pictures. SMM is a very general tool for obtain- original set of structures. ing transition pathways, it is computationally inexpen- 3. Transition pathways between states. Dynamical sive, and it is therefore ideally suited to complement MD properties of the system are calculated from the tran- simulations and to aid the interpretation of NMR spec- sition matrix T as described in Ref. [29], and briefly re- troscopy data and kinetic experiments. ported in Supporting Information 1 (SI1) for clarity. The flux calculations are performed using pyEMMA [33]. To facilitate the analysis of the fluxes, nodes are lumped to- II. MATERIALS AND METHODS gether using a standard spectral clustering procedure [34] on the transition matrix T . Notice that the lumping is Stop motion modeling We here describe how to set only used to compute the aggregate fluxes, and does not up the stop motion modeling (SMM) procedure. influence in any way the underlying Markov model. We 1. Pairwise distances. We first calculate all pairwise set the number of clusters to 25 and 45 for UNCG and distances between all 3D structures within an ensemble GNRA tetraloops, respectively. We empirically verified as dij = ERMSD(xi; xj). Here xi and xj are the coor- that the fluxes are robust with respect to the number dinates of structures i and j and ERMSD is an RNA- of clusters. The flux between distant structures (com- specific metric based on the relative orientation of nucle- pared to the Gaussian width σ) depends on the number obases only [28]. This measure has the important prop- of transition pathways connecting them. If the overall erty of being highly related to the temporal distance: this transition requires crossing intermediate states with very means that when two structures are close in ERMSD dis- low population, corresponding to high free-energy barri- tance, then they are also kinetically close. In a previous ers, the resulting flux will be very small. In the limit of work [28], we have shown that this property is satisfied to missing intermediates, the graph becomes disconnected a larger extent by ERMSD compared to standard RMSD and the resulting flux is zero. after optimal alignment [30] as well as distance RMSD 4. Low-dimensional embedding. For the sake of vi- measures. Additionally, we have proven ERMSD to be sualisation, folding pathways and clusters are projected accurate in recognising known RNA motifs within the on a low dimensional space. To this aim we use the dif- structural database.

Arxiv:1603.08952V1 [Q-Bio.BM] 29 Mar 2016 Solution [11, 12]

Free-Energy Landscape of a Hyperstable RNA Tetraloop

Structure and Thermodynamics of a Conserved U2 Snrna Domain from Yeast and Human

RNA Tertiary Structure Energetics Predicted by an Ensemble Model of the RNA Double Helix

Unexpected Dynamics in the UUCG RNA Tetraloop

Anrnafoldingmotif:Gnratetraloop^Receptor Interactions

Tetranucleotide Loops GAAA and UUCG in Fission Yeast

Computational Design of Asymmetric Three-Dimensional RNA Structures and 2 Machines

Independent Evolution of Tetraloop in Enterovirus Oril Replicative Element and Its Putative Binding Partners in Virus Protein 3C

Free-Energy Landscape of a Hyperstable RNA Tetraloop

Free Energy Landscape of GAGA and UUCG RNA Tetraloops” Authors: Sandro Bottaro, Pavel Banas, Jiri Sponer, and Giovanni Bussi

Nucleic Acids: Theory and Computer Simulation, Y2K David L Beveridge* and Kevin J Mcconnell†

A General Module for RNA Crystallization Adrianr.Ferreâ-D'amareâ,Kaihongzhouandjennifera.Doudna*