Viennarna Package
Total Page:16
File Type:pdf, Size:1020Kb
Lorenz et al. Algorithms for Molecular Biology 2011, 6:26 http://www.almob.org/content/6/1/26 SOFTWAREARTICLE Open Access ViennaRNA Package 2.0 Ronny Lorenz1*, Stephan H Bernhart1, Christian Höner zu Siederdissen1, Hakim Tafer2, Christoph Flamm1, Peter F Stadler2,1,3,4,5,6 and Ivo L Hofacker1,3,7* Abstract Background: Secondary structure forms an important intermediate level of description of nucleic acids that encapsulates the dominating part of the folding energy, is often well conserved in evolution, and is routinely used as a basis to explain experimental findings. Based on carefully measured thermodynamic parameters, exact dynamic programming algorithms can be used to compute ground states, base pairing probabilities, as well as thermodynamic properties. Results: The ViennaRNA Package has been a widely used compilation of RNA secondary structure related computer programs for nearly two decades. Major changes in the structure of the standard energy model, the Turner 2004 parameters, the pervasive use of multi-core CPUs, and an increasing number of algorithmic variants prompted a major technical overhaul of both the underlying RNAlib and the interactive user programs. New features include an expanded repertoire of tools to assess RNA-RNA interactions and restricted ensembles of structures, additional output information such as centroid structures and maximum expected accuracy structures derived from base pairing probabilities, or z-scores for locally stable secondary structures, and support for input in fasta format. Updates were implemented without compromising the computational efficiency of the core algorithms and ensuring compatibility with earlier versions. Conclusions: The ViennaRNA Package 2.0, supporting concurrent computations via OpenMP, can be downloaded from http://www.tbi.univie.ac.at/RNA. Background be approximated as a hierarchical process in which sec- A typical single stranded-nucleic acid molecule has the ondary structure is formed before tertiary structure [1]. propensity to form double helical structures causing the The dominance of base pairing and the confinement molecule to fold back onto itself. Simple rules of com- to a single interaction partner makes it possible to plementary base pairing govern this process, which model RNA (and DNA) secondary structures at a purely results in a regular pattern of Watson-Crick and GU combinatorial level, completely ignoring both atom-scale pairings (helices) and intervening stretches of less regu- details and spatial embeddings. Formally, an RNA sec- larly ordered nucleotides (loops), collectively known as ondary structure is a (labeled) graph whose nodes repre- the molecule’s secondary structure. Secondary structure sent nucleotides. The edge set contains edges between elements may be placed in close spatial proximity allow- consecutive nodes (i, i + 1) representing the phosphate ing additional non-covalent interactions. These are not backbone as well as edges between base pairs. For the as frequent and often are energetically less favorable latter, the following conditions must hold: compared to canonical base pairs, thus rendering the 3- dimensional tertiary structure of an RNA to be domi- 1. base pair edges are formed only between nucleo- nated by the underlying scaffold of the secondary struc- tides that form Watson-Crick or GU base pairs; ture. The canonical base pairing governs not only the 2. no two base pair edges emanate from the same thermodynamics but also the folding kinetics, which can vertex, i.e., a secondary structure is a matching; 3. base pair edges span at least three unpaired bases; ’ ’ * Correspondence: [email protected]; [email protected] 4. if the vertices are placed in 5 to 3 order on the 1Institute for Theoretical Chemistry and Structural Biology, University of circumference of a circle and edges are drawn as Vienna, Währingerstraße 17/3, A-1090 Vienna, Austria straight lines, no two edges cross. Full list of author information is available at the end of the article © 2011 Lorenz et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Lorenz et al. Algorithms for Molecular Biology 2011, 6:26 Page 2 of 14 http://www.almob.org/content/6/1/26 The last condition ensures that the graph is outerpla- the thermodynamic and the machine learning approach nar and therefore excludes so-called pseudo-knots. to RNA bioinformatics proposing to learn a parameter Matching problems usually have cost functions deter- set for a SCFG that structurally matches the standard mined by edge-weights. The earliest predictions of RNA energy model [18]. secondary structures in the early 1970s indeed used Several other tools implement dynamic programming such simple energy models [2]. Detailed melting experi- based RNA secondary structures prediction: UNAfold ments, however, soon showed that a different, more [19] is the successor of the original mfold program and complex type of energy function is necessary to properly adds suppport for predicting RNA-RNA hybridization. model the thermodynamics of nucleic acid structures. RNAstructure [20] started as a reimplementation of Instead of individual base pairs, the energy contributions mfold with a graphical user interface in Windows, but are dominated by base-pair stacking and the destabiliz- is now available for other platforms and has added sev- ing entropic effects of unpaired “loops”.Sequence- eral additional algorithms such as partition function dependent energy parameters for these building blocks folding and suboptimal structures. The NUPACK suite contribute to a very good approximation additively to [21] focuses on folding of several interacting RNA the folding energy [3]. Over the last two decades, this strands and design problems. The group around Kiyoshi additive standard energy model has been repeatedly Asai developed several tools focusing the usage of cen- refined and updated, see e.g. [4-9]. troid and maximum expected accuracy (MEA) estima- The RNA folding problem is solvable by means of tors, see e.g. [22]. Ye Ding’s Sfold program [23] was dynamic programming. The simplest version, known as the first to introduce stochastic structure sampling. The maximum circular matching problem, accounts for base group around Robert Giegerich provides several RNA pairing energies only [10,11]. In the early 1980s Nussi- related tools, notably the RNAshapes [24] program. nov and Jacobson [12] and Michael Zuker with colla- The Vienna RNA Package [25]hasitsrootsina borators [13,14] demonstrated that the loop-based series of large-scale simulation studies aiming at an energy model is also amenable to the same algorithmic understanding of adaptive evolution on rugged fitness ideas. Their work made computational RNA structure landscapes [26-28] and the statistical properties of the prediction accurate and efficient enough for practical sequence-structure relationships of RNA [29-31] rather use, resulting in the first versions of mfold. A decade than the detailed analysis of individual RNA molecules later, John McCaskill realized that the dynamic pro- of biological interest. The primary design goals for its gramming recursions can be adapted to compute the implementation in the early 1990s, therefore, were two- partition function of an equilibrium ensemble of RNA fold. First and foremost, the basic folding algorithms molecules [15], paving the way for efficient computa- were to be implemented so as to be as efficient as possi- tional access to accurate thermodynamic modelling ble in their usage of both CPU and memory resources. without exceeding an asymptotic time complexity of The core algorithms are accessible as a C library, which O(n3) . later on was also equipped with Perl bindings to facili- The secondary structure model of RNA perfectly fits tate interoperability with this commonly used scripting together with modern genomics and transcriptomics language. Secondly, the interactive programs were to be since it works at the same level of abstraction, treating used mostly in (shell-script) pipelines, hence they use a nucleotides as basic entities. With the increasing avail- simple command-line interface and, where possible, they ability of RNA sequence data, and the realization that read from and write to a stream. This feature made it many of the functional RNAs have evolutionary well- easy to construct a suite of web services [32] providing conserved secondary structures, many research groups easy access to most functionalities of the Vienna RNA developed a plethora of specialized tools for various Package. With the rising tide of first genomics and aspects of RNA bioinformatics. As an alternative to the then transcriptomics data, the need for both efficient direct measurement of thermodynamic parameters, for implementation and easy incorporation into pipelines instance, machine learning approaches employing sto- remained, even though the focus gradually shifted from chastic context free grammars (SCFG) were introduced large-scale simulation to large-scale data analysis. Little e.g. in the infernal suite [16,17]. The algorithmic has changed in the core folding algorithms in the 17 work horses of the SCFG approach, the Cocke-Younger- years since the first publication [25] of the package. On Kasami (CYK), the inside and the outside algorithms, the other hand,