CoFold: an RNA structure prediction method that takes co-transcriptional folding into account

by

JEFFREY RYAN PROCTOR

BSEng, The University of Victoria, 2010

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

THE FACULTY OF GRADUATE STUDIES

(Bioinformatics)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

September, 2012

c Jeffrey Ryan Proctor, 2012 Abstract

RNA has a diverse array of cellular functions, and relies on molecular structure to carry them out. The vast majority of current methods for prediction of RNA secondary structure (i.e. the set of base pairs in the molecule) consider the minimum free energy structure (i.e. the most thermodynamically stable structure), and thus disregard the process of structural formation. There exists substantial evidence that the process of structure formation is important, and that it does impact the resulting functional RNA structure. Several methods currently exist that explicitly simulate the kinetic folding pathway as a time-ordered sequence of structural changes. However, these methods not only suffer from a long list of limiting assumptions about the cellular environment, but also are restricted to short sequences. In this thesis, we explore the idea of capturing the effects of kinetic folding rather than simulating in detail the process over time, and propose that accounting for kinetic effects of structural formation is crucial to further improve non-comparative RNA secondary structure prediction methods. During transcription, RNA structure begins to form immediately as the molecule emerges from the polymerase (i.e. co-transcriptionally). Long-range base pairs suffer a disadvantage during this process, as quickly- forming short-range base pairs act to block their formation (i.e. due to kinetic barriers). We propose a novel method, CoFold, that captures the reachability of potential pairing partners during co-transcriptional folding. We show that it significantly improves prediction accuracy over free energy minimization alone, particularly for long sequences. CoFold depends on only two free parameters that are highly correlated, and we demonstrate robust training. Furthermore, the resulting structures predicted by CoFold have a free energy measurement that does not significantly differ from that of the respective RNAfold prediction, indicating that they are indeed stable secondary structures. We propose that consideration of kinetics in RNA secondary structure prediction is crucial, and we hope that this work encourages further exploration of its effect on biological RNA structure.

ii Contents

Abstract ...... ii

Contents ...... iii

List of Figures ...... v

List of Tables ...... vi

Acknowledgements ...... vii

1 Introduction ...... 1 1.1 RNA secondary structure ...... 1 1.1.1 Secondary vs. tertiary structure ...... 3 1.1.2 Components of RNA structure ...... 3 1.2 Experimental determination of RNA structure ...... 4 1.2.1 Xray crystallography ...... 4 1.2.2 Nuclear magnetic resonance spectroscopy ...... 5 1.2.3 Chemical and enzymatic probing ...... 5 1.3 Computational non-comparative RNA Secondary Structure Prediction ...... 6 1.3.1 Nussinov algorithm ...... 6 1.3.2 Zuker-Stiegler algorithm ...... 7 1.3.3 Prediction of pseudoknotted structures ...... 8 1.3.4 Suboptimal folding ...... 9 1.3.5 Equilibrium partition function ...... 10 1.3.6 Sfold ...... 11 1.3.7 Prediction guided by chemical probing data ...... 11 1.3.8 CONTRAfold ...... 12 1.4 Computational comparative RNA Secondary Structure Prediction ...... 12 1.4.1 Covariance models ...... 13 1.4.2 Pfold ...... 14 1.4.3 RNA-Decoder ...... 15 1.4.4 RNAalifold ...... 15 1.4.5 Alignment-free methods ...... 15 1.5 Thermodynamics alone does not tell the whole story ...... 16 1.5.1 Importance of kinetics ...... 16 1.5.2 Co-transcriptional Folding ...... 17 1.5.3 Kinetic folding methods ...... 18 1.6 Goals of this thesis ...... 19

iii 2 CoFold ...... 22 2.1 Motivation ...... 22 2.2 CoFold Algorithm ...... 22 2.3 Implementation ...... 26 2.4 Compilation of the long and combined data sets ...... 27 2.4.1 The long data set ...... 28 2.4.2 The combined data set ...... 29 2.5 Parameter training ...... 30 2.5.1 Training procedure ...... 31 2.5.2 Clade-specific parameter training ...... 33 2.6 CoFold Performance ...... 34 2.7 Calculation of free energy differences ...... 39 2.8 Case studies ...... 43 2.9 CoFold web server ...... 47

3 Discussion ...... 48

4 Future Work ...... 49

References ...... 50

Appendices ...... 60 Definition of covariation metric ...... 60 Data set statistics ...... 60 Detailed free energy difference distribution ...... 63

iv List of Figures

1 RNA secondary structure components ...... 2 2 Diagram of an RNA pseudoknot ...... 4 3 Example of base pair covariation ...... 13 4 Scaling function γ of CoFold ...... 24 5 Visualization of the indices involved in CoFold recursion formulas ...... 26 6 Heatmap of performance measurements for all (α, τ) combinations...... 33 7 Difference in predictive accuracy between CoFold and RNAfold(TPR vs PPV) . . . . 37 8 Difference in predictive accuracy between CoFold and RNAfold(TPR vs FPR) . . . . 38 9 Distribution of relative free energy difference between predicted structures and the respec- tive RNAfold-predicted minimum free energy structures in the long data set ...... 41 10 Distribution of absolute free energy difference (in kcal/mol) between predicted structures and the respective RNAfold-predicted minimum free energy structures in the long data set 42 11 RNA secondary structures predicted by RNAfold and CoFold-A for the 23S rRNA of the gamma-proteobacteria Pseudomonas aeruginosa ...... 45 12 RNA secondary structures predicted by RNAfold and CoFold-A for the 16S rRNA of the fresh-water algae Cryptomonas sp. (species unknown) ...... 46 A1 Distributions of relative free energy difference between predicted structures and the re- spective minimum free energy structures predicted by RNAfold ...... 63 A2 Differences in prediction accuracy versus relative free energy changes between the predicted structures and the MFE structures predicted by RNAfold ...... 64

v List of Tables

1 Definition of method names used throughout the text ...... 27 2 Composition of the long and combined datasets ...... 30 3 Performance metrics for CoFold and RNAfold ...... 36 4 Summary of relative free energy difference between predicted structures and the respective minimum free energy structures predicted by RNAfold...... 39 5 Details of the linear fits to the ∆ MCC versus % ∆∆G distributions ...... 43 A1 RNA families of the long and the combined data sets ...... 61 A2 Alignment quality and phylogenetic support for the reference RNA secondary structures . 62

vi Acknowledgements

I would like to thank my supervisor, Irmtraud Meyer, as well as my committee members, Joerg Gsponer and Anne Condon. I would also like to thank all the members of the Meyer group, particularly Daniel Lai for the invaluable R code he has written as part of the R4RNA package, and Adi Steif for immensely helpful advice regarding statistics. My sources of funding include the MSFHR/CIHR Bioinformatics Training Program, and the Natural Sciences and Engineering Research Council (NSERC) Canada Graduate Scholarship.

vii 1 Introduction

1.1 RNA secondary structure

Ribonucleic acids (RNA) perform a wide variety of essential and well-defined roles in the cell. Whether it is protein-coding mRNA, or the myriad non-coding , RNA molecules exert their function by assuming structural features. Transfer RNA (tRNA), ribosomal RNA (rRNA), and messenger RNA (mRNA) act in concert to carry out faithful translation of the genetic code. The structure of transfer RNA is vital not only to interact with codons within messenger RNA, but also for robust recognition by tRNA synthases, the enzymes that attach the appropriate amino acid to each tRNA [107]. The complex structure of ribosomal RNAs acts not only as structural scaffolding of the ribosome, but interestingly rRNA is in fact responsible for its catalytic peptidyl transferase activity [124]. The untranslated regions (UTR) at the ends of messenger RNAs are repositories of RNA structural features responsible for regulation of the respective protein product. Riboswitches are often found in the UTRs of metabolic , where they modulate expression by actively binding to small metabolites [89, 115, 119]. Conserved hairpins in the UTRs of genes involved in iron metabolism are essential for regulation of cellular iron levels in mammals [47]. MicroRNA response elements, commonly found in UTRs of protein-coding genes, are short sequences with near-perfect complementarity to microRNAs, short RNA molecules which cause repression upon binding [105]. In the 1980s, it was discovered that RNA, in addition to proteins, can play a catalytic role. For instance, group I and II introns contain RNA structure that catalyzes its own excision from the tran- script [10, 14]. Ribonuclease P (RNase P) is a catalytic RNA that cleaves tRNA precursors during their maturation pathway [32]. Viral genomes have been found to depend highly on RNA structural features, ostensibly due to the need to fit many functions into a compact genome. The hepatitis delta virus genome has a complex RNA structure that is necessary for viral replication [97]. In the HIV genome, RNA structure has been implicated in transcription initiation, genome dimerization, virion packaging, and a regulated frame-shift mechanism [8, 19, 35]. Thus, RNA structure is widespread, functionally diverse, and biologically crucial to all organisms. RNA is a linear, covalently-linked polymer that is transcribed with four ribonucleotide monomers: adenine (A), cytosine (C), guanine (G), and uracil (U). DNA is similarly comprised of monomers called deoxyribonucleotides, which lack a hydroxyl (OH) chemical group attached to the sugar component. Because of this chemical difference, RNA and DNA typically assume a different three dimensional helical shape: RNA assumes the so-called A-form helix, while DNA is in the B-form helix. Base pairs are the fundamental building blocks of RNA structure, and form through the interface of two complementary nucleotides (G with C, A with U, or G with U). Much like DNA, RNA molecules have directionality (i.e. a distinct 5’ and 3’ end). Base pairs always form in an anti-parallel configuration, where the two strands are parallel to one another, but run in opposite directions. In the nucleus of eukaryotic cells, the genome consists of a double stranded complex of DNA, com-

1 Figure 1. RNA secondary structure components. The circles represent nucleotides, and the black line is the linear sugar-phosphate backbone. Base pairs are indicated by blue lines. [A] a simple RNA hairpin comprised of a helix and hairpin loop (degree 1). [B] An RNA structure containing both a bulge and internal loop (degree 2). [C] A branched RNA structured comprised of three helices and a multiloop (degree 3). Diagrams generated using the VARNA software package [21]. prising two long complementary molecules that form a contiguous zipper of base pairs. Conversely, RNA molecules are typically single stranded when they are synthesized, and form local base-pairing interactions with other regions of the same molecule (intramolecular interactions), and base-pairing interactions with other RNA molecules (intermolecular interactions). Intermolecular interactions will be entirely disre- garded for the purpose of this thesis, but certainly play an important role in formation of RNA structure (See [78] for a review of this topic).

2 1.1.1 Secondary vs. tertiary structure

The sequence of RNA nucleotides in the molecule (i.e. the primary sequence) determines which in- tramolecular base pairs are possible, as a consecutive set of complementary nucleotides are required to form a helix. However, a nucleotide can only form a base pair with a single nucleotide concurrently, and there are typically many potential pairing partners for any given nucleotide. A mutually compatible set of base pairs in an RNA molecule is referred to as the secondary structure. At this level of detail, we consider only the set of nucleotides that form canonical base pairs, and not the three-dimensional positions. The secondary structure in turn defines the three-dimensional conformations (i.e. the tertiary structures) of the RNA molecule that are sterically possible. Tertiary structures typically involve many interactions that differ from canonical base pairs (A-U, G-C, G-U) [110]. RNA tertiary structure pre- diction is a difficult problem, for which state-of-the-art methods are restricted to very short sequences. Most modern methods are evaluated using only sequences less than 70 nucleotides in length, e.g. [93]. RNA secondary structure prediction is a far simpler and tractable problem than that of tertiary structure. Additionally, there is evidence that RNAs fold hierarchically, first forming the secondary structure elements (e.g. helices, loops), which then guide the tertiary contacts [11]. Experimental evidence indicates that tertiary interactions are indeed considerably weaker than secondary interactions, and thus are unlikely to alter the secondary structure once formed [108]. Therefore, the secondary structure can often be predicted without knowledge of tertiary structure. For most analyses, the RNA secondary structure provides a sufficient level of detail. Some RNA molecules, most notably tRNA, have highly modified, non-standard nucleotides that are key in formation of tertiary interactions. In these RNAs, unusual non-standard base pairing will likely play an important role in its structure and function, and secondary structure analysis will have trouble modeling this. However, RNA secondary structure allows us to determine the overall domain composition of the molecule, the accessibility of each nucleotide, and the position of features that are likely to be functionally important. Furthermore, many tertiary structure prediction methods first perform secondary structure analysis, then search for tertiary structures consistent with that prediction. This thesis focuses entirely on the problem of RNA secondary structure prediction.

1.1.2 Components of RNA structure

RNA structure can be decomposed into basic structural components. The most essential component is the double helix formed by the interface of two complementary stretches. Adjacent sets of base pairs in a helix participate in an energetically favourable interaction called stacking, and thus base pairs are often considered in overlapping groups of two called stacks. Unpaired regions that are contained by an RNA base pair are referred to as loops, where the number of base pairs enclosing the loop is referred to as the degree. Loops with degree 1 are hairpin loops (Figure 1A). Interior loops and bulges have degree 2 (Figure 1B), where bulges have unpaired bases on only one of the two strands. Multiloops have degree 3, and result from complex, branched RNA structure (Figure 1C).

3 Figure 2. Diagram of an RNA pseudoknot. Black circles represent nucleotides, green lines indicate the linear RNA sugar-phosphate backbone, and red lines indicate base pairs. This configuration is a pseudoknot, because there exists two base pairs (i, l) and (k, j) such that i < k < l < j. Figure from [99].

A particular class of RNA secondary structure feature called pseudoknots plays important roles in structured RNAs in vivo. A number of structured RNAs contain pseudoknots, including the hepati- tis delta virus replication mechanism [97], various self-splicing introns [1], the telomerase protein-RNA complex responsible for extension of chromosome tips [57], and a number of RNA structural features in viruses that induce frame shift during protein translation [109]. Pseudoknots are characterized by two or more RNA helices in a non-nested configuration, where the loop of one helix (i.e. the bases contained within its inner base pair) pairs with a region completely outside that helix. More precisely, there must exist two base pairs (i, l) and (k, j) such that i < k < l < j, where i, l, k, and j indicate sequence positions. A pseudoknotted structure is illustrated in Figure 2. Pseudoknots are of particular relevance for RNA secondary structure prediction as many methods disregard them for the sake of computational tractability.

1.2 Experimental determination of RNA structure

1.2.1 Xray crystallography

Experimental determination of RNA structure is a difficult process. X-ray crystallography, a tech- nique widely used in protein structure determination, was used as early as the late 1960s to determine the structure of transfer RNA [58, 59]. It begins with a precarious process of amplification and purification of an RNA sample containing many copies of the same molecule. Some purification methods impose

4 denaturation of the RNA structure, potentially resulting in a non-native conformation following renatu- ration [51]. The crystallization procedure then produces a solid RNA crystal from the homogeneous RNA sample using a buffer solution with high ionic concentration and low temperatures. Diffraction patterns are produced from the RNA crystal using a rotating x-ray generator, and are then analyzed to generate a model of the RNA molecule. Because x-ray crystallography requires RNA in a solid crystal with fewer water molecules than in vivoaqueous solution, its structure may not always reflect the in vivo functional structure, nor any structural features requiring dynamic movement of the molecule, i.e. alternative or transient features. The primary challenge in application of x-ray crystallography to RNAs is producing the RNA crystal itself.

1.2.2 Nuclear magnetic resonance spectroscopy

Nuclear magnetic resonance (NMR) spectroscopy is the second main method for RNA structure determination. Because the RNA molecule is analyzed in aqueous solution in which it can freely move, NMR provides insight into structural dynamics. NMR spectroscopy is based on the physical property of atomic nuclei with a so-called spin, i.e. those with an uneven number of protons and/or neutrons. This spin causes the nucleus to absorb or emit electromagnetic radiation when in the presence of a magnetic field. Such atomic nuclei that are found in nucleic acids include 1H, 13C, 15N, and 31P [37]. Each atomic nucleus type has a specific electromagnetic frequency at which resonance occurs, and importantly, this frequency is affected by all nearby electron shells. That is, the nucleus will resonate differently according to which atoms and chemical bonds surround it. Thus, NMR spectroscopy can probe the chemical structure by analyzing the variation in resonance frequencies (i.e. the chemical shift). Because RNA molecules typically consist of many regions of similar structure (e.g. A-form RNA helices), it is often difficult to disentangle the signal contributions from different nuclei [37], and multi-dimensional NMR experiments (e.g. 2D-NOESY) are often used to resolve the overlapping signal. Importantly, NMR spectroscopy has a more strict length limitation than x-ray crystallography due to the complexity of the signal (roughly 100 nucleotides) [37]. However, because the RNA molecule is in aqueous solution, and is freely mobile, NMR spectroscopy tends to produce structural annotations that more accurately reflect functional, dynamic RNA structures in vivo. Furthermore, real-time NMR studies can be used to investigate the kinetics and mechanisms of RNA folding due to the dynamic nature of NMR spectroscopy [36], and its time resolution of is roughly on the scale of seconds to minutes [36].

1.2.3 Chemical and enzymatic probing

Structural probing is a low resolution technique for RNA structure determination whereby the RNA molecule is treated with a chemical that preferentially modifies nucleotides according the structural features in which they participate (and often according to the primary sequence as well). These methods employ a wide variety of chemical and enzymatic probes, including electrophiles, oxidants, nuclease mimics, and biological nucleases, each with different structural target sites [12]. For example, many nuclease enzymes only cleave at unpaired positions of the RNA, often at specific subset of the four

5 possible nucleotides, and thus many identical RNA molecules must be simultaneously analyzed using a variety of probes. In 2005, a refined method called SHAPE (Selective 2-Hydroxyl Acylation and Primer Extension) was developed with greater sensitivity to nucleotide flexibility and broad recognition of all nucleotides [77]. This method relies on an electrophilic reagent that reacts with the 2’ hydroxyl of the RNA ribose backbone, only at positions with nucleotide flexibility. The reaction sites are subsequently detected by primer extension and gel resolution. This method was recently used to analyze structural features of the entire HIV RNA genome, which is roughly 9000 nucleotides in length [116]. Because the probes simply cannot pass the cell membrane, Chemical probing techniques are mostly performed in vitro on renatured RNA molecules. Dimethyl sulfate (DMS) is an exception, as it readily crosses into the cell [12]. Furthermore, probing experiments only provide clues into which regions are structured or unstructured, and thus are typically used in conjunction with computational prediction (See section 1.3.7), or to provide supporting evidence for a proposed structure.

1.3 Computational non-comparative RNA Secondary Structure Prediction

Because experimental determination of RNA structure is expensive and time-consuming, computa- tional prediction is a valuable tool for RNA biologists and bioinformaticians. The number of potential secondary structures grows exponentially with the length of the sequence, and thus efficient algorithms are essential for feasible exploration of the search space. Moreover, the various existing prediction meth- ods arrive at their predictions through a variety of criteria and hypotheses on how RNA structure forms. Methods for prediction of RNA secondary structure roughly fall into two categories: comparative and non-comparative. Comparative methods take as input a set of aligned sequences and typically aim to predict base pairs that are conserved through evolution. Non-comparative methods take only a single sequence to predict its functional secondary structure. The following section outlines existing methods of non-comparative RNA secondary structure prediction.

1.3.1 Nussinov algorithm

Developed in 1980, the Nussinov algorithm [90] was one of the first attempts at RNA secondary structure prediction. It is a dynamic programming algorithm that efficiently calculates the secondary structure with the greatest number of base pairs in O(L3) time, where L denotes the length of the input sequence. The algorithm solves the problem recursively by determining the optimal structure for sub- sequences, and using these solutions to derive optimal structures for successively larger subsequences. The output structure is the optimal solution for the full sequence. This algorithm, however, has shortcomings. First, base pairs vary their contribution to the overall stability of the molecule; for example, G-C pairs are energetically more favourable than A-U pairs. The Nussinov algorithm weight all pairs equally. Second, the stability of a base pair depends highly on its neighbouring base pairs due to so-called stacking interactions between adjacent pairs, and this contextual effect is also ignored by the algorithm.

6 1.3.2 Zuker-Stiegler algorithm

The Zuker-Stiegler algorithm [126] is an advancement of the Nussinov algorithm, and was published soon after its predecessor in 1981. Rather than predicting the structure with the greatest number of pairs, the Zuker-Stiegler algorithm predicts the thermodynamically most favourable pseudoknot-free RNA struc- ture. It relies on the concept of Gibbs free energy, a measurement of chemical potential of a system at constant temperature and pressure. The Gibbs free energy is minimized when the system reaches ther- modynamic equilibrium, and the free energy in kcal/mol is defined as G(T, p) = H − TS, where p is the pressure in pascals, T is the temperature in Kelvin, S is the entropy in kilocalories per Kelvin (kcal/K), and H is the enthalpy in kilocalories (kcal). Note that the SI unit for energy is the joule (J), and thus the SI units for free energy, entropy, and enthalpy are J/mol, J/K, and J, respectively. However, due to common practice of tools for RNA structure analysis, we use only kilocalories throughout this text. The Zuker-Stiegler algorithm calculates the RNA secondary structure that minimizes the Gibbs free energy according to a set of pre-specified free energy parameters that are typically experimentally determined. This structure is denoted the minimum-free-energy (MFE) structure. The Zuker-Stiegler algorithm employs dynamic programming that assigns a sequence-specific free energy value to various structural building blocks, such as stacking interactions between adjacent base pairs, unpaired nucleotides, and hairpin loops. Some of these energy contributions are stabilizing (i.e. they have negative free energy) such as stacking interactions, and others are destabilizing (i.e. they have positive free energy) such as loops and bulges. The overall energy of the secondary structure is calculated as the sum of these small energies. The algorithm utilises dynamic programming similarly to the Nussinov algorithm, but calculates two energy values for all subsequences Si,j of a given input sequence S, where 1 <= i < j <= L, and L denotes the sequence length:

• Ci,j: minimum free energy of subsequence Si,j given that nucleotides i and j form a base pair

• FMLi,j: minimum free energy of subsequence Si,j

 hairpini,j open a helix with hairpin loop  Ci,j = min mini

Ci,j and FMLi,j are calculated for each subsequence Si,j as the minimum of a well-defined set of rules, each corresponding to various structural elements (Equation (1), Equation (2)). The minimum

7 free energy can be retrieved from the value at FML1,L. The corresponding MFE structure is retrieved by backtracking through the Ci,j and FMLi,j matrices. The Zuker-Stiegler is implemented in many RNA secondary structure prediction tools, including RNAfold [50], and Mfold [125], which are both frequently used today. By assigning an estimated thermodynamic energy value to secondary structure predictions, the Zuker- Stiegler prediction more closely models physical attributes of RNA structure. However, to produce the thermodynamic estimate, the algorithm requires a large set of thermodynamic parameters. In 1999, the Turner group published one such model, which included a combination of roughly 7600 experimentally measured and estimated energy values [74]. This parameter set (called Turner 1999 parameter set) is widely used by many modern implementations of the Zuker-Stiegler algorithm, including RNAfold [50] and Mfold [125]. Andronescu et al. improved estimated values in the Turner 1999 parameter set by applying sophisticated machine learning techniques to train 363 independent free parameter values [5]. The initial Turner parameters were adjusted using a training set of 3439 reference structures (average length is 178 nucleotides), and 946 thermodynamic measurements by optical melting (average sequence length is 17 nucleotides), which resulted in the Andronescu 2007 parameter set. They observed an average performance increase of 7% in F-measure on a test set of 1660 sequences that have average length of 295 nucleotides, and that contain several biological classes, including transfer RNA (tRNA), RNase P, ribosomal RNA (rRNA), and signal recognition particle (SRP) RNA.

1.3.3 Prediction of pseudoknotted structures

A significant limitation of the Zuker-Stiegler algorithm is its technical inability to model pseudoknots (See Figure 2 and Section 1.1.2 for detailed explanation of pseudoknots), which introduces two major issues. First, the Zuker-Stiegler’s assumption that all base pairs are nested (i.e. the structure is non- pseudoknotted: there are no two base pairs (i, l) and (k, j) such that i < k < l < j) reduces the secondary structure search space, resulting in time complexity of O(N 3) and space complexity of O(N 2). Second, thermodynamic energy parameters are not well known for pseudoknotted structures. In the most general case, prediction of pseudoknotted structures has been to be an NP-complete problem [18]. A number of methods have been developed that model a restricted subset of all possible pseudoknotted structures. While these methods are more computationally costly than the Zuker-Stiegler algorithm (which is O(N 3) in run time), they are tractable with run time complexity ranging from O(N 4) to O(N 6). One subclass of pseudoknotted RNA structures is so-called bi-secondary structures defined by Haslinger and Stadler [42]. This class includes those structures than can be partitioned into two disjoint sets of non-pseudoknotted base pairs. Importantly they found only a single pseudoknotted RNA structure that is not bi-secondary among a diverse set of examples. No methods currently predict only this class of pseudoknotted structures. In 1999, Rivas and Eddy published dynamic programming algorithm that models pseudoknotted base pairs with time complexity O(N 6) and space complexity O(N 4) [102]. The class of pseudoknotted structures predicted by their method is neither contained in nor contains the class of bi-secondary struc-

8 tures [103], but is the most general class of all current methods [18]. Thermodynamic energy parameters for non-nested base pairs (i.e. pseudoknotted base pairs) were roughly estimated (e.g. by multiplying the nested energy values by a parameter g > 1) [102]. Because of the high time complexity of the algo- rithm, it can only be realistically used for prediction of very short sequences. A number of algorithms for pseudoknotted structure prediction that improve on the run time efficiency followed this publication, including those that further restrict the class of pseudoknotted structures that can be modeled (e.g. NUPACK [24], pknotsRG-mfe [98]), and heuristic algorithms which are not guaranteed to return the minimum free energy structure (e.g. STAR package [41], ILM [104], HotKnots [100]). In 2005, Ren, et al. developed a heuristic algorithm called HotKnots [100], against which they empirically evaluated all the above methods using 43 sequences. For short sequences (mean length 58 nucleotides), HotKnots and pknotsRG-MFE performed best, and were roughly equal in both sensitivity and specificity [100]. STAR out-performed all other methods for long sequences (average length 300 nucleotides) [100].

1.3.4 Suboptimal folding

Minimum free energy predictions (e.g. those made by the Zuker-Stiegler algorithm) may differ signif- icantly from experimentally verified structure annotations. There are several factors that may influence this disparity. First, many thermodynamic energy parameters are estimated by in vitro experiments, and may not accurately reflect actual RNA chemistry in vivo. Second, RNA secondary structure prediction methods fail to model numerous aspects of the cellular environment, including protein and RNA interac- tion partners, concentrations of ions beyond just Mg2+, metabolites and small molecules. Third, many RNA molecules may not actually reach their minimum free energy state in a realistic cellular time frame due to slow reaction kinetics (discussed later in detail in section 1.5). Fourth, many RNA molecules have alternative functional structures, wherein more than one conformation is involved in the functional role of the molecule. In 1999, Wuchty, et al. tried to address this discrepancy with so-called suboptimal folding [121], which generates an ensemble of secondary structures near minimum free energy. The algorithm calculates all RNA secondary structures with free energy less than an arbitrary user-provided threshold from the minimum free energy value. The output is an extensive list of secondary structures and respective free energy estimations that includes the minimum free energy structure and other potential meta-stable conformations (i.e. conformations stuck in a local free energy minimum from which the transition to a globally stable structure is slow). The first step determines the minimum free energy of the input sequence using the Zuker-Stiegler algorithm with run time O(N 3), but the run time of the suboptimal backtrack step depends highly on the specified energy threshold. Because the number of potential RNA structures scales roughly exponentially with the length of the sequences, a sufficiently large threshold will result in exponential run time. The user is typically not looking for the minimum free energy structure when using suboptimal folding, but the algorithm gives no indication of which structures are most likely to form aside from their free energy estimations. Determining which structures are important remains the limitation of the

9 suboptimal folding method. This algorithm is implemented in the ViennaRNA package under the name RNAsubopt [50].

1.3.5 Equilibrium partition function

Free energy minimization methods provide no indication of which structures are more probable aside from the free energy estimations. In 1990, McCaskill employed the equilibrium partition function to do exactly this [76]. Their algorithm defines the partition function in terms of the Gibbs free energy of potential RNA secondary structures. The Gibbs free energy is a measurement of chemical potential of a system at constant temperature and pressure, and it is minimized when the system reaches thermo- dynamic equilibrium. They specify the partition function Z(T ) as follows, where T is temperature in

Kelvin, kB is the Boltzmann constant, and G(s) is the Gibbs free energy in kcal/mol of a given structure s in the set S of potential pseudoknot-free structures:

X − G(s) Z(T ) = e kbT (3) s∈S The partition function is thus a weighted sum of the potential states of the system (i.e. the RNA secondary structures). McCaskill’s partition function algorithm efficiently calculates Z(T ) for pseudoknot free structures only, and operates similarly to the Zuker-Stiegler algorithm. It recursively determines the energy of all subsequences, starting from the smallest, but instead performs summation of energy terms rather than minimization. Importantly, the probability P (s) of observing a given RNA secondary structure at thermodynamic equilibrium can be derived from value of the partition function according to equation (4).

1 − G(s) P (s) = e kbT (4) Z(T ) Thus the McCaskill algorithm can efficiently generate an estimated probability of observing a confor- mation assuming that the population of RNA molecules have reached thermodynamic equilibrium. The run time complexity is equivalent to the Zuker-Stiegler algorithm: O(N 3). Also like the Zuker-Stiegler algorithm, it entirely disregards pseudoknot formation. Because the number of conformations for an RNA molecule grows exponentially with its length, the probability of one conformation is typically very small. These numbers are not useful in isolation, but are typically used to compare the relative prominence of two or more conformations. Furthermore, a modified traceback of the dynamic programming matrix produces base pairing probability (i.e. probability of observing that base pair across the ensemble of structural conformations) [76]. The McCaskill partition function algorithm is implemented in the Vi- ennaRNA package [48]. Dirks and Pierce later developed a modified implementation of the McCaskill algorithm that includes a subclass of pseudoknotted structures with a slower run time complexity of O(N 5) [25].

10 1.3.6 Sfold

McCaskill’s 1990 partition function [76] thus defined a useful statistic for the analysis of RNA sec- ondary structure, but did not explicitly present a method for structural prediction. The suboptimal folding algorithm of Wuchty, et al. [121] presented a method of investigating an ensemble of low energy structures without assigning statistical relevance to the resulting list of structures. Ding and Lawrence [23] synthesized these concepts in developing a randomized algorithm Sfold, which performs statistical sam- pling of the Boltzmann distribution using McCaskill’s partition function with worst case time and space complexity of O(N 4). Specifically, the Boltzmann distribution refers to the ensemble of states that exist in a physical system at thermodynamic equilibrium, and in the case of RNA, the probability of a state (i.e. a secondary structure) can be estimated by equation (4). Sfold precisely samples an RNA secondary structure from the Boltzmann distribution such that the sampling probability is exactly the probability of observing that state according to equation (4). Thus, the method not only returns an ensemble of low energy structures like the previous suboptimal folding methods, but the returned structures correspond to the relative abundance predicted by the Boltzmann distribution. Sfold, like its predecessors, makes the assumption that the population of RNA molecules has indeed reached thermodynamic equilibrium, as those with lower energy are sampled with greater frequency. Morgan and Higgs showed that phylogenetically supported RNA structural annotations often have sig- nificantly higher estimated free energies than the predicted minimum free energy structures for the same molecule, particularly for long sequences [86]. Their study suggests that functional RNAs may not reach their minimum free energy state, and higher energy conformations may occur with greater frequency in vivo. Sfold is provided as a web server, which in addition to the basic Sfold functionality, provides visualization of and statistics on RNA secondary structure clustering [15, 22].

1.3.7 Prediction guided by chemical probing data

As discussed in Section 1.2.3, hints about the functional secondary structure of an RNA molecule can be gleaned through chemical probing methods, but often constructing the RNA structure from these hints is ambiguous. In 2004, Mathews, et al. [72] developed a dynamic programming method similar to the classical Zuker-Stiegler algorithm which explicitly takes into account chemical probing data to inform the prediction. This algorithm solves a slightly different problem than the other methods discussed previously, as it requires additional experimental input data. It is implemented as a minor modification of the Zuker-Stiegler algorithm, whereby any base pairs that are inconsistent with the probing data are simply disallowed. The authors evaluate the performance of their method on 14 sequences with known secondary structures and chemical probing data, and demonstrate an increase from 67% to 76% accuracy, which is mostly due to a large performance increase in a small number of sequences.

11 1.3.8 CONTRAfold

CONTRAfold abandons the thermodynamic physics-based approach to non-comparative RNA structure prediction, and instead relies on machine learning to establish a trained model of RNA sec- ondary structure [26]. The CONTRAfold algorithm employs a probabilistic model called a conditional log-linear model (CLLM) to assign a probability measurement to a secondary structure annotation, and maximum expected accuracy parsing to efficiently calculate the annotation with the greatest probability. Its flexible model accounts for most features typically found in energy minimization methods, including stacking interactions, dangling nucleotides, and loop energies, all features that are not easily incorporated into a probabilistic model. Whereas thermodynamic methods rely on thousands of biochemical measure- ments for parameter allocation, CONTRAfold learns its probabilistic parameters through maximum likelihood estimation on known RNA structures. The authors present a 5.0% increase in sensitivity and 7.6% increase in specificity over RNAfold from the ViennaRNA package [50, 69].

1.4 Computational comparative RNA Secondary Structure Prediction

While non-comparative methods take as input a single sequence, and typically predict the RNA secondary structure with the lowest estimated free energy, comparative methods investigate the signals left behind by thousands of years of evolutionary history. They need as input a set of orthologous sequences from related organisms (aligned or unaligned depending on the algorithm), and return as output the predicted conserved RNA secondary structure. Comparative methods rely on evolution to reveal which structural features are functionally important, and thus rely on fewer assumptions about the cellular environment. Specifically, it does not matter if the RNA molecule has protein binding partners, or does not reach thermodynamic equilibrium, because comparative methods do not need to model any physical process inside the cell. Rather, if a feature is functionally important to the organism, it is likely to be retained throughout evolution, and we can observe this pattern of conservation among its close relatives. However, most comparative methods are highly sensitive to input data quality: they require known sequences from numerous related species and intensive data curation. Misalignment can result in poor predictions, because the method can only reliably predict a base pair if the participating nucleotide is in the same alignment column for all alignment species. A crucial consideration in comparative RNA analysis is that base pairing pattern tends to be more conserved than the primary sequence itself. The presence of secondary structure imposes long-range sequence correlations, wherein distant nucleotides are highly dependent through evolution. Moreover, the presence or absence of a potential base pair tends to trump the specific nucleotides at those positions. That is, secondary structure is often more important than the primary sequence, and furthermore, they can often be in conflict. For instance, two compensatory mutations from an A-U base pair to a G-C base pair may have no detrimental effect on the function of the RNA (they both form a base pair), whereas mutation from A-U to G-G would change the overall base pairing pattern. In both cases, the primary sequence is significantly altered, but only in the second case is the secondary structure

12 Figure 3. Example of base pair covariation. The semi-circular arc indicates two base pairing alignment columns. Each row indicates a sequence within the alignment. Conserved base pairs are illustrated in green, invalid base pairs in orange, and covarying base pairs in blue. [A] A non-conserved base pair. Most sequences in the alignment have not retained this base pair. This is evidence that it is either not functionally important, or the base pair is only present in the first sequence. [B] A conserved base pair. All sequences in the alignment retain the pairing potential, but the alignment has no variation. The lack of variation may either be due to additional sequence constraints or little evolutionary divergence. [C] A covarying base pair. This typically provides the strongest evidence of a functional base pair. The sequences have significant variation, but they all retain base pairing potential. Diagrams generated using the R4RNA R package [65]. changed. Thus, base pairing columns can retain secondary structure, while having very low primary sequence conservation. This phenomenon is referred to as covariation (or covariance), and is illustrated in detail in Figure 3. That primary sequence and secondary structure are at odds poses an issue for sequence alignment of structural RNAs. Methods that consider only primary sequence tend to mis-align covarying columns, as they are oblivious to the long-range correlation between the participating columns. Furthermore, covariation is a strong signal of structural conservation as it is unlikely to occur by chance, and thus it is the focus of many comparative RNA secondary structure prediction methods.

1.4.1 Covariance models

Sequence search and alignment tools based purely on primary sequence conservation, such as the local alignment tool BLAST [3,4], are typically insufficient to identify and align structured RNAs due to secondary structure covariance patterns. That is, two sequences with highly differing primary sequence may in fact retain the same overall base pairing pattern. In 1994, Eddy and Durbin developed covariance models to perform RNA sequence database searching and alignment construction that is aware of both primary sequence and secondary structure conservation [30]. As input, it accepts a seed alignment with a reference structure from which to construct a probabilistic model. This model is conceptually similar to profile hidden Markov models (HMM), which capture the primary sequence conservation of a single

13 protein family [29]. However, due to the long range sequence correlation imposed by base pairing positions in structural RNA, a more complex model is required: stochastic context free grammar (SCFG). SCFGs can model the long-range interactions of RNA base pairs via the production rules of their underlying grammar. The grammar is parameterized by transition and emission probabilities which are trained from the seed alignment. SCFGs are quite flexible, as adjusting the grammar can capture different hypotheses on RNA structure formation. Using this infrastructure, a covariance model can be used to perform rapid sequence matching to the seed alignment, and alignment construction from the search hits. The authors describe a rudimentary consensus structure prediction algorithm that utilizes an iterative pairwise implementation of a modified Zuker-Stiegler algorithm. However, this functionality ignores the evolutionary history of the sequences, and it is evaluated on a small handful of sequences. The strength of their work lies in database searching and alignment construction from a seed alignment of known secondary structure. A modern implemen- tation of RNA covariance models is described in [88].

1.4.2 Pfold

In 1999, Knudsen and Hein [60] developed the first comparative RNA secondary structure prediction method that fully takes into account the evolutionary relationships of the sequences: Pfold. Like the covariance models of Eddy and Durbin [30], Pfold utilizes stochastic context free grammars as a probabilistic model capable of capturing the long range dependencies of structured RNAs. Rather than building a model that specifically describes a single structured RNA alignment (as with covariance models), Pfold’s grammar is general enough to produce any valid pseudo-knot free RNA secondary structure. Whereas the probability of any column in a covariance model arises from its match to the seed in both sequence and structure, Pfold utilizes the Felsenstein peeling algorithm [33] to calculate the probability of paired and unpaired columns according to the evolutionary model, and uses the values as the SCFG emission probabilities. The Felsenstein algorithm ascends the input evolutionary tree from the leaves (i.e. observed sequences from species) to the root, and calculates its likelihood using a rate of mutation matrix trained from known structured RNAs. The algorithm requires both a 4x4 rate matrix that describes the evolution of independent unpaired columns, and a 16x16 rate matrix that describes the evolution of paired columns. Importantly, the paired evolutionary rate matrix captures covariation patterns in the training RNA set (e.g. AU → GU has a higher rate than AU → AG as the former retains base pairing potential). The likelihood of the alignment is thus the product of all unpaired and paired column likelihoods according to a structural annotation. The secondary structure prediction algorithm utilizes the Cocke- Younger-Kasami (CYK) algorithm [28] to efficiently explore the search space of potential RNA secondary structures. The CYK algorithm is conceptually similar to the Zuker-Stiegler algorithm in that it employs dynamic programming to find the secondary structure with the lowest score in O(N 3) time; however, it uses the Felsenstein and SCFG transition probabilities in place of thermodynamic energy values. Pfold was implemented as a web server described in [61], and later as a downloadable Java application that

14 supports multi-threading, and explicitly takes into account chemical probing data [111, 112].

1.4.3 RNA-Decoder

RNA-Decoder is a method closely related to Pfold, but is intended specifically to model the complex evolutionary constraints of protein-coding RNAs [95], such as pre-mRNAs, mRNAs, and viral genomes. Due to the degeneracy of the genetic code, it is known that the second codon position has the tightest constraint, followed by the first, and finally the third position. These varying constraints result in differences in the evolutionary substitution process, and RNA-Decoder explicitly takes this into account when calculating the column likelihood. Whereas Pfold has one evolutionary model for base paired columns and one model for unpaired columns, RNA-Decoder has separate models depending on the codon position (i.e., 1, 2, or 3) of both the 5’ and 3’ columns for both unpaired and paired columns, resulting in a total of 15 phylogenetic models. The evolutionary models of RNA-Decoder are described in detail in [94]. The authors show that RNA-Decoder predicts fewer false positives compared to the predictions of competing methods that do not take protein-coding regions into account (including Mfold, Pfold, and RNAalifold) for several structured, protein-coding RNAs.

1.4.4 RNAalifold

In 2002, RNAalifold [49], which was developed as a part of the ViennaRNA package, tackled the problem of comparative structure prediction by considering both thermodynamic stability and sequence covariation simultaneously. The authors define a heuristic scoring function for base paired alignment columns as the linear combination of (1) the average pairing energy of all sequences and (2) a covariation score. The covariation score is higher when the columns contain compensatory mutations that retain the pairing potential, and lower when there is no variation, or the columns contain many invalid base pairs. The covariation measure does not explicitly model the evolutionary relationship of the input sequences (as Pfold does), but rather weighs all sequences equally. RNAalifold uses a familiar dynamic programming scheme to determine the secondary structure with the optimal score according to the above heuristic function. The authors later published an improvement of the RNAalifold algorithm by adjusting the evaluation of gaps and covariation scores [7].

1.4.5 Alignment-free methods

Comparative RNA structure prediction methods are highly sensitive to the quality of the input align- ment. Conserved base pairing positions do not necessarily retain the specific nucleotides participating in those pairs due to covariation patterns. Alignments that are constructed by only taking into account the primary sequence conservation are thus highly susceptible to misalignment. Any conserved base pairs that are not properly aligned in the input alignment cannot be easily detected. A chicken and egg problem exists: one needs structural information to properly align the sequences, but one also needs an accurate alignment to predict the secondary structure. To address this problem, a number of methods

15 exist that take unaligned sequences as input (so-called alignment free methods). A number of methods simultaneously fold and align several RNA sequences: Stemloc [52], Consan [27] Dynalign [73, 75], Foldalign [43, 44], CARNAC [96, 114], ComRNA [55], and SimulFold [80]. Dynalign and Foldalign both employ a thermodynamic energy minimization approach to predict pseudoknot-free structure shared between two sequences. SimulFold uses a Bayesian Markov chain Monte Carlo (MCMC) to non-deterministically explore the distribution of potentially pseudoknotted RNA structure, alignment, and tree of the input sequences simultaneously. The others use the so-called N-SCFG probabilistic model first described by Sankoff [106]. The N-SCFG is a complex extension of the typical SCFG model, and accept several sequences rather than just a single sequence (where N is the number of sequences). Due to the complexity of determining pairing positions and gap positions simultaneously, N-SCFGs deterministically solve a more challenging computational problem requiring O(L3N ) time and O(L2N ) space, where L is the length of the sequences (in some cases, heuristics improve on these bounds). ComRNA is the only N-SCFG method which allows for pseudoknotted structures. Consan accepts only two sequences, while ComRNA, CARNAC, and Stemloc accept several sequences. Alignment-free methods are useful when an alignment is difficult to establish, typically when sequence identity is extremely low. Moreover, they have an inherent performance advantage over non-comparative methods because they simply have more signal to work with, while requiring no alignment curation effort. For instance, Dynalign (pairwise thermodynamic folding) reports an increase from 59.7% to 86.1% over non-comparative free energy minimization for tRNA sequences [75]. These advantages come at a substantial computational cost due to the complexity of alignment-free structure prediction.

1.5 Thermodynamics alone does not tell the whole story

1.5.1 Importance of kinetics

An RNA molecule, like all molecules, proceeds towards its minimum free energy given infinite time, and this can involve either internal changes, or formation of favourable interactions with other molecules. Once this occurs, the system is in thermodynamic equilibrium where there is no net flow between states, and entropy is maximized. This is typically a dynamic state where change is occurring, but the rates of opposing changes are equal. Once in thermodynamic equilibrium, the system is dominated by the state with the lowest Gibbs free energy, a measure of entropic potential, and the same measure employed by non-comparative energy minimization methods for RNA secondary structure prediction. Specifically the state at thermodynamic equilibrium is determined by the difference between free energy of the reactants and free energy of the products (i.e. ∆G), and thus is entirely independent of the path of the transformation. The increase in free energy from the reactants to the transient states along the pathway (i.e. ∆G‡) determines the rate at which the reaction proceeds. When a significant kinetic barrier exists between reactant and product, the system may exist in a meta-stable state, which is not at thermodynamic equilibrium, but transitions towards the minimum energy state are prohibitively slow. Furthermore, thermodynamic equilibrium implies that the relative amounts of each molecule remains

16 constant over time. The living cell is an extremely dynamic environment that would not function if all processes are at equilibrium, and therefore it may not be realistic to assume. Thus, kinetics may play an important role in the formation of functional RNA structures in vivo. Morgan and Higgs [86] investigated this idea in 1996 by examining discrepancies between estimated free energy of phlyogenetically supported reference structures and free energy of structures predicted by energy minimization. They found that these two energies are roughly the same for short sequences (i.e. less than 250 nucleotides), but the references structures of longer sequences have significantly higher estimated energy. This is evidence that the functional structures of the RNAs in vivo are not in fact the minimum free energy structure. The authors posit that small domains form swiftly during RNA folding, and any conformational shift resulting in long range base pairs naturally suffers from a large energy barrier, as it would require dissolution of small domains. They hypothesize that the free energy differences they observe can not be due solely to errors in the free energy parameters utilized by energy minimization methods, but instead because these methods disregard kinetic considerations. Existing non-comparative methods for RNA secondary structure prediction largely disregard kinetic considerations, and rely fully on free energy minimization. This assumes that RNAs will always reach their minimum energy structure, and that this is the functional structure of the molecule. Comparative methods circumvent this issue, as evolutionary signals naturally guide the prediction towards conserved, functional structures.

1.5.2 Co-transcriptional Folding

Transcription in vivo is a well-defined process that synthesizes the RNA molecule from a DNA tem- plate starting at the 5’ end and proceeding toward the 3’ end. There is now substantial evidence that RNA structural formation occurs during transcription (i.e. co-transcriptionally), and that the transcrip- tion process has a significant impact on the resulting functional structure of the molecule. Here we present some of the mechanisms by which RNA structures are affected by co-transcriptional folding.

Transcription speed It has been shown that RNA structure formation can occur within the time scale of RNA synthesis [9, 63], and thus occurs in a directional, co-transcriptional manner. RNA structural changes occur at rates several orders of magnitude faster than typical transcription rates [2]. Several in vitro studies demonstrated that modulation of the transcription speed can alter the resultant final structure of the RNA folding pathway, indicating that a kinetic effect can play an important role in RNA structure formation [45,46,91]. It has been shown in several cases that synthesis of structured RNAs with a non-host polymerase (i.e. a polymerase with a transcription rate different from the native polymerase) results in an inactive transcript [16, 67].

Transcriptional pausing Pausing during transcription is known to play an important role in the RNA secondary structure formation process. Several experiments show that consistent pausing sites are necessary to ensure efficient folding [113,118,120]. Pausing is caused in Bacteria by interactions between

17 the nascent RNA molecule and the polymerase protein complex [66, 68, 85].

Transient RNA structures Due to the directional nature of transcription, RNA structure formation has likewise been shown to be directional [70, 71]. Base pairs near the 5’ end of the molecule are able to form early on, and long-range base pairs or those near the 3’ end of the molecule can only form later during transcription. Structure elements that appear temporarily during the folding process (i.e. transient features) play a role in guiding the pathway [63, 101]. Tertiary structural features have been shown to determine the folding pathway that is followed [17]. Flanking sequences, in addition to transcription speed, were shown to be an important factor in reaching the functional folding pathway [62]. Meyer and Miklos investigated asymmetries that result from directional structure formation in struc- tured RNAs [79]. They predicted several classes of potentially transient RNA helices that may compete with the known reference structure of that RNA: those that may prevent formation of the functional structure, and those that may cede to it. The former were found to be statistically underrepresented, while the latter were overrepresented. This suggests that, through evolution, certain types of potentially transient competing helices are suppressed while others are encouraged, and that structural RNA genes encode some information on their own co-transcriptional folding pathway.

Interactions with other molecules RNA structure formation is affected significantly by interac- tions with other molecules during folding. Binding to small metabolites can induce important structures changes necessary for the function of the molecule [39, 56, 87]. Proteins, including chaperones [117] and sequence- or structure-specific RNA binding proteins (e.g. [83,84]), can prevent mis-folding of structured RNAs. Two separate RNA molecules can form base pairs in the same way that two regions of the same RNA form base pairs. with itself. These RNA-RNA interactions are essential to the function of sev- eral RNAs, including microRNA-mRNA interaction responsible for repression of mRNA expression [64], snRNA-mRNA interaction which guides splicing [53], and snoRNA-rRNA interaction responsible for edit- ing of rRNAs [6]. RNA structure prediction methods typically disregard the effect of interactions with other molecules due to lack of information on interaction partners in addition to the shear complexity of the problem.

1.5.3 Kinetic folding methods

A number of existing computational methods explicitly simulate the co-transcriptional folding pathway as a series of structural changes over time. These methods require a single sequence as input (i.e. they are non-comparative), and return as output a list of structural configurations constituting a predicting folding pathway. Most kinetic simulation methods employ stochastic simulation, including RNAkinetics [20, 81, 82], Kinfold [34], Kinefold [54, 122, 123], [41]). These methods explicitly simulate the physical process of RNA folding by modeling the reaction kinetics of helix formation and disruption. RNAkinetics was the first published in 1985, and allows pseudoknot-free structures containing helices of contiguous base

18 pairs (i.e. no bulges or internal loops). The transition probability from any given RNA structure to another is defined as the kinetic rate constant associated with that chemical change, giving the model a logical physical interpretation. Kinfold models structural changes on the level of individual base pairs rather than full helices, and the resulting pathway predictions are very detailed. Like RNAkinetics, it predicts only pseudoknot-free structures. Kinefold differs most significantly from the others in al- lowing pseudoknotted structures. This requires a more sophisticated energy model that accounts for the structural rigidity imposed by pseudoknots. These stochastic methods simulate co-transcriptional folding by progressively extending the sequence at regular intervals during simulation and assuming a constant transcription speed. Randomized changes are made to the RNA secondary structure throughout the simulation, and the probability of choosing each change is roughly proportional to the rate at which the physical process would happen. All of the above methods are subject to a strong length limitation: they can reliably accept sequences up to roughly 200 nucleotides in length. Kinwalker [38] is a deterministic algorithm that uses free energy minimization along with a heuristic that disallows transitions deemed kinetically infeasible. It first calculates the minimum free energy structure of all subsequences of the input sequence. The algorithm then begins the folding simulation at the 5’ end, and progressively extends the sequence. At regular intervals, it approximates a merger of the current structure and the minimum free energy structure for that subsequence using a heuristic. This chemical change is allowed if the theoretical time required for the transition is less than the time between transcription events. Kinwalker is designed to handle sequences up to 1000 nucleotides in length. The program returns as output the list of all structures encountered during the deterministic simulation. Due to the lack of experimentally confirmed RNA folding pathways, kinetic folding methods are evaluated on a very small number of cases, mostly comprising only the final structure. Thus, the corre- spondence between kinetic folding predictions and in vivo folding pathways is largely unknown. More- over, all above methods make a wide range of simplifying assumptions about the cellular environment. They impose constant transcription speed, which does not occur in vivo as we know from experimental studies [113, 118, 120]. They are unable to model interaction partners due to the lack of knowledge on interacting proteins and RNA, and the sheer intractability of incorporating all cellular components into a computational model. This could have a significant impact in the reliability of pathway predictions, as even small effects early in the pathway could have a large effect on the resulting final structures. Addi- tionally, kinetic folding simulations return as output a predicted folding pathway rather than a predicted functional structure, often making the output difficult to interpret. Overall, kinetic folding methods are useful tools for analysis of folding pathways, but suffer from significant limitations for RNA secondary structure prediction.

1.6 Goals of this thesis

We have thus far argued that kinetics and co-transcriptional folding are an important consideration that have not yet been significantly considered in RNA secondary structure prediction methods. Methods currently exist that predict kinetic folding pathways for an input sequence by explicitly simulating the co-

19 transcriptional folding trajectory. By doing so, they have to make various simplifying assumptions about the cellular environment, and are currently limited to the study of molecules less than 1000 nucleotides in length. In this study, we want to explore whether it is feasible to capture the effects of co-transcriptional folding, rather than simulating the physical process itself. We aim to accomplish this using the well established, deterministic energy minimization methods for RNA secondary structure prediction that can technically handle sequences of arbitrary length. Furthermore, by considering the overall effects of in vivo RNA structure formation rather than simulation of the physical process itself, we aim to generate more accurate secondary structure predictions without relying on complex assumptions of the cellular environment. These predictions, while guided by free energy minimization, would also be influenced by kinetic considerations, and thus we expect they will more accurately reflect in vivo structures. The aim is not to abandon energy minimization as a paradigm, but to incorporate additional considerations that help to guide the prediction toward a stable structure that also has a feasible kinetic folding pathway. To explore these ideas, we modify RNAfold from the ViennaRNA package [50,69], an implementation of the Zuker-Stiegler algorithm. RNA secondary structure prediction methods suffer from particularly poor accuracy for longer se- quences. In their 1996 study, Morgan and Higgs [86] hypothesize that this is due primarily to the disregard for kinetic effects rather than poor estimation of the free energy parameters. They suggest that, during transcription, local structures begin to form rapidly as the sequence emerges from the poly- merase, and that formation of long-range base pairs requires disruption of these local structures. Longer sequences are more greatly affected by this kinetic effect simply because they require disruption of a larger number of base pairs, thus resulting in higher energy barriers. In our study, we focus primarily on long sequences for this reason. Currently, methods for kinetic simulation of RNA folding are mostly limited to very short sequences (roughly 200 nucleotides), with the exception of Kinwalker which allows sequences up to roughly 1000 nucleotides in length. In section two we present our novel method, CoFold, that improves predictive accuracy, particularly for long RNA sequences. CoFold takes into consideration one aspect of the co-transcriptional folding pathway: the overall reachability of potential base pairing positions. We propose that long-range base pairs form with lower probability due to greater energy barriers involved in reorganization of the secondary structure. We present the dedicated long data set that was constructed for evaluation of CoFold, and for training of its two easily-interpretable parameters. We then investigate interesting case studies that

20 show case the effectiveness of capturing kinetic effects in RNA secondary structure prediction.

Goals:

• Capture the effects of the co-transcriptional folding process in a deterministic, energy minimization algorithm

• Construct an adequate data set that specifically includes very long and diverse RNA sequences

• Examine specifically the performance of CoFold on long RNA sequences compared to energy minimization alone

• Ensure robust parameter training

• Demonstrate the effectiveness of accounting for kinetics in RNA secondary structure prediction

21 2 CoFold

2.1 Motivation

Free energy minimization methods for RNA secondary structure prediction typically consider only the change in free energy for the most stable RNA secondary structure conformation, but do not consider the process of RNA structure formation. This implicitly assumes that the input RNA sequence will always be able to reach the RNA structure that minimises the overall free energy of the molecule in vivo, irrespective of kinetic barriers that may occur during the co-transcriptional folding pathway. Given infinite time, any molecule will proceed toward its minimum free energy structure; however, slow rates of change may trap the molecule in meta-stable conformations due to kinetic considerations. We know from several early experiments that RNA secondary structure formation occurs during transcription, i.e. co-transcriptionally (See section 1.5). From these experiments, we know that RNA molecules do not necessarily assume the structural conformation with the lowest possible free energy in vivo, and that they may become kinetically trapped. That is, they may reach a local minimum in the structural folding landscape, and energy barriers prevent it from shifting to the globally minimum free energy structure. Morgan and Higgs noted an important discrepancy between the phylogenetically supported RNA sec- ondary structure and their corresponding predicted minimum free energy structures [86]. Specifically, phylogenetic structures for long sequences are predicted to have significantly higher free energy than the respective predicted minimum free energy structure. The authors hypothesize that this difference cannot be simply due to errors in the thermodynamic parameters, but likely due to kinetic effects that are not accounted for. Specifically, they suggest that co-transcriptional folding naturally results in a domain folding effect wherein local structures form rapidly during transcription, and impose kinetic barriers for formation of long-range base pairs. Thus, long sequences tend to have predominantly local structure, with long-range base pairs filling in the spaces in between. We propose a new method, CoFold to predict RNA secondary structure while taking into account this effect of the co-transcriptional folding process. It is based largely on existing free energy minimization methods and the Zuker-Stiegler algorithm, but adjusts the weight of potential secondary structures according to the overall reachability of potential pairing partners during co-transcriptional folding (i.e. a kinetic effect). The run time and space complexity of CoFold remain the same as the Zuker-Stiegler algorithm: O(N 3) in time, and O(N 2) in space, where N is the length of the input sequence.

2.2 CoFold Algorithm

CoFold is a modification of the Zuker-Stiegler algorithm that takes into account some effects of co-transcriptional folding in addition to thermodynamics. The key effect that we aim to model is that during co-transcriptional folding in vivo, it matters to a given sequence position whether a potential pairing partner is available for base-pairing or not. Distant base pairs are less likely to pair with one another as local structures that form during transcription may act to block their formation. Under this model, there indeed could be a lower energy structure that requires reorganization of these local

22 secondary structures, but it is unlikely to form due to kinetics. To capture this, we model the distance along the sequence between base pairing positions, and scale thermodynamic energy scores according to the distance. We therefore expect that the resulting predicted secondary structure is stable, as it is guided by free energy values, but that it is also kinetically feasible, as it is guided by the kinetic notion of domain folding. CoFold was implemented using the RNAfold source code from the ViennaRNA package [50,69], a widely used implementation of the Zuker-Stiegler algorithm [126]. See section 1.3.2 for a detailed explanation of the Zuker-Stiegler algorithm. The Zuker-Stiegler algorithm efficiently explores the search space of RNA secondary structure confor- mations available to the input sequence, and returns the structure that overall minimizes the estimated free energy value according to the thermodynamic model. It assigns an energy value to a given secondary structure by deconstructing it into its elementary components, each with a free energy contribution as- signed by the thermodynamic model. These components include hairpin loops, base pair stacks, bulges, unpaired nucleotides, etc. The energy value of a secondary structure is the summation of energy contri- butions from all of its structural components. CoFold calculates energies in the same general manner as in RNAfold, but the energy contribution of all components associated with a base pair are modified by a scaling function according to the number of nucleotides between the pair (i.e. the distance d). This scaling function γ(d) models the exponential decay in reachability of pairing partners as a function of the nucleotide distance d between the two potential pairing partners and depends on two parameters α and τ (Equation (5), Figure 4). Both parameters have a straightforward interpretation. The value of α specifies the range of the scaling function (e.g. when α is 0.2, the affected free energies will range from 80% to 100% of their original values). Thus, γ(d) guarantees that no energy contributions are scaled below 1 − α. The value of τ determines the rate of the exponential decay, where low values of τ result in a steep decay function.

− d γ(d) := α · (e τ − 1) + 1 (5)

Definitions:

• Ci,j: minimum free energy of subsequence Si,j given nucleotides i and j form a base pair (RNAfold only)

• FMLi,j: minimum free energy of subsequence Si,j (RNAfold and CoFold)

0 • Ci,j: minimum free energy of subsequence Si,j given nucleotides i and j form a base pair, adjusted according to the reachability of the two pairing positions i and j (CoFold only)

The Zuker-Stiegler algorithm calculates two energy values for all subsequences of the input sequence:

Ci,j and FMLi,j. Both RNAfold recursion formulas are shown in equation (6) and equation (7). Ci,j stores the energy of a subsequence enclosed by a base pair. The algorithm stores this energy separately in order to efficiently calculate stacking energies between adjacent base pairs (i.e. it must immediately have available the energy for all base pairs nested between i and j). Ci,j has a meaningful energy value

23 1.0 α = 0.5

0.8 τ = 640 0.6 γ 1 − α 0.4 0.2 0.0 0 1000 2000 3000 4000 d [nt]

Figure 4. Scaling function γ of CoFold. Along the horizontal axis is the distance (in nucleotides) between the two positions involved in the base pair (i.e. the distance d). The value of the scaling function is along the vertical axis. γ is an exponential decay function governed by two parameters (α and τ) and its formula is defined in equation 5. The γ curve pictured above is plotted using the trained parameter values α = 0.5, τ = 640 (See section 2.5 for explanation of training procedure). The value of γ ranges from 1 − α to 1, and thus no energy contributions are scaled below this lower bound.

24 if positions i and j can form a canonical base pair (i.e. G-C, A-U, G-U), and otherwise a large positive number so that invalid base pairs are never incorporated into minimum energy substructures. The various cases of the recursion formulas correspond to the addition of elementary structural components

(i.e. hairpin loop, unpaired nucleotide, etc.) to a previously computed value of Ci,j or FMLi,j (i.e. a subsequence), and in this lies the efficiency of the Zuker-Stiegler algorithm, as the energy of each subsequences is calculated once, and reused many times. The recursion formulas shown in equation (6) and equation (7) are simplified in terms of the thermodynamic energy parameters, which are highly sequence-specific, and in total they amount to roughly 7600 values in the Turner model [5]. Figure 5 provides a visual depiction of the indices i, j, p, and q.

 hairpini,j opens a helix with hairpin loop  Ci,j = min mini

 mini

 γ(di,j) · hairpini,j opens a helix with hairpin loop  0  Ci,j = min mini

In CoFold, the RNAfold recursion formulas are modified by the scaling function γ(d). This function is only applied to energy values in the Ci,j calculation because these correspond to predicted base pairs. 0 CoFold employs an adjusted recursion Ci,j shown in equation (8), which precisely defines how the 0 scaling function affects the energy contributions in Ci,j. γ(d) is not applied to the recursive energy values of subsequences in order to avoid multiple applications to the same energy contributions. The function is applied both to elements with positive energy, such as loops and bulges, as well as to those with negative energy, such as stacking interactions. Primarily, the purpose of the function is to scale down the energy of long-range stacking interactions in order to give less weight to long-range base pairs; however, it is necessary to scale all related contributions, such as the loops and bulges, to preserve the relative magnitude of the energy from the overall structural feature. The FMLi,j calculation remains the same as in RNAfold. Aside from scaling the energy contributions as described above, CoFold generates the its predicted structure using the same algorithm with which RNAfold calculates the minimum free energy structure. 0 It stores the Ci,j and FMLi,j values in a two-dimensional dynamic programming matrix. Consequently,

25 the recursive values must be computed only once, from shortest subsequences to the longest. This results in an efficient run time bound of O(N 3) and space bound of O(N 2). The output of CoFold is an RNA secondary structure which promotes local base pairs according to the γ(d) scaling function, and is therefore not the minimum free energy prediction. This RNA secondary structure captures both thermodynamic contributions as well as effects due to co-transcriptional structure formation. Neither CoFold nor RNAfold are technically capable of predicting pseudoknotted structures.

Figure 5. Visualization of the indices involved in CoFold recursion formulas. The indices i and j indicate the bounds of the current subsequence Si,j. On the left, the hairpin case of Ci,j is illustrated where i and j denote the pairing positions immediately adjacent to the hairpin loop. On the right, the case of Ci,j defining base pair stacks, bulges, and internal loops is illustrated. The indices p and q indicate the position of the inner base pair with which i and j stack. For base pair stacks, p = i + 1, and q = j − 1. For bulges and internal loops, one or both of the statements p > i + 1 or q < j − 1 are true.

2.3 Implementation

RNAfold is a program written in the C programming language, and is made freely available from the ViennaRNA website (http://www.tbi.univie.ac.at/ ivo/RNA/) [50,69]. CoFold was first implemented as a modification of ViennaRNA 1.8, but was later ported to ViennaRNA 2.0 because this version provides functionality to specify the thermodynamic energy parameters. This version includes a file comprising all energy values for the parameter sets we used in our study: the Turner 1999 set [74], and the Andronescu 2007 set [5]. These parameter sets are introduced in section 1.3.2. This feature allowed us to easily run both RNAfold and CoFold using either parameter set. We refer to each of these combinations as described in table 1. We apply two major changes to the RNAfold code: (1) command line arguments that accept the two parameters of the CoFold scaling function (α and τ), and (2) incorporation of the CoFold scaling function into the main recursion loop of the Zuker-Stiegler algorithm. This loop calculates the two energies (Ci,j and FMLi,j) for all valid subsequences Si,j of the input sequence S where i < j − 3.

The value of γ(d) is calculated once for each subsequences Si,j only where i and j form a canonical

26 Turner parameters Andronescu parameters unmodified RNAfold algorithm RNAfold RNAfold-A adjusted CoFold algorithm CoFold CoFold-A

Table 1. Definition of method names used throughout the text in terms of the underlying algorithm, and thermodynamic energy model base pair, and thus imposes a fairly minor constant factor to the run time. The run time complexity therefore remains the same as in RNAfold: O(N 3). Additionally, we use the ’–dangle 1’ RNAfold option, which specifies that dangling energies are calculated optimally, and without coaxial stacking, and is the RNAfold default. The default RNAfold output includes the free energy value calculated for the minimum energy prediction. Because the γ(d) scaling function modifies free energy values, CoFold recalculates the free energy of the output structure using the same rules and thermodynamic model as RNAfold. CoFold source code will be made available at http://www.e-rna.org/cofold.

2.4 Compilation of the long and combined data sets

RNA secondary structure prediction methods suffer from particularly poor accuracy for longer se- quences. In their 1996 study, Morgan and Higgs [86] hypothesize that this is due primarily to the disregard for kinetic effects rather than poor estimation of the free energy parameters. They suggest that, during transcription, local structures begin to form rapidly as the sequence emerges from the poly- merase, and that formation of long-range base pairs requires disruption of these local structures. Longer sequences are more greatly affected by this kinetic effect simply because they require disruption of a larger number of base pairs, thus resulting in higher energy barriers. For this reason, we require a data set of long sequences that are biologically meaningful for performance evaluation of CoFold. Our goals in establishing a data set were to ensure (1) that there is strong support for the structure, preferably phylogenetic evidence, (2) that the structures are non-trivial and biologically important, (3) that the sequences are evolutionarily and functionally diverse, (4) non-redundant, and (5) sufficiently long. Rfam is the leading source of phylogenetically supported RNA structure families [40], including non- coding RNAs, structured regions of mRNAs, and self-splicing RNAs. Each family is established with a seed alignment and consensus structure that is hand curated. They then utilize covariance models (described in section 1.4.1) to search a collection of RNA sequences for additional putative members of the family, resulting in the so-called full alignment. As of June 2011, Rfam includes a total of 1973 RNA families. In general, Rfam contains high quality structural alignments. However, most alignments contain very short sequences. Various specialized databases provide RNA structures for specific classes of RNA molecules, but these also offer sequences that are too short for our purposes. Gutell lab’s Comparative RNA Web site (CRW) [13] is the leading data source for ribosomal RNA structures. The CRW provides multiple sequences alignments of ribosomal RNA (rRNA) for various evolutionary clades, the most coarse of which include bacteria, , archaea, mitochondria, and

27 chloroplasts. However, data quality is variable, as some sequences are partial, and some alignments contain a high proportion of gaps. Thus, a dedicated data set was compiled for performance evaluation of CoFold from the above data sources. To ensure high data quality and sequence diversity, we performed extensive data curation on sequences retrieved from the CRW and Rfam databases. We focus specifically on very long sequences for the purposes of performance evaluation. Ribosomal RNAs are the longest and most complex RNA structures studies to date, and energy minimization methods for prediction of RNA secondary structure typically struggle in predicting them [86]. 16S rRNAs are roughly 1500 nucleotides in length, while 23S rRNAs are roughly 3000 nucleotides in length, making them substantially longer than the majority of annotated RNA structures currently available. For this reason, we constructed a highly curated data set of 16S and 23S ribosomal RNA sequences to evaluate the performance of CoFold. In addition to specifically zooming into the longest RNA sequences available, we aimed to assemble a more comprehensive data set of shorter sequences. This data set would allow us to observe the perfor- mance of CoFold on sequences of varied length, and to perform parameter training of CoFold’s two free parameters (α and τ) on a biologically and functionally diverse data set. For this purpose we established a curated data set from the Rfam database including the longest, most complex, and phylogenetically supported RNA families.

2.4.1 The long data set

The long data set was established to investigate CoFold performance on the longest available an- notated RNA structures: 16S and 23S ribosomal RNAs. Bacteria, , archaea, and chloroplast multiple sequence alignments of 16S and 23S sequences were retrieved from the Comparative RNA Web site (CRW) [13]. The quality of the mitochondria alignments was too low, so these were excluded from the analysis. The remaining alignments were not annotated with a consensus RNA structure, but a sub- set of the individual ungapped sequences were annotated with a structure. These individual structures were each projected onto the alignments, and the structure with the lowest mismatch score (defined in Equation (9)) was chosen as the consensus RNA structure for each alignment. For the calculation of the mismatch score, base pairs with a gap in one base position, and a non-gap in the other are considered one-sided gaps. Base pairs with gaps on both sides are considered two-sided gaps. Non-canonical pairs are those other than G-C, A-U, G-U. The length of the alignment A is denoted by |A|. This function considers one-sided gaps as more highly deleterious than two-sided gaps. A one-sided gap results from addition of a bulge to the helix, while two-sided gaps result from lengthening or shortening a helix.

P (2 · (# one-sided gaps) + (# two-sided gaps) + (# non-canonical pairs)) mismatch := seq∈A (9) |A| Sequences with large indels, many ambiguous nucleotides, or a poor fit to the consensus RNA structure were removed from the alignment using a combination of automated processing, and data visualization using the R4RNA R package [65]. This step ensured that all sequences in the resulting alignment strongly

28 reflect the consensus structure, and that all alignments show a high degree of covariation indicating strong evolutionary support (See table A2 for covariation metrics). Unpaired regions of the alignment were realigned using MUSCLE [31]. Individual sequences were extracted from each resulting alignment such that no pair of extracted sequences have a pairwise percent sequence identity greater than an alignment- specific threshold. The exact threshold varies to ensure no biological class or evolutionary clade is over-represented in the long data set (max 85%, Table A1). Because no two sequences are similar, we guarantee that the long data set is highly diverse and without redundant sequences. The consensus alignment structure was projected onto each extracted sequence by removing base pairs at gap positions, and removing any non-canonical base pairs. The resulting 61 sequences have a mean sequence length of 2397 nucleotides. The long data set thus contains all annotated sequences longer than 1000 nucleotides that meet our quality criteria for uniqueness and evolutionary support. See table 2 for composition of the long dataset in terms of evolutionary clades and sequence length.

2.4.2 The combined data set

In addition to investigating CoFold’s performance for long biological sequences, we also aimed to construct a more comprehensive data set to observe its behaviour on sequences with a wide variety of lengths and functions. Furthermore, it enables us to effectively train CoFold’s two simple free parameters on diverse biological sequences, and to ensure that accounting for a simple kinetic effect does not detrimentally impact the structural predictions of shorter sequences. Thus we constructed the combined dataset that includes both the long ribosomal RNAs described above in addition to a wide array of well-supported structures from shorter sequence. For this, we turn to Rfam [40]. Because the families in Rfam vary greatly in phylogenetic support, structural complexity and sequence length, we aimed to choose a subset of families that meet our strict quality criteria:

• mean sequence length greater than 115 nucleotides

• covariation (defined in Equation (24)) greater than 0.18

• minimum of 5 sequences

• mean percentage of canonical base pairs greater than 80%

• diverse biological classes and evolutionary clades

Sequences were extracted from the Rfam alignments using the same protocol as for the CRW align- ments described above. Specifically, no pair of sequences extracted from the same alignment share a pairwise percent sequence identity above an alignment-specific threshold (max 85%, Table A1). Consen- sus RNA structures were projected onto individual sequences by removing base pairs at gap positions, and by removing any non-canonical base pairs. The mean sequence length of the resulting 187 Rfam sequences is 247 nucleotides, and the combined dataset has an average sequence length of 778 nucleotides (Table 2).

29 long data set combined data set clade > 1000 nt all ≤ 1000 nt Bacteria 15 69 (54) Eukaryotes 15 112 (97) Virus 0 20 (20) Archea 17 33 (16) Chloroplast 14 14 (0) sum 61 248 (187)

sequence length (nt) average 2397 776 (247) minimum 1245 110 (110) maximum 3578 3578 (628)

Table 2. Composition of the long and combined datasets. The table shows the distribution of broad evolutionary clades to which the sequences belong, and the distribution of sequence lengths for each data set. Both data sets contains sequences from a wide variety of evolutionary groupings.

Appendix table A1 contains detailed information regarding biological classes of all alignments in the combined data set, as well as sequence extraction details (e.g. number of sequences extracted from each alignment, and max pairwise percent identity of extracted sequences). Appendix table A1 lists detailed alignment statistics (such as number of base pairs, average sequence length), and quality metrics (e.g. percentage canonical base pairs, percentage of gaps). Table 2 describes the composition of the two datasets in terms of sequence length and evolutionary clades. Both data sets contain sequences from all domains of life. Together with the strict pairwise percent identity threshold we demand of extracted sequences, this ensures that the data sets are diverse and non-redundant. While the Rfam sequences we curated are significantly shorter than the ribosomal RNAs, they are longer than the typical Rfam alignment, most of which are below 100 nucleotides in length. The average length of the extracted Rfam sequences is 247 nucleotides.

2.5 Parameter training

The CoFold scaling function γ(d) models the decay in reachability of potential base pairing partners as a function of the distance between them along the sequence (Equation (5)). The scaling function has only two free parameters: α and τ. Both parameters have a straightforward interpretation. The value of α specifies the range of the scaling function (e.g. when α is 0.2, the affected free energies will range from 80% to 100% of their original values). Thus, γ(d) guarantees that no energy contributions are scaled below 1 − α. The value of τ determines the rate of the exponential decay, where low values of τ result in a steep decay function.

30 The two parameters α and τ have an important relationship: both values affect the rate of decay of the exponential curve. α is a vertical scaling factor (i.e. a lesser value of alpha results in a curve that is compressed in the vertical dimension). τ scales the function in the horizontal dimension. Both increasing α and decreasing τ will result in a steeper γ(d) function. The key functional difference between the two parameters is that α adjusts the minimum value of the scaling function, while τ does not. Because of their close relationship, we expect a tight correlation between the values of α and τ. Thus certain (α, τ) parameter combinations result in a severe decay in reachability as a function distance between pairing partners, while others are more gentle. We hypothesize that the effect of reachability could vary depending on context. The most obvious potential factor is the transcription speed of the polymerase, which varies significantly between distant evolutionary groups [92]. The rate is fastest in phages, followed by bacteria, and slowest in eukaryotes. When transcription is faster, we expect that local structures have less time to form while the molecule emerges from the polymerase, and thus reachability may be less impeded for long-range interactions. When transcription is slow, short range interactions take precedence as they have more time to form during transcription, and thus reachability is overall decreased.

2.5.1 Training procedure

We aimed to choose effective values for α and τ through learning from biological sequences. To do this, we employed an extremely simple training scheme. Complex parameter training methods that employ expectation maximization algorithms are typically used for models with a large number of free parameters that are difficult to directly interpret. However, CoFold relies on an extremely small number of free parameters, and thus a brute force scheme is computationally tractable. Furthermore, by training through manual enumeration of parameter combinations, we have greater freedom into investigation of patterns in the resulting training data (e.g. the ability to produce figure 6). We opted to use the combined data set for training purposes. Because there are only two free parameters in the CoFold model, the possibility of overfitting is extremely low. In general, the number of observations we make from the data vastly outnumber the inferences of parameter values. Moreover, by only considering the short sequences, this may miss important aspects of the signal, as they are not expected to be as strongly affected by the reachability of pairing partners. Cross validation was used to further mitigate the possibility of overfitting the parameters to the data. This involves a large number of randomized trials where the optimal parameters are determined for subsets of the data set. We investigate the pattern of parameter combinations that result from these trials. Simple parameter training was performed by enumerating the performance metrics of CoFold over a search space of parameter combinations for all sequences of the combined data set. Performance metrics were calculated for each (α, τ) combination in set P (defined in Equation (10)). The Turner 1999 thermodynamic parameter set [74] was used for (α, τ) parameter training. We use the Mathews correlation coefficient (MCC) to measure the overall performance. This metric is defined and discussed in section 2.6.

31 P := {0.05, 0.10,..., 0.90, 0.95} × {40, 80,..., 1160, 1200} (10)

P s S MCC MCC := s∈S α,τ (11) α,τ |S| where (α, τ) ∈ P and S is the sequence set

S S ∆MCCα,τ := MCCα,τ − MCCRNAfold (12) Performance metrics were found to be highly correlated in α and τ (Figure 6 (right)). To demonstrate this, linear regression was performed on the ∆MCC matrix (Figure 6 (left)). We first compiled a set of th triples Q = {(α, τ, ∆MCCα,τ )}, for which ∆MCCα,τ is in the 97 quantile of the performance matrix. Weighted linear regression was performed with α and τ as dimensions, and ∆MCC as the weight. The regression line fits the data with an R2 value of 98.4%, indicating that variability in τ highly accounts for the variability in α. Regression line (solid) and its 95% confidence region (dotted) are plotted in Figure 6 (left). The values of α and τ are related by the linear equation α = m · τ + b, where the slope m is 6.1 · 10−4 ± 2 · 10−5 and the intercept b is 0.105 ± 0.016. As the estimated standard errors of the fit line’s slope and intercept are extremely low, and the R2 metric is close to 100%, we have very high confidence that near-optimal (α, τ) parameter combinations cluster along the linear fit line. This result agrees strongly with our expectation of the dependence between α and τ. Adjustments to one of the parameters can be roughly mitigated by tweaking the other. Thus anywhere along the linear fit line, the rate of decay of γ(d) is approximately equal. However, the minimum value of the scaling function will be different along the fit line, as this depends solely on α. This has an impact upon the performance of CoFold, as the performance markedly decreases when α is very small. Twenty trials of five-fold cross validation were performed to determine robustness of parameter train- ing. In each trial, the combined data set D is randomly divided into five partitions Di. For each partition, the optimal parameter combination is determined for the remaining sequences (Equation (13)). The cross validation results are plotted in Figure 6 (right), where the integer in each cell indicates the number of tri- als where that parameter combination was optimal. The optimal parameter values highly cluster around the linear regression line show in Figure 6 (left). This confirms the reproducibility of α, τ parameter training.

T T (αopt, τopt) :=(α, τ) s.t. ∆MCCα,τ = max(∆MCC ) (13)

where T := D \ Di

The default parameter combination for CoFold is α = 0.5, τ = 640. This parameter set max- imises MCC for the combined dataset. The default parameter combination is marked with an ”X” in Figure 6 (left), which shows that it lies directly on the linear regression line.

32 hrfr,w eie ofrg hsaayi ni oeetniecaeseicdt e savailable. is set parameters. data trained clade-specific extensive found the more We between a sequences. until differences of analysis number clade-specific this small forego for very to support a decided by with we statistical notion bins Therefore, little this some was explored in ( We results there separate partitioning that a pairs. this training 2, base and table distant bins, in of clade-specific seen into reachability set of training terms the in partitioning differently affected be may training parameter Clade-specific heatmap the of 2.5.2 region It optimal optimal. the is and combination results, parameter validation that cross which between the in agreement indicates trials the X of illustrates The number the lines). indicate depicts (dotted integers heatmap region where left confidence The 95% bottom maximizes its line. at that and cyan provided combination line) as colour parameter values (solid cell heatmap line heatmap of fit for histogram linear legend displays the Exact also plotted. which not figure, is the gray of and lowest, is red highest, between MCC in difference mean from and triple axis, one all represents for heatmap measurements the performance on of Heatmap 6. Figure eas h rncito pesi ahdmi flf r atydffrn 9] h iei pathway kinetic the [92], different vastly are life of domain each in speeds transcription the Because α 80 saogtevria xs h ooro h qaeidctstevleof value the indicates square the of colour The axis. vertical the along is 160 240 320 400 480 560 X τ 640 720 800 CoFold 880 10 30 50 70 960

∆ 1040 MCC . . . 2.0 1.5 1.0 0.5

Q 1120 and

= 1200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 MC(%) ΔMCC { α,τ RNAfold (

,τ, α, α h ih eta hw rs aiainresults, validation cross shows heatmap right The . 33 ∆ 1

MCC 80 160 o h obnddt e,weewieis white where set, data combined the for 13 240 1 12 α,τ 320 9 10 ) (

} 400 ,τ α, where , 480 1 ) 560 5 16 combinations. τ 640 1 4 6 2 1 τ 720 saogtehorizontal the along is ,τ α, 800 1

880 3 3 1 arfrec i.As bin. each for pair ) 960

∆ 1040 1 MCC 1120 5 1 3 ahsquare Each 1200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 α,τ α the , Nonetheless, viruses stand out as an outlier in that the optimal parameter combination tends to have a slow rate of decay of reachability. This seems to agree with the hypothesis that fast transcription rate results in slower decay in reachability. However, The number of viral sequences examined was small, and the data set has a significant bias regarding viruses: there is no viral ribosomal RNA sequence, and thus no long viral sequences. The fact that we observed them as an outlier could be due simply to this bias in the data. This observation is inconclusive, but could be investigated further with a more extensive viral data set.

2.6 CoFold Performance

We measure structure prediction accuracy on a base pair level. Under this scheme, the presence or absence of a base pair is the elemental unit of measurement, and it does matter that both pairing partners are correct. Evaluation on nucleotide level is one alternative to this, where it only considers whether a given nucleotide is predicted as paired or not, regardless of its predicted pairing partner. This scheme is very crude, and often does not accurately portray the true correspondence between predictions and reference structures. Yet another alternative considers entire helices as the basic units of evaluation. A helix in the known structure is typically considered correctly predicted when some percentage of its base pairs are included in the predicted structure, which allows for a margin of error. Helix-level evaluation tends to be plagued by subtle complicating factors, such as the definition of a helix and choice of helix similarity cutoff. For this reason, we opt for the simple and easily interpretable base pair-level evaluation. True positives (TP) are correctly predicted base pairs. False positives (FP) are incorrectly predicted base pairs that are not part of the reference structure. True negatives (TN) are hypothetical base pairs that are not predicted, nor part of the reference structure. False negatives (FN) are reference base pairs missed by the prediction. Performance metrics true positive rate (TPR), false positive rate (FPR), positive predictive value (PPV), and Matthews correlation coefficient (MCC) are defined as follows:

TP TPR = 100 · (14) TP + FN

FP FPR = 100 · (15) FP + TN

TP PPV = 100 · (16) TP + FP

TP · TN − FP · FN MCC = 100 · (17) p(TP + FP ) · (TP + FN) · (TN + FP ) · (TN + FN)

∆TPR = TPRCoFold − TPRRNAfold (18)

∆FPR = FPRCoFold − FPRRNAfold (19)

34 ∆PPV = PPVCoFold − PPVRNAfold (20)

∆MCC = MCCCoFold − MCCRNAfold (21) True positive rate is a measurement of sensitivity, and indicates the proportion of reference base pairs that were predicted (e.g. if the method predicts 7 out of 10 known base pairs, TPR = 70%). False positive rate and positive predictive value are both measurements of specificity, where FPR indicates the proportion of false positives out of all potential pairs (e.g. in a sequence of length N = 50 with N·(N−1) 15 reference pairs and 100 false positive predictions, there are 2 = 1225 potential pairs, then 100 FPR = 100 · 1225+100 = 7.5%), and PPV indicates the proportion of predictions that are correct (e.g. if the method predicts 20 base pairs, and 15 of those are in the known structure, PPV = 75%). Matthews correlation coefficient (MCC) is a measurement of overall prediction quality, taking into account both sensitivity and specificity. It can be interpreted as the overall correlation between the predicted base pairs and the known base pairs. TPR and PPV range from 0% to 100%, where a higher value indicates a better prediction. FPR also ranges from 0% to 100%, but lower values indicate the best prediction. MCC ranges from -100% to 100%, where a higher number indicates better correlation between predictions and known structures. CoFold shows significantly improved performance for the long data set (See table 3). Using the Turner 1999 energy parameters, CoFold MCC is 49.1%, while RNAfold MCC is 42.8% (an increase of 6.3%). Using the Andronescu 2007 energy parameters, CoFold-A MCC is 53.7%, while RNAfold-A MCC is 48.2% (an increase of 5.5%). By applying both the Andronescu 2007 energy parameters, and the CoFold scaling function, we achieve an MCC improvement of 10.9%, indicating that the effect of the Andronescu 2007 parameters (5.4% MCC improvement) and the effect of CoFold (6.3% MCC improvement) are almost additive. CoFold takes into account the reachability of potential pairing partners during the transcription process. The more distant to positions are, the more likely that local structures will block their formation. We expect long sequences are most strongly affected due to higher energy barriers required to break the local structures. We thus expect to observe a much greater performance increase for long sequences by using CoFold, and the effect is indeed significantly more pronounced for the long data set. As described in section 2.5, we used the combined data set to ensure CoFold does not result in decreased performance among short sequences. Table 3 indicates a 1.1% MCC increase (CoFold vs. RNAfold) and 1.4% MCC increase (CoFold-A vs. RNAfold-A) for short sequences. Thus, applying the CoFold scaling function in fact results in a slight increase in performance for short sequences. Not only is the average performance greater for long sequences, but we observe a performance im- provement for almost every sequence in the long dataset. Figure 7 illustrates the change in performance between RNAfold and CoFold for all long sequences, and shows that only four sequences out of 61 experience a significant decrease in performance. The improvement for the remaining sequences is due in roughly equal parts to improvement in true positive rate and positive predictive value. This result

35 illustrates the wide applicability to all nearly all long sequences in our data set. Figure 8 illustrates the change in TPR and the change in FPR for all sequences in the long data set, and shows a minor decrease in FPR for the vast majority of sequences.

long data set TPR (%) FPR (%) PPV (%) MCC (%) RNAfold 46.30 0.0176 39.74 42.81 RNAfold-A 52.02 0.0160 44.76 48.17 CoFold 52.83 0.0159 45.79 49.10 CoFold-A 57.80 0.0145 50.06 53.70 combined data set TPR (%) FPR (%) PPV (%) MCC (%) RNAfold 57.87 0.1132 45.27 50.86 RNAfold-A 58.98 0.1152 46.16 51.83 CoFold 60.38 0.1097 47.56 53.26 CoFold-A 61.51 0.1112 48.42 54.22 short sequences only TPR (%) FPR (%) PPV (%) MCC (%) RNAfold 61.64 0.1444 47.08 53.48 RNAfold-A 61.25 0.1475 46.61 53.02 CoFold 62.84 0.1403 48.14 54.62 CoFold-A 62.72 0.1428 47.88 54.39

Table 3. Performance metrics for CoFold and RNAfold using both energy parameter sets. CoFold-A and RNAfold-A refer to the respective programs run with the Andronescu 2007 energy parameters. Each metric is defined in equations 14, 15, 16, and 17. Metrics are presented for the long data set (which shows the greatest performance improvement), the combined data set, and the short Rfam sequences.

36 30

●● 20 ● ●●

● ●

●● ●●● ● ● ● ●●●●● 10 ●● ●●●● ●● ●● ●●● ● ●●●● ● ● ●●● ● ● ●● 0 ●● TPR

● ∆

●●● −10 −20 −30

−30 −20 −10 0 10 20 30 ∆ PPV

Figure 7. Difference in predictive accuracy between CoFold and RNAfold. Each point shows the change in performance for one sequence in the long data set in terms of change in true positive rate (∆TPR) and change in positive predictive value (∆PPV ). The metrics are defined in equations 18 and 20 .

37 30

● 20 ●●

● ●

●● ● ●● 10 ● ●●● ●● ●● ●●● ● ●●●● ● ●● ●● ●● 0 ● TPR

● ∆

●● −10 −20 −30

−0.04 −0.02 0.00 0.02 0.04 ∆ FPR

Figure 8. Difference in predictive accuracy between CoFold and RNAfold. Each point shows the change in performance for one sequence in the long data set in terms of change in true positive rate (∆TPR) and change in false positive rate (∆FPR). The metrics are defined in equations 18 and 19 .

38 2.7 Calculation of free energy differences

CoFold is guided by thermodynamics, as it overall finds a structure with low Gibbs free energy value, but is also guided by kinetics via pairing partner reachability. If the method indeed returns a predicted structure that is stable, but also is feasible to fold kinetically, the free energy of the CoFold must naturally be higher than the minimum free energy. This is simply the definition of a kinetically trapped structure. However, structures whose free energy is very high become physically unstable, and less likely to remain in that state. This section explores the shift induced by the CoFold scaling function in free energy of predicted structures according to the Turner 1999 model. We define ∆∆G as the difference between the free energy (∆G) of a given prediction and the corresponding RNAfold minimum free energy prediction. We calculate these values for RNAfold-A, CoFold, and CoFold-A. Because the Andronescu 2007 parameters use modified free energy values, we use RNAeval from the ViennaRNA package [50,69] to calculate the free energy of each predicted structure on an equal footing (i.e. the free energy under the Turner 1999 model). Unlike RNAfold which predicts a minimum free energy structure from a sequence, RNAeval calculates the free energy for an input RNA secondary structure according to the provided thermodynamic parameters. For a prediction program X which corresponds to RNAfold-A CoFold or CoFold-A, we define absolute energy shift ∆∆GX and relative energy shift %∆∆GX as follows.

∆∆GX = ∆GX − ∆GRNAfold (22)

(∆GX − ∆GRNAfold) %∆∆GX = 100 · (23) |∆GRNAfold|

Summary of % ∆∆G distributions long data set combined data set > 1000 nt all ≤ 1000 nt av. ± stdev min max av. ± stdev min max av. ± stdev min max RNAfold-A 4.7 ± 1.9 1.4 11.1 5.0 ± 3.5 -2.3 15.4 5.1 ± 3.9 -2.3 15.4 CoFold 1.8 ± 1.0 0.2 4.4 0.5 ± 1.2 -5.0 4.4 0.1 ± 0.9 -5.0 3.8 CoFold-A 6.8 ± 2.4 1.7 13.1 5.9 ± 3.8 -2.3 18.2 5.6 ± 4.1 -2.3 18.2

Table 4. Summary of relative free energy difference between predicted structures and the respective minimum free energy structures predicted by RNAfold.

The distribution of %∆∆G each of RNAfold-A, CoFold, and CoFold-A is shown in detail in table 4 for all data sets, and illustrated for the long data set in figure 9. Detailed histograms of energy shifts are provided in appendix figure A1. Applying the CoFold scaling function results in free energies close to that of RNAfold (%∆∆G = 1.8% for the long data set). Using the Andronescu 2007 parameters in place of the Turner 1999 parameters results in a greater difference in free energy (%∆∆G = 4.7% for

39 the long data set). In all cases, the effects of the Andronescu parameters and the CoFold scaling function are roughly additive. Moreover, CoFold causes a significantly greater energy shift in long sequences (%∆∆G = 1.8%) as compared to short sequences (%∆∆G = 0.1%), whose free energy remain mostly the same. Figure 10 illustrates the distribution of absolute free energy differences in kcal/mol defined in equa- tion (22). CoFold-predicted structures differ from the MFE structure on average by +17.9 kcal/mol, CoFold-A-predicted structures differ by +64.0 kcal/mol, and RNAfold-A-predicted structures dif- fer by +43.6 kcal/mol. For comparison, the average energy contribution from stacking interactions in 138 sequences extracted from Rfam is -2.12 kcal/mol. The average increase in energy associated with CoFold-predicted structures is therefore roughly equal to 8 stacking interactions. Note that this ap- proximation disregards other energy contributions, such as loops and bulges. Furthermore, these energy differences are for the long sequences, which have an average of 2397 nucleotides, and the 16S sequences contain roughly 450 known base pairs, while the 23S sequences contain roughly 850 known base pairs. We investigated the relationship between %∆∆G and ∆MCC (i.e. the change in performance between RNAfold and CoFold) to determine if those sequences with large performance increases experience greater shift in free energy. Scatter plots of these two values are shown in appendix figure A2 for the long, short, and combined sequences, as well as for CoFold, CoFold-A, and RNAfold-A. A largely significant correlation was not found between %∆∆G and ∆MCC. The coefficient of determination (R2) measures the proportion of variability in the data that can be explained by the relationship (R2 = 1 indicates perfect correlation, while R2 = 0 indicates no correlation). All R2 are less than 1%, except for CoFold with long and short sequences (See table 5). With long sequences, for which R2 = 6.1%, the sequences with large performance increase tend to have slightly greater free energy shift. The opposite relationship is observed in short sequences, for which R2 = 4.4%: the sequences with high performance increase tend to have less free energy shift. While this correlation is fairly minor, it tends to agree with our assumptions. The cases where CoFold causes the greatest improvement in predictive accuracy also tend to result in structures further from the minimum free energy. This agrees with the hypothesis of Morgan and Higgs [86]. They posit that inaccuracies of predicting structures of long sequences occur because the molecule is unable to reach its minimum free energy structure due to kinetics. Further support for this claim lies in the fact that energy shifts for long sequences are higher for those of short sequences. In summary, CoFold significantly improves the predictive performance without drastically shifting the free energy values of the predicted structures as calculated in the Turner 1999 model. The energies of CoFold-predicted structures remains close to the minimum free energy, and thus they are likely to be stable, but they are indeed consistently greater than the minimum free energy. Free energy differences are more pronounced in long sequences where we expect more substantial kinetic effects. These data are consistent with our hypothesis that CoFold’s performance improvements are a direct result of accounting for kinetic folding.

40 ● 12 ● 10 8 G ∆ ∆ 6 %

● 4 2 0

RNAfold−A CoFold CoFold−A

Figure 9. Distribution of relative free energy difference between predicted structures and the respective RNAfold-predicted minimum free energy structures in the long data set. The metric is defined in equation 23. The distributions for RNAfold-A, CoFold, and CoFold-A are each shown. The solid line denotes the mean, and the boxed area indicates the region between the 1st and 3rd quartiles. Dotted lines delineate minimum and maximum values.

41 120 100 80

● 60 G (kcal/mol) ∆ ∆ 40 20 0

RNAfold−A CoFold CoFold−A

Figure 10. Distribution of absolute free energy difference (in kcal/mol) between predicted structures and the respective RNAfold-predicted minimum free energy structures in the long data set. The metric is defined in equation 22. The distributions for RNAfold-A, CoFold, and CoFold-A are each shown. The solid line denotes the mean, and the boxed area indicates the region between the 1st and 3rd quartiles. Dotted lines delineate minimum and maximum values.

42 Linear fit to ∆ MCC versus % ∆∆G distributions long data set > 1000 nt intercept ± stdev slope ± stdev R2 (%) RNAfold-A 7.0 ± 2.4 −0.34 ± 0.48 0.85 CoFold 3.5 ± 1.6 1.52 ± 0.78 6.06 CoFold-A 9.2 ± 3.1 0.25 ± 0.43 0.56 combined data set intercept ± stdev slope ± stdev R2 (%) RNAfold-A 1.0 ± 1.4 0.0008 ± 0.23 5.6 · 10−06 CoFold 2.1 ± 0.6 0.59 ± 0.47 0.64 CoFold-A 2.1 ± 1.6 0.21 ± 0.23 0.34 short sequences only ≤ 1000 nt intercept ± stdev slope ± stdev R2 (%) RNAfold-A −0.8 ± 1.6 0.06 ± 0.25 0.03 CoFold 1.3 ± 0.7 −2.21 ± 0.75 4.44 CoFold-A 0.7 ± 1.7 0.03 ± 0.25 0.01

Table 5. Details of the linear fits to the ∆ MCC versus % ∆∆G distributions. Intercept and slope of each fit line are shown with their respective standard deviations. The R2 is a measurement of the proportion of variability that one variable explains in the other. A value of R2 close to 100% indicates that the two variables describe each other perfectly, while 0% indicates that there is no relationship.

2.8 Case studies

In this section we explore a few interesting cases where CoFold makes a significant improvement in the RNA secondary structure prediction. We visually present the predictions of RNAfold and CoFold as diagrams generated with the R4RNA R package [65]. The horizontal axis represents the range of sequences positions along the RNA sequence, and the semi-circular arcs each represent a base pair. In the top portion of the figures we show the RNAfold prediction, and in the bottom portion the CoFold prediction. Correct base pairs are coloured in blue (true positives), incorrect base pairs in red (false positives), and missed base pairs in black (false negatives). Because neither RNAfold nor CoFold is technically capable of predicting pseudoknotted structures, we colour any pseudoknotted base pairs that are missed by the prediction in orange. Figure 11 shows the RNAfold and CoFold predictions for 23S ribosomal RNA of the gamma proteobacteria Pseudomonas aeruginosa. The sequence is 2893 nucleotides in length. RNAfold predicts the reference structure with 43.1% MCC, while CoFold predicts with 58.6% MCC (an increase of 15.5).

43 Comparatively the CoFold-A prediction has an MCC of 58.4%. The performance increase of CoFold is due to an almost equal improvement in true positive rate (increase of 15.1) and positive predictive value (increase of 15.9). RNAfold predicts many long-range base pairs, but the reference structure has very few. This effect contributes greatly to the low performance of RNAfold. Although CoFold tends to predict fewer long-range base pairs, it can still correctly predict a key long-range reference helix that spans nearly the entire sequence (See large blue arc on lower side of figure 11). This is because CoFold does not simply disregard long-range base pairs, but instead gives them less weight. Overall, CoFold greatly improves upon the RNAfold prediction by removing the many long-range base pairs that are unlikely to form. Figure 12 shows the RNAfold and CoFold predictions for 16S ribosomal RNA of the fresh water algae Cryptomonas sp. (species unknown). This sequence is 1493 nucleotides in length. RNAfold predicts the reference structure with 31.6% MCC, while CoFold predicts with 50.5% MCC (an increase of 18.9). By using the Andronescu 2007 parameters for this sequence, we observe an even greater perfor- mance increase. RNAfold-A and CoFold-A achieve an MCC of 47.4% (increase of 15.8), and 72.7% (increase of 41.1), respectively. This demonstrates that the effects of CoFold and the Andronescu 2007 parameters are roughly additive in some cases (i.e., applying both effects results in greater performance than either method alone). The performance improves by removing many incorrect long-range base pairs. These examples demonstrate the predictive power of CoFold. Although the scaling function under- lying CoFold is extremely simple, it is capable of causing vast changes in the predicted RNA secondary structure. An important pattern can be observed in these two case studies: RNAfold predictions have no particular preference for long-range or short-range pairs, but the known structures do have a strong tendency. They mostly consist of short range base pairs, with a handful of long-range pairs filling in the gaps between them. This provides significant support to the hypothesis that reachability of pairing partners has a strong effect on RNA structural formation. RNAfold did likely determine a secondary structure with a lower free energy, and thus would be more stable if the molecule was able to reach this state. However, many of these base pairs involve positions with poor reachability, imposing insurmount- able energy barriers. Simply by considering this phenomenon, much of the reference structure rises to the top of the search space. Thus the cases we present not only illustrate the significant performance improvements that CoFold can achieve, but also provide cogent evidence of its kinetic hypothesis.

44 0 500 1000 1500 2000 2500

Figure 11. RNA secondary structures predicted by RNAfold and CoFold-A for the 23S rRNA of the gamma-proteobacteria Pseudomonas aeruginosa. The horizontal line represents the positions along the RNA sequence of 2893 nucleotides in length. Arcs indicate a base pair between the corresponding positions along the sequence. The top portion of the diagram shows the RNAfold-predicted structure, and the bottom portion shows the CoFold-A-predicted structure. Blue arcs denote correctly predicted base pairs (true positives). Red arcs indicate incorrectly predicted base pairs (false positives). Black arcs denote known base pairs missed in the prediction (false negatives). Because neither method is technically capable of predicting pseudoknotted structures, we colour known pseudoknotted base pairs that are missed by the prediction in orange. Figure made using R-chie [65].

45 0 200 400 600 800 1000 1200 1400

Figure 12. RNA secondary structures predicted by RNAfold and CoFold-A for the 16S rRNA of the fresh-water algae Cryptomonas sp. (species unknown). The horizontal line represents the positions along the RNA sequence of 1493 nucleotides in length. Arcs indicate a base pair between the corresponding positions along the sequence. The top portion of the diagram shows the RNAfold-predicted structure, and the bottom portion shows the CoFold-A-predicted structure. Blue arcs denote correctly predicted base pairs (true positives). Red arcs indicate incorrectly predicted base pairs (false positives). Black arcs denote known base pairs missed in the prediction (false negatives). Because neither method is technically capable of predicting pseudoknotted structures, we colour known pseudoknotted base pairs that are missed by the prediction in orange. Figure made using R-chie [65].

46 2.9 CoFold web server

CoFold web server is available at http://www.e-rna.org/cofold where individual queries can be submitted online, and source code is available for download. The interface allows the user to select either the Andronescu 2007 or Turner 1999 thermodynamic parameters, and either default of custom- chosen values of the CoFold parameters α and τ. The result page provides the CoFold-predicted structure in dot bracket format, but allows the user to download it in a variety of formats, including connect, helix, or bpseq. Additionally, the result page provides a default arc plot illustrating the predicted structure (similar to those in figure 11 and figure 12). These diagrams represent the RNA sequence as a horizontal line, and base pairs as semi-circular arcs. The two ends of an arc indicate the two sequence positions involved in the base pair. The arc plots are generated using the freely available R4RNA R package [65]. The CoFold web server only provides a simple default plot, but arc plots can be generated on-line with extensive customization options at the R-chie website http://www.e-rna.org/r-chie.

47 3 Discussion

We have thus argued the crucial importance of kinetics and co-transcriptional folding to RNA sec- ondary structure prediction, and the lack of existing methods that effectively take this into consideration. We proposed a novel method CoFold for deterministic RNA secondary structure prediction that is an adjustment of the currently prevailing free energy minimization paradigm. We capture one key effect of the co-transcriptional folding pathway by modeling the distance between pairing partners along the nucleotide sequence. By considering kinetic effects, we are able to greatly improve RNA secondary struc- ture predictions for long sequences, and show that it is indeed feasible and beneficial to do so. CoFold out performs free energy minimization techniques alone, and the performance gains are comparable to that of Andronescu, et al. in their 2007 thermodynamic parameter set [5], which involved a readjustment of 363 free energy parameters. Furthermore, CoFold achieves performance on long sequences which is comparable to that of short sequences in our data set. CoFold’s underlying model is simple and straightforward to interpret, relying on only two free parameters that were trained from a diverse set of 248 sequences of various function and length. Importantly, the predicted structures that result from CoFold are not expected to be the minimum free energy conformation. The prediction is indeed guided in part by free energy minimization, and thus is expected to be a stable conformation, but it simultaneously considers the kinetic feasibility of the secondary structure. Existing methods explore structures that deviate from minimum free energy, such as Sfold. This method samples structures with probability equal to the estimated frequency of that conformation at thermodynamic equilibrium using the McCaskill partition function and specialized statistical sampling technique. It thus assumes a system at thermodynamic equilibrium. We argue that this is not sufficient, because RNA molecules often become kinetically trapped, and the equilibrium assumption in vivo may not be justified. In their 1996 study, Morgan and Higgs [86] investigated the estimated free energy of phylogenetically- supported reference structures and predicted structures by energy minimization. The reference structures were found to have significantly higher free energy for longer sequences. They hypothesize that co- transcriptional folding involves successive formation of local structures as the molecule emerges from the polymerase. Transition toward long range structures requires breaking the local base pairs, a change that may be very slow. Our results agree strongly with this finding, as CoFold is based on the reachability of distant base pairing positions. We achieve a substantial performance increase for long sequences, the case Morgan and Higgs show as having the greatest free energy discrepancy. Furthermore, we show that the free energy of CoFold-predicted structures is close to, but greater than the minimum free energy value. We show that CoFold’s performance increase is likely due to the domain folding effect of long RNAs hypothesized by Morgan and Higgs, thus providing significant evidence for this hypothesis. We have demonstrated a significant improvement in predictive performance using a simple, compu- tationally novel modification to the thermodynamic paradigm. Moreover, we have shown that ample experimental evidence exemplifies the importance of kinetics for RNA structural formation. Considera- tion of the folding process is important in RNA secondary structure prediction, and we hope this may

48 inspire other similar methods to explore prediction of kinetically favourable structures.

4 Future Work

In this thesis we have demonstrated that by capturing the effects kinetic folding, we can significantly improve performance of energy minimization methods. Importantly, we have proven that it is indeed feasible and beneficial to consider aspects of the process of structural formation itself, rather than just the end product. The method we present in this thesis captures only one relatively simple effect of the co-transcriptional folding pathway. We hope that it inspires further work to address more complex aspects of structural formation in the context of practical RNA secondary structure prediction. There are many factors affecting structure formation during the co-transcriptional folding pathway (these are discussed in section 1.5.2). Transient RNA helices are thought to play a crucial role in guiding the RNA folding pathway by preventing or encouraging specific structural elements [79]. They have important 5’ to 3’ biases due to the directional nature of RNA synthesis. This effect has not yet been captured in an RNA secondary structure prediction method, and may help to guide predictions toward the true in vivo structure. Additionally, RNA-RNA and protein-RNA interactions have been shown to affect the RNA structural formation process. Modeling these effects is difficult due to lack of information regarding interaction partners, and thus interactions are typically disregarded by structure prediction methods. A novel secondary structure prediction method could predict structures closer to those in vivo by compiling and capturing known interaction sites. In section 2.5.2, we explored clade-specific parameter training for CoFold’s two parameters α and τ. Due to significant differences in the transcription process between distant evolutionary groups (e.g. transcription speed), the pairing partner reachability effect may too be significantly different. We en- deavored to train separate parameters for each evolutionary subgroup, but concluded that there is not enough statistical power within each evolutionary subgrouping to make a confident conclusion about differences between the parameter combinations between evolutionary clades. However, further analysis into this aspect could reveal biologically interesting differences, and potentially further increase predictive performance. The main limitation preventing this analysis right now is the size of the current data set. As CoFold is based on the Zuker-Stiegler algorithm, it is technically incapable of predicting pseu- doknotted secondary structures. There are many previously published methods that can predict pseu- doknots that incur significant computational costs (discussed in section 1.3.3). Many of these methods are based on the same thermodynamic assumptions as the Zuker-Stiegler, and thus could benefit from consideration of kinetic effects. The concept of pairing partner reachability, or even more complex related ideas, could be readily implemented in currently existing methods for prediction of pseudoknotted RNA structures.

49 References

1. P. Adams, M. Stahley, A. Kosek, J. Wang, and S. Strobel. Crystal structure of a self-splicing group I intron with both exons. Nature, 430(6995):45–50, Jul 2004.

2. H. M. Al-Hashimi and N. G. Walter. RNA dynamics: it is about time. Current Opinion in Structural Biology, 18:321–329, 2008.

3. S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.

4. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389–3402, Sep 1997.

5. M. Andronescu, A. Condon, H. H. Hoos, D. H. Mathews, and K. P. Murphy. Efficient param- eter estimation for RNA secondary structure prediction. Bioinformatics, 23(13):I19–I28, Jul 1 2007. 15th Conference on Intelligent Systems for Molecular Biology/6th European Conference on Computational Biology, Vienna, Austria, Jul 21-25, 2007.

6. J. Bachellerie, J. Cavaille, and A. H¨uttenhofer. The expanding snorna world. Biochimie, 84(8):775–790, 2002.

7. S. H. Bernhart, I. L. Hofacker, S. Will, A. R. Gruber, and P. F. Stadler. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics, 9, Nov 2008.

8. P. Biswas, X. Jiang, A. Pacchia, J. Dougherty, and S. Peltz. The human immunodeficiency virus type 1 ribosomal frameshifting site is an invariant sequence determinant and an important target for antiviral therapy. Journal of Virology, 78(4):2082–2087, Feb 2004.

9. J. Boyle, G. Robillard, and S. Kim. Sequential folding of transfer RNA. A nuclear magnetic reso- nance study of successively longer tRNA fragments with a common 5’ end. Journal of Molecular Biology, 139:601–625, 1980.

10. S. Brehm and T. Cech. Fate of an intevening sequence ribonucleic-acid — excision and cycliza- tion of the Tetrahymena ribosomal ribonucleic-acid intervening sequence in vivo. Biochemistry, 22(10):2390–2397, 1983.

11. P. Brion and E. Westhof. Hierarchy and dynamics of RNA folding. Annual Review of Biophysics and , 26:113–137, 1997.

12. C. Brunel and P. Romby. Probing RNA structure and RNA-ligand complexes with chemical probes. In RNA-Ligand Interactions, Part B, volume 318 of Methods in Enzymology, pages 3–21. 2000.

50 13. J. Cannone, S. Subramanian, M. Schnare, J. Collett, L. D’Souza, Y. Du, B. Feng, N. Lin, L. Mad- abusi, K. Muller, N. Pande, Z. Shang, N. Yu, and R. Gutell. The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3:2, 2002.

14. T. R. Cech, A. J. Zaug, and P. Grabowski. In vitro splicing of the ribosomal RNA precursor of Tetrahymena: involvement of a guanosine nucleotide in the excision of the intervening sequence. Cell, 27:487–496, 1981.

15. C. Chan, C. Lawrence, and Y. Ding. Structure clustering features on the Sfold Web server. Bioinformatics, 21(20):3926–3928, Oct 2005.

16. M. Y. Chao, M. Kan, and S. Lin-Chao. RNAII transcribed by IPTG-induced T7 RNA polymerase is non-functional as a replication primer for ColE1-type plasmids in Escherichia coli. Nucleic Acids Research, 23:1691–1695, 1995.

17. S. Chauhan and S. Woodson. Tertiary interactions determine the accuracy of RNA folding. Journal of the American Chemical Society, 130(4):1296–1303, 2008.

18. A. Condon, B. Davy, B. Rastegari, F. Tarrant, and S. Zhao. Classifying RNA pseudoknotted structures. Theoretical Computer Science, 320(1):35–50, 2004.

19. C. Damgaard, E. Andersen, B. Knudsen, J. Gorodkin, and J. Kjems. RNA interactions in the 5 ‘ region of the HIV-1 genome. Journal of Molecular Biology, 336(2):369–379, Feb 2004.

20. L. Danilova, D. Pervouchine, A. Favorov, and A. Mironov. RNAkinetics: a web server that models secondary structure kinetics of an elongating RNA. Journal of Bioinformatics and Computational Biology, 4(2):589–596, 2006.

21. K. Darty, A. Denise, and Y. Ponty. VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics, 25(15):1974–1975, Aug 2009.

22. Y. Ding, C. Chan, and C. Lawrence. Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Research, 32(2):W135–W141, Jul 2004.

23. Y. Ding and C. Lawrence. A statistical sampling algorithm for RNA secondary structure predic- tion. Nucleic Acids Research, 31(24):7280–7301, Dec 2003.

24. R. Dirks and N. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. Journal of Computational Chemistry, 24(13):1664–1677, 2003.

25. R. Dirks and N. Pierce. An algorithm for computing nucleic acid base-pairing probabilities in- cluding pseudoknots. Journal of Computational Chemistry, 25(10):1295–1304, 2004.

51 26. C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):E90–E98, Jul 2006.

27. R. D. Dowell and S. R. Eddy. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 7:400, 2006.

28. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis. Cambridge University Press, 1998.

29. S. Eddy. Hidden Markov models. Current Opinion in Structural Biology, 6(3):361–365, Jun 1996.

30. S. R. Eddy and R. Durbin. RNA sequence analysis using covariance models. Nucleic Acids Research, 22(11):2079–2088, 1994.

31. R. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–1797, Mar 2004.

32. D. Evans, S. Marquez, and N. Pace. Rnase p: interface of the rna and protein worlds. Trends in Biochemical Sciences, 31(6):331–341, June 2006.

33. J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol, 17(6):368–376, 1981.

34. C. Flamm, W. Fontana, I. L. Hofacker, and P. Schuster. RNA folding at elementary step resolution. RNA, 6(3):325–38, Mar 2000.

35. A. Frankel and J. Young. HIV-1: Fifteen proteins and an RNA. Annual Review of Biochemistry, 67:1–25, 1998.

36. B. Fuertig, J. Buck, V. Manoharan, W. Bermel, A. Jaeschke, P. Wenter, S. Pitsch, and H. Schwalbe. Time-resolved NMR studies of RNA folding. Biopolymers, 86(5-6):360–383, Aug 2007.

37. B. Furtig, C. Richter, J. Wohnert, and H. Schwalbe. NMR spectroscopy of RNA. Chembiochem, 4(10):936–962, Oct 2003.

38. M. Geis, C. Flamm, M. T. Wolfinger, A. Tanzer, I. L. Hofacker, M. Middendorf, C. Mandl, P. F. Stadler, and C. Thurner. Folding kinetics of large RNAs. Journal of Molecular Biology, 379(1):160–173, May 2008.

39. A. Giuliodori, F. D. Pietro, S. Marzi, B. Masquida, R. Wagner, P. Romby, C. Gualerzi, and C. Pon. The cspA mRNA Is a Thermosensor that Modulates Translation of the Cold-Shock Protein CspA. Molecular Cell, 37(1):21–33, 2010.

40. S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Research, 33:D121–D124, 2005.

52 41. A. Gultyaev, F. von Batenburg, and C. Pleij. The computer-simulation of RNA folding pathways using a genetic algorithm. Journal of Molecular Biology, 250(1):37–51, 1995.

42. C. Haslinger and P. F. Stadler. RNA structures with pseudo-knots: Graph-theoretical, combina- torical, and statistical properties. Bull. Math. Biol., 61:437–467, 1999.

43. J. H. Havgaard, R. B. Lyngsø and J. Gorodkin. The FOLDALIGN web server for pairwise structural RNA alignment and mutual motif search. Nucleic Acids Research, 33:W650–W653, 2005.

44. J. H. Havgaard, R. B. Lyngsø G. D. Stormo, and J. Gorodkin. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21:1815–1824, 2005.

45. S. Heilman-Miller and S. Woodson. Effect of transcription on folding of the Tetrahymena ri- bozyme. RNA, 9(6):722–733, 2003.

46. S. Heilman-Miller and S. Woodson. Perturbed folding kinetics of circularly permuted RNAs with altered topology. Journal of Molecular Biology, 328(2):385–394, 2003.

47. M. Hentze, M. Muckenthaler, and N. Andrews. Balancing acts: Molecular control of mammalian iron metabolism. CelL, 117(3):285–297, Apr 2004.

48. I. Hofacker. The Vienna RNA Secondary Structure Server. Nucleic Acids Research, 31:3429–3431, 2003.

49. I. Hofacker, M. Fekete, and P. Stadler. Secondary Structure Prediction for Aligned RNA Se- quences. Journal of Molecular Biology, 319:1059–1066, 2002.

50. I. Hofacker, W. Fontana, P. Stadler, S. Bonhoeffer, M. Tacker, and P. Schuster. Fast Folding and Comparison of RNA Secondary Strutures. Monatshefte f¨urChemie (Chemical Monthly), 125:167–188, 1994.

51. S. Holbrook and S. Kim. RNA crystallography. Biopolymers, 44(1):3–21, 1997.

52. I. Holmes. Accelerated Probabilistic Inference of RNA Structure Evolution. BMC Bioinformatics, 6:73, 2005.

53. D. S. Horowitz. The mechanism of the second step of pre-mRNA splicing. Wiley interdisciplinary reviews RNA, 3:331–350, 2012.

54. H. Isambert and E. D. Siggia. Modeling RNA folding paths with pseudoknots: application to hepatitis delta virus ribozyme. Proceedings of the National Academy of Science of the USA, 97(12):6515–20, Jun 2000.

53 55. Y. Ji, X. Xu, and G. Stormo. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics, 20(10):1591–1602, 2004.

56. J. Johansson, P. Mandin, A. Renzoni, C. Chiaruttini, M. Springer, and P. Cossart. An RNA thermosensor controls expression of virulence genes in Listeria monocytogenes. Cell, 110(5):551– 561, 2002.

57. C. Kelleher, M. Teixeira, K. Forstemann, and J. Lingner. Telomerase: biochemical considerations for enzyme and substrate. Trends in Biochemical Sciences, 27(11):572–579, Nov 2002.

58. S. Kim, G. Quigley, F. Suddath, and A. Rich. High-resolution x-ray diffraction patterns of crystalline transfer RNA that show helical regions. Proceedings of the National Academy of Science of the USA, 68(4):841–&, 1971.

59. S. Kim and A. Rich. Single crystals of transfer RNA - an x-ray diffraction study. Science, 162(3860):1381–&, 1968.

60. B. Knudsen and J. Hein. Rna secondary structure prediction using stochastic context-free gram- mars and evolutionary history. Bioinformatics, 15(6):446–454, Jun 1999.

61. B. Knudsen and J. Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research, 31(13):3423–3428, 2003.

62. S. Koduvayur and S. Woodson. Intracellular folding of the Tetrahymena group I intron depends on exon sequence and promoter choice. RNA, 10(10):1526–1532, 2004.

63. F. Kramer and D. Mills. Secondary structure formation during RNA-synthesis. Nucleic Acids Research, 9(19):5109–5124, 1981.

64. M. Lagos-Quintana, R. Rauhut, W. Lendeckel, and T. Tuschl. Identification of novel genes coding for small expressed RNAs. Science, 294(5543):853–858, 2001.

65. D. Lai, J. R. Proctor, J. Y. Zhu, and I. M. Meyer. R-CHIE: a web server and R package for visualizing RNA secondary structures. Nucleic Acids Research, 40(12):e95, 2012.

66. R. Landick. RNA polymerase slides home: Pause and termination site recognition. Cell, 88(6):741– 744, 1997.

67. B. Lewicki, T. Margus, J. Remme, and K. Nierhaus. Coupling of rRNA transcription and ri- bosomal assembly in vivo – formation of active ribosomal-subunits in Escherichia coli requires transcription of RNA genes by host RNA polymerase which cannot be replaced by T7 RNA polymerase. Journal of Molecular Biology, 231:581–593, 1993.

54 68. K. Liu, Y. Zhang, K. Severinov, A. Das, and M. Hanna. Role of Escherichia coli RNA polymerase alpha subunit in modulation of pausing, termination and anti-termination by the transcription elongation factor NusA. EMBO Journal, 15(1):150–161, 1996.

69. R. Lorenz, S. H. Bernhart, C. H. Z. Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L. Hofacker. ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6, Nov 24 2011.

70. E. Mahen, J. Harger, E. Calderon, and M. Fedor. Kinetics and thermodynamics make different contributions to RNA folding in vitro and in yeast. Molecular Cell, 19(1):27–37, 2005.

71. E. Mahen, P. Watson, J. Cottrell, and M. Fedor. mRNA Secondary Structures Fold Sequentially But Exchange Rapidly In Vivo. PLoS Biology, 8(2):e1000307, 2010.

72. D. Mathews, M. Disney, J. Childs, S. Schroeder, M. Zuker, and D. Turner. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences, 101(19):7287–7292, May 2004.

73. D. H. Mathews. Predicting a Set of Minimal Free Energy RNA Secondary Structures Common to Two Sequences. Bioinformatics, 21(10):2246–2253, 2005.

74. D. H. Mathews, J. Sabina, M. Zuker, and D. H. Turner. Expanded sequence dependence of thermo- dynamic parameters improves prediction of RNA secondary structure. J Mol Biol, 288(5):911–940, May 1999.

75. D. H. Mathews and D. H. Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2):191–203, 2002.

76. J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers, 29:1105–1119, 1990.

77. E. Merino, K. Wilkinson, J. Coughlan, and K. Weeks. RNA structure analysis at single nucleotide resolution by selective 2‘-hydroxyl acylation and primer extension (SHAPE). Journal of the American Chemical Society, 127(12):4223–4231, Mar 2005.

78. I. Meyer. Predicting novel RNA-RNA interactions. Current Opinion in Structural Biology, 18(3):387–393, 2008.

79. I. M. Meyer and I. Mikl´os. Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic Acids Research, 33(19):6338–6348, 2005.

80. I. M. Meyer and I. Mikl´os. Simulfold: Simultaneously Inferring an RNA Structure Including Pseudo-Knots, a Multiple Sequence Alignment and an Evolutionary Tree Using a Bayesian Markov Chain Monte Carlo Framework. PLoS Computational Biology, 3(8):e149, 2007.

55 81. A. Mironov, L. Dyakonova, and A. Kister. A kinetic approach to the prediction of RNA secondary structures. Journal of Biomolecular Structure & Dynamics, 2(5):953–962, 1985.

82. A. Mironov and V. Lebedev. A kinetic model of RNA folding. Biosystems, 30(1-3):49–56, 1993.

83. G. Mohr, M. Caprara, Q. Guo, and A. Lambowitz. A tyrosyl-transfer-RNA synthetase can function similarly to an RNA structure in the Tetrahymena ribozyme. Nature, 370(6485):147– 150, 1994.

84. G. Mohr, A. Zhang, J. Gianelos, M. Belfort, and A. Lambowitz. The Neurospora CYT-18 protein suppresses defects in the phage-T4 intron by stabilizing the catalytic active structure of the intron core. Cell, 69(3):483–494, 1992.

85. R. Mooney, I. Artsimovitch, and R. Landick. Information processing by RNA polymerase: Recog- nition of regulatory signals during RNA chain elongation. Journal of Bacteriology, 180(13):3265– 3275, 1998.

86. S. Morgan and P. Higgs. Evidence for kinetic effects in the folding of large RNA molecules. Journal of Chemical Physics, 105(16):7152–7157, 1996.

87. F. Narberhaus. Translational control of bacterial heat shock and virulence genes by temperature- sensing mRNAs. RNA Biology, 7(1):84–89, 2010.

88. E. P. Nawrocki, D. L. Kolbe, and S. R. Eddy. Infernal 1.0: inference of RNA alignments. Bioin- formatics, 25(10):1335–1337, MAY 2009.

89. E. Nudler and A. Mironov. The riboswitch control of bacterial metabolism. Trends Biochem. Sci., 29(1):11–17, 2004.

90. R. Nussinov and A. Jacobson. Fast algorithm for predicting the secondary structure of single- stranded RNA. Proc. Natl. Acad. Sci. USA, 77:6309–6313, 1980.

91. T. Pan, X. Fang, and T. Sosnick. Pathway modulation, circular permutation and rapid RNA folding under kinetic control. Journal of Molecular Biology, 286(13):721–731, 1999.

92. T. Pan and T. Sosnick. RNA folding during transcription. Annual Review of Biophysics and Biomolecular Structure, 35:161–175, 2006.

93. M. Parisien and F. Major. The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data. Nature, 452(7183):51–55, 2008.

94. J. Pedersen, R. Forsberg, I. Meyer, and J. Hein. An evolutionary model for protein-coding regions with conserved RNA structure. Mol. Biol. Evol., 21(10):1913 – 1922, 2004.

56 95. J. Pedersen, I. Meyer, R. Forsberg, P. Simmonds, and J. Hein. A comparative method for find- ing and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res., 32(16):4925 – 4936, 2004.

96. O. Perriquet, H. Touzet, and D. M. Finding the common structure shared by two homologous RNAs. Bioinformatics, 19:108–116, 2003.

97. A. Perrotta and M. Been. A pseudoknot-like structure required for efficient self-cleavage of hep- atitis delta-virus RNA. Nature, 350(6317):434–436, Apr 1991.

98. J. Reeder and R. Giegerich. Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5:104, 2004.

99. J. Reeder, P. Steffen, and R. Giegerich. pknotsRG: RNA pseudoknot folding including near- optimal structures and sliding windows. Nucleic Acids Research, 35(S):W320–W324, Jul 2007.

100. J. Ren, B. Rastegari, A. Condon, and H. Hoos. HotKnots: Heuristic prediction of RNA secondary structures including pseudoknots. RNA, 11(10):1494–1504, Oct 2005.

101. D. Repsilber, S. Wiese, M. Rachen, A. Schroder, D. Riesner, and G. Steger. Formation of metastable RNA structures by sequential folding during transcription: Time-resolved structural analysis of potato spindle tuber viroid (-)-stranded RNA by temperature-gradient gel electrophore- sis. RNA, 5:574–584, 1999.

102. E. Rivas and S. R. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol, 285(5):2053–2068, Feb 1999.

103. E. A. Rodland. Pseudoknots in RNA secondary structures: Representation, enumeration, and prevalence. Journal of Computational Biology, 13(6):1197–1213, Jul 2006.

104. J. Ruan, G. Stormo, and W. Zhang. An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots. Bioinformatics, 20(1):58–66, 2004.

105. L. Salmena, L. Poliseno, Y. Tay, L. Kats, and P. P. Pandolfi. A ceRNA hypothesis: the rosetta stone of a hidden RNA language? Cell, 146(3):353–358, Aug 2011.

106. D. Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal of Applied Mathematics, 45:810–825, 1985.

107. P. Schimmel, R. Giege, D. Moras, and S. Yokoyama. An operational RNA code for amino acids and possible relationship to genetic code. Proceedings of the National Academy of Science of the USA, 90(19):8763–8768, Oct 1993.

108. S. Silverman and T. Cech. Energetics and cooperativity of tertiary hydrogen bonds in RNA structure. Biochemistry, 38(27):8691–8702, Jul 1999.

57 109. D. Staple and S. Butcher. Pseudoknots: RNA structures with diverse functions. PLoS Biology, 3(6):956–959, Jun 2005.

110. J. Stombaugh, C. L. Zirbel, E. Westhof, and N. B. Leontis. Frequency and isostericity of RNA base pairs. Nucleic Acids Research, 37(7):2294–2312, Apr 2009.

111. Z. Sksd, B. Knudsen, J. Kjems, and C. N. Pedersen. Ppfold 3.0: Fast rna secondary structure prediction using phylogeny and auxiliary data. Bioinformatics, 2012.

112. Z. Sksd, B. Knudsen, M. Vaerum, J. Kjems, and E. S. Andersen. Multithreaded comparative RNA secondary structure prediction using stochastic context-free grammars. BMC Bioinformatics, 12, Apr 2011.

113. F. Toulme, C. Mosrin-Huaman, I. Artsimovitch, and A. Rahmouni. Transcriptional pausing in vivo: A nascent RNA hairpin restricts lateral movements of RNA polymerase in both forward and reverse directions. Journal of Molecular Biology, 351(1):39–51, 2005.

114. H. Touzet and O. Perriquet. CARNAC: folding families of related RNAs. Nucleic Acids Research, 32:W142–145, 2004.

115. B. Tucker and R. Breaker. Riboswitches as versatile gene control elements. Current Opinion in Structural Biology, 15(3):342–348, 2005.

116. J. M. Watts, K. K. Dang, R. J. Gorelick, C. W. Leonard, J. W. Bess, Jr., R. Swanstrom, C. L. Burch, and K. M. Weeks. Architecture and secondary structure of an entire HIV-1 RNA genome. Nature, 460(7256):711–U87, Aug 2009.

117. K. M. Weeks. Protein-facilitated RNA folding. Current Opinion in Structural Biology, 7(3):336– 342, 1997.

118. J. Wickiser, W. Winkler, R. Breaker, and D. Crothers. The speed of RNA transcription and metabolite binding kinetics operate an FMN riboswitch. Molecular Cell, 18(1):49–60, 2005.

119. W. Winkler. Riboswitches and the role of noncoding RNAs in bacterial metabolic control. Current Opinion in Chemical Biology, 9(6):594–602, 2005.

120. T. Wong, T. Sosnick, and T. Pan. Folding of noncoding RNAs during transcription facilitated by pausing-induced nonnative structures. Proceedings of the National Academy of Science of the USA, 104(46):17995–18000, 2007.

121. S. Wuchty, W. Fontana, I. Hofacker, and P. Schuster. Complete Suboptimal Folding of RNA and the Stability of Secondary Structures. Biopolymers, 49:145–165, 1999.

122. A. Xayaphoummine, T. Bucher, and H. Isambert. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots. Nucleic Acids Res, 33(Web Server issue):W605–10, Jul 2005.

58 123. A. Xayaphoummine, T. Bucher, F. Thalmann, and H. Isambert. Prediction and statistics of pseudoknots in RNA structures using exactly clustered stochastic simulations. Proceedings of the National Academy of Science of the USA, 100(26):15310–5, Dec 2003.

124. M. Yusupov, G. Yusupova, A. Baucom, K. Lieberman, T. Earnest, J. Cate, and H. N. HF. Crystal structure of the ribosome at 5.5 A resolution. Science, 292(5518):883–896, 2001.

125. M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13):3406–3415, July 2003.

126. M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamic and auxiliary information. Nucleic Acids Research, 9:133–148, 1981.

59 Appendices

Definition of covariation metric

For a given multiple-sequence alignment, the covariation is defined as:

PM P ab ab  a=1,b=1,a

• Sij is the set of base pairs i and j in the consensus secondary structure.

• M is the number of sequences in the alignment.

• H(aiaj, bibj) is the Hamming distance between the strings aiaj and bibj.

ab • Πij is an indicator function such that if ai and aj can form a canonical base-pair, and bi and bj ab ab can also form a canonical base-pair, Πij = 1 (otherwise Πij = 0).

ab • Ωij is an indicator function such that if ai and aj and/or bi and bj cannot for a canonical base-pair, ab ab Ωij = 1 (otherwise Ωij = 0).

Data set statistics

Here we provide detailed statistics for each alignment used in construction of both the long and combined data sets (described in section 2.4). Table A1 provides detailed information regarding the sequences extracted from each alignment. Table A2 provides quality statistics.

60 Table A1. RNA families of the long and the combined data sets. All sequences of the long data set originate from alignments of the CRW data base (top), whereas the short sequences from the combined data set all originate from alignments of the Rfam data base (bottom). For each alignment, i.e. each row in this table, we specify the alignment length in nucleotides (A.len), the broad evolutionary clade of its sequences (clade, A - Archea, B - Bacteria, C - Chloroplast, V - Virus, E - Eukaryotes), the number of sequences (N.seq), data base (source) and identifier in that data base (ID). We also specify, for each original alignment, how many sequences we extracted (N.ext) and what their maximum pairwise percent identify is in terms of primary sequence conservation (max. ppid). IRES stands for internal ribosomal entry site. Dataset curation is described in section 2.4.

biological class A.len clade N.seq N.ext max source ID (nt) ppid (%) 16S rRNA (archaea) 1545 A 40 8 85 CRW 16S archaea 23S rRNA (archaea) 3153 A 40 9 85 CRW 23S archaea 16S rRNA (bacteria) 1661 B 144 7 70 CRW 16S bacteria 23S rRNA (bacteria) 3046 B 40 8 85 CRW 23S bacteria 16S rRNA (chloroplast) 1558 C 40 5 85 CRW 16S chloroplast 23S rRNA (chloroplast) 3722 C 40 9 80 CRW 23S chloroplast 16S rRNA (eukaryote) 1867 E 40 7 85 CRW 16S eukaryote 23S rRNA (eukaryote) 4105 E 40 8 85 CRW 23S eukaryote snRNA 184 E 87 14 80 Rfam RF00003 U2 spliceosomal RNA 270 E 181 10 50 Rfam RF00004 Nuclear RNase P 622 E 77 11 45 Rfam RF00009 snoRNA 236 E 14 9 85 Rfam RF01256 snoRNA 394 E 4 1 85 Rfam RF01267 snoRNA 373 E 18 9 85 Rfam RF01296 U4 spliceosomal RNA 273 E 160 11 50 Rfam RF00015 U5 spliceosomal RNA 178 E 153 9 45 Rfam RF00020 ciliate telomerase RNA comp. 270 E 19 11 80 Rfam RF00025 RNase MRP 903 E 40 12 50 Rfam RF00030 RNase P 511 B 88 8 60 Rfam RF00011 CsrB RNA 391 B 11 7 85 Rfam RF00018 lysine riboswitch 232 B 37 14 65 Rfam RF00168 Mg riboswitch - Ykok leader 216 B 85 14 65 Rfam RF00380 Ornate extremophilic RNA 676 B 7 6 85 Rfam RF01071 Pestivirus IRES 286 V 23 5 85 Rfam RF00209 Tombusvirus 5’ UTR 180 V 7 5 85 Rfam RF00171 Aphthovirus IRES 471 V 87 4 85 Rfam RF00210 Cripavirus IRES 208 V 6 6 80 Rfam RF00458 tRNA-like structures 137 V 5 5 80 Rfam RF01084 Archaeal RNase P 533 A 25 16 80 Rfam RF00373

61 Table A2. Alignment quality and phylogenetic support for the reference RNA secondary structures. For each alignment, i.e. each row in this table, we specify the alignment length in nucleotides (A.len), the average length of each non-gapped sequence in that alignment (av. seq. length), the average pairwise percent identity between pairs of sequences in the alignment in terms of primary sequence conservation (av. ppid), the average fraction of gaps per sequence in the alignment (gaps), the average fraction of ambiguous (not A,C,G,T,U,-) nucleotide symbols per sequence in the alignment (n), the number of base pairs in the reference RNA secondary structure for that alignment (bpairs), the average fraction of sequences in the alignment that have a canonical base-pair per reference base pair (canonical bpairs) and the covariation (covar.) as defined in Rfam [40] (defined above in equation 24) which measures how well the base pairs of the reference RNA secondary structure are supported by co-variation (high means good).

alignment (ID) A.len av. seq. av. gaps n (%) bpairs canonical covar. (nt) length (nt) ppid (%) (%) bpairs (%) 16S archaea 1545 1477 81.8 4.4 5 · 10−7 458 95.2 0.343 23S archaea 3153 2945 74.9 6.6 6 · 10−7 852 95.0 0.408 16S bacteria 1661 1520 76.7 8.5 2 · 10−2 453 93.4 0.352 23S bacteria 3046 2904 79.2 4.6 6 · 10−7 868 94.3 0.358 16S chloroplast 1558 1490 90.2 4.4 5 · 10−7 440 93.9 0.113 23S chloroplast 3722 2941 74.8 21.0 0 869 90.1 0.253 16S eukaryote 1867 1708 73.3 8.5 1 · 10−7 370 84.3 0.162 23S eukaryote 4105 3476 78.7 15.3 1 · 10−7 998 88.1 0.084 RF00003 184 162 63.1 11.8 0 40 93.2 0.493 RF00004 270 193 58.4 28.4 1 · 10−2 45 92.8 0.496 RF00009 622 315 40.7 49.3 8 · 10−5 62 89.3 0.397 RF01256 236 208 60.6 11.9 0 54 92.7 0.457 RF01267 394 384 72.5 2.6 0 128 94.3 0.295 RF01296 373 325 63.7 13.0 0 57 91.4 0.339 RF00015 273 147 52.2 46.2 5 · 10−5 31 91.5 0.604 RF00020 178 117 51.7 34.1 4 · 10−5 30 94.0 0.694 RF00025 270 186 42.5 31.1 0 39 86.4 0.395 RF00030 903 303 34.7 66.5 0 74 88.3 0.470 RF00011 511 373 63.0 27.1 0 105 95.1 0.500 RF00018 391 350 62.4 10.5 0 49 96.8 0.368 RF00168 232 183 46.1 21.2 0 53 90.3 0.580 RF00380 216 170 59.6 21.4 0 47 94.5 0.471 RF01071 676 609 59.9 9.9 0 159 90.2 0.378 RF00209 286 275 89.2 3.9 0 75 98.8 0.191 RF00171 180 159 67.3 11.4 0 34 97.9 0.403 RF00210 471 461 85.4 2.1 0 122 98.3 0.181 RF00458 208 201 55.4 3.5 0 60 95.0 0.757 RF01084 137 128 51.0 6.9 0 43 97.2 0.795 RF00373 533 311 49.4 41.7 0 87 90.0 0.537

62 Detailed free energy difference distribution

Long data set Combined data set Short sequences only 15 120 120 100 100 10 80 80 60 60 5 RNAfold−A 40 40 20 20 0 0 0 −5 0 5 10 15 20 −5 0 5 10 15 20 −5 0 5 10 15 20 %∆∆G %∆∆G %∆∆G 15 120 120 100 100 10 80 80 60 60 CoFold 5 40 40 20 20 0 0 0 −5 0 5 10 15 20 −5 0 5 10 15 20 −5 0 5 10 15 20 %∆∆G %∆∆G %∆∆G 15 120 120 100 100 10 80 80 60 60 CoFold−A 5 40 40 20 20 0 0 0 −5 0 5 10 15 20 −5 0 5 10 15 20 −5 0 5 10 15 20 %∆∆G %∆∆G %∆∆G

Figure A1. Distributions of relative free energy difference between predicted structures and the respective minimum free energy structures predicted by RNAfold (the MFE structures). Results are shown for the long data set (left column), the combined data set (middle column) and the short sequences of the combined data set (right column). For each data set, three histograms show the relative free energy differences between the RNAfold-A-predicted structures and the RNAfold MFE structures (top row), between the CoFold-predicted structures the RNAfold MFE structures (middle row), and between the CoFold-A-predicted structures and the RNAfold MFE structures(bottom row). The free energies of all structures are calculated using the Turner 1999 energy parameters.

63 Long data set Combined data set Short sequences only ● ● ● ● ● ●

40 40 ● 40 ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● 20 ● ● ● 20 ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●● ●● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●●●●●●● ●●●● ● ●● ● ● ● ●●●●●●●● ●●● ● ●● ● ● ● 0 ● 0 ● ●● ● ● ●●● ● 0 ● ●● ● ● ●●● MCC ● MCC ● MCC ● ● ● ● ● ● ● ●● ●● ●●● ● ● ●● ● ● ●● ●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ● ∆ ∆ ● ● ∆ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● % ● % ● ● ● ●● ● ● ● ● % ● ● ● ● ● ● ● ● ● ●

RNAfold−A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−20 −20 ● −20 ● ● ● ● ● ● ● ● ●

● ● ● ● ● ●

● ● ● ● ● ● −40 −40 −40 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 ∆∆ ∆∆ ∆∆ % G ● % G ● % G

● ● 40 40 40

● ● ● ● ●● ●●

● ● ● ● ● ● ● ●● ● ●● ●● 20 ● 20 ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ●● ● ●● ●● ●● ● ●● ●●● ● ●● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●●●● ● ●●●●● ● ● ●● ●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●●●●● ● ●●●●●● ● ●● ● ● ● ●●● ● ● ● ●●● ● ● ●●●●●●●● ●● ●●●●●●●● ● ● ●●● ● ● ● ●● ●●●●● ● ● ● ● ●●● 0 0 ● 0 ● MCC MCC ● ● ● MCC ● ● ● ● ● ● ● ● ● ●● ●● ∆ ∆ ● ●●● ● ∆ ● ●●● ● ●● ● ●● ●

% % ● % ● CoFold ● ● ● ● ● ●

● ● −20 −20 ● −20 ●

● ● ● ● −40 −40 −40 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 ● ● %∆∆G %∆∆G ● %∆∆G ●

● ● ● ● ● ● ● ● ● ● 40 40 40

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● 20 ● ● ● ● 20 ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ●● ●●●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ●●● ●●● ● ●●●● ● ●● ● ● ● ●●● ●●● ● ●●●● ● ●● ● ● ●● ●●●● ● ● ●●●●● ●●● ● ● ● ●●●● ● ● ●●●● ●●● ● ● ● 0 0 ● ● ●●● ●● 0 ● ● ●●● ●● MCC ● MCC ● ●● ● ● ●● ● ● ● MCC ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ∆ ∆ ● ●● ● ● ∆ ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●● % % ● ● ● ● ● % ● ● ● ● ● ● ● ● ● ● ● ● ● CoFold−A ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−20 −20 ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●

● ● −40 −40 −40 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 %∆∆G %∆∆G %∆∆G

Figure A2. Differences in prediction accuracy versus relative free energy changes between the predicted structures and the MFE structures predicted by RNAfold. Results shown for the long data set (left column), the combined data set (middle column) and the short sequences of the combined data set (right column). For each data set, three figures show the change in performance accuracy in terms of MCC versus the relative free energy change (1) between the RNAfold-A-predicted structures and the RNAfold-predicted structures, (2) between the CoFold-predicted structures and the RNAfold-predicted structures, and (3), between the CoFold-A-predicted structures and the RNAfold-predicted structures. The free energies of all structures are calculated using the Turner 1999 energy parameters. Linear regression lines shown in blue with its 95% confidence region (dotted).

64