Cofold: an RNA Structure Prediction Method That Takes Co-Transcriptional Folding Into Account
Total Page:16
File Type:pdf, Size:1020Kb
CoFold: an RNA structure prediction method that takes co-transcriptional folding into account by JEFFREY RYAN PROCTOR BSEng, The University of Victoria, 2010 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) September, 2012 c Jeffrey Ryan Proctor, 2012 Abstract RNA has a diverse array of cellular functions, and relies on molecular structure to carry them out. The vast majority of current methods for prediction of RNA secondary structure (i.e. the set of base pairs in the molecule) consider the minimum free energy structure (i.e. the most thermodynamically stable structure), and thus disregard the process of structural formation. There exists substantial evidence that the process of structure formation is important, and that it does impact the resulting functional RNA structure. Several methods currently exist that explicitly simulate the kinetic folding pathway as a time-ordered sequence of structural changes. However, these methods not only suffer from a long list of limiting assumptions about the cellular environment, but also are restricted to short sequences. In this thesis, we explore the idea of capturing the effects of kinetic folding rather than simulating in detail the process over time, and propose that accounting for kinetic effects of structural formation is crucial to further improve non-comparative RNA secondary structure prediction methods. During transcription, RNA structure begins to form immediately as the molecule emerges from the polymerase (i.e. co-transcriptionally). Long-range base pairs suffer a disadvantage during this process, as quickly- forming short-range base pairs act to block their formation (i.e. due to kinetic barriers). We propose a novel method, CoFold, that captures the reachability of potential pairing partners during co-transcriptional folding. We show that it significantly improves prediction accuracy over free energy minimization alone, particularly for long sequences. CoFold depends on only two free parameters that are highly correlated, and we demonstrate robust training. Furthermore, the resulting structures predicted by CoFold have a free energy measurement that does not significantly differ from that of the respective RNAfold prediction, indicating that they are indeed stable secondary structures. We propose that consideration of kinetics in RNA secondary structure prediction is crucial, and we hope that this work encourages further exploration of its effect on biological RNA structure. ii Contents Abstract .................................................. ii Contents .................................................. iii List of Figures .............................................. v List of Tables ............................................... vi Acknowledgements ........................................... vii 1 Introduction ............................................. 1 1.1 RNA secondary structure . 1 1.1.1 Secondary vs. tertiary structure . 3 1.1.2 Components of RNA structure . 3 1.2 Experimental determination of RNA structure . 4 1.2.1 Xray crystallography . 4 1.2.2 Nuclear magnetic resonance spectroscopy . 5 1.2.3 Chemical and enzymatic probing . 5 1.3 Computational non-comparative RNA Secondary Structure Prediction . 6 1.3.1 Nussinov algorithm . 6 1.3.2 Zuker-Stiegler algorithm . 7 1.3.3 Prediction of pseudoknotted structures . 8 1.3.4 Suboptimal folding . 9 1.3.5 Equilibrium partition function . 10 1.3.6 Sfold ........................................... 11 1.3.7 Prediction guided by chemical probing data . 11 1.3.8 CONTRAfold ..................................... 12 1.4 Computational comparative RNA Secondary Structure Prediction . 12 1.4.1 Covariance models . 13 1.4.2 Pfold ........................................... 14 1.4.3 RNA-Decoder ..................................... 15 1.4.4 RNAalifold ....................................... 15 1.4.5 Alignment-free methods . 15 1.5 Thermodynamics alone does not tell the whole story . 16 1.5.1 Importance of kinetics . 16 1.5.2 Co-transcriptional Folding . 17 1.5.3 Kinetic folding methods . 18 1.6 Goals of this thesis . 19 iii 2 CoFold ................................................. 22 2.1 Motivation . 22 2.2 CoFold Algorithm . 22 2.3 Implementation . 26 2.4 Compilation of the long and combined data sets . 27 2.4.1 The long data set . 28 2.4.2 The combined data set . 29 2.5 Parameter training . 30 2.5.1 Training procedure . 31 2.5.2 Clade-specific parameter training . 33 2.6 CoFold Performance . 34 2.7 Calculation of free energy differences . 39 2.8 Case studies . 43 2.9 CoFold web server . 47 3 Discussion ............................................... 48 4 Future Work ............................................. 49 References ................................................. 50 Appendices ................................................ 60 Definition of covariation metric . 60 Data set statistics . 60 Detailed free energy difference distribution . 63 iv List of Figures 1 RNA secondary structure components . 2 2 Diagram of an RNA pseudoknot . 4 3 Example of base pair covariation . 13 4 Scaling function γ of CoFold .................................. 24 5 Visualization of the indices involved in CoFold recursion formulas . 26 6 Heatmap of performance measurements for all (α; τ) combinations. 33 7 Difference in predictive accuracy between CoFold and RNAfold(TPR vs PPV) . 37 8 Difference in predictive accuracy between CoFold and RNAfold(TPR vs FPR) . 38 9 Distribution of relative free energy difference between predicted structures and the respec- tive RNAfold-predicted minimum free energy structures in the long data set . 41 10 Distribution of absolute free energy difference (in kcal/mol) between predicted structures and the respective RNAfold-predicted minimum free energy structures in the long data set 42 11 RNA secondary structures predicted by RNAfold and CoFold-A for the 23S rRNA of the gamma-proteobacteria Pseudomonas aeruginosa ..................... 45 12 RNA secondary structures predicted by RNAfold and CoFold-A for the 16S rRNA of the fresh-water algae Cryptomonas sp. (species unknown) . 46 A1 Distributions of relative free energy difference between predicted structures and the re- spective minimum free energy structures predicted by RNAfold . 63 A2 Differences in prediction accuracy versus relative free energy changes between the predicted structures and the MFE structures predicted by RNAfold . 64 v List of Tables 1 Definition of method names used throughout the text . 27 2 Composition of the long and combined datasets . 30 3 Performance metrics for CoFold and RNAfold ....................... 36 4 Summary of relative free energy difference between predicted structures and the respective minimum free energy structures predicted by RNAfold. .................. 39 5 Details of the linear fits to the ∆ MCC versus % ∆∆G distributions . 43 A1 RNA families of the long and the combined data sets . 61 A2 Alignment quality and phylogenetic support for the reference RNA secondary structures . 62 vi Acknowledgements I would like to thank my supervisor, Irmtraud Meyer, as well as my committee members, Joerg Gsponer and Anne Condon. I would also like to thank all the members of the Meyer group, particularly Daniel Lai for the invaluable R code he has written as part of the R4RNA package, and Adi Steif for immensely helpful advice regarding statistics. My sources of funding include the MSFHR/CIHR Bioinformatics Training Program, and the Natural Sciences and Engineering Research Council (NSERC) Canada Graduate Scholarship. vii 1 Introduction 1.1 RNA secondary structure Ribonucleic acids (RNA) perform a wide variety of essential and well-defined roles in the cell. Whether it is protein-coding mRNA, or the myriad non-coding RNAs, RNA molecules exert their function by assuming structural features. Transfer RNA (tRNA), ribosomal RNA (rRNA), and messenger RNA (mRNA) act in concert to carry out faithful translation of the genetic code. The structure of transfer RNA is vital not only to interact with codons within messenger RNA, but also for robust recognition by tRNA synthases, the enzymes that attach the appropriate amino acid to each tRNA [107]. The complex structure of ribosomal RNAs acts not only as structural scaffolding of the ribosome, but interestingly rRNA is in fact responsible for its catalytic peptidyl transferase activity [124]. The untranslated regions (UTR) at the ends of messenger RNAs are repositories of RNA structural features responsible for regulation of the respective protein product. Riboswitches are often found in the UTRs of metabolic genes, where they modulate expression by actively binding to small metabolites [89, 115, 119]. Conserved hairpins in the UTRs of genes involved in iron metabolism are essential for regulation of cellular iron levels in mammals [47]. MicroRNA response elements, commonly found in UTRs of protein-coding genes, are short sequences with near-perfect complementarity to microRNAs, short RNA molecules which cause gene repression upon binding [105]. In the 1980s, it was discovered that RNA, in addition to proteins, can play a catalytic role. For instance, group I and II introns contain RNA structure that catalyzes its own excision from the tran- script [10, 14]. Ribonuclease P (RNase P) is a catalytic RNA that cleaves tRNA precursors during their maturation pathway [32]. Viral genomes have been found to depend highly on RNA structural features, ostensibly due to the