RNA Regulatory Elements Play a Significant Role in Gene
Total Page:16
File Type:pdf, Size:1020Kb
Ab initio IDENTIFICATION OF REGULATORY RNAS USING INFORMATION-THEORETIC UNCERTAINTY by AMIRHOSSEIN MANZOUROLAJDAD (Under the Direction of Dr. Jonathan Arnold) ABSTRACT RNA regulatory elements play a significant role in gene regulation. Riboswitches are regulatory elements which function by forming a ligand-induced alternative fold that controls access to ribosome binding sites or other regulatory sites in RNA. Traditionally, riboswitches have been identified based on sequence and structural homology. In this work, in an attempt to devise an ab initio method for identification of regulatory elements, mainly riboswitches, we derive and implement Shannon’s entropy of the SCFG ensemble on an RNA sequence in polynomial time for both structurally ambiguous and unambiguous grammars. We then evaluate the significance of this new measure of structural entropy in identifying riboswitches. Finally, sim- ple lightweight stochastic context-free grammar folding models assign significant values to long extensive secondary structures in Bacillus subtilis. INDEX WORDS: Riboswitch, RNA Secondary Structure, Stochastic Context-free Grammars, Information Theory Ab initio IDENTIFICATION OF REGULATORY RNAS USING INFORMATION-THEORETIC UNCERTAINTY by AMIRHOSSEIN MANZOUROLAJDAD BS, The University of Guilan, Rasht, Iran, 2004 MS, Isfahan University of Technology, Isfahan, Iran, 2007 A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY ATHENS,GEORGIA 2014 ⃝c 2014 Amirhossein Manzourolajdad All Rights Reserved Ab initio IDENTIFICATION OF REGULATORY RNAS USING INFORMATION-THEORETIC UNCERTAINTY by AMIRHOSSEIN MANZOUROLAJDAD Approved: Major Professor: Jonathan Arnold Committee: Sidney Kushner Russell L. Malmberg Jan Mrazek Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia May 2014 I dedicate this work to my mother, Roya iv ACKNOWLEGEMENTS I will be eternally indebted to my major advisor Dr. Jonathan Arnold and also to my committee—Dr. Sidney Kushner, Dr. Russell L. Malmberg, and Dr. Jan Mrazek—without whom this work would not have been possible. I would like to thank Dr. Jamal Golestani whose information theory course at Isfahan Univ. of Technology first inspired me to explore the field. I also would like to thank my first algorithm coach Abolfazl Javan. I would like to thank all my fellow colleagues and the professors at the University of Georgia whom I met and learned a great deal from in various labs. Finally, I would like to acknowledge my family and friends for standing beside me throughout this research. v Contents 1 Introduction and Literature Review 1 1.1 Riboswitches . 2 1.2 Structural Homology . 2 1.3 Ab initio Identification of Riboswitches . 3 2 Information-Theoretic Uncertainty of SCFG-Modeled Folding Space of The Non-coding RNA 6 2.1 Abstract . 7 2.2 Introduction . 7 2.3 Stochastic Context-Free Grammar (SCFG) ensemble of RNA secondary structure . 10 2.4 Structural entropy over SCFG ensembles . 11 2.5 Evaluating the Structural Entropy of ncRNAs . 14 2.6 Methodology . 15 2.7 Results . 21 2.8 Discussion . 34 2.9 Conclusion . 44 3 Ab initio Riboswitch Identification Based on The Secondary Structure Folding Space 45 3.1 Introduction . 45 3.2 Results . 50 3.3 Discussion . 67 3.4 conclusion . 72 3.5 Materials and Methods . 72 vi 4 Conclusion 77 4.1 Future Work . 80 Appendix A Structural Entropy Derivations 91 A.1 Structural Entropy of Structurally Ambiguous Non-stacking Grammars . 91 A.2 Data Collection . 97 A.3 Sensitivity and Specificity (PPV) of models to Annotated Secondary Structure of ncRNAs . 98 A.4 Generating random structures . 98 A.5 Short ncRNA Sequence Clustering . 99 A.6 Structural Entropy P-values of Short Non-coding RNAs Against Random Sequences with Structure . 100 A.7 P-value Stability Test for Dinucleotide Shuffling . 102 A.8 Structural Entropy Empirical P-values for single nucleotide composition and dinucleotide shuffling tests . 103 A.9 Structural Entropy and Total Base-pairing Entropy . 104 Appendix B Riboswitch Classification 116 B.1 Data and Classification Results . 116 B.2 Bacillus subtilis Classification Results . 120 B.3 Bacillus subtilis Genome-wide Scan Results . 120 B.4 Escherichia coli Genome-wide Scan Results . 130 B.5 Positive-Control-Set Sequence Segments . 136 vii List of Figures 1.1 Energy Landscape of The TPP Riboswitch . 4 2.1 Entropy vs. Length . 23 2.2 Structural Entropy p-Value Distribution . 24 2.3 Structural Entropy p-Value Average . 26 2.4 Base-pairing Entropy p-Value Average . 28 2.5 Structural Entropy p-Values of RNase P . 29 2.6 Structural Entropy p-Values of Riboswitches . 29 2.7 Structural Entropy p-Values of Structural Randomization . 31 2.8 Structural Entropy p-Values of miRNAs vs. Model Sensitivity . 32 2.9 Structural Entropy p-Values of miRNAs vs. Model Specificity . 33 2.10 Structural Entropy p-Values of Riboswitches vs. Model Sensitivity . 35 2.11 Structural Entropy p-Values of Riboswitches vs. Model Specificity . 36 3.1 Structural Entropy vs. GC-comp. and MFE . 55 3.2 Structural Entropy Genomic Distribution . 63 3.3 Riboswitch Identification Pipeline . 76 A.1 Structural Entropy p-Values Under Inaccurate Modeling . 101 A.2 P-Value Stability Test . 102 A.3 Structural and Base-pairing Entropy p-Values under Single-nucleotide Randomization . 103 A.4 Structural and Base-pairing Entropy p-Values under Di-nucleotide Randomization . 104 A.5 Structural Entropy p-Value vs. Length . 107 viii A.6 Structural Entropy p-Value vs. UA-comp. 108 A.7 Structural Entropy p-Value of RNase P vs. Model Sensitivity . 109 A.8 Structural Entropy p-Value of Low-GC Riboswitches vs. Model Sensitivity . 110 A.9 Structural Entropy p-Value of Low-GC Riboswitches vs. Model Specificity . 111 A.10 Structural Entropy p-Value of Ave.-GC Riboswitches vs. Model Sensitivity . 112 A.11 Structural Entropy p-Value of Ave.-GC Riboswitches vs. Model Specificity . 113 B.1 Classification ROC Curve . 117 B.2 Sense-antisense Differential Entropy of Shorter Riboswitches . 118 B.3 Sense-antisense Differential Entropy of Longer Riboswitches . 119 B.4 Structural Entropy vs. Uracil-comp. in B. subtilis ....................... 121 ix List of Tables 3.1 Data Collection . 58 3.2 Classification Performance . 59 3.3 Classification Performance Using Centroid Free Energy . 59 3.4 Logistic Regression Coefficients of Classifiers . 59 3.5 Classification Performance in B. subtilis ............................ 59 3.6 Classification Performance in B. subtilis Under Constant Length . 60 3.7 B. subtilis Riboswitches Ranking Under Actual-Length Test . 60 3.8 B. subtilis Riboswitches Ranking Under Constant-Length Test . 60 3.9 Top Entropy Hits in B. subtilis Filtered for GC-comp. and Uracil-comp. 61 3.10 Top Entropy Hits of E. coli Filtered for GC- and Uracil-comp. 62 3.11 Mutagenesis . 65 3.12 Mutagenesis Results . 67 3.13 Ribex Performance . 67 A.1 Downloaded sequences from Rfam 10.0 . 97 A.2 Structural Entropy p-Value vs. Accuracy for Bralibase Predictions . 98 A.3 Structural Entropy p-Value vs. Accuracy for Rfam Predictions . 98 A.4 Structural Entropy p-Values of miRNA and Riboswitches with Different GC-comp. Under All Models . 100 A.5 Structural Entropy p-Values Average Under All Models . 105 A.6 Structural Entropy p-Values Average Under a Random Model . 105 A.7 Kolmogorov-Smirnov Distance of p-Values Under All Models . 106 x A.8 Kolmogorov-Smirnov Distance of p-Values Under a Random Model . 106 A.9 Structural Entropy p-Value Correlation . 114 B.1 Genomic locations of collected sequences . 116 B.2 Classification Performance Using Cross Validation . 120 B.3 Short UTR Collection . 122 B.4 Riboswitch Statistics . 122 B.5 Classification Performance for Different Choices of Length . 123 B.6 Classification Performance for Different Choices of Length in B. subtilis ........... 124 B.7 Top Classification Hits in B. subtilis .............................. 125 B.8 Top Classification Hits in B. subtilis Uracil-comp. Constrained . 127 B.9 Top Entropy Hits in B. subtilis Forward Strand . 129 B.10 Top Entropy Hits in B. subtilis Reverse Strand . 131 B.11 Top Classification Hits in E. coli ................................ 132 B.12 Top Classification Hits in E. coli Uracil-comp. Constrained . ..