UC San Diego Electronic Theses and Dissertations

UC San Diego UC San Diego Electronic Theses and Dissertations Title Computational methods for genome-wide non-coding RNA discovery and analysis Permalink https://escholarship.org/uc/item/5qc2h8tf Author Zhang, Shaojie Publication Date 2007 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Computational Methods for Genome-Wide Non-Coding RNA Discovery and Analysis A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Shaojie Zhang Committee in charge: Professor Vineet Bafna, Chair Professor Sanjoy Dasgupta Professor Pavel Pevzner Professor Glenn Tesler Professor Steven Wasserman 2007 . Copyright Shaojie Zhang, 2007 All rights reserved. The dissertation of Shaojie Zhang is approved, and it is acceptable in quality and form for publication on micro- ¯lm: Chair University of California, San Diego 2007 iii To my parents. iv TABLE OF CONTENTS Signature Page . iii Dedication . iv Table of Contents . v List of Figures . vii List of Tables . viii Acknowledgements . ix Vita, Publications, and Fields of Study . xi Abstract . xiii 1 Introduction . 1 1.1 Non-coding RNAs . 1 1.2 RNA secondary structure . 4 1.3 The Challenge of ncRNA Discovery and Analysis . 6 1.3.1 RNA Homolog Search . 7 1.3.2 RNA Consensus Folding for ncRNA Discovery . 8 1.4 Dissertation Outline . 9 2 FastR: Fast RNA Search Using Structure-based Filters . 11 2.1 Introduction . 11 2.2 Methods . 14 2.2.1 Strucutre-based Filters . 14 2.2.2 Structure-based Filter Design . 17 2.2.3 Optimal Structure-based Filter Design . 20 2.2.4 Structure-based Filtering Algorithms . 21 2.2.5 Computing RNA Sequence Structure Alignment . 22 2.2.6 P-value Computation . 27 2.3 Testing Results . 27 2.3.1 Filtering for ncRNA . 28 2.3.2 Alignment . 30 2.3.3 Search Riboswitches using FastR . 30 2.4 Summary . 37 v 3 PFsatR: Pro¯le-based Fast RNA search using sequence-based ¯lters . 40 3.1 Introduction . 40 3.2 Formalizing ncRNA Filters . 44 3.3 Sequence-based Filters . 46 3.3.1 Multiple Keyword (Chain) Filtering . 47 3.3.2 Accuracy of Chain Filters . 49 3.3.3 Implementing Chain Filters . 50 3.4 RNA-Pro¯le Scoring and Alignment . 52 3.4.1 Choosing the Scoring Functions . 53 3.4.2 The Alignment Procedure . 54 3.5 Experimental Results . 54 3.5.1 Filter E±ciency and Accuracy . 56 3.5.2 Discovering Novel Riboswitches . 61 3.5.3 Mining Environmental Sequence Data . 61 3.6 PFastR Web Server . 66 3.7 Summary . 68 4 RNAscf: Consensus folding of unaligned RNA sequences . 71 4.1 Introduction . 71 4.2 RNA Secondary Structure and Stack Con¯gurations . 75 4.2.1 Predicting Putative Stacks . 77 4.2.2 Stack Con¯gurations . 78 4.3 Stack-based Consensus Folding . 81 4.3.1 Computing Optimal Stack Con¯guration in Two RNA Se- quences . 81 4.3.2 Consensus Fold Computation for Multiple RNA Sequences . 84 4.3.3 Implementation Details . 86 4.4 Testing Results . 87 4.5 Summary . 92 5 Conclusions . 93 5.1 Summary of Contribution . 93 5.2 Future Work . 94 Bibliography . 98 vi LIST OF FIGURES Figure 1.1 MicroRNA block protein formation . 2 Figure 1.2 RNA secondary structure . 5 Figure 2.1 Alignment of two tRNA sequences. 15 Figure 2.2 An RNA structure with various structural elements including stacked base-pairs, bulges, hairpin, and multi-loops. 16 Figure 2.3 A (k; ~w; 4)-multiloop stack for tRNA with distance constraints. 19 Figure 2.4 Procedure to create a Binary tree for s with structure S, having O(m) nodes such that each node has at most 2 children. 24 Figure 2.5 An algorithm for aligning a query RNA s of length m with a database string t of length n.................... 25 Figure 2.6 ROC plots for the alignments generated by RSEARCH and FastR. 31 Figure 2.7 Representative riboswitch secondary structures derived from the alignments of the top novel hits for each query. 38 Figure 3.1 A plot of log(eF ) versus m, when L = 150, l = 8 and ± = 20. Di®erent lines correspond to di®erent values of sK . 49 Figure 3.2 An algorithm for aligning an RNA pro¯le R with m columns against a database string t of length n. 55 Figure 3.3 ROC curves for selected families with accurate ¯lter and alignment. 69 Figure 4.1 Two stack con¯gurations match to each other for both unpaired regions and paired regions. 76 Figure 4.2 Statistics of the stacks in Rfam database. 79 Figure 4.3 The procedure for computing anchor con¯guration. 86 Figure 4.4 The procedure RNAscf for computing consensus folds. 87 Figure 4.5 Sensitivity and accuracy of RNA secondary structure prediction on 12 RNA families. 89 Figure 4.6 Improved sensitivity and accuracy of RNAscf as the number of input sequences grows for the thiamine family. 91 Figure 4.7 A comparison of predicted stack con¯gurations by di®erent programs. 91 Figure 5.1 RNAz classi¯es alignments using a support vector machines . 95 Figure 5.2 Evofold scores on the alignments . 96 Figure 5.3 Shifted stacks on a multispecies alignment . 96 vii LIST OF TABLES Table 2.1 Expected number of hits in a random string in a (k; w)-¯lter. 17 Table 2.2 The results of applying nested and multiloop ¯lters to random databases that contain true positives. 29 Table 2.3 Comparison of FastR and RSEARCH. 32 Table 2.4 Summary of the FastR riboswitch search. 34 Table 2.5 Description of the 18 most promising candidates discovered by FastR. 36 Table 3.1 Riboswitch sub-families in Rfam database . 56 Table 3.2 Filtering performance of chain ¯lters (CF), HMM ¯lters (HMM), and composite ¯lters (CF¢HMM) on synthetic sequences. 57 Table 3.3 Comparison of RNA pro¯le alignment (PAln) and CMsearch (CM) on synthetic sequences. 59 Table 3.4 Filtering performance of chain ¯lters (CF), HMM ¯lters (HMM), and composite ¯lters (CF¢HMM) on two real genomes. 60 Table 3.5 Summary of searching riboswitches against the whole bacterial and archaeal genomes. 62 Table 3.6 Summary of searching riboswitch elements against GOS data. 65 Table 3.7 Summary of predicted functions of the con¯dent ORFs down- stream of riboswitch predictions. 67 Table 3.8 Statistics for accurate option and e±cient option. 68 Table 4.1 E®ect of parameters k; w and s on the probability of predicting conserved stacks at random. 85 Table 4.2 A complete list of the comparison of sensitivity and accuracy of RNA secondary structure prediction on 12 RNA families shown in Figure 4.5. 90 viii ACKNOWLEDGEMENTS I am very grateful to my advisor, Dr. Vineet Bafna, for his guidance and support throughout my Ph.D. studies. I feel fortunate to work with him. The work presented in this dissertation bene¯ts the most from his advices. I also would like to thank Dr. Pavel Pevzner for his kindly supporting me for my ¯rst two years and his guidance throughout my Ph.D. studies. I would like to thank Dr. Haixu Tang, Dr. Roded Sharan, Dr. Eleazar Eskin for all the successful collaborations. I wish to thank Dr. Vineet Bafna, Dr. Sanjoy Dasgupta, Dr. Pavel Pevzner, Dr. Glenn Tesler, and Dr. Steven Wasserman for taking the time and patience to review my dissertation and serve on my defense committee. I would like to thank all CSE bioinformatics lab members. All of them have made my Ph.D. study a very precious and unique experience. The science presented in this dissertation greatly bene¯ted from interactions with Max Alek- seyev, Nuno Bandeira, Vikas Bansal, Ali Bashir, Fjola Bjornsdottr, Mark Chaisson, Banu Dost, Ari Frank, Neil Jones, Julio Ng, Qian Peng, Alkes Price, Ben Raphael, Stephen Tanner, Je®rey Wang and Degui Zhi. My dissertation work was supported by a grant from the National Science Foundation (NSF-DBI:0516440). I extensively used computers through the UCSD FWGrid Project (NSF Research Infrastructure Grant Number EIA-0303622). Finally, I am deeply indebted to my family for their everlasting support and love. Chapter 2, in part, is a reprint of the paper \Searching Genomes for non- coding RNA using FastR" co-authored with Brian Haas, Eleazar Eskin and Vineet Bafna in IEEE/ACM Transactions on Computational Biology and Bioinformat- ics, Vol. 2, Issue 4, pp. 366{379, 2005. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, is a reprint of the paper, \A sequence-based ¯ltering method for ncRNA identi¯cation and its application to searching for riboswitch ix elements", co-authored with Ilya Borovok, Yair Aharonowitz, Roded Sharan, and Vineet Bafna in Bioinformatics (ISMB 2006) Vol. 22, pp. e557{e565, 2006. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part, is a reprint of the paper, \Consensus folding of unaligned RNA sequences revisited", co-authored with Vineet Bafna and Haixu Tang in Journal of Computational Biology, Vol. 13, Issue 2, pp. 283{295, 2006. The dissertation author was the primary investigator and author of this paper. x VITA 1997 B.S. in Computer Science Peking University, Beijing, P.R. China 2001 M.Eng. in Information Engineering Nanyang Technological University, Singapore 2001{2007 Graduate Research Assistant University of California, San Diego 2005 C.Phil., University of California, San Diego 2007 Ph.D. in Computer Science University of California, San Diego PUBLICATIONS Je®rey C Wang, Roded Sharan, Vineet Bafna, and Shaojie Zhang, "PFastR: a web-based fast RNA family identi¯cation tool", in preparation, 2007.

UC San Diego Electronic Theses and Dissertations

120421-24Recombschedule FINAL.Xlsx

Extrachromosomal and Other Mechanisms of Oncogene Amplification in Cancer

Call for Papers (Page 1)

Computational Discovery of Splicing Events from High-Throughput Omics Data

Front Matter

Research in Computational Molecular Biology

Table of Contents More Information

Friday (August 4) Intelligent Systems for Molecular Biology

The Future of Genomic Medicine II the Neurosciences Institute Auditorium San Diego, California

Curriculum Vitae

Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Research in Computational Molecular Biology