UC San Diego UC San Diego Electronic Theses and Dissertations

Title Computational methods for genome-wide non-coding RNA discovery and analysis

Permalink https://escholarship.org/uc/item/5qc2h8tf

Author Zhang, Shaojie

Publication Date 2007

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational Methods for Genome-Wide Non-Coding RNA Discovery and Analysis

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Computer Science

by

Shaojie Zhang

Committee in charge:

Professor Vineet Bafna, Chair Professor Sanjoy Dasgupta Professor Pavel Pevzner Professor Glenn Tesler Professor Steven Wasserman

2007 .

Copyright Shaojie Zhang, 2007 All rights reserved. The dissertation of Shaojie Zhang is approved, and it is acceptable in quality and form for publication on micro- film:

Chair

University of California, San Diego

2007

iii To my parents.

iv TABLE OF CONTENTS

Signature Page ...... iii

Dedication ...... iv

Table of Contents ...... v

List of Figures ...... vii

List of Tables ...... viii

Acknowledgements ...... ix

Vita, Publications, and Fields of Study ...... xi

Abstract ...... xiii

1 Introduction ...... 1 1.1 Non-coding RNAs ...... 1 1.2 RNA secondary structure ...... 4 1.3 The Challenge of ncRNA Discovery and Analysis ...... 6 1.3.1 RNA Homolog Search ...... 7 1.3.2 RNA Consensus Folding for ncRNA Discovery ...... 8 1.4 Dissertation Outline ...... 9

2 FastR: Fast RNA Search Using Structure-based Filters ...... 11 2.1 Introduction ...... 11 2.2 Methods ...... 14 2.2.1 Strucutre-based Filters ...... 14 2.2.2 Structure-based Filter Design ...... 17 2.2.3 Optimal Structure-based Filter Design ...... 20 2.2.4 Structure-based Filtering Algorithms ...... 21 2.2.5 Computing RNA Sequence Structure Alignment ...... 22 2.2.6 P-value Computation ...... 27 2.3 Testing Results ...... 27 2.3.1 Filtering for ncRNA ...... 28 2.3.2 Alignment ...... 30 2.3.3 Search Riboswitches using FastR ...... 30 2.4 Summary ...... 37

v 3 PFsatR: Profile-based Fast RNA search using sequence-based filters . . 40 3.1 Introduction ...... 40 3.2 Formalizing ncRNA Filters ...... 44 3.3 Sequence-based Filters ...... 46 3.3.1 Multiple Keyword (Chain) Filtering ...... 47 3.3.2 Accuracy of Chain Filters ...... 49 3.3.3 Implementing Chain Filters ...... 50 3.4 RNA-Profile Scoring and Alignment ...... 52 3.4.1 Choosing the Scoring Functions ...... 53 3.4.2 The Alignment Procedure ...... 54 3.5 Experimental Results ...... 54 3.5.1 Filter Efficiency and Accuracy ...... 56 3.5.2 Discovering Novel Riboswitches ...... 61 3.5.3 Mining Environmental Sequence Data ...... 61 3.6 PFastR Web Server ...... 66 3.7 Summary ...... 68

4 RNAscf: Consensus folding of unaligned RNA sequences ...... 71 4.1 Introduction ...... 71 4.2 RNA Secondary Structure and Stack Configurations ...... 75 4.2.1 Predicting Putative Stacks ...... 77 4.2.2 Stack Configurations ...... 78 4.3 Stack-based Consensus Folding ...... 81 4.3.1 Computing Optimal Stack Configuration in Two RNA Se- quences ...... 81 4.3.2 Consensus Fold Computation for Multiple RNA Sequences . 84 4.3.3 Implementation Details ...... 86 4.4 Testing Results ...... 87 4.5 Summary ...... 92

5 Conclusions ...... 93 5.1 Summary of Contribution ...... 93 5.2 Future Work ...... 94

Bibliography ...... 98

vi LIST OF FIGURES

Figure 1.1 MicroRNA block protein formation ...... 2 Figure 1.2 RNA secondary structure ...... 5

Figure 2.1 Alignment of two tRNA sequences...... 15 Figure 2.2 An RNA structure with various structural elements including stacked base-pairs, bulges, hairpin, and multi-loops...... 16 Figure 2.3 A (k, ~w, 4)-multiloop stack for tRNA with distance constraints. 19 Figure 2.4 Procedure to create a Binary tree for s with structure S, having O(m) nodes such that each node has at most 2 children. . . . . 24 Figure 2.5 An algorithm for aligning a query RNA s of length m with a database string t of length n...... 25 Figure 2.6 ROC plots for the alignments generated by RSEARCH and FastR...... 31 Figure 2.7 Representative riboswitch secondary structures derived from the alignments of the top novel hits for each query...... 38

Figure 3.1 A plot of log(eF ) versus m, when L = 150, l = 8 and δ = 20. Different lines correspond to different values of sK ...... 49 Figure 3.2 An algorithm for aligning an RNA profile R with m columns against a database string t of length n...... 55 Figure 3.3 ROC curves for selected families with accurate filter and align- ment...... 69

Figure 4.1 Two stack configurations match to each other for both unpaired regions and paired regions...... 76 Figure 4.2 Statistics of the stacks in Rfam database...... 79 Figure 4.3 The procedure for computing anchor configuration...... 86 Figure 4.4 The procedure RNAscf for computing consensus folds. . . . . 87 Figure 4.5 Sensitivity and accuracy of RNA secondary structure predic- tion on 12 RNA families...... 89 Figure 4.6 Improved sensitivity and accuracy of RNAscf as the number of input sequences grows for the thiamine family...... 91 Figure 4.7 A comparison of predicted stack configurations by different programs...... 91

Figure 5.1 RNAz classifies alignments using a support vector machines . . 95 Figure 5.2 Evofold scores on the alignments ...... 96 Figure 5.3 Shifted stacks on a multispecies alignment ...... 96

vii LIST OF TABLES

Table 2.1 Expected number of hits in a random string in a (k, w)-filter. . . 17 Table 2.2 The results of applying nested and multiloop filters to random databases that contain true positives...... 29 Table 2.3 Comparison of FastR and RSEARCH...... 32 Table 2.4 Summary of the FastR riboswitch search...... 34 Table 2.5 Description of the 18 most promising candidates discovered by FastR...... 36

Table 3.1 Riboswitch sub-families in Rfam database ...... 56 Table 3.2 Filtering performance of chain filters (CF), HMM filters (HMM), and composite filters (CF·HMM) on synthetic sequences...... 57 Table 3.3 Comparison of RNA profile alignment (PAln) and CMsearch (CM) on synthetic sequences...... 59 Table 3.4 Filtering performance of chain filters (CF), HMM filters (HMM), and composite filters (CF·HMM) on two real genomes...... 60 Table 3.5 Summary of searching riboswitches against the whole bacterial and archaeal genomes...... 62 Table 3.6 Summary of searching riboswitch elements against GOS data. . . 65 Table 3.7 Summary of predicted functions of the confident ORFs down- stream of riboswitch predictions...... 67 Table 3.8 Statistics for accurate option and efficient option...... 68

Table 4.1 Effect of parameters k, w and s on the probability of predicting conserved stacks at random...... 85 Table 4.2 A complete list of the comparison of sensitivity and accuracy of RNA secondary structure prediction on 12 RNA families shown in Figure 4.5...... 90

viii ACKNOWLEDGEMENTS

I am very grateful to my advisor, Dr. Vineet Bafna, for his guidance and support throughout my Ph.D. studies. I feel fortunate to work with him. The work presented in this dissertation benefits the most from his advices. I also would like to thank Dr. Pavel Pevzner for his kindly supporting me for my first two years and his guidance throughout my Ph.D. studies. I would like to thank Dr. Haixu Tang, Dr. Roded Sharan, Dr. for all the successful collaborations. I wish to thank Dr. Vineet Bafna, Dr. Sanjoy Dasgupta, Dr. Pavel Pevzner, Dr. Glenn Tesler, and Dr. Steven Wasserman for taking the time and patience to review my dissertation and serve on my defense committee. I would like to thank all CSE lab members. All of them have made my Ph.D. study a very precious and unique experience. The science presented in this dissertation greatly benefited from interactions with Max Alek- seyev, Nuno Bandeira, Vikas Bansal, Ali Bashir, Fjola Bjornsdottr, Mark Chaisson, Banu Dost, Ari Frank, Neil Jones, Julio Ng, Qian Peng, Alkes Price, Ben Raphael, Stephen Tanner, Jeffrey Wang and Degui Zhi. My dissertation work was supported by a grant from the National Science Foundation (NSF-DBI:0516440). I extensively used computers through the UCSD FWGrid Project (NSF Research Infrastructure Grant Number EIA-0303622). Finally, I am deeply indebted to my family for their everlasting support and love. Chapter 2, in part, is a reprint of the paper “Searching Genomes for non- coding RNA using FastR” co-authored with Brian Haas, Eleazar Eskin and Vineet Bafna in IEEE/ACM Transactions on and Bioinformat- ics, Vol. 2, Issue 4, pp. 366–379, 2005. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, is a reprint of the paper, “A sequence-based filtering method for ncRNA identification and its application to searching for riboswitch

ix elements”, co-authored with Ilya Borovok, Yair Aharonowitz, Roded Sharan, and Vineet Bafna in Bioinformatics (ISMB 2006) Vol. 22, pp. e557–e565, 2006. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part, is a reprint of the paper, “Consensus folding of un- aligned RNA sequences revisited”, co-authored with Vineet Bafna and Haixu Tang in Journal of Computational Biology, Vol. 13, Issue 2, pp. 283–295, 2006. The dis- sertation author was the primary investigator and author of this paper.

x VITA

1997 B.S. in Computer Science Peking University, Beijing, P.R. China 2001 M.Eng. in Information Engineering Nanyang Technological University, Singapore 2001–2007 Graduate Research Assistant University of California, San Diego 2005 C.Phil., University of California, San Diego 2007 Ph.D. in Computer Science University of California, San Diego

PUBLICATIONS

Jeffrey C Wang, Roded Sharan, Vineet Bafna, and Shaojie Zhang, ”PFastR: a web-based fast RNA family identification tool”, in preparation, 2007. Ilya Borovok, Shaojie Zhang, Roded Sharan, Vineet Bafna, Yair Aharonowitz, Riboswitches in Streptomyces genomes: novel mechanisms controlling essential functions”, in preparation, 2007. Shaojie Zhang, Ilya Borovok, Yair Aharonowitz, Roded Sharan, Vineet Bafna, ”A sequence-based filtering method for ncRNA identification and its application to searching for riboswitch elements”, Bioinformatics (ISMB 2006), 22:e557-e565, 2006. Banu Dost, Buhm Han, Shaojie Zhang and Vineet Bafna, ”Structural alignment of pseudoknotted RNA”, In RECOMB 2006: Proceedings of the eighth annual international conference on Research in computational molecular biology, 143-158, 2006. Vineet Bafna, Haixu Tang and Shaojie Zhang, ”Consensus folding of unaligned RNA sequences revisited”, Journal of Computational Biology, 13(2), 283 - 295, 2006. Vineet Bafna, Haixu Tang and Shaojie Zhang, ”Consensus folding of unaligned RNA sequences revisited”, In RECOMB 2005: Proceedings of the seventh annual international conference on Research in computational molecular biology, page 172-187, 2005 Shaojie Zhang, Brian Haas, Eleazar Eskin and Vineet Bafna, ”Searching genomes for non-coding RNA using FastR”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2 (4), 366-379, 2005.

xi Vineet, Bafna and Shaojie Zhang, ”FastR: Fast database search tool for non- coding RNA”, In Proceedings of IEEE Computational Systems Bioinformatics (CSB) Conference, 52-61, 2004.

FIELDS OF STUDY

Major Field: Computer Sicence Studies in Bioinformatics Professor Vineet Bafna and Pavel Pevzner

xii ABSTRACT OF THE DISSERTATION

Computational Methods for Genome-Wide Non-Coding RNA Discovery and Analysis

by

Shaojie Zhang Doctor of Philosophy in Computer Science University of California, San Diego, 2007 Professor Vineet Bafna, Chair

The discovery of novel non-coding RNAs has been among the most ex- citing recent developments in Biology, yet, many more remain undiscovered. It has been hypothesized that there is in fact an abundance of functional non-coding RNAs (ncRNAs) with various catalytic and regulatory functions. Computational methods tailored specifically for ncRNA discovery are being actively developed. As the inherent signal for ncRNA is weaker than that for protein coding genes, comparative methods offer the most promising approach. In this dissertation, we address several open issues and problems on com- putational methods for genome wide non-coding RNA discovery and analysis: (1) We first consider the following problem: Given an RNA sequence with a known secondary structure, efficiently detect all structural homologs in a genomic database by computing the sequence and structure similarity to the query. Our approach, based on structural filters that eliminate a large portion of the database, while retaining the true homologs, allows us to search a typical bacterial genome in minutes on a standard PC. This results is two orders of magnitude better than currently available software for the problem. (2) We formalize the concept of a filter and provide figures of merit that allow comparison between filters. We design efficient sequence based filters that

xiii dominate the current state-of-the-art HMM filters. We provide a new formulation of the covariance model that allows speeding up RNA alignment. We demonstrate the power of our approach on both synthetic data and real bacterial genomes. We then apply our algorithm to the detection of novel riboswitch elements from the whole bacterial and archaeal genomes and environmental sequence data. Our results point to a number of novel riboswitch candidates, and include genomes that were not previously known to contain riboswitches. (3) We propose a novel framework to predict the common secondary structure for unaligned RNA sequences. By matching putative stacks in RNA sequences, we make use of both primary sequence information and thermodynamic stability for prediction at the same time. We show that our method can predict the correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences, and it outperforms current algorithms in sensitivity and accuracy. Together these contributions made efforts toward genome wide ncRNA discovery for exploring the modern RNA world.

xiv 1 Introduction

“Structural genes encode proteins, and regulatory genes produce ncRNA” - Jacob F. & Monod J., 1961.

1.1 Non-coding RNAs

A surprising conclusion of the research was the relatively low number of protein coding genes [55,105]. Recent estimates maintain that number to be around 20000 − 25000 [44]. By comparison, even the worm C. elegans has around 19, 500 genes. On the other hand, transcriptional activity has been detected on a much larger portion of human genome [13, 48]. Many explanations have been given, but the one that most intrigues us is the notion these studies were looking primarily for protein coding genes, and may have missed another class of genes – non-coding RNA (ncRNA) genes, which are transcribed into functional RNAs, but not translated into proteins. With recent discoveries of many novel non-coding RNA families [4,95], RNA is rapidly regaining importance as a molecule of interest [27, 96]. It has been hypothesized that there is in fact an abundance of undiscovered, functional non-coding RNAs with various catalytic and regulatory functions (The Modern RNA world [27]). Novel families of functional RNAs or RNA elements are continually be- ing discovered, in addition to existing textbook examples, such as transfer RNA (tRNA), ribosomal RNA (rRNA), and spliceosomal RNA, which are involved in translating mRNAs into proteins. All these ncRNAs with various catalytic func- tions and regulatory functions play very importance roles in the cell. Through

1 2

Figure 1.1 MicroRNA block protein formation [69]. a base-pairing mechanism, ncRNAs are particularly suited to regulating specific genes. The discovery of endogenous small interfering RNA (RNAi) genes, includ- ing siRNA and miRNA have generated much excitement in this field (See, for example, [70]). In 2006, the Nobel Prize in Physiology or Medicine was awarded to Andrew Fire and Craig Mello for their discovery of “RNA interference - gene silencing by double-stranded RNA,” [69]. Figure 1.1 shows that the precursor of microRNA forms the hairpin structure of double-stranded RNA. The mature mi- croRNAs activate the RNA interference machinery by binding with target mRNA regions through base-paring complementarity. In addition to being important play- ers as gene regulators, RNAi has also been implicated in chromatin remodeling, and other epigenetic phenomena [14, 121]. Also consider the recent discovery of riboswitches [6, 108, 116]. Riboswitches are RNA motifs which are present in up- stream UTR regions of many genes that synthesize essential metabolites (Vita- mins, proteins, nucleic acids). The riboswitches regulate metabolite synthesis by directly binding metabolites, and undergoing a change in conformation, to inhibit transcription of the downstream gene. Thus, biochemical activity that was once thought to be the exclusive domain of proteins has now found another, equally im- 3 portant, player in ncRNA. As with any other class of novel genes, these ncRNA are also being considered as therapeutic and antibacterial targets, and genetic tools [9]. For example, 2% of the genomic content of B. subtilis (69 genes) is under the con- trol of riboswitches [116]. Likewise, RNAi has become a key tool for functional genomics [70]. The following is a list of ncRNA categories with catalytic functions or regulatory functions:

• Transfer RNA (tRNA), ribosomal RNA (rRNA), and spliceosomal RNA: These ncRNAs serve as catalytic or structural parts of RNA-protein machines [63].

• Ribozyme: Ribozymes are RNA molecules with catalytic activity. They can be separated into three groups: self splicing intron, RNase P RNA and small catalytic RNA [79].

• Micro RNA (miRNA): Micro RNAs play important regulatory roles by targeting mRNAs for cleavage or translational repression [70].

• Small Interfering RNA (siRNA): siRNAs help to inhibit certain genes, a process called RNA interference or RNA silencing [17,29,38].

• Riboswitch element: Riboswitchs are genetic control elements that are lo- cated in the 5’-untranslated region (5’-UTR) of mRNA [116]. Riboswitch el- ements regulate downstream gene translation by binding with small metabo- lites.

• Small Nucleolar RNA (snoRNA): snoRNAs direct the modification of rRNAs [49]. Most vertebrate snoRNAs are processed from introns of pre- mRNAs, which was the first hint that introns could code functional ncRNAs.

• Translational Control Element (TCE): Bergsten an Gavis ( [7]) found out that the translation of unlocalized nanos mRNA is repressed by a regu- latory element (with conserved secondary structure [24]) in 3’ UTR regions. 4

• tmRNA: tmRNA is named for having both tRNA and mRNA proper- ties [126], and it directs the tagging of abnormal proteins for degradation.

Despite of the growing importance of ncRNA, most, though not all, of the recent ncRNA discoveries have come from experimental screening. While that used to be the case with protein coding genes, the situation there is now the opposite. Most novel genes are validations of hits from computational screens of genomic sequences. It is fair to say that computational methods for discovering non-coding genes are not yet mature, and development of such methods will fulfill a currently unmet need. Here, in this dissertation, we describe novel algorithms for ncRNA discovery and analysis. We know that, like proteins, RNA secondary structures are more impor- tant for function than their sequences. RNAs with similar functions often have similar secondary structures but distinct primary sequences. Therefore, under- standing the structures of these RNA molecules will help elucidate their functions. Before addressing these computation challenges, let’s look at the properties of RNA secondary structure.

1.2 RNA secondary structure

Recall that RNA (ribonucleic acid) is a single strand molecule composed of four different nucleotides (bases), adenine, cytosine, guanine, and uracil. In DNA, thymine replaces uracil. RNA can be considered as a sequence over the 4 letter alphabet {A, C, G, U}. Most functional RNAs tend to fold into a particular base-paired secondary structure after they are transcribed from DNA sequences. The complementary nucleotides A-U and G-C form stable base pairs with each other based on the hydrogen bonds between two sites on the bases. These are called Watson-Crick base pairs, since they are analogous to the same A-T and G-C bonds that hold two strands of a DNA molecule together. In addition, G-U wobble base pair may be 5

A U C G multi−loop stack a a’ C G hairpin loop G U G C bulge g’ b

b’ c g c h’ d pseudo− d’ f f’ knot e h e’

Figure 1.2 RNA secondary structure. found in RNA secondary structure. As in DNA, the G-C pair contribute the great- est energetic stability to the structure, with A-U pair contributing less stability, and G-U pair contributing the least. When an RNA molecule folds into its secondary structure, it tends to fold into energetically favorable nested based pairs, which is called a stack. As shown in Fig. 1.2, the secondary structure has a tree-like shape and can be decomposed into various non-base-paired loop regions (interior loops, bulges, multi-loops) and base- paired stack regions. Each stem in this tree contains energetically favorable stacked base pairs. Usually these nested base pairs are non-crossing to each other, which means for any two base pairs (i, j) and (i0, j0), where i < j, i0 < j0, and i < i0, either i < i0 < j0 < i or i < j < i0 < j0. An RNA pseudoknot is a set of base pairs which violate the non-crossing convention, also shown in Fig. 1.2. RNA pseudoknots are functionally important in several known RNAs. For example, by comparative analysis, RNA pseudoknots are conserved in ribosomes [81], self splicing introns [1], and telomerase RNAs [101]. Because pseudoknots make the problem much more complicated, in this dissertation we assume there are no pseudoknots in a given RNA sequence unless otherwise noted. Most functional RNAs (ncRNAs) appear to be selected more for mainte- 6 nance of a particular base-paired structure than conservation of primary sequence. It is relatively common to find examples of homologous RNAs that have a common secondary structure without sharing significant sequence similarities [35]. Drastic changes in sequence may often be tolerated as long as compensatory mutations maintain base-paired complementarity, such as G-C pair is changed into A-U pair with double substitutions or G-C pair is changed into G-U pair with single sub- stitution. RNA sequence analysis therefore should consider secondary structure conservation in addition to sequence conservation.

1.3 The Challenge of ncRNA Discovery and Analysis

Some of the early efforts on ncRNA discovery were attempts at de novo prediction, looking for signals that might detect functional RNA in a genomic database. Since ncRNA stabilizes by forming hydrogen bonds among complemen- tary nucleotides, the early methods used low energy secondary structure formation as a signal to discover RNA [15, 57]. These efforts were only partially successful. Recent reports [83, 118] have concluded that the signal is not strong enough to distinguish RNA from other sequence. In fact, random sequence with high GC composition, or with a di-nucleotide composition similar to a true RNA sequence often allows folding into energetically favorable secondary structures. Recent de novo approaches, which include looking for the transcription start and similar sig- nals, have had success mainly when combined with a comparative approach [4,61]. The consensus appears to be that the ncRNA signals in a genome are not as strong as the signals for protein coding genes. Consequently, we focus on a comparative approach to finding ncRNA genes and predicting the their secondary structures. One way to identify ncRNAs is to find subsequences in the genomic database that are similar in structure and sequence to a query, which is called RNA homolog search. 7

1.3.1 RNA Homolog Search

Sequence similarity searching tools for genomic sequences and protein se- quences, such as BLAST [3] and Fasta [74], are very well developed. However, they cannot be used to find ncRNA homologs with conserved secondary structure and relatively low sequence identity. Methods based on similarity for RNA struc- ture homologs search have been published earlier [61,64]. This idea has also been extended to searching for homologs of ncRNA families by querying with a statisti- cal representations of a multiple alignments of the family. Examples include CM- SEARCH [35] using covariance model (CM) [26] and ERPIN using secondary struc- ture profiles [31,54]. Recently, Klein and Eddy developed a tool, RSEARCH [50], for searching a database with a single ncRNA query. This method depends upon existing algorithms for computing alignments between an RNA sequence and sub- strings of a database, where the alignment score is a function of sequence and structural similarity. Known algorithms for computing such alignments are com- putationally intensive, which has running time approximately O(mw2n), where m is the length of the query sequence, n is the length of the database sequence, and w is the maximum length of a database substring that is aligned to the query. For a test run on an Intel/Linux PC with 2.8 GHz, 1 Gb memory, a microbial database of size 1.67 Mb, and a query 5SrRNA sequence, RSEARCH took over 6.5 hours to run. This makes it intractable for a large genome database. All these RNA homolog search tools (both CMSEARCH and RSEARCH) are too computationally expensive to use to search a genomic database for novel ncRNA genes. The challenge is how to speed up RNA homolog search. One way to speed up the search is to use a much faster scanning procedure to remove the most unrelated regions (we call it a “filter”). BLAST succeeds in this task by an efficient keyword match filter. The challenge for RNA homolog search is whether we can design a fast and efficient filter (using structure information or sequence information) for ncRNA search. 8

1.3.2 RNA Consensus Folding for ncRNA Discovery

Another problem related to ncRNA discovery is RNA secondary structure prediction, which has been extensively studied since the 70s. There are two distinct approaches to predict RNA secondary structure. The RNA folding approach, initi- ated by [103], assigns free energies to the components of RNA secondary structure, and then computes the RNA secondary structure with the minimum energy. Dy- namic programming algorithms have been developed to compute minimum energy secondary structures [71, 72, 93, 112, 124], and implemented in software packages such as MFOLD [41, 43] and ViennaRNA [122–125]. However, RNA folding via energy minimization has its shortcomings. First, fold prediction depends critically upon correct values of the energy parameters, as shown by [45], which are hard to obtain experimentally. Also, RNA folding in a real cell is mediated by interactions with other molecules, and the absence of knowledge of these interactions may cause mis-folding in silico. A different approach attempts to resolve these shortcomings by using evolutionary conservation of structure as the basis for structure prediction. Most RNA sequences are selected more for maintenance of the structure than conserva- tion of primary sequence. The presence of compensatory mutations that preserve structure even as primary sequence diverges is a signal for base-pairing structure. However, aligning multiple and divergent RNA sequences so as to preserve their conserved structures is not easy, because many compensatory mutations decrease the overall sequence similarity. For unaligned sequences, one must compute the structure and alignment simultaneously. Sankoff proposed an algorithm that can simultaneously align RNA sequences and find the optimal common fold [34,66,91]. However, the complexity of this algorithm is O(l6), where l is the length of RNA sequences, too high to be practical even for two sequences. This motivates another challenge: how to predict the common secondary structure for multiple unaligned RNA sequences. Recently, there has been a lot of work in aligning the genomes of related 9 species (such as human and other vertebrate genomes [8,53,92] and 12 Drosophila genomes and other insects genomes [92]). These multiple genome-wide alignment regions can be used to screen for conserved structured ncRNAs by checking the potential consensus secondary structures and identifying compensatory mutations (RNAz [110], EvoFold [75]). Despite the divergent comparative sequence data in the alignment regions, both RNAz and EvoFold still show very high false positive rate (50%-70%) [111]. This also shows the challenge for ncRNA discovery.

1.4 Dissertation Outline

We dedicate this dissertation to address the aforementioned challenges on how to speed up ncRNA homolog search and how to solve the RNA consensus secondary structure prediction problem. To speed up ncRNA homolog search, we have constructed two prototype tools: FastR (using structure-based filters) to speed up single ncRNA sequence homolog search and PFastR (using sequence-based filters) to speed up ncRNA family homolog search. We also have applied FastR and PFastR to the discov- ery of novel ncRNAs and detected novel riboswitch elements from the bacterial genomes and environmental sequence data. Our results point to a number of novel riboswitch candidates, and include genomes that were not previously known to contain riboswitches. PFastR web server is implemented based on sequence based filtering idea. We have trained more than 570 families from the Rfam database to determine the optimal parameters for the filtering procedure. The PFastR server provides the first web-based fast RNA homolog search. For RNA consensus folding, we proposed a novel framework to predict the common secondary structure for unaligned RNA sequences and constructed a prototype tool: RNAscf (RNA stack-based consensus folding). By matching putative stems in RNA sequences, we make use of both primary sequence informa- tion and thermodynamic stability for prediction at the same time. We show that 10 our method can predict correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences, and it outperforms current algorithms in both sensitivity and accuracy. The remaining part of the dissertation is organized as follows:

• Chapter 2 describes how to incorporate structure-based filters to speed up the search for homologs in the genomes.

• Chapter 3 formalizes the concept of a filter and presents efficient sequence based filters that dominate the current state-of-the-art HMM filters.

• Chapter 4 presents our new approach for RNA consensus folding.

• Chapter 5 concludes the whole dissertation with a brief summary of contri- bution and discusses possible directions of future work.

In the area of computational RNA research, we also did some work on a pseudoknotted RNA alignment algorithm, which is not described in this thesis [22]. We know that pseudoknots are base-pairs in RNA secondary structure that vio- late the non-crossing rule. While not as common as other sub-structures, they are often critically important to the function of ncRNAs. However, understanding the extent and importance of these ncRNAs is partially handicapped by the difficulty of identifying them (computationally). We presented algorithms that align pseu- doknotted RNA to a target subsequence. By scanning a genome, we can identify novel (homologous) pseudoknotted ncRNA and infer the secondary structure of the target aligned sequence. In order to compute a local structural alignment, Many definitions of pseudoknots have been postulated [2,21,82], and recent research in- vestigates the power of these definitions in describing real pseudoknots [16]. We started with Akutsu’s formalism (simple pseudoknots) [2], which has a clean re- cursive structure and encompasses a majority of the known cases [16, 80]. We also proposed algorithms that extend this class of allowed pseudoknots (standard pseudoknots) which included a chaining procedure to speed up the alignment al- gorithm. 2 FastR: Fast RNA Search Using Structure-based Filters

2.1 Introduction

In this chapter, we consider the following problem: Given an RNA se- quence with a known secondary structure, efficiently detect all structural homologs in a genomic database by computing the sequence and structure similarity to the query. Our approach, FastR, based on structural filters that eliminate a large portion of the database, while retaining the true homologs, allows us to search a typical bacterial genome in minutes on a standard PC. The results are two orders of magnitude better than currently available software for the problem. We applied FastR to the discovery of novel riboswitches, which are a class of RNA domains found in the untranslated regions. Various computational approaches to detecting non-coding genes are un- der investigation. Some of these are attempts at de novo prediction, looking for signals that might suggest a functional RNA in the molecule. The most promis- ing approach seemed to be the use of secondary structure as a signal [15, 40, 57] to discover RNA. This approach builds upon extensive earlier research into pre- dicting the secondary structure of an RNA molecule [45, 124]. However, recent reports [83, 118] have concluded that the secondary structure signal is not suffi- cient to detect ncRNA. Random sequences with a biased GC composition, or with a di-nucleotide composition similar to true RNA sequences, usually allow folding

11 12 into energetically favorable secondary structures. Other de novo approaches in- clude looking for the transcription start and similar signals, but have had limited success. The consensus is that the ncRNA signals in a genome are not as strong as the signals for protein coding genes. Therefore, a natural way to solve this problem is based on comparative methods. One approach is to consider the evidence for RNA structure in sequences that are conserved through evolution. QRNA [84] tries to find ncRNA genes by scanning the conserved region alignments from two distant species. The program has been used to find ncRNAs in E. coli [85] and in Saccharomyces cerevisiae [67]. Other programs, such as ddbRNA [20], MSARI [18] and alignfold [109], use multi- ple alignments as input to detect conserved RNA secondary structures. However, if the sequences have diverged, constructing accurate multiple alignment itself is a challenging problem. Further selecting appropriate genomic subsequences to align is also challenging because of the divergence in primary sequences. Here, instead of trying to identify novel ncRNA families, we address the relatively easier problem of identifying subsequences that are similar in structure and sequence to a query. This approach has been used to find homologs of a specific RNA, such as tRNA [64]. This has also been extended to searching for homologs of ncRNA families by querying with a statistical representations of a mul- tiple alignments of the family. Examples include CMsearch [35] using covariance model (CM) [26] and ERPIN using secondary structure profiles [31,54]. Recently, Klein and Eddy developed a tool, RSEARCH [50], for searching a database with a single ncRNA query. This method depends upon existing algorithms for com- puting alignments between an RNA sequence and substrings of a database, where the alignment score is a function of sequence and structural similarity. Known algorithms for computing such alignments are computationally intensive, which has running time approximately O(mw2n), where m is the length of the query se- quence, n is the length of the database sequence, and w is the maximum length of a database substring that is aligned to the query. For a test run on an Intel/Linux 13

PC with 2.8 GHz, 1 Gb memory, a microbial database of size 1.67 Mb, and a query 5SrRNA sequence, RSEARCH took over 6.5 hours to run. This makes it intractable for a large genome database. In this chapter, we describe FastR, an efficient database search tool for ncRNA. FastR is generally two orders of magnitude faster and, as an example, FastR reduces the compute time of the previously mentioned query to 103 sec- onds. Are such algorithmic improvements worth investigating? An analogy can be made with the BLAST [3] algorithm, which has had tremendous influence on the growth of sequence databases such as Genbank, and bioinformatics as a discipline. While tools for sequence alignment, based on the Smith-Waterman algorithm, had been available for a long time, BLAST changed the landscape largely by its speed and accuracy in searching for sequence homologs. The main idea here is the de- velopment of filters that efficiently prune most of the database while retaining the true homologs. This has also been tried for ncRNAs. For example, to improve speed, Rfam employs an initial BLAST search to filter genomic sequences before running the CMsearch [35]. Weinberg and Ruzzo [115] described filters based on Markov models, which can provably retain all hits that a covariance model could find. Because these two filters are based on primary sequences conservation, many compensatory mutations in ncRNA sequences that affect the sequence similarity may reduce their sensitivity or speed. Other approaches to filters are also studied (see for example [23]), which search for simple motifs which might be shared by many ncRNA families. Whereas, the idea in FastR is the use of RNA structural features as filters, where the filters are specific to a family. Most ncRNAs appear to be selected more for maintenance of a particular base-paired secondary structure than conservation of primary sequences. After filtering, we compute the alignments between the query ncRNA and all possible hits to find the true homologs. We apply FastR to the discovery of novel riboswitches, which are a class of RNA domains found in the UTRs. They are of interest because they regulate metabolite synthesis by directly binding metabolites. We searched all available 14 eubacterial and archaeal genomes (508 mega bases) for riboswitches from purine, lysine, thiamin, and riboflavin sub-families. Our results point to a number of novel candidates for each of these sub-families, and include genomes that were not previously known to contain riboswitches. As an example, a search with the purine riboswitch (Z99107.2/14363-14264) took 19 hours on a standard PC and resulted in the discovery of 180 homologs, including 33 of 35 known riboswitches. 9 of these are of interest as they lie less than 500 bases upstream of a gene involved in Purine metabolism. Thus, FastR is a viable tool for discovering novel homologs of ncRNA. The rest of the chapter is organized as follows. We describe details of the FastR algorithm in the Section 2.2. In the Section 2.3, the algorithm is validated by testing its speed and accuracy on known ncRNA sub-families. In this section, we also describe our findings from a search of the entire microbial database for novel riboswitches. The chapter ends with a summary in Section 2.4.

2.2 Methods

FastR solves following problem: Given an RNA sequence with known secondary structure, efficiently compute all structural homologs (computed as a function of both sequence and structural similarity) in a genomic database. There are two stages in FastR. In the first stage, the database is filtered to identify substrings which have structural features similar to the query. In the second stage, the selected substrings are locally aligned to the query using a sequence structure alignment. Finally p-values are assigned to the top hits.

2.2.1 Strucutre-based Filters

Before introducing our structure-based filtering method, we first address the question whether sequence similarity with the query string is sufficient to get an initial set of candidate regions. To test this, we queried the whole genome of 15

GCAUCGGUGGUUCAGUGGUAGAAUGCUCGCCUGCCACGCGGGCG <<<<<<<..<<<<...... >>>>.<<<<<...... >>>>>.. UCUAAUAUGGCAGAUU...AGUGCAAUAGAUUUAAGCUCUAUAU

GCCCGGGUUCGAUUCCCGACCGAUGCA ..<<<<<...... >>>>>>>>>>>>. AUAAAGU.AUUUU.ACUUUUAUUAGAA

Figure 2.1 Alignment of two tRNA sequences from Drosophila melanogaster (tRNA Gly, Acc#: X07778.1/115-45) (top) and Drosophila simulans (tRNA Leu, Acc#: AF200843.1/3014-3079) (bottom). The two molecules have identical sec- ondary structure (there are 4 stacks, and two same-colored blocks form a stack), but very low sequence similarity (only 4 bases are matched in stacked region). Note that these are diverged members of a large super-family. However, they underscore the need for structure based alignments.

A. pernix (GenBank NC 000854.1) with an Asn-tRNA sequence. With default parameters, BLASTN selected 4 hits with an E-value < 0.001, and 24 hits with E-value < 10. 3 of the 4, and 10 of the 24 matched the 43 hits produced by RSEARCH. Most of the alignments were less than 20bp in length and would have been discarded. Another example is presented in Figure 2.1. The alignment of two tRNA sequences (Acc#: X07778.1/115-45 and AF200843.1/3014-3079), from Drosophila, shows complete conservation of structure, but low sequence similarity. From this, and similar tests not included here, we do not anticipate a tool based on sequence similarity to be effective in finding RNA homologs. Therefore, we turn to the secondary structure of the query RNA sequence as the basis for our filter design. We will continue to use sequence similarity in computing the final alignments. As shown in Figure 2.2 (a), the secondary structure of an RNA has a tree like shape and can be decomposed into various loops (Interior loops, bulges, multi- loops) and stack regions. Each stem in this tree contains energetically favorable stacked base-pairs. The stacks are stabilized by hydrogen bonds between base- pairs. The Watson-Crick base-pairing (A↔U, C↔G) is energetically the most favorable, but other pairings such as the wobble base-pair (G↔U) are possible as well. Figure 2.2(b) provides a ’stretched’ view of the RNA structure. 16

A U C G multi−loop stack a a’ C G hairpin loop G U G C bulge g’ b

b’ c g c h’ d pseudo− d’ f f’ knot e a a’ h e’ b b’ c d e e’ d’ f h f’ c’ g h’ g’ k w’ <= w (a) (b)

Figure 2.2 (a) An RNA structure with various structural elements including stacked base-pairs, bulges, hairpins, and multi-loops. (b) An alternative view. The set of bases in (a, a0) forms a (k, w)-stack. Two substrings a and a0 are w0 bases apart where w0 <= w.

Each stack corresponds to a pair of sub-strings. These pairs are typically non-interleaving. While interleaved stacks, or pseudo-knots (such as the pair (f, f 0), and (h, h0) in Figure 2.2) do occur, they can be ignored for filtering purposes. Consider a nucleotide string s, with |s| = n. We define a (k,w)-stack as a pair of indices (i, j), i < j if (j − i) ≤ w, s[i . . . i + k − 1] and s[j . . . j + k − 1] can form an energetically favorable base-pair stack. As an example, the indices of the substring (a, a0) in Figure 2.2 form a (5, w)-stack if they are at most w bases apart. A simple filter choice for an RNA structure is the set of all starting positions i which contain a (k, w)-stack for appropriately chosen k and w. Let p be the probability that a pair of randomly chosen bases is part of a stack. The probability that a pair of indices (i, j) with (j − i) ≤ w forms a (k, w)-stack is

k p . Define Xi,j as the indicator variable with Xi,j = 1 if and only if (i, j) forms a (k, w)-stack. Using linearity of expectation, the expected number of hits in a random string of length n is

Xn Xi+w Xn Xi+w k E( Xi,j) = E(Xi,j) ≤ nwp i=1 j=i+k i=1 j=i+k

See Table 2.1 for the expected number of hits per starting position (' wpk). 17

Table 2.1 Expected number of hits in a random string in a (k, w)-filter. w\k 4 5 6 7 8 9 10 20 0.3955 0.1483 0.0556 0.0208 0.0078 0.0029 0.001 40 0.791 0.2966 0.1112 0.0417 0.0156 0.0058 0.0021 60 1.1865 0.4449 0.1668 0.0625 0.0234 0.0087 0.0032 80 1.582 0.5932 0.2224 0.0834 0.0312 0.0117 0.0043 100 1.9775 0.7415 0.278 0.1042 0.0391 0.0146 0.0054

Obviously, for large k and small w, even this simple filter can be quite powerful. Assume for exposition purposes that the base-pairing is limited to the Watson- Crick base (A↔U, C↔G), and the wobble base-pair (G↔U). For a randomly

3 and uniformly chosen pair of bases, the probability p of pairing is p = 8 . As an example, typical tRNA structures have a clover-leaf shape with the outermost stem having a 7 base-pair stack separated by about 70 bases. The (7, 70) filter would eliminate over 90% of the starting positions from consideration. In fact, we can do better as this base-pair unit is in fact separated by at least 50 bases in all tRNA, therefore making w effectively 20 (50 ≤ w ≤ 70), eliminating 98% of the starting positions. Note that the assumption that the bases are independent and identically distributed (i.i.d.) is not valid for a real genomic sequence. However, the same principle applies and similar results are observed in practice.

2.2.2 Structure-based Filter Design

We will use the (k, w)-stack as the basis for our filter design. However, we need to design more sophisticated filters as indels may sometimes disrupt base- pair stacks (decreasing the effective value of k), and variability in separation may increase the effective value of w. We quantify some design goals for filters to evaluate different designs, spur further research in this area. A good filter must be efficient. The time to filter should be no more than the time to align and score the filtered hits, and preferably as small as possible. Additionally, the filters must have high sensitivity, and specificity. Sensitivity is described as the fraction of all 18 members of the ncRNA family that is admitted by the filter, and should be as close to 1 as possible. It may be acceptable to work with lower sensitivities, for example, to look for members in a sub-family. We define specificity as the expected number of hits per base-pair, and should be as small as possible. Finally, the filters must be general and simply described, so as to be applicable (with appropriate parameter tuning) to every ncRNA family. We propose the following Nested and Multiloop filters:

Nested filters: Considering the RNA secondary structure as a tree, and going depth first down a path (see Figure 2.2), we have many nested (k, w)-stacks.

Consider (k, w)-stacks s1 = (i1, j1) and s2 = (i2, j2). Stack s1 is nested in

stack s2, if i1 ≤ i2 + k and j2 ≥ j1 + k.A(k, w, l)-nested stack is a collection

of l (k, w)-stacks s1, s2, ..., sl, such that for all i ∈ [1, l − 1], si+1 is nested in 0 0 0 0 si. For example, in Figure 2.2, the configuration (a, a ), (c, c ), (d, d ), (e, e ), is a (k, w, 4)-nested stack.

Parallel and Multiloop stacks: Yet another way of looking at RNA structural elements is to locate non-nested, non-overlapping (k, w)-stacks. Consider

stacks s1 = (i1, j1) and s2 = (i2, j2). Stack s1 is parallel to stack s2 if

j1 < i2 or j2 < i1.A(k, w, l)-parallel stack is a set of stacks s1, s2, ..., sl, such that any pair of stacks is parallel to each other. This definition can be extended to a multiloop stack. A (k, w, l)-multiloop stack is a configuration that is a (k, w, l − 1)-parallel stack in which each of the stacks is nested in a (k, w)-stack. The units (b, b0), (d, d0), (f, f 0), and (g, g0) in Figure 2.2 form a (k, w, 4)-parallel stack. Correspondingly, {(a, a0), (b, b0), (d, d0), (f, f 0), (g, g0)} is a (k, w, 5)-multiloop stack.

The nested, parallel, and multiloop stacks are all generalizations of the (k, w)-stack, and therefore applicable to all families of ncRNA. There are conserved structural elements in every ncRNA family that enforce the correct folding, so it should be possible to find multiloop and nested structures with high sensitivity. Also, the 19

3−7 3−15 0−7 5−15 0−7 5−15 3−7

50−75

Figure 2.3 A (k, ~w, 4)-multiloop stack for tRNA with distance constraints, with ~w = [(50, 75), (3, 7), (3, 15), (0, 7), (5, 15), (0, 7), (5, 15), (3, 7)] simple description allows us to compute specificity using combinatorial techniques. To increase the specificity of these filters, we need to extend the design to include distance constraints (number of base-pairs) in between the various (k, w)-stacks. For a filter with l (k, w)-stacks, there are 2l substrings of length k each with 2l − 1 distances between adjacent substrings. To this, we add an additional distance between the first and the last substring, and we have a vector of 2l distances. We constrain the distances by a 2l-dimensional vector ~w containing the allowed ranges for each of these distances. Choose w0 to be the range of distances between the first and last substring, and wj, j > 1 to be the range of distances in the substrings ordered from left to right. A (multiloop/nested) filter satisfying these constraints is a (k, ~w, l)-filter. Note that (k, w, l)-multiloop stack can be redefined by choosing ~w such that wj = (0, w) for all j. A (4, ~w, 4)-multiloop stack for tRNA with appropriate distance constraints is shown in Figure 2.3. To compute the specificity of a (multiloop or nested) filter, we address the following combinatorial problem: What is the probability that an arbitrary position in the random database is the start of a (k, ~w, l)-multiloop stack, or nested stack? In general, this is hard to compute because of the various depen- dencies between overlapping units, so we approach it indirectly. Consider a 2l 20 dimensional vector ~v. If the distances in ~v are within the range specified by ~w, then ~v denotes a configuration of a (k, ~w, l)-multiloop stack obtained by fixing the 2l positions of the l (k, w)-stacks using distances in ~v. The probability of occur- rence of an arbitrary configuration is exactly pkl. For an arbitrary starting position, and a configuration ~v, define an indicator variable   1 if a (k, ~w, l)-multiloop stack occurs with configuration ~v; X~v =  0 otherwise. P Let Y = ~v X~v. We are interested in computing P r[Y > 0]. By linearity of P kl expectation, E(Y ) = ~v E(X~v) = nk, ~w,lp , where nk, ~w,l is the number of possible configurations of a (k, ~w, l)-multiloop stack. nk, ~w,l can be computed using standard combinatorial arguments. We consider two special cases:

1. Let 0 ≤ wj ≤ w for all j. Then, µ ¶ w − 2(k − 1)l − 1 n = k, ~we,l 2l − 1

2l−1 2. Let 0 ≤ w0 ≤ ∞, and for all j > 0, let 0 ≤ wj < x. Then, nk, ~w,l = x .

kl Ideally, we choose the distance constraints so that nk, ~w,lp ¿ 1. For those values, we can use the Markov inequality (P r[Y > 0] < E(Y )) to get the desired bound. For higher values of E(Y ), we need other techniques to bound the probability. These computations allow us to quantify the sensitivity-specificity trade-off due to a change in the distance constraints. Further increases in specificity are obtained by using intersections of nested and multiloop stacks. In Section 2.3, we describe our filtering results on various test cases. Informally, making a filter restrictive increases specificity at the cost of sensitivity. However, in most families of inter- est, we can design effective filters that reduce the database size by two orders of magnitude.

2.2.3 Optimal Structure-based Filter Design

Given a family R of ncRNA sequences, an ideal (nested or multiloop)

kl filter would seek to minimize nk, ~w,lp (increase specificity) while admitting a large 21 fraction of the members (sensitivity) and allow efficient filtering. Initial tests on the purine filters resulted in a ten fold improvement in total running time with no loss of sensitivity. We will describe our results on optimal filter design elsewhere. As the input to FastR is a single query ncRNA, we employ a dynamic programming algorithm that automatically generates nested and multiloop filters with high specificity. The algorithm takes advantage of the tree like structure of RNA. It iterates over every value of k, l. For each such pair of values, and every node v in the tree, it checks if a (k, l)-nested (multiloop) filter is possible. The final filter chosen is one that maximizes kl, while keeping k as low as possible. The software then allows users to tweak the computed parameters to get the desired sensitivity while retaining specificity. However, our test results show that the automatically generated filters have sensitivity that is comparable to the fine tuned filters.

2.2.4 Structure-based Filtering Algorithms

Filtering speed is critical to fast homolog computation. We use a com- bination of string matching and dynamic programming techniques (see for exam- ple [37]) to filter databases with multiloop and nested filters.

1. Hash: Build a hash table to compute all k-mer positions in the database. The time taken is O(m), where m is the size of the database.

2. Identify (k, w)-stacks: Let si denote the k-mer at an arbitrary position i

in the database. For each si, compute a neighborhood N(si) of all ‘comple- mentary’ k-mers. To identify (k, w)-stacks efficiently, we use the hash table

to compute all positions j such that sj ∈ N(si), and j − i satisfies distance constraints. The time taken is linear in the number of (k, w)-stacks, which is typically smaller than the size of the database.

3. Filters: Note that multiloop and nested filters are combinations of (k, w)- stacks. We scan the database with a moving window of size w. An ‘active’ 22

list of (k, w)-stacks within the window is maintained, and a dynamic pro- gramming technique is used to compute filters from this list. The total com-

putation is bounded by O(mkw), where mk is the number of (k, w)-stacks. m Typically mk < w .

In general, any k-mer that can form an energetically favorable stack with s should be in N(sk). In our current implementation, we do not allow indels, and allow at most 2 G↔U pairs. To test this, a scan of all of the structures in Rfam 5.0 [35] showed that at least 93% of all stacks contain an ungapped base-pairing of at size at least 4. Note that even the absence of an ungapped stack does not preclude the formation of a filter using other stacks in the same molecule. Therefore, this is a reasonable choice that does not affect sensitivity too much. The current filter time is a few seconds per Mb of sequence, which is easily dominated by the time for computing alignments. Also, the filters are very effective in eliminating a large fraction of the database while retaining most of the true hits.

2.2.5 Computing RNA Sequence Structure Alignment

After filtering, we need to align the filtered regions to the query. There are three types of alignments for RNA sequences: (1) RNA plain sequence alignment, which takes into account the secondary structures in the sequences [66, 91], (2) RNA structure structure alignment, which aligns tree-like secondary structures together [39,119], and (3) RNA sequence structure alignment, which aligns a plain sequence to a secondary structure or a structure profile [5,25,59]. In this paper, we are dealing with the third type of alignment: the filtered database substrings must be structurally aligned to the query to identify true homologs. This problem has been well studied in the literature, with scoring based on a Nussinov-like counting model [5,47,91] and probabilistic models such as Covariance Models and Stochastic Context free grammars for RNA [25, 90]. It is also possible to extend the Zuker- Turner thermodynamic model [45,124] for scoring sequence structure alignments. Here, we extend the approach from Bafna et al. [5], to include a new bi- 23 narizing procedure, banded alignment for efficient computation, and more realistic score functions. We use the scoring matrix (RIBOSUM) from Klein and Eddy [50] and empirically generated affine gap penalties to score the alignments. We note that our filtering approach generates candidates which can be used in conjunction with any alignment method. However, we use the extra information from the filter match to speed up alignment computation using banding techniques. Consider two RNA sequences s[1, ..., m] and t[1, ..., n]. We know the sec- ondary structure of s, which is a set of base-pairs, S, where (i, j) ∈ S implies that s[i] bonds with s[j]. The alignment A of two RNA strings s and t can be described by a matrix of two rows. The first row A[1, ∗] contains the string s with gaps, and the second row A[2, ∗] contains the string t with some interspersed gaps. Each column has at most one gap in it. If A[1, i] and A[1, j] form a base-pair, we score for both sequence and structure using the function δ(A[1, i],A[1, j],A[2, i],A[2, j]). As long as A[2, i] and A[2, j] also form a base-pair, we will give a high score to cap- ture complementary mutations. Additionally, we score each column that does not participate in base-pairing by a function γ(A[1, i]A[2, i]) that measures sequence conservation. Alignments are scored by summing up the contributions of sequence and structural alignments. A naive algorithm would iterate over all pairs of intervals in s and t. We can do better by exploiting the structure of s. Ignoring pseudo-knots, each base-pair has a unique enclosing base-pair; thus S can be shown to be a tree with each node denoting a base-pair, and the obvious parent-child relation. First, we augment the tree (see algorithm and an illustration in Figure 2.4) by adding spurious base-pairs so that each nucleotide (originally base-paired or not) is in some base-pair, each node has at most two children, and the number of nodes is O(m), where |s| = m. For any unpaired base, there should be a spurious edge added between this base and the most left base without crossing real base-pairing edges. Additionally, each node v ∈ S has at most one child in the augmented structure which is denoted by S0. 24

a a b b c c

d e d h g h e g f i f j k i j k (a) (b)

procedureBinarize(i,j) (* Binarize the interval (i, j). *)

if (i = j)

return (create node(i,j,dotted,Nil)); (* A dotted node with 0 child. *)

if (i, j) ∈ S

v = Binarize(i+1,j-1);

return (create node(i,j,solid,v)); (* A solid node with 1 child v. *)

if (k, j) ∈ S for some i < k < j

vl = Binarize(1,k-1);

vr = Binarize(k,j);

return (create node(i,j,dotted,vl,vr)); (*A dotted node with 2 children, vl and vr. *)

if (i < j)

v = Binarize(i,j-1);

return (create node(i,j,dotted,v)); (* A dotted node with 1 child v. *)

end if (c)

Figure 2.4 Procedure to create a Binary tree for s with structure S, having O(m) nodes such that each node has at most 2 children. (a) Nodes in the horizontal line represent sequence s. a, f, g and h are paired bases. i, j, k and b are unpaired bases. c, d and e are representing the branches. The solid edges correspond to base-pairs in S, while the dotted edges correspond to augmented spurious edges. (b) A binary tree representation for (a), by changing solid edges into solid nodes and dotted edges into void nodes. (c) The Binarize procedure. 25

procedure alignRNA

(*S is the set of base-pairs in RNA structure of s. S0 is the augmented set. *)

for all intervals (i, j), 1 ≤ i < j ≤ n, all nodes v ∈ S0

if v ∈ S   A[i + 1, j − 1, child(v)] + δ(t[i], t[j], s[l ], s[r ]),  v v   A[i, j − 1, v] + γ(0−0, t[j]),    A[i + 1, j, v] + γ(0−0, t[i]), A[i, j, v] = max  0 0  A[i + 1, j, child[v]] + γ(s[lv], t[i]) + γ(s[rv], − ),   0 0  A[i, j − 1, child[v]] + γ(s[lv], − ) + γ(s[rv], t[j]),   0 0 0 0 A[i, j, child[v]] + γ(s[lv], − ) + γ(s[rv], − ), 0 else if v ∈ S − S, and v has one child   A[i, j − 1, child[v]] + γ(s[rv], t[j]),   0 0 A[i, j, child[v]] + γ(s[rv], − ), A[i, j, v] = max  A[i, j − 1, v] + γ(0−0, t[j]),    A[i + 1, j, v] + γ(0−0, t[i]), else if v ∈ S0 − S, and v has two children

A[i, j, v] = maxi≤k≤j{A[i, k − 1, left child[v]] + A[k, j, right child[v]]}

end if

end for Figure 2.5 An algorithm for aligning a query RNA s of length m with a database string t of length n. The query structure S has been Binarized to get S0. The 0 index pair in s corresponding to each node v ∈ S is denoted by (lv, rv). 26

A schematic algorithm for aligning an RNA query against a sequence is given in Figure 2.5. Note that this algorithm uses linear gap penalties. In our implementation, we use a slightly more sophisticated affine gap function (omitted in Figure 2.5 for exposition). Our alignment is local in the subject sequence (there is no penalty for aligning ends of the sequence), but global in the query sequence (the entire query must be aligned). We limit the intervals in s to nodes v ∈ S0, which are bounded by O(m). Figure 2.5 describes the algorithm for aligning sequence t against sequence s (with known structure). Each node v in the tree structure of s is aligned against each interval (i, j) of t. Suppose v ∈ S and let lv and rv denote the indices of the left and right end-points of v. If, for example, s[lv] = t[i] and s[rv] = t[j], then clearly

A[i, j, v] = A[i + 1, j − 1, child(v)] + δ(t[i], t[j], s[lv], s[rv])

If, on the other hand, v ∈ S0 − S, and has 2 children, then we need to iterate over all k such that the right child(v) can align with the interval (k, j). The procedure alignRNA (Figure 2.5) describes the dynamic programming algorithm to handle all the cases. Let m1, and m2 = m − m1 denote the number of nodes with 1 and 2 children respectively. The complexity of alignRNA, with a query of length

2 3 m and a target of length n, is O(n m1 + n m2). This parameterization is useful because in typical structures m2 ¿ m. In our case, the complexity can be further reduced. The sequence pairs that need to be aligned have been filtered for an underlying sub-structure. The preliminary alignment obtained by this filter allows us to limit the nodes in S0 that can be applied to a position i in t, based on the left end-point of v, and the width. This banding reduces the number of nodes to a

2 2 constant, effectively making the complexity O(n δm), where δm ¿ m is the size of the banded region. The banding forces a tradeoff. Overlapping hits from the filter can either be aligned independently with a tight band, or merged and aligned once with larger band size. 27

2.2.6 P-value Computation

For an effective database search, we need to have p-values for the proba- bility that a hit was obtained by chance. Klein and Eddy make the argument that the distribution of scores of RNA structural alignments follow the Gumbel distri- bution. As this is a strong assumption, and determination of a true p-value is a challenging research problem. Therefore, we choose to express the p-value by using the non-parametric Chebyshev’s inequality. To obtain the mean and variance, the query is aligned against randomly generated sequence with a similar GC-content as the database after each query. The bound provided by this inequality is conser- vative, and over-estimates the probability of obtaining a similar score by chance. We have found that a cut-off of 0.03 is a reasonable value in practice.

2.3 Testing Results

We describe the results on filtering and alignment independently before giving combined results. To test our algorithms, we worked with ncRNA sub- families of known/predicted structure from the Rfam [35] and the 5S Ribosomal RNA database [99]. Four sub-families are considered here: tRNA, 5S rRNA, the hammerhead ribozyme, and 4 riboswitches (purine, lysine, thiamin, and ri- boflavin [107,116]). Of these, tRNA and rRNA are well-studied sub-families. Most genome annotations include screening and annotation for tRNA. The different ri- boswitches are of great interest because they regulate metabolite (nucleic-acids, amino-acids, vitamins) synthesis by direct binding to metabolites. In subsequent tests, we search the entire complement of eubacterial and archaeal genomes for novel riboswitches. For every sub-family, we chose representative members, inserted them in a random database of size 1Mb, and tested our algorithms on the composite sequence. The probability of finding stacks at random depends on the GC-content, so in some cases, the random database was created by first choosing the GC- 28 content, and subsequently generating bases with appropriate fixed probability. G+ C probabilities of 0.35, 0.5, and 0.75 were chosen to study the effect of GC-content. All experiments were performed on an Intel PC (3.4 GHz, 1 Gb RAM), running Linux.

2.3.1 Filtering for ncRNA

Table 2.2 describes results of applying various filters. As expected, as the filters become more stringent (higher k, l, less variable distances), the num- ber of false negatives increases. However, for each family, there exist appropriate filters that filter out a large portion of the database while retaining most of the members of the family. Also, as the GC-content is biased away from 0.5, the number of false hits increases. The false negatives are all explained by one of 3 possibilities: (a) The proposed structure contains non-canonical base-pairs, which are not allowed by FastR. For example, 10/100 tRNA sequences contain A↔G, A↔A, or A↔C base-pairs. (b) One of the (k, w)-stacks is missing due to indels, mismatches or short stacks. (c) Distance constraints are not satisfied. In ongoing work, we plan to change the neighborhood computation to include all k-mers that form low energy stacks according to the Zuker-Turner [45] thermodynamic con- siderations. With respect to varying distance constraints, one has to choose the correct speed/sensitivity trade-off. The filters we have selected are all fast, and lead to very few hits in the random database. This number will increase as the distance constraints are increased. The running time for filtering increases linearly with the size of the database. Clearly, the time to filter is small enough to be dominated by the align- ment, so in principle, one could try more complex filters. However, the advantage of the simpler nested and multi-loop filters is their universality over various families of ncRNA. Designing an appropriate filter for a sub-family, is then just a matter of choosing appropriate values for k, l, and ~w. 29 U base-pair = 3), or the ~w 0 0 0 0 0 6 0 0 1 ↔ k 13 Deviant -stack 1 1 1 0 2 4 2 20 20 18 ) Missing k, w ( 0 0 1 0 0 0 0 10 10 10 pairing Non canonical 57 35 32 41 100 100 100 100 100 115 / / / / / / / / / / 50 33 28 38 /Tot. 89 89 89 80 80 84 True Pos. 558 7502 3307 6250 2749 21120 29379 37208 10263 10822 #Hits (/Mb) l 3 3 3 2 2 2 2 2 4 0 U base-pairs are allowed in a stack. (*) refers to cases where only 1 G ↔ 4 4 4 5 5 4 5 k 4(*) 4(*) 3(*) 0.7 0.7 0.5 0.5 0.5 0.5 0.5 0.5 GC 0.50 0.35 tRNA tRNA tRNA ncRNA 5S rRNA 5S rRNA Lysine-Rs Purine-Rs Thiamin-Rs Hammerhead Riboflavin-Rs Table 2.2 The resultstrue of positives. applying As nested the andIn filters multiloop become all filters more but (with stringent, the (*) various number parameters) cases, of to hits at random decrease, most databases and 2 that the number contain G of false-negatives increase. is allowed by thedistance multiloop constraints filter. being out The of false range, negatives as described are in due the to last non-canonical 3 base-pairing, columns. small stacks ( 30

2.3.2 Alignment

To test alignment quality, we computed alignments on a randomly gener- ated 300Kb database sequence with set of ncRNA sequence for each family inserted in it. No filtering was used for FastR. Figure 2.6 shows ROC plots for the two align- ment algorithms. The two are comparable, with RSEARCH performance better for distant homologs. The time taken for FastR (banded) tRNA alignment is 3 minutes 48 seconds, compared to 20 minutes 42 seconds for RSEARCH. Finally, we evaluate FastR after combining filtering and alignment. We randomly select the query for each family from Rfam seed alignment and search the random sequences using FastR and RSEARCH. Table 2.3 summarizes the results of our search. Similar results are achieved when repeating the tests with different queries. As can be seen, FastR is close to two orders of magnitude faster than RSEARCH while maintaining comparable sensitivity. Much of the loss of sensitivity is due to filtering. As seen in the previous section, FastR alignments and scores are good for the high quality hits, but decrease thereafter leading to a loss of sensitivity. As expected, much of the loss of sensitivity can be attributed to filtering. For 5S rRNA, the filter allows 80 of the 100 true positives, which are almost completely retrieved by FastR. In contrast, RSEARCH gets the top 97 but needs two orders of magnitude more time, making it much harder to conduct large scale searches. It should also be pointed out that many of the true positives were initially discovered using covariance models which are not unlike the model used by RSEARCH. As a final validation of the FastR algorithm, we apply it to the discovery of novel members of Riboswitches. Our results point to a number of interesting findings.

2.3.3 Search Riboswitches using FastR

Riboswitches are cis-regulatory elements typically found in the 5’ untrans- lated region of the gene they regulate. To date, six such motifs have been identified that control the anabolism of three vitamins (riboflavin, thiamin, cobalamin), as 31

110 120 100 90 100 80 80 70

60 P P RSEARCH RSEARCH T T 60 # # 50 FastR FastR

40 40 30 20 20 10 0 0 0 2 4 6 8 10 12 0 1 2 3 4 5 #FP #FP tRNA 5S rRNA

60 40

35 50 30 40 25

P RSEARCH P RSEARCH

T 30 T 20

# FastR # FastR 15 20 10 10 5

0 0 0 5 10 15 20 0 1 2 3 4 5 #FP #FP Hammerhead Purine-Rs

50 120 45 100 40 35 80 30 P P RSEARCH RSEARCH T T 25 60 # # FastR FastR 20 40 15

10 20 5 0 0 0 2 4 6 8 0 2 4 6 8 #FP #FP Lysine-Rs Thiamin-Rs

45 40 35 30 25

P RSEARCH T # 20 FastR 15 10 5 0 0 1 2 3 4 5 #FP Riboflavin-Rs Figure 2.6 ROC plots for the alignments generated by RSEARCH and FastR. Alignments were tested using a 300 kb random sequence with a set of true ncRNAs inserted in it. The x-axis represents the number of false-positives and the y-axis represents the number of true positives. The horizontal line represents the number of true hits in the random sequences. 32

Table 2.3 Comparison of FastR and RSEARCH. A p-value cutoff for FastR, 0.05, was chosen that approximately matched the total number of hits in RSEARCH with cutoff E-value of 10. The hits column refers to the number of true positives out of the total hits found. The filtered hits column represents the number of true positives passed the filters. No filtering is used for RSEARCH.

Hits Filtered Time Query (TP/Tot) Hits (s) RSEARCH Asn-tRNA 85/93 100 3411 FastR (AE001087.1/4936-5008) 77/93 82 52 RSEARCH 5S rRNA 97/97 100 14939 FastR (AE016770.1/210436-210555) 80/80 80 44 RSEARCH Hammerhead 50/58 50 2741 FastR (M83545.1/56-3) 47/47 47 34 RSEARCH Purine-Rs 34/35 35 5461 FastR (Z99107.2/14363-14264) 33/33 33 77 RSEARCH Lysine-Rs 32/39 32 26581 FastR (Z75208.1/54883-55062) 28/28 28 159 RSEARCH Thiamin-Rs 109/116 115 7850 FastR (Z99110.2/31833-31942) 71/81 84 234 RSEARCH Riboflavin-Rs 41/45 41 14385 FastR (L09228.1/7992-8136) 31/31 38 79 33 well as the biosynthesis of methionine, lysine and purine [68, 86, 107, 116]. Similar to previously characterized RNA regulatory structures, each riboswitch is capable of folding into a consensus structure which may result in either transcription at- tenuation or translation inhibition. However, the riboswitch element is unique in that it binds directly to ligands and is therefore able to sense the level of cellular metabolites without the need of trans-acting protein factors. It is believed that this class of ncRNA appeared early in evolution, and accordingly, riboswitch elements have been found in a wide range of bacterial species. The vitamin riboswitches are the most diverse and can be identified in ar- chaea and eubacteria. In particular, the thiamin riboswitch has been characterized in fungi and plants such as rice and Arabidopsis [107]. Conversely, the methion- ine, lysine, and purine riboswitches are more common in gram-positive bacteria. The repression mechanism is also biased by a bacterium’s phylogeny. Gram pos- itive bacteria typically prefer transcription termination; whereas, gram-negative micro-organisms tend to mediate gene repression by inhibiting translation. While riboswitches are ubiquitous, homologs show little sequence similar- ity. Even in the most conserved regions, typically for ligand binding, the sequence identity may be less than 7 nucleotides. We used FastR to search both plus and minus strands of bacterial and archaeal genomes with queries from purine, thi- amin, lysine and riboflavin riboswitches. A data set of non-redundant, known ri- boswitches existing within our genome files was assembled from the Rfam database. This data set was used to determine the p-value cutoffs, and a single member used as the query sequence. A total of 245 genomes comprising 508 Mb were searched in both strands. Candidate riboswitch sequences generated by FastR were filtered in order to find the best predictions. First, known riboswitches from the Rfam database, low-complexity and AT-rich predictions were discarded. The remaining predictions were filtered by their distance from the 5’ start of an exon. Finally, the predictions were manually examined to determine if the downstream gene was biologically relevant. The results are summarized in Table 2.4. 34 c 3 85 180 200 hits Filtered b 10 3350 2190 1592 hits Novel a value 0.03 0.03 0.04 0.03 cutoff P 7) 23) 71) 94) . . . . 12(67 Time 86(80 . 82(162 00(148 . . . 19 22 45 42 hrs.(Secs./Mb) Query L09228.1/7992-8136 Z99107.2/14363-14264 Z75208.1/54883-55062 Z99110.2/31833-31942 Lysine Purine Thiamin Riboflavin Riboswitch Table 2.4 Summary ofriboswitches. the (b) Number FastR of riboswitchof novel search. hits novel returned hits (a) by afterfrom FastR The after a removing removing p-value gene. annotated the cutoffs annotated hits were hits in determined in Rfam Rfam from database. database alignments (c) and of Number filtering known for low-complexity, AT-content, and distance 35

We focus on the 18 most promising hits, even though the remaining hits are likely to contain many interesting candidates. See Table 2.5. These predic- tions represent either those elements which are upstream from genes involved in the metabolic pathway under regulation, or predictions with strong sequence sim- ilarity in regions thought to mediate ligand binding. There are 6 of the 9 putative purine riboswitches that are found in the 5’ UTR of either the xanthine trans- port protein, xanthine phosphoribosyltransferase, purine nucleotide phosphorylase, adenine deaminase, or GMP synthase. Moreover, prediction 2 (gi|42519879) lies upstream from a hypothetical protein with homology to the xanthine permease family. This observation highlights a hidden value in identification of riboswitches - the possibility of assigning annotations to genes of unknown function. Simi- larly, of the 7 reported lysine riboswitch predictions, there are 5 predictions that are located upstream of genes encoding an amino acid permease, diaminopime- late decarboxylase, dihydrodipicolinate synthase, or lysine specific permease. The final predictions for the riboflavin and thiamin riboswitches are found upstream of genes encoding diaminohydroxyphosphoribosylaminopyrimidine deaminase and phosphomethylpyrimidine kinase, respectively. Of the 16 novel purine and lysine riboswitch predictions, there are 13 predictions from gram-positive bacteria, supporting earlier conclusions. There are 4 of the 7 novel purine hits that are to Lactobacillus johnsonii and Lactobacillus plantarum, which have no previously identified purine riboswitches. Likewise, 4 lysine predictions and 1 riboflavin prediction are from genomes with no previous riboswitches from that family. While none of these predictions appears in Rfam, it has been brought to our attention that some of these predictions overlap with the predictions in [86]. As these were made using completely different techniques, they provide additional validation of our approach. Free energy minimization approaches to secondary structure prediction are not well suited to riboswitches, because the repressing structure is contingent upon ligand binding. FastR offers an advantage for such RNA motifs in that the 36 refer to the strand. 0 − 0 and 0 + 0 Gene Annotation GMP synthase Xanthine / uracil transportAdenine protein deaminase Xanthine phosphoribosyltransferase Conserved hypothetical protein Purine nucleoside phosphorylase Hypothetical protein Adenine deaminase Diaminopimelate decarboxylase Hypothetical protein Lysine specific permease Dihydrodipicolinate synthase Hypothetical Na+/H+ antiporte Phosphomethylpyrimidine kinase Hypothetical protein (xanthine permease family) ABC transporter (amino acid permease) ABC-type amino acid transport system Diaminohydroxyphosphoribosy- laminopyrimidine deaminase family b D 264 181 156 175 168 319 417 93 50 282 296 277 200 284 273 276 123 181 p-value 0.016 0.018 0.019 0.020 0.021 0.021 0.024 0.026 0.026 0.010 0.011 0.015 0.022 0.026 0.026 0.027 0.016 0.029 ) ) ) ) ) ) ) ) ) ) ) − − − − − − − − − − − a Location 794079-178(+) 1949385-485(+) 2410480-573(+) 339446-540( 1729531-628( 4574871-970(+) 512985-3085(+) 1778017-113( 2435592-693( 794406-582( 1619231-417(+) 1813295-475( 191443-627( 2276234-412(+) 699673-852( 1689148-335( 1210336-467( 1469400-498( Genome Bacillus anthracis Lactobacillus johnsonii* Lactobacillus plantarum* Lactobacillus plantarum* Lactobacillus johnsonii* Bacillus anthracis Clostridium perfringens Bdellovibrio bacteriovorus Bacillus cereus Bacillus subtilis Bacillus halodurans Fusobacterium nucleatum* Onion yellows phytoplasma* Lactococcus lactis* Lactococcus lactis* Shewanella oneidensis Thermus thermophilus* Streptococcus pneumonia Family Purine Lysine Riboflavin Thiamin Table 2.5 Description ofpredicted the riboswitch 18 is most eitherthought promising upstream to candidates from mediate from a ligand(b) the biologically binding. D: 468 relevant (a) Distance gene, putative Location: between or riboswitchesriboswitches the Genome contains discovered from start coordinates strong by that of of sequence FastR. family. the Each the similarity riboswitch riboswitches. in and regions the 5’ end of an exon. (*) Genomes with no previously identified 37 biologically significant structure can be inferred from the alignment. The secondary structures derived from the top predictions in each riboswitch family in Table 5 can be seen in Figure 2.7.

2.4 Summary

Our test results show that FastR is an effective tool for finding novel homologs of query ncRNA sequences. In general, the development of fast filter- ing and searching tools for ncRNA is a natural area of research, analogous to the development of sequence similarity tools like BLAST and Fasta. However, as the discussion above shows, the underlying structure and diversity of ncRNA makes this problem quite different in character. Consequently, the filters must be more complex than the (approximate) keyword matches used for sequence similarity. The ideas presented here open many lines of research, which we are actively pur- suing. The first is regarding sensitivity. For diverged families, the filters miss out a few true homologs. Our analysis showed that in many cases, this was due to a stem loop not being recognized as a (k, w)-stack. This can be due to too few base- pairs, bulges, and non-canonical base-pairing. However, the stem must still have low-energy that allows it to maintain that conformation. Therefore, we plan to generalize the definition of a (k, w)-stack allowing all pairs that form energetically favorable structures. While non-canonical base-pairs are easy to handle, bulges and interior loops are computationally more challenging. It will be interesting to see how these changes affect sensitivity. Some homologs are filtered out because they do not satisfy distance constraints. Relaxing the distance constraints could decrease specificity. One approach to increasing sensitivity without compromising specificity is to relax the distance constraints, but employ multiple nested and multiloop filters. Another direction is the design of optimal multiloop and nested filters. 38

(a) (b)

(c) (d)

Figure 2.7 Representative riboswitch secondary structures derived from the align- ments of the top novel hits for each query. (a) The secondary structure prediction for the top purine hit. (b) The secondary structure prediction for the top lysine hit. (c) The secondary structure prediction for the top thiamine hit. (d) The secondary structure prediction for the top riboflavin hit. 39

Currently, the filters were constructed by changing parameters empirically. We are working to automate the design of optimal (multi-loop and nested) filters for an ncRNA family. Preliminary results on the purine riboswitch show a 10-fold speedup with no loss of sensitivity, and we are testing the methodology on other families. These filters are constructed when the secondary structure is known for all members of the ncRNA family. In general, the secondary structure might not be known, and it would be interesting to automate the filter design after inferring common structural elements. This is a simpler and more tractable version of the well-studied problem of RNA multiple alignment because it only requires an alignment of the stacks involved in the filters. In many cases, the FastR results themselves need to be filtered to remove obvious false positives. However, the advantage of using the tool is that good candidates can be found with relatively little effort. If the “Modern RNA world” hypothesis is true, many ncRNA sequences will be discovered in the coming years. Our tool can be used to rapidly identify novel homologs of these ncRNA. Finally, many RNA motifs, including riboswitches, fold into the correct structure only in combination with other molecules. Programs that predict structure based on de novo energy minimization are challenged in their ability to find the correct structure for these molecules. In contrast, comparative tools such as ours can be used to infer structure, relatively easily. This chapter, in part, is a reprint of the paper “Searching Genomes for non-coding RNA using FastR” co-authored with Brian Haas, Eleazar Eskin and Vineet Bafna in IEEE/ACM Transactions on Computational Biology and Bioin- formatics, Vol. 2, Issue 4, pp. 366–379, 2005. The dissertation author was the primary investigator and author of this paper. 3 PFsatR: Profile-based Fast RNA search using sequence-based filters

3.1 Introduction

In previous chapter, we tried to build a filter based on secondary struc- ture information. What if we have a good alignment for an RNA family? In this case, we also can use the sequential conservation information in this RNA family to build a filter. On the other hand, profile-based alignment also has better detec- tion for the homologs. In this chapter, we make several contributions toward this goal. First, we formalize the concept of a filter and provide figures of merit that allow comparing between filters. Second, we design efficient sequence based filters that dominate the current state-of-the-art HMM filters. Third, we provide a new formulation of the covariance model that allows speeding up RNA alignment. We demonstrate the power of our approach on both synthetic data and real bacterial genomes. We then apply our algorithm to the detection of novel riboswitch ele- ments from the whole bacterial and archaeal genomes and enviro mental sequence data. Our results point to a number of novel riboswitch candidates, and include genomes that were not previously known to contain riboswitches. A database filter is a computational procedure that takes a database as input, and outputs a subset of the database. The goal is to ensure that

40 41 the object being searched for remains in the database after filtering, the filtered database is significantly smaller, and the filtering operation is very fast. Filters have played a central role in bioinformatics. BLAST is the prototypical example, with a keyword match filter greatly improving the search for remote homologs. Indeed, improving the filters for sequence similarity search remains an intensively researched area, with many recent publications. Filtering is also being applied in other bioinformatics domains, including structural genomics [58], proteomics (mass-spectrometry) [30,100], and non-coding RNA (ncRNA) [114,115]. We have discussed how to use stack-configurations (secondary structure information) as fil- ters in the previous chapter. Here, we revisit the notion of filtering, focusing on applications to detecting ncRNAs using sequence-based filters. ncRNAs are genomic sequences that are transcribed, but not translated, and function as RNA molecules. Recent discoveries of many novel families and sub-families of ncRNA have underscored their importance, and hint at an RNA world, where coding and non-coding genes play equally important roles [27, 96, 108]. However, computational tools for detecting novel ncRNA are not yet mature, and the signal for ncRNA is considerably weaker than that for protein coding genes. Therefore, a comparative approach to discovering novel homologs of a query ncRNA is also increasing in importance, much like BLAST is often used to identify novel homologs of coding genes. While viable, this approach poses a technical challenge since the known algorithms for aligning ncRNA are at least an order of magnitude slower than sequence alignment [50] (see Section 2.2.5), and even slower when other secondary structures (such as pseudoknots) are allowed [22]. Indeed, using a search based on a covariance model (CM) [25], it would take 54 hours to query two bacterial genomes: E. coli K12 and Staphylococcus aureus MW2 (7.5 Mb) for a sub-family such as the FMN riboswitch (145 bp). This makes the filtering problem both easier and harder. On the one hand, the alignment is so expensive (cubic time), that even a computationally intensive filter (quadratic time) could be useful. At the same time, since the alignment is so expensive, the filtering itself 42 must be very efficient in removing a large portion of the database while retaining the true hits. For example, a filter that removes 50% of the database is still not sufficient to make CM searches tractable for large genomic sequences. Algorithms that align ncRNA are expensive because they score for both sequence and structure conservation, and the latter task is computationally inten- sive. Filtering for RNA was systematically explored by Weinberg and Ruzzo [114, 115], who used a pigeonhole argument to show that it is enough to scan for se- quence similarity, expressed by a hidden Markov model, leaving the more expensive structural alignment for the filtered sequence. Henceforth, we refer to their filter as HMM-filter. Subsequently, they and us, independently, used partial structure conservation for the filtering [114,120]. Even after applying these filters, the prob- lem remains computationally expensive, and it is worthwhile to ask if one can do better. Here, we make several contributions in this regard. First, we formalize the concept of a filter and provide figures of merit that allow comparison between filters. Second, we design novel filters and show that they dominate the HMM filters of Weinberg and Ruzzo [115] (we defer a formal definition of the notion of dominance to Section 3.2). In practice, this leads to 1-2 orders of magnitude decrease in search time. However, our main point is not that we can build better filters, but that it is relatively easy to do so. Indeed, the filters we design are very simple conceptually, indicating perhaps that we have only scratched the surface on this problem. The main contribution of this paper is a principled approach to com- bining filters that have different performance characteristics to achieve dominance (Section 3.3). We also revisit the issue of alignment by aligning an RNA-profile to a filtered substring. We emphasize that there is a strong (practically, 1-1) corre- spondence with CMs in both the alignment algorithm, and the observed results. Indeed, the advantage of the CMs is that their parameters can be trained using the same formalization. However, our reformulation helps us take advantage of simple 43 tricks like banding and others which help speed up the alignment appreciable loss in accuracy (Section 3.4). Similar extensions would require a departure from the formalism of stochastic context free grammars that support CMs. This also has an impact on filtering. Unlike previous approaches, we do not tie the accuracy of our filtering procedure to the accuracy of an existing alignment procedure. Thus, it is relatively easy to use our filtering procedure in conjunction with other different alignment algorithms. For example, in recent work, we used the filtering to search genomes for pseudoknotted RNA [22]. Within ncRNA, we focus our attention on Riboswitches. Riboswitches are ncRNA elements that often occur in the 5’ Untranslated Region (UTR) re- gions of genes [65,68,88,98,107,108]. The riboswitches have a mode of action that one normally associates with proteins: they directly sense the levels of specific metabolites with a structurally conserved aptamer domain to regulate expression of downstream genes. Riboswitches respond to a wide range of metabolites in- cluding coenzymes, purines, amino acids and some others. Most riboswitches are predicted to be within UTRs of mRNAs that encode biosynthetic enzymes or metabolite and metal transporters. Novel members are continuously being discov- ered. Rfam [36], version 7.0, has members from 12 sub-families of riboswitches. Due to their widespread and exclusive occurrence in bacteria, they are attrac- tive anti-microbial targets. Our results point to a number of novel candidates for each of these sub-families, and include genomes that were not previously known to contain riboswitches. The rest of the chapter is organized as follows. We first formalize the con- cept of a filer in Section 3.2. Section 3.3 describes our new idea for sequence-based filtering. Section 3.4 describes new profile-based RNA alignment procedure. The testing and experimental results are presented in Section 3.5. We also implemented a web server for PFastR which is described in Section 3.6. Section 3.7 concludes this chapter. 44

3.2 Formalizing ncRNA Filters

Covariance Models (CMs) are probabilistic context-free grammar models that describe both structure and sequence information of an RNA family [25, 28]. The score of an RNA sequence t against a CM model M is roughly the sum of two components: its sequence similarity to the modeled family, measured using a position specific scoring matrix (PSSM) of nucleotides, and its structural similarity, measured against the distribution of nucleotide pairs in aligned positions. Formally,

S(M, t) = SeqScore(M, t) + StructScore(M, t) where SeqScore is the score of the PSSM part of M against t. For ungapped alignments, this would simply be the sum over all columns X SeqScore(M, t) = SeqScore(Mj, tj) j

If gaps are allowed, we must compute an alignment that optimizes S[M, t]. The SeqScore computation is an order of magnitude faster than an optimum Struct- Score computation. Weinberg and Ruzzo [115] use this as the basis of their sequence based HMM filter1. For a given threshold T for M, they compute a threshold Tps as

Tps = min{SeqScore(M, t): S(M, t) ≥ T }

This choice of Tps ensures that each“true homolog” (S(M, t) ≥ T ) will pass the filter. Moreover, much of the database will be rejected by this filter, and will not undergo the more expensive CM alignment. In order to improve upon this filter, we start with formalizing the defini- tions of a filter and its quality. A filter F takes a sequence as input and outputs sub-sequences. We assume the operating parameters (such as a threshold) as part of the filter definition. To make the notion of performance independent of the

1They use HMMs (not PSSMs) to describe the filter, but that technical difference does not change the argument. 45 database, we measure it on a suitably defined random database sequence D, with a set of true sequences A embedded in D. The performance of the filter is measured with the following:

1. Running Time: The running time TF (|D|, n) is a function of query length n, and database length |D|.

2. Efficiency: Let OF (D) be the output of filter F . Define efficiency as eF =

|OF (D)| |D| . The lower the better (0 ≤ eF ≤ 1) .

3. Accuracy: Let AF denote the subset of true sequences that are accepted

|AF | by the filter. Then accuracy is defined as AF = |A| . The higher the better

(0 ≤ AF ≤ 1).

Filter F1 dominates F2 if it is faster, more accurate, and more efficient than F2. Often, filters perform well in one or two but not all of these aspects. In many cases, they can be combined for further improvement. The two obvious ways to combine filters are:

• Union F1 + F2: in which OF1+F2 (D) = OF1 (D) ∪ OF2 (D). Union helps if

both F1 and F2 are fast and efficient, but not accurate.

• Composition F1 · F2: OF1·F2 (D) = OF2 (OF1 (D)). Composition helps when

the two filters are accurate but not very efficient, and F1 is faster than F2. Note that composition is always better than intersection, as the running time

TF1 (D, n) + TF2 (OF1 (D), n) is better than TF2 with identical accuracy.

We will use both of these operations in designing better filters. The following result shows that it is not essential to be able to compute efficiency directly in order to prove dominance.

Theorem 1 Filter F can be dominated if there exists a filter F1 with AF ⊆ AF1

TF1 (D,n) and ≤ 1 − eF . TF (D,n) 1 46

Proof: We simply use the composition F1 · F as the filter. Clearly, it dominates in accuracy and efficiency. For running time, we note that

TF1 (D, n) + TF (OF1 (D), n) ≤ TF1 (D, n) + eF1 TF (D, n)

≤ (1 − eF1 )TF (D, n) + eF1 TF (D, n)

≤ TF (D, n).

¥ Theorem 1 is useful because instead of trying to compute efficiency ex-

TF1 (D,n) actly, we can look for a constant θ such that ≤ θ, and eF ≤ 1 − θ. As an TF (D,n) 1 application of the theorem, we can think of the CM itself as a filter F . F is very accurate (gets all the true hits) and efficient (random sequences do not score high),

2 but slow (TF (D, n) = Ω(|D|n )) [50, 120]. On the other hand, the HMM filter F1 is accurate (AF = AF1 ), and an order of magnitude faster (TF1 (D, n) = O(|D|n)), T (D,n) but not as efficient. Can the composite filter dominate? Note that F1 ≤ 1 . TF (D,n) n n−1 From Theorem 1, the composite filter F1 · F dominates F if eF1 ≤ n . As this condition is relatively easy to achieve, Weinberg and Ruzzo show improvements for most families [115]. In the following, we will describe sequence based filters that run in time c|D|, where c is a small constant. By the previous argument, we

n−c only need to show marginal efficiency n to dominate. Thus, the filters we design will dominate the HMM filters of [115].

3.3 Sequence-based Filters

Let FP denote a sequence based filter, which computes a gapped SeqS- core, and uses a threshold T , chosen so that the accuracy of FP is identical to the

CM. We will define a sequence based filter Fs that matches the accuracy of FP , but is faster. The idea is based on an application of the pigeonhole principle, and the fact that text search using a dictionary of words is fast. For a sequence to score T against a profile of length L, each column must score T/L on the average. In fact, every sequence that scores T against the profile contains an l-mer w that scores 47

T l/L or better against the profile. FS proceeds by computing all subsequences that match at least one keyword in T . We use the following procedure:

1. Generate a set of keywords K, each of length l (for a fixed parameter l), by selecting all words that score T l/L in an ungapped region of the profile. Label each such keyword w so that label(w) is the profile position where it occurs.

2. Search D for exact matches to keywords from K.

3. For each position i that matches a keyword with label p, identify D[i−p . . . i− p + L] as a candidate sequence.

4. Merge significantly overlapping candidate sequences.

By the pigeonhole principle, the accuracy of FS is high (AFP ⊆ AFS ). The filtering can be done in O(|D|) time through the use of Aho-Corasick tries, or hashing, so the filter time is an order of magnitude faster. It remains to evaluate the efficiency of this filter. For any position i to be selected, either of the keywords in K must match at a specific position (given by their label) relative to i. Therefore, assuming a uniform distribution of words along the sequence, the efficiency of this filter is ³ ´ |K| |K| n−1 given by 4l . By Theorem 1, we only require 4l < n for dominance, and can often find single keyword filters that suffice. In the following we improve upon this simple filter by considering multiple keywords.

3.3.1 Multiple Keyword (Chain) Filtering

We define an (l, m, δ, K)-chain filter as follows: sequence D[i, . . . , i + L] is accepted by an (l, m, δ, K)-chain filter if m words w1, w2, . . . , wm ∈ K, each of length l, match at positions i + i1, i + i2, . . . , i + im, s.t. for all j, ij ≥ ij−1 + l (i.e., words are ordered and non-overlapping) and |ij − label(wj)| ≤ δ. For ungapped alignments, δ = 0, but otherwise, δ must be chosen carefully to maximize accuracy. We have the following result: 48

Theorem 2 Consider an (l, m, δ, sK )-chain filter. If sK is the maximum number of keywords with an identical label in K, then the efficiency on a uniform random database is given by µ ¶ µ ¶ L − m(l − 1) 2δs m e (l, m, δ, s ) = K . (3.1) F K m 4l

Proof: Consider a random position i in the database D. By definition,

eF (l, m, δ, sK ) = P r[D[i, . . . , i + L] is accepted]

Define a configuration w.r.t. a position i as an m-tuple C(i) = (i1, i2, . . . , im), such that i ≤ i1 ≤ i2 ... ≤ im ≤ i + L and ij ≥ ij−1 + l for all j. Then i is accepted by the filter if there exists a configuration C(i) such that for all ij ∈ C,

D[ij, . . . , ij + l − 1] = wj for some wj ∈ K with |label(wj) − ij| ≤ δ. Thus, the

2δsK probability for ij to match up by chance is 4l . It follows that the efficiency of l m the (l, m, δ, K)-chain filter is Cm(2δsK /4 ) , where Cm is the number of possible configurations. To compute this number, consider a binary string b with exactly m ones and L − lm zeros. For 1 ≤ j ≤ m, let bj be the position of the j-th ‘1’ from the left. Define ij = bj + (j − 1)l. Then each binary string corresponds to a unique m-tuple (i1, i2, . . . , im), and ij+1 − ij = bj+1 − bj + l ≥ l for all j < m. The number of configurations is equal to the number of distinct binary strings, given ¡L−m(l−1)¢ by Cm = m . ¥

Figure 3.1 shows (as expected) that the efficiency of a chain filter FC decreases exponentially with increasing m. The slightly faster than exponential decay is due to the fact that L − ml also decreases with increasing m. Likewise, higher values of sK decrease the rate of decay. However, for multiple keywords, selecting the set K of keywords becomes a challenging problem. The pigeonhole principle guarantees the existence of m words that score at least mT l/L, but does not bound the minimum score on any single word. If we were to choose K to be the set of all keywords, sK could be prohibitively large. On the other hand, any choice of a lower bound will reduce Accuracy ( AFP 6⊆ AFC ). In practice, there 49

5

0

-5

-10 ) F e

( -15 g o l -20 sK = 5 sK = 10 -25 sK = 15 sK = 20 -30

-35 1 2 3 4 5 6 7 8 9 10 m

Figure 3.1 A plot of log(eF ) versus m, when L = 150, l = 8 and δ = 20. Different lines correspond to different values of sK . are many reasonable choices that ensure that the accuracy remains 1 and high efficiency is maintained. Indeed, the features we deploy use empirically chosen cut-offs for keyword scores. However, there is a principled way to get around this obstacle by using an appropriately chosen union of filters.

3.3.2 Accuracy of Chain Filters

To control the accuracy of chain filters we extend their definition to allow a score threshold S, such that a sequence is accepted by the filter if in addition to satisfying the above conditions, the total score of the matched keywords exceeds S. Let θ = T l/L. We are interested in computing a chain of words that score mθ. We illustrate the approach using a parameter θ0 = θ/2. Any subsequence that is accepted by the chain-filter must have some k (1 ≤ k ≤ m) words w1, . . . , wk that 0 each score at least θ . Let wk+1, . . . wm denote the remaining words in the chain 50

filter. We have Xk Xm Xk 0 mθ ≤ score(wj) + score(wj) ≤ score(wj) + (m − k)θ j=1 j=k+1 j=1

Pk 0 Thus j=1 score(wj) ≥ mθ − (m − k)θ = (m + k)θ/2.

For all 1 ≤ k ≤ m, define an extended chain filter Fk of k words in which each word scores at least θ/2, and the chain must score at least (m + k)θ/2.

Observe that F1 +...+Fm accepts every chain that scores above mθ, implying that

AFC ⊆ AF1+...+Fm . In the next section, we show that chain filters can be computed efficiently, in time that is often o(|D|n). The search time of the union filter grows linearly with m, and so an efficiency/speed trade-off must be considered in selecting an appropriate m. Once again, Theorem 1 can be used to ensure dominance, but we must do it in an empirical setting since the running time depends upon the score distribution of keywords in K, which in turn, depends upon the alignment. Our results in Section 3.5 show that dominating filters are easy to find.

3.3.3 Implementing Chain Filters

We wish to filter substrings that match an extended (l, m, δ, K, S)-chain filter (where S is the score threshold). Our goal is to improve upon the profile search time of O(L|D|). As chain filters are based on matches with l-mers, we can improve the speed by using string matching techniques. The algorithm is as follows:

1. Build an Aho-Corasick Trie TK with K (alternatively, if l is small, construct a hash table for occurrences of l-mers in D).

2. Initialize a set of active intervals I = φ.

3. Scan D with TK . For each hit of word w ∈ K at position i, add the interval π = [i − label(w) − δ, i − label(w) + δ] to I. The score of the interval sc[π] is set to the score of w against the profile. Also, set the position as pos[π] = i. 51

4. For each position j ∈ D, let Ij = {π|j ∈ π} be the subset of intervals that

overlap with j. For most choices of parameters, |Ij| ¿ L. Select position j if there exist m intervals that are disjoint and have net score better than m. This is done as follows:

(a) Sort the intervals in Ij according to pos[π]. For each π ∈ Ij, let p1(π)

be the largest interval with pos[π] − pos[p1(π)] > l, and p2(π) be the predecessor of π.

(b) For all intervals π ∈ Ij,

score[j, π] = max{sc[π] + score[j, p1(π)], score[j, p2(π)]}. Output j if score[j, π] exceeds the threshold. P The entire computation takes time j |Ij| = o(L|D|). Also, the computation is done only if the depth of coverage at position j exceeds a threshold. The depth of coverage can be computed in linear time. This discussion hides an important problem. Insertions and deletions make the profile length significantly longer than any sequence. For example, the average length of cobalamin riboswitches is 200, while the profile length is closer to 600. A simple way around this is to discard columns with many gap characters, but that entails deciding which columns are dominated by gaps. Instead, we revise the definition of the label of a position. Recall that label of a keyword is its position in the profile, and should match its position in the query sequence. Instead, define the label as the expected position in the query sequence. Let pi denote the probability that the i-th position of the profile is not a gap (in other words, pi = P [i, A]+P [i, C]+P [i, G]+P [i, T ]). Then define   pi if i = 1; label = i  labeli−1 + pi otherwise.

Each keyword that appears at position i in the profile is assigned labeli as its label. 52

3.4 RNA-Profile Scoring and Alignment

In this section we describe our algorithm for scoring a sequence against a structural alignment of an RNA family, where we score for conservation of both sequence and structure. The algorithm is very similar to Covariance Model [25,28]. However, we provide our own implementation to allow for faster banded scoring. Also, our filter design can be more effectively tied to the scoring. Formally, we treat the RNA-profile alignment as a filter, and compose it with the chain filter. Finally, our algorithm can be extended to include more complex RNA models, such as pseudoknots, which will be explored in future work. The structural alignment of an RNA family is a (gapped) multiple align- ment R of its sequences with structure described by a set M of pairs of positions (i, j), such that for a majority of sequences in the family, the nucleotides aligning to these positions form base-pairs. The alignment of the RNA family against a tar- get sequence t is described by a 2 × m matrix A, in which row 1 contains column positions of the profile interspersed with spaces (insertion of aligned sequence), and row 2 contains the sequence, also interspersed with spaces (deletion of profile columns). For all columns j, A[1, j] 6=0 −0 or A[2, j] 6=0 −0. For r ∈ {1, 2}, define

0 0 0 0 ρr[j] = j − |{l < i s.t. A[r, l] = − }|. In other words, if A[1, j] 6= − , it contains the position ρ1[i] of R. The score of alignment A is given by X X γ(A[1, j],A[2, j]) + δ(ρ1[i], ρ1[j], ρ2[i], ρ2[j])

j i,js.t.(ρ1[i],ρ1[j])∈M

The function γ scores for sequence similarity, and δ scores for structure conser- vation. Our goal is to find an alignment that maximizes this score. While this formulation encodes a linear gap penalty, we note here that alignments of RNA molecules may contain large gaps, particularly in the loop regions, and we imple- ment affine penalties for gaps (details omitted). 53

3.4.1 Choosing the Scoring Functions

Consider an alignment of n RNA sequences from a family. Let ni(a) be the number of sequences with a ∈ {A, C, G, U,0 −0} in the i-th column of the multiple alignment. The probability of observing a in the i-th position can be estimated by Ca + ni(a) Pi(a) = P a0 Ca0 + n

Ca where Ca are pseudo-counts, chosen so that pa = P , where pa is the prob- a0 Ca0 ability of occurrence of a in the family. These probabilities are used to con- struct a position specific scoring matrix. Then for all positions i, and all symbols a ∈ {A, C, G, U,0 −0}, X 0 0 γ(i, a) = S(a , a) × Pj(a ) (3.2) a0∈{A,C,G,T,−} where S(a0, a) is the score of substituting a0 with a. We use a nucleotide substitu- tion scoring matrix [50]. We model insertions and deletions with the gap penalties γ(0−0, a), and γ(i,0 −0), respectively. Likewise, to score for structure conservation we look at the probabilities of specific base-pairs that occur in each pair of positions. For each (i, j) ∈ M, let ni,j(a, b) describe the number of sequences in the alignment that contain a in position i, and b in position j. As before,

Ca,b + ni,j(a, b) Pi,j(a, b) = P a0,b0∈{A,C,G,U,0−0} Ca0,b0 + n and the score for conserved structure is given by X 0 0 0 0 δ(i, j, a, b) = Pi,j(a , b )×Sp(a , b , a, b), ∀(i, j) ∈ M, a, b ∈ {A, C, G, U} a0,b0∈{A,C,G,U} (3.3)

0 0 where Sp is the scoring matrix for substituting (a , b ) with (a, b), and rewards both sequence and structure conservation. Note that δ is only defined when (i, j) ∈ M, and a, b ∈ {A, C, G, U}. In other cases, the structure is obviously not conserved, and the appropriate score is given by γ. 54

3.4.2 The Alignment Procedure

We make the assumption that the base-pairs are non-crossing. For each base-pair (i, j) ∈ M, there is a unique (parent) base-pair (i0, j0) such that i0 < i < j < j0, and there is no base-pair (i”, j”) such that i < i” < i0, or j < j” < j0. Thus the alignment can be done by recursing on the nodes of the tree. However, the tree can have high degree and not all columns of the profile participate in it. To this end, we binarize the tree using the procedure given in Section 2.2.5. Specifically, we add spurious nodes (base-pairs) to the tree so that every column participates as a tree node, the degree of any node is at most 3, and the number of nodes is O(m), where m is the number of columns in the profile. Further, a node corresponding to a true base-pair (i, j) ∈ M has at most one child. Figure 3.2 describes a dynamic programming algorithm for aligning a sequence to an RNA profile. The RNA profile is described by a tree. Each node

0 0 v in the tree corresponds to a base-pair (lv, rv) ∈ M of the profile, where M is the augmented list of base-pairs. The alignment of the sequence to the RNA profile is done by recursing on the tree-like structure of RNA. Each node in the binarized tree either represents a base-pair/unpaired base (and has its own PSSM), or represents a branching point in a pair of parallel loops. The algorithm maintains the sequence interval being aligned and the current node in the structure tree.

3.5 Experimental Results

We implemented the chain filtering and the profile alignment algorithms as described above. All tests reported herein were performed on a 2.8 GHz Intel PC (genomic searches were done on 1.6GHz AMD Opteron grid). For chain fil- tering, we chose the parameters l, m, δ and score threshold (affects sK ) so as to optimize efficiency while maintaining optimal accuracy. The chain filtering was also composed with HMM filtering (from RAVENNA package [115]) to further improve the filtering efficiency. For the alignment of the filtered sequences to an 55

procedure PAln

(*M is the set of base-pairs in RNA profile R. M 0 is the augmented set. *)

for all intervals (i, j), 1 ≤ i < j ≤ n, all nodes v ∈ M 0

if v ∈ M   A[i + 1, j − 1, child(v)] + δ(l , r , t[i], t[j]),  v v   A[i, j − 1, v] + γ(0−0, t[j]),    A[i + 1, j, v] + γ(0−0, t[i]), A[i, j, v] = max  0 0  A[i + 1, j, child[v]] + γ(lv, t[i]) + γ(rv, − ),   0 0  A[i, j − 1, child[v]] + γ(lv, − ) + γ(rv, t[j]),   0 0 0 0 A[i, j, child[v]] + γ(lv, − ) + γ(rv, − ), 0 else if v ∈ M − M, and v has one child   A[i, j − 1, child[v]] + γ(rv, t[j]),   0 0 A[i, j, child[v]] + γ(rv, − ), A[i, j, v] = max  A[i, j − 1, v] + γ(0−0, t[j]),    A[i + 1, j, v] + γ(0−0, t[i]), else if v ∈ M 0 − M, and v has two children

A[i, j, v] = maxi≤k≤j{A[i, k − 1, left child[v]] + A[k, j, right child[v]]}

end if

end for Figure 3.2 An algorithm for aligning an RNA profile R with m columns against a database string t of length n. The query consensus structure M has been Binarized 0 0 to get M . Each node v in the tree corresponds to a base-pair (lv, rv) ∈ M .

RNA model, we used both our profile alignment tool and the CMsearch tool from the INFERNAL suite (http://infernal.wustl.edu) [28, 35]. Both the HMM filters (using expended HMM filters) and CMsearch were applied in the following with their default parameters or recommended parameters from Rfam database website. We applied these algorithms to search for riboswitch elements. We chose to focus on riboswitches both due to their importance and due to their unique properties that make them an ideal test case: many ncRNA families show strong sequence similarity, which makes sequence based filtering very efficient, and rela- tively trivial. In contrast, the riboswitches, with 12 distinct sub-families (and new sub-families being continuously discovered) are quite diverse, and relatively diffi- 56

Table 3.1 Riboswitch sub-families in Rfam database (version 7.0). Average length and “%identity” are based on the information in Rfam database. “#seed” is the number of sequences in the seed alignment. “#total” is the number of full family sequences.

Rfam Id Name Average length %id #seed #total RF00050 FMN 145 66 48 136 RF00059 TPP 110 52 237 382 RF00080 yybP-ykoY 128 45 74 127 RF00162 SAM 110 67 71 219 RF00167 Purine 100 56 37 100 RF00168 Lysine 182 49 60 98 RF00174 Cobalamin 204 46 171 249 RF00234 glmS 184 58 14 37 RF00379 ydaO-yuaA 158 54 35 74 RF00380 ykoK 168 60 39 53 RF00442 ykkC-yxkD 106 62 16 21 RF00504 gcvT 101 51 117 163

cult to filter. Table 3.1 summarizes known riboswitches from Rfam v.7.0 [35,36].

3.5.1 Filter Efficiency and Accuracy

To systematically test our filters, we downloaded data on 12 riboswitch sub-families from the Rfam database, version 7.0 [35, 36]. These data contain for each family a “seed” alignment, which is a hand-curated alignment of known members, and a “full” collection of family sequences, which contains known and predicted (by CMsearch) members. In the following we refer to a member of the seed alignment as seed sequence, and to a member of the full collection as family sequence. As a first test of our method we synthesized several test sequences. For each sub-family, we created a random genomic sequence of size 1 Mb with G+C content of 0.5, and randomly planted all the family sequences therein. We tested the filter’s performance on the composite sequence. Table 3.2 summarizes the results of the chain filter (CF) in comparison to the HMM filters and to a combined 57 0:11 0:14 0:28 0:09 0:10 0:13 0:26 0:17 0:10 0:12 0:07 0:10 0:14 time (m:s) 0 0 0 0 0 HMM · HMM) on synthetic · eff2. 3.1e-2 1.3e-1 5.9e-5 3.8e-3 3.3e-2 7.5e-5 4.3e-3 0.017 CF ) Note that these filters only eff. ∗ 1.3e-2 5.8e-2 1.4e-1 1.7e-2 7.4e-3 1.5e-2 6.2e-2 2.4e-3 6.9e-3 5.9e-3 1.7e-3 1.6e-2 0.029 1:10 0:59 1:07 0:55 0:52 1:34 1:42 1:25 1:11 1:32 0:53 0:51 1:11 time (m:s) 1 1 1 1 1 1 1 1 1 1 1 1 0.97 acc. 0 1 1 1 1 0 HMM eff2. 4.0e-4 1.5e-4 3.0e-4 1.0e-3 1.2e-4 1.6e-1 0.347 1 1 1 1 eff. 2.8e-2 5.9e-2 1.1e-2 7.7e-3 1.9e-2 1.2e-2 2.4e-3 1.9e-1 0.361 0:7 0:10 0:07 0:08 0:07 0:10 0:13 0:14 0:08 0:10 0:07 0:07 0:09 time (m:s) ∗ ∗ 1 1 1 1 1 1 1 1 1 1 1 0.99 0.99 acc. CF 0 0 eff2. 3.4e-2 1.4e-1 2.1e-3 3.1e-2 3.9e-3 3.4e-2 9.1e-3 4.9e-3 6.0e-3 2.5e-2 0.024 eff. 1.3e-2 6.3e-2 1.5e-1 1.8e-2 3.8e-2 1.5e-2 6.3e-2 1.3e-2 1.2e-2 1.2e-2 1.7e-3 3.7e-2 0.036 Family FMN TPP yybP-ykoY SAM Purine Lysine Cobalamin glmS ydaO-yuaA ykoK ykkC-yxkD gcvT Average sequences. “eff.” is thethe efficiency accuracy on on synthetic synthetic sequences, sequences, “eff2.” and “time” is is the the efficiency running on time exclusively on random synthetic sequences, sequences. “acc.” ( is Table 3.2 Filtering performance of chain filters (CF), HMM filters (HMM),miss and one composite hit. filters (CF 58

filter. In addition to the efficiency measure we also report a second measure, efficiency2, which is computed exclusively on the random sequence. While actual genomic sequence will have some true hits as well, it is unlikely to have more than a few members per Mb, so efficiency2 is a better approximation to the true efficiency. Recall from Theorem 1 that high gains in filter speed at the cost of efficiency is desirable because filter composition can be used to achieve dominance. Thus, the key statistic in Table 3.2 is search time. The sequence based chain filter is much faster (on average, 9 sec/Mb) than the HMM filter (71 sec/Mb). Interestingly, even the efficiency of a CF filter remains very high on the average (0.036) while maintaining optimal accuracy. The faster speed and the optimal accuracy of the CF filter makes the composite filter (CF·HMM), which applies CF filter first and HMM filter later on the database, dominate the HMM filter. In Table 3.2, CF·HMM further improves the efficiency significantly (0.029), and it is still much faster (on average, 14 sec/Mb) than the HMM filter. The filtering is followed by alignment with RNA-Profile. In the appendix we include a direct comparison between profile alignment and the CM approach. As can be seen from Table 3.3, profile alignment attains very similar accuracies but is much faster. Next, we tested the performance of our filter on two genomes with biased G+C content, previously used by Weinberg and Ruzzo [115]: E. coli K12 and Staphylococcus aureus MW2. We searched for the 12 riboswitch families on these genomes whose total length is 7.5 Mb. Table 3.4 presents a comparison to the HMM filter. As expected, the chain filter is much faster. On the average, its efficiency is also very high (0.017), outperforming that of the HMM filter (0.34). Note that all true hits in these two genomes were recovered by every filtering method with the corresponding alignment algorithm. Obviously, the composite filter, CF·HMM, still provides the fastest filtering solution. 59 · 2:05 6:53 8:39 1:56 13:24 12:03 13:16 37:48 7:06:47 4:11:31 13:57:59 27:39:27 HMM CM time (h:m:s) 1 1 1 1 1 1 1 1 1 1 0.95 0.99 CM retrieval rate 98 37 74 53 21 136 382 127 219 100 249 163 CM #true 98 35 73 53 21 136 382 127 219 100 248 163 CM #TP · 1:29 6:06 2:23 2:17 3:16 2:36 3:15 1:22 0:30 3:15 14:43 14:58 CF PAln time (m:s) 0.2in 1 1 1 1 0.98 0.94 0.99 0.97 0.97 0.99 0.98 0.85 PAln retrieval rate 99 98 37 74 53 21 136 382 127 219 249 163 PAln #true 99 97 36 73 52 21 136 373 119 218 242 138 PAln #TP 05 to get the top ranking hits (one hit in cobalamin family is marginal), and CMsearch use the same . Family FMN TPP yybP-ykoY SAM Purine Lysine Cobalamin glmS ydaO-yuaA ykoK ykkC-yxkD gcvT cutoff bits score frompossible Rfam true data hits website. (#true) “retrieval after rate” filtering is (either defined chain as filtering the (CF) percentage or of HMM true filtering positive (HMM)). (#TP) hits over the Table 3.3 Comparison of RNAuses profile p-value alignment cut-off (PAln) and 0 CMsearch (CM) on synthetic sequences. RNA profile alignment 60 54h 40h 38h 31h 18h 85h 80h 99h 61h 23h 33h 61h 166h CM time (hours) estimated HMM) on two real · · · 2:24 7:55 1:51 1:49 5:46 3:32 3:10 3:00 1:29 2:37 9:11 36:59 39:40 time “CF (m:s) PAln” HMM · · 1:32 1:15 1:34 6:28 2:24 1:22 1:26 1:00 2:01 16:10 53:04 69:42 13:10 time “CF CM” (m:s) HMM · 45h 44h 8:23 31h 2.4h 100h 177h 16:10 12:39 12:36 18:28 11:47 16:19 time CM” (m:s) “HMM 8:19 7:45 15:40 12:03 13:55 11:34 10:51 19:23 20:31 17:53 11:00 14:05 114:00 time (m:s) HMM 1 1 0 0.34 9.1e-6 9.9e-1 8.2e-4 2.7e-3 9.6e-1 2.1e-3 3.5e-4 1.0e-3 1.5e-1 eff. HMM · 2:24 2:01 6:06 6:27 5:32 1:29 6:49 10:39 42:23 10:38 44:56 13:09 12:42 time “CF (m:s) PAln” 1:28 1:10 1:20 1:11 1:08 1:41 1:55 2:00 1:20 1:21 1:00 0:54 1:22 CF time (m:s) 0 eff. CF 1.2e-4 2.5e-2 6.7e-2 4.5e-4 2.8e-2 3.8e-3 4.0e-2 1.9e-2 3.8e-3 3.0e-3 1.5e-2 0.017 CM (from RAVENNA [115] package) is longer than the computation time for CMsearch (from Infernal · Family FMN TPP yybP-ykoY SAM Purine Lysine Cobalamin glmS ydaO-yuaA ykoK ykkC-yxkD gcvT Average genomes with alignment performance“time” of is profile the alignment running (PAln)time time and of for CMsearch the (CM). HMM corresponding “eff.”package filters. [35]). is Note the This that filtering is when efficiency, because HMM and of filter the efficiency is differences close between to the 1, C++ the and computation the C compiler. Table 3.4 Filtering performance of chain filters (CF), HMM filters (HMM), and composite filters (CF 61

3.5.2 Discovering Novel Riboswitches

We applied our sequence based filters, coupled with profile alignment, to search all bacterial and archaeal genomes for the twelve riboswitch families. A total of 254 genomes spanning 818Mb were searched. Of these, 179 have some ncRNA annotations. Table 3.5 summarizes the search results. In total we identified 463 novel (putative) riboswitches based on a P-value cutoff 0.04. Interestingly, 413 of these predictions were within 500 bp upstream of an annotated gene. These pre- dictions include hits to genomes that had previously been annotated for ncRNA in Rfam. For cobalamin riboswitch (as an example), most of the predictions are, in- deed, in 5’ UTRs of cobalamin-related or cobalamin-associated genes [87,108] (B12 synthesis, cobalt transporters and alternative cobalamin-independent enzymes). One of the predicted cobalamin riboswitches has been experimentally tested and confirmed (By Ilya Borovok and Yair Aharonowitz, microbiologists at Tel-Aviv Univeristy). In the gcvT (glycine-dependent riboswitch) family, we found 28 novel hits, of which 12 occur as proximal pairs, which is known a preferred mechanism of action for this family [65]. Detailed information on these discoveries is presented in supplementary data (http://www.cse.ucsd.edu/∼shzhang/paper/ISMB2006 or http://ribozyme.ucsd.edu/fastr).

3.5.3 Mining Environmental Sequence Data

Efforts are underway to study the diversity of life by random sequencing of environmental samples [106]. This has resulted in a large volume of genome sequence data. An initial sampling of the Sargasso sea resulted in 1 Gb of Global Ocean Sample (GOS) sequence. The complete expedition will greatly extend this data, and is expected to double the number of known protein sequences [94]. We mined the data for 8 major riboswitches and discovered a number of novel mem- bers, with interesting distribution across geographic locations. These families show an interesting distribution that differs from known microbial genomes, but is con- sistent with the phylogenetic distribution of GOS sequences. 62 82.6 98.7 236.9 232.4 166.5 136.0 405.8 794.0 372.1 470.2 266.7 136.8 CM time estimated (days) PAln 4.8 6.7 3.4 6.9 2.8 · 63.7 34.3 12.6 65.1 36.9 10.5 27.2 CF time (hours) eff. CF 8.5e-4 7.9e-3 7.7e-2 6.7e-4 5.7e-2 5.7e-3 3.6e-2 1.4e-3 2.3e-2 3.9e-3 1.4e-5 4.2e-2 2 6 3 5 1 2 1 1 25 10 15 57 #new* 8 7 34 89 65 80 31 23 70 17 11 28 #new 92 74 72 61 23 62 39 14 98 235 182 141 #TP 82 82 24 68 44 14 103 305 109 204 189 148 #known Family FMN TPP yybP-ykoY SAM Purine Lysine Cobalamin glmS ydaO-yuaA ykoK ykkC-yxkD gcvT Table 3.5 Summary ofof searching known riboswitches riboswitches against in the theis whole whole the bacterial bacterial number and and of archaealhad archaeal new genomes, previously genomes. predictions “#TP” been in is “#known” annotated the these is for number genomes, the of ncRNA and number predicted in “#new*” known Rfam. is hits, the “#new” number of new predictions from the genomes that 63

A summary of our results is shown in Table 3.6. A total of 2542 ri- boswitches were identified. These riboswitches significantly expand the known families, but show a highly variable distribution. Interestingly, we only find a few instances of purine and lysine riboswitches. This is consistent with a recent survey in which purine and lysine riboswitches have not been reported in Cyanobacteria, and are underrepresented (30/155 occurrences) in Proteobacteria [117]. The GOS samples, on the other hand, have an abundance of these phyla. We find multiple oc- currences of cobalamin and TPP, which are well represented in all microbial phyla, and even some eukaryotes (plants/fungi). The riboswitches typically occur in the 5’ UTR of the regulated genes, but are also found in the 3’ UTR. It is suspected that the 3’ UTR occurrence inhibits transcription, while the 5’ occurrence inhibits translation [97]. We searched 600bp downstream of the predicted Riboswitches for predicted ORFs, with the exception of gcvT (glycine-dependent riboswitch) family. The gcvT riboswitch often appears as a pair of ligand-binding domains next to each other, therefore, we extended the search to 1200bp downstream. As the sampling is relatively light, the contigs were not large enough to accommo- date a downstream ORF in many cases. Even so, 1691 (67%) of riboswitches have a downstream confident ORF prediction, of which 1549 are annotated with GO terms. The GOS ORFs have been grouped into 297,254 clusters. Each cluster was annotated based on similarity to known domains, and is associated with GO process IDs. It is expected that the ORFs downstream of riboswitches would have GO process IDs related to the appropriate metabolic functions, and can help validate the riboswitch predictions. Table 1 shows that the downstream genes all group into a very small number of clusters. Typically 3-4 clusters account for over 90% of the downstream genes for each of the riboswitches except cobalamin. Further, the over-represented GO terms that are common to the major clusters for each riboswitch are strongly suggestive of the expected function. For example, all of the 92 ORFs downstream of FMN (riboflavin binding riboswitch) group into 2 64 clusters. Moreover, 2 GO terms (“riboflavin synthesis”, “transport”) cover all of these clusters. Glycine riboswitch (gcvT) participates in the glycine cleavage system and uses glycine as an energy source [65]. We discovered 1482 members of the gcvT riboswitch, 989 of which have clustered ORFs located downstream. Most of those ORFs with GO mappings contain 2 GO terms: “glycine decarboxylation via glycine cleavage system” (702 ORFs) and “tricarboxylic acid cycle” (224 ORFs). The 1482 predictions contain 213 adjacent pairs, typical for this family. We note here that we chose very conservative parameters in our predictions. By relaxing the P-value, we found 443 pairs of co-occurring gcvT riboswitches. Thus, the GOS data has a significant expansion of this family. As a partial explanation, recall that SAR11 (Pelagibacter ubique) is among the most abundant bacterial organ- isms in the oceans [32]. The SAR11 has a compact genome with many metabolic pathways (cobalamin, thiamine) missing. A search of this genome (Refseq ID: NC 007205) for riboswitches resulted in 4 (two adjacent pairs) members of gcvT but no other riboswitch. Thus, it is possible that many of the gcvT riboswitches come from “SAR11-like” organisms. We note that the geological distribution of gcvT riboswitch is consistent with the distribution of SAR11-like organisms [89]. We also see an expansion in the glmS riboswitch family. The glmS ri- boswitch controls the production of GlcN6P (glucosamine-6-phosphate), which plays an important role in sugar metabolism. We find 75 hits, and 64 of which have clustered ORFs located in the down stream of these elements. All this ORFs are clustered into one GO term (carbohydrate biosynthesis). Cobalamin presents an interesting case. A total of the 218 downstream ORFs are found downstream of the predictions. But relatively few of them (118 ORFs) have GO annotation. On the other hand, we also found that 92 of the riboswitches also have an upstream ORF with GO mapping within 600bp. Of which, there are 23 ORFs with GO term: “cobalamin biosynthesis.” This suggests two possibilities: the cobalamin riboswitch regulates upstream genes, and/or that 65 GO terms Counts TransportElectron transportSRP-dependent cotranslational protein targeting to membraneTransport 12 Methionine biosynthesis 14 Threonine biosynthesis 30 Thiamin biosynthesisSiderophore transport 15 Electron transportThiamin 4 DP biosynthesis 5 Tricarboxylic acid cycle 71 Electron transport 55 L-serine biosynthesis 4 Transport 13 224 Sodium ion transport 41 5 2 3 GO terms # Major ) Cobalamin predictions that check for upstream ORFs. ∗ clusters # ORF #ORFs (with GO) in GOS # Predicted in MG #Known TPP 382 456 281(255) 12(95%) 5(98%) Transport 107 glmS 37 75 64(64) 1(100%) 1(100%) Carbohydrate biosynthesis 64 gcvT 163 1482 989(976) 7(99%) 5(100%) Glycine decarboxylation 702 SAM 219 56 42(39) 6(95%) 3(87%) ‘de novo’ IMP biosynthesis 15 FMN 136 118 92(92) 5(100%) 2(100%) Ribofalvin biosynthesis 87 Lysine 98 8 5(5) 4(80%) 2(100%) Amino acid biosynthesis 3 Purine 100 9 0(0) 0(-) 0(-) - - Family Cobalamin 249 338 218(118) 11(93%) 4(96%) Siderophore transport 57 Cobalamin* 249 338 107(92) 66(72%) 1(25%) Cobalamin biosynthesis 23 Table 3.6 Summary ofriboswitches searching with riboswitch a elements downstreamwith GO against confident term ORF GOS annotations. prediction data.than “# two and hits ORF “# the in clusters” ORFs number ourthe refers predictions. in percentage (with the of ”# number parenthesis GO)” total Major of refers GO refers the GO mappings to to terms” ORF is the refers clusters the shown to in number with number the the of GO of parenthesis. number mappings ORFs of ( and GO having terms more that cover most of the ORFs, and 66 genes involved in cobalamin metabolism are clustered on the genome. The downstream ORFs with annotated GO terms provide a validation of our riboswitch predictions. Therefore, we use these predictions to enhance the functional assignment of the remaining ORFs. We analyzed confident ORF clus- ters with no GO mapping that were downstream of predicted riboswitches. The function of each cluster was inferred based on a BLASTp search of the ORFs against the NCBI nr database. The results, shown in Table 2, are generally con- sistent with the upstream riboswitch. A majority of the 71 ORFs downstream of cobalamin riboswitches are related to cobalamin transport or cobalamin synthesis. We found 22 ORFs downstream of TPP riboswitches that are homologous to ABC thiamin transporter. Interestingly, 11 ORFs downstream of gcvT all group in a single cluster with no functional annotation. It is likely that we have identified a novel protein family regulated by this riboswitch. In conclusion, the GOS data provides valuable insight into the riboswitch family, and the distribution of riboswitches in oceanic communities. Even with a conservative cutoff, we found over 2500 novel homologs. It is likely that a careful interrogation will validate other candidates. The GOS data should also be valuable in searching for novel families. The under-representation of purine and lysine, and the over-representation of gcvT and glmS riboswitches are indicative of differential organization of metabolic pathways.

3.6 PFastR Web Server

The PFastR web server (http://ribozyme.ucsd.edu/fastr) provides fast sequence and secondary structure search against most RNA families in the Rfam database [35,36]. The PFastR web application allows users to query ncRNA fam- ilies within the Rfam database against their own genomic sequences. We have automated the process that keeps PFastR up to date with the new version of Rfam database and ensure that our data is always consistent. The overall accu- 67 Predicted function #ORFs (#total) Cluster ID 298643524369 2(224)12109 26(652)11010188 35(435)11518609 No 6(30) hits 6(90) TonB-dependent receptor 2(2) Outer membrane cobalamin receptor protein Conserved Similar hypothetical to protein metal-binding9291 protein No hits 9877 17(57) 5(119) ABC thiamine transporter ABC-type thiamine transport system #ORF clusters GO without #ORFs mapping TPP 26 3(96%) 920 3(631) TonB-dependent receptor glmS 0 0 - - - gcvT 13 1(84%) 5080 11(913) Conserved hypothetical protein SAM 3 1(67%) 13428484 2(4) No hits FMN 0 0 - - - Lysine 0 0 - - - Purine 0 0 - - - Cobalamin 100 7(87%) 920 10(631) TonB-dependent vitamin B12 receptor Riboswitch Table 3.7 Summary ofGO predicted mapping” functions of refers thenumber to confident of the ORFs the number downstream ORFsthe of of for total riboswitch the clustered number predictions. cluster, confident of whose “#ORFs ORFs ORFs ID without in downstream is this shown of cluster). in the ”cluster prediction. ID” column, “#ORFs” downstream of refers the to prediction the (“#total” is 68

Table 3.8 Statistics for accurate option and efficient option. Accurate Option Efficient Option Avg. Filter Accuracy 0.9998 0.9939 Avg. Filter Efficiency 0.0392 0.0213 Avg. Overall Accuracy 0.9902 0.9863 Average Processing Time 2.46 mins/Mb 1.39mins/Mb racy of PFastR webserver remains high while the running time is just 1.39 minutes per Mb, on average, for the efficient search option. Table 3.8 shows the results for each search option by testing 566 RNA families from Rfam database. As shown, the average filter accuracy is very high and there is only a slight degradation in overall accuracy after running PFastR on the filtered results. Moreover, the results not only maintain a high level of accuracy, but also a high level of sensi- tivity. Figure 3.3 shows the ROC curves for selected families in which there are more than 100 sequences with less than 90% identity. The ROC curves indicate that running most families yields a high true positive rate with a minimal false positive rate. Full statistics and ROC curves for each family may be found at http://ribozyme.ucsd.edu/fastr/sup data.

3.7 Summary

We reiterate that the main contribution of this chapter is not simply to provide improved sequence-based filtering, but to formalize the filtering problem, and demonstrate that a simple approach based on combining filters is useful. While our results improve the state-of-the-art and are likely to be useful in discovering novel ncRNAs, many questions remain unanswered. Some of the open problems are directly related to our analysis. First, can we give theoretical bounds on the efficiency vs. speed trade-off for the union filters? This will probably entail some assumptions on the distribution of keyword scores. Second, can we design optimal chain filters, which provably dominate all other sequence based filters? Indeed the bulk of the results presented here are presented on filters that are 69

Figure 3.3 ROC curves for selected families with accurate filter and alignment. 70 fast, but not perhaps as efficient as could be. On the other hand, HMMs are efficient, but not always fast, which indicates that there is room for more filters in-between. Examples of such filters include subsets of profiles (choose a subset of contiguous conserved columns, and filter based on those), or a hierarchy of compositions instead of a single one. Finally, for the most diverse families, it is likely that sequence based filters will not be efficient. Fast structure-based filters have been shown to be effective the former chapter. How to combine to further improve the efficiency? It is an important open problem to formalize their efficiency and speed, and to study their combination between structure based filters and sequence based filters. We hope that these and related challenges will spur the development of filters, and ultimately lead to better tools for mining biomolecular databases. This chapter, in part, is a reprint of the paper, “A sequence-based filtering method for ncRNA identification and its application to searching for riboswitch elements”, co-authored with Ilya Borovok, Yair Aharonowitz, Roded Sharan, and Vineet Bafna in Bioinformatics (ISMB 2006) Vol. 22, pp. e557–e565, 2006. The dissertation author was the primary investigator and author of this paper. 4 RNAscf: Consensus folding of unaligned RNA sequences

4.1 Introduction

As one of the earliest problems in computational biology, the RNA sec- ondary structure prediction (sometimes referred to as “RNA folding”) problem has attracted attention again, thanking to the recent discoveries of many novel non-coding RNA molecules. The two common approaches to this problem are the de novo prediction of RNA secondary structure based on energy minimization and the “consensus folding” approach (computing the common secondary structure for a set of unaligned RNA sequences). Consensus folding algorithms work well when the correct seed alignment is part of the input to the problem. However, seed alignment itself is a challenging problem for diverged RNA families. In this chap- ter, we propose a novel framework to predict the common secondary structure for unaligned RNA sequences. By matching putative stacks in RNA sequences, we make use of both primary sequence information and thermodynamic stability for prediction at the same time. We show that our method can predict the correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences, and it outperforms current algorithms in sensitivity and accuracy. Like proteins, RNA structures are more important for function than their sequences: RNAs with similar functions often have similar structures but dis-

71 72 tinct primary sequences. Therefore, understanding the structures of these RNA molecules will help elucidate their functions. Consider the recent exciting discovery of riboswitches [68, 108] as an example. These control elements, with a conserved secondary structure, are located in the untranslated regions of genes coding for proteins that are involved in variant metabolite (nucleic acids, amino acids etc.) synthesis pathways. The riboswitches can turn off the expression of their down- stream genes by binding to certain metabolites and subsequently changing their secondary structures and blocking the translation initiation. While there is a resurgence in interest in ncRNA, the problem of RNA secondary structure prediction has been extensively studied since the 70s. The key idea is that to stabilize its structure, distant base pairs in the single stranded RNA molecule must form hydrogen bonds. There are two distinct approaches to predict RNA secondary structure. The RNA folding approach, initiated by [103], assigns free energies to the components of RNA secondary structure, and then computes the RNA secondary structure with the minimum energy. Dynamic pro- gramming algorithms have been developed to compute minimum energy secondary structures [71, 72, 93, 112, 124], and implemented in software packages such as MFOLD [41, 43] and ViennaRNA [122–125]. However, RNA folding via energy minimization has its shortcomings. First, fold prediction depends critically upon correct values of the energy parameters, as shown by [45], which are hard to obtain experimentally. Also, RNA folding in a real cell is mediated by interactions with other molecules, and the absence of knowledge of these interactions may cause mis- folding in silico. [73] tried to alleviate this problem by comparing minimum energy structures of a set of RNA sequences from the same family to determine conserved secondary structure. However, it is unclear how the misprediction of secondary structure for a single RNA sequence can affect the accuracy of this approach. A different approach attempts to resolve these shortcomings by using evolutionary conservation of structure as the basis for structure prediction. It needs as input multiple RNA sequences from an RNA family, which have common 73 secondary structures. Since for divergent sequences, the mutations in base pairing regions must be compensated in the complementary base to preserve structure, the presence of multiple covarying mutations is a strong signal for base pairing. In fact most RNA sequences are selected more for maintenance of the structure than conservation of primary sequence. If the sequence similarity between the given RNA sequences is appropriate, one can first align these sequences using multiple sequence alignment algorithm and then figure out the potential base pairs in RNA secondary structures by looking at regions with a high number of compensating mutations. Levitt successfully derived the theoretical tRNA secondary structure using this approach, which was largely confirmed by crystallography [60], and various other structures have been determined through such analysis. Computer programs were implemented later on to achieve this goal automatically [42]. However, aligning multiple and divergent RNA sequences so as to pre- serve their conserved structures is not easy, because many compensatory muta- tions decrease the overall sequence similarity. For unaligned sequences, one must compute the structure and alignment simultaneously. Sankoff proposed an algo- rithm that can simultaneously align RNA sequences and find the optimal common fold [34,66,91]. However, the complexity of this algorithm is O(l6), where l is the length of RNA sequences, too high to be practical even for two sequences. The complexity can be reduced to O(l4) [33], but only when RNA has no multi-loop structure. Eddy and Durbin, and other groups [26,52,90] pioneered the approach of modeling RNA sequences using stochastic context free grammars. The rules of the SCFG allow for position dependent scoring of distant base pairs and primary sequence conservation, and also allow automated estimation of model parameters from unaligned sequences using EM. However, in practice, the extensive divergence of RNA sequences makes it hard to reconstruct structure and alignment with per- fect accuracy, and the covariance models work best when supplied with good seed alignments. Much recent work has focused on improving fold prediction for aligned sequences [42,51]. 74

In our approach to this well-researched problem, we are motivated by the idea of constraining allowed folds to make it more likely to reach the final correct structure. This idea has been used with good success in aligning divergent genomic sequences. For diverged DNA sequences, there are not enough signals for probabilistic models such as HMMs to be effective without prior information (in the form of seed alignments). The recent methods [11, 12, 62] identify anchors corresponding to highly conserved orthologous regions, and use these to constrain the multiple alignments. This approach has been used in RNA as well. [113] pioneered this with a statistical approach for choosing conserved stem-loops within a pair of fixed size windows in a set of unaligned RNA sequences. [46] extended this idea considerably. Starting with putative stem-loops, they remove all but the ones conserved in a global sequence alignment of two sequences. These are further culled to retain only those that are present in every sequence in the family. [76] proposed a different anchoring approach to solving the same problem by first determining anchor regions that are highly conserved in given RNA sequences and then seeking a set of conserved stems crossing the same anchor regions that have minimum folding energy. Both methods use primary sequence conservation extensively, and limit the variability in the length of loop regions, which may lead to accurate but relatively few predicted anchor stacks for diverged families. On the contrary, Bouthinon and Soldano [10], Davydov and Batzoglou [19] each proposed algorithms to select conserved base pairs among the given RNA sequences based solely on the conservation of the structure that they form, considering neither their sequence similarity nor the thermodynamic stability of this structure. As a result, these methods may risk selecting wrong base pairs when only a limited number of RNA sequences are given. In this chapter, we describe a method, RNAscf (RNA Stacks based Con- sensus Folding), for predicting the consensus fold of an RNA family, given un- aligned sequences. Our method (Section 4.2 and Section 4.3) is based on the notion of finding structurally conserved anchors, and an iterative extension con- 75 strained by the anchors. With relatively few parameters, and limited training, the method outperforms other competing methods (Section 4.4), detecting 88% of true stacks (sensitivity), and overlapping with a correct stack in 93% of all predictions. The method establishes the validity of this new paradigm for computing consen- sus structure in RNA. We also mention a possible extension based on an iterative refinement of the predicted structure (Section 4.5).

4.2 RNA Secondary Structure and Stack Configurations

As shown in Figure 4.1(a), the secondary structure of an RNA has a tree- like shape. Assume that there is a dummy base pair between 0 and n + 1. Define a loop as a set of indices i1 ≤ i2 ... ≤ ik such that for all j, either ij+1 = ij + 1, or 0 (ij, ij+1) form a base pair. Further, if for some j, j ,(ij, ij0 ) form a base pair, then |j − j0| = 1. It can be seen that the structure can be decomposed uniquely into a set of loops, and the loops can be classified as hairpin (containing only one base pair), a stem-loop (two base pairs with no unpaired bases), an interior loop/bulge (two base pairs with unpaired bases), and a multi-loop (k > 2 base pairs). The stability of the RNA structure is determined predominantly by stacks of consecutive stem-loops. The stacks are stabilized by hydrogen bonds between the base pairs, and in general, the longer a stack region, the more energetically favorable it is. Each stack corresponds to a pair of sub-strings. These pairs are typically non-interleaving. While interleaved stacks, or pseudo-knots (such as the pair (f, f 0), and (h, h0) in Figure 4.1(a)) do occur, they are relatively less common and ignored here. In our approach for finding anchors, we ignore individual base pairs and work with a slightly generalized notion of a stack that includes unpaired bases. Configurations of stacks that are conserved in multiple sequences will be the anchors in determining consensus structures.

76 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

A U ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

C G a’ multi−loop ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ stack a ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

C G hairpin loop ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

G U ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

G C ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £¡£¡£¡£¡£¡£ ¤¡¤¡¤¡¤¡¤¡¤

bulge g’ ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £¡£¡£¡£¡£¡£ ¥¡¥¡¥¡¥

¤¡¤¡¤¡¤¡¤¡¤ ¦¡¦¡¦¡¦

b ¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ £¡£¡£¡£¡£¡£ ¥¡¥¡¥¡¥

¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¡¢¤¡¤¡¤¡¤¡¤¡¤ ¦¡¦¡¦¡¦ g b’ c A1 A2 A3 A4 A5 A6 Ak−2Ak−1 Ak

c h’

§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§ ©¡©¡©¡©¡© ¡ ¡ ¡ ¡

¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨ ¡ ¡ ¡ ¡

d B1 B2 B3 ¡ ¡ ¡ ¡ B4 B5 B6 Bk−2 Bk−1 Bk

§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§ ©¡©¡©¡©¡© ¡ ¡ ¡ ¡

¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨ ¡ ¡ ¡ ¡

pseudo− ¡ ¡ ¡ ¡

§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§ ©¡©¡©¡©¡© ¡ ¡ ¡ ¡

¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨ ¡ ¡ ¡ ¡

d’ f f’ knot ¡ ¡ ¡ ¡ §¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§

e ¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨

§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨ h §¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§

e’ ¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨

§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨ §¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§¡§ Internal loop ¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨¡¨

(a) (b)

Figure 4.1 (a) An RNA secondary structure structure with various structural el- ements including stacked stem-loops, bulges, hairpins, and multi-loops. (b) Two stack configurations match to each other for both unpaired regions and paired regions. 77

4.2.1 Predicting Putative Stacks

The thermodynamic stability of a stack is proportional to the number of hydrogen bonds between the base pairs in the stack. Any pair of strings can be aligned (with gaps), so as to optimize the energy of the paired bases. There- fore, given an RNA string A, we construct a local alignment of A[1, . . . , n] with

A[n, . . . , 1]. Let δh(i, j) be the score (number of h-bonds) in an (A[i],A[j]) base pairing. Thus, base pairing of G-C, A-T, and G-U are scored 3, 2, and 1 respectively. Let S[i, j] be optimum score for a stack with left end-point i, and right end-point j. Then   S[i + 1, j − 1] + δ (i, j)(δ [i, j] > 0)  h h   S[i + 1, j] + g S[i, j] = max (4.1)  S[i, j − 1] + g    0 where g is a gap penalty. In our implementation, we modify this basic approach to include affine gap costs. We select each (i, j) for which S[i, j] is greater than some threshold. In order to avoid predicting overlapping stacks, we sort the stacks by decreasing score values. Each time a stack is picked, all base pairs in it are excluded. While straightforward, this is an effective procedure. Intuitively, that the probability of finding a base pair at random is much higher than the probability of finding a high-scoring stack. Waterman showed that finding a k-long stack within certain window size in many given sequences (even not all) can be significant [113]. Also, most real base pairs in correct structures should be stabilized by multiple stacked base pairs, implying that limiting consideration to high-scoring stacks does not results in many false negatives. To test this, we computed statistics on the seed-alignments in the RFAM database [35]. All stacks from seed alignments of 379 families (9559 sequences) were selected. The seed alignment contains known representative members of the family, which is hand-curated and is annotated with structural information. To correct for annotation errors, the stacks were realigned to each other locally, and 78 extended so as to be locally maximal without unpaired bases. Figure 4.2 plots the cumulative distributions of stacks according to the minimum number of hydrogen bonds (Figure 4.2(a)), or stack length (Figure 4.2 (b)). Additionally, we also plotted the number of putative stacks that can be found on RFAM sequences (normalized to a 100bp region). If all possible base pairs were considered, we see about 900 putative stacks in a 100bp region. This number grows quadratically with the length of the sequence increases. Instead, by limiting attention to stacks with length greater than 4, and at least 8 hydrogen bonds, we have only 34 putative stacks in a 100bp region. At the same time, we miss only a small fraction of the true stacks. This computation shows that in considering anchors, it is reasonable to restrict attention to longer stacks.

4.2.2 Stack Configurations

A putative stack of length 4 can still be found by chance. [46] and [104] both use sequence similarity to further select stacks. We propose to select a set of stacks instead of one at a time. We evaluate the set of stacks by both the stability (free energy) of the structure they form and the sequence similarity computed based on these common stacks as anchors. To describe evaluation function of the selected set, we must define a notion of configuration of stacks. Note that two stacks P1 and

P2 may have one of the following relations [10]: (1) P1 and P2 are interleaving; (2)

P1 is enclosed within P2 (denoted by P1

RA on A is a collection of non-interleaving stacks on A. The stacks have a partial order (

The energies of loops Eh (hairpin), Eb (bulge), Em (multi-loops), are a function of the length of the sequence and set as described in [45]. 79

900 1 800 # of predicted stacks 0.9 d e

0.8 s

700 Fraction of true stacks s i

missed m 0.7

600 s k s c

k 0.6 a c 500 t s a

t s 0.5 e

f u

400 r o t

0.4 f # o 300 0.3 n o i t

200 c

0.2 a r F 100 0.1 0 0 1 3 5 7 9 1 3 5 7 9 1 3 5 # of hyd1 rog1 en b1 ond1 s 1 2 2 2 (a)

900 1

800 # of predicted stacks 0.9 d

Fraction of true stacks 0.8 e 700 s s missed i

0.7 m 600 s k s 0.6 c k a c 500 t s a

t s 0.5 e

f u r

o 400 t

f

# 0.4 o

300 n o

0.3 i t c

200 a 0.2 r F 100 0.1 0 0 1 2 3 4 5 6 7 8 9 10 length of stacks (b)

Figure 4.2 Statistics of the stacks in Rfam database. (a) One line shows the fraction of annotated stacks which would be missed out by using different cutoffs - the minimum numbers of hydrogen bonds in the stacks. The other line shows the number of predicted stacks per 100 bases from all Rfam seed sequences using the same cutoffs. (b) One line shows the fraction of annotated stacks which would be missed out by using the minimum lengths of the stacks as cutoffs. The other line shows the number of predicted stacks per 100 bases from all Rfam seed sequences using the same cutoffs. 80

Define a consensus structure P(A, B) as a pair of structures RA and

RB on A and B respectively, with a one-one correspondence between stacks in

RA and RB such that the corresponding stacks in the two structures maintain identical partial order relationships. We define the free energy φ(P(A, B)) of the consensus structure similar to the energy of the individual structures. For each pair of corresponding stacks, or loops, the maximum of the two energy values is chosen. Given a consensus P(A, B), the sequences A and B can be aligned to be consistent with P(A, B) (see Figure 4.1(b)), so that the sequences in stacks are aligned to each other, and likewise for the sequences in the unpaired regions. This alignment partitions sequence A and B into alternating stack and non-stack regions A1,A2,...,Ak, and B1,B2,...,Bk. Each pair of sequences (Ai,Bi), is aligned optimally. We define such an alignment as a configuration. The cost of the configuration (A, B, P(A, B)) is defined as a function of sequence similarity and the energy of the consensus structure. Denote i ∈ P(A, B) if and only if (Ai,Bi) are paired in a stack with some (Aj,Bj) in P(A, B). Let S(Ai,Bi) denote the cost of an optimal global alignment of subsequences Ai and Bi. The cost of the configuration (A, B, P(A, B)) is denoted by X X M(P(A, B)) = w1Φ(P(A, B)) + w2 S(Ai,Bi) + w3 S(Ai,Bi) (4.2) i∈P(A,B) i6∈P(A,B) where w1 +w2 +w3 = 1 represent parameters describing the relative weights to the free energy of the configuration, sequence similarity in stack regions, and sequence similarity within loop regions. Ideally, these weights should be adjusted according to the number and divergence of the given sequences. However, in the tests through this paper, we use an identical set of weights (w1 = 0.84, w2 = 0.06, and w3 = 0.1). For a given pair, we compute consensus structures of minimum cost. The definition of a configuration of stacks for a pair of sequences also extends to multiple sequences: A configuration P(A1,A2,...,As) is a collection of s RNA structures {P A1 ,P A2 ,...,P As }, one for each sequence, with the fol- 81 lowing property: For each pair of structures, there is a one-one correspondence between the stacks that is consistent with the partial orders

figuration with l stacks partitions each sequence Ai into 2l + 1 blocks denoted

Ai,1,Ai,2,...,Ai,2l+1, where each block A∗,j is either a stack in the configuration (j ∈ P), or part of the loop region. We modify equation 4.2 to describe the cost of the configuration P(A1,A2,...,As), as follows:     A , A ,  1,j   1,j      X  A ,  X  A ,   2,j   2,j  M(P(A1,...,As)) = w1Φ(P(A1,...,As))+w2 S  +w3 S    ...,   ...,  j∈P(A1,...,As)   j6∈P(A1,...,As)   As,j As,j (4.3) Here, the function S computes the score of a multiple alignment. The RNA stack based consensus folding problem can be described formally: given s RNA sequences, compute a minimum cost stack configuration. In the following section, we describe algorithms for computing optimal configurations.

4.3 Stack-based Consensus Folding

4.3.1 Computing Optimal Stack Configuration in Two RNA Sequences

We use dynamic programming to compute an optimal configuration. The algorithm is similar to prior work [5, 91] with an important difference being that stacks (instead of individual base pairs) are now used. Given sequences A, B, we compute all potential stacks in them, using the algorithm from Section 4.2.1. As-

A A A sume these two sequences have m and n stacks respectively. Let P = P1 ,P2 , ..., A B B B B Pm and P = P1 ,P2 , ..., Pn denote the stacks, ordered according to increasing values of the right-most base pair. Denote the index of the first and last base pair of a stack P as Pb,Pe, and the length as Pl. Define the following terms:

A A A A Seq(P ): The sub-sequence covered by the stack P , given by A[Pb ...Pb + A A A A Pl − 1], and A[Pe − Pl + 1 ...Pe ]. 82

Loop(P A): The sub-sequence covered by the first and last positions of the stack P A after excluding the bases in Seq(P A). In other words, the sequence

A A A A A[Pb + Pl ...Pe − Pl ].

Loop(P A,P A0 ): If P A0 is enclosed within P A, then the loop region corresponds

A to the sequence in between the two stacks (i.e., the subsequences A[Pb + A A0 A0 A A A0 A Pl ...Pb − 1], and A[Pe + 1 ...Pe − Pl ]). If P is to the left of P , A0 A the loop region corresponds to A[Pe + 1 ...Pb − 1]. Otherwise, the term is undefined.

M[P A,P B]: The cost of an optimum configuration of A and B over all consensus structures, given that stacks P A and P B are in the consensus structure and aligned to each other.

Clearly, it is sufficient to compute M[P A,P B] for all pairs in PA × PB, which would need O(m2n2) time. In computing M[P A,P B], we have 3 choices for the subsequences Loop(P A), and Loop(P B), as they could either form a hairpin, an interior loop/bulge, or a multi-loop. Therefore,    M [P A,P B], (* hairpin loop *)   h  A B A B A B M[P ,P ] = Ms[P ,P ] + min Mb[P ,P ], (* interior loop/bulge *)    A B  Mm[P ,P ] (* multi-loop *) (4.4)

A B A B Here, Ms[P ,P ] is the score matching stacks P and P , based on sequence and structure conservation, and can be computed by      A  A A B Es(P ), Seq(P ) Ms[P ,P ] = w1 max + w2S   (4.5)  B  B Es(P ) Seq(P )

A B A B Mh[P ,P ] is the score of the loop regions of P and P given that no other matched stack pair is included by P A and P B i.e. these regions form matched hairpin loops.      A  A A B Eh(|Loop(P )|), Loop(P ), Mh[P ,P ] = w1 max + w3S   (4.6)  B  B Eh(|Loop(P )|) Loop(P ) 83

A B A B Mb[P ,P ] represents the matching score when P , and P are followed by an interior loop, or bulge. Consider all stacks P x,P y that are enclosed by P A, and

B A B P , respectively. Then, Mb[P ,P ] is the minimum free energy of any matching of P x,P y.       x A    Eb(|Loop(P ,P )|),   w1 max    y B   A B Eb(|Loop(P ,P )|) Mb[P ,P ] = min   (4.7)  x A  x A  Loop(P ,P ),  P

A B quence of stacks P1,P2,... form a chain if P1

A B A A cluded by P and P which form the multi-loop. Let P1 ,P2 ,... (respectively B B A B P1 ,P2 ,...) denote stacks enclosed by P (P , respectively), and ordered accord- A A A A ing to increasing values of the last coordinate. Denote Pi1 ∈ F (Pi2 ) if Pi1

A B A Here, Mc[Pi ,Pj ] is defined as the minimum energy of a chain that ends at Pi , B A A B B and Pj , and begins at some Pi0

A A B B where Mo[Px ,Pi ; Py ,Pj ] is the minimum free energy of the matching between A A B B the loops (Px ,Pi ) and (Py ,Pj ),      E (|Loop(P A,P A)|),  Loop(P A,P A), A A B B b x i  x i  Mo[Px ,Pi ; Py ,Pj ] = w1 max + w3S (4.10)  B B  B B Eb(|Loop(Py ,Pj )|) Loop(Py ,Pj ) 84

4.3.2 Consensus Fold Computation for Multiple RNA Sequences

The optimal configuration of a randomly chosen pair of sequences from a family already shows high sensitivity (data not shown). It is likely that an optimal configuration of structures conserved in diverse multiple sequences will be very accurate. Recall the cost of the configuration as Equation 4.3, where S denotes the score of a multiple alignment of the block. Clearly, the problem of computing an optimal configuration is hard, given the discussion for the pairwise case. Here, we use a heuristic principle based on the notion of a star-alignment, with a seed configuration chosen from an optimal configuration of a random pair of sequences. To understand why our approach should work, we describe a back-of-the-envelope calculation. Consider a stack x of length k from the seed structure defined on sequence

A1 that is in fact incorrect (does not overlap with a true stack). For x to be retained in the final anchored configuration, it must match with stacks in a large fraction of the other sequences. Let p be the probability that a random pair of bases can form base pairs. Given an interval (i, j) in some sequence, the probability that (i, j) is the end of a stack of length k is pk. Thus, the probability that x is matched up with some random set of base pairs defined by the end-points (i, j) is no more than pk, even after ignoring sequence similarity. However, x cannot be matched to any other arbitrary stack. As we also score for primary sequence conservation, and the match should maintain the partial order of the configuration, (i, j) and x must be ‘similarly situated’. To model this, we introduce a parameter w. Define w as the number of distinct pairs (i, j) to which x is allowed to be matched. Then, the

k w probability that x finds a match by chance is given by px = 1−(1−p ) . Allowing a more flexible definition, we say that x is f-conserved in the configuration if it finds a match in at least (1 − f)s of the s sequences. µ ¶ X s P r[x is f-conserved] = P (x) = pl (1 − p )s−l (4.11) c l x x l≥(1−f)s

3 Fix some parameters as follows. Let p = 8 (corresponding to G-C, G-U, A-U), 85

Table 4.1 Effect of parameters k, w and s on the probability of predicting conserved stacks at random. A large w greatly increases the probability of an incorrect prediction. [113] has performed similar statistical analysis.

k w s Pc(x) 4 10 20 0.0008 4 40 20 0.13 4 80 20 0.91 5 80 20 0.02 6 80 20 1.8e-6 5 80 40 0.001 5 80 60 7.35e-05

f = 0.7. Table 4.1 describes the impact on Pc(x) for varying values of k, s, w. The probability of getting an incorrect conserved stack depends critically on the parameters w. If w is too large, there is a high probability of getting random stacks to match up. This effect might be offset by increasing k, but then we risk losing many true (smaller size) stacks, which may cause incorrect pairs to be matched. The effect is also offset (to a lesser degree) by increasing the number of sequences, but that may not always be possible. The reason our approach works is because the choice of a conserved configuration restricts the possible stacks that x can match to, effectively keeping the value of w low. Before describing the approach, we must first modify the formulation to allow stacks to be partially conserved, and therefore, absent from some sequences. For 0 < f ≤ 1, define an f-configuration as a configuration with the following property: for every set of s corresponding stacks (one from each sequence), at most (1 − f)s can be absent. In Figure 4.3, we describe a procedure for computing an anchor configuration. The anchor configuration consists of stacks that optimize the cost of the configuration, and are conserved across the family. Thus, the stacks are highly likely to be correct. However, the procedure might also miss some true stacks due to a high initial value of k, and requirement of conservation. To increase 86

Procedure ComputeAnchorConfiguration(k, f)

1. Pick a pair of sequences (A, B) at random from the set R of RNA sequences.

2. Compute putative stacks PA from A, and PB from B with minimum length k. k is chosen according to the lengths of sequences in R, and typically k = 4.

3. Compute the optimum configuration. Reduce PA to retain only the stacks from the optimum configuration. Denote P0A as the reduced set.

4. For each sequence R ∈ R, compute the optimum pairwise configuration of (A, R) using the reduced set P0A. Denote M[(A, B), R] as the sum of the configuration costs.

5. Recompute Steps 1-4 for various random choices of (A, B), and pick the pair (A, B) with minimum configuration cost M[(A, B), R].

6. Retain only the stacks in P0A that appear in 1 − f fraction of the sequences in R. Denote the subset as P00A. Output P00A as the anchored structure of R.

Figure 4.3 The procedure for computing anchor configuration. sensitivity, we now search for less conserved, and shorter stacks. However, the new stacks are forced to be consistent with the anchor configuration. Recall from the definition of loops in Section 4.2.1 that all unpaired bases can be uniquely assigned to a loop. Additionally, a stack does not interleave with the anchor configuration if and only if it is defined on unpaired bases within a single loop. This forms the basis of the final procedure RNAscf. See Figure 4.4.

4.3.3 Implementation Details

The program (RNAscf) is implemented in C, and is available upon re- quest. In default setting, we limit ourselves to two iterations. For the first iteration, we choose the default parameters as k = 4, h = 8 (k is the minimum length and h is the minimum number of hydrogen bonds in the putative stacks) and g = 0 (no unpaired bases allowed in stacks). For the second iteration, the default settings are changed to k = 3, h = 6. 87

Procedure RNAscf(k, f)

1. P = ComputeAnchorConfiguration(k, f)

2. In each sequence, partition the unpaired bases according to their loop region in P.

3. For every loop region that has a minimum number of unpaired bases, predict additional putative stacks with k0 < k. Each ’arm’ of the stack is constrained to have contiguous base pairs.

4. For each stack in the optimal configuration that was not present in every member of the family, recompute the alignment with the additional putative stacks to retrieve less conserved stacks.

5. For each set of loop regions and potential stacks, recurse using RNAscf(k0, f) to compute additional stacks in the loop regions.

Figure 4.4 The procedure RNAscf for computing consensus folds.

4.4 Testing Results

To test the performance of RNAscf, we chose a set of 12 RNA families from the Rfam database [35]. Twenty sequences were chosen for each family, ex- cept for CRE (RF00220) and glmS (RF00234) for which we chose 10 sequences respectively. Stacks were retrieved from the annotated structures for each of these sequences. In all, there are 953 stacks. We chose 3 other programs to compare the performance of RNAscf, choosing the best representative of different method- ologies: RNAfold, which is an implementation of energy based minimization [45] from the Vienna package [41]; COVE, which is an implementation of covariance model [26], and comRNA [46], which is based on computing anchors in multiple se- quence alignment. Only comRNA predicts stacks explicitly. COVE and RNAfold do not explicitly predict stacks, but most of their base pairs appear in stacks. For best results, we ignored unstacked base pairs (with unpaired bases on either side) for RNAfold and COVE. Larger stack length cutoffs were also tried, but this choice gave the best balance of sensitivity and accuracy. Sensitivity is defined as the frac- 88 tion of true stacks that overlapped with predicted stacks. A sensitivity of 1 would imply that all true stacks overlapped with some predicted stacks. Correspondingly, accuracy is the fraction of predicted stacks that overlapped with a true stack. As COVE expects aligned sequences, we aligned the sequences using ClustalW [102]. The alignment was used to train the Covariance model, and the model was then used to align sequences, and predict structure. We also ran COVE on unaligned sequences, but the performance in that case was inferior to the performance on ClustalW aligned sequences. Figure 4.5 shows the plots of the sensitivity and specificity of all programs on the test (detailed numbers are shown in Table 4.2). As can be seen in the tables, RNAscf is at the top or near the top in every family, and maintains high sensi- tivity and accuracy throughout, with an average accuracy of 0.884, and average sensitivity of 0.926. Only comRNA shows consistently high accuracy because it predicts very few stacks (leading to low-sensitivity) that are well-conserved in both sequence and structure. COVE occasionally shows very poor sensitivity, possibly because of incorrect seed alignment. RNAfold predicts many stacks, and therefore has good sensitivity, but the extra predictions lead to loss of accuracy. While our method shows robust performance for a limited number of given RNA sequences, its performance improves when the number of the given sequences increase. Fig- ure 4.6 shows the sensitivity and accuracy as the number of sequences increase for the thiamine sub-family. Both the sensitivity and accuracy exceed 0.9 when s = 80. Similar results were obtained for other large families. Finally, we emphasize that even though sometimes we cannot get all the stacks in all the given sequences, the consensus structure obtained by RNAscf is always the right configuration; the prediction errors in a few input sequences are usually due to an incorrect stack that is very close to a correct one. Since this cannot be quantified, we use one example to demonstrate that the minor predic- tion error in a few given sequences does not affect the prediction of the common structure. Figure 4.7 shows the predicted configuration of the four programs on 89

Sensitivity

1 0.9 0.8 0.7 RNAscf 0.6 RNAfold 0.5 COVE 0.4 comRNA 0.3 0.2 0.1 0

) ) ) ) ) t A * + * 3 + e e o + A n N ( ( ( r_ ( in in ib ( N e 0 6 S e II s r r e rR 2 3 m _ ly u _ n tR m _ 2 2 l m n p m i le s _ _ g m o a m e 5 E A a tr s ia _ R N h in th k C R ko ct y (a)

Accuracy

1 0.9 0.8 0.7 RNAscf 0.6 RNAfold 0.5 COVE 0.4 comRNA 0.3 0.2 0.1 0

) ) ) ) ) t A * + * 3 + e e o + A n N ( ( ( r_ ( in in ib ( N e 0 6 S e II s r r e rR 2 3 m _ ly u _ n tR m _ 2 2 l m n p m i le s _ _ g m o a m e 5 E A a tr s ia _ R N h in th k C R ko ct y (b)

Figure 4.5 Sensitivity and accuracy of RNA secondary structure prediction on 12 RNA families. The default parameters for RNAscf are f = 0.7, k = 4, and h = 8 for the first iteration. RNAfold is run under default parameters. COVE is run under the default parameters, using the multiple alignment from ClustalW as input. comRNA is run under the recommended parameter (p = 0.7 and s = 0.56). (*) There are only 10 sequences in these families. (+) RNAscf is run under k = 3 and h = 6 on these families, due to their small size. 90 comRNA 1 1 1 0.944 1 1 0.531 1 1 - 1 0.692 0.924 Cove 0.658 0.941 0.5 0.906 0.943 0.754 0.984 0.797 0.613 0.606 0.910 0.727 0.778 Accuracy RNAfold 0.558 0.72 1 0.552 0.828 0.731 0.633 0.753 0.813 0.654 0.567 0.762 0.714 RNAscf 0.9 0.941 1 0.983 1 0.782 0.958 0.917 0.966 0.824 0.973 0.863 0.926 comRNA 0.35 0.5 0.313 0.283 0.25 0.339 0.142 0.6 0.168 0 0.5 0.3 0.312 Cove 0.75 0.5 0.333 0.483 0.833 0.821 0.783 0.983 0.425 0.354 0.975 0.606 0.654 Sensitivity RNAfold 0.67 0.85 0.882 0.683 0.883 0.946 0.775 0.917 0.858 0.690 0.625 0.817 0.8 RNAscf 0.86 0.8 0.882 0.983 1 0.91 0.925 0.917 0.903 0.858 0.912 0.656 0.884 Stacks 100 20 51 60 60 56 120 60 113 80 113 180 3 (RF00008) id) pGA1 (RF00236) gpII (RF00029) CRE (RF00220) riboswitch (RF00162) rRNA (RF00001) Name (Rfam 5s Rhino ctRNA glmS(RF00234) Hammerhead Intron Lysine (RF00168) Purine (RF00167) Sam Thiamine (RF00059) tRNA (RF00005) ykok (RF00380) Average Table 4.2 A completefamilies list shown of in the Figure comparison 4.5. of sensitivity and accuracy of RNA secondary structure prediction on 12 RNA 91

0.94

0.92

0.9

0.88 Sensitivity Accuracy 0.86

0.84

0.82

0.8 0 10 20 30 40 50 60 70 80 # of input sequences

Figure 4.6 Improved sensitivity and accuracy of RNAscf as the number of input sequences grows for the thiamine family.

(a) (b) (c)

(d1) (d2) (d3)

(e1) (e2) (e3)

Figure 4.7 A comparison of predicted stack configurations by different programs. (a) The true consensus stack configuration for the sam riboswitch (RF00162). (b) RNAscf prediction. (c) comRNA prediction. (d1)-(d3) The first three RNAfold predictions. (e1)-(e3) The first three COVE predictions. Note that RNAfold and COVE are not limited to predicting conserved stack configurations, and, therefore, give potentially a different answer for each sequence. Thick line, dashed line and thin line represent true stacks, missed stacks and wrong stacks in the corresponding predicted configurations. 92 the SAM riboswitch. Clearly, RNAfold can predict the correct configuration in some of the RNA sequences, but make the wrong prediction on the others. This is not surprising, because it analyzes each RNA sequence separately and doesn’t presume they have the common structure. However, it will be difficult to derive their common structure based on these results. comRNA tends to miss many real stacks, although the ones it predicts are often correct. COVE predicted some cor- rect stacks but it may miss some correct stacks and also predict some wrong ones. Similar results were seen for all families.

4.5 Summary

In conclusion, RNAscf establishes the principle that anchored stacks se- lection based on seed configurations and prediction of consensus structure subject to anchored constraints is a valid approach to RNA structure prediction. Our fu- ture work will be aimed at correcting errors by using a stochastic iterative scheme such as Gibbs sampling [56]. In each step, we will remove a stack from the consen- sus structure, and add a new stack sampled from possibilities that are consistent with the remaining configuration, and weighted according to the energy. Early experiments have shown the promise of such refinement. This chapter, in part, is a reprint of the paper, “Consensus folding of unaligned RNA sequences revisited”, co-authored with Vineet Bafna and Haixu Tang in Journal of Computational Biology, Vol. 13, Issue 2, pp. 283–295, 2006. The dissertation author was the primary investigator and author of this paper. 5 Conclusions

5.1 Summary of Contribution

In this dissertation, we have focused on how to speed up ncRNA ho- molog search and how to predict the consensus secondary structure from multiple unaligned RNAs. To speed up ncRNA homolog search, we have constructed two prototype tools: FastR (using structure-based filters) to speed up single ncRNA sequence homolog search and PFastR (using sequence-based filters) to speed up ncRNA family homolog search. State-of-the-art methods for the problem, like covariance models, suffer from high computational cost, underscoring the need for efficient filtering approaches that can identify promising sequence segments and speed up the detection process. Our approach, based on structural and sequence filters that eliminate a large portion of the database while retaining the true homologs, allows us to search a typical bacterial database in minutes on a standard PC with high sensitivity and specificity. We also have applied FastR and PFastR to the discovery of novel ncRNAs and detected novel riboswitch elements from the bacterial genomes and environ- mental sequence data. Our results point to a number of novel riboswitch candi- dates, and include genomes that were not previously known to contain riboswitches. Some of the predicted cobalamin riboswitches have been experimentally tested and confirmed by our micro-biologist collaborators. PFastR web server is implemented based on the sequence-based filtering

93 94 idea. We have trained more than 570 families from the Rfam database to determine the optimal parameters for the filtering procedure. The PFastR server provides the first web-based fast RNA homolog search. For RNA consensus folding, we proposed a novel framework to predict the common secondary structure for unaligned RNA sequences and constructed a prototype tool: RNAscf (RNA stack-based consensus folding). By matching putative stems in RNA sequences, we make use of both primary sequence informa- tion and thermodynamic stability for prediction at the same time. We show that our method can predict correct common RNA secondary structures even when we are only given a limited number of unaligned RNA sequences, and it outperforms current algorithms in both sensitivity and accuracy.

5.2 Future Work

Structured ncRNA genes or elements potentially play very important roles in various post-transcriptional regulations, such as alternative slicing, mRNA degradation and translational regulation. Recently, Haussler’s group found out a novel neuron-specific RNA gene (forming stable secondary structure) in human genome that has significant mutations by comparing with other mammals (in- cluding chimpanzee) [77]. This finding indicates that ncRNA genes in the human genome may harbor most of the vital changes that separate humans from other mammals [78]. The systematic discovery and identification of these functional ncRNA genes will enable us to have a better understanding of our genomes. A number of interesting questions are left open for future research. The first open question is how to de novo find structure-conserved RNA elements. There are many programs (RNAz [110], Evofold [75]) proposed to identify RNAs consensus structures from multispecies alignments. Despite the divergent comparative sequence data in the alignment regions, both RNAz and EvoFold still have very high false positive rate 95

Figure 5.1 RNAz classifies alignments using a support vector machines [111].

(50%-70%) [111]. Figure 5.1 and Figure 5.2 show that the score of RNAz and EvoFold computed for known RNA alignments (native) is still hard to distinguish from shuffled alignments (random). We know that aligning multiple and divergent genomic sequences so as to preserve their conserved structures is not easy. This is because many compen- satory mutations decrease the overall sequence similarity. We have found out a lot misaligned columns in the multiple genome alignments that destroy the RNA signal. We also found out the some RNA stacks may “shift” along the alignment (see Figure 5.3). Both RNAz and EvoFold do not pick up this additional signal. We are currently trying to incorporate these two features into our future ncRNA identification tool. Another interesting but algorithmically challenging topic related with ncRNA genes is genome-wide discovery of RNA-RNA interactions. RNA-RNA interaction prediction has become more and more important since microRNAs were discovered. MicroRNAs usually binds to their target with imperfect com- plementarity. Many computational methods have been developed for microRNA target prediction and all have relatively high false-positive predictions. Another potential application of the RNA-RNA interaction algorithm is to study the target 96

Figure 5.2 Evofold scores on the alignments [111].

Figure 5.3 Shifted stacks on a multispecies alignment. Base-pairs on the boundary of stacks are not conserved across all species (base-pairs in light color). 97 of riboregulators and the alternative splicing mechanism. Studying ncRNAs and their functions is still in its early stage. It is estimated that there are as many ncRNA genes as protein coding genes. The discoveries on ncRNAs will push our understanding of genomes forward. Using these new findings as a source of insight, we will find new understanding of our genomes. Bibliography

[1] P. L. Adams, M. R. Stahley, A. B. Kosek, J. Wang, and S. A. Strobel. Crystal structure of a self-splicing group I intron with both exons. Nature, 430(6995):45–50, Jul 2004. [2] T. Akutsu. Dynamic programming algorithm for RNA secondary structure prediction with pseudoknots. Disc. Appl. Math., 104:45–62, 2000. [3] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990. [4] L. Argaman et al. Novel small RNA-encoding genes in the intergenic regions of Escherischia coli. Current Biology, 11:941–950, 2001. [5] V. Bafna, S. Muthukrishnan, and R. Ravi. Computing similarity between RNA strings. In Z. Galil and E. Ukkonen, editors, Proceedings of the6th Annual Symposium on Combinatorial Pattern Matching, pages 1–16, Espoo, Finland, 1995. Springer-Verlag, Berlin. [6] J. E. Barrick and R. R. Breaker. The power of riboswitches. Scientific American, 296(1):50–57, Jan 2007. [7] S. E. Bergsten and E. R. Gavis. Role for mRNA localization in transla- tional activation but not spatial restriction of nanos RNA. Development, 126(4):659–669, Feb 1999. [8] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit, K. M. Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, D. Haus- sler, and W. Miller. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res, 14(4):708–715, Apr 2004. [9] K. F. Blount and R. R. Breaker. Riboswitches as antibacterial drug targets. Nat Biotechnol, 24(12):1558–1564, Dec 2006. [10] D. Bouthinon and H. Soldano. A new method to predict the consensus secondary structure of a set of unaligned RNA sequences. Bioinformatics, 15(10):785–798, Oct 1999.

98 99

[11] N. Bray and L. Pachter. MAVID: Constrained Ancestral Alignment of Mul- tiple Sequences. Genome Res., 14(4):693–699, 2004. [12] M. Brudno, A. Poliakov, A. Salamov, G. Cooper, A. Sidow, E. Rubin, V. Solovyev, S. Batzoglou, and I. Dubchak. Automated whole-genome mul- tiple alignment of rat, mouse, and human. Genome Res., 14(4):685–692, Apr 2004. [13] S. Cawley, S. Bekiranov, H. H. Ng, P. Kapranov, E. A. Sekinger, D. Kampa, A. Piccolboni, V. Sementchenko, J. Cheng, A. J. Williams, R. Wheeler, B. Wong, J. Drenkow, M. Yamanaka, S. Patel, S. Brubaker, H. Tammana, G. Helt, K. Struhl, and T. R. Gingeras. Unbiased mapping of transcrip- tion factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell, 116(4):499–509, Feb 2004. [14] S. W.-L. Chan, D. Zilberman, Z. Xie, L. K. Johansen, J. C. Carrington, and S. E. Jacobsen. RNA silencing genes control de novo DNA methylation. Science, 303(5662):1336, Feb 2004. [15] J.-H. Chen, S.-Y. Lee, and B. Shapiro. A computational procedure for as- sessing the significance of RNA secondary structure. CABIOS, 6:7–18, 1990. [16] A. Condon, B. Davy, B. Rastegari, F. Tarrant, and S. Zhao. Classifying RNA Pseudoknotted Structures. Theoretical Computer Science, 320(1):35– 50, 2004. [17] J. Couzin. Breakthrough of the year: small RNAs make big splash. Science, 298(5602):2296–2297, 2002. [18] A. Coventry, D. Kleitman, and B. Berger. MSARI: Multiple sequence align- ments for statistical detection of RNA secondary structure. Proceedings of the National Academy of Sciences, 101(33):12102–12107, 2004. [19] E. Davydov and S. Batzoglou. A computational model for rna multiple structural alignment. Combinatorial Pattern Matching, 3109:254–269, 2004. [20] D. di Bernardo, T. Down, and T. Hubbard. ddbRNA: detection of conserved secondary structures in multiple alignments. Bioinformatics, 19(13):1606– 1611, 2003. [21] R. M. Dirks and N. A. Pierce. A partition function algorithm for nucleic acid secondary structure including pseudoknots. J Comput Chem, 24(13):1664– 1677, Oct 2003. [22] B. Dost, B. Han, S. Zhang, and V. Bafna. Structural alignment of pseudo- knotted RNA. In Proceedings of the Annual Intl. Conference on Computa- tional Biology (RECOMB), 2006. 100

[23] M. Dsouza, N. Larsen, and R. Overbeek. Searching for patterns in genomic data. Trends Genet, 13(12):497–498, 1997. [24] H. K. Duchow, J. L. Brechbiel, S. Chatterjee, and E. R. Gavis. The nanos translational control element represses translation in somatic cells by a Bearded box-like motif. Dev Biol, 282(1):207–217, Jun 2005. [25] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analy- sis, chapter 10.3 Covariance models: SCFG-based RNA profiles. Cambridge University Press, 1998. [26] S. Eddy and R. Durbin. RNA sequence analysis using covariance models. Nucleic Acids Research, 22:2079–2088, 1994. [27] S. R. Eddy. Non-coding RNA genes and the modern RNA world. Nature Reviews in Genetics, 2:919–929, 2001. [28] S. R. Eddy. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformat- ics, 3:18, Jul 2002. [29] A. Fire, S. Xu, M. Montgomery, S. Kostas, S. Driver, and C. Mello. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature, 391:806–811, 1998. [30] A. Frank, S. Tanner, V. Bafna, and P. Pevzner. Peptide sequence tags for fast database search in mass-spectrometry. J Proteome Res, 4(4):1287–1295, Jul 2005. [31] D. Gautheret and A. Lambert. Direct RNA motif definition and identifica- tion from multiple sequence alignments using secondary structure profiles. Journal of Molecular Biology, 313(5):1003–1011, 2001. [32] S. J. Giovannoni, H. J. Tripp, S. Givan, M. Podar, K. L. Vergin, D. Bap- tista, L. Bibbs, J. Eads, T. H. Richardson, M. Noordewier, M. S. Rappe, J. M. Short, J. C. Carrington, and E. J. Mathur. Genome streamlining in a cosmopolitan oceanic bacterium. Science, 309(5738):1242–1245, Aug 2005. [33] J. Gorodkin, L. J. Heyer, and G. D. Stormo. Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucl. Acids Res., 25(18):3724–32, Sep 1997. [34] J. Gorodkin, S. L. Stricklin, and G. D. Stormo. Discovering common stem- loop motifs in unaligned RNA sequences. Nucl. Acids Res., 29(10):2135–2144, May 2001. [35] S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S. Eddy. Rfam: an RNA family database. Nucleic Acids Research, 31(1):439–441, 2003. 101

[36] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33(Database issue):121–124, Jan 2005. [37] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge Univer- sity Press, 1997. [38] G. Hannon. RNA interference . Nature, 418:244–251, 2002. [39] M. H¨ochsmann, T. T¨oller,R. Giegerich, and S. Kurtz. Local similarity in rna secondary structures. In 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), pages 159–168, 2003. [40] I. Hofacker, B. Priwitzer, and P. Stadler. Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics, 20(2):186– 190, 2004. [41] I. L. Hofacker. Vienna RNA secondary structure server. Nucl. Acids Res., 31(13):3429–3431, Jul 2003. [42] I. L. Hofacker, M. Fekete, and P. F. Stadler. Secondary structure prediction for aligned RNA sequences. J. Mol. Biol., 319(5):1059–1066, Jun 2002. [43] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster. Fast folding and comparison of RNA secondary structures. Monatsh. Chem., 125:167–188, 1994. [44] H. G. S. C. International. Finishing the euchromatic sequence of the human genome. Nature, 431(7011):931–945, Oct 2004. [45] J. A. Jaeger, D. H. Turner, and M. Zuker. Improved prediction of sec- ondary structures for RNA. Proceedings of the National Academy of Sciences, 86:7706–7710, 1989. [46] Y. Ji, X. Xu, and G. D. Stormo. A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics, 20(10):1591–1602, Jul 2004. [47] T. Jiang, G. Lin, B. Ma, and K. Zhang. A general edit distance between rna structures. Journal of Computational Biology, 9:371–388, 2002. [48] D. Kampa, J. Cheng, P. Kapranov, M. Yamanaka, S. Brubaker, S. Cawley, J. Drenkow, A. Piccolboni, S. Bekiranov, G. Helt, H. Tammana, and Gin- geras. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res., 14(3):331–342, Mar 2004. [49] T. Kiss. Small nucleolar RNAs: an abundant group of noncoding RNAs with diverse cellular functions. Cell, 109:145–148, 2002. 102

[50] R. Klein and S. Eddy. Rsearch: Finding homologs of single structured rna sequences. BMC Bioinformatics, 4(1):44, 2003. [51] R. Knight, A. Birmingham, and M. Yarus. BayesFold: rational 2 de- grees folds that combine thermodynamic, covariation, and chemical data for aligned RNA sequences. RNA, 10(9):1323–1336, Sep 2004. [52] B. Knudsen and J. Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucl. Acids Res., 31(13):3423–3428, 2003. [53] R. M. Kuhn, D. Karolchik, A. S. Zweig, H. Trumbower, D. J. Thomas, A. Thakkapallayil, C. W. Sugnet, M. Stanke, K. E. Smith, A. Siepel, K. R. Rosenbloom, B. Rhead, B. J. Raney, A. Pohl, J. S. Pedersen, F. Hsu, A. S. Hinrichs, R. A. Harte, M. Diekhans, H. Clawson, G. Bejerano, G. P. Barber, R. Baertsch, D. Haussler, and W. J. Kent. The UCSC genome browser database: update 2007. Nucleic Acids Res, 35(Database issue):668–673, Jan 2007. [54] A. Lambert et al. The ERPIN server: an interface to profile-based RNA motif identification. Nucl. Acids Res., 32(s2):W160–165, 2004. [55] E. Lander et al. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001. [56] C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262(5131):208–214, Oct 1993. [57] S. Le, J. Chen, and J. Maizel. Structure and Methods: Human Genome Initiative and DNA Recombination, volume 1, pages 127–136. Adenine Press, 1990. [58] N. Leibowitz, Z. Y. Fligelman, R. Nussinov, and H. J. Wolfson. Multiple structural alignment and core detection by geometric hashing. Proc Int Conf Intell Syst Mol Biol, pages 169–177, 1999. [59] H. P. Lenhof, K. Reinert, and M. Vingron. A polyhedral approach to RNA sequence structure alignment. Journal of Computational Biology, 5(3):517– 530, 1998. [60] M. Levitt. Detailed molecular model for transfer ribonucleic acid. Nature, 224(221):759–763, Nov 1969. [61] L. Lim, N. Lau, E. Weinstein, A. Abdelhakim, S. Yekta, M. W. Rhoades, C. Burge, and D. Bartel. The microRNAs of Caenorhabditis elegans. Genes and Developtment, 17:991–1008, 2003. [62] R. A. Lippert, X. Zhao, L. Florea, C. Mobarry, and S. Istrail. Finding anchors for genomic sequence comparison. RECOMB, pages 233–241, 2004. 103

[63] H. Lodish. Molecular cell biology. W. H. Freeman and Company, 1999. [64] T. Lowe and S. Eddy. tRNAscan-SE:a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research, 25:955– 964, 1997. [65] M. Mandal, M. Lee, J. E. Barrick, Z. Weinberg, G. M. Emilsson, W. L. Ruzzo, and R. R. Breaker. A glycine-dependent riboswitch that uses cooperative binding to control gene expression. Science, 306(5694):275–279, Oct 2004. [66] D. Mathews and D. Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2):191–203, 2002. [67] J. McCutcheon and S. Eddy. Computational identification of non-coding RNAs in Saccharomyces cerevisiae by comparative genomics. Nucleic Acids Research, 31(14):4119–4128, 2003. [68] A. Nahvi, N. Sudarshan, M. Ebert, X. Zou, K. Brown, and R. Breaker. Genetic control by a metabolite binding mRNA. Chemical Biology, 9:1043– 1049, 2003. [69] Nobel prize in physiology or medicine for 2006. http://nobelprize.org/nobel prizes/medicine/laureates/2006/info en.pdf, 2006. [70] C. Novina and P. Sharp. The rnai revolution. Nature, 430(6996):161–4, 2004. [71] R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. USA, 77(11):6309– 6313, Nov 1980. [72] R. Nussinov, G. Pieczenik, J. R. Griggs, and D. J. Kleitman. Algorithms for loop matchings. SIAM J. Appl. Math., 35(1):68–82, 1978. [73] G. Pavesi, G. Mauri, M. Stefani, and G. Pesole. RNAProfile: an algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucl. Acids Res., 32(10):3258–3269, 2004. [74] W. Pearson and D. Lipman. Improved Tools for Biological Sequence Com- parison. Proceedings of the National Academy of Sciences, 85:2444–2448, 1988. [75] J. S. Pedersen, G. Bejerano, A. Siepel, K. Rosenbloom, K. Lindblad-Toh, E. S. Lander, J. Kent, W. Miller, and D. Haussler. Identification and classi- fication of conserved RNA secondary structures in the human genome. PLoS Comput Biol, 2(4):e33, Apr 2006. 104

[76] O. Perriquet, H. Touzet, and M. Dauchet. Finding the common structure shared by two homologous RNAs. Bioinformatics, 19(1):108–116, Jan 2003. [77] K. S. Pollard, S. R. Salama, N. Lambert, M.-A. Lambot, S. Coppens, J. S. Pedersen, S. Katzman, B. King, C. Onodera, A. Siepel, A. D. Kern, C. Dehay, H. Igel, M. J. Ares, P. Vanderhaeghen, and D. Haussler. An RNA gene expressed during cortical development evolved rapidly in humans. Nature, 443(7108):167–172, Sep 2006. [78] C. P. Ponting and G. Lunter. Evolutionary biology: human brain gene wins genome race. Nature, 443(7108):149–150, Sep 2006. [79] E. Puerta-Fernandez, C. Romero-Lopez, A. Barroso-delJesus, and A. Berzal- Herranz. Ribozymes: recent advances in the development of RNA tools. FEMS Microbiol Rev., 27:75–97, 2003. [80] B. Rastegari and A. Condon. Linear time algorithm for parsing rna secondary structure. In 5th Workshop on Algorithms in Bioinformatics (WABI), 2005. [81] T. Rastogi, T. L. Beattie, J. E. Olive, and R. A. Collins. A long-range pseudoknot is required for activity of the Neurospora VS ribozyme. EMBO J, 15(11):2820–2825, Jun 1996. [82] E. Rivas and S. Eddy. A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots. Journal of Molecular Biology, 285:2053–2068, 1999. [83] E. Rivas and S. Eddy. Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics, 16(7):583– 605, 2000. [84] E. Rivas and S. Eddy. Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2:8–26, 2001. [85] E. Rivas, R. Klein, T. Jones, and S. Eddy. Computational identification of noncoding RNAs in E. coli by comparative genomics. Current Biology, 11:1369–1373, 2001. [86] D. A. Rodinov, A. G. Vitreschak, A. A. Mironov, and M. S. Gelfand. Reg- ulation of lysine biosynthesis and transport genes in bacteria: yet another RNA riboswitch? Nucleic Acids Research, 31(23):6748–6757, 2003. [87] D. A. Rodionov, A. G. Vitreschak, A. A. Mironov, and M. S. Gelfand. Com- parative genomics of the vitamin B12 metabolism and regulation in prokary- otes. J Biol Chem, 278(42):41148–41159, Oct 2003. [88] D. A. Rodionov, A. G. Vitreschak, A. A. Mironov, and M. S. Gelfand. Reg- ulation of lysine biosynthesis and transport genes in bacteria: yet another RNA riboswitch? Nucleic Acids Res, 31(23):6748–6757, Dec 2003. 105

[89] D. Rusch, A. Halpern, G. Sutton, K. Heidelberg, S. Williamson, S. Yooseph, D. Wu, J. Eisen, J. Hoffman, K. Remington, K. Beeson, B. Tran, H. Smith, H. Baden-Tillson, C. Stewart, J. Thorpe, J. Freeman, C. Andrews- Pfannkoch, J. Venter, K. Li, S. Kravitz, J. Heidelberg, T. Utterback, Y. Rogers, L. Falcon, V. Souza, G. Bonilla-Rosso, L. Eguiarte, D. Karl, S. Sathyendranath, T. Platt, E. Bermingham, V. Gallardo, G. Tamayo- Castillo, M. Ferrari, R. Strausberg, K. Nealson, R. Friedman, M. Frazier, and J. Venter. The Sorcerer II Global Ocean Sampling Expedition: North- west Atlantic through Eastern Tropical Pacific. PLoS Biol, 5(3):e77, Mar 2007. [90] Y. Sakakibara, M. Brown, R. Hughey, I. S. Mian, K. Sj¨olander,R. Under- wood, and D. Haussler. Recent methods for RNA modeling using Stochastic Context Free Grammars. In Combinatorial Pattern Matching Conference. Lecture Notes in Computer Science, volume 807, 1994. [91] D. Sankoff. Simulations solution of the RNA folding, alignment and proto- sequence problems. SIAM J. Appl. Math., 45(5):810–825, 1985. [92] A. Siepel, G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosen- bloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller, and D. Haussler. Evolu- tionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res, 15(8):1034–1050, Aug 2005. [93] T. F. Smith and M. S. Waterman. RNA Secondary structure. Math. Biosci., 42:257–266, 1978. [94] Sorcerer II Expedition Press Release. http://www.venterinstitute.org /press/news /news 2006 01 17.php, Jan. 2006. [95] G. D. Stormo. New tricks for an old dogma: riboswitches as cis-only regula- tory systems. Mol. Cell, 11(6):1419–1420, Jun 2003. [96] G. Storz. An expanding universe of noncoding RNAs. Science, 296(5571):1260–1263, May 2002. [97] N. Sudarsan, J. E. Barrick, and R. R. Breaker. Metabolite-binding RNA domains are present in the genes of eukaryotes. RNA, 9(6):644–647, Jun 2003. Letter. [98] N. Sudarsan, J. K. Wickiser, S. Nakamura, M. S. Ebert, and R. R. Breaker. An mRNA structure in bacteria that controls gene expression by binding lysine. Genes Dev, 17(21):2688–2697, Nov 2003. [99] M. Szymanski, M. Barciszewska, V. Erdmann, and J. Barciszewski. 5S ribo- somal RNA database. Nucleic Acids Research, 28(1):166–167, 2002. 106

[100] S. Tanner, H. Shu, A. Frank, L.-C. Wang, E. Zandi, M. Mumby, P. A. Pevzner, and V. Bafna. InsPecT: identification of posttranslationally modi- fied peptides from tandem mass spectra. Anal Chem, 77(14):4626–4639, Jul 2005. [101] C. A. Theimer, C. A. Blois, and J. Feigon. Structure of the human telom- erase RNA pseudoknot reveals conserved tertiary interactions essential for function. Mol Cell, 17(5):671–682, Mar 2005. [102] J. D. Thompson, D. G. Higgins, and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res., 22(22):4673–4680, Nov 1994. [103] I. Tinoco, O. C. Uhlenbeck, and M. D. Levine. Estimation of secondary structure in ribonucleic acids. Nature, 230(5293):362–367, Apr 1971. [104] H. Touzet and O. Perriquet. CARNAC: folding families of related RNAs. Nucl. Acids Res., 32(Web Server issue):142–145, Jul 2004. [105] J. Venter et al. The sequence of the human genome. Science, 291(5507):1304– 51, 2001. [106] J. C. Venter et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66–74, Apr 2004. [107] A. G. Vitreschak, D. A. Rodionov, A. A. Mironov, and M. S. Gelfand. Reg- ulation of the vitamin B12 metabolism and transport in bacteria by a con- served RNA structural element. RNA, 9(9):1084–1097, Sep 2003. [108] A. G. Vitreschak, D. A. Rodionov, A. A. Mironov, and M. S. Gelfand. Ri- boswitches: the oldest mechanism for the regulation of gene expression? Trends Genet, 20(1):44–50, Jan 2004. [109] S. Washietl and I. Hofacker. Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. Journal of Molecular Biology, 342(1):19–30, 2004. [110] S. Washietl, I. L. Hofacker, M. Lukasser, A. Huttenhofer, and P. F. Stadler. Mapping of conserved RNA secondary structures predicts thousands of func- tional noncoding RNAs in the human genome. Nat Biotechnol, 23(11):1383– 1390, Nov 2005. [111] S. Washietl, J. S. Pedersen, J. O. Korbel, C. Stocsits, A. R. Gruber, J. Hack- ermuller, J. Hertel, M. Lindemeyer, K. Reiche, A. Tanzer, C. Ucla, C. Wyss, S. E. Antonarakis, F. Denoeud, J. Lagarde, J. Drenkow, P. Kapranov, T. R. Gingeras, R. Guigo, M. Snyder, M. B. Gerstein, A. Reymond, I. L. Hofacker, and P. F. Stadler. Structured RNAs in the ENCODE selected regions of the human genome. Genome Res, 17(6):852–864, Jun 2007. 107

[112] M. S. Waterman. Secondary structure of single stranded nucleic acids. Adv. Math. Suppl. Stud., I:167–212, 1978. [113] M. S. Waterman. Consensus methods for fodling single-stranded nucleic acids. Mathematical methods for DNA Sequences, pages 185–224, 1989. [114] Z. Weinberg and W. L. Ruzzo. Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, 20 Suppl 1:I334–I341, Aug 2004. [115] Z. Weinberg and W. L. Ruzzo. Faster genome annotation of non-coding rna families without loss of accuracy. In RECOMB ’04: Proceedings of the eighth annual international conference on Resaerch in computational molecular bi- ology, pages 243–251, New York, NY, USA, 2004. ACM Press. [116] W. C. Winkler and R. R. Breaker. Genetic control by metabolite-binding riboswitches. Chembiochem, 4(10):1024–1032, 2003. [117] W. C. Winkler and R. R. Breaker. Regulation of bacterial gene expression by riboswitches. Annu Rev Microbiol, 59:487–517, Oct 2005. [118] C. Workman and A. Krogh. No evidence that mRNA have lower folding free energy than random sequences with the same dinucleotide distribution. Nucleic Acids Research, 27(24):4816–4822, 1999. [119] K. Zhang, L. Wang, and B. Ma. Computing similarity between rna struc- tures. In Combinatorial Pattern Matching, pages 281–293, 1999. [120] S. Zhang, B. Hass, E. Eskin, and V. Bafna. Searching genomes for non-coding RNA using FastR. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Oct-Dec 2005. [121] D. Zilberman, X. Cao, and S. E. Jacobsen. ARGONAUTE4 control of locus- specific siRNA accumulation and DNA and histone methylation. Science, 299(5607):716–719, Jan 2003. [122] M. Zuker. Prediction of RNA secondary structure by energy minimization. Methods Mol. Biol., 25:267–294, 1994. [123] M. Zuker. Mfold web server for nucleic acid folding and hybridization pre- diction. Nucl. Acids Res., 31(13):3406–3415, 2003. [124] M. Zuker and D. Sankoff. RNA secondary structures and their prediction. Bull. Math. Biol., 46:591–621, 1984. [125] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res., 9(1):133– 148, Jan 1981. 108

[126] C. Zwieb, I. Wower, and J. Wower. Comparative sequence analysis of tm- RNA. Nucl. Acids. Res., 27(10):2063–2071, 1999.