Structural RNA Homology Search and Alignment Using Covariance Models Eric Nawrocki Washington University in St
Total Page:16
File Type:pdf, Size:1020Kb
Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) January 2009 Structural RNA Homology Search and Alignment Using Covariance Models Eric Nawrocki Washington University in St. Louis Follow this and additional works at: https://openscholarship.wustl.edu/etd Recommended Citation Nawrocki, Eric, "Structural RNA Homology Search and Alignment Using Covariance Models" (2009). All Theses and Dissertations (ETDs). 256. https://openscholarship.wustl.edu/etd/256 This Dissertation is brought to you for free and open access by Washington University Open Scholarship. It has been accepted for inclusion in All Theses and Dissertations (ETDs) by an authorized administrator of Washington University Open Scholarship. For more information, please contact [email protected]. WASHINGTON UNIVERSITY IN ST. LOUIS Division of Biology and Biomedical Sciences (Computational Biology) Dissertation Examination Committee: Sean Eddy, Chair Michael Brent Jeremy Buhler Justin Fay Jeff Gordon Rob Mitra Gary Stormo STRUCTURAL RNA HOMOLOGY SEARCH AND ALIGNMENT USING COVARIANCE MODELS by Eric Paul Nawrocki A dissertation presented to the Graduate School of Arts and Sciences of Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy December 2009 Saint Louis, Missouri copyright by Eric Paul Nawrocki 2009 ABSTRACT OF THE DISSERTATION Structural RNA Homology Search and Alignment Using Covariance Models by Eric Paul Nawrocki Doctor of Philosophy in Biology and Biomedical Sciences (Computational Biology) Washington University in St. Louis, 2009 Sean R. Eddy, Chairman Functional RNA elements do not encode proteins, but rather function directly as RNAs. Many different types of RNAs play important roles in a wide range of cellular processes, including protein synthesis, gene regulation, protein transport, splicing, and more. Because important sequence and structural features tend to be evolutionarily conserved, one way to learn about functional RNAs is through comparative sequence analysis - by collecting and aligning examples of homologous RNAs and comparing them. Covariance models (CMs) are powerful computational tools for homology search and alignment that score both the conserved sequence and secondary structure of an RNA family. However, due to the high computational complexity of their search and alignment algorithms, searches against large databases and alignment of large RNAs like small subunit ribosomal RNA (SSU rRNA) are prohibitively slow. Large-scale alignment of SSU rRNA is of particular utility for environmental survey studies of microbial diversity which often use the rRNA as a phylogenetic marker of microorganisms. In this work, we improve CM methods by making them faster and more sensitive to remote homology. To accelerate searches, we introduce a query-dependent banding (QDB) technique that makes scoring sequences more efficient by restricting the possible lengths of structural elements based on their probability given the model. We combine QDB with a ii complementary filtering method that quickly prunes away database subsequences deemed unlikely to receive high CM scores based on sequence conservation alone. To increase search sensitivity, we apply two model parameterization strategies from protein homology search tools to CMs. As judged by our benchmark, these combined approaches yield about a 250-fold speedup and significant increase in search sensitivity compared with previous implementations. To accelerate alignment, we apply a method that uses a fast sequence- based alignment of a target sequence to determine constraints for the more expensive CM sequence- and structure-based alignment. This technique reduces the time required to align one SSU rRNA sequence from about 15 minutes to 1 second with a negligible effect on alignment accuracy. Collectively, these improvements make CMs more powerful and practical tools for RNA homology search and alignment. iii Acknowledgements I gratefully acknowledge financial support from the Howard Hughes Medical Institute (HHMI) and a National Institutes of Health National Human Genome Research Initiative (NIH NHGRI) Institutional Training Grant in Genomic Science, T32-HG000045. Sean Eddy’s lab in Washington University at St. Louis, where this work took place from 2004 to 2006, was supported by NIH NHGRI grant RO1-HG01363, HHMI, and Alvin Goldfarb. Since then, my research in the Eddy lab at Janelia Farm Research Campus in Ashburn, Virginia has been financially supported by HHMI. This dissertation is based on and benefits from important research from many other scientists. Zasha Weinberg and Larry Ruzzo’s excellent work on RNA homology search, Michael Brown and David Haussler’s clever methods for accelerating structural RNA align- ment, and the wealth of freely available information at Robin Gutell’s Comparative RNA Website (crw) have been particularly crucial. In addition, I have been extremely fortunate to work in the lab of Sean Eddy. Sean’s tireless work ethic and depth of knowledge in the biological and computational sciences are staggering. The past and present members of his lab bring dedication, skill and enthusiasm to their work that is unmatched. It has been an honor to learn from and work with each of them. Finally, I want to thank my wife, Michael Hopkins, for her unwavering support and confidence, and for adding so much fun to the last six years of my life - none of this would have been possible without her. iv This work is dedicated to my parents, Kathy and Ed Nawrocki. Thank you for all the love, support, and opportunities that you have provided for me. v Contents Acknowledgements iv Part 1: RNA homology search 1 1 Introduction to RNA homology search 2 1.1 Functional RNA elements . 3 1.2 Comparative sequence analysis . 9 1.3 Computational methods for sequence homology search . 15 1.4 Exploiting conserved structure in RNA homology searches . 34 1.5 Stochastic context-free grammars for RNA sequence and structure modeling 41 1.6 Covariance models: profile stochastic context-free grammars . 46 1.7 Addressing the limitations of covariance models . 57 1.8 Outline of Part 1 of this work . 63 2 QDB for faster RNA similarity searches 65 2.1 Abstract . 65 2.2 Introduction . 66 2.3 Results . 69 2.4 Discussion . 91 2.5 Materials and Methods . 95 3 Infernal 1.0: inference of RNA alignments 96 vi 3.1 Abstract . 96 3.2 Introduction . 96 3.3 Usage . 97 3.4 Performance . 98 4 A filter pipeline for accelerating covariance model searches 102 4.1 Previous work on accelerating CM searches using filters . 103 4.2 Optimized implementations of the CYK and Inside algorithms . 104 4.3 Infernal’s two-stage filter acceleration pipeline . 106 4.4 Evaluation . 118 4.5 Implementation . 127 4.6 Conclusion and future directions . 130 5 Computational identification of functional RNA homologs in metage- nomic data 132 5.1 Exploiting conserved structure in RNA similarity searches . 134 5.2 Infernal: software for RNA homology search and alignment . 138 5.3 Rfam: high-throughput RNA homology search and annotation . 141 5.4 Limitations of CMs . 145 5.5 Conclusion . 148 6 Conclusion to RNA homology search 149 Part 2: RNA alignment 151 7 Introduction to small subunit ribosomal RNA alignment 152 7.1 SSU and the tree of life . 153 7.2 Rapid culture-independent SSU sequence determination . 154 7.3 Environmental sequencing surveys . 155 7.4 Comparative SSU analysis for environmental surveys . 161 vii 7.5 SSU alignment methods . 169 7.6 Dedicated SSU databases and alignment tools . 175 7.7 Developing a new SSU alignment tool . 181 8 HMM banding for faster structural RNA alignment 184 8.1 Faster alignment using banded dynamic programming . 184 8.2 HMM banded alignment in Infernal . 186 8.3 Comparison of HMM and query-dependent banding . 194 8.4 Benchmarking . 195 8.5 Constructing an HMM that is maximally similar to a CM . 201 9 SSU-align: a tool for structural alignment of SSU rRNA sequences 206 9.1 Aligning SSU sequences with SSU-align . 207 9.2 Automated probabilistic alignment masking . 208 9.3 Implementation . 214 9.4 SSU-align’s SSU rRNA sequence and structure models . 217 Bibliography 236 Vita 265 viii List of Figures 1.1 RNA secondary structure elements. 8 1.2 Predicted secondary structure of SSU rRNA from E. coli (a bacterium) and M. vannielii (an archaeon). 10 1.3 Predicted secondary structure of RNase P RNA from E. coli (a bacterium) and M. vannielii (an archaeon). 11 1.4 Toy example of a multiple sequence alignment. 12 1.5 Dynamic programming matrix filled during a Needleman-Wunsch-Sellers align- ment of two sequences. 19 1.6 A banded pairwise alignment dynamic programming matrix. 21 1.7 A toy hidden Markov model. 27 1.8 A simple profile HMM. 29 1.9 Additional information gained from modeling structure . 39 1.10 Information in a sequence-only versus a sequence and structure profile. 42 1.11 A parse tree for a dual-stem loop grammar and the corresponding RNA secondary structure . 45 1.12 Input alignment, CM guide tree, CM state graph and parse trees. 48 1.13 Examples of CM local begin and end transitions. 53 1.14 Histogram of number of sequences in rfam 7.0 seed alignments . 56 1.15 A pattern defining the E-loop RNA structural motif used by the rnamotif program. 62 ix 2.1 An example RNA family and corresponding CM . 71 2.2 Effect of transition priors on band calculation . 78 2.3 Effect of varying the β parameter on sensitivity, specificity, and speedup . 88 2.4 CPU time required by CM searches with and without QDB . 89 2.5 ROC curves for the benchmark (QDB chapter) . 90 3.1 ROC curves for the benchmark (infernal 1.0 chapter) . 99 4.1 Average QDB CYK and HMM Forward scores versus Inside scores for various RNA families. 111 4.2 CM score histograms of random, known, and sampled sequences for three RNA families. 115 4.3 MER versus time for the benchmark.