Dissertation / Doctoral Thesis
Total Page:16
File Type:pdf, Size:1020Kb
DISSERTATION / DOCTORAL THESIS Titel der Dissertation / Title of the Doctoral Thesis Unsupervised construction, evaluation and visualisation of RNA family models. verfasst von / submitted by Mag. rer. nat. Florian Eggenhofer angestrebter akademischer Grad / in partial fulfillment of the requirements for the degree of Doctor of Philosophy (PhD) Wien, 2016 / Vienna, 2016 Studienkennzahl lt. Studienblatt / A 794 685 490 degree programme code as it appears on the student record sheet: Dissertationsgebiet lt. Studienblatt / Molekulare Biologie field of study as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof. Dipl.-Phys. Dr. Ivo L. Hofacker III Acknowledgments/Danksagung I would like to acknowledge some outstanding persons and institutions that supported me and my work for this thesis. Ivo, my sincere thanks for sharing your wisdom and knowledge, but also for your good-will and patience. Your enthusiasm for science has set an shining example for me, what it means to be a scientist. It has been a privilege to have you as an advisor, for which I will be always grateful. Christian, I want to express my deepest gratitude for the countless instances where you have supported me. Your expertise and dedication have been truly inspiring for me. A heartily thank you for introducing me to functional pro- gramming and Haskell. I consider myself very fortunate having you as co- advisor and collaborator. I want to thank my collaborators from the ViennaNGS team: Michael for initiating and leading the ViennaNGS project, for guiding me through the intricacies hub construction (and for Speck) J¨org, for hacking, thinking and discussing with me, instantly finding the most elusive of bugs and being the truest of friends. Fabian, for sharing my fascination with small RNAs, discussing with me and being symbadisch. I want to thank the members of my PhD-committee, Peter F. Stadler and Ren´eeSchroeder for their advise, encouragement and suggestions. I want to express my deepest gratitude for the administrative support from: Judith, for being the kind and compassionate soul of the institute and the excellent way she manages the administration. Richard, for being a real-life Scotty and for supporting me with technical and personal advise over the years. Hack on! Gerlinde, for her support in the PhD program and overcoming of bureau- IV cratic odds. I want to thank Sita Saunders for her time and help in improving the text quality of the foreword and the discussion. Furthermore I want to thank: The reviewers of the publications included in this thesis, for their contribu- tions to the manuscripts and the tools. The reviewers of this thesis for consenting to review this manuscript and for their corresponding investment of effort and time. The three awesome "Musketeers", Stefan, Peter and J¨org (again) for sharing the PhD experience together with me since the PhD selection. The crowd of friendly people, who were not directly involved in my thesis, at the TBI. I really appreciated the time with you. Rolf for his generosity and for giving me a new scientific home at the Uni- versity of Freiburg. I also want to thank the funding agencies DFG - Deutsche Forschungs- gemeinschaft, SNF - Schweizer Nationalfond and FWF - Der Wis- senschaftsfond and the University of Vienna for funding. Finally my Mum and Dad for their unwavering support and believing in me. V Abstract RNA performs important functions in all organisms, for example mediating gene expression. RNAs are often evolutionary conserved over large set of species, giving rise to families of homologous RNA genes. These RNA fami- lies exhibit not only sequence similarity, but are often characterized by strong conservation of the RNA structure. Computationally, RNA families are represented by RNA-family models, also known as covariance models. Covariance models capture structure and se- quence of the family in a probabilistic model. They enable the prediction of additional, previously unknown, members of the RNA-family from genomic sequences. This allows a knowledge transfer between organisms and helps in designing experiments. Up to now RNA-family models were constructed by manual collection and curation, or automatic solutions for a few specific RNA families. The peer- reviewed publication for "RNAlien - Unsupervised RNA-family model con- struction" introduces a novel method to automatically construct such models, in principle for any RNA sequence. RNAlien, starting from a single input se- quence collects potential family member sequences by multiple iterations of homology search. RNA-family models are fully automatically constructed for the found sequences. The quality of RNA-family models and their performance in homology search depends on several factors. RNAlien evaluates both the models as well as the aligned sequences used to build them, to provide as much information about the model as possible. However this takes only the novel model itself into consideration, but does not investigate it in context with other models. The following manuscript, with the title "CMCompare webserver: comparing RNA families via covariance models", addresses the comparison between mod- els. This allows to identify models with poor specificity and to explore the relationship between models. Visualisation of family relationships helps in identifying candidates for clans, groups of biologically related families. Moreover the thesis presents a novel tool to visualise and compare the taxon- omy of of found RNA-family members, called TaxonomyTools. VI Family member sequences found by RNAlien during the model construction process are also a useful starting point for investigating families. UCSC genome browser hubs visualise the found family members in their genetic context, showing traits like orthology. Methods to constructs such hubs were con- tributed to the publication "ViennaNGS: A toolbox for building efficient next- generation sequencing analysis pipelines" and are also presented in the thesis. VII Zusammenfassung RNA-Familien werden in den Computerwissenschaften durch RNA-Familien Modelle, auch bekannt als Covarianz-Modelle repr¨asentiert. Covarianz-Modelle bilden Struktur und Sequenz der Familie als statistisches Modell ab. Sie machen es m¨oglich weitere, zuvor unbekannte, Vertreter der RNA Familie in genomischen Sequenzen zu identifizieren. Dieser Vorgang erm¨oglicht es bekan- ntes Wissen und experimentelle Ergebnisse von einem auf den anderen Organ- ismus zu transferieren und vereinfacht das Design neuer Experimente. In der Vergangenheit wurden RNA-Familien Modelle durch manuelles Sam- meln und Verfeinern, oder durch automatische L¨osungenf¨ureinige wenige spezielle RNA Familien konstruiert. Die Publikation "RNAlien - Unsupervised RNA-family model construction" stellt eine neue Methode zum automatischen Konstruieren solcher Modelle, prinzipiell f¨urjede RNA Sequenz, vor. RNAlien, ausgehend von einer einzelnen Eingabesequenz, sammelt potentielle Familien- mitglieder durch multiple Iteration von Homologiesuche. RNA-Familien Mod- elle werden automatisch f¨urdie gefundenen Sequenzen gebaut. Die Qualit¨atvon RNA-Familien Modellen und ihre Leistungsf¨ahigkeit in der Homologiesuche h¨angtvon verschiedenen Faktoren ab. RNAlien wertet sowohl die Modelle, als auch die alignierten Sequenzen die zum Bau der Modelle ver- wendet wurden, aus um so viel Information wie m¨oglich zur Verf¨ugungzu stellen. Dies ber¨ucksichtigt allerdings nur das neukonstruierte Modell und setzt es nicht in Beziehung zu anderen Modellen. Die folgende Publikation, mit dem Titel "CMCompare webserver: comparing RNA families via covariance models", behandelt den Vergleich zwischen Mod- ellen. Dies erlaubt die Identifizierung von Modellen mit schlecher Spezifit¨at und die Untersuchung von Beziehungen zwischen Modellen. Visualisierung dieser Zusammenh¨angehilft bei der Identifizierung von Kandidaten f¨urClans, Gruppen biologisch verkn¨upfter Familien. Dar¨uberhinaus wird ein Programmpacket, mit dem Namen TaxonomyTools, vorgestellt, welches die Visualsierung und den Vergleich der Taxonomie von gefundenen RNA Familien Mitgliedern erm¨oglicht. Sequenzen von Familienmitglieder, die von RNAlien w¨ahrenddes Konstruk- VIII tionsprozesses identifiziert wurden, sind ein Ausgangspunkt f¨urdie weitere Un- tersuchung der Familie. UCSC genome browser hubs visualisieren die gefunde- nen Familienmitglieder in ihrem genomischen Kontext, was Eigenschaften wie zum Beispiel Orthologie sichtbar macht. Methoden um solche Hubs zu bauen wurden als Beitrag mit der Publikation "ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines" ver¨offentlicht und wer- den hier pr¨asentiert. Contents IX Contents List of Figures XI List of Tables XIII 1 Foreword 1 2 Theoretical Background 3 2.1 RNA biology . .5 2.1.1 Sequence . .5 2.1.2 Secondary structure . .6 2.1.3 Tertiary structure . .9 2.1.4 Quaternary structure . 11 2.1.5 RNA function . 12 2.1.6 Homology . 13 2.1.7 Phylogenetics . 14 2.1.8 Taxonomy . 16 2.1.9 RNA groups . 18 2.2 Sequence alignment . 21 2.2.1 Pairwise sequence alignment . 21 2.2.2 Multiple sequence alignment . 26 2.3 Probabilistic models . 30 2.3.1 Hidden Markov models . 30 2.3.2 Stochastic Context Free Grammar . 34 2.4 RNA-family models . 36 2.4.1 Infernal . 36 2.4.2 Rfam - RNA-family database . 39 2.5 Homology search . 42 2.5.1 BLAST . 43 2.5.2 Expected value . 44 2.5.3 nhmmer . 45 2.5.4 cmsearch . 45 X Contents 3 RNA-family model construction 46 3.1 Construction of RNA families and Clans . 47 3.1.1 Seed alignment . 47 3.1.2 Consensus