Modelling Binding Preferences of RNA-Binding Proteins

Modelling binding preferences of RNA-binding proteins Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) vorgelegt dem Rat der Technischen Fakulät der Albert-Ludwigs-UniversitätFreiburg im Breisgau 9. Februar 2017 von Diplom-Informatiker Daniel Maticzka Dekan: Prof. Dr. Oliver Paul Gutachter: Prof. Dr. Rolf Backofen Prof. Dr. Ivo Hofacker Datum der Promotion: 4. April 2017 Acknowledgements I would like to thank my supervisor Prof. Backofen for his advice and support during the time of my PhD. Much of this work would have been impossible without his requisition of data when there was none and providing the means to process it when there was plenty. I also would like to express my gratitude to Prof. Ivo Hofacker and the members of my PhD committee for taking the time to evaluate this work. All of this was only ever possible because of the companionship, contributions and criticism of many people. This page is dedicated to you. You lent the shoulders to stand on, you are my giants! In particular, I would like to thank. Martin Mann, Sita J. Saunders, Milad Miladi and Torsten Houwaart for proofreading parts of this thesis, . Christina Otto, Robert Kleinkauf, Fabrizio Costa, Milad Miladi, Michel Uhl and Stefan Mautner for making our office a lively place in all its iterations, . Sita J. Saunders for still rocking our boat, . Martin Mann for all the advice over the years be it big or small, . Fabrizio Costa for discussing and sharing his many ideas and schemes, . Ibrahim Ilik for his longtime collaboration and freely sharing his RNA-seq expertise, . Anke Busch, Martin Mann, Mathias Möhland Sebastian Will for their guidance in all the things that fold, . Monika Degen-Hellmuth for being the heart of the lab and for all her efforts in guiding me through pesky administrative details, . all past and present group members for being a great community, and . my family and friends for bearing with all that crazy talk about that woman called "Erna". Finally, I would like to express my deepest gratitude to Kim and Emil, who traveled this long journey with me and supported me with their relentless love. Thank you for being the amazing people you are! i Contents Summary v Zusammenfassung vii Publications ix 1 Introduction 1 1.1 RNA . 4 1.1.1 Global and local RNA structure . 5 1.1.2 Prediction of RNA secondary structure . 5 1.2 RNA-binding proteins . 7 1.2.1 Properties of RNA-binding proteins . 7 1.2.2 Experimental identification of RNA-protein interactions 8 1.3 Performance evaluation of predictive methods . 9 2 Predicting the local structure of mRNAs 15 2.1 Introduction . 15 2.2 Prediction and evaluation of local RNA structures . 17 2.2.1 Algorithms and performance measures . 17 2.2.2 Exploratory evaluation of local folding parameters . 22 2.2.3 A bias in windowed local folding . 28 2.3 Performance evaluation of local folding algorithms . 31 2.3.1 Prediction of cis-regulatory structures . 32 2.3.2 Prediction of accessible regions . 33 2.4 Conclusion . 37 3 Detecting binding sites of RNA-binding proteins with iCLIP 39 3.1 Introduction . 39 3.2 iCLIP processing pipeline . 40 3.2.1 Alignment to the reference genome . 40 3.2.2 Identification of crosslinking events . 43 3.2.3 Identification of binding sites . 48 3.3 Identification of MLE and MSL2 binding sites . 51 3.3.1 Identification of MLE- and MSL-bound sites . 52 3.4 Conclusion . 58 ii Contents 4 GraphProt: Modelling RBP binding preferences 61 4.1 Introduction . 61 4.2 The flexible GraphProt framework . 64 4.2.1 Graph encoding of RNA sequence and structure. 66 4.2.2 Graph kernel . 69 4.2.3 Application of predictive models . 72 4.3 GraphProt performance evaluation . 73 4.3.1 Learning binding preferences from high-throughput data 74 4.3.2 GraphProt sequence-and-structure motifs . 77 4.3.3 Benefits of modelling local RNA structure . 80 4.3.4 Learning binding affinities from categorical data . 82 4.3.5 Genome-wide prediction of binding sites . 84 4.4 Conclusion . 87 5 Model-based validation of RBP binding sites 89 5.1 Introduction . 89 5.1.1 PTB mediates expression of ANXA7 isoforms . 90 5.2 Prediction and validation of binding sites . 91 5.2.1 Prediction of PTB-bound sites . 91 5.2.2 Designing mutations for probing predicted sites . 92 5.2.3 Experimental validation of predicted binding sites . 93 5.3 Completeness of CLIP-seq-derived binding sites . 93 5.3.1 Influence of peak calling . 96 5.3.2 Influence of sequencing depth . 97 5.3.3 Influence of mappability . 98 5.4 Conclusion . 99 6 Conclusions 101 Bibliography 105 A Detailed statement of contributions 131 B Supplementary material 135 B.1 Chapter 2 . 135 B.2 Chapter 4 . 139 B.3 Chapter 5 . 173 iii Summary This is a dissertation about modelling the binding preferences of RNA-binding proteins (RBPs), proteins with the capability to bind ribonucleic acids (RNAs). RNAs and proteins are versatile macromolecules that are employed in a plethora of cellular functions and that are essential for the synthesis of proteins according to their genetic blueprints. They do not just provide the machinery creating RNA or protein products according to DNA-encoded genes, however, but are also involved in the regulation of this gene expression. RBPs are mainly involved in post-transcriptional regulation, the regulation of gene expression at the RNA level. RBPs can be categorized as single-stranded RNA-binding proteins (ssRBPs) and double-stranded RNA-binding proteins (dsRBPs). ssRBPs recognize and bind the nucleobase parts of RNA nucleotides. If a target region is sequestered in a structure it may not be readily available for binding by an ssRBP. dsRBPs on the other hand recognize structured parts of RNAs. In consequence, the creation of models of RBP binding preferences in general requires knowledge about the structure of the targeted regions. Since data on biochemically determined RNA structures is scarce, efficient methods for the in-silico prediction of RNA structures are required. In this work, we discuss structure prediction for regulatory structure elements located on messenger RNAs, the main targets of RBPs. The vast majority of the currently known RBP binding sites was determined by CLIP-seq. With CLIP-seq, interacting RBPs and RNAs are fused via irradiation with ultraviolet light. The RNA regions fused to a protein of interest are then extracted and their sequences are determined by high-throughput sequencing. In this work, we discuss the appropriate processing of sequenced reads stemming from the iCLIP protocol, a variant of CLIP-seq that allows to determine individual RBP binding events at nucleotide resolution. We present GraphProt, a flexible framework for training computational models of RBP binding-preferences. Here, the combination of large numbers of RBP target sites determined by CLIP-seq approaches and efficient algorithms for predicting the structure of these sites allows the creation of models of RBP sequence- and structure binding preferences. GraphProt predictions allow to score the likelihood of an RBP to bind any RNA sequence. When applied at the nucleotide level, GraphProt models can be used to create v Summary motif visualizations of RBP sequence- and structure-preferences. GraphProt predictions at the nucleotide level also allow the transcriptome-wide scanning for binding sites, for example to detect binding sites that were missed by a given CLIP-seq experiment. We present an exemplary analysis of one such case, where biologically relevant binding sites were missing in a published set of CLIP-seq-determined binding sites. Using a GraphProt model trained on these mostly non-functional sites, we were able to determine a genome-wide set of sites that better correspond to target genes known to be regulated by the RBP. The final topic covered by this thesis is the experimental validation of RBP binding sites. For predicted binding sites the actual capability to be bound by an RBP has to be shown. Even for experimentally determined binding sites the mere event of binding does not unconditionally cause a regulation of the targeted gene. Accordingly, a method for experimentally validating both binding to and efficacy of target sites is required. For this purpose, we used GraphProt to design changes for binding sites meant to weaken or disable binding of an RBP. The efficacy of a bound and regulatorily active site could then be shown experimentally by evaluating the effects of a loss of binding caused by the introduced mutations, thus validating binding to and efficacy of the original binding sites. vi Zusammenfassung Diese Dissertation thematisiert die Modellierung der Bindepräferenzen von RNA-bindenden Proteinen (RPBs), einer Klasse von Proteinen die Bindun- gen zu Ribonukleinsäuren(RNAs) eingehen kann. RNA und Proteine sind vielseitige Makromoleküledie in einer Vielzahl von zellulärenAbläufenverwen- dung finden. Sie sind essentiell für die Synthese von Proteinen, stellen aber nicht nur die Maschinerie fürdie Erstellung der RNA- und Proteinprodukte basierend auf den DNA-kodierten Genen, sondern sind ebenso involviert in der Regulation der Genexpression. Hierbei sind RBPs hauptsächlich an der Posttranskriptionellen Regulation beteiligt, der Regulation der Genexpression auf der Ebene von RNA. RBPs könnenals einzelsträngigeRNA-bindende Proteine (ssRBP) und dop- pelsträngigeRNA-bindende Proteine (dsRBPs) kategorisiert werden. ssRBPs erkennen und binden den Nukleobasenteil der RNA Nukleotide. Wenn eine Zielregion an einer RNA Struktur beteiligt ist besteht die Möglichkeit, dass sie nicht fürdie Bindung durch das ssRBP verfügbarist. dsRBPs hingegen erkennen strukturierte Teile von RNAs. Dies bedeutet, dass die Erstellung von Modellen von RBP Bindepräferenzenauf Informationen über die Struktur der RNA Zielbereiche angewiesen ist.

Load more