Modelling binding preferences of RNA-binding

Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.)

vorgelegt dem Rat der Technischen Fakul¨at der Albert-Ludwigs-Universit¨atFreiburg im Breisgau

9. Februar 2017

von Diplom-Informatiker Daniel Maticzka Dekan:

Prof. Dr. Oliver Paul

Gutachter:

Prof. Dr. Rolf Backofen Prof. Dr. Ivo Hofacker Datum der Promotion: 4. April 2017 Acknowledgements

I would like to thank my supervisor Prof. Backofen for his advice and support during the time of my PhD. Much of this work would have been impossible without his requisition of data when there was none and providing the means to process it when there was plenty. I also would like to express my gratitude to Prof. Ivo Hofacker and the members of my PhD committee for taking the time to evaluate this work.

All of this was only ever possible because of the companionship, contributions and criticism of many people. This page is dedicated to you. You lent the shoulders to stand on, you are my giants!

In particular, I would like to thank...... Martin Mann, Sita J. Saunders, Milad Miladi and Torsten Houwaart for proofreading parts of this thesis, . . . Christina Otto, Robert Kleinkauf, Fabrizio Costa, Milad Miladi, Michel Uhl and Stefan Mautner for making our office a lively place in all its iterations, . . . Sita J. Saunders for still rocking our boat, . . . Martin Mann for all the advice over the years be it big or small, . . . Fabrizio Costa for discussing and sharing his many ideas and schemes, . . . Ibrahim Ilik for his longtime collaboration and freely sharing his RNA-seq expertise, . . . Anke Busch, Martin Mann, Mathias M¨ohland Sebastian Will for their guidance in all the things that fold, . . . Monika Degen-Hellmuth for being the heart of the lab and for all her efforts in guiding me through pesky administrative details, . . . all past and present group members for being a great community, and . . . my family and friends for bearing with all that crazy talk about that woman called ”Erna”.

Finally, I would like to express my deepest gratitude to Kim and Emil, who traveled this long journey with me and supported me with their relentless love. Thank you for being the amazing people you are!

i Contents

Summary v

Zusammenfassung vii

Publications ix

1 Introduction 1 1.1 RNA ...... 4 1.1.1 Global and local RNA structure ...... 5 1.1.2 Prediction of RNA secondary structure ...... 5 1.2 RNA-binding proteins ...... 7 1.2.1 Properties of RNA-binding proteins ...... 7 1.2.2 Experimental identification of RNA- interactions 8 1.3 Performance evaluation of predictive methods ...... 9

2 Predicting the local structure of mRNAs 15 2.1 Introduction ...... 15 2.2 Prediction and evaluation of local RNA structures ...... 17 2.2.1 Algorithms and performance measures ...... 17 2.2.2 Exploratory evaluation of local folding parameters . . . 22 2.2.3 A bias in windowed local folding ...... 28 2.3 Performance evaluation of local folding algorithms ...... 31 2.3.1 Prediction of cis-regulatory structures ...... 32 2.3.2 Prediction of accessible regions ...... 33 2.4 Conclusion ...... 37

3 Detecting binding sites of RNA-binding proteins with iCLIP 39 3.1 Introduction ...... 39 3.2 iCLIP processing pipeline ...... 40 3.2.1 Alignment to the reference genome ...... 40 3.2.2 Identification of crosslinking events ...... 43 3.2.3 Identification of binding sites ...... 48 3.3 Identification of MLE and MSL2 binding sites ...... 51 3.3.1 Identification of MLE- and MSL-bound sites ...... 52 3.4 Conclusion ...... 58 ii Contents

4 GraphProt: Modelling RBP binding preferences 61 4.1 Introduction ...... 61 4.2 The flexible GraphProt framework ...... 64 4.2.1 Graph encoding of RNA sequence and structure. . . . . 66 4.2.2 Graph kernel ...... 69 4.2.3 Application of predictive models ...... 72 4.3 GraphProt performance evaluation ...... 73 4.3.1 Learning binding preferences from high-throughput data 74 4.3.2 GraphProt sequence-and-structure motifs ...... 77 4.3.3 Benefits of modelling local RNA structure ...... 80 4.3.4 Learning binding affinities from categorical data . . . . 82 4.3.5 Genome-wide prediction of binding sites ...... 84 4.4 Conclusion ...... 87

5 Model-based validation of RBP binding sites 89 5.1 Introduction ...... 89 5.1.1 PTB mediates expression of ANXA7 isoforms ...... 90 5.2 Prediction and validation of binding sites ...... 91 5.2.1 Prediction of PTB-bound sites ...... 91 5.2.2 Designing mutations for probing predicted sites . . . . . 92 5.2.3 Experimental validation of predicted binding sites . . . 93 5.3 Completeness of CLIP-seq-derived binding sites ...... 93 5.3.1 Influence of peak calling ...... 96 5.3.2 Influence of sequencing depth ...... 97 5.3.3 Influence of mappability ...... 98 5.4 Conclusion ...... 99

6 Conclusions 101

Bibliography 105

A Detailed statement of contributions 131

B Supplementary material 135 B.1 Chapter 2 ...... 135 B.2 Chapter 4 ...... 139 B.3 Chapter 5 ...... 173

iii

Summary

This is a dissertation about modelling the binding preferences of RNA-binding proteins (RBPs), proteins with the capability to bind ribonucleic acids (). RNAs and proteins are versatile macromolecules that are employed in a plethora of cellular functions and that are essential for the synthesis of proteins according to their genetic blueprints. They do not just provide the machinery creating RNA or protein products according to DNA-encoded , however, but are also involved in the regulation of this expression. RBPs are mainly involved in post-transcriptional regulation, the regulation of at the RNA level. RBPs can be categorized as single-stranded RNA-binding proteins (ssRBPs) and double-stranded RNA-binding proteins (dsRBPs). ssRBPs recognize and bind the nucleobase parts of RNA nucleotides. If a target region is sequestered in a structure it may not be readily available for binding by an ssRBP. dsRBPs on the other hand recognize structured parts of RNAs. In consequence, the creation of models of RBP binding preferences in general requires knowledge about the structure of the targeted regions. Since data on biochemically determined RNA structures is scarce, efficient methods for the in-silico prediction of RNA structures are required. In this work, we discuss structure prediction for regulatory structure elements located on messenger RNAs, the main targets of RBPs. The vast majority of the currently known RBP binding sites was determined by CLIP-seq. With CLIP-seq, interacting RBPs and RNAs are fused via irradiation with ultraviolet light. The RNA regions fused to a protein of interest are then extracted and their sequences are determined by high-throughput sequencing. In this work, we discuss the appropriate processing of sequenced reads stemming from the iCLIP protocol, a variant of CLIP-seq that allows to determine individual RBP binding events at nucleotide resolution. We present GraphProt, a flexible framework for training computational models of RBP binding-preferences. Here, the combination of large numbers of RBP target sites determined by CLIP-seq approaches and efficient algorithms for predicting the structure of these sites allows the creation of models of RBP sequence- and structure binding preferences. GraphProt predictions allow to score the likelihood of an RBP to bind any RNA sequence. When applied at the nucleotide level, GraphProt models can be used to create

v Summary motif visualizations of RBP sequence- and structure-preferences. GraphProt predictions at the nucleotide level also allow the transcriptome-wide scanning for binding sites, for example to detect binding sites that were missed by a given CLIP-seq experiment. We present an exemplary analysis of one such case, where biologically relevant binding sites were missing in a published set of CLIP-seq-determined binding sites. Using a GraphProt model trained on these mostly non-functional sites, we were able to determine a genome-wide set of sites that better correspond to target genes known to be regulated by the RBP. The final topic covered by this thesis is the experimental validation of RBP binding sites. For predicted binding sites the actual capability to be bound by an RBP has to be shown. Even for experimentally determined binding sites the mere event of binding does not unconditionally cause a regulation of the targeted gene. Accordingly, a method for experimentally validating both binding to and efficacy of target sites is required. For this purpose, we used GraphProt to design changes for binding sites meant to weaken or disable binding of an RBP. The efficacy of a bound and regulatorily active site could then be shown experimentally by evaluating the effects of a loss of binding caused by the introduced mutations, thus validating binding to and efficacy of the original binding sites.

vi Zusammenfassung

Diese Dissertation thematisiert die Modellierung der Bindepr¨aferenzen von RNA-bindenden Proteinen (RPBs), einer Klasse von Proteinen die Bindun- gen zu Ribonukleins¨auren(RNAs) eingehen kann. RNA und Proteine sind vielseitige Makromolek¨uledie in einer Vielzahl von zellul¨arenAbl¨aufenverwen- dung finden. Sie sind essentiell f¨ur die Synthese von Proteinen, stellen aber nicht nur die Maschinerie f¨urdie Erstellung der RNA- und Proteinprodukte basierend auf den DNA-kodierten Genen, sondern sind ebenso involviert in der Regulation der Genexpression. Hierbei sind RBPs haupts¨achlich an der Posttranskriptionellen Regulation beteiligt, der Regulation der Genexpression auf der Ebene von RNA. RBPs k¨onnenals einzelstr¨angigeRNA-bindende Proteine (ssRBP) und dop- pelstr¨angigeRNA-bindende Proteine (dsRBPs) kategorisiert werden. ssRBPs erkennen und binden den Nukleobasenteil der RNA Nukleotide. Wenn eine Zielregion an einer RNA Struktur beteiligt ist besteht die M¨oglichkeit, dass sie nicht f¨urdie Bindung durch das ssRBP verf¨ugbarist. dsRBPs hingegen erkennen strukturierte Teile von RNAs. Dies bedeutet, dass die Erstellung von Modellen von RBP Bindepr¨aferenzenauf Informationen ¨uber die Struktur der RNA Zielbereiche angewiesen ist. Da experimentell erhobene Daten zu RNA Strukturen selten zur Verf¨ugungstehen, sind effiziente Methoden f¨ur die in-silico Vorhersage von RNA Strukturen notwendig. In dieser Arbeit analysieren wir Algorithmen f¨urdie Strukturvorhersage von regulatorischen Strukturelementen in Boten-RNAs, den Hauptzielen von RBPs. Die große Mehrheit der zur Zeit bekannten RBP Bindestellen wurde durch CLIP-seq Methoden erhoben. Hierbei werden interagierende RBPs und RNAs durch Bestrahlung mit ultraviolettem Licht fest aneinander gebunden. Die mit dem RBP verbundenen Regionen k¨onnendadurch extrahiert und ihre Basenfolge bestimmt werden. In dieser Arbeit besprechen wir die n¨otigen Verarbeitungsschritte f¨urRNA Sequenzen die im Rahmen des iCLIP Protokolls erstellt werden. iCLIP ist eine Variante von CLIP-seq und erm¨oglicht die Bestimmung individueller Bindungsereignisse mit Nukleotidaufl¨osung. Wir pr¨asentieren GraphProt, ein flexibles System um Modelle von RBP Bindepr¨aferenzenzu trainieren. Die Erstellung von Modellen der RBP Sequenz- und Strukturbindepr¨aferenzenwird durch die Kombination einer großen An- zahl von mit Hilfe von CLIP-seq Methoden bestimmter RBP Bindestellen und

vii Zusammenfassung effizienter Algorithmen f¨urdie Vorhersage der Struktur dieser Bindestellen erm¨oglicht. GraphProt Vorhersagen erm¨oglichen die Vorhersage der Tendenz eines RBPs, an eine beliebige Bindestelle zu binden. GraphProt Vorhersagen auf Nukleotidebende erm¨oglichen die Erstellung von graphischen Repr¨asentatio- nen der RBP Sequenz- und Strukturbindepr¨aferenzen.Zus¨atzlich erm¨oglichen diese Vorhersagen die transkriptomweite Suche nach Bindestellen, zum Beispiel um Bindestellen die durch ein gegebenes CLIP-seq Experiment nicht detek- tiert werden konnten, vorherzusagen. Wir stellen eine Analyse f¨ureinen ebensolchen Fall vor, bei dem biologisch relevante Bindestellen weitestge- hend in einem ver¨offentlichten Satz von CLIP-seq Bindestellen fehlen. Unter Verwendung eines GraphProt Modells, das auf eben jenem unvollst¨andigen Satz von Bindestellen trainiert wurde, konnten wir einen erweiterten Satz von Bindestellen vorhersagen der deutlich besser mit den durch das RBP regulierten Genen ¨ubereinstimmt als die experimentell bestimmten Bindestellen. Abschließend thematisieren wir die experimentelle Validierung von RBP Bindestellen. F¨urvorhergesagte Bindestellen muss das tats¨achliche Potential, von dem entsprechenden RBP gebunden zu werden, gezeigt werden. Aber auch f¨urexperimentell bestimmte Bindestellen bedeutet das nachgewiesene Ereignis einer Bindung noch nicht, dass durch diese Bindung eine Regulation des Zielgens bewirkt wird. Dementsprechend wird eine Methode ben¨otigtmit der sowohl Bindung als auch der regulatorische Effekt dieser Bindung nachgewiesen werden k¨onnen.Zu diesem Zweck verwenden wir ein GraphProt Modell um Mutationen, die die Bindung des RBP schw¨achen oder inaktivieren, in zu testende Bindestellen einzuf¨ugen.Die Wirksamkeit einer vormals gebundenen und regulatorisch aktiven Bindestelle kann dann durch Evaluation des Effekts eines Verlusts der Bindung, verursacht durch die eingef¨ugtenMutationen, experimentell bestimmt werden. Dadurch ließen sich sowohl Bindung als auch regulatorische Aktivit¨atder von uns vorhergesagten Bindestellen nachweisen.

viii Publications

Publications covered by this thesis

• Maticzka, D., Lange, S. J., Costa, F., and Backofen, R. GraphProt: modeling binding preferences of RNA-binding proteins. Genome Biology, 15(1):R17, 2014.

• Ferrarese, R., Harsh IV, G. R., Yadav, A. K., Bug, E., Maticzka, D., Reichardt, W., Dombrowski, S. M., Miller, T. E., Masilamani, A. P., Dai, F., Kim, H., Hadler, M., Scholtens, D. M., Yu, I. L. Y., Beck, J., Srinivasasainagendra, V., Costa, F., Baxan, N., Pfeifer, D., v. Elver- feldt, D., Backofen, R., Weyerbrock, A., Duarte, C. W., He, X., Prinz, M., Chandler, J.P., Vogel, H., Chakravarti, A., Rich, J. N., Carro M. S. and Bredel, M. Lineage-Specific Splicing of a Brain-Enriched Alternative Exon Promotes Glioblastoma Progression. Journal of Clinical Investigation, 124(7):2861-2876, 2014

• Ilik, I. A., Quinn, J. J., Georgiev, P., Tavares-Cadete, F., Maticzka, D., Toscano, S., Wan, Y., Spitale, R. C., Luscombe, N., Backofen, R., Chang, H. Y., and Akhtar, A. Tandem Stem- Loops in roX RNAs Act Together to Mediate X Dosage Compensation in Drosophila. Molecular Cell, 51(2):156–73, 2013.

• Lange, S. J.∗, Maticzka, D.∗, M¨ohl,M., Gagnon, J. N., Brown, C. M., and Backofen, R. Global or local? Predicting secondary structure and accessibility in mRNAs. Nucleic Acids Research, 40(12):5215–26, 2012.

Further publications

• Aktas, T., Ilik, I. A., Maticzka, D., Bhardwaj, V., Rodrigues, C.P., Mittler, G., Manke, T., Backofen, R., and Akhtar, A. DHX9 sup- presses spurious RNA processing defects originating from the Alu invasion of the . Nature, accepted

∗Joint first authors

ix Publications

• Ilik, I. A., Maticzka, D., Georgiev, P., Backofen, R., Akhtar, A. A hidden RNA-affinity switch provides a rare glimpse into RNA helicase function in vivo. submitted

• Rehfeld, F., Maticzka, D., Grosser, S., Eravci, M., Vida, I., Backofen, R., Wulczyn, F.G. miR-128 and its host gene Arpp21 functionally interact as antagonistic regulators of post-transcriptional gene expression in neurons. submitted

• Preusse, M., Marr, C., Saunders, S., Maticzka, D., Lickert, H., Back- ofen, R. and Theis, F. SimiRa: A tool to identify coregulation between microRNAs and RNA-binding proteins. RNA Biology, 12(9):998-1009, 2015.

• Richter, H., Zoephel, J., Schermuly, J., Maticzka, D., Backofen, R., and Randau, L. Characterization of CRISPR RNA processing in Clostridium thermocellum and Methanococcus maripaludis. Nucleic Acids Research, 40(19):9887–96, 2012.

x Chapter 1

Introduction

We are but whirlpools in a river of ever-flowing water. We are not stuff that abides, but patterns that perpetuate themselves.

Norbert Wiener The Human Use of Human Beings: Cybernetics and Society

The patterns of life

Life on earth exhibits complex traits such as reproduction, appropriate reaction to environmental stimuli or organisation into multicellular organisms. These traits require a system to store, process and reproduce information. In 1952 the molecular basis of this system was discovered by Amos S. Hershey and Martha Chase with the identification of deoxyribonucleic acid (DNA) as the genetic material of the T2 phage [Hershey and Chase, 1952]. In the following year, James D. Watson and Francis Crick proposed a structure for DNA where two helical chains of nucleotides are held together by complementing pairs of nucleobases, the DNA double helix [Watson and Crick, 1953]. Assuming that only two specific pairs of bases can be formed, the sequence of bases in one chain would determine the sequence of bases in the other chain, suggesting “a copying mechanism for the genetic material” [Watson and Crick, 1953]. In the same year, James D. Watson hypothesized that protein synthesis may be mediated by ribonucleic acid (RNA) intermediates, subsequently called messenger RNAs (mRNAs), that form the connecting agent between genes encoded in DNA and the corresponding proteins [Watson, 1963; Gesteland et al., 2006]. The validation of his hypothesis confirmed RNA as a central but mostly passive part in the information processing of cells. A more prominent role of RNA arose after the discovery of its catalytic properties [Kruger et al., 1982;Guerrier-Takada et al., 1983]. Being able to catalyse the synthesis of new RNA molecules, early self-replicating system could have been composed solely

1 1. Introduction

of RNA. In 1986 Walter Gilbert coined the term “RNA world” [Gilbert, 1986] for such a precursor system. Sequencing of the human genome was concluded in 2002, enabling for the first time a near complete view of the human genetic material. Analysis of the human genome set the number of protein coding genes a little higher than 25,000 [Venter et al., 2001; International Human Genome Sequencing Consortium, 2001]. Estimates of the number of genes one decade earlier were set as high as 100,000 genes [Fields et al., 1994]. The number of different proteins encoded by the human genome, however, is currently estimated at about 80,000 [Wilhelm et al., 2014]. This surplus is brought about by widespread [Black, 2003], a process that increases the number of different proteins than can be built from a single gene. The number of genes or encoded proteins, however, may be very similar between organisms of apparently very different complexities, indicating that it is not sufficient to account for differing complexities on the organismal level [Nilsen and Graveley, 2010]. Given that the amount of protein coding genes is not a good indicator of complexity, an alternative mechanism had to be found. The degree of regulation of the gene products was identified as a promising candidate for explaining differences in complexity. Regulation of gene expression, or short gene regulation, covers the orchestration of all steps of the DNA-mRNA-protein pathway. Regulation at the level of DNA, transcriptional regulation, determines which genes, or groups of genes, are expressed and to what extent [Kornberg, 1999]. Regulation at the level of RNA, post-transcriptional regulation, governs all steps of the RNA life cycle, including splicing, transport, translation and eventual degradation [Morris et al., 2010]. Interest in the regulatory roles of RNA increased after the discovery that large parts of the human genome are transcribed to RNA. The vast majority of these RNAs, about 98%, does not encode for proteins [Mattick and Makunin, 2005]. Initially thought by many to be just noise, many of these non-coding RNAs (ncRNAs) were found to be just as stable as mRNAs [Mattick and Makunin, 2005]. The discovery of a large number regulatory non-coding RNAs pointed to large layers of post-transcriptional control that were previously hidden [Mattick, 2003]. It has been shown that the amount of non-coding sequence contained in total genomic DNA increases consistently with the complexity of an organism [Prasanth and Spector, 2007]. The number of ncRNAs, and concomitantly the degree and complexity of regulation, is now thought to be the main determinant of life’s complexity. Currently, there are at least 45 known classes of regulatory ncRNAs [Cech and Steitz, 2014]. Many ncRNAs convey their biological function via their structure [Cech and Steitz, 2014]. The functions of these structures are manifold: structured RNA may act as a catalytic core [Noller et al., 1992], scaffold [Deng and Meller, 2006] or recognition element for other molecules [Cech and Steitz, 2014]. The presence of RNA structure may also obstruct interactions with

2 other molecules [Warf and Berglund, 2010]. Most ncRNAs don’t operate in isolation but as complexes of RNAs and proteins [Cech and Steitz, 2014]. A large fraction of human proteins has been found to have an RNA-binding capability [Ule, 2014] required for the formation of these complexes. RNA-binding proteins (RBPs) are known to govern all aspects of the RNA life cycle [Morris et al., 2010]. While target recognition may be mediated by additional interactions with non-coding RNAs such as sRNAs [Møller et al., 2002] or miRNAs [Chi et al., 2009], many RBPs bind to their targets without additional RNAs [Ray et al., 2013]. These insights finally debased DNA to the role of a mostly passive informa- tion storage system which is merely acted upon and in turn put RNA – and its structure – to the centre of attention as a versatile and active participant in the majority of regulatory processes. The discovery and description of the regulatory networks governing life has become an important task of current biological and medical research.

Overview of this thesis Most of my work presented in this thesis was performed in collaboration with various lab members and external partners and spans the domains of bioinformatics, biology and medicine. The utility of my contributions was established during many hours of ”ex-silico” experiments performed by my collaboration partners. To take into account the team-based nature of my work and to enable a consistent reading experience, I use the pronoun ”we” throughout the remainder of this thesis. A detailed statement of contribution is provided in Appendix A. Regulatory networks manifest via a manifold of interactions within and between molecules. The detection of these interactions on a genome-wide scale can be achieved with modern high-throughput methods. These methods rely on bioinformatics on all stages of an experiment, starting its initial conception and ending with the interpretation of the experimental data within the biological context. The need for efficient bioinformatics methods, however, does not end there. Since individual experiments are specific to the tested conditions and cell types, they can only yield snapshots of the dynamic regulatory processes under consideration. Here, computational methods — able to generalize based on a set of detected interactions — can be an efficient alternative to a large number of experiments under varying conditions. In this thesis, we present methods both for the interpretation of high- throughput data as well as their integration into computational models.

• We investigated algorithms for folding long RNAs. To this end, we performed a thorough analysis of folding algorithms for predicting struc- tured regulatory elements and showed that fast local folding algorithms

3 1. Introduction

yield the best general predictions. We created an improved algorithm for folding long RNAs. [Lange et al., 2012]

• We created an integrated analysis pipeline for analysing iCLIP experi- ments. The improved processing of unique molecular identifiers achieved a pronounced reduction of false positives. This allowed the precise detection of targets and binding patterns of the tested RBPs which al- lowed further experiments showing the impact of specific RNA structural elements in binding of the tested RBPs. [Ilik et al., 2013]

• We created GraphProt, a framework for training computational models of RBP binding-preferences that allows modelling the impact of explicit RNA secondary structures on RBP binding. We benchmarked GraphProt on a large set of published in-vivo binding sites, positioning GraphProt as the new state-of-the-art. We performed the first large-scale analysis of structural preferences of RBPs based on in-vivo data and showed that models of RBP binding-preferences are are suitable and necessary for the discovery of target sites missed by an experiment. [Maticzka et al., 2014]

• We created an algorithm for model-based RBP binding-site validation. Using a GraphProt model trained on in-vivo binding sites, we first predicted missing binding sites and then designed minimal sequence alterations meant to specifically weaken or disable binding to these sites. Reduced protein activity for the modified sites was then shown experimentally, providing evidence of direct binding of the protein to the predicted targets. [Ferrarese et al., 2014]

In the remaining parts of this chapter, we introduce the two molecules at the centre of post-transcriptional regulation: RNA and RNA-binding proteins. We conclude this chapter with an overview of the methods used to estimate and compare the performance of predictive methods throughout this thesis.

1.1 RNA

Ribonucleic acid (RNA) is a polymeric molecule consisting of covalently bound monomers, the nucleotides. Each RNA nucleotide is composed from a ribose sugar and one of the four nucleobases adenine (A), cytosine (C), guanine (G) and uracil (U). The polymerization of RNA results in a thin non-branching chain of nucleotides connected via a sugar-phosphate backbone. In consequence, an RNA molecule can be described via its sequence of nucleotides, given as the string of nucleobase identifiers A, C, G and U. This sequence, by convention given in the 5’-3’ direction of the ribose backbone, constitutes the RNA primary structure [Madison, 1968]. This RNA sequence determines the spatial conformation of an RNA molecule that is described as RNA secondary and tertiary structure. RNA secondary structure is a set of base pairs, pairs of

4 1.1. RNA nucleobases connected via hydrogen bonds [Cox and Littauer, 1959;Cox, 1966]. Common base pairs formed by RNA are C-G, A-U, and G-U. The secondary structure is the basis of the precise spatial conformation of the molecule, the tertiary structure [Tinoco and Bustamante, 1999; Cruz and Westhof, 2009].

1.1.1 Global and local RNA structure The biological functions of RNAs are enabled by both its sequence and structure. In this thesis we distinguish between global and local RNA structure. Global structure refers to settings where sequence and structure of the whole molecule are relevant for its function. In contrast to this, RNA local structure is only concerned with the structure of localized parts of a molecule. Example for RNAs that depend for their functionality on the sequence and structure of the whole molecule are ribosomal RNAs and tRNAs. Ribosomal RNAs (rRNAs) form large complexes, the ribosome, that translate mRNAs to proteins. As such, ribosomal RNAs constitute a molecular assembler, a complex molecular machine that depends on a specific three-dimensional shape to accomplish its function. With other RNAs, most notably mRNAs and long non-coding RNAs (lncRNAs), sequence and structure of localized parts of the molecule may have specific functions. Examples for these structural elements are self-splicing elements [Kruger et al., 1982], bistable elements such as riboswitches [Mandal and Breaker, 2004] and RNA thermometers [Chowdhury et al., 2006], the Shine-Dalgarno sequence [Shine and Dalgarno, 1974] and miRNA-, siRNA- and RBP binding sites [Cech and Steitz, 2014]. These elements are referred to as cis-regulatory elements, the Latin prefix cis meaning ”on this side of”.

1.1.2 Prediction of RNA secondary structure Computational prediction of RNA structure is a necessary and workable tool in uncovering the function of RNAs. Recent experimental methods can determine RNA structure on a genome-wide scale. Accessibility of individual nucleotides can be determined in-vitro and in-vivo [Wan et al., 2011]. More recent experi- mental methods also allow determination of RNA structure on the level of base pairs [Lu et al., 2016]. In both cases, these data require computational predic- tion to derive probable structures from the measurements [Lorenz et al., 2016]. Prediction of RNA secondary structures, however, is feasible on a genome-wide scale. Since computational prediction does not rely on the existence of actual molecules it can be efficiently used to design novel RNAs [Busch and Backofen, 2006; Honer Zu Siederdissen et al., 2013; Kleinkauf et al., 2015]. In this thesis we cover the prediction of secondary structures given a sequence. A secondary structure consists of a set of base pairs. These have the beneficial property that each nucleotide can pair with exactly one other nucleotide. In this context RNA structure prediction can be framed as a

5 1. Introduction discrete optimization problem. Here, two challenges arise: First, unconstrained prediction of RNA secondary structure is NP-complete [Akutsu and Tatsuya, 2000; Lyngsø and Pedersen, 2000]. Second, the number of possible RNA secondary structures grows exponentially with sequence length [Stein and Waterman, 1979; Hofacker et al., 1998]. The most widely used methods for determining RNA secondary structure given its sequence are thermodynamic approaches [Schroeder, 2009], selecting for structures with low Gibbs free energy, assumed to be the most stable. Early approaches incorporated stabilizing contributions of base pairs and destabilizing contributions of loops and bulges [Tinoco et al., 1971] to determine the free energy of RNA secondary structures. Later models replaced the contributions of individual base pairs with stabilizing contributions of stacks of base pairs, found to contribute the major stabilizing effect for the RNA helices [DeVoe and Tinoco, 1962]. This type of energy model, the nearest neighbour energy models [DeVoe and Tinoco, 1962; Tinoco et al., 1973; Borer et al., 1974], is used by all structure prediction algorithms presented in this work. Early algorithms that automated structure prediction were prohibitively slow [Pipas and McMahon, 1975] or could not include the stabilizing and destabilizing effects of stacks and loops [Nussinov et al., 1978;Nussinov and Jacobson, 1980]. The first fast algorithm able to incorporate these effects was developed by Michael Zuker and colleagues [Zuker and Stiegler, 1981]. The Zuker algorithm determines the minimum free energy (MFE) structure according to a nearest neighbour energy model. Implementations of this algorithm are widely used today [Markham and Zuker, 2008; Lorenz et al., 2011]. In many cases it may not be sufficient to determine a single optimal struc- ture. Determining lists of suboptimal structures [Wuchty et al., 1999] does not solve this issue because of the abundance of structures near optimal en- ergy, even with small sequences [Zuker and Stiegler, 1981; Zuker and Sankoff, 1984;McCaskill, 1990]. Furthermore, the assumption of a single correct solution may not be valid for many RNAs. An elegant solution to this issue was devel- oped by McCaskill and colleagues [McCaskill, 1990]. The McCaskill algorithm can be used to calculate the probabilities of RNA secondary structures at thermodynamic equilibrium. This is achieved via the efficient calculation of the partition function which allows to determine probabilities of individual secondary structures according to the Boltzmann distribution. In addition, features concerning the whole ensemble of secondary structures can be calcu- lated. A powerful application of the McCaskill algorithm is the visualization of structure probabilities via a dot-plot, enabling a quick overview of probable secondary structures. The dot-plot visualizes all pairwise pairing probabilities of nucleotides — the probability of a being the sum of probabilities of all valid structures that contain a given base pair. An even more condensed structural measure is the accessibility of nucleotides — the total probability of all structures that don’t form any base pair involving a given nucleotide or

6 1.2. RNA-binding proteins stretch of nucleotides. The kind of allowed secondary structures is a crucial decision for RNA secondary structure prediction that determines both time and space complexity of the algorithm. Both Zuker and McCaskill algorithms achieve complexities O(n3) time and O(n2) space by restricting the set of structures under consid- eration: secondary structures must be nested and the size of internal loops is restricted [Lyngso et al., 1999]. Also, a simplified energy model is used to calculate the energy contributions of multiple loops.

1.2 RNA-binding proteins

Cis-regulatory elements may be recognized by trans-acting factors not part of the recognized molecule (the Latin prefix trans meaning ”on the other side of”). A large and important class of trans-acting factors are RNA-binding proteins (RBPs). Recent studies indicate that about 15% of human cellular proteins have RNA-binding capability [Ule, 2014]. RBPs are known to regulate a plethora of post-transcriptional processes including splicing, localization, translation, degradation and stability of RNAs [Re et al., 2014]. Current efforts in analysing RBPs are largely enabled by the increased availability of high-throughput methods [Cook et al., 2015]. In this context, microarrays [Ray et al., 2009;Ray et al., 2013], high-throughput sequencing [Licatalosi et al., 2008] as well as quantitative mass spectrometry [Baltz et al., 2012;Castello et al., 2012] have been applied to determine RNA-protein interactions and to elucidate their various binding-mechanisms.

1.2.1 Properties of RNA-binding proteins Over 1,500 of the proteins encoded in the human genome are known RBPs or contain annotated RNA binding domains [Ascano et al., 2013]. Recently, two studies detected 860 mRNA-bound proteins in HeLa cells [Baltz et al., 2012] and 797 mRNA-bound proteins in HEK293 cells [Castello et al., 2012]. Among these, 554 proteins were identified by both approaches. Surprisingly, 341 of the detected proteins were not previously known or annotated to have RNA-binding capability [Ascano et al., 2013], adding over 300 proteins to the repertoire of human RBPs. More recently, a manually curated census of 1,542 RBPs has been presented [Gerstberger et al., 2014]. In total, about 7.5% of the approximately 20,500 protein-coding genes in humans bind to RNA [Gerstberger et al., 2014]. The similar number of about 1, 500 human RBPs [Gerstberger et al., 2014] and 1, 400 human transcription factors [Vaquerizas et al., 2009] indicates that the complexity of post-transcriptional regulation could match that of transcriptional regulation. Most RBPs derive their RNA-binding capability from the inclusion of RNA- binding modules from a set of RNA-binding domains (RBDs) that are reused by

7 1. Introduction

different RBPs in a modular fashion [Lunde et al., 2007]. For that reason RBPs can be identified by scanning for known RBDs. The PFAM database [Finn et al., 2010] contains models of at least 800 distinct RBDs [Gerstberger et al., 2013]. Some of the most common RNA-binding domains are the RRM-, KH-, SAM- or dsRBD-domains [Lunde et al., 2007]. While each of these domains confers different binding-preferences, it is thought to be the combination of multiple RBDs that allows the binding mode of RBPs to be tuned to diverse tasks. In effect one RBP might recognize a particular target with high specificity while another may be tuned to bind a large number of sequences [Lunde et al., 2007]. Binding of RBPs to their targets can be influenced by sequence-specific as well as sequence-unspecific components [Draper, 1999; Gupta and Gribskov, 2011]. The majority of sequence-specific RBPs tend to prefer single-stranded binding sites [Messias and Sattler, 2004; Auweter et al., 2006; Ray et al., 2009]. Binding to double-stranded RNA is mostly mediated by the double-stranded RNA-binding domain (dsRBD) carried by least 645 proteins [Re et al., 2014]. Initially thought to bind independently of RNA sequence, recent experiments show that dsRBDs can also directly bind in a sequence-specific manner [Stefl et al., 2010; Masliah et al., 2013]. Many studies analysing the binding of RBPs have focused on proteins binding to specific sequence or structure motifs. Many RBPs, however, lack specific patterns of sequence or structure in their binding sites. These proteins are generally considered non-specific RBPs [Guenther et al., 2013]. Here, specificity to certain targets can be conveyed by mechanisms such as cell type-specific expression or localization of the RBPs and target RNAs, clusters of bound sites on the RNA or the combination of multiple RBDs within a single RBP [Singh and Valc´arcel,2005]. These results indicate that in order to comprehensively understand how an RBP conveys its function, the regulatory context of an RBP and its targets must be considered [Singh and Valc´arcel, 2005; Jens and Rajewsky, 2015].

1.2.2 Experimental identification of RNA-protein interactions

Ultraviolet crosslinking and immunoprecipitation (CLIP) is a method for detect- ing in-vivo binding sites of RBPs. CLIP was introduced in 2003 by Jernej Ule and colleagues [Ule et al., 2003;Ule et al., 2005]. Here, irradiation of cells with ultraviolet light leads to the formation of covalent bonds between RNAs and proteins in direct contact, allowing for rigorous purification of protein-RNA complexes. An extension of the CLIP protocol, high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-CLIP), allows to determine genome-wide maps of in-vivo protein-RNA interactions [Licatalosi et al., 2008].

8 1.3. Performance evaluation of predictive methods

Crosslinking-based methods combined with high-throughput sequencing (CLIP-seq) are today the most widely employed systems for the detection of in-vivo binding sites. Since 2008 the number of CLIP-seq data sets submit- ted yearly to the GEO database [Barrett et al., 2013] has steadily risen to over 40 in 2013 [Reyes-Herrera and Ficarra, 2014]. The two most widely used extensions of the original CLIP-seq protocol [Reyes-Herrera and Ficarra, 2014], PAR-CLIP [Hafner et al., 2010] and iCLIP [K¨onig et al., 2010], allow for more exact determination of binding sites. In addition, a variety of different crosslinking-based methods have been developed, often with specialized do- mains of application. Examples include iPAR-CLIP [Jungkamp et al., 2011], gPAR-CLIP [Freeberg et al., 2013], CLASH [Helwak et al., 2013], hiCLIP [Sugi- moto et al., 2015] and eCLIP [Van Nostrand et al., 2016]. In the remainder of this thesis we will use CLIP-seq as a general term for any CLIP-based method employing high-throughput sequencing. For the discussion of specific methods we will use the specific terms HITS-CLIP, PAR-CLIP and iCLIP. A complementary approach to the detection of in-vivo binding sites is the identification of in-vitro binding affinities. A microarray-based approach, RNAcompete [Ray et al., 2009; Ray et al., 2013], measures relative binding affinities of RBPs in the context of a library of designed RNAs. More re- cently sequencing-based methods for the determination of affinities have been introduced [Guenther et al., 2013; Zykovich et al., 2009; Buenrostro et al., 2014]. These methods are able to use arbitrary RNA libraries instead of fixed microarray designs of the microarray, offering the possibility to eventually determine affinities in-vivo [Buenrostro et al., 2014].

1.3 Performance evaluation of predictive methods

In this work, we evaluate the performance of predictive methods by comparison to reference or gold standard datasets. With the exception of the bp-accuracy, defined and used in Chapter 2, this task is achieved by evaluating predictions in the context of classification tasks, more specifically in the context of binary classification. With binary classification a predictive method is tasked to assign one of two class labels to the tested instances, usually named positive and negative. In this section we first present the basic measures of classifier perfor- mance Precision, True Positive Rate (TPR) and False Positive Rate (FPR). These measures are also called single-threshold measures. Given a real-valued score such as a probability, a single threshold for deciding between positive and negative labels has to be selected. We conclude with the presentation of two threshold-free measures of classifier performance, the Receiver Operator Criterion (ROC) and the Precision-Recall Curve (PRC). We evaluate both the performance of RNA structure predictions and the performance of models of RBP binding-preferences in binary classification

9 1. Introduction

Reference Positive Negative

TP FP Positive

Prediction FN TN Negative

Figure 1.1: Template of a confusion matrix. A confusion matrix is a table with two rows and two columns that reports the number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN).

a) ROCb) PR FPR:TPR, recall: TPR, recall: TP / (TP + FN) FP / (FP + TN) TP / (TP + FN)

Reference Reference Positive Negative Positive Negative

TP FP PPV, TP FP Positive Positive precision: TP / (TP + FP) TP / (TP Prediction Prediction FN TN FN TN Negative Negative

Figure 1.2: Classification scores used for the calculation of ROC and Precision-Recall curves. a) ROC employs True Positive Rate (Equation 1.1) and Positive Rate (Equation 1.2). b) Precision-Recall (PR) employs Precision (Equation 1.3) and Recall (Equation 1.1)

settings. Given a reference set with known labels, one can distinguish four different outcomes. When the predicted labels agree with the known ones, the prediction is labelled as True Positive (TP) or True Negative (TN). A mismatch between the predicted and reference labels represents a classification error. A known negative instance that is wrongly classified as positive is called a False Positive (FP). Conversely, a known positive instance that is classified as negative is called a False Negative (FN). The outcome of the tested instances from the reference set is commonly summarized in a confusion matrix (Figure 1.1) by summing up the number of occurrences of each case. A variety of scores reflecting various aspects of the classifier performance can be derived from the counts summarized in the confusion matrix. Here, we define the classification scores that are used by ROC and PRC: Precision, False Positive Rate (FPR) and True Positive Rate (TPR). True Positive Rate (TPR), also called Recall or Sensitivity, is the fraction

10 1.3. Performance evaluation of predictive methods a) ROC curveb) PR curve 1 1

0.9

0.8 0.8

0.7

0.6 0.6

0.5 Precision 0.4 0.4 True Positive Rate Positive True 0.3

0.2 0.2

GraphProt 0.1 RNAcontext GraphProt MatrixReduce RNAcontext 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Recall Figure 1.3: ROC and Precision Recall (PR) curves. a) The ROC curve plots True Positive Rate (Equation 1.1, y-axis) versus False Positive Rate (Equation 1.2, x-axis) in the context of varying discrimination thresholds. These ROC curves compare the classification performances of GraphProt, RNAcontext and MatrixREDUCE models for PTB HITS-CLIP (Chapter 4). b) The Precision- Recall curve plots Precision (Equation 1.3, y-axis) versus Recall (Equation 1.1, x-axis) in the context of varying discrimination thresholds. Precision Recall curves comparing the classification performances of GraphProt and RNAcontext models for the ELAVL1 RNAcompete data (Chapter 4). of known positives that are correctly identified as such.

TP TPR = (1.1) TP + FN False Positive Rate (FPR) is the fraction of known negatives that are mis-identified as positives. FP FPR = (1.2) FP + TN Precision is the fraction of instances correctly identified as positive among all instances predicted as positive. TP P recision = (1.3) TP + FP Many classifiers return real-valued scores instead of labelling instances as one of two classes, for example probabilities or SVM margins. The performance of these classifiers can be evaluated using the basic scores defined above by selecting a threshold that serves to determine the boundary between positive and negative instances. The choice of appropriate thresholds is usually moti- vated by the envisioned application scenario of the classifier. It is however not

11 1. Introduction

clear, how to set thresholds in order to compare the performance of different classifiers. A broader view on the classifier performance over the full range of discrimination thresholds can be achieved by calculating ROC and PRC. These also enable the performance comparison of different classifiers. A classifier can be said to outperform another if the corresponding curve dominates, meaning that all other curves are beneath it or equal to it. By calculating the area under the curves one can generate single scores – Area Under ROC (AUC) and Average Precision Recall (APR) – indicating the overall performance of the classifier. The Receiver Operating Characteristics (ROC) curve plots False Positive Rate (FPR) on the x-axis and True Positive Rate (TPR) on the y-axis for all decision thresholds (Figures 1.2 a and 1.3 a). With ROC, the overall performance of a classifier is summarised by the Area Under the ROC curve (AUC). The AUC score has a probabilistic interpretation. Given a randomly chosen positive instance and a randomly chosen negative instance, the AUC corresponds to the probability that the classifier will rank the positive instance higher than the negative instance [Fawcett, 2006]. For that reason, only AUC scores > 0.5 correspond to classifiers with performance better than random choice of labels. ROC curves are widely used to evaluate the performance of diagnostic procedures [Swets, 1988], allowing for straightforward interpretability by a general audience. On the downside, ROC curves are known to be less reliable than other performance measures when working with datasets that are imbalanced with regard to the number of positive and negative instances [Saito and Rehmsmeier, 2015]. For that reason, we use the Precision-Recall (PR) curve and its associated score the Average Precision (APR) to evaluate classifier performance on imbal- anced data. The Precision-Recall curve plots Precision (Equation 1.3) on the y-axis versus Recall (also: True Positive Rate) (Equation 1.1) on the x-axis for all decision thresholds (Figures 1.2 b and 1.3 b). The Precision-Recall curve is strongly related to the ROC curve. There is a one-to-one mapping between points in the ROC and PR space. Based on this result it was shown that a curve dominates (meaning that all other curves are beneath or equal to it) in ROC space if and only if it dominates in PR space [Davis and Goadrich, 2006]. As described above, estimating the power of a binary classifier requires the evaluation of a reference set with known class labels. To yield an accurate estimate of the performance with unknown data, different data must be used during training and evaluation of a model. If the same instances were used during both phases, the model could simply repeat the class labels already seen during the training phase. This model may achieve high scores during evaluation but possibly fails to generalize and thus will not achieve good classification performance on novel data. One possible solution is to create independent datasets to be used for training and testing data as part of the experimental design. This strategy is realized by RNAcompete [Ray et al., 2009; Ray et al., 2013]. Here, two

12 1.3. Performance evaluation of predictive methods independent experiments using different sequence libraries were conducted for each RBP. Consequently, the results of one experiment can serve for training and the results of the other for testing and vice versa. The results of the two tests can then be averaged to gain a more stable estimate. Since this approach only uses half of the available data for training, it is not suitable for cases where training data is scarce. In many cases, there is only one set of data available for training and testing. If this data can be split arbitrarily to be used for training and testing, performance can be estimated using k-fold cross-validation. Here, the data is divided into k segments of similar size. A model is then trained on the instances from k-1 segments and then evaluated on the remaining segment. After conducting k train-and-test experiments, testing a different segment each time, the results are averaged.

13

Chapter 2

Predicting the local structure of mRNAs

2.1 Introduction

In recent years, our perception of RNA has seen a strong shift from its role as a messenger to its roles in the regulation of a plethora of cellular processes. Here RNA regulatory functions are often guided by its structural conformation. For example, local structures in messenger RNA (mRNA) can regulate protein gene expression. In this chapter, we mainly focus on determining and enhancing the performance of computational approaches for the prediction of local structural elements of mRNAs. Many existing methods of experimental and computational structure deter- mination concentrated on regulatory non-coding RNA (ncRNA) [Gorodkin and Hofacker, 2011;Griffiths-Jones et al., 2005;Andronescu et al., 2008]; notable examples are transfer RNA, ribosomal RNA, small nucleolar RNA, microRNA, and small interfering RNA. In comparison, little research was dedicated to the more challenging task of elucidating the structural properties of mRNA. This is surprising, since a vast number of cis-regulatory structures [Jacobs et al., 2009], e.g. riboswitches [Breaker, 2008], iron response elements (IRE) [Stevens et al., 2011], internal ribosome entry sites (IRES) [Mokrejs et al., 2010], and selenocysteine insertion sequences (SECIS) [Walczak et al., 1996], are located on mRNA transcripts, predominantly in the untranslated regions. Recently, experimental approaches for transcriptome-wide enzymatic structural probing were introduced [Kertesz et al., 2010; Underwood et al., 2010]. Going beyond individual structures, more general metrics such as folding energy or accessi- bility were associated with translational efficiency [Kudla et al., 2009;Tuller et al., 2010], the viability of protein-binding sites [Hiller et al., 2007; Li et al., 2010], and the efficacy of small ncRNA target sites [Kertesz et al., 2007; Tafer et al., 2008;Hausser et al., 2009;Hong et al., 2009;Richter et al., 2010;Kiryu et al., 2011]. These metrics are also the basis of many current algorithms for

15 2. Predicting the local structure of mRNAs

the detection of mRNA targets of small ncRNAs [Busch et al., 2008;Mar´ınand Van´ıˇcek,2011; Kertesz et al., 2007] and RNA-binding proteins [Hiller et al., 2006; Kazan et al., 2010]. As experimental data on mRNA structure are scarce, research into post- transcriptional regulation is greatly enhanced by the use of predicted mRNA structures. The classical algorithms for RNA secondary structure prediction are global approaches that determine the minimum free energy (MFE) struc- ture [Zuker and Stiegler, 1981] or the Boltzmann ensemble of all possible structures calculated by the partition function method [McCaskill, 1990]. In global folding there is no restriction on the span of base pairs and struc- tures are considered for the entire RNA molecule. This approach is imple- mented in e.g. RNAfold [Hofacker et al., 1994], UNAfold (formerly known as mfold) [Markham and Zuker, 2008], and RNAstructure [Reuter and Math- ews, 2010]. A major challenge in global folding is the correct prediction of long-ranging base pairs [Doshi et al., 2004]. Furthermore, the global folding approach is cubic in time, reduced to quadratic on average for MFE pre- dictions [Backofen et al., 2009]. Therefore, it is too slow for genome-wide applications. Moreover, the mRNA is translated and regulated by a plethora of molecules binding to it; these can influence its global conformation. Hence, probable local structures might be more relevant for regulatory function. Some local folding approaches have been proposed to account for these challenges: (i) Structures are kept local by restricting the maximum distance allowed between the two nucleotides that form a base pair, e.g. in RNALfold [Hofacker et al., 2004], Rfold [Kiryu et al., 2008] and Raccess [Kiryu et al., 2011]. (ii) A window-based approach to further accommodate the uncertainty of global structure by multiple stabilising and destabilising factors was developed and implemented in RNAplfold [Bernhart et al., 2006; Bernhart et al., 2011]. The runtime of all local folding algorithms is linear with respect to sequence length and is easily applicable on a genome-wide scale. In this work, we focused on three major unresolved problems involving the secondary structure prediction of mRNAs: (i) No comprehensive comparison of the performance of global versus local folding exists. (ii) Local approaches require the user to set additional parameters such as the base-pair span and window size, which can not be easily determined from experimental data or biophysical principles. Moreover, an in-depth qualitative investigation of the locality parameters is still required. (iii) To detect cis-regulatory elements in predicted base-pair probabilities, a quality measure for the stability of the structural element within a greater context is needed. The comparison of methods requires data of high-quality structures. For benchmarking accessibility (i.e. single-strandedness of nucleotides), we used recently available transcriptome-wide structural probing data [Kertesz et al., 2010]. This data, however, does not provide explicit information on base pairs, which is required to locate structured cis-regulatory elements. Structural infor- mation on these elements is stored in the Rfam database [Griffiths-Jones et al.,

16 2.2. Prediction and evaluation of local RNA structures

2005; Gardner et al., 2011], which we filtered and processed to optimise struc- tural integrity. As a result, we had two benchmarking datasets covering both aspects of secondary structure, namely base pairing and single-strandedness. We introduced suitable measures for determining and comparing the quality of structure prediction. Subsequently, we used our benchmark datasets to perform the first comprehensive study of the qualitative differences between global and local approaches. For local folding, we assessed the two parameters of locality: the maximum base-pair span and the window size. We identified optimal parameter settings for our benchmark data and analysed the relation between the parameters. We identified artefacts introduced by window borders and present a new method to reduce these effects. Previous investigations of the locality parameters were centred around specific applications. For example, Tafer and colleagues evaluated effects of accessibility on the efficacy of small interfering RNA interactions [Tafer et al., 2008]. Folding parameters that achieved the most significant results, a window size of 80 nucleotides and a maximum base-pair span of 40 nucleotides, were subsequently used as standard values for local secondary structure predic- tions [Kazan et al., 2010; Li et al., 2010; Mar´ınand Van´ıˇcek,2011]. Similar analyses were performed in [Shao et al., 2006;Kiryu et al., 2011]. A window size that was equal to the maximum base-pair span was used in [Kiryu et al., 2008] and it is also the default setting in RNAplfold. Our benchmark analysis showed that these previously used parameters performed poorly. As an additional result of this research, we introduce LocalFold, a method that reduces the detrimental effects of artificial window borders and that produced more robust mRNA secondary structure predictions on curated benchmark datasets compared to other available tools.

2.2 Prediction and evaluation of local RNA structures

In this section we first summarize current global and local folding methods for the prediction of secondary structures. We define structure accuracy, the performance measure subsequently used to evaluate the predictive performance of the selected algorithms. We then perform an exploratory evaluation of local folding parameters and investigate biases introduced by windowed local folding. We conclude with presenting LocalFold, a windowed local folding algorithm designed to mitigate these biases. This lays the foundation for the performance evaluation of local folding algorithms presented in Section 2.3.

2.2.1 Algorithms and performance measures Here, we first give a general overview of previously published methods for secondary structure prediction to enable a better comprehension of the results

17 2. Predicting the local structure of mRNAs

presented in this chapter. Due to their broad usage, we concentrated on partition function based approaches that, given an RNA sequence, produce probabilities or average probabilities for base pairs. Among these approaches we made a careful selection of algorithms that reflect the current status of secondary structure prediction for subsequent use to benchmark structure prediction methods (see Table 2.1). Finally, we present structure accuracy, the primary measure used for evaluating predicted local RNA structures with respect to known reference structures.

Table 2.1: Summary of the prediction methods and the benchmark datasets used in this work. L is the maximum base-pair span, W is the window size and b is the number outermost window positions for which base pairs are ignored.

Method Locality Parameters Type Output RNAfold – Global Base-pair probabilities Rfold L Local Base-pair probabilities Raccess L Local Accessibilities RNAplfold* L, W Local Average base-pair probabilities and accessibilities LocalFold* L, W , b Local Average base-pair probabilities and accessibilities Dataset Description CisReg 2, 500 cis-regulatory elements in 95 Rfam families, filtered and processed in this work YeastUnpaired Data on the single-strandedness of single positions for 3, 196 Saccharomyces cervisiae mRNAs from [Kertesz et al., 2010]

*Window-based approach

Global folding Folding an RNA globally means that structures are predicted for the entire input sequence for which all possible base pairs are considered. Global base-pair probabilities are predicted via a partition function algorithm that considers the entire ensemble of all possible structures weighted by their free energies [Mc- Caskill, 1990]. These free energies are calculated with a nearest neighbour energy model using thermodynamic parameters determined by the Turner group [Mathews et al., 1999; Turner and Mathews, 2010]. The algorithm is implemented in RNAfold [Hofacker et al., 1994], RNAstructure [Reuter and Mathews, 2010], and UNAfold (formerly known as mfold) [Markham and Zuker, 2008] and has a complexity O(n3) time and 0(n2) space. Each implementation has different additional features that are not relevant in the scope of this work,

18 2.2. Prediction and evaluation of local RNA structures however, they are all based on the same algorithm for computing base-pair probabilities. We chose RNAfold from the ViennaRNA Package Version 1.8.4 as a repre- sentative of global folding. The options used in this study are RNAfold -d2 -p -noLP. For folding under the constraint of the consensus structure, we used the additional option -C. RNAfold calculates base-pair probabilities but not accessibilities. Position-wise accessibilities can be computed from the base-pair probabilities as follows:

n X pu(i) = 1 − p(i, j), (2.1) j=1 where pu(i) is the probability for base i to be unpaired (i.e. its accessibility); n is the length of the RNA sequence; (i, j) is a base pair between base i and base j; and p(i, j) is the probability for the base pair (i, j) according to the McCaskill algorithm [McCaskill, 1990].

Local folding The first known approach for the prediction of stable local secondary structures was presented in [Hofacker et al., 2004]. Compared to the global approach, it introduces a maximal base-pair span L such that the predicted structure does not contain any base pair (i, j) with bp-span(i, j) > L.

bp-span(i, j) = j − i + 1, i < j. (2.2) As a result, the predicted structures are local in the sense that they do not contain any long-range base pairs that connect distant parts of the sequence. Since this approach still folds the entire input sequence simultaneously and merely restricts the base-pair spans of the predicted structures, it can be considered semi-local. In the following analysis, we used Rfold [Kiryu et al., 2008] (base-pair probabilities) and Raccess [Kiryu et al., 2011] (accessibilities) as representatives of the local folding approach.

Windowed local folding Windowed local folding, in addition to imposing a maximum base-pair span, predicts structures in sliding windows. The results of the windows are then averaged. This window-based approach is local in the sense that each window is folded independently of the rest of the sequence. Nevertheless, a single window is folded semi-locally as before. Approaches that predict true local structures, without the use of fixed windows, currently do not exist. RNAplfold [Bernhart et al., 2006; Bernhart et al., 2011] is currently the most cited method for computing base-pair probabilities of local secondary

19 2. Predicting the local structure of mRNAs

structures. It also includes the maximum base-pair restriction parameter L. The algorithm has the same time and space complexity as Rfold, but depends on the size of W instead of L. The partition function, however, is not computed for the entire sequence (as for Rfold or Raccess), but independently for all subsequences (windows) of size W . The average probability of a base pair is derived by averaging over all windows that contain both bases:

1 X p (i, j) = pw(i, j), (2.3) avg |W(i, j)| w∈W(i,j) where pw(i, j) is the base-pair probability of (i, j) in the window w and W(i, j) is the set of all windows that include the base pair (i, j). Under the assump- tion that windows are randomly chosen with equal probabilities, the average probability of a base pair can also be understood as an expected value. The average accessibility of a base i is

1 X pu (i) = puw(i), (2.4) avg |W(i)| w∈W(i) where puw(i) is the position-wise accessibility of base i in window w and W(i) is the set of windows that contain base i.

LocalFold We developed LocalFold for the purpose of investigating possible biases from window borders in windowed local folding. LocalFold is a modification of the window-based RNAplfold approach that ignores predictions at window borders. To this end, we modify the average base-pair probability of RNAplfold (Equation 2.3) to

1 X pb (i, j) = pw(i, j), (2.5) avg |Wb(i, j)| w∈Wb(i,j) where Wb(i, j) ⊂ W(i, j) is the set of windows where base i and base j are not within the first or last b positions of the window. Window borders that coincide with the input sequence ends are exempt from the modification and are calculated as in RNAplfold. The LocalFold algorithm is applicable to all parameter combinations of W , L, and b satisfying W − L ≥ 2b. The LocalFold method is thus limited to a W that is sufficiently larger than L. The b parameter does not exclude any parts of the sequence; the filtering induced by b merely ignores the outliers in the averaging calculation (Equation 2.3). The time and space complexity stays the same as for RNAplfold [Bernhart et al., 2006; Bernhart et al., 2011]. LocalFold is available on www.bioinf.uni-freiburg.de/Software/LocalFold/.

20 2.2. Prediction and evaluation of local RNA structures

Structure accuracy

We required a measure to compare probabilities, as calculated by RNAfold (global) and Rfold/Raccess (local), to average probabilities, as calculated by RNAplfold and LocalFold (also local, but windowed). This comparison is non- trivial and has not been previously addressed in the literature (to the extent of the authors’ knowledge). In addition, these methods generate probabilities for individual base pairs, whereas we required a measure for a complete structure, i.e. a cis-regulatory element. Previous approaches for comparing predictions were based on individual base pairs and not on entire structures [Kiryu et al., 2008]. In the investigation of cis-regulatory elements, however, we required a measurement for the stability of a local structured element within a greater context. More precisely, we needed to determine the accuracy of the prediction of the entire element based on individual base-pair scores. In the literature, there was no consistent measure for this purpose, however, structure stability measures have been applied to global structures [Do et al., 2006; Carvalho and Lawrence, 2008; Lu et al., 2009]. We generalised the measure of structure accuracy to local structure prediction.

Let R be an RNA sequence, and Sl be a local structured element in R. The accuracy A is the expected overlap of a local reference structure Sl with the predicted global structures S

X A(Sl|R) = |Sl ∩ S| · P r[S|R]

S∈QR X X = 1{(i, j) ∈ S}P r[S|R]

S∈QR (i,j)∈Sl X X X = 1{(i, j) ∈ S}P r[S|R] = p(i, j). (2.6)

(i,j)∈Sl S∈QR (i,j)∈Sl

QR is the ensemble of all possible structures of sequence R, p(i, j) is the probability for the base pair (i, j) and 1{(i, j) ∈ S} is an indicator function that is 1 if (i, j) ∈ S and 0 otherwise.

In the context of windowed folding we define W(Sl) to be the set of windows that contain the complete structure Sl, similar to the previous definition in the case of a base pair. Then we define the average accuracy as:

1 X Aavg(Sl) = A(Sl|w) |W(Sl)| w∈W(Sl) 1 X X = pw(i, j). |W(Sl)| w∈W(Sl) (i,j)∈Sl

21 2. Predicting the local structure of mRNAs

If we had the same windows for each base pair in Sl, i.e. for all (i, j) ∈ Sl, W(i, j) = W(Sl), then analogously to Equation 2.6, we could continue with X 1 X A (S ) = pw(i, j) avg l |W(i, j)| (i,j)∈Sl w∈W(i,j) X = pavg(i, j). (2.7)

(i,j)∈Sl Having the same set of windows for each base pair, however, could only be enforced if the location of the element was known in advance. Since this is not the case when searching for local structures, we used Equation 2.7 as an approximation of the average accuracy of the local structure Sl. For the comparison of accuracies for structure elements of different sizes, we normalised them by the number of base pairs within the respective local structure Sl:

Aavg(Sl) bp-accuracy(Sl) = , (2.8) |Sl| and analogously we substituted Aavg(Sl) with A(Sl) for the non-averaged base-pair probabilities. Intuitively, the bp-accuracy is the mean base-pair probability (or average probability) of all base pairs within the reference structure (i.e. cis-regulatory element); it measures the thermodynamic stability of the structure within its global context. The bp-accuracy, however, does not consider false positive base-pair predictions. No gold standard for negative base pairing exists and it was unclear when a base pair that is not part of the local structure should be regarded as negative, or incorrect. For example, one could consider all possible conflicting base pairs, i.e. all base pairs involving one and only one base from a correct base pair, to be incorrect (in a secondary structure, a base can only be paired to one other). This is problematic for three reasons: (i) There are about 2L more incorrect than correct base pairs; (ii) a different number of negative base pairs would occur for different L values, hence, it is difficult to compare global and local folding methods; and (iii) it is unknown to which extent the mRNA folds into different conformations, or refolds. Alternative structures do exist in-vivo, e.g. in riboswitches [Breaker, 2008]; some conflicting base pairs could be true variants. Kiryu and colleagues proposed a way to calculate specificity by considering all base pairs predicted in random sequences to be incorrect [Kiryu et al., 2008]. Randomly designed RNA sequences, however, could also form stable structures [Rivas and Eddy, 2000].

2.2.2 Exploratory evaluation of local folding parameters For the following analyses we selected RNAfold to represent global folding methods. The semi-local folding approach in which only the maximum base- pair span is restricted to L, is represented by Rfold [Kiryu et al., 2008] for

22 2.2. Prediction and evaluation of local RNA structures base-pair probabilities and Raccess [Kiryu et al., 2011] for accessibilities. For local window-based folding we used the frequently cited RNAplfold [Bernhart et al., 2006;Bernhart et al., 2011]. In addition to the maximum base-pair span L, this method introduces the window size parameter W .

CisReg: structured cis-regulatory elements The ability to detect and accurately predict known cis-regulatory elements is an important benchmark of new mRNA structure discovery methods. These known elements are characterised in several databases, of which the largest is the RNA families database (Rfam) [Griffiths-Jones et al., 2005; Gardner et al., 2011]. The major release 10.0 contains 1446 covariance models, mostly for non-coding RNA genes, but also for structured mRNA elements [Gardner et al., 2011]. Each model consists of a set of published “Seed” and computationally extended “Full” alignments. Sequences within the structural alignments consist of only the structured element, and usually lack the flanking sequence from the mRNA, needed to assess structure prediction. For this study, we developed a new benchmark for mRNA cis-regulatory elements. We extracted and individually re-examined a set of 95 families of cis-regulatory elements from Rfam that were correctly classified and adopted secondary structures without pseudoknots. Of these, 24 were from eukaryotic mRNAs and 71 from prokaryotic or viral genomes. The eukaryotic mRNA elements have diverse functions (e.g. mRNA localisation, translation efficiency or mRNA stability) and most were located within 3’UTRs. A large number of the genomic elements were from RNA viral genomes or from bacterial mRNAs. To investigate the effects of the context these elements were embedded in, we extracted for each element three different lengths of flanking regions from the mRNAs (including coding regions and 5’UTRs), or from the genomes when these were not available: 100, 200, and 500 nucleotides, or otherwise to the sequence ends. Subsequently, we filtered and processed the elements to maximise structural integrity and a small proportion of sequences were excluded as they did not match sequences in the EMBL Nucleotide Sequence Database. The CisReg dataset used in this study consists of 2500 individual elements (95 families) with over 85, 000 base pairs and we propose it as a reference set to test future prediction algorithms. We provide a website for the data including additional information and statistics: http://lancelot.otago.ac.nz/CisRegRNA/.

YeastUnpaired: single-strandedness For the evaluation of the accessibility predictions we used the set of in-vitro secondary structure profiles from [Kertesz et al., 2010]. This set, referenced as YeastUnpaired in this article, consists of nucleotide-wise measurements for 3,196 mRNAs from Saccharomyces cervisiae. These profiles were derived by parallel

23 2. Predicting the local structure of mRNAs analysis of RNA structure (PARS). With PARS the single-strandedness (as well as double-strandedness) of a set of sequences is inferred using a combination of RNase digestion and deep sequencing. Kertesz and colleagues report that they covered approximately 100-fold more transcribed bases than all previously published footprints combined, making this dataset uniquely suited for a comprehensive analysis of prediction performance. In the case of accessibility predictions, we compared the methods according to their ability to correctly classify paired and unpaired bases. Classification performance was measured using the Receiver Operating Characteristic (ROC), summarised to the Area Under the ROC Curve (AUC). As discussed in Section 1.3, this measure is independent of the types of outputs of the different algorithms. The accessibility of a base is the complement of the sum of all base- pairing probabilities that involve that base (see Equation 2.1), thus implicitly, the base-pairing distribution is taken into account. Therefore, the performance comparisons of accessibility should indicate which method produces the more accurate base-pair distributions.

Algorithms performed best for spans between 100-150 nucleotides For local folding approaches, the main question was which degree of locality to use. Current methods introduced locality by restricting the maximum base-pair span (bp-span, Equation 2.2) to L. We compared Rfold predictions with L between 40 and 400 nucleotides to (the global) RNAfold results using the CisReg data. Local folding was represented by Rfold, because the introduction of the base-pair restriction is the only conceptual difference to global folding; whereas the window-based approaches introduced the window size (W) as an additional parameter. The lowest median bp-accuracy of 0.46 was achieved using Rfold with L = 40 (Figure 2.1a). The accuracy increased with greater L values until a maximum of 0.59 was achieved at L = 150, after which accuracies decreased slightly to approximately 0.57. Rfold outperformed RNAfold at L ≥ 60. The difference between the bp-accuracy distributions of Rfold (L = 150) and RNAfold was significant with p = 1.2 × 10−7, two-sample Wilcoxon Rank Sum Test. The cis-regulatory structures in Figure 2.1a were situated within a context of up to 500 nucleotides to either side, the folded RNA sequence was thus only approximately 1, 000 nucleotides long and often not the full- length mRNA. Therefore, we compared Rfold (L = 150) to RNAfold on the 179 available full-length mRNA sequences (Figure 2.1b). Restriction to the set of full-length mRNA sequences resulted in lower base-pair accuracies for both Rfold and RNAfold compared to the full set of sequences. The difference of base-pair accuracies between the two methods increased from 0.07 to 0.13 when using the full-length sequences, indicating that local folding performance is less affected by long sequences than global folding performance. When investigating the degree of locality L suitable for the YeastUnpaired data, we observed similar results to the CisReg data, see Figure 2.8 (the

24 2.2. Prediction and evaluation of local RNA structures

(a) p=1.2e-7 (b) 0.6 0.35

0.5 0.30

0.4 0.25

0.20 0.3 0.15 0.2 0.10 0.1

Median Base-Pair Accuracy Median Base-Pair Accuracy Median Base-Pair 0.05

0.0 0.00

RNAfold RNAfold RfoldRfold - L40Rfold - L50Rfold - L60 - L80 RfoldRfold - L100Rfold - L150Rfold - L200Rfold - L250Rfold - L300 - L400 Rfold - L150

Figure 2.1: Comparison of global versus local folding using the methods RNAfold and Rfold. The median base-pair accuracy (y-axis) is given for the CisReg dataset. (a) Comparison of RNAfold and Rfold using different L values. (b) A subset of the CisReg dataset that comprised of 179 full-length mRNA. main discussion of this figure follows in Section 2.3). For accessibility, Rfold outperformed RNAfold at L ≥ 50 and the performance increased up to the optimum at L = 100. L > 100 exhibited only a minor decrease in AUC, thus L was robust to larger L values. Nevertheless, the quality in prediction decreases down to the level of RNAfold for both datasets; the greater the span L, the more global the prediction becomes until it is global when L equals the sequence length.

Most base pairs have short spans Our results on the best value for L reflected the distribution of base-pair spans within known structures: we observed that 83% of all base pairs had a bp-span less than 100 nucleotides (85% less than 150 nucleotides) for all the cis-regulatory elements in the CisReg dataset (Figure 2.2). Thereafter, the increase in the number of base pairs with a larger span is very slow. Although we specifically chose local regulatory structures located on the mRNA, the

25 2. Predicting the local structure of mRNAs

distribution was similar to previously published data. Doshi and colleagues showed an exponential distribution for base-pair spans of 496 16S rRNAs, with 75% of all base pairs with bp-span ≤ 100 nucleotides [Doshi et al., 2004]. In 151 ncRNA structures from 151 seed alignments in the Rfam, 85% of the base pairs had a bp-span ≤ 100 nucleotides [Kiryu et al., 2011]. The latter two analyses looked at global structures that form long-range base pairs. Due to the exponential distribution of base-pair spans in native RNA structures, the majority of base pairs have short spans, i.e. are local and thus smaller L values (L ≤ 100) still performed comparably well. Because of the good correlation of our results to the distribution of base-pair spans, we suggest that local folding with restricted base-pair spans could perform better for other classes of long RNA sequences, such as ribosomal RNA and long non-coding RNA. Note that although long non-coding RNA may be largely unstructured, local structured domains, or regulatory target sites could be located on these molecules making a structure prediction interesting. For example for determining the accessibility of miRNA target sites [Cesana et al., 2011].

Base-pair prediction accuracy decreased with span length

The choice of the locality parameter also depends on the prediction accuracy of base pairs with respect to their span lengths. For this evaluation, we used RNAfold as it allows all base-pair spans. The influence of the base-pair span length on the sensitivity of the predictions is illustrated in Figure 2.2b. We defined sensitivity as the fraction of all true base pairs within each bp-span interval that were predicted with probability p(i, j) > 0.5. Base pairs with a probability greater than 0.5 are called high-frequency base pairs and are contained in the centroid structure [Ding et al., 2006; Carvalho and Lawrence, 2008;Jenkins et al., 2010]. Base-pair prediction accuracy decreased with respect to span length; this was also published in [Doshi et al., 2004;Konings and Gutell, 1995; Fields and Gutell, 1996]. The highest sensitivity of approximately 0.6 was achieved for bp-span < 30 nucleotides, after which it dropped to 0.45, and at bp-span ≤ 100 nucleotides the sensitivity decreased further to around 0.35 (except an outlier of 0.5). The implications of this decrease are twofold: (i) The current nearest neighbour energy model [Mathews et al., 1999; Turner and Mathews, 2010] is unsuited to the prediction of long-range base pairs or (ii) the multi-loop energies are incorrect [Mathews et al., 1999; Diamond et al., 2001;Mathews and Turner, 2002]. Our results indicated that an L = 150 represents a good balance between maximising the number of base pairs included in the predictions and minimising the effect of reduced accuracy for longer base-pair spans. A larger span L did not increase the performance, probably due to the very few extra base pairs that could be predicted and the quality of these predictions becoming increasingly poor.

26 2.2. Prediction and evaluation of local RNA structures

(a) 1.0 0.8 0.6 0.4 0.2 0.0 Cumulative Distr. 0 100 200 300 400 500 (b) BP-Span 0.6 0.5 0.4 0.3 0.2

Sensitivity 0.1 0.0 [5,7] (7,8] (8,10] (10,11] (11,12] (12,13] (13,14] (14,16] (16,17] (17,19] (19,21] (21,22] (22,24] (24,26] (26,29] (29,31] (31,34] (34,37] (37,41] (41,46] (46,54] (54,62] (62,72] (72,83] (83,101] Interval of BP-Span (101,187] (187,250] (250,286] (286,319] (319,551]

Figure 2.2: The distribution of base-pair spans and the quality of prediction with respect to span length. (a) The bp-span (x-axis) distribution for the CisReg dataset with the cumulative distribution given on the y-axis. (b) The sensitivity of base pairs (y-axis) for each base-pair span interval (x-axis). The intervals were distributed such that they contain roughly an equal number of base pairs.

Structures are locally stable

The success of local folding approaches is based on the assumption that, in most cases, structures with short base-pair spans are locally stable and do not need the global influence of long-ranging base pairs to stabilise their formation. This condition is supported by the fact that small values for L performed only slightly worse than their more global counterparts (see Figures 2.1 and 2.8). In the search for cis-regulatory elements, maximum base-pair spans much smaller than the real spans still predicted the local parts of the structure. The structural stability of local substructures was also stated in [Nussinov and Tinoco, 1981;Doshi et al., 2004]. These authors illustrated that in predicted sub-optimal structures, most of the rearrangement occurs in the form of long- range connections, whereas the local substructures remain the same. Moreover, Higgs and colleagues have shown that, due to kinetics, short-range base pairs

27 2. Predicting the local structure of mRNAs

form more quickly [Morgan and Higgs, 1996]. Finally, the hierarchical evolution hypothesis, introduced in [Bokov and Steinberg, 2009], could further support the initial formation of locally stable structures with short base-pair spans and the subsequent addition of longer-range connections.

2.2.3 A bias in windowed local folding

RNAplfold computes base-pairing probabilities by averaging over subsequences, windows, of length W . On the one hand, the use of independent windows removes dependencies between two local structures with a distance greater than W ; on the other hand, each window introduces two artificial RNA ends at the window borders. As the ends do not correspond to any real features of the RNA, this can lead to undesirable effects.

Window borders were biased towards higher accessibilities

To investigate a possible bias introduced by folding independent (short) subse- quences, we computed the average accessibility per position of the respective windows using RNAplfold. Mean accessibilities for over 500, 000 sequence win- dows from 400 mRNAs, selected randomly from four species, are depicted in Figure 2.3. Nucleotides at the window borders showed considerably higher accessibilities than nucleotides near the window centres. This effect is preserved for the full range of observed GC-contents (Supplementary Figure B.2) and is not particular to mRNAs (Supplementary Figure B.3).

Windows affected base-pairing predictions

The accessibility bias towards window borders affected the probabilities of base pairs with at least one end in this region. Consequently, long-range base pairs with both ends within the outer regions were affected most (Figure 2.4a). Two issues arise from window-based folding: (i) The number of windows in the calculation of a base-pair probability is dependent on its span, i.e. probabilities of a base pair with bp-span = l occur in W − l + 1 windows. Hence, the number of windows being averaged decreases linearly with increasing bp-span. (ii) Strong secondary structures tend to form in the central part of a window, leaving the remaining unpaired bases at the window borders available to pair with each other; crossing base pairs with internal unpaired bases are not allowed in secondary structure prediction, so the ends pair up (if possible), because each additional base pair minimises the overall free energy. In combination, when L is close to W , long-range base pairs within the borders resulted in skewed pairing probabilities, as they were not compensated by averaging over many windows.

28 2.2. Prediction and evaluation of local RNA structures

(a)

(b)

Figure 2.3: High accessibilities at window borders. Average accessibilities were computed per window position for 400 randomly chosen mRNAs from four species. Computations were done with RNAplfold, L = 100 and (a) W = 100 and (b) W = 150. Positions beyond approximately 10 nucleotides at the window borders have equivalent average accessibilities.

(a) (b) 1 W-L

W=L W>L

W W

Figure 2.4: Illustration of folding-windows. Regions affected by the border effect are shaded. (a) Same Window size and maximum span. Long-range base pairs can be affected by both window borders. The base pair of maximal span is part of exactly one window. (b) Window larger than maximum span. Base pairs can only be influenced by one window end. Base pairs of maximal span can be part of multiple windows.

29 2. Predicting the local structure of mRNAs

Appropriate parameter choice to reduce border effects

The negative effect of having only few windows representing long-range base pairs was mitigated by setting a suitable window size W with respect to the maximum base-pair span L. When W ≥ L, base-pair probabilities are averaged for at least W − L + 1 windows (Figure 2.4b). In Figure 2.5, the dot plots from RNAplfold of a cis-regulatory element exemplify the border effect on long-range base pairs. For visualisation purposes, the sequences were folded with L = 70. For W = L, many base pairs with spans near L were assigned high probabilities while located in very short stems (Figure 2.5a). For W = L + 50, most of the long-range base pairs either disappeared or were assigned much smaller probabilities (Figure 2.5b). The base-pair probabilities for the target structure were not influenced by the parameter settings, due to their shorter base-pair spans. In our evaluations of different window sizes on both the CisReg and the YeastUnpaired datasets, W had little effect on the prediction performance as long as it was sufficiently larger than L. The current default parameter setting of RNAplfold is W = L = 70. In general, the default settings of computational tools are frequently used and in the case of RNAplfold the default, W = L, was applied in e.g. [Kiryu et al., 2008]. Note that on the other extreme, window sizes much larger than L diminish the positive effects of the window-based approach, namely to avoid dependencies between distant local structures. When W is equal to the sequence length, the window-based approach is the same as the approach for Rfold and Raccess. Varying the window sizes from L + 50 to 3L did not influence the results significantly, however, the best results for RNAplfold were achieved using W = L + 50 (Supplementary Figure B.4). For all further evaluations we set the window size to W = L + 50, which allowed each base pair to be present in at least 51 windows.

LocalFold diminished border effects

While an appropriate choice of the window size mitigated some of the adverse effects of windowed approaches, the borders still affected the accessibilities up to the ten outer nucleotides of each folding window (Figure 2.3b). Therefore, we developed LocalFold (see Section 2.2.3) that reduced these border effects and we quantified the improvement of predictions performed on our datasets. In short, the biased regions at the window borders were not considered for the computation of accessibilities or base-pair probabilities. As the border effect was mostly independent of window size and maximum base-pair span, in LocalFold the first and last ten nucleotides in each artificial window (excluding real ends of the input sequence) were removed from the calculations. Note that LocalFold only removes the bias outliers from the window average calculations and still produces average probabilities for all positions of the nucleotide sequence (any length).

30 2.3. Performance evaluation of local folding algorithms

W=L Scale for p(i,j) border effect 0.8 0.6 0.4 0.2

(a)

AUUUUUAGCGUGCCGCGACAAGCGGUCCGGGCGCCCUUCGGGGGCCCGGCGGAGACGGGCGCCGGAGGUGUCCGACGCCUGCUCGUACCCAUCUUGCUCAGUGGAGGAUUUGGCUAUGAGGACCACCUAC

W=L+50 no border effect

(b)

AUUUUUAGCGUGCCGCGACAAGCGGUCCGGGCGCCCUUCGGGGGCCCGGCGGAGACGGGCGCCGGAGGUGUCCGACGCCUGCUCGUACCCAUCUUGCUCAGUGGAGGAUUUGGCUAUGAGGACCACCUAC

Figure 2.5: Probability bias for long-ranged base pairs close to the window size and their reduced effect. We see cropped dot plots of the base-pairing matrices for positions 5,180-5,291 of RF00435-U55047-1 in the CisReg dataset, which is a heat shock gene expression (ROSE) element. Base pairs of the target structure are marked in red. The size of each dot is relative to the probability of the base pair it represents and the nucleotides can be read by following the diagonal lines to the left and right. The incorrect long-range base pairs are much more likely when (a) W = L instead of (b) W = L + 50.

2.3 Performance evaluation of local folding algorithms

Having developed suitable comparison measures and tests designed to identify and elucidate the optimal degree of locality and having investigated the effects of artificial window borders and sizes, we now evaluate the performance of LocalFold and current methods available for folding mRNA. We compared the performance of the following secondary structure prediction methods applied to mRNA sequences: RNAfold (global), Rfold (restricted bp-span, base-pair probabilities), Raccess (restricted bp-span, accessibilities), RNAplfold (window-based), and our method LocalFold (reduced border effects). We investigated their performance using a large curated set of 2500 cis-regulatory elements (CisReg) and a position-wise structural probing dataset with the single-strandedness of over 3000 yeast mRNAs (YeastUnpaired), hence, we quantified their predictions of both paired and unpaired bases, respectively. For the local folding methods, we applied the best parameter combinations

31 2. Predicting the local structure of mRNAs

(for each dataset) according to the previous analyses. Prediction methods and datasets are summarised in Table 2.1.

2.3.1 Prediction of cis-regulatory structures We compared the accuracies each method achieved for the base pairs of the CisReg dataset. For folding, we used sequences of up to 500 nucleotides context to either side of the elements. Although many mRNA sequences are longer than 1000 nt, we chose this length because resource demands of RNAfold were too high for longer sequences. For the local folding methods we applied the optimal values determined previously: maximum base-pair span L = 150 and window-size W = 200. To fairly compare RNAfold to the local folding methods, we used a subset of the CisReg dataset in which the elements had a maximum bp-span of 150 nucleotides. This subset included most elements (2158 out of 2500) across 90 different Rfam families. This meant L did not exclude base pairs in the dataset from being predicted. In Figure 2.6 we summarised the bp-accuracies (Equation 2.8) resulting from each method. When comparing the median bp-accuracy in part (a), it increased from 0.55 (RNAfold), through 0.6 (RNAplfold), 0.62 (LocalFold), to a maximum of 0.65 (Rfold). These accuracies indicate that the target structures were clearly predicted as illustrated in Figure 2.5 in which the cis-regulatory element achieved a bp-accuracy of 0.65. Although Rfold achieved the highest median bp-accuracy, the method – together with RNAfold – exhibited a much greater variation in results than the window- based approaches: RNAplfold and LocalFold. While the boxplot indicated similar distributions for the latter two approaches, the accuracies for LocalFold were significantly higher than for RNAplfold (p = 0.017, two-sided, two-sample Wilcoxon Rank Sum Test). Both window-based approaches produced the most robust predictions; LocalFold and RNAplfold made fewer predictions in the lower bp-accuracy range, i.e. they were more sensitive (Figure 2.6b). We considered a bp-accuracy ≤ 0.2 to mean the structure was not predicted: Rfold and RNAfold failed to predict 15% and 22%, respectively, whereas both RNAplfold and LocalFold failed in only 11% of all instances. To show that these results were not biased by redundancies in the dataset, we evaluated the median accuracy per Rfam family (Supplementary Figure B.1). Albeit some exceptions, the above trends remain the same for the individual families. Only for two families with large base-pair spans of 338 and 551 nucleotides did global folding show a substantial improvement over the local folding methods.

Rfold performance decreases at sequence ends In the investigation of different context lengths for the local folding methods, Rfold exhibited a decreased performance for smaller contexts (Figure 2.7); the context length was defined by the number of nucleotides to either side of the regulatory element, see part (b). Although the median bp-accuracy for

32 2.3. Performance evaluation of local folding algorithms

(a) (b) 0.5 1.0 p=0.017

0.8 0.4

0.65 0.60 0.62 0.6 0.55 0.3

0.4 0.2 Base-Pair Accuracy Base-Pair Accuracy Base-Pair 0.2

22 % 11 % 11 % 15 % 0.1 RNAfold Rfold (L150) 0.0 RNAplfold (L150, W200) 0.0 LocalFold (L150, W200) Rfold RNAfold 0.0 0.1 0.2 0.3 0.5 RNAplfold LocalFold 0.4 Cumulative Distribution

Figure 2.6: Comparison of structure prediction methods for the identification of cis-regulatory elements. Computations were performed with L=150 and W=200 (when applicable) on the subset of the CisReg data that have a max. base-pair span of 150 nucleotides, including 2158 elements assigned to 90 Rfam families. (a) Comparison of the achieved accuracies as boxplots. (b) Cumulative distributions of the bp-accuracy up to 0.5 (y-axis) to highlight the prediction sensitivity. Base pairs with probabilities above 0.5 are contained in the centroid structure [Ding et al., 2006;Carvalho and Lawrence, 2008;Jenkins et al., 2010] and thus a bp-accuracy above this threshold implies a well defined target structure. The p-value was calculated with a two-sample Wilcoxon Rank Sum test.

Rfold was higher for the contexts of 200 and 500 nt, it performed worst for 100 nucleotides. This, in combination with the greater variance for all Rfold predictions, indicated that the prediction of correct structures at sequence ends is poor. A similar trend was observed in [Kiryu et al., 2011], where the authors reported decreased prediction for the ends of sequences up to four times the maximum base-pair span, i.e. a context of 600 nucleotides for L = 150. Most cis-regulatory elements are situated within the untranslated regions (UTRs) of mRNAs and thus are frequently located at the sequence ends. Hence, poor prediction performance at sequence ends is detrimental for the prediction of cis-regulatory elements.

2.3.2 Prediction of accessible regions In the previous analysis, we inspected the accuracy at which each method predicted a given secondary structure. The extent of wrongly predicted base pairs was not explored. Here, we compared the performance of all methods on their ability to predict the accessibility of individual bases. As the accessibility of a base is defined as its probability of being unpaired, the probabilities of

33 2. Predicting the local structure of mRNAs

(a) 0.6

0.5 Context 0.4 100 0.3 200 0.2 500 0.1 Median Base-Pair Accuracy Median Base-Pair 0.0

Rfold

RNAplfold LocalFold (b) regulatory element 5' AAAA 3' 100 nt 200 nt 500 nt

5' CDS AAAA 3' 100 nt 100 nt

Figure 2.7: Rfold is more sensitive to the context length and thus has increased problems predicting correct structures at sequence ends, also reported in [Kiryu et al., 2011]. (a) A comparison of the median bp-accuracy (y-axis) achieved by the local folding methods on sequences where the regulatory element is situated within contexts 100, 200 and 500 nucleotides (CisReg dataset). (b) When the regulatory element is located at the sequence ends, a context larger than 100 nucleotides is often unavailable. Thus, methods performing poorly for shorter contexts are not appropriate to identify those elements.

34 2.3. Performance evaluation of local folding algorithms all possible base pairs involving this nucleotide are taken into account. Thus, wrongly predicted base pairs can have a detrimental effect on this measure. We first computed accessibilities for each folding method. For the local folding methods we used maximum base-pair spans (L) between 25 and 200 nucleotides. The window size W = L + 50 was used for the two window-based approaches. The quality of predictions for the YeastUnpaired dataset was evaluated by computing AUC values for discriminating high- and low-rated nucleotides according to the PARS score; these nucleotides achieved the clearest evidence for being paired or unpaired, respectively. Figure 2.8a shows the results for 1% of the highest- and 1% of the lowest-ranking nucleotides, comprising a set of approximately 80, 000 measurements. In most cases, an AUC greater than 0.8 was achieved. Folding globally with RNAfold resulted in the third lowest performance, only the predictions of Raccess and RNAplfold using span L = 25 performed worse. LocalFold outperformed the other methods for all Ls. Even the worst result for LocalFold at L = 25 was significantly higher than for RNAfold (p = 8.055 · 10−8, Wilcoxon Signed Rank test using AUCs derived from 1, 000 bootstrap samples). The best prediction result was attained by LocalFold using L = 100 with an AUC of 0.85. Larger L values resulted in comparable AUCs, hence, the prediction of accessibility was stable for different parameter settings. The fact that Raccess was clearly outperformed by the window-based approaches on the YeastUnpaired data provides further evidence that the greater variance in its base-pair prediction performance (Figure 2.6) is detrimental.

Relative performance was independent of transcript length

Finally, we investigated the influence of transcript lengths on the performance of the algorithms. For the analysis shown in Figure 2.8b, we split the data into sequence length intervals and the AUC for L = 100 was computed for each interval separately. The intervals were chosen to include roughly an equal number of sequences. We used 10% of the highest- and 10% of the lowest-ranking nucleotides so that each interval contained a sufficient number of sequences. While predictive performance fluctuated slightly for the intervals, we observed the same ranking of methods as seen in the previous analysis: global folding scored worst, the window-based approaches best. LocalFold scored marginally better than RNAplfold for most intervals and both consistently outperformed Raccess. Overall, performance dropped slightly for sequences longer than 2, 000 nucleotides. The fluctuations in performance were mirrored by all methods, probably due to the quality or properties of the underlying data.

35 2. Predicting the local structure of mRNAs

(a)

0.8

0.6

0.4 AUC

0.2

0.0 25 50 100 150 200 NA L Raccess RNAplfold LocalFold RNAfold (b) 0.74

0.72

0.70 AUC

0.68

0.66 [71,151] [152,451] [452,557] [558,656] [657,754] [755,833] [834,922] [923,1016] [1017,1121] [1122,1211] [1212,1315] [1316,1417] [1418,1531] [1532,1659] [1660,1795] [1796,1959] [1960,2190] [2191,2507] [2508,3185] [3186,8145] Transcript Length

Figure 2.8: Comparison of AUC values for separating high- and low scoring nucleotides of the YeastUnpaired dataset. (a) Effect of the parameter L was evaluated for W = L + 50 including only the 1% highest and lowest scoring nucleotides, respectively. (b) Using the best parameter combination (L = 100,W = 150), we show the dependency of the transcript length on the prediction quality. Here the 10% highest and lowest scoring nucleotides were included. Each interval contains roughly the same number of sequences.

36 2.4. Conclusion

CisReg and YeastUnpaired data showed similar results We observed similar results for both of the analysed datasets. The YeastUnpaired dataset was generated in in-vitro conditions, whereas the structured cis- regulatory elements in the CisReg dataset consists of experimentally verified regulatory structures with post-transcriptional functions in-vivo. The fact that the results are comparable between two independent datasets supports their overall quality and highlights their validity and generality.

2.4 Conclusion

To benchmark the performance of mRNA secondary structure prediction, we generated a large curated set of cis-regulatory elements and introduced bp-accuracy to measure how accurately a local structure was predicted. Further- more, we evaluated accessibility predictions using transcript-wide structural probing data. Prediction accuracy was affected by the following algorithmic assumptions and parameter combinations: (i) The optimal base-pair span parameters were dataset dependent, but similar, at L = 150 for the CisReg dataset and L = 100 for the YeastUnpaired dataset. Within a range of 100-150, differences in performance were minimal. This range reflects the distribution of base-pair spans for known structures. (ii) The use of sliding windows allows for more locality than the mere restric- tion of base-pairs spans. Windows, however, introduced a prediction bias at each artificial border. Windows with W = L caused unusually high base- pairing probabilities of long-range base pairs. This was was resolved by setting W = L + 50. (iii) Setting the larger window size (W = L + 50) did not completely remove the bias of high accessibilities (single-strandedness) at the window borders. Therefore, LocalFold was developed to diminish this bias which resulted in a consistent improvement compared to the other methods. The greater improve- ment in results was observed for the CisReg data (base pairs) in comparison to the YeastUnpaired data (single-strandedness). In addition to having much faster run times, we present clear quantitative and qualitative evidence that local folding methods outperformed the global approach. The advantage of local folding is that the majority of base pairs have short base-pair spans and that local structure can be predicted without the stabilising effects of long-range connections. Moreover, the reduced accuracy in the prediction of these long-range base pairs meant that local folding was better than global folding at determining secondary structure in long RNAs.

37

Chapter 3

Detecting binding sites of RNA-binding proteins with iCLIP

3.1 Introduction

In this chapter, we present a computational pipeline for the analysis of iCLIP experiments for inferring genome-wide in-vivo binding sites of RNA-binding proteins. This pipeline includes novel processing steps that improve the sensi- tivity of iCLIP, namely the extension to paired-end sequencing and improved handling of random sequence tags for inferring individual crosslinking events. We used the pipeline to analyse iCLIP experiments for two RBPs involved in the approximately twofold upregulation of the male Drosophila X-chromosomal gene output caused by the Male-Specific Lethal (MSL) complex. This com- plex, comprised of the five proteins MSL1, MSL2, MSL3, MOF, and MLE, incorporates one of the two long non-coding RNAs (lncRNAs) roX1 and roX2. To elucidate complex formation and roX RNA inclusion, iCLIP was employed to determine genome-wide binding sites of two members of this complex, the RNA helicase MLE and the ubiquitin ligase MSL2. Both iCLIP experiments exhibited two exceptional properties: (1) the majority of reads mapped to only two RNAs; and (2) the sequenced library contained a large number of PCR duplicates. In consequence, the dynamic range was not sufficient to fully capture the crosslinking events on the main targets. At the same time, the large number of PCR duplicates resulted in a large number of spurious crosslinking events. Employing the additional processing steps of our pipeline, we were able to increase the dynamic range to fully capture all crosslinking events and to remove a large number of spurious crosslinking events. In the next section, we focus on the preliminary processing steps for iCLIP reads resulting in the a of unique alignments required for the identification

39 3. Detecting binding sites of RNA-binding proteins with iCLIP

of crosslinking events, extending the standard iCLIP analysis workflow to the use of paired-end reads. We continue with a description of the processing required to derive individual crosslinking events from a given set of unique genomic alignments. The main focus of this section is the improved handling of random tags and its evaluation via the MLE and MSL2 iCLIP experiments. We conclude this chapter with a summary of the biological insights gained from the analysis of the MLE and MSL2 iCLIP experiments. For a detailed description of the follow-up experiments enabled by the iCLIP experiments we refer to the original publication [Ilik et al., 2013].

3.2 iCLIP processing pipeline

In this section, we describe the processing steps for generating a set of uniquely mapped genomic alignments required for the calculation of genome-wide crosslinking events starting from a paired-end sequenced iCLIP library as prepared for our MLE and MSL2 iCLIP experiments. The sequenced reads include metadata — multiplexing and random se- quence tags — as well as readthroughs into the adapter sequences required for sequencing. These non-genomic parts of the reads have to be removed carefully. The processed reads are then mapped to a reference genome, identifying their genomic origins. This procedure must reliably select reads for which unique genomic origins can be identified and provide exact genomic locations of the read ends.

3.2.1 Alignment to the reference genome iCLIP cDNAs contain two tags adjacent to the 5’ sequencing adapter (Fig- ure 3.1 A): multiplexing and random tags. Multiplexing refers to the simulta- neous sequencing of multiple cDNA libraries. Here, the different libraries are labeled with unique multiplexing tags that allow to identify from which library a given read originated. With the Illumina paired-end sequencing mode, both forward and reverse template strands are read. This results in pairs of reads that start from opposite ends of the cDNA and are reverse complementary to each other. The read originating from the forward template is called first mate and the read originating from the reverse template is called second mate (Figure 3.1 B). With this setup, first mate reads contain multiplexing and random sequence tags at their 5’-ends and — in many cases — readthroughs into adapters (Figure 3.1 B). Second mate reads may contain readthroughs into adapters and sequence tags (Figure 3.1 B). All non-genomic portions of the reads have to be removed before mapping to the reference genome. The precision of this preprocessing is vital because the downstream analysis of crosslinking events will be restricted to reads for which unique and high-quality alignments to the reference genome can be found. Any non-genomic elements left after preprocessing will reduce the quality of the alignments due to the

40 3.2. iCLIP processing pipeline

Figure 3.1: Schematic overview of the preparation of sequenced cDNA constructs prior to mapping. A) iCLIP cDNAs contain non- genomic elements. B) Paired-end sequencing generates two reads starting from opposite ends of the cDNA. Depending on the length of the genomic insert, both reads may contain readthroughs into adapter and sequence tags. C) Sequence tags from the first mate read are used to create a dictionary of random sequence tags and to demultiplex the replicates D) Adapters are removed from the 3’-ends of both reads. Second mate reads may contain readthroughs into sequence tag regions. E) Second mate reads are truncated by the tag length of 9 nucleotides to reliably remove any nucleotides left over from the sequence tags. While this may remove genomic sequence from the second mate, no information is lost since these nucleotides are guaranteed to be part of the first mate.

resulting mismatches. At the same time, removal of genomic sequence increases the likelihood that a read is falsely aligned to multiple genomic locations and consequently omitted from the analysis. At this stage of the processing first mate reads contain random nucleotides (5 nucleotides) and multiplexing tags (4 nucleotides) at their 5’-ends. Inserts of fewer than the read length of 75 nucleotides contain readthroughs into the adapter regions used for reverse transcription, PCR and sequencing. In

41 3. Detecting binding sites of RNA-binding proteins with iCLIP

the case of second mate reads, readthroughs may also contain random nu- cleotides and multiplexing tags in addition to readthroughs into adapter regions (Figure 3.1 B). The first 9 nucleotides of first mate reads contain multiplexing and random tags in the order XXXNNNNXX (X: random nucleotide, NNNN: multiplexing tag). To facilitate subsequent demultiplexing according to the sequence tag, the random nucleotides of the first mate reads, located at positions 1-3 and 8-9, are extracted and used later for PCR duplicate removal after mapping. Second mate reads may contain readthroughs into these regions, removal of these sequences is deferred until multiplexing and adapter regions have been removed (Figure 3.1 C). After extraction of the random tags, reads were split into different libraries according to the multiplexing tags (sabre: http://github.com/najoshi/ sabre). Most genomic inserts were shorter than the 75 nucleotides sequenced for each mate (data not shown), leading to readthroughs into adapter re- gions. Readthroughs into 3’-adapters are located at the 3’-ends of first mates. Readthroughs into 5’-adapters are located at the 3’-ends of second mates. The readthroughs into adapter sequences at the 3’-ends of first and second mates were removed using cutadapt [Martin, 2011]. Since the exact alignments of the read starts are vital for PCR-duplicate removal that requires exact start and end coordinates of the alignments, sequences at the 5’-ends were not changed. The sequences were also clipped according to their PHRED quality score, using cutadapt parameter ’-q 34’. After adapter removal, the second mate reads may still contain parts of the random and multiplexing tags (Figure 3.1 D). Since the random nature of these tags makes sequence-based identification impossible, we decided to assume the worst case of readthroughs into the full sequence tag regions for all reads. Accordingly, we removed the last 9 nucleotides from all second mate 3’-ends, corresponding to the length of the whole tag region. Any genomic sequence removed by this step is guaranteed to be contained in the corresponding first mate read, so no information is lost by this procedure (Figure 3.1 E). The processed reads, now split according to the multiplexing tags, have to be aligned to the genome. For this purpose, we used bowtie2 [Langmead and Salzberg, 2012], a fast and memory-efficient tool built for aligning se- quencing reads to long reference sequences. The processed MLE and MSL2 iCLIP reads were aligned to the Drosophila melanogaster reference genome dm3. In the iCLIP protocol, reverse transcription is expected to terminate after transcription of the nucleotide directly downstream of the crosslinked protein residue [K¨onig et al., 2010]. Based on this assumption, the genomic crosslink- ing position can be determined given the proper alignment of the 5’-end of the first mate read. Mapping of iCLIP reads requires specific parameter set- tings and post-processing. Here we used bowtie2 parameters ”--end-to-end“, ”--maxins=200“ and ”--very-sensitive“.

42 3.2. iCLIP processing pipeline

To determine the crosslinking positions from the aligned reads, the first genomic nucleotide of the reads has to be mapped to the genome. In addition, genomic positions of both alignment ends are used during removal of PCR duplicates using the random tags. This precludes use of the bowtie2 local alignment mode, accordingly we set the “--end-to-end” parameter to require the end-to-end alignment of the mated reads. Usually, only reads that can be unambiguously mapped to the genome are used for the downstream analysis of CLIP-seq alignments. All reads that can be mapped to multiple genomic locations are excluded from further evaluation. Exclusion of these reads is essential because only for uniquely mapped reads one can be reasonably sure about the true location of the binding sites. Removing reads that can be mapped to multiple genomic locations can thus be seen as a straightforward measure for removing false positives. This filtering is especially important for motif search. If peaks from nonuniquely mapped reads were used, a single point of binding could be misconceived as multiple binding sites. In that case the corresponding sequence would be erroneously recognized as being over-represented during motif search, effectively inventing a fallacious binding motif. To make sure that the majority of reads with multiple possible genomic alignments is identified we use the ”--very-sensitive“ preset. This preset increases the effort for finding matching alignments and thus improves detection of multiply alignable reads at the cost of longer runtime. Allowing alignments with mismatches to the reference genome can compen- sate for errors introduced during library preparation or sequencing and thus improves the number of usable alignments. However, the possibility of subop- timal alignments complicates the notion of uniquely aligned reads. Because a strict definition of uniqueness may be too stringent for many applications, bowtie2 assigns mapping qualities to all alignments. For cases where multiple alignments of a read are possible, a high mapping quality corresponds to a large difference between the alignment scores of the best and second best alignments. For this analysis, however, we stick to a strict definition of uniqueness and rather remove additional alignments than attributing reads to wrong locations of origin. This filtering is done in an additional processing step after mapping. Alignments having edit distance > 2 to the reference genome are removed as well.

3.2.2 Identification of crosslinking events One distinguishing feature of iCLIP compared to other CLIP-seq methods is the ability to identify individual crosslinking events. After mapping to the reference genome, each crosslinking event detected by iCLIP is represented by a number of sequences created during PCR amplification of the cDNA sequences, the PCR duplicates. These copies can be identified using the aforementioned random sequence tags. The resulting set of crosslinking events

43 3. Detecting binding sites of RNA-binding proteins with iCLIP may contain events derived from non-specific background. While iCLIP employs stringent purification to remove non-crosslinked sequences, some sequences not specifically bound by the protein under consideration will remain. For that reason we use peak calling to filter out non-specific background and only retain high-confidence sites. The number of sequenced PCR duplicates depends on the amount of starting material (here: the amount of crosslinked immunoprecipitated RNA), the number of target sites and the number of sequenced reads [Sims et al., 2014]. If a large number of PCR cycles is used to amplify the library for sequencing, the number of sequenced PCR duplicates can be quite substantial. iCLIP employs random sequence tags to identify PCR duplicates. The random sequence tags are part of the adapters ligated prior to reverse transcription, thus all PCR duplicates originating from the same crosslinking event are labelled with the same random tag. After identifying the genomic origins of the reads, the random tags can thus be used to identify PCR duplicates and merge them into individual crosslinking events. Adapters used for MLE and MSL2 iCLIP library preparation contained random sequence tags of 5 nucleotides and can thus be used to distinguish 45 = 1, 024 crosslinking events. With single-end sequencing one can be rea- sonably sure that the aligned position of the 5’-end of the read corresponds to the nucleotide accessory to the crosslinking event. However, the 3’-end of the sequenced read does not necessarily correspond to the other end of the insert. Characterization of each crosslinking event by the combination of the 5 nucleotide random tag and the genomic position of the reverse transcription truncation site (the 5’-end of the read) would allow for the distinction of up to 1, 024 crosslinking events per genomic position. Whenever multiple crosslinking events at the same nucleotide are assigned the same random sequence tag, this procedure will underestimate the actual number of crosslinking events. The chance of such hidden crosslinking events increases with utilization of the address space — the number of crosslinking events will be underestimated more severely for positions approaching the maximum number of detectable events than for positions with only a few events. The MLE and MSL2 iCLIP libraries were sequenced using paired-end sequencing, as opposed to single-end sequencing used with the original iCLIP protocol [K¨onig et al., 2010]. Since sequencing starts from both ends of the insert, aligned coordinates of both ends positively correspond to the ends of the sequenced inserts. Any parts not sequenced, for example because they exceed the lengths of the sequenced reads, are restricted to the inner parts of the sequenced inserts. The number of crosslinking events that can be distinguished per genomic position can thus be increased by using the additional high- confidence coordinate gained by paired-end sequencing. Using this scheme, the number of detectable crosslinking events per genomic position depends on the expected number of different read lengths in addition to the length of the random tags.

44 3.2. iCLIP processing pipeline

The maximum of 1, 024 detectable crosslinking sites has not been an issue with iCLIP experiments so far. With the MLE and MSL2 iCLIP experiments, however, the majority of reads aligned to only two targets, roX1 and roX2, leading to exceptionally large numbers of crosslinking events for these RNAs (a detailed analysis of MLE and MSL2 targets will follow in Section 3.3.1; see Figure 3.6 for the genomic distribution of crosslinked nucleotides). Using both ends of the mapped reads for PCR deduplication, we were able to detect a maximum of 1, 421 crosslinking events per genomic position for MSL2 iCLIP replicate 2 (data not shown). Using only the single 5’-coordinate for PCR deduplication would would have allowed to account for up to 1, 024 different crosslinking events per nucleotide, thus trimming the highest observed peak to about 70% of its actual height. Using coordinates of both alignment ends, the number of detected crosslinking events is well below the number of detectable events per nucleotide given the selected cDNA size fractions, ensuring that only a low number of crosslinking events will be hidden due to multiple utilization of random sequence tags. This analysis showed that the use of the random sequence tags in com- bination with a single genomic position derived by a single-end read is not sufficient to adequately represent the full range of crosslinking events seen for the MLE and MSL2 iCLIP experiments. The extension of PCR deduplication by using random tags in combination with coordinates of both ends of the alignments is necessary to determine crosslinking counts that accurately reflect the actual number of binding events occurring at each crosslinked nucleotide.

Spurious crosslinking events As described above, the number of crosslinking events is underestimated when the address space available for detecting crosslinking events is fully utilized. The converse effect, the overestimation of the number of crosslinking events, can be caused by PCR deduplication using random sequence tags when the sequenced library contains many PCR duplicates per crosslinking event. When an error is introduced into a random sequence tag, e.g. during library preparation or sequencing, the corresponding read is not identified as a PCR duplicate of the corresponding crosslinking event. It is then either assigned to another valid crosslinking event or interpreted as an additional event. The latter case introduces an additional, spurious crosslinking event. Differences between the actual RNA and the sequenced reads are a known and quantified issue. Faircloth and colleagues analysed the effect of these errors for the correct identification of sequence tags [Faircloth and Glenn, 2012]. Two of the identified sources of errors are relevant to the analysis of iCLIP data employing random tags: changes introduced during sequencing and changes introduced during PCR amplification. Errors introduced during sequencing were found to be specific to the sequencing technology in use. The Illumina platform used for sequencing the MLE and MSL2 iCLIP libraries was found

45 3. Detecting binding sites of RNA-binding proteins with iCLIP

to mainly produce substitution errors. Most of these substitutions occur at the starts of the reads — where the random and multiplexing tags are located — and the ends of the reads. The amount of errors introduced during PCR amplification was found to depend on the templates, on the DNA polymerase and on the number of PCR cycles. A saturated iCLIP library contains at least a small number of PCR duplicates per crosslinking event. In that case all crosslinking events can be detected and a further increase of the sequencing depth would not lead to the identification of additional crosslinking events. If the sequencing depth is increased beyond that point, the library becomes over-saturated. The analysis of an over-saturated library will identify, if at all, only very few additional crosslinking events compared to a saturated library. The large number of PCR duplicates per crosslinking event in an over-saturated iCLIP library, however, increases the number of sequenced reads with erroneous random tags which in turn increases the number of spurious crosslinking events. The level of saturation depends on the number of PCR cycles, the number of sequenced reads, the amount of crosslinked RNA and the number of RBP targets. The latter two properties are not known prior to the evaluation of the iCLIP experiment and can — in the best case — only be approximated beforehand. Both the MLE and MSL2 iCLIP libraries were sequenced with a greater depth than required for saturation and are consequently consequently over- saturated: sequenced reads from both libraries contain exceptionally large numbers of PCR duplicates per crosslinking event. The average number of PCR duplicates was 503 for MLE and 95 for MSL2. In contrast, the average number of PCR duplicates per crosslinking events reported by K¨onigand colleagues with the initial description of the iCLIPprotocol was 6.5 [K¨onig et al., 2010].

Identification of spurious crosslinking events The expected number of spurious crosslinking events depends on the probability of erroneous random tags. To get an estimate of this probability, we analysed the multiplexing tags of our MLE and MSL2 iCLIP experiments, reasoning that those tags should exhibit error rates comparable to the random tags due to their similarity in length and positioning at the start of the sequenced reads. We restricted this analysis to multiplexing tags deviating by a single nucleotide substitution from the known sequences as these were very likely derived from one of the known tags. 5% of the sequenced multiplexing tags for MLE and 4% of the sequenced multiplexing tags for MSL2 contained single nucleotide substitutions. Because random tags consist of 5 nucleotides and the multiplexing tags only contain 4 nucleotides, we expect the fraction of random sequence tags containing errors to be slightly larger than 5%. This error rate is comparable to the rates seen

46 3.2. iCLIP processing pipeline

random barcode #reads class frac_top TATTG 448 top 1 TGTTG 6 1MM 0.013 CATTG 3 1MM 0.007 AATTG 2 1MM 0.004 TTTTG 2 1MM 0.004 GATTG 1 1MM 0.002 TATAG 1 1MM 0.002 Figure 3.2: Errors introduced into random tags during library preparation or sequencing can cause spurious crosslinking events when crosslinking events are represented by many PCR duplicates. As a typical example, we show the crosslinking events for MLE replicate 1 position 243, 144 on chromosome 2L. At this position, 463 paired-end reads were combined into 7 crosslinking events. The majority of these reads were assigned to a single crosslinking event with tag TATTG. The remaining 6 crosslinking events incorporate only 3% of the read pairs assigned to this position and each tag only differs in a single nucleotide to the tag of the top event. Assuming 3% of random tags contain a single error, all but the top event in this example must be categorized as false event and should be removed. This error rate matches the error rate seen for the multiplexing tags: 5% of the MLE multiplexing tags and 4% of the MSL2 multiplexing tags had single nucleotide errors. for sequence tags located near 5’-ends of reads analysed by Faircloth and colleagues [Faircloth and Glenn, 2012]. Assuming a lower bound of 5% for the rate of erroneous random sequence tags, this translates to an average of 25 spurious crosslinking events for each MLE iCLIP crosslinking event (based on an average of 503 PCR duplicates). In the worst case, this could result in a 26-fold overestimation of the number of crosslinking events for MLE. The number of crosslinking events for MSL2 would be overestimated 6-fold. Figure 3.2 shows a typical example of this effect. 463 alignments taken from MLE replicate 1 corresponding to position 243,144 of chromosome 2L were deduplicated into seven crosslinking events. Six of the seven events account for only 15 of the 463 alignments. In addition, each of the random sequence tags differ only by a single nucleotide from the sequence tag of the 7th event accounting for the remainder of 448 reads. Accordingly, the number of crosslinking events at this position would have been overestimated 6-fold. Our initial observations indicated that spurious crosslinking events caused by erroneous random sequence tags tend to have low numbers of supporting alignments compared to true crosslinking events. To check if the pattern of spurious crosslinking events supported by few PCR duplicates can be identified on a genome-wide scale, we assigned one of several classes to each crosslinking event on the autosomes. First, within each set of crosslinking events sharing

47 3. Detecting binding sites of RNA-binding proteins with iCLIP the same end coordinates, we identified the crosslinking event supported by the most PCR duplicates (top). From all other events within a set we selected those with a single mismatch in the random sequence tags compared to the top event (1MM) as candidates for spurious crosslinking events. We normalized the number of supporting PCR duplicates of each event within a set by dividing each count of PCR duplicates by the number of PCR duplicates of the corresponding top event (frac top). Figure 3.3 A shows the number of crosslinking events of the top and single mismatch (1MM) classes in relation to the amount of supporting PCR duplicates (frac top). Most crosslinking events with one mismatch in the random sequence tag (1MM) were supported by less than 10% of the alignments seen for the events with highest number of supporting alignments (top). At the same time these single mismatch events constituted the majority of events identified. Thus, a valid quantification of crosslinking events requires identification and removal of spurious events. To estimate the amount of evidence supporting spurious crosslinking events, we counted the number of supporting PCR duplicates for each crosslinking event. The resulting histogram shown in Figure 3.3 B reveals that the spurious single mismatch events, while constituting the majority of identified events, are only supported by a minor fraction of the sequenced PCR duplicates. These evaluations show that, on a global scale, spurious crosslinking events constitute the majority of crosslinking events but are only supported by a minor fraction of the available evidence, which makes an additional filtering step for removing spurious crosslinking events necessary. The removal of these events amounts to the rejection of only a minor fraction of the experimental data while having a huge impact on the overall quality of the remaining crosslinking events. Having established that spurious events can be identified by the fraction of supporting alignments compared to the best supported events, we decided to use this feature to efficiently separate spurious from true events. Among each set of crosslinking events sharing the same end coordinates, we removed all crosslinking events supported by less than 10% reads compared to the top event.

3.2.3 Identification of binding sites The sequenced samples may contain sequences that do not correspond to target sites of the protein that was pulled down after crosslinking. These background sequences constitute either non-crosslinked RNA or RNA that was crosslinked to another protein but pulled down nonetheless due to antibody cross-reactivity [Uren et al., 2012]. The identification of the true interaction sites is commonly referred to as peak calling. Several available peak calling algorithms rely on specific information that is not available with iCLIP. For example PARalyzer [Corcoran et al., 2011], a method for data stemming from PAR-CLIP experiments, searches for specific nucleotide substitutions introduced during reverse transcription of PAR-CLIP-

48 3.2. iCLIP processing pipeline

A B

Figure 3.3: Separation of true and false crosslinking events. (A) Without compensating for errors in random tags, the majority of crosslinking events are false events. From each set of crosslinking events on the autosomes having the same end coordinates (as used for removal of PCR duplicates), we selected the event incorporating the largest number of read pairs (top) and all events with exactly one mismatch in the random tag as compared to the random tag of the top event (1MM). For each crosslinking event we calculated the fraction of incorporated reads with respect to the number of reads of the corresponding top event (frac top). The histogram shows that there are many more 1MM events than top events. Most of these events, however, incorporate less than 10% reads as compared to the corresponding top events and should thus be considered false events. For all replicates (except MSL2 replicate 3 which was removed from further analysis), the number of false 1MM events is higher than the number of corresponding top events. (B) Albeit the number of 1MM events is high, the associated number of reads is low. Similar to previous figure, but counting the number of incorporated reads instead of the number of crosslinking events. False crosslinking events are supported by a tiny fraction of all reads, showing that removal of spurious crosslinking events will retain most of the available reads.

49 3. Detecting binding sites of RNA-binding proteins with iCLIP

specific 4-thiouridine nucleotides as a diagnostic for crosslinked nucleotides. Crosslinking induced mutation site (CIMS) uses the observation that crosslinked nucleotides tend to be skipped during reverse transcription, leading to specific deletions at the crosslinking site [Zhang and Darnell, 2011]. With iCLIP, however, the majority of reads is expected to be truncated at the nucleotide downstream of the crosslinked nucleotide. For that reason only a minor fraction of cDNAs is expected to extend beyond the crosslinked nucleotide and thus be usable for the CIMS analysis.

For the purpose of calling peaks for our MLE and MSL2 iCLIP experiments, we investigated two peak callers applicable to CLIP-seq data in general: Piranha and modFDR. The Piranha peak caller [Uren et al., 2012] uses a fraction of the sites from the CLIP-seq data to be filtered to fit a count distribution of non- specific background reads (the default setting uses the 99% sites with the lowest scores, e.g. the number of uniquely aligned reads, for this purpose). Thus, Piranha assumes that a large fraction of the CLIP reads constitutes unspecific background. For normalization purposes, Piranha allows the incorporation of transcript abundance into the peak calling, e.g. from RNA-seq experiments. However, no such data is available for Drosophila melanogaster nuclear extracts that were used for the MLE and MSL2 iCLIP experiments. Another generally applicable method employed by several CLIP-seq studies, commonly referred to as modFDR, approximates false discovery rates (FDR) based on simulations of site-unspecific background reads [K¨onig et al., 2010;Xue et al., 2009;Yeo et al., 2009]. Under the assumption of very low rates of background, peak detection has also been omitted from CLIP-seq analyses [Licatalosi et al., 2008; K¨onig et al., 2010]. While modFDR was applied by K¨onigand colleagues, they report that the filtering would have removed 94% of the crosslinked nucleotides. Since this would have affected about half of the nucleotides found to be reproducible between replicates, the authors chose to present the results based on the full unfiltered sets [K¨onig et al., 2010].

Accordingly, the choice of between Piranha and modFDR depends on the expected amount of unspecific background. Assuming high levels of background, the peak calling should be done using Piranha; assuming high specificity of the iCLIP protocol, no peak calling should be done at all. Ideally this choice should be informed by experimental evidence for either one of these assumptions, for example from a negative control experiment. Having no negative control available, we chose to pursue neither of these two extremes. Instead, we used peak detection via a modFDR approach. Significance of the crosslinked nucleotides was determined using simulation-derived FDR cut-offs as described by [K¨onig et al., 2010]. Peak calling was done independently for each replicate, using our own implementation of the algorithm using an FDR cut-off of 5%.

50 3.3. Identification of MLE and MSL2 binding sites

3.3 Identification of MLE and MSL2 binding sites

The sex of both fruit flies and mammals is determined by the X and Y : while females have two X chromosomes, males have only one X and an additional Y chromosome. Despite the different number of X chromosomes the amount of X chromosome gene products of males and females is roughly the same. This balancing of the X chromosomal output between the sexes, termed dosage compensation, is achieved by different means in mammals and fruit flies. In mammals, one of the female X-chromosomes is inactivated [Augui et al., 2011]. In Drosophila, transcription of the single male X chromosome is upregulated to generate approximately double output [Conrad and Akhtar, 2011]. With both systems, dosage compensation is accomplished by ribonucleoproteins that regulate X chromosomal output at the level of chromatin organization. In both mammals and Drosophila these complexes consist of long non-coding RNAs (lncRNAs) and several proteins [Maenner et al., 2012]. In Drosophila, dosage compensation is mediated by the Male-Specific Lethal (MSL) complex. This complex is composed of the five proteins MLE, MOF, MSL1, MSL2 and MSL3 and the two RNA-on-the-X 1 and 2 (roX1 and roX2) lncRNAs [Lucchesi, 1998]. Despite greatly differing lengths (roX1 4.000 nucleotides, roX2 600 nucleotides) the two roX RNAs are functionally redundant. Only simultaneous removal of both roX RNAs leads to a severe reduction of male viability [Meller and Rattner, 2002]. Both RNAs contain conserved elements called roX boxes that were shown to be important for their function in dosage compensation [Kelley et al., 2008; Park et al., 2007; Park et al., 2008]. After assembly, the MSL complex spreads along the X chromosome using high affinity sites (HAS) entry points [Straub and Becker, 2011]. MLE, roX1 and roX2 are required for this spreading and subsequent coating of the X chromosome [Meller et al., 2000; Meller and Rattner, 2002]. Histone H4 lysine 16 (H4K16) is then acetylated by MOF. This H4K16 acetylation in turn is linked to the transcriptional upregulation of the X-chromosomal genes [Smith et al., 2000; Conrad et al., 2012]. Since roX RNAs are required for dosage compensation mediated by the MSL complex it is important to better understand the interactions of the roX RNA with the protein MSL complex members. Among these proteins, MLE and MSL2 are of special interest. MLE was shown to be important for incorporation of the roX RNAs into the MSL complex [Meller et al., 2000], however the roX RNAs were also found to be associated with MSL complex members in the absence of MLE [Akhtar et al., 2000;Fauth et al., 2010;Izzo et al., 2008;Meller et al., 2000; Smith et al., 2000]. This indicates that additional members of the MSL complex interact with roX RNAs. MSL2 is a good candidate as a roX interaction member, since partial MSL complexes lacking MSL3 and MOF still co-immunoprecipitate roX RNAs [Kadlec et al., 2011], but partial MSL complexes lacking MSL2 do not co-immunoprecipitate with roX [Hallacli et al., 2012]. 51 3. Detecting binding sites of RNA-binding proteins with iCLIP

3.3.1 Identification of MLE- and MSL-bound sites To determine binding sites of MLE and MSL2 on the roX RNAs and furthermore clarify if they interact with RNAs other than roX1 and roX2 in vivo, we employed iCLIP [K¨onig et al., 2010]. iCLIP identifies binding sites with single nucleotide resolution. This property is especially important to determine binding sites on the rather short roX2 (∼ 500 nucleotides) RNA. In contrast, HITS-CLIP only reaches a resolution of about 30 nucleotides [K¨onig et al., 2010]. iCLIP was essentially performed as described by K¨onigand colleagues [K¨onig et al., 2010]. Instead of whole cell lysates, nuclear extracts of cultured Drosophila Clone8 cells were used. Immunopurification of MLE- and MSL2- RNA complexes was performed using antibodies against MLE (rat1) and MSL2 (d300, Santa Cruz) and Protein G (for MLE) or Protein A (MSL2) Dynal beads (Invitrogen). Biological triplicates of both iCLIP experiments, annotated with different multiplexing tags, were combined after reverse transcription. Adapters contained 5 random nucleotides for distinguishing PCR duplicates. 75 nucleotide paired-end sequencing reads were generated using the Illumina HiSeq 2000 platform. We used the iCLIP pipeline described in Section 3.2 to identify genome-wide MLE and MSL2 crosslinking events. Total reads, alignments and crosslinking events for the three replicates of each sequencing run are summarized in Figure 3.4. In total, bowtie2 aligned 81% of the 183, 918, 504 reads of the combined MLE and MSL2 experiments, indicating a high quality of the prepared iCLIP libraries. 61% of these reads aligned to unique positions on the genome and were used for further processing. Pooling of PCR duplicates into distinct crosslinking events according to random sequence tags identified 794,353 and 1,433,366 events for MLE and MSL2. Filtering spurious crosslinking events caused by errors in the random sequence tags as described in Section 3.2.2 discarded 88% of the MLE and 72% of the MSL2 events as spurious events, leaving 92, 425 and 395, 273 high- confidence crosslinking events for MLE and MSL2. MLE and MSL2 crosslinking events were on average represented by 503 and 95 PCR duplicates. Without controlling for the effect of errors in the random sequence tags, the number of MLE crosslinking events would have been overestimated more than 8-fold, the number of MSL2 crosslinking events would have been overestimated more than 3-fold. Subsequent peak calling retained 50% of the MLE and 31% of the MSL2 events. In total the iCLIP pipeline identified 46, 875 crosslinking events for MLE and 120, 943 crosslinking events for MSL2. MSL replicate 3 was excluded from the remaining analyses due to the very low number of reads. We found a good agreement between the the crosslinking profiles on roX1 and roX2 for both MLE and MSL2 (Figure 3.5). For that reason, we combined all MLE replicates and the two remaining MSL2 replicates (Figures 3.7 and 3.8).

52 3.3. Identification of MLE and MSL2 binding sites

MLE iCLIP Replicate 1 Replicate 2 Replicate 3 Total Reads 49,744,886 8,976,213 34,460,608 93,181,707 Mapped read pairs 40,668,737 7,531,060 27,668,364 75,868,161 Uniquely aligned read pairs 24,954,209 3,800,546 17,743,605 46,498,360 Cross-link events after PCR duplicate removal 301,610 38,697 454,046 794,353 Cross-link events after barcode error compensation 37,117 6,123 49,185 92,425 Cross-link events after shuffling 15,643 1,764 29,468 46,875 Cross-link nucleotides 963 179 1,705 2,847

MSL2 iCLIP Replicate 1 Replicate 2 Replicate 3 Total Reads 37,745,073 51,701,140 1,290,584 90,736,797 Mapped read pairs 28,637,172 36,580,377 1,237,893 66,455,442 Uniquely aligned read pairs 15,872,442 21,789,940 27,038 37,689,420 Cross-link events after PCR duplicate removal 776,539 652,759 4,068 1,433,366 Cross-link events after barcode error compensation 178,505 213,290 3,478 395,273 Cross-link events after shuffling 55,176 65,718 49 120,943 Cross-link nucleotides 3,205 2,928 14 6,147

Figure 3.4: iCLIP pipeline statistics.

A B

roX1 roX1 C D

roX2 roX2 Figure 3.5: Biological replicates for MLE and MSL2 iCLIP show good agreement on roX1 and roX2 RNAs. iCLIP profiles for A-B) roX1 (3.7 kb) and C-D) roX2 (1.2 kb).

53 3. Detecting binding sites of RNA-binding proteins with iCLIP

Subsequent analysis of the crosslinking events from the combined replicates revealed roX1 and roX2 as the principal targets of MLE and MSL2 in vivo (Figure 3.6). While both proteins bind to a number of RNAs in the Drosophila transcriptome, the maximum number of crosslinking events per nucleotide was higher for roX1 and roX2 than for for any other RNA (Figure 3.6 A and B). In addition, the majority of crosslinking events were located on these two ncRNAs (Figure 3.6 C). Binding profiles of MLE and MSL crosslinking events on roX1 (Figure 3.7 A) and roX2 (Figure 3.8 A) show that binding is restricted to separate domains of the two lncRNAs. On roX1, iCLIP identified three domains with a large number of crosslinking events for both MLE and MSL2. Binding of MLE and MSL2 to roX2 was restricted to exon 3. In summary, the MLE and MSL2 iCLIP experiments established the roX1 and roX2 RNAs as the main targets of MLE and MSL2 (Figure 3.6). Binding of both proteins to the roX RNAs is restricted to distinct domains that are similar for both proteins (Figures 3.7 and 3.8). To further investigate binding of MLE to roX1 and roX2, RNA struc- ture data and motif analysis was combined to identify regions for further investigation using GRNA chromatography [Czaplinski et al., 2005]. GRNA chromatography is able to detect binding of exogenously expressed and tagged RNAs with endogenously expressed proteins. Here, GRNA chromatography was used to detect interactions between endogenously expressed MLE from nuclear Drosophila extracts and peptide-tagged RNAs. Each of the three roX1 domains with a large number of crosslinking events (Figure 3.7 A, D1-D3), but none of the regions in between, were found to be bound by MLE in vitro by GRNA chromatography [Ilik et al., 2013]. Structural probing of roX1 and roX2 RNAs revealed that the secondary structure of both RNAs is organized in regions with tandem stem-loops con- nected by unstructured regions. Structure probing data was generated by selective 2’-hydroxyl acylation analysed by primer extension (SHAPE) [Wilkin- son et al., 2006] and parallel analysis of RNA structure (PARS) [Kertesz et al., 2010] using full length in-vitro transcribed roX1 and roX2. Nucleotide-wise probing scores were then used to create structure models for both RNAs. Cartoon-like representations of the RNA structures for roX1 domain D3 and the roX2 domain on exon 3 are depicted in Figures 3.7 B and 3.8 B. Domain D3 contains three stable loop structures connected by flexible (i.e. accessible) linker regions, roX2 has two clusters of tandem stem-loops. Subsequent analysis of MLE binding to roX1 was restricted to domain D3 because its similarity in length and architecture to the single roX2 domain. Most stems but not the loop or linking regions, are evolutionarily conserved [Ilik et al., 2013]. Analysis of the two MLE- and MSL2-bound roX domains revealed a number of relevant sequence motifs. In addition to roX boxes (RB) that were previously shown to be important for the function of roX1 [Kelley et al., 2008] and roX2 [Park et al., 2007; Park et al., 2008] two other relevant motifs could

54 3.3. Identification of MLE and MSL2 binding sites

Figure 3.6: iCLIP reveals that MLE and MSL2 interact with RNA in vivo, roX1 and roX2 being the principal targets. A) MLE iCLIP detects 2,447 crosslinked nucleotides. The genomic distribution of crosslinked nucleotides scored by number of crosslinking events shows no particular bias toward any chromosome. Grey bars indicate chromosome sizes. roX1 (blue) and roX2 (red) nucleotides score considerably higher than most other nucleotides. B) MSL2 iCLIP detects 5,206 crosslinked nucleotides. Scores and distribution of roX1, roX2, and other nucleotides are similar to MLE. C) Distribution of crosslinking events. Each crosslinking event was assigned the first matching target class following the hierarchy roX1, roX2, CR41602, rRNA, snoRNA, snRNA, ncRNA, tRNA, 3’-UTR, 5’-UTR, exon, intron. The majority of crosslinking events fall on roX1 and roX2.

55 3. Detecting binding sites of RNA-binding proteins with iCLIP

roX1 (3.7 kb) A 600

400 MSL2

200 protein 0 MSL2 score 600 MLE

400 MLE

200

0

D1 D2 D3 B

579 206 iCLIP 123 360 143212 score: R1H2

RBL IRB roX box 1 roX box 2 roX box 3 roX1 D3 R1H1 P2

5’ 3’ Figure 3.7: MLE and MSL2 Bind to Three Clustered Regions within roX1 RNA A) The iCLIP data show that MLE and MSL2 interact with three domains (D1–D3) of roX1. The red box marks roX1 D3, which shows the highest score of MLE binding. B) A cartoon-like representation of roX1 RNA domain-3 (D3). roX1 region-3 is similar to roX2 exon-3 in its arrangement of stem-loops and roX boxes (see Figure 3.8 B). R1H1 contains an RBL element in its stem (pink box) and is followed by R1H2, which is formed by a long-range interaction between the inverted roX box (IRB, green box) element and roX1 box1 (RB1, red-in-blue box). Another stem-loop (P2) is predicted to form in the bulge separating IRB and roX1 box1. Some of MLE’s top iCLIP scores are indicated on top.

56 3.3. Identification of MLE and MSL2 binding sites

roX2 (1.2 kb) A 1000 MSL2

500

protein 0 MSL2 MLE score 1000 MLE 500

0

exon 1 exon 3 1267 790 iCLIP 1192 B 265 score: 331 640 492 216 278 roX box 1 roX box 2 roX box 3 RBL1 RBL2 RBL3 roX2 R2H1 R2H2/ 3 P3 R2H4 R2H5 R2H6

1-504nt 1-280nt 281-540nt R2H1 mut R2H2 mut P3 mut R2H1-R2H2 mut Figure 3.8: MLE and MSL2 Bind Exclusively to roX2 Exon-3 A) MSL2 and MLE interact with the evolutionarily conserved third exon of roX2 in vivo. B) A cartoon-like representation of roX2 exon-3 shows that it has tandem helical regions at its 5’-end (R2H1, R2H2/3, and P3, white boxes), forming the first stem-loop cluster with RBL elements (pink boxes), and three roX-box elements at its 3’-end (RB1-3, red-in-blue boxes, indicated on top), which also resides in helical structures that form the second stem-loop cluster. MLE’s top iCLIP scores are indicated on top.

57 3. Detecting binding sites of RNA-binding proteins with iCLIP

be identified: inverted roX boxes (IRB) and roX box-like elements (RBL) that resemble roX boxes. The largest and most stable stem-loop structure in roX1 contains one such RBL element. In roX1, an IRB forms a long range base-pairing interaction with roX box 1 (RB1). Two additional roX boxes are located in an unstructured region upstream of RB1 (Figure 3.7 B). The first cluster of tandem stem loops towards the 5’-end of roX2 contains 3 roX box-like elements. The second cluster of roX2 tandem stem-loops contains three roX boxes (Figure 3.8 B). Most occurrences of the RB and RBL motifs correspond to binding hotspots of MLE as determined by iCLIP (Figures 3.7 B and 3.8 B). Functionality of the different domains and structures was determined using GRNA chromatography. For this purpose, roX1 binding domain D3 was split into two fragments that were separately tested for binding of MLE. The 5’ fragment contained stem loops R1H1 and P2, the 3’ fragment contained the 3 roX box elements. Binding of MLE could be detected only for the 5’ half but not the 3’ half containing the three RB elements. Binding of MLE could be established, however, when the 3’ half fragment was extended toward 5’ to include both structure P2 and the IRB, thus restoring the long range interaction between the IRB and RB 1 [Ilik et al., 2013]. iCLIP data suggested interactions of MLE with two separate domains of roX2. For that reason roX2 was split into two fragments (1-280nt and 281-540nt, Figure 3.8 B). GRNA chromatography detected MLE associating with the whole roX2 exon 3 and with the 5’ fragment but not with the 3’ fragment. Binding to the 3’ fragment could be established after adding ATP to the assay. MLE binding to the first tandem stem loop cluster of roX2 was only slightly dependent on ATP [Ilik et al., 2013]. To establish the importance of RNA structures for binding of MLE, several transcripts of roX2 exon 3 including mutations designed to disrupt various stem loop structures were tested for MLE binding using GRNA chromatography (Figure 3.8 B). Disruption of stem loop structure R2H1 lead to decreased binding of MLE to exon 3, a double mutant disrupting stem loop structures R2H1 and R2H2 lead to a further decrease of binding. Further assays showed that the two double-stranded RNA-binding domains (dsRBDs) of MLE mediate its interaction with roX2 [Ilik et al., 2013].

3.4 Conclusion

In this chapter we presented the analysis of two iCLIP experiments for deter- mining in vivo binding sites of MLE and MSL2, two members of the MSL complex in Drosophila. Analysis of the iCLIP sequencing data revealed two characteristics particular to the MLE and MSL2 experiments that required additional processing steps to ensure proper accounting of genome-wide binding events. First, the large

58 3.4. Conclusion number of crosslinking events on roX1 and roX2 caused a saturation of the address space available for assigning PCR duplicates to distinct crosslinking events. Second, the large number of crosslinking events in combination with errors introduced into random sequence tags caused a severe inflation of crosslinking events. We introduced two novel processing steps that successfully compensated for these effects. The extended address space used by the improved PCR duplicate removal scheme ensured the proper accounting of large numbers of crosslinking events per nucleotide. An additional filtering procedure identified spurious crosslinking events and enabled the correct accounting of crosslinking events. Without filtering for spurious crosslinking events, the number of crosslinking events would have been inflated up to 8-fold. The iCLIP experiments identified roX1 and roX2 as the most prominent targets of MLE and MSL2. Furthermore, iCLIP binding profiles revealed that MLE and MSL2 bind to common domains on roX1 and roX2. The combined analysis of binding profiles determined by iCLIP and RNA structures determined by structural probing experiments revealed the presence of repeated secondary structures with embedded roX box sequence motifs that are bound by MLE and MSL2. These results were used to guide GRNA chromatography experiments that showed that tandem stem-loops and roX-box motifs form the basis of MLE binding to roX1 and roX2.

59

Chapter 4

GraphProt: Modelling RBP binding preferences

4.1 Introduction

Recent studies revealed that hundreds of RNA-binding proteins (RBPs) regulate a plethora of post-transcriptional processes in human cells [Baltz et al., 2012; Castello et al., 2012; Ray et al., 2013]. The gold standard for identifying RBP targets are experimental CLIP-seq (crosslinking immunoprecipitation- high-throughput sequencing) protocols [Licatalosi et al., 2008; K¨onig et al., 2010; Hafner et al., 2010]. Despite the great success of these methods, there are still some caveats to overcome: (1) the data may contain many false positives due to inherent noise [Corcoran et al., 2011; Uren et al., 2012]; (2) a large number of binding sites remain unidentified (a high false-negative rate), because CLIP-seq is sensitive to expression levels and is both time and tissue dependent [Blencowe et al., 2009]; (3) limited mappability [Derrien et al., 2012] and mapping difficulties at splice sites lead to further false negatives, even on highly expressed mRNAs. To analyse the interaction network of the RBPome and thus to find all binding sites of a specific RBP, a CLIP-seq experiment is only the initial step. The resulting data requires non-trivial peak detection to control for false positives [Corcoran et al., 2011; Uren et al., 2012]. Peak detection leads to high-fidelity binding sites, however, it again increases the number of false negatives. Therefore, to complete the RBP interactome, computational discovery of missing binding sites is essential. The following describes a typical biological application of computational target detection: A published CLIP-seq experiment for a protein of interest is available for HEK-293 cells, but the targets of that protein are required for liver cells. The original CLIP-seq targets may miss many correct targets due to differential expression in the two tissues. The the costs for a second CLIP-seq experiment in liver cells may not be within the budget. Furthermore, since tissues are not readily accessible to ultraviolet irradiation required for crosslinking, CLIP-seq

61 4. GraphProt: Modelling RBP binding preferences

mostly depends on cell cultures that may not be readily available for the condition of interest. In fact, most CLIP-seq experiments are performed using immortalized cell lines such as HEK-293 or HeLa where expression patterns may differ significantly from human baseline [Landry et al., 2013]. We provide a solution that involves learning an accurate protein-binding model from the HEK-293 CLIP-seq data, which can be used to identify potential targets in the entire transcriptome. Transcripts targeted in liver cells can be identified with improved specificity when target prediction is combined with tissue-specific transcript expression data. Generating expression data is likely cheaper than a full CLIP-seq experiment. Computational target detection requires large numbers of highly reliable binding sites to train a binding model. Modern experimental methods such as RNAcompete [Ray et al., 2009; Ray et al., 2013] and CLIP-seq [Licatalosi et al., 2008; K¨onig et al., 2010; Hafner et al., 2010] facilitate a better characterization of RBP-binding specificities due to two important aspects: (1) the number of binding sites available for model training is increased from tens to thousands of sequences; (2) detection of exact binding locations is more precise, ranging from about 30 nucleotides for RNAcompete and HITS-CLIP [Licatalosi et al., 2008] to measurements at the nucleotide level for iCLIP [K¨onig et al., 2010] and PAR-CLIP [Hafner et al., 2010]. A major qualitative difference between CLIP-seq and RNAcompete data is that the latter determines relative binding affinities in vitro, whereas CLIP-seq detects binding events in vivo. There is a clear deficit of computational tools suited to detecting RBP binding sites to date, however, a multitude of sequence-motif discovery tools have been developed to detect DNA-binding motifs of transcription factors [Das and Dai, 2007]. Popular examples are MEME [Bailey et al., 2009], MatrixRE- DUCE [Foat et al., 2006] and DRIMust [Leibovich et al., 2013]. In the past, some of these methods have also been applied to the analysis of RBP-bound RNAs [Sanford et al., 2009; Kazan et al., 2010; Gupta et al., 2013]. It has been established that not only sequence—but also structure—is imperative for detecting RBP binding [Hiller et al., 2007; Kazan et al., 2010]. Two of the first tools to introduce structural features into target recognition were BioBayesNet [Pudimat et al., 2005] for transcription factor binding sites and MEMERIS [Hiller et al., 2006] for the recognition of RBP targets. MEMERIS is an extension of MEME using RNA accessibility information to guide the search towards single-stranded regions. A recent approach and the current state of the art for learning models of RBP binding preferences is RNAcontext [Kazan et al., 2010; Kazan and Morris, 2013]. RNAcontext is based on a biophysical energy model and extended accessibility information that includes the type of unpaired regions (external regions, bulges, multiloops, hairpins and internal loops). RNAcontext was shown to outperform MEMERIS and a sequence-based approach, MatrixREDUCE, on an RNAcompete set of 9 RBPs [Kazan et al., 2010]. Available approaches that introduce secondary structure into motif detec-

62 4.1. Introduction tion have two weaknesses. First, a single-nucleotide–based structure profile is used, i.e., a nucleotide is considered paired or unpaired (or part of a specific loop). Second, the main assumption behind these models is that nucleotide positions are scored independently. While this assumption seems to work well for RBP motifs located within single-stranded regions, positional dependencies arise when structured regions (i.e. base-pairing stems) are involved in binding recognition: binding to double-stranded regions involves dependencies between base pairs, which lead to distant stretches of nucleotides in the sequence that can affect the binding affinity [Lee et al., 2002; Gatignol et al., 1993; Lange et al., 2013; Hatoum-Aslan et al., 2011; Masliah et al., 2013]. General requirements of accurate binding models are thus manifold. First, training data nowadays comprise several thousands of RBP-bound sequences, therefore, identification of sequence-and-structure similarities must be com- putationally efficient. This excludes the use of conventional alignment-based methods (such as LocaRNA [Will et al., 2012;Will et al., 2007] and RNAal- ifold [Bernhart et al., 2008]). Second, both sequence and structure interde- pendencies should be modelled, which cannot be achieved by structure-profile– based approaches [Hiller et al., 2006; Wang et al., 2011; Kazan et al., 2010]. Third, models should be robust with respect to noisy data and be able to take quantitative binding affinities into account. In this chapter we present GraphProt, a flexible machine-learning framework for learning models of RBP binding preferences from different types of high- throughput experimental data such as CLIP-seq and RNAcompete. Trained GraphProt models are used to predict RBP binding sites and affinities for the entire (human) transcriptome—regardless of tissue-specific expression profiles. We start with a schematic overview of the GraphProt framework and highlight the advantages of this approach: for the first time, in spite of the huge amount of data, we make use of full secondary structure information by relying on an efficient graph-kernel approach. GraphProt showed a robust and improved performance in comparison to the state of the art by evaluating prediction performances for 24 sets of CLIP-seq and 9 sets of RNAcompete data. Prediction performance was clearly improved in comparison to RNAcontext [Kazan et al., 2010; Kazan and Morris, 2013] and even more clearly in comparison to a sequence-only–based approach, Ma- trixREDUCE [Foat et al., 2006], which was added to accentuate the importance of considering secondary structure. To gain further insight into the binding preferences learned by GraphProt models, we devised a procedure to extract simplified sequence- and structure-binding motifs that could be visualized as well-known sequence logos. We compared our motifs with current literature on binding specificities and found a substantial agreement. Finally, we showcase two possible applications that consolidate the bio- logical relevance of GraphProt models. First, we estimated affinities for PTB binding sites when training on CLIP-seq data without access to affinity mea- surements. As a control, we compared these estimated affinities with additional

63 4. GraphProt: Modelling RBP binding preferences experimental measurements and observed a significant correlation. Thus, our binding models can learn from simple binding and non-binding information to differentiate between strong and weak binding sites. Second, using a GraphProt model trained on a set of Ago2 HITS-CLIP sites, we verified that predicted Ago2 targets are in agreement with changes in transcript expression levels upon Ago2 knockdown. The same trend was not observed for the original HITS-CLIP-detected sites, clearly indicating that GraphProt identifies binding sites missed by the high-throughput experiment.

4.2 The flexible GraphProt framework

The main application of the GraphProt framework is to learn binding preferences using CLIP-seq data and to apply trained models to (1) detect motifs of sequence- and structure-binding preferences and (2) predict novel RBP target sites within the same organism. Figure 4.1 presents a schematic outline of the GraphProt framework: there are two main phases, a training and an application phase. In the training phase, RBP binding sites and unbound sites are derived from CLIP-seq data; highly probable secondary structures (using RNAshapes) are calculated in the context of each potential target site; each structure is encoded as a hypergraph (see Figure 4.2 B) containing both sequence and full secondary structure information; features are extracted from the hypergraphs using efficient graph kernels; and finally a model is trained using a standard machine-learning approach. In the application phase, the trained models are either (1) processed further to generate sequence and structure logos of learned binding preferences or (2) used in a scanning approach to predict (novel) RBP binding sites. The predictions can be viewed as a profile over the entire transcript from which only high-scoring sites can be selected. Note that when affinity measurements are available for a large set of binding sites, we can train a regression model on these measurements—instead of separating sites into bound and unbound. In this case affinities are learned and predicted directly. In subsequent results, however, we show that GraphProt can also accurately predict binding affinities when no affinity data is available for training. In the following, we highlight special features of GraphProt that are not found in RBP-binding prediction tools in the literature. This initial overview is then followed by a detailed description of the graph encoding and graph kernel used by GraphProt.

A natural encoding for RBP binding sites Conventional feature encoding in RNA-binding models uses aggregate prob- abilities per nucleotide to characterize RNA structure, i.e., models integrate a structure profile of the bound sequence [Kazan et al., 2010; Wang et al., 2011;Sturm et al., 2010]. The most common measurement is accessibility, which is the probability of a nucleotide to be unpaired [Bernhart et al., 2011; Lange

64 4.2. The flexible GraphProt framework

RBP CLIP-seq binding sites RBP

secondary structure selected unbound sites computation (RNAshapes) training

graph-based encoding

graph kernel features

GraphProt model application

binding profiles at nucleotide resolution

high-affinity target site motif visualization predictions

Figure 4.1: Schematic overview of the GraphProt framework.

65 4. GraphProt: Modelling RBP binding preferences et al., 2012]; accessibility is used by MEMERIS [Hiller et al., 2006]. RNA- context [Kazan et al., 2010] employs accessibility and its complement, the probability of bases being paired. Accessibility is split into probabilities of the unpaired nucleotide to be located within a specific type of loop (e.g. hairpin, bulge or multiloop). These single-nucleotide structure profiles allow encoding of RBP target sites in sequential data structures, which guarantees higher computational efficiency. The downside of structure profiles is that the original structure information of the RNA molecule is severely compressed: instead of storing exact base-pairing information, only the marginal binding propensity of one nucleotide towards all other nucleotides is considered. We propose a representation that is more natural and fully preserves base-pairing information (Figure 4.2). The key idea is to use a small set of stable structures to represent probable folding configurations on the mRNA in the surrounding context of RBP binding sites. These structures are then encoded as graphs with additional annotations for the type of substructure, i.e., multiloops, hairpins, bulges, internal loops, external regions and stems (see Figure 4.2 B).

Advantages of graph-kernel features In order to efficiently process RNA structures encoded as graphs, we propose a method based on graph kernels. The main idea is to extend the k-mer similarity notion for strings (i.e. counting the fraction of common small substrings) to graphs and finally to fit a predictive model using algorithms from the Support Vector Machine (SVM) family [Cortes and Vapnik, 1995] for classification problems and Support Vector Regression (SVR) [Drucker et al., 1997] when affinity information is available. Using a graph-kernel approach, we extract a very large number of fea- tures (i.e. small disjoint subgraphs, see Figure 4.2 C and Section 4.2.2) in a combinatorial manner and assess their importance in discriminating between bound and unbound regions. The use of disjoint subgraphs allows a notion of a binding motif that is more expressive than the one offered by traditional Position Specific Scoring Matrices [Gowri et al., 2006] because it takes the simultaneous interdependencies between sequence and structure information at different locations into account. Feature importance information can be used, not only to build accurate predictors, but can be subsequently processed to identify sequence- and structure-binding preferences.

4.2.1 Graph encoding of RNA sequence and structure. We propose an easy-to-adapt method to encode information about RNA sequence and structure in a natural way. The key idea is to use a generic hypergraph formalism to annotate different types of relations: (1) relations between nucleotides, such as sequence backbone or structure base pairs; and

66 4.2. The flexible GraphProt framework

Figure 4.2: Natural encoding of RBP-bound sites and graph kernel features. A) The region identified in the respective CLIP-seq experiment (yellow) is symmetrically extended by 150 nucleotides in order to compute rep- resentative secondary structure information. B) The RNA secondary structure of each RBP-bound context is represented as a graph. Additional information on the type of substructures (i.e. whether a group of nucleotides is located within a stem, or within one of the loop types) is annotated via a hypergraph formalism. C) A very large number of features is extracted from the graphs using a combinatorial approach: a valid feature is a pair of small subgraphs (parametrized by a radius R) that are at a small distance from each other (parametrized by a distance D). The feature highlighted in orange is an ex- ample of a feature that can account for the simultaneous interdependencies between sequence and structure information at different locations.

(2) relations between abstract structure annotations, such as loops or stems, and the corresponding subsequences. For the encoding of RNA sequence and structure, we start from the repre- sentation used in GraphClust [Heyne et al., 2012] and provide several useful extensions. In GraphClust, an RNA sequence is encoded—together with its fold- ing structure—as a graph, where vertices are nucleotides and edges represent either a sequence backbone connection or a bond between base pairs. We do not rely on a single best-folding structure (e.g. the one achieving minimum free energy) because this is known to be error prone; instead, one can sample the population of all possible structures and retain highly probable, representative candidates. The sampling strategy was implemented via the shape abstraction technique introduced by RNAshapes [Steffen et al., 2006]. RNAshapes catego- rizes all secondary structures according to a simplified representation, called the

67 4. GraphProt: Modelling RBP binding preferences

shape, which abstracts certain structural details. Different abstraction levels which ignore various structure details, are possible, e.g., ignoring all bulges, or all bulges and all internal loops; stem lengths are always ignored. Out of all possible structures that have identical shapes, RNAshapes considers the one with minimum free energy as representative called the shrep. We calculate shreps using shifting windows of 150 nucleotides at step size 37 nucleotides and predict up to 3 shreps that are required to be within 10% of the minimum free energy of the sequence for each window. In this work, we extended the representation used in GraphClust [Heyne et al., 2012] in three ways: (1) we added a layer of abstract structure information to the secondary structure representation (see Figure 4.2 B); (2) we considered an oriented version of the graphs; and (3) we imposed a restriction on the graph—termed the viewpoint—so that features are only extracted from the informative part, i.e., the part where RBP-binding is hypothesized to occur (see Figure 4.2 A).

Encoding abstract structure information

In order to better model high-level characteristics of an RNA structure and increase the capacity of the model to detect distantly related sequences, we considered an additional layer of secondary structure annotations that we call abstract. This layer generalizes the specific nucleotide information and characterizes only the generic shape of a substructure (analogous to the shape abstraction in RNAshapes [Steffen et al., 2006]) such as stems (S), multiloops (M), hairpins (H), internal loops (I), bulges (B), and external regions (E) (see the right-hand side of Figure 4.2 B). This type of annotation is much richer than what could be achieved by merely labelling the corresponding nucleotides (e.g., a nucleotide C within a stem could be labelled as C-S and within a bulge loop as C-B) and it allows to extract dependencies at a pure abstract level (i.e. between abstract secondary structure elements) and at an hybrid level (i.e. between abstract secondary structure elements and specific nucleotides). To represent such a rich annotation scheme, we required the expressive power of hypergraphs, which generalize the notion of an edge to that of a relation between many vertices (see Figures 4.2 and 4.3).

Sequence-only encoding

It is possible to use GraphProt in a pure sequence mode, ignoring the RNA secondary structure by discarding base-pairing edges and abstract RNA struc- tures. In this case, GraphProt behaves like an efficient, string kernel machine with gaps in the spirit of [Leslie et al., 2004].

68 4.2. The flexible GraphProt framework

Figure 4.3: Extensions to the graph kernel for GraphProt. A) Transfor- mation of a hypergraph to an equivalent incident graph. B) Mixed abstract– ground level hypergraph features. Two identical occurrences of the subsequence UUC yield two independent features, one that is aware of the internal loop location and the other that is aware of the hairpin loop location. C) Undirected to directed graph transformation: edges are directed following the 5’ to 3’ direction; an additional copy of the graph with inverted edges and relabelled vertices (i.e. using the prefix r) is added. (1) A fragment C(G-C)U is high- lighted; in the undirected case, the reversed substructure U(G-C)C generates identical features. (2) The directed treatment allows features that discriminate between the two fragments: the neighbourhood of vertex G generates the feature (G-C)U in the main direction and (rG-rC)rU in the reverse direction. D) Viewpoint extension: using a large window allows the correct folding of the RNA molecule, however, as we are interested in a local phenomenon, we restrict the extraction of features to a smaller subportion that reflects the relevant part of the RNA, i.e. the RBP binding site. In yellow, we highlighted the viewpoint area. In red, we highlighted the portion of the folded RNA molecule that will be accessed to extract features when the parameters for the NSPD Kernel are radius+distance=5.

4.2.2 Graph kernel

The graph kernel used by GraphProt is the Neighbourhood Subgraph Pairwise Distance kernel (NSPD Kernel) [Costa and Grave, 2010]. The main idea of the approach is to decompose a graph into a set of small overlapping subgraphs (see Figure 4.2 C); every subgraph is then assigned a numerical identifier via an efficient hash based technique. The identifier is used to solve the isomorphism detection problem in an approximate but extremely fast way and it is used to

69 4. GraphProt: Modelling RBP binding preferences

build the final explicit feature encoding. In this way we build representations that can effectively use millions of features. The type of subgraph chosen in NSPD Kernel is the conjunction of two neighbourhood subgraphs at a small distance from each other. Two parameters determine the characteristics of these subgraphs (and are thus related to the complexity and size of the entire feature set): (1) the maximum size of the neighbourhood, called the radius R, and (2) the maximum distance between any two root nodes, called the distance D. Features are extracted for all combinations of values r ≤ R and d ≤ D. The use of pairs of subgraphs effectively captures the correlation between features [Kundu et al., 2013]. For that reason we can use fast linear kernels without significant performance penalty compared to using more costly non-linear kernels (e.g. Gaussian or RBF kernels). In this work, the NSPD Kernel was extended in the following way: (1) we upgraded the encoding from graphs to hypergraphs to be able to annotate the RNA abstract structure elements; (2) we considered directed graphs rather than undirected graphs; and (3) we introduced a way to select subsets of features via the viewpoint notion.

A kernel for hypergraphs In the NSPD Kernel of [Costa and Grave, 2010], shortest paths can access all vertices and edges in the graph. When the graph contains vertices with a large degree (i.e. is not sparse), however, the shortest path distance notion becomes degenerate and many vertices become immediate neighbours of each other. Under these conditions, the NSPD Kernel would generate uninformative features corresponding to extremely large subgraphs that are unlikely to occur in more than one instance. Thus, effective learning or generalization would be impossible. This situation would occur if we used the incident graph representation for hypergraphs as shown in Figure 4.3 A (left): hyperedges (i.e. relations) would yield vertices with a large degree, for example, a hairpin-loop relation would produce a vertex connected to all nucleotides belonging to the respective hairpin loop. This would effectively remove the nucleotide order of the RNA sequence, since there would exist a shortest path of length two between any two nucleotides in the original hairpin sequence. In order to deal with this issue, we extended the NSPD Kernel to work on the incident graph as visualized on the right side of Figure 4.3 A. Additional hyperedge vertices are introduced as an intermediate between corresponding vertices of the ground and abstract levels. These additional relation vertices are considered as non- traversable by paths, resulting in two separate sets of features where one set is based exclusively on ground level vertices and the other set is based exclusively on abstract level vertices. By creating additional features (i.e. pairs of subgraph decompositions), where the root vertices of the two paired neighbourhoods are on the two endpoints of the hyperedge relation (Figure 4.3 B), we create features that are aware of the nucleotide composition of a substructure and,

70 4.2. The flexible GraphProt framework at the same time, of the position of that substructure in the global abstract structure annotation. Consider Figure 4.3 B: without the abstract structure annotation, the two occurrences of the subsequence UUC (indicated in green) would be indistinguishable. With the abstract annotation, we generate two independent features, one that is aware that UUC is located in an internal loop (the vertex labelled I surrounded by two stems), and another feature that is aware that UUC is located in a hairpin loop (the vertex labelled H, preceded by a stem). By making the relation vertex non-traversable, we have separated the basic from the abstract part of the graph. The NSPD Kernel features in this case can be divided into three separate sets: one set for the basic part, which correspond to the features used in GraphClust [Heyne et al., 2012], a set of novel features for the abstract part and finally a hybrid set of features that relate nucleotide composition to the abstract part. Note that the features for the abstract part are independent of the exact nucleotide composition of the underlying substructures and therefore allow a better generalization for distantly-related RNA sequences.

Directed graphs Using undirected graphs for RNA sequences (as in GraphClust [Heyne et al., 2012]) means that the order imposed by the 5’→3’ asymmetry is lost. Hence, a sequence and its reversed counterpart (not the complement) would yield the same feature representation. To overcome this limitation, we extended all notions in the NSPD Kernel [Costa and Grave, 2010] to directed graphs. For this, we required an unambiguous definition of edge direction: (1) The sequence backbone edges reflected the natural 5’→3’ direction; (2) the base-pair edges were directed in natural order, .i.e. away from the nucleotide closer to the 5’-end and towards the nucleotide closer to the 3’-end; and (3) edges in the abstract part were directed by starting at the sequence ends and travelling from the inner annotations towards the outer limbs, i.e., starting from multiloops and ending at hairpin loops. Finally, to capture all relevant information, while still maintaining the consistency with the chosen direction, we duplicated the graph, relabelled all vertices by adding a distinguishing prefix, and reversed the direction of all edges (see Figure 4.3 C).

Selection of kernel viewpoints In the NSPD Kernel [Costa and Grave, 2010] of GraphClust [Heyne et al., 2012], all vertices are considered in the generation of features. This is suitable when global RNA sequences are being compared. In the case of RBP-binding sites on the mRNA, however, only the local target region could be informative and considering all vertices would lead to a substantial amount of noise and decrease the overall predictive performance. Thus, without losing discriminative power,

71 4. GraphProt: Modelling RBP binding preferences

we reduced the number of vertices considered to a fixed subregion of the sequence called the viewpoint (see Figures 4.2 and 4.3). In a supervised setting, the viewpoint area is selected randomly for negative examples and, for the positive examples, around the region covered by the RBP-bound sequence identified by the respective high-throughput experimental technique. In a genome-wide scanning setting it would be selected with a moving window approach. Note that we cannot simply reduce the graph encoding to fit exactly this reduced area, since in so doing, we would lose the nucleotides located up- and downstream of the selected area that are needed to estimate the folding structure of the RNA. In detail, we require that the root vertex of at least one of the two neighbourhoods is localized in the viewpoint area. This way we still allow an accurate folding of the mRNA, considering 150 nucleotides up- and downstream of the viewpoint [Lange et al., 2012], but, we only select features that are local to the area of interest. The other hyper-parameters of the NSPD Kernel, namely the distance D and the radius R, determine the area of influence around the putative target region, i.e., the portion of the mRNA used to extract relevant information for the discriminative task (see Figure 4.3 D). The viewpoint technique was first introduced in [Frasconi et al., 2012].

4.2.3 Application of predictive models As outlined in Figure 4.1, GraphProt models can be applied to predict novel binding sites and to visualize sequence and structure preferences.

Scoring whole sequences and predicting binding profiles A trained GraphProt model is applied to any transcript (or 3’-UTRs) to predict (novel) binding sites from the same organism (across-species compatibility may exist, but was not tested and as far as we know was not reported in the literature). Two options for prediction are available. First, an entire sequence window, representing a potential binding site, is assigned a score that reflects the likelihood of binding: the score is the prediction margin as given by the machine-learning software, e.g. the SVM; positive values indicate a true binding site and negative values indicate that no binding occurs. Second, to generate prediction profiles on a nucleotide level, we process the prediction margins reported by the software per feature (i.e., the importance of that feature for predicting RBP binding)—not per window. Profiles are calculated per nucleotide by summing over all features for which the corresponding nucleotide is a root node in the subgraph representing the feature (Figure 4.2 C).

Extracting high-affinity binding sites High-affinity binding sites can be extracted from prediction profiles as we exemplified for Ago2 (see Section 4.3.5). To predict high-affinity Ago2 target sites, we calculated binding profiles for the 3’-UTRs of genes with corresponding

72 4.3. GraphProt performance evaluation fold-changes from the Ago2 knockdown experiment in [Schmitter et al., 2006] using the GraphProt sequence-only model, trained on the Ago2 HITS-CLIP set. Since proteins do not only bind to single nucleotides, binding scores were averaged for all 12-mer windows. To gain high-affinity Ago2 binding sites we considered the 1% highest-scoring 12-mers and merged overlapping or abutting sites.

Visualising sequence- and structure-binding preferences To provide visual representations for both sequence and structural preferences encoded by the GraphProt models, we predicted and scored the approximately 25, 000 folding hypotheses of up to 2, 000 CLIP-seq-derived binding sites. For each folding hypothesis per binding site, we extracted only the highest-scoring 12-mer, where the score is the average prediction margin per nucleotide from the binding profile—analogous to the method of predicting the Ago2 binding sites. To visualize structure preferences, we compressed full secondary structure information into structure profiles: a nucleotide is assigned to the structure element it occurs in—stems (S), external regions (E), hairpins (H), internal loops (I), multiloops (M) and bulges (B). The 1, 000 highest-scoring 12-mer nucleotide sequences and structure profiles were converted into sequence and structure logos, respectively (using WebLogo [Crooks et al., 2004]; all logos are found in Supplementary Section B.2).

4.3 GraphProt performance evaluation

Here, we evaluate GraphProt performance for learning binding preferences from CLIP-seq as well as RNAcompete data. We use the straightforward representa- tion of learned binding preferences via sequence logos to show that GraphProt models capture known binding preferences and analyse the performance im- provement gained by modelling RNA structure in addition to RNA sequence. We conclude with the presentation of two application scenarios for GraphProt models: estimation of binding-affinities from categorical data and genome-wide prediction of binding sites. Throughout this section we report increases in crossvalidation performance using relative error reduction, defined as

x0 − x (4.1) 1 − x where x is the baseline performance and x0 is the improved performance. The performance is a function with codomain in the interval [0, 1] and is 1 when all predictions correspond exactly to the desired targets. The (generalized) error notion is consequently defined as e = 1 − x.

73 4. GraphProt: Modelling RBP binding preferences

4.3.1 Learning binding preferences from high-throughput data

GraphProt learns binding preferences from CLIP-seq

Computational approaches for predicting RBP binding sites require large amounts of training data: the current uprise of available CLIP-seq experiments make these a valuable data source of target sites bound by specific RBPs. To benchmark the ability of GraphProt to detect binding preferences of RBPs from human CLIP-seq data, we used 24 sets of HITS-CLIP-, PAR-CLIP- and iCLIP-derived binding sites: 23 were curated by doRiNA [Anders et al., 2012] and an additional set of PTB HITS-CLIP binding sites was retrieved from GEO accession GSE19323 [Xue et al., 2009]. The full listing of publications corresponding to the used CLIP-seq data can be found in Supplementary Section B.2. The Ago1-4 and IGF2BP1-3 sets contain combined binding sites of several proteins; four of the sets consist of ELAVL1 binding sites derived by both HITS-CLIP and PAR-CLIP. Other proteins included are ALKBH5, C17ORF85, C22ORF28, CAPRIN1, EWSR1, FUS, HNRNPC, MOV10, PTB, PUM2, QKI, SFRS1, TAF15, TDP-43, TIA1, TIAL1 and ZC3H7B. We compared the performance of GraphProt to RNAcontext [Kazan et al., 2010] and MatrixREDUCE [Foat et al., 2006]; MatrixREDUCE was added to the benchmark comparison because it is a sequence-based method that previously displayed promising results in a comparison with RNAcontext [Kazan et al., 2010], the current state of the art. GraphProt uses an extended sequence context for structure prediction, but centres on the CLIP-seq sites using the viewpoint technique (Section 4.2.2 and Figure 4.2 A). To enable a fair comparison, the same context sequences (for structure prediction) and viewpoint information (for target sites) were considered for RNAcontext and MatrixREDUCE. Peaks of more than 75 nucleotides were excluded from all training sets to reduce the number of peaks likely to correspond to multiple binding sites. iCLIP sites were extended by 15 nucleotides up- and downstream as they were generally more narrow than HITS-CLIP and PAR-CLIP sites. For each set of CLIP-seq sites, we created a set of unbound sites by shuffling the coordinates of bound sites within all genes occupied by at least one binding site, thus enabling the training of models using binary classification. To enable an accurate prediction of secondary structures [Lange et al., 2012], we extended the binding sites to both directions by 150 nucleotides or until reaching a transcript end. Core binding-site nucleotides, but not the additional context for folding, were marked as viewpoints. All expansions were done using genomic coordinates. Secondary structure profiles for RNAcontext were calculated using a mod- ified version of RNAplfold [Bernhart et al., 2011] that calculates separate probabilities for stacking base pairs (i.e. stems), external regions, hairpins, bulges, multiloops and internal loops. Profiles for RNAcontext were calculated

74 4.3. GraphProt performance evaluation using the full sequences; training and testing were performed on the same core binding sites that were marked as viewpoints for GraphProt. This ensures that RNAcontext still has access to the full sequence context required for structure prediction while providing the same concise binding sites as used by GraphProt. MatrixREDUCE was also evaluated using only the viewpoint regions. The predictive performance of models trained on CLIP-seq data was eval- uated by a 10-fold cross-validation. This technique assesses the ability of a method to predict RBP target sites that were not seen during training. This is analogous to the prediction of novel sites. Classification performance is given as Area Under the ROC Curve (AUC) using the SVM margins as the diagnostic results of classification. GraphProt is parametrized in its three main components: the graph encoding part, the graph kernel feature part and the predictive model part. The main parameter in the graph encoding part is the abstraction level of the shape category; in the graph kernel feature part the main parameters are the maximal radius R and the maximal distance D that define the neighbourhood subgraphs features; in the predictive model part, in the classification case the SVM models were trained using a stochastic gradient descent (SGD) approach [Bottou and LeCun, 2004] and the main parameters are the number of training epochs and parameter λ which control the trade-off between the fitting accuracy and the regularization strength (Supplementary Tables B.3 and B.4). The optimal values for all these parameters were determined jointly via a line search strategy, i.e. all parameters were kept fixed with the exception of one, and the one parameter subject to optimization is chosen in a round-robin fashion. Given the amount of computation required for the optimization phase, all GraphProt parameters and RNAcontext motif widths were evaluated on a set of 1, 000 sequences or 10% of the available data, whichever was smaller (Supplementary Tables B.3, B.4 and B.5). The sequences used to determine the optimal parameter values were then discarded for the cross-validated performance assessment procedure. MatrixREDUCE automatically selects ap- propriate motif widths during training, for each fold of the MatrixREDUCE cross-validation, we evaluated a single motif, setting max motif to 1 (Supple- mentary Table B.7). RNAcontext and MatrixREDUCE were trained using values 1/-1 for positive/negative class sequences and using motif widths ranging from 4 to 12 nucleotides. GraphProt outperforms RNAcontext for 20 of the 24 sets, showing an average 29% relative error reduction (Figure 4.4, Supplementary Table B.1). RNAcontext only scores marginally better for the remaining 4 sets (only a 6% relative error reduction on average). For 11 sets, the improvement in relative error reduction of GraphProt over RNAcontext is above 30%. The largest improvements are a 59% relative error reduction for CAPRIN1 (from AUC 0.65 to 0.86) and a 62% relative error reduction in AGO1-4 (from AUC 0.72 to 0.90). Although MatrixREDUCE scores worse than both GraphProt and RNAcontext for all 24 sets, some sets exist where MatrixREDUCE performs

75 4. GraphProt: Modelling RBP binding preferences

Figure 4.4: GraphProt shows a high performance in detecting missing binding sites across all RBPs. Prediction performance is measured by the AUC (area under the receiver-operating curve) stemming from a 10-fold cross- validation (y-axis) on 24 CLIP-seq sets (x-axis) for GraphProt, RNAcontext, and MatrixREDUCE. GraphProt and RNAcontext consider sequence and structure information, whereas MatrixREDUCE is only sequence based. MatrixREDUCE results below 0.5 are not shown, see Supplementary Table B.1 for the full table of results.

nearly as well as structure-based methods. Nevertheless, it more or less fails (AUC < 0.5, i.e. worse than random) for 8 data sets. Overall, GraphProt shows robust prediction accuracies and outperforms existing methods.

GraphProt learns binding preferences from RNAcompete The affinity of an RBP to its target site is important for the effectiveness of the subsequent regulation. This implies that a classification into bound and unbound sequences is only a coarse approximation. Instead, a regression approach that can distinguish target sites according to their binding strength would be more suitable. To model this binding strength, we require a training set with the affinities for different sequences instead of just a list of bound regions. Such measurements are provided by RNAcompete, an in-vitro assay for the analysis of recognition specificities of RNA-binding proteins [Ray et al., 2009]. To measure affinities, a pool of short RNAs, designed to include a wide range of k-mers in both structured and unstructured contexts, is exposed to a

76 4.3. GraphProt performance evaluation tagged RBP. The resulting RNA-protein complexes are pulled down and the abundance of bound RNA is measured. Relative binding affinity is then defined as the log ratio between the amount of pull-down RNA and the amount of RNA in the starting pool. Although a modified version of the RNAcompete protocol was published more recently [Ray et al., 2013], this data was not suited for evaluating GraphProt sequence and structure models as the experiment was designed in such a way that it uses only unstructured sequences. We evaluated the ability of GraphProt to accurately predict binding affinities in a regression setting using the RNAcompete sets for nine RBPs from the initial RNAcompete assay: Vts1p, SLM2, YB1, RBM4, SFRS1, FUSIP1, ELAVL1, U1A and PTB [Ray et al., 2009]. All sets included both structured and unstructured sequences. The performance of predictions for RNAcompete data was measured by the mean average precision (APR), addressing the huge class imbalance of the RNAcompete test sets (typically, only a few hundred sequences were considered bound whereas tens of thousands were considered unbound). Model evaluation for the RNAcompete data was essentially recreated as pub- lished for RNAcontext [Kazan et al., 2010]: models were evaluated by conversion to a binary-classification task using the published thresholds. Classification performance is given as the average precision (APR), which is better suited for imbalanced class sizes (few bound sequences, many unbound sequences) than AUC. For each of the nine proteins, models were created for the two independent sets and in each case tested on the corresponding sets. We report the mean score of the two evaluations. For the RNAcompete regressions, the main parameters are c and  which control the trade-off between the fitting accuracy and the regularization strength (Supplementary Table B.5). Graph- Prot parameters were determined using subsets of 5, 000 training sequences (Supplementary Table B.5). Support vector regressions were performed using libSVM [Chang and Lin, 2011]. RNAcontext motif widths were determined using all training sequences (Supplementary Table B.6). GraphProt outperformed RNAcontext for all proteins except Vts1p where RNAcontext scored marginally better (Figure 4.5, Supplementary Table B.2). For five of the proteins, the improvement in relative error reduction was over 30%; the largest improvements in relative error reduction was achieved for FUSIP1 (67%) and SFRS1 (71%). Note that MatrixREDUCE was not shown as it was previously outperformed by RNAcontext using the exact same data and analysis procedure in [Kazan et al., 2010].

4.3.2 GraphProt sequence-and-structure motifs Kernel-based methods allow the use of more complex features and thus an improved prediction performance. On the downside, kernel approaches usually do not provide an insight into what the model has learned. Since this insight is useful for assessing the biological relevance of the CLIP-seq models, we devised a novel post-processing step in order to identify the sequence and structure

77 4. GraphProt: Modelling RBP binding preferences

Figure 4.5: GraphProt uses a regression model to predict binding affinities from measurements derived by RNAcompete with an im- proved precision. We present the mean APRs (y-axis) for two independent RNAcompete sets (x-axis), both comprising nine RBPs, comparing GraphProt and RNAcontext sequence-and-structure–based models.

preferences learned by the models (see Section 4.2.3); note that these logos are a mere visualization aid and do not represent the full extent of the information captured by GraphProt models. A visual comparison with data from the literature (Figure 4.6) revealed that GraphProt motifs for SFRS1, ELAVL1 and PTB closely match known SELEX consensus motifs [Tacke et al., 1997;Gao et al., 1994;Perez et al., 1997]. For TDP43, GraphProt identifies a preference for repeated UG dinucleotides; TDP43 targets, determined by RIP-chip (RNA immunoprecipitation followed by microarray analysis), contained such repeats in 80% of the 3’-UTRs [Colombrita et al., 2012]. GraphProt motifs for PUM2, QKI and IGFBP1-3 closely resemble the motifs previously identified using the same PAR-CLIP sets [Hafner et al., 2010]. The motifs identified in [Hafner et al., 2010], however, are based on the top sequence read clusters while the GraphProt model was trained using the full sets of PAR-CLIP sites. FUS was found to bind AU-rich loop structures using electrophoretic mobility shift assays (EMSA) [Hoell et al., 2011]. In accordance with this, the GraphProt structure motif in Figure 4.6 shows a preference for stems at the borders, but not at the centre of the motif. The three members of the FET protein family—FUS, TAF15 and EWSR—have similar PAR-CLIP binding profiles [Hoell et al., 2011], explaining the stunning similarity of the corresponding GraphProt motifs. Three of the GraphProt motifs—HNRNPC, TIA1 and the closely related TIAL1—show a preference

78 4.3. GraphProt performance evaluation

Protein Literature knowledge Source GraphProt sequence logo GraphProt structure logo

[40] SFRS1

ELAVL1 [41]

PTB [42]

TDP43 [43] G C U A U U U A A A A C C C A A A A GUAGAUGGAUGUCCUCGCUCGCUCGAUGAUGAU

PUM2 [6]

A UGAUAUAUAUA

QKI [6] C C C CA UA U UAAU

IGF2BP1-3 [6]

C C AC U A A UU U G A C UCA U A U C A U AU CUU CAA U CA A

FUS AU-rich loop structure [44]

large overlap of target sites TAF15 [44] with FUS and EWSR1

large overlap of target sites EWSR1 [44] with FUS and TAF15

HNRNPC uridine tracts [5]

U-rich region TIA1 [47] (3−11 nt)

U-rich region TIAL1 [47] (3−11 nt)

Figure 4.6: GraphProt sequence and structure motifs capture known binding preferences. We compare knowledge from the literature (left) with visualized GraphProt sequence and structure motifs (right) and a substantial agreement is evident, especially with known sequence specificities. Structure motifs are annotated with the full set of structure elements—stems (S), external regions (E), hairpins (H), internal loops (I), multiloops (M) and bulges (B). The character size correlates with the importance for RBP binding. For ELAVL1, we show the motif for ELAVL1 PAR-CLIP (C).

79 4. GraphProt: Modelling RBP binding preferences

for U-rich sites. HNRNPC was reported to bind to poly-U tracts in 3’- and 5’-UTRs [Gorlach et al., 1994; Wilusz and Shenk, 1990; K¨onig et al., 2010]. TIA-1 has been described as an ARE-binding protein and binds both U-rich and AU-rich elements. The preference for U-rich regions was shown using SELEX [Dember et al., 1996], crosslinking and immunoprecipitation [Forch et al., 2000] and ITC [Bauer et al., 2012]. Just recently, the high affinity toward binding to U-rich RNA could be traced to six amino acid residues in the TIA1 RNA recognition motif 2 (RRM2) [Kim et al., 2013].

4.3.3 Benefits of modelling local RNA structure

Previous benchmarking analyses (Figures 4.4 and 4.5) established that the full GraphProt models (with secondary structure information) are superior to those gained by state-of-the-art methods; now, we assess the importance of secondary structure in RBP binding models. The encoding of RBP target sites is flexible, such that it is easy to remove all structure details to leave only sequence information. This enables a direct comparison of the full structure to sequence-only models in a controlled setting (i.e. the only difference in the comparison is the encoding of the target site); thus, the added value of structure information for RBP target site prediction is determined. Both the CLIP-seq and RNAcompete sets (from Figures 4.4 and 4.5, respec- tively) were used to compare models with and without structure information in Figure 4.7 (prediction comparisons were performed analogously to previous benchmarking analyses). The average relative error reduction for structure models compared to sequence-only models was 27% for the RNAcompete and 14% for the CLIP-seq sets. The addition of structure improves prediction accuracy in many cases and never leads to a significant loss in performance. RNAcompete data is optimal for comparing models, since the initial se- quences in the library were designed to be either unstructured or to form a stem-loop structure consisting of a single hairpin; therefore, a clear distinction of structure contribution is possible. Results are plotted in Figure 4.7 A. Three of the four proteins from the RNAcompete set showing significant improvements over the sequence models—PTB, RBM4 and U1A—are known to recognize stem-loop structures [Sharma et al., 2011;Kojima et al., 2007;Law et al., 2006]. For PTB, it was determined by isothermal titration calorimetry (ITC), gel shift assays and NMR studies that the two RRM domains bind a stem-loop structure of U1 snRNA [Sharma et al., 2011]. For RBM4, information about possible targets is scarce, however, in one case it was reported that the target of RBM4 is a cis-regulatory element that was predicted to be a stem-loop structure [Ko- jima et al., 2007]. This finding was supported by several mutations that were predicted to disrupt the RNA structure that led to a decreased interaction with RBM4. U1A is also known to bind to a stem-loop structure [Law et al., 2006].

80 4.3. GraphProt performance evaluation

In contrast to RNAcompete, CLIP-seq experiments are performed in vivo and all different types of structure elements could influence binding affinities. Comparisons using the CLIP-seq data are plotted in Figure 4.7 B. For five of the CLIP-seq sets—Ago1-4, CAPRIN1, IGF2BP1-3, MOV10 and ZC3H7B— performance of the structure models was significantly improved over the sequence models (35% average relative error reduction). The structure motif for IGFBP1-3 shows a preference for the accessible part of stem-loop structures; motifs for MOV10, CAPRIN1, ZC3H7B and Ago1-4 indicate preferences for generally structured regions (Figure 4.8). GraphProt structure models for these proteins also show higher-than-average relative error reduction when compared to RNAcontext (53% versus 29% average relative error reduction), indicating that the full RNA structure representations used by GraphProt are better suited than the structure-profile–based approach used by RNAcontext when modelling binding preferences of RBPs binding to structured regions. Some of the remaining proteins show preferences for structured binding sites in their structure motifs as well as large relative error reductions over RNAcontext, e.g. ALKBH5, C17ORF85, C22ORF28, PTB, PUM2, SFRS1 and TDP43, indicating that structure properties of these binding sites may be captured by GraphProt sequence models via dinucleotide frequencies.

The large-scale analysis of double-stranded–binding RBPs (dsRBPs) is slightly lagging behind that of RBPs binding to accessible regions (i.e. ssRBPs). To the extent of the authors’ knowledge, the first—and only—genome-wide studies of dsRBPs performed at the time of this study were done for MLE, MSL2 (see Chapter 3) and Staufen [Laver et al., 2013]. The data resulting from these studies, however, is not suited for training GraphProt models: MLE and MSL2 bind very specifically to only a few sites on the roX1 and roX2 RNAs [Ilik et al., 2013] and for Staufen, only target mRNA were available instead of exact target sites [Laver et al., 2013]. Therefore, we could not evaluate the performance of GraphProt for dsRBPs binding predominantly to stems, however, previously mentioned improved performances for RBPs binding to mixed structured and accessible regions indicate that GraphProt is well-equipped for—and should perform well when—learning binding preferences of dsRBPs.

In summary, for ssRBPs binding to accessible regions, GraphProt sequence models may provide results comparable to the full structure models at increased processing speed. In contrast, proteins binding to structured regions, benefit strongly from the full structure models provided by GraphProt, showing larger- than-average increases in performance over structure-profile-based models. Since full structure models never performed significantly worse than sequence- only models, they should be used as the default.

81 A

RBM4

FUSIP1

U1A

PTB

4. GraphProt: Modelling RBP binding preferences

A B

RBM4 IGF2BP1-3

FUSIP1 Ago1-4 MOV10 U1A CAPRIN1 PTB ZC3H7B

B

Figure 4.7: The differenceIGF2BP1-3 in predictive power using RNA structure in comparisonAgo1-4 to sequence-only models. Full sequence-and-structure models (y-axis)MOV10 and sequence-only (x-axis) models are trained on RNAcompete (left) and CLIP-seqCAPRIN1 data (right). Gray ribbons denote the standard deviation ZC3H7B of the differences between full structure and sequence-only models.

4.3.4 Learning binding affinities from categorical data

Biologically, it is more important to predict the binding affinity of an interaction than to categorize a potential target site into binding and non-binding. The bottle-neck of this computational task is the availability of large data sets of quantitative, experimental measurements of affinities. Although CLIP-seq experiments are becoming increasingly popular, this data does not inherently provide a quantification of the binding affinity. In principle, the number of reads mapping to a binding site could be used as a proxy for its affinity, provided there is suitable expression data to normalize read counts. Even if this data exists, which is often not the case, normalization is non-trivial. We therefore ask whether binding affinities can be predicted while learning from only bound versus unbound information, as can be derived from CLIP-seq data. To test this hypothesis, we compared experimentally-derived PTB binding affinities of two sets of sequences with GraphProt prediction margins using the GraphProt model for PTB HITS-CLIP. Perez and colleagues [Perez et al., 1997] report relative affinities derived from competitive titration experiments for 10 sequences of size 20 and 31 nucleotides. Karakasiliotis and colleagues [Karakasil- iotis et al., 2010] identified three PTB consensus sequences starting at positions 112 (BS1), 121 (BS2) and 167 (BS3) of the 5’-end of the feline calicivirus ge- nomic RNA and created mutations designed to disrupt PTB binding (mBS1-3) for each site. All combinations of the three modified sites were introduced into probes corresponding to the first 202 nucleotides of the genome, resulting

82 4.3. GraphProt performance evaluation

Protein Sequence logo Structure logo

IGF2BP1-3

MOV10

CAPRIN1

ZC3H7B

AGO1-4

Figure 4.8: Sequence and structure motifs for five CLIP-seq sets show- ing significant improvement of GraphProt structure over sequence models. In the visualized logos, the character size determines its importance and structure elements are labelled as follows: stems (S), external regions (E), hairpins (H), internal loops (I), multiloops (M) and bulges (B). All motifs show preferences to both stems and unpaired regions simultaneously. Sequence and structure motifs for Ago1-4 and ZC3H7B are very similar, this can be attributed to the large overlap between ZC3H7B and Ago1-4 PAR-CLIP sites (5, 752 of the 28, 238 ZC3H7B sites overlap AGO1-4 sites).

in one wild type and seven mutant sequences. Affinities were measured by electrophoretic mobility shift assay (EMSA), reported affinities are relative to the wild-type probe. We report results for the sequence-only model because the structure model did not show a significant improvement in cross-validation performance over the sequence-only model. For the 8 calcivirus probes, we centred on the region containing the three consensus sequences using the view- point mechanism. Prediction margins and measured affinities show significant correlation with both sets of sequences (Perez et al.: Spearman correlation r = 0.93, p < 0.01; Karakasiliotis et al.: Spearman correlation r = 0.76, p < 0.05). Figure 4.9 shows prediction margins and reported affinities for both sets. The set of calcivirus probes contains multiple binding sites: thus,

83 A

4. GraphProt: Modelling RBP binding preferences

A B wt

mBS2

mBS1

mBS1+2 mBS3 mBS2+3

mBS1+3

mBS1+2+3

B wt

mBS2

Figure 4.9: The certaintymBS1 of prediction correlates with measured binding affinities. Prediction certainty is given by GraphProt margins on mBS1+2 the y-axis and measured affinitiesmBS3 for two sets of PTB aptamers on the x-axis. Fitted linearmBS2+3 models and 95% confidence intervals are depicted in blue and dark grey. Binding affinitiesmBS1+3 are given by (A) relative association constants from [Perez et al., 1997] and (B) affinities relative to the wild-type (wt) probe mBS1+2+3 from [Karakasiliotis et al., 2010]. the measured affinities show cooperative effects between binding sites. For example, individual mutations of the first two binding sites (mBS1, mBS2) slightly increase affinity, but the combined mutation of both sites (mBS1+2) leads to a decreased affinity as compared to the wild-type sequence (Figure 4.9 B). Despite the fact that GraphProt does not model cooperative effects, both the wild type as well as the two probes with comparable affinities were assigned positive GraphProt margins while the probes with reduced PTB affinity were predicted negative. The only notable outlier is mBS1+3, where GraphProt is overestimating the combined effect of the disrupted PTB consensus sequences. These results clearly show that, in addition to predicting binding affinities in a regression setting, GraphProt can also be applied to the prediction of binding affinities when only sets of bound sites for a binary classification task are available—as is the case when analysing CLIP-seq data. This allows the evaluation of putative binding sites with a meaningful score that reflects the biological functionality.

4.3.5 Genome-wide prediction of binding sites A typical question in post-transcriptional gene regulation is whether a particular observation can be explained by RBP–RNA interactions. Here, we want to explain differential expression upon Ago2 knockdown in comparison to the wild

84 4.3. GraphProt performance evaluation type. Ideally, to obtain RBP-target information, a CLIP-seq experiment should be performed for the cell and condition being analysed; although this is not always feasible. A more economic approach would be to use RBP targets taken from publicly available CLIP-seq data. The problem is that available data is mostly generated by experiments performed in other cells and/or conditions. We show that results from publicly available CLIP-seq data do not explain the observed effect. In contrast, we achieve a highly significant agreement when we apply GraphProt to detect binding sites missed by the CLIP-seq experiment (Figure 4.10). In detail, two independent factors influence the efficiency of downregulating a target mRNA. First, the binding affinity of an RBP to its target site regulates binding frequency and strength. Second, the number of proteins bound to the same target can increase the signal for subsequent steps in the regulation process [Zhang et al., 2013]. The effect of cooperative regulation when the same element binds multiple times has been especially well studied for Ago2- microRNA interactions [Schmitter et al., 2006; Selbach et al., 2008; Schnall- Levin et al., 2011;Grimson et al., 2007]. Here, Ago2 generally associates with a microRNA and other proteins (together a miRISC complex) to target mRNAs for degradation and/or translational inhibition: a usual observation is that several miRISC complexes bind to the same mRNA and the cooperative effect is that the down regulation is stronger [Selbach et al., 2008; Grimson et al., 2007]. The expected effect of a reduction of Ago2 expression is a reduction of its regulatory effect, i.e. reduced downregulation of its target genes. If this hypoth- esis holds, transcripts regulated by Ago2 can be determined by a knockdown experiment (where the expression of Ago2 is reduced) comparing transcript expression between a wild type and Ago2 knockdown condition. Upon Ago2 knockdown, transcripts previously regulated by Ago2 will be less downregu- lated by Ago2 and accordingly exhibit increased expression compared to the baseline. This hypothesis was confirmed in previous work by Schmitter and colleagues [Schmitter et al., 2006]. Schmitter and colleagues established that the mean number of microRNA seed sites per 3’-UTR increased significantly between unchanged and weakly upregulated as well as strongly upregulated mRNAs upon Ago2 knockdown in human HEK293 cells. Accordingly, tran- scripts upregulated after Ago2 knockdown can be assumed to be regulated by Ago2. Using their expression data and the same fold-change categories, we investigated the influence of both affinity and cooperative effects based on GraphProt predictions of Ago2 binding sites in comparison to the available CLIP-seq data. For this purpose, we prepared a set of 3’-UTRs with associated fold changes for Ago2 knockdown on day 2 by selecting a non-overlapping set of transcripts, preferring longer over shorter UTRs and with at least 100 but no more than 3, 000 nucleotides. In Section 4.3.4 (Figure 4.9), we established that GraphProt prediction margins correlate with measured affinities; therefore, we estimate high-affinity

85 4. GraphProt: Modelling RBP binding preferences

A **

B * **

down-regulated weakly up-regulated unchanged strongly up-regulated

Figure 4.10: Targets predicted by the Ago2-HITS-CLIP-model are in agreement with measured fold changes after Ago2 knockdown. Analysis of predicted Ago2 binding events to 3’-UTRs that are upregulated after Ago2 knockdown at day 2 for transcripts falling into the following fold-change categories: downregulated (fold change below 0.7, 804 UTRs), unchanged (fold change between 0.7 and 1.4, 6893 UTRs), weakly upregulated (fold change between 1.4 and 2.0, 713 UTRs) and strongly upregulated (fold change greater than 2.0, 136 UTRs). 20,613 Ago2 HITS-CLIP sites and 20,707 Ago2 GraphProt sites were located on these 3’-UTRs. A) Number of 3’-UTRs with at least one Ago2 binding-site hit. Asterisk indicates statistically significant increase (t-test, *: p < 0.05, **: p < 0.001). B) Number of binding-site hits per 3’-UTR. Asterisk indicates statistically significant increase (Wilcoxon rank sum test, *: p < 0.05, **: p < 0.001). Boxplots do not include outliers, for that reason we show the full distributions in Supplementary Figure B.5.

86 4.4. Conclusion

Ago2 binding sites by only considering the highest-scoring predictions. The GraphProt sequence-only model was trained on the Ago2-HITS-CLIP set (the use of structure did not improve prediction results for Ago2) and was applied to the set of 3’-UTRs with measured fold-changes to predict high-scoring target sites as described in Section 4.2.3. We compared these predictions to reliable binding sites derived by peak calling on the Ago2-HITS-CLIP read profiles. The overall regulatory effect was investigated by comparing the fraction of 3’-UTRs that contain binding sites between the fold-change categories (Figure 4.10 A). An interaction with higher affinity should cause a greater upregulation upon Ago2 knockdown. In a second analysis, cooperative effects were estimated by counting the number of Ago2 binding sites per 3’-UTR (Figure 4.10 B) in each fold-change category. For binding sites predicted by GraphProt, both the fraction of 3’-UTRs with at least one GraphProt hit (Figure 4.10 A) and the number of GraphProt hits per 3’-UTR (Figure 4.10 B) showed a significant increase between unchanged and weakly upregulated transcripts. While there is no major difference in the fraction of UTRs with at least one hit between unchanged and strongly upregulated transcripts, we see a clear enrichment for the number of hits in UTRs that are highly regulated, indicating the cooperative effect of multiple miRISC target sites (Figure 4.10 B). In contrast, no correlation is observed for binding sites as taken from the Ago2-HITS-CLIP in both cases (Figure 4.10). Since microRNAs guide Ago2 binding, we also looked at computational approaches for detecting microRNA binding sites. To this end, we repeated the analysis from [Schmitter et al., 2006] using the same microRNA seeds found to be over-represented in upregulated transcripts and extracted PicTar 2.0 microRNA target predictions from doRiNA [Anders et al., 2012] to compare against GraphProt (Supplementary Figures B.6 and B.7). Both microRNA detection approaches show some agreement within the differential expression upon Ago2 knockdown, however, the differences between fold-change categories are not as significant in comparison to GraphProt. This shows that GraphProt predictions based on CLIP data are superior to using pure miRNA target prediction in this case. These results prove the necessity of computational target prediction in ad- dition to performing CLIP-seq experiments; we prove the capacity of GraphProt to reliably predict RBP target sites and to even detect sites missed by experi- mental high-throughput methods.

4.4 Conclusion

GraphProt is an accurate method for elucidating binding preferences of RBPs and highly flexible in its range of applications. We use a novel and intuitive representation of RBP binding sites that, in combination with an efficient graph kernel, is able to capture binding preferences of a wide range of RBPs.

87 4. GraphProt: Modelling RBP binding preferences

Depending on the input data, GraphProt models can solve either a regression or a classification task and are thus suitable for learning binding preferences from the two current major sources of experimental data: RNAcompete and CLIP-seq. Trained models are used to predict functional RBP target sites on any transcript from the same organism. GraphProt displayed a robust and much improved performance in compari- son to the existing state of the art. The full RNA structure representations used by GraphProt were shown to be especially suitable for modelling preferences for binding sites within base-pairing regions. For RBPs known not to be influenced by RNA structure, GraphProt provides very fast sequence-only models that perform as well as the full structure models. RBP sequence and structure preferences learned by GraphProt can be visualized using well-known sequence logos. Beyond the mere elucidation of binding preferences, GraphProt models were successfully applied to diverse tasks such as predicting RBP affinities and scanning for RBP target sites. GraphProt is applicable on a genome-wide scale and can thus overcome the limitations of CLIP-seq experiments, which are time and tissue dependent: we show that when GraphProt is applied to all transcripts, missing targets are identified in a setting different to the one where the original CLIP-seq experiment was performed.

88 Chapter 5

Model-based validation of RBP binding sites

5.1 Introduction

Glioblastoma multiforme is the most common type of brain tumor in adults. With a median survival after initial diagnosis of 15 months, it is also the most lethal form of brain tumor [Stupp et al., 2005]. On the genetic level it can be characterized as a complex disease with disruptions in many signalling pathways due to recurrent mutations [Srivastava et al., 2001; Srivastava et al., 2003]. The Epidermal Growth Factor Receptor (EGFR) pathway is found to be deregulated in most types of glioblastoma [Wong et al., 1987; Bredel et al., 2011; Vivanco et al., 2010; Yadav et al., 2009]. Isolated targeting of EGFR by prospective glioblastoma treatments, however, has shown limited therapeutic effects [Nicholas et al., 2006]. Extensive networking effects in glioblastoma-specific misregulation suggest that successful treatment will re- quire the targeting of multiple factors [Bredel et al., 2009]. In this chapter, we investigate the role of annexin A7 (ANXA7) in glioblastoma multiforme. ANXA7 is a membrane-binding tumor suppressor [Srivastava et al., 2001;Srivas- tava et al., 2003] that is associated with the deregulation of the EGFR pathway and also with glioblastoma patient prognosis [Bredel et al., 2009; Yadav et al., 2009], making it a possible candidate for targeting by glioblastoma treatments. Preliminary experiments identified an aberrantly spliced isoform of ANXA7 as the cause of deregulation of the EGFR pathway in glioblastoma. Further experiments identified overexpression of the RNA-binding protein PTB as the most likely cause for the production of the aberrant isoform. We used GraphProt to confirm the direct agency of PTB in the alternative splicing of ANXA7. To this end, we first predicted PTB binding sites in the region surrounding the alternatively spliced exon and then designed mutations meant to disable binding of PTB to these sites. Minigene constructs that incorporated disabled the PTB binding sites showed altered expression of ANXA7 isoforms,

89 5. Model-based validation of RBP binding sites

providing evidence for the direct agency of PTB binding on alternative splicing of ANXA7. We conclude the chapter with an analysis of the factors preventing the detection of bound sites by CLIP-seq.

5.1.1 PTB mediates expression of ANXA7 isoforms Alternative splicing is known to generate cell-type specific isoforms. In key regulatory proteins these isoforms are known to regulate cellular differentia- tion [Gabut et al., 2011; Ungewitter and Scrable, 2010]. Altered expression of splicoforms may promote the development of malignant transformations by altering signalling pathways ensuring normal cellular function [Yang et al., 2007]. ANXA7 contains a 66 nucleotides alternatively spliced exon (exon 6). Iso- forms with this exon are prevalent in the brain, skeletal muscle, and heart [Ma- gendzo et al., 1991;Rick et al., 2005]. To assess the role of ANXA7 splicoforms in glioblastoma multiforme, exclusion of ANXA7 exon 6 was measured in several healthy cell types found in the brain as well as in glioblastoma mul- tiforme [Ferrarese et al., 2014]. ANXA7 Isoform 1 (ANXA7-I1), the isoform including exon 6, was found abundantly expressed in normal brain tissue, which was mainly comprised of neurons, but not in glioblastoma tissue or cultures. In contrast, both glioblastoma tissue and cultures showed high expression of ANXA7 Isoform 2 (ANXA7-I2) that lacks exon 6 [Ferrarese et al., 2014]. Since ANXA7 is known to be associated with the deregulation of EGFR signalling [Bredel et al., 2009; Yadav et al., 2009], we decided to further assess the role of its splicoforms. To this end we measured the expression of the EGFR protein in SNB19 glioblastoma cells after overexpressing the two ANXA7 isoforms and an empty control vector. Only re-expression of the exon-6-containing isoform ANXA-I1 resulted in a reduction of EGFR protein expression, showing that the glioblastoma-specific isoform ANXA7-I2 deregulates EGFR signalling [Ferrarese et al., 2014]. Sequencing of ANXA7 exon 6 and parts of the flanking intron/exon junc- tions in 35 human glioblastomas did not reveal mutations that could explain the altered composition of ANXA7 splicoforms in glioblastoma [Ferrarese et al., 2014]. After excluding mutations on and around exon 6 as the reason for the altered splicing, we sought to identify splicing factors differentially expressed in glioblastoma. Among the splicing factors differentially expressed in glioblastomas [Cheung et al., 2008], only PTB showed increased expression compared to normal brain tissue. Knockdown of PTB using short-hairpin RNAs (shRNA) in cells expressing high levels of PTB and lacking ANXA7-I1 expression resulted in the re-expression of ANXA7-I1 [Ferrarese et al., 2014]. Binding of PTB to the ANXA7 gene could be shown using RNA immuno- precipitation [Ferrarese et al., 2014]. A direct role of PTB in the alternative splicing of ANXA7 was indicated by exon trapping using an ANXA7 minigene comprised of the genomic region from exon 5 to exon 7 [Ferrarese et al., 2014].

90 5.2. Prediction and validation of binding sites

Figure 5.1: Schematic overview of the exon structure of ANXA7 between exon 5 and exon 7 with predicted PTB binding sites S1- S11.

5.2 Prediction and validation of binding sites

In the previous section, we described how PTB was found to influence alter- native splicing of ANXA7 exon 6. To confirm the direct agency of PTB in ANXA7 exon 6 skipping, we sought to identify prospective PTB binding sites. A search for published PTB binding sites on ANXA7 revealed two sites from a PTB HITS-CLIP experiment [Xue et al., 2009] on the first intron of ANXA7 with distances of about 10 and 13 thousand base pairs to exon 6. Because of the large distances to exon 6 it seemed unlikely that these sites would influence exon 6 skipping. We then hypothesized that the binding sites responsible for the regulation of exon 6 splicing were missed by the HITS-CLIP experiment.

5.2.1 Prediction of PTB-bound sites

We used GraphProt to predict PTB binding sites in the vicinity of ANXA7 exon 6. Prediction of PTB binding-site candidates downstream, within, and upstream of exon 6 (Figure 5.1) was performed using a GraphProt model trained with the PTB HITS-CLIP binding sites from Xue and colleagues [Xue et al., 2009]. A set of bound sequences was created by extending each of the 51, 394 binding sites published by Xue and colleagues by 25 nucleotides in each direction. A set of unbound sequences was derived by shuffling the locations of the bound sites within the genome under the following constraints: negative sites had to be located within introns and were not allowed to fall within 125 nucleotides of any CLIP-seq read. We used a preliminary version of the GraphProt pipeline as described in Chapter 4. For that reason, the graph encoding of RNA sequence and structure differs to the one described in Section 4.2 as follows: (1) secondary structures were computed using RNAshapes [Giegerich et al., 2004] employing shape abstraction level 3 and sliding windows of sizes 50 and 100; (2) the graph encoding was created as described in Section 4.2 but lacked the additional layer of abstract structure information; and (3) all nucleotides were set as viewpoints. Feature vectors for model training were calculated using subgraph parameters radius 3 and distance 6. The resulting model was used to calculate the nucleotide-wise scores for the ANXA7 minigene as described in Section 4.2.3. We used these scores, indicating the likelihood of a nucleotide being bound by PTB, to extract 11 high-scoring binding-site candidates (Figure 5.1).

91 5. Model-based validation of RBP binding sites

Figure 5.2: Visualization of candidate mutations for the validation of predicted PTB binding site. The sequence of predicted binding site S3 is shown on top. The corresponding bar chart shows the nucleotide-wise scores according to the PTB GraphProt model for the wild-type sequence, the bar chart at the bottom shows the scores of the candidate sequence incorporating seven mutations. The heat map visualizes the effect of binding-site mutations according to the GraphProt model. Scores of individual nucleotides are colour- coded. High scores, indicating likely PTB binding, are colored dark-blue. Middle and low scores, indicating a reduced likelihood of PTB binding, are colored green (midrange) and white (low). Selected mutations are superimposed on the heat map.

5.2.2 Designing mutations for probing predicted sites

Next, we sought to experimentally determine the effect of PTB binding on the splicing of exon 6. To this end we designed mutations meant to disable binding of PTB to the predicted sites and probed the changed splicing of exon 6 caused by the mutated binding sites. For that purpose, we used the same GraphProt model employed for site prediction. A single mutation can cause changes of secondary structure that in turn will affect the score of the whole sequence. To ensure that mutations are introduced only where really needed, we chose an iterative approach evaluating the effect of each additional mutation before introducing another. To this end, favourable mutations were determined using an iterative k-best search. For a given wild-type sequence, we determined the nucleotides having the largest effect on PTB-binding according to the GraphProt model. To this end, we calculated the nucleotide-wise scores as described in Section 4.2.3. The k highest-scoring nucleotides, having the largest influence on PTB binding according to the model, were selected for the introduction of mutations meant to

92 5.3. Completeness of CLIP-seq-derived binding sites weaken binding of PTB. We then generated candidate sequences by introducing all possible mutations to the selected nucleotides, each mutation forming a new candidate sequence. From this set of candidates we selected the n lowest- ranking sequences according to the sequence-wise GraphProt score, having the least likelihood of being bound by PTB according to the GraphProt model, for further evaluation by the next iteration. The sequence with the lowest score among the candidates was reported for each iteration. This procedure, using parameters k = 10 and n = 10, was iterated to generate modified binding sites incorporating up to 7 mutations compared to the wild type for each of the 11 predicted binding sites.

5.2.3 Experimental validation of predicted binding sites Reduced interaction of PTB with binding sites effective in regulating the splicing of ANXA7 exon 6 will increase the expression of the exon-6-containing isoform ANXA7 I1. Accordingly, increased expression of ANXA7 I1 upon targeted deactivation of a binding site shows binding of PTB and confirms the direct role of PTB in the alternative splicing of ANXA7 exon 6. We selected mutations designed to disable binding of PTB for eight of the predicted PTB binding sites (Figure 5.2, Supplementary Table B.8 and Supplementary Figures B.9 - B.18) for probing their effect on splicing of ANXA7. The selected binding-site mutations were introduced into ANXA7 minigenes spanning the 8, 000 nucleotides from exon 5 to exon 7 using site-directed mutagenesis (QuikChange II Site-Directed Mutagenesis kit (Agilent), applied according to the manufacturer’s instructions). This yielded mutated minigenes M2, M3, M5-M7 and M9-M11 incorporating changes to the corresponding binding sites S2, S3, S5-S7 and S9-S11 as shown in Figure 5.1. After transfection of the wild type and mutant minigenes into SNB19 cells, the mRNA expression rations of ANXA7 I1/I2 were measured by qRT-PCR. Eight of the mutant minigenes showed increased ANXA-I1 expression compared to the wild-type minigene (Figure 5.3). The altered splicing of alternative exon 6 upon the targeted deactivation of the predicted PTB binding sites shows a biologically relevant loss of PTB binding at these sites, confirming the direct role of PTB in the alternative splicing of ANXA7 exon 6.

5.3 Completeness of CLIP-seq-derived binding sites

In the previous section, we have shown that GraphProt can be used to detect binding sites missed by CLIP-seq. Now, we use the 11 predicted PTB binding sites as a starting point to analyse the influence of three possible factors con-

93 5. Model-based validation of RBP binding sites

Figure 5.3: PTB Mediates Exon 6 Skipping in ANXA7 pre-RNA. Changes in ANXA7I1/I2 mRNA expression ratio measured by qRT-PCR after site-directed mutagenesis of in silico predicted PTB-binding sites in the cloned ANXA7 minigene (p-values above the bars according to unpaired t-test for the comparison of each mutation versus wild type; minigene M3 not significant (N.S.); n = 3 independent experiments for all samples; error bars represent mean ± s.d.).

tributing to binding sites being missed by CLIP-seq: peak detection, sequencing depth and mappability. Because only seven of the eleven predicted sites were tested using mutated minigenes, we sought to find further experimental evidence for binding of PTB to the remaining untested sites. Reid and colleagues applied next-generation SELEX to an oligonucleotide pool consisting of 30 nucleotides sequences, extracted in 10 nucleotides steps from exons and 200 nucleotides of flanking intronic sequence, to measure enrich- ment of PTB-bound oligonucleotides [Reid et al., 2009]. Genomic enrichment scores were calculated as the log mean enrichment scores of all overlapping oligonucleotides. The three GraphProt candidate sites located on the region probed by PTB next-generation SELEX — S1, S2 and S5 — were found to be enriched, showing that these sites are bound in vitro (Figures 5.4 and 5.5). Shortly after we conducted the mutational study described in the previous section, a second PTB HITS-CLIP [Xue et al., 2013] experiment was published by Xue and colleagues. In the following, we will refer to the first PTB

94 5.3. Completeness of CLIP-seq-derived binding sites

200nt E6 200nt S5 S1 S2

Figure 5.4: Predicted binding sites correspond to enriched sites from PTB next-generation SELEX. Binding profile for ANXA7 exon 6 and 200 nucleotides of surrounding intronic sequence as determined by PTB next- generation SELEX. Genomic enrichment scores were calculated as the log mean enrichment scores of all overlapping oligonucleotides. The three GraphProt candidate sites predicted for this region — S1, S2 and S5 — correspond to oligonucleotides enriched in PTB next-generation SELEX, indicating binding by PTB.

GraphProt candidate site S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 mutated minigenes OK fail OK OK OK OK OK OK PTB next-generation SELEXOK OK n.a. n.a. OK n.a. n.a. n.a. n.a. n.a. n.a. PTB CLIP v1, published peak clusters fail fail fail fail fail n.a. n.a. fail n.a. n.a. n.a. PTB CLIP v1, uniquely mapped reads OK OK OK OK fail n.a. n.a. fail n.a. n.a. n.a. PTB CLIP v2, uniquely mapped reads OK OK OK OK fail n.a. n.a. OK n.a. n.a. n.a.

not tested OK binding detected fail no evidence for binding n.a. method not applicable

Figure 5.5: Experimental evidence for PTB binding to GraphProt can- didate sites S1-S11. Mutated minigenes: Mutations designed to disable PTB binding were introduced into the ANXA7 minigene using site-directed mu- tagenesis. Of the seven tested sites, six lead to lead to a statistically significant increase (unpaired t-test, p=0.05) in the ANXA7I1/I2 mRNA ratio, indicating loss of PTB binding at these sites. Sites S1, S4 and S8 were not tested. PTB next-generation SELEX: The three predicted binding sites located on exon 6 or the 200 nucleotides of surrounding intronic sequence were detected by next generation SELEX. The other predicted sites were not covered by the next generation SELEX library. PTB HITS-CLIP v1 published peak clusters: None of the binding sites are included in the set of published peak clusters for PTB CLIP v1. PTB HITS-CLIP v1 uniquely mapped reads: Several of the PTB HITS-CLIP v1 libraries contain reads aligned to four of the predicted binding sites. PTB HITS-CLIP v2 uniquely mapped reads: The coverage profiles of PTB HITS-CLIP v2 show reads aligned to five of the predicted binding sites. Note that four of the GraphProt candidate sites — S6, S7 and S9-S11 — are located in regions of low mappability. Under the constraint of using only uniquely mapped reads these sites are not detectable by CLIP-seq.

95 5. Model-based validation of RBP binding sites

HITS-CLIP experiment performed as PTB HITS-CLIP v1 [Xue et al., 2009], the subsequent PTB HITS-CLIP experiment will be denoted as PTB HITS- CLIP v2 [Xue et al., 2013]. Both HITS-CLIP experiments were performed as described by Yeo and colleagues [Yeo et al., 2009]. Due to the availability of more modern sequencing equipment, the number of sequenced reads could be increased from 14, 101, 625 for PTB HITS-CLIP v1 (Illumina Genome Analyzer) to 38, 227, 488 for PTB HITS-CLIP v2 (Illumina HiSeq 2000). Four of the eleven predicted PTB binding sites were covered by uniquely mapped reads from PTB HITS-CLIP v1. Uniquely mapped reads from PTB HITS-CLIP v2 covered one additional GraphProt candidate site (Figure 5.5). In combination, our mutational study, the next-generation SELEX, and the HITS-CLIP experiments provide evidence for binding of PTB to all of the GraphProt candidate sites. Binding to sites S4 and S8, however, is only indicated by coverage of uniquely mapped reads from one (site S4) or two (site S8) HITS-CLIP experiments (Figure 5.5). The evidence for PTB binding to sites S4 and S8 must be considered weak due to the absence of called peaks for these sites.

5.3.1 Influence of peak calling

To determine the influence of peak calling on missing binding sites in CLIP-seq, we compared PTB HITS-CLIP v1 peak clusters to the full set of uniquely mapped PTB HITS-CLIP v1 reads. Xue and colleagues applied a modFDR-like peak calling algorithm as described by Yeo and colleagues [Yeo et al., 2009] to PTB HITS-CLIP v1 [Xue et al., 2009;Xue et al., 2013]. The published peaks and peak clusters, obtained by combining all peaks with a distance of fewer than 50 base pairs, of PTB HITS-CLIP v1 are available at the Gene Expression Omnibus (GEO, accession GSE19323, 64, 314 peaks and 51, 394 peak clusters) [Xue et al., 2009]. No peak calling was performed for PTB HITS-CLIP v2. As previously described, none of the published PTB HITS-CLIP v1 peak clusters were located on ANXA7 exon 6 or the adjacent introns. In contrast, several of the uniquely mapped PTB HITS-CLIP v1 reads overlap binding sites S1, S2, S3 and S4 (Figures 5.5 and 5.6), providing weak evidence for binding of PTB. Direct use of uniquely mapped reads would have resulted in evidence for 4 of the 11 candidate sites. Use of the uniquely mapped reads without peak calling would have resulted in at least 2.348.187 potential binding sites (approximated by merging all alignments within 50 base pairs of each other). While this number seems exceedingly large, it is well below the 5 million PTB binding sites predicted by Xue and colleagues [Xue et al., 2009] for the human genome using a biochemical model of PTB binding [Gama-Carvalho et al., 2006]. Due to the absence of a negative control experiment for PTB HITS-CLIP v1, however,

96 5.3. Completeness of CLIP-seq-derived binding sites

E5 E6 E7

M4 M5 M1 M2 M3 M6 M7 M9-10 M11 M8 Figure 5.6: Predicted binding sites have evidence from HITS-CLIP. The blue lane shows the PTB HITS-CLIP profile from Xue and colleagues for ANXA7 exons 5 to 7 [Xue et al., 2013]. Five of the sites selected for mutational analysis also are covered by mapped reads from HITS-CLIP (coverage profile shown in green).

there is insufficient data to assess the expected number of true binding sites among the 2, 348, 187 potentially bound regions. The lack of published peaks for the sites covered by uniquely mapped reads shows that the sequencing depth of PTB HITS-CLIP v1 in combination with the chosen peak calling procedure was not large enough to reliably detect these sites.

5.3.2 Influence of sequencing depth

We next compared the published reads and uniquely mapped reads for PTB HITS-CLIP v1 (GEO accession GSE19323) and PTB HITS-CLIP v2 (GEO accession GSE42701) to investigate the influence of sequencing depth on binding sites missed by CLIP-seq. The increased sequencing depth of PTB HITS-CLIP v2 lead to a 2.7-fold increase in the number of sequenced reads compared to PTB HITS-CLIP v1 (PTB HITS-CLIP v1: 14, 101, 625 reads, PTB HITS-CLIP v2: 38, 227, 488 reads). The number of uniquely mapped reads was increased 5-fold (PTB HITS-CLIP v1: 5 million uniquely mapped reads reads, PTB HITS-CLIP v2: 23 million uniquely mapped reads). This increase in the number of usable reads is most likely due to the increased length of reads of PTB HITS-CLIP v2 (HITS-CLIP v1: 36 nucleotides, HITS-CLIP v2: 50 nucleotides) [Xue et al., 2009; Xue et al., 2013]. Despite the 5-fold increase in usable reads of PTB HITS-CLIP v2 over v1, PTB HITS-CLIP v2 uniquely mapped reads only cover one additional GraphProt candidate site on the ANXA7 minigene (site S8; Figures 5.7 and 5.5). PTB HITS-CLIP v2 does not contain any evidence for PTB binding to six of the seven GraphProt candidate sites shown to be effective in regulating ANXA7 alternative splicing. Neither the increased sequencing depth nor the longer reads of PTB HITS-CLIP v2 were sufficient or suited to provide evidence for PTB binding to these sites.

97 5. Model-based validation of RBP binding sites

1

A 0 E5 E6 E7 B

S4 S5 S1 S2 S3 S6 S7 S9-10 S11 M8 C

D

Figure 5.7: Missed binding sites on low mappability regions are recovered by allowing multiple alignments per read. A) En- code Alignability track (50 nucleotides, http://genomebrowser.wustl.edu/ cgi-bin/hgTrackUi?db=hg19&g=wgEncodeMapability) [Derrien et al., 2012]. B) Schematic overview of the GraphProt candidate sites S1-S11 on ANXA7 exon 6 and the surrounding introns. C) Coverage by re-mapped PTB HITS- CLIP v1 reads using up to 20 matches per read. Maximum profile height is 7. D) Coverage by re-mapped PTB HITS-CLIP v1 reads using up to 20 matches per read, drawn at the same scale as C. Maximum profile height is 22. While the alignment profiles at the low-mappability regions show higher peaks than the uniquely mappable regions, this does not indicate increased binding of PTB at these sites. The increased peak heights are a side effect of using multiple alignments per read because reads originating from duplicate sequences elsewhere in the genome also contribute alignments to the ANXA7 low-mappability regions.

5.3.3 Influence of mappability

Next, we checked to what extent mappability can account for the fact that six of the seven sites shown to be bound by PTB using our mutational study lack any evidence in the published PTB HITS-CLIP v1 and v2 data. The mappability of a genomic region refers to the number of times the sequence of this region can be mapped to the reference genome. The ENCODE project provides several mappability tracks for the human reference genome. For this analysis, we used the Encode Alignability track for 50mers avail- able at http://genomebrowser.wustl.edu/cgi-bin/hgTrackUi?db=hg19& g=wgEncodeMapability. Here, the alignability of a genomic region is de- fined as the inverse of its mappability, allowing up to 2 mismatches [Derrien et al., 2012]. (Figure 5.7 A). The k-mer length of 50 corresponds to the read length of PTB HITS-CLIP v2. Reads that were mapped more than one time to the reference genome were not included in the published PTB HITS-CLIP v1 and v2 alignments. For that reason only binding sites located on genomic regions with perfect alignability, corresponding to an alignability score of 1, are included in the published data. The five GraphProt candidate sites lacking evidence in the published data of the PTB HITS-CLIP experiments but shown to be bound by our minigene study —

98 5.4. Conclusion

S6, S7, S9, S10 and S11 — are located in regions of low alignability (Figure 5.7 B), explaining the absence of evidence in the published PTB HITS-CLIP v1 and v2 alignments. To analyse to what extent the inclusion multiply mapped reads could have rescued evidence of binding to these sites, we remapped both PTB HITS-CLIP v1 and v2 libraries, allowing up to 20 alignments per read using parameters -k 20, --very-sensitive and --local. Bowtie2 aligned 47% of the mapped PTB HITS-CLIP v1 and 40% of the mapped PTB HITS-CLIP v2 reads to multiple genomic locations. The alignment profiles based on both uniquely and multiply mapped reads cover the full ANXA7 minigene including the low-mappability regions devoid of uniquely mapped reads in the published data (Figure 5.7 C and D). Consequently, PTB HITS-CLIP experiments at least indicate the possibility of PTB binding to the these sites. The removal of this evidence by the exclusive use of uniquely mapped reads, however, precluded the detection of valid PTB binding sites by CLIP-seq within these regions. We investigated the influence of peak calling, sequencing depth and map- pability on the ability of two PTB CLIP-seq experiments to detect a set of PTB binding sites located on ANXA7 between exons 5 and 7. For PTB HITS-CLIP v1, we showed that the sequencing depth in combination with the peak calling procedure was not suited for detecting any binding sites on the ANXA7 minigene. Despite a 5-fold increase in the number of uniquely mapped reads, PTB HITS-CLIP v2 only contained evidence for one additional binding site compared to PTB HITS-CLIP v1. Five of the PTB binding sites shown to influence alternative splicing of ANXA7 exon 6 were found to be located on low-mappability regions. Remapping of the published reads showed that both PTB HITS-CLIP experiments contain weak evidence for binding to these sites in the form of multiply mapping reads. We conclude that mappability is major factor preventing the detection of PTB binding sites relevant for regulating the alternative splicing of ANXA7 exon 6 by PTB HITS-CLIP.

5.4 Conclusion

In this chapter, we presented the use of GraphProt for the conclusive identifica- tion of the RNA-binding protein PTB as the factor responsible for the aberrant splicing of ANXA7, a tumor suppressor associated with the deregulation of the EGFR pathway commonly found in glioblastoma multiforme. Despite circumstantial evidence of the agency of PTB in the alternative splicing of ANXA7 exon 6 from PTB knockdown and immunoprecipitation, no evidence of PTB binding relevant to the alternative splicing of exon 6 was found in a set of PTB CLIP-seq sites. Assuming that the binding sites were missed by the CLIP-seq experiments, we used GraphProt to predict PTB-binding sites in the vicinity of the alternatively spliced exon and to design mutations meant to weaken or disable PTB binding to these sites. Biological relevance of the

99 5. Model-based validation of RBP binding sites predicted sites could be shown by measuring the influence of the mutated sites on splicing of a minigene reporter construct. The set of newly identified PTB binding sites enabled us to exemplarily analyse three factors responsible for missed binding sites in HITS-CLIP ex- periments: peak calling, sequencing depth and mappability. These analyses showed that four of the eleven sites were missed due to the combined effects of peak calling and low sequencing depth. We found evidence for one additional site in a second dataset generated using a larger sequencing depth. Analysis of the remaining five sites without any evidence in the published HITS-CLIP data revealed that binding to these sites can’t be detected using uniquely mapped reads. Realignment of the published reads revealed that the published data contains some evidence for binding to the missed sites in the form of non-uniquely mapped reads. Since the true genomic origin of these reads can’t be determined, this evidence must be considered weak. Improvements in next generation sequencing technology in the form of increased sequencing depth and read length can be expected to somewhat mitigate the issue of missed binding sites. Large parts of the genome, however, are not suitable for obtaining high-confidence binding evidence in the form of uniquely mapped CLIP-seq reads. Consequently, detection and validation of a considerable fraction of RBP-bound sites require additional experimental evidence. GraphProt-based identification and validation of RBP binding sites can be a viable solution for this task.

100 Chapter 6

Conclusions

Elucidation of the regulatory networks governing gene regulation is a vital aspect of the scientific quest to understand life. The work presented in this thesis is concerned with RNA-binding proteins (RBPs), a large and versatile class of proteins involved in all stages of gene regulation. In this thesis we investigated RBPs partaking in the regulation of transcriptional control (Chapter 3), the regulation of mRNA stability (Chapter 4) and RNA processing (Chapter 5). While the regulatory mechanisms involving RBPs are diverse, the activity of RBPs is consistently mediated via interactions with their RNA targets. The work presented in this thesis covers the evaluation of experimental data on these interactions and their integration into computational models of binding preferences. The basis of these models is the representation of the sequence and structure of the RNA targets. In Chapter 2 we presented a detailed evaluation of algorithms for the prediction of secondary structure of long RNAs, the main targets of RBPs. For this purpose, we focussed on algorithms for determining RNA secondary structure given the nucleotide sequence. Their applicability on a genome- wide scale and independence of additional data such as measurements of RNA structure or sequence-structure alignments makes these algorithms the most versatile means of predicting RNA structure. Our qualitative evaluation of local folding approaches designed for the prediction of structure of long RNAs identified a specific bias at the window borders for the sliding window approach. We proposed parameters to mitigate this effect and introduced LocalFold, a method that removes the bias entirely. Our benchmarks of the selected algorithms on two large data sets identified the sliding window approach to local folding as the algorithm best suited for the prediction of structures in long RNAs. Based on our qualitative and quantitative analyses we proposed parameters for windowed local folding that should give favourable results for a wide range of long RNAs. Computational models of RBP binding require experimental data on RBP- RNA interactions for training. In Chapter 3 we discussed the processing

101 6. Conclusions required for the evaluation of iCLIP, one of the major experimental methods for determining genome-wide in-vivo binding sites of RNA-binding proteins. The main advantage of iCLIP compared to other CLIP-seq approaches is its ability to detect individual binding events at nucleotide resolution. Detection of these events is accomplished with a barcoding strategy that allows to identify and merge duplicates introduced by PCR amplification. We showed that a merging strategy not compensating for barcode errors, bound to be abundant in low complexity sequencing libraries, would severely overestimate the number of crosslinking events and implemented a straightforward filtering strategy that allows the exact determination of binding events. Furthermore we provided a detailed description and discussion of the processing required for the evaluation of iCLIP sequencing data. We concluded the chapter with the analysis of two iCLIP experiments that allowed to determine the structural elements required for binding by these proteins. Experimental identification of RBP binding sites with CLIP-seq may yield large numbers of bound sites. Measurements of binding affinities yield indi- vidual affinity scores for large numbers of sequences. While initial insights into the binding preferences of the analysed proteins may be gained by inte- gration of this data into sequence and structure motifs, further uses of this data require computational models of binding preferences. In Chapter 4 we presented GraphProt, a framework for learning the binding preferences of RNA-binding proteins. GraphProt is the first method that allows to encode full RNA secondary structures of RBP binding sites, a technique especially suitable for modelling preferences of binding sites within base-pairing regions. GraphProt can be applied both in regression and classification settings and is thus suited for current experimental data on RBP interactions. GraphProt displayed a robust and improved performance in comparison to the previous state of the art. Sequence and structure preferences learned by GraphProt models can be visualized using sequence logos. We showed that GraphProt is suitable for performing genome-wide target site scans. Sites predicted to be bound by GraphProt, but not the sites determined by the CLIP-seq experiment the model was based on, were found to be relevant for the regulation of targeted transcripts. In summary, we showed that GraphProt is able to capture binding preferences of a wide range of RBPs. The scanning mode can be a powerful tool to complete the RBPome. We concluded this thesis with Chapter 5, presenting a method for designing experiments for the validation of binding sites. For this purpose, we first predicted prospective binding sites using a GraphProt model trained on CLIP- seq-derived binding sites. We then identified mutations weakening binding of the protein of interest according to the same model. The outcome expected from a reduction of binding by this protein — increased inclusion of an alternatively spliced exon — could then be shown experimentally and thus validated binding to the predicted sites. We then provided a detailed analysis of the various factors that precluded the identification of binding sites by

102 genome-wide high-throughput methods. In summary, this method provides the means to identify and validate RBP binding sites in a straightforward assay that remains applicable to situations where data from high-throughput methods is hard or impossible to obtain. In summary, we provide methods for the investigation of binding sites of RNA-binding proteins with a focus on computational models of RBP binding. In this context, we evaluated the suitability of prediction algorithms for the folding of long RNAs and developed methods for the evaluation of experimental data on RBP binding sites. We presented GraphProt, a framework to integrate this data into models of sequence- and structure-binding preferences. GraphProt models allow the straightforward visualisation of RBP sequence and structure preferences and can be utilized in a productive manner to detect and validate novel binding sites, thus enabling the evaluation of binding sites that are hard or impossible to probe with current experimental methods. The research presented in this thesis shows that models of RBP sequence- and structure-binding preferences are useful tools for the research of RBP binding. We hope that this work will be useful in the further investigation of the gene regulation governed by RBPs and thus contribute to the elucidation of the patterns of life.

103

Bibliography

[Akhtar et al., 2000] Akhtar, A., Zink, D., and Becker, P. B. (2000). Chro- modomains are protein-RNA interaction modules. Nature, 407(6802):405–9.

[Akutsu and Tatsuya, 2000] Akutsu, T. and Tatsuya, A. (2000). Dynamic programming algorithms for RNA secondary structure prediction with pseu- doknots. Discrete Appl. Math., 104(1-3):45–62.

[Anders et al., 2012] Anders, G., Mackowiak, S. D., Jens, M., Maaskola, J., Kuntzagk, A., Rajewsky, N., Landthaler, M., and Dieterich, C. (2012). doRiNA: a database of RNA interactions in post-transcriptional regulation. Nucleic Acids Res, 40(Database issue):D180–6.

[Andronescu et al., 2008] Andronescu, M., Bereg, V., Hoos, H. H., and Con- don, A. (2008). RNA STRAND: the RNA secondary structure and statistical analysis database. BMC Bioinformatics, 9:340.

[Ascano et al., 2013] Ascano, M., Gerstberger, S., and Tuschl, T. (2013). Multi- disciplinary methods to define RNA-protein interactions and regulatory networks. Curr. Opin. Genet. Dev., 23(1):20–28.

[Augui et al., 2011] Augui, S., Nora, E. P., and Heard, E. (2011). Regulation of X-chromosome inactivation by the X-inactivation centre. Nat Rev Genet, 12(6):429–42.

[Auweter et al., 2006] Auweter, S. D., Oberstrass, F. C., and Allain, F. H.-T. (2006). Sequence-specific binding of single-stranded RNA: is there a code for recognition? Nucleic Acids Res., 34(17):4943–4959.

[Backofen et al., 2009] Backofen, R., Tsur, D., Zakov, S., and Ziv-Ukelson, M. (2009). Sparse RNA folding: Time and space efficient algorithms. In Kucherov, G. and Ukkonen, E., editors, Proc. 20th Symp. Combinatorial Pattern Matching, volume 5577 of LNCS, pages 249–262. Springer.

[Bailey et al., 2009] Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, L., Ren, J., Li, W. W., and Noble, W. S. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res, 37(Web Server issue):W202–8.

105 Bibliography

[Baltz et al., 2012] Baltz, A. G., Munschauer, M., Schwanhausser, B., Vasile, A., Murakawa, Y., Schueler, M., Youngs, N., Penfold-Brown, D., Drew, K., Milek, M., Wyler, E., Bonneau, R., Selbach, M., Dieterich, C., and Landthaler, M. (2012). The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts. Mol. Cell, 46(5):674–690.

[Barrett et al., 2013] Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Holko, M., Yefanov, A., Lee, H., Zhang, N., Robertson, C. L., Serova, N., Davis, S., and Soboleva, A. (2013). NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res., 41(D1):D991–D995.

[Bauer et al., 2012] Bauer, W. J., Heath, J., Jenkins, J. L., and Kielkopf, C. L. (2012). Three RNA recognition motifs participate in RNA recognition and structural organization by the pro-apoptotic factor TIA-1. J Mol Biol, 415(4):727–40.

[Bernhart et al., 2006] Bernhart, S. H., Hofacker, I. L., and Stadler, P. F. (2006). Local RNA base pairing probabilities in large sequences. Bioinfor- matics, 22(5):614–5.

[Bernhart et al., 2008] Bernhart, S. H., Hofacker, I. L., Will, S., Gruber, A. R., and Stadler, P. F. (2008). RNAalifold: improved consensus structure predic- tion for RNA alignments. BMC Bioinformatics, 9:474.

[Bernhart et al., 2011] Bernhart, S. H., M¨uckstein, U., and Hofacker, I. L. (2011). RNA Accessibility in cubic time. Algorithms Mol Biol, 6(1):3.

[Black, 2003] Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem., 72:291–336.

[Blencowe et al., 2009] Blencowe, B. J., Ahmad, S., and Lee, L. J. (2009). Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev, 23(12):1379–86.

[Bokov and Steinberg, 2009] Bokov, K. and Steinberg, S. V. (2009). A hierar- chical model for evolution of 23S ribosomal RNA. Nature, 457(7232):977–80.

[Borer et al., 1974] Borer, P. N., Dengler, B., Tinoco, Jr, I., and Uhlenbeck, O. C. (1974). Stability of ribonucleic acid double-stranded helices. J. Mol. Biol., 86(4):843–853.

[Bottou and LeCun, 2004] Bottou, L. and LeCun, Y. (2004). Large scale online learning. In Thrun, S., Saul, L., and Sch¨olkopf, B., editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA.

[Breaker, 2008] Breaker, R. R. (2008). Complex riboswitches. Science, 319(5871):1795–7.

106 Bibliography

[Bredel et al., 2009] Bredel, M., Scholtens, D. M., Harsh, G. R., Bredel, C., Chandler, J. P., Renfrow, J. J., Yadav, A. K., Vogel, H., Scheck, A. C., Tibshirani, R., and Sikic, B. I. (2009). A network model of a cooperative genetic landscape in brain tumors. JAMA, 302(3):261–75.

[Bredel et al., 2011] Bredel, M., Scholtens, D. M., Yadav, A. K., Alvarez, A. A., Renfrow, J. J., Chandler, J. P., Yu, I. L. Y., Carro, M. S., Dai, F., Tagge, M. J., Ferrarese, R., Bredel, C., Phillips, H. S., Lukac, P. J., Robe, P. A., Weyerbrock, A., Vogel, H., Dubner, S., Mobley, B., He, X., Scheck, A. C., Sikic, B. I., Aldape, K. D., Chakravarti, A., and Harsh, G. R. t. (2011). NFKBIA deletion in glioblastomas. N Engl J Med, 364(7):627–37.

[Buenrostro et al., 2014] Buenrostro, J. D., Araya, C. L., Chircus, L. M., Layton, C. J., Chang, H. Y., Snyder, M. P., and Greenleaf, W. J. (2014). Quantitative analysis of RNA-protein interactions on a massively parallel array reveals biophysical and evolutionary landscapes. Nat. Biotechnol., 32(6):562–568.

[Busch and Backofen, 2006] Busch, A. and Backofen, R. (2006). INFO-RNA–a fast approach to inverse RNA folding. Bioinformatics, 22(15):1823–1831.

[Busch et al., 2008] Busch, A., Richter, A. S., and Backofen, R. (2008). In- taRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics, 24(24):2849–56.

[Carvalho and Lawrence, 2008] Carvalho, L. E. and Lawrence, C. E. (2008). Centroid estimation in discrete high-dimensional spaces with applications in biology. Proc Natl Acad Sci USA, 105(9):3209–14.

[Castello et al., 2012] Castello, A., Fischer, B., Eichelbaum, K., Horos, R., Beckmann, B. M., Strein, C., Davey, N. E., Humphreys, D. T., Preiss, T., Steinmetz, L. M., Krijgsveld, J., and Hentze, M. W. (2012). Insights into RNA biology from an atlas of mammalian mRNA-binding proteins. Cell, 149(6):1393–1406.

[Cech and Steitz, 2014] Cech, T. R. and Steitz, J. A. (2014). The noncoding RNA Revolution—Trashing old rules to forge new ones. Cell, 157(1):77–94.

[Cesana et al., 2011] Cesana, M., Cacchiarelli, D., Legnini, I., Santini, T., Sthandier, O., Chinappi, M., Tramontano, A., and Bozzoni, I. (2011). A long noncoding RNA controls muscle differentiation by functioning as a competing endogenous RNA. Cell, 147(2):358–69.

[Chang and Lin, 2011] Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu. tw/~cjlin/libsvm.

107 Bibliography

[Cheung et al., 2008] Cheung, H. C., Baggerly, K. A., Tsavachidis, S., Bachin- ski, L. L., Neubauer, V. L., Nixon, T. J., Aldape, K. D., Cote, G. J., and Krahe, R. (2008). Global analysis of aberrant pre-mRNA splicing in glioblastoma using exon expression arrays. BMC Genomics, 9:216.

[Chi et al., 2009] Chi, S. W., Zang, J. B., Mele, A., and Darnell, R. B. (2009). Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature, 460(7254):479–486.

[Chowdhury et al., 2006] Chowdhury, S., Maris, C., Allain, F. H.-T., and Narberhaus, F. (2006). Molecular basis for temperature sensing by an RNA thermometer. EMBO J., 25(11):2487–2497.

[Colombrita et al., 2012] Colombrita, C., Onesto, E., Megiorni, F., Pizzuti, A., Baralle, F. E., Buratti, E., Silani, V., and Ratti, A. (2012). TDP-43 and FUS RNA-binding proteins bind distinct sets of cytoplasmic messenger RNAs and differently regulate their post-transcriptional fate in motoneuron-like cells. Journal of Biological Chemistry, 287(19):15635–47.

[Conrad and Akhtar, 2011] Conrad, T. and Akhtar, A. (2011). Dosage com- pensation in Drosophila melanogaster: epigenetic fine-tuning of chromosome- wide transcription. Nat Rev Genet, 13(2):123–34.

[Conrad et al., 2012] Conrad, T., Cavalli, F. M. G., Vaquerizas, J. M., Lus- combe, N. M., and Akhtar, A. (2012). Drosophila dosage compensation involves enhanced Pol II recruitment to male X-linked promoters. Science, 337(6095):742–6.

[Cook et al., 2015] Cook, K. B., Hughes, T. R., and Morris, Q. D. (2015). High-throughput characterization of protein-RNA interactions. Brief. Funct. Genomics, 14(1):74–89.

[Corcoran et al., 2011] Corcoran, D. L., Georgiev, S., Mukherjee, N., Gottwein, E., Skalsky, R. L., Keene, J. D., and Ohler, U. (2011). PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data. Genome Biol, 12(8):R79.

[Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks. In Machine Learning, pages 273–297.

[Costa and Grave, 2010] Costa, F. and Grave, K. D. (2010). Fast neighborhood subgraph pairwise distance kernel. In Proceedings of the 26 th International Conference on Machine Learning, pages 255–262. Omnipress.

[Cox, 1966] Cox, R. A. (1966). The secondary structure of ribosomal ribonu- cleic acid in solution. Biochem. J, 98(3):841–857.

108 Bibliography

[Cox and Littauer, 1959] Cox, R. A. and Littauer, U. Z. (1959). Secondary structure of ribonucleic acid in solution. Nature, 184(Suppl 11):818–819.

[Crooks et al., 2004] Crooks, G. E., Hon, G., Chandonia, J.-M., and Brenner, S. E. (2004). WebLogo: a sequence logo generator. Genome Res, 14(6):1188– 90.

[Cruz and Westhof, 2009] Cruz, J. A. and Westhof, E. (2009). The dynamic landscapes of RNA architecture. Cell, 136(4):604–609.

[Czaplinski et al., 2005] Czaplinski, K., Kocher, T., Schelder, M., Segref, A., Wilm, M., and Mattaj, I. W. (2005). Identification of 40LoVe, a Xenopus hnRNP D family protein involved in localizing a TGF-beta-related mRNA during oogenesis. Dev Cell, 8(4):505–15.

[Das and Dai, 2007] Das, M. K. and Dai, H.-K. (2007). A survey of DNA motif finding algorithms. BMC Bioinformatics, 8 Suppl 7:S21.

[Davis and Goadrich, 2006] Davis, J. and Goadrich, M. (2006). The relation- ship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 233–240, New York, NY, USA. ACM.

[Dember et al., 1996] Dember, L., Kim, N., Liu, K., and Anderson, P. (1996). Individual RNA recognition motifs of TIA-1 and TIAR have different RNA binding specificities. Journal of Biological Chemistry, 271:2783.

[Deng and Meller, 2006] Deng, X. and Meller, V. H. (2006). Non-coding RNA in fly dosage compensation. Trends Biochem. Sci., 31(9):526–532.

[Derrien et al., 2012] Derrien, T., Estelle, J., Marco Sola, S., Knowles, D. G., Raineri, E., Guigo, R., and Ribeca, P. (2012). Fast computation and applications of genome mappability. PLoS One, 7(1):e30377.

[DeVoe and Tinoco, 1962] DeVoe, H. and Tinoco, I. (1962). The stability of helical polynucleotides: Base contributions. J. Mol. Biol., 4(6):500–517.

[Diamond et al., 2001] Diamond, J. M., Turner, D. H., and Mathews, D. H. (2001). Thermodynamics of three-way multibranch loops in RNA. Biochem- istry, 40(23):6971–81.

[Ding et al., 2006] Ding, Y., Chan, C. Y., and Lawrence, C. E. (2006). Clus- tering of RNA secondary structures with application to messenger RNAs. J Mol Biol, 359(3):554–71.

[Do et al., 2006] Do, C. B., Woods, D. A., and Batzoglou, S. (2006). CON- TRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90–8.

109 Bibliography

[Doshi et al., 2004] Doshi, K. J., Cannone, J. J., Cobaugh, C. W., and Gutell, R. R. (2004). Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics, 5:105.

[Draper, 1999] Draper, D. E. (1999). Themes in RNA-protein recognition. J. Mol. Biol., 293(2):255–270.

[Drucker et al., 1997] Drucker, H., Burges, C. J., Kaufman, L., Smola, A., and Vapnik, V. (1997). Support vector regression machines. Advances in neural information processing systems, pages 155–161.

[Faircloth and Glenn, 2012] Faircloth, B. C. and Glenn, T. C. (2012). Not all sequence tags are created equal: Designing and validating sequence identification tags robust to indels. PLoS ONE, 7(8):e42543.

[Fauth et al., 2010] Fauth, T., Muller-Planitz, F., Konig, C., Straub, T., and Becker, P. B. (2010). The DNA binding CXC domain of MSL2 is required for faithful targeting the Dosage Compensation Complex to the X chromosome. Nucleic Acids Res, 38(10):3209–21.

[Fawcett, 2006] Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874. ROC Analysis in Pattern Recognition.

[Ferrarese et al., 2014] Ferrarese, R., Harsh, G. R. t., Yadav, A. K., Bug, E., Maticzka, D., Reichardt, W., Dombrowski, S. M., Miller, T. E., Masilamani, A. P., Dai, F., Kim, H., Hadler, M., Scholtens, D. M., Yu, I. L. Y., Beck, J., Srinivasasainagendra, V., Costa, F., Baxan, N., Pfeifer, D., Elverfeldt, D. V., Backofen, R., Weyerbrock, A., Duarte, C. W., He, X., Prinz, M., Chandler, J. P., Vogel, H., Chakravarti, A., Rich, J. N., Carro, M. S., and Bredel, M. (2014). Lineage-specific splicing of a brain-enriched alternative exon promotes glioblastoma progression. J Clin Invest, 124(7):2861–2876.

[Fields et al., 1994] Fields, C., Adams, M. D., White, O., and Venter, J. C. (1994). How many genes in the human genome? Nat. Genet., 7(3):345–346.

[Fields and Gutell, 1996] Fields, D. S. and Gutell, R. R. (1996). An analysis of large rRNA sequences folded by a thermodynamic method. Fold Des, 1(6):419–30.

[Finn et al., 2010] Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L. L., Eddy, S. R., and Bateman, A. (2010). The pfam protein families database. Nucleic Acids Res., 38(Database issue):D211– 22.

110 Bibliography

[Foat et al., 2006] Foat, B. C., Morozov, A. V., and Bussemaker, H. J. (2006). Statistical mechanical modeling of genome-wide transcription factor occu- pancy data by MatrixREDUCE. Bioinformatics, 22(14):e141–9.

[Forch et al., 2000] Forch, P., Puig, O., Kedersha, N., Martinez, C., Granne- man, S., Seraphin, B., Anderson, P., and Valcarcel, J. (2000). The - promoting factor TIA-1 is a regulator of alternative pre-mRNA splicing. Mol Cell, 6(5):1089–98.

[Frasconi et al., 2012] Frasconi, P., Costa, F., Raedt, L. D., and Grave, K. D. (2012). klog: A language for logical and relational learning with kernels. CoRR, abs/1205.3981.

[Freeberg et al., 2013] Freeberg, M. A., Han, T., Moresco, J. J., Kong, A., Yang, Y.-C., Lu, Z. J., Yates, J. R., and Kim, J. K. (2013). Pervasive and dynamic protein binding sites of the mRNA transcriptome in Saccharomyces cerevisiae. Genome Biol., 14(2):R13.

[Gabut et al., 2011] Gabut, M., Samavarchi-Tehrani, P., Wang, X., Slobode- niuc, V., O’Hanlon, D., Sung, H.-K., Alvarez, M., Talukder, S., Pan, Q., Mazzoni, E. O., Nedelec, S., Wichterle, H., Woltjen, K., Hughes, T. R., Zandstra, P. W., Nagy, A., Wrana, J. L., and Blencowe, B. J. (2011). An alternative splicing switch regulates embryonic stem cell pluripotency and reprogramming. Cell, 147(1):132–46.

[Gama-Carvalho et al., 2006] Gama-Carvalho, M., Barbosa-Morais, N. L., Brodsky, A. S., Silver, P. A., and Carmo-Fonseca, M. (2006). Genome-wide identification of functionally distinct subsets of cellular mRNAs associated with two nucleocytoplasmic-shuttling mammalian splicing factors. Genome Biol, 7(11):R113.

[Gao et al., 1994] Gao, F. B., Carson, C. C., Levine, T., and Keene, J. D. (1994). Selection of a subset of mRNAs from combinatorial 3’ untranslated region libraries using neuronal RNA-binding protein Hel-N1. Proc Natl Acad Sci USA, 91(23):11207–11.

[Gardner et al., 2011] Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths-Jones, S., Finn, R. D., Nawrocki, E. P., Kolbe, D. L., Eddy, S. R., and Bateman, A. (2011). Rfam: Wikipedia, clans and the ”decimal” release. Nucleic Acids Res, 39(Database issue):D141–5.

[Gatignol et al., 1993] Gatignol, A., Buckler, C., and Jeang, K. T. (1993). Relatedness of an RNA-binding motif in human immunodeficiency virus type 1 TAR RNA-binding protein TRBP to human P1/dsI kinase and Drosophila staufen. Mol Cell Biol, 13(4):2193–202.

111 Bibliography

[Gerstberger et al., 2013] Gerstberger, S., Hafner, M., and Tuschl, T. (2013). Learning the language of post-transcriptional gene regulation. Genome Biol., 14(8):130.

[Gerstberger et al., 2014] Gerstberger, S., Hafner, M., and Tuschl, T. (2014). A census of human RNA-binding proteins. Nat. Rev. Genet., 15(12):829–845.

[Gesteland et al., 2006] Gesteland, R. F., Cech, T. R., and Atkins, J. F., editors (2006). The RNA World. Cold Spring Harbor Laboratory Press, Plainview, NY, 3rd edition.

[Giegerich et al., 2004] Giegerich, R., Voss, B., and Rehmsmeier, M. (2004). Abstract shapes of RNA. Nucleic Acids Res, 32(16):4843–51.

[Gilbert, 1986] Gilbert, W. (1986). The world. Nature, 319:618.

[Gorlach et al., 1994] Gorlach, M., Burd, C. G., and Dreyfuss, G. (1994). The determinants of RNA-binding specificity of the heterogeneous nuclear ribonucleoprotein C proteins. Journal of Biological Chemistry, 269(37):23074– 8.

[Gorodkin and Hofacker, 2011] Gorodkin, J. and Hofacker, I. L. (2011). From structure prediction to genomic screens for novel non-coding RNAs. PLoS Comput Biol, 7(8):e1002100.

[Gowri et al., 2006] Gowri, V. S., Krishnadev, O., Swamy, C. S., and Srini- vasan, N. (2006). MulPSSM: a database of multiple position-specific scor- ing matrices of protein domain families. Nucleic Acids Res, 34(Database issue):D243–6.

[Griffiths-Jones et al., 2005] Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S. R., and Bateman, A. (2005). Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33 Database Issue:D121–4.

[Grimson et al., 2007] Grimson, A., Farh, K. K.-H., Johnston, W. K., Garrett- Engele, P., Lim, L. P., and Bartel, D. P. (2007). MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol Cell, 27(1):91– 105.

[Guenther et al., 2013] Guenther, U.-P., Yandek, L. E., Niland, C. N., Camp- bell, F. E., Anderson, D., Anderson, V. E., Harris, M. E., and Jankowsky, E. (2013). Hidden specificity in an apparently nonspecific RNA-binding protein. Nature, 502(7471):385–388.

[Guerrier-Takada et al., 1983] Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., and Altman, S. (1983). The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 35(3 Pt 2):849–857.

112 Bibliography

[Gupta and Gribskov, 2011] Gupta, A. and Gribskov, M. (2011). The role of RNA sequence and structure in RNA–Protein interactions. J. Mol. Biol., 409(4):574–587.

[Gupta et al., 2013] Gupta, S. K., Kosti, I., Plaut, G., Pivko, A., Tkacz, I. D., Cohen-Chalamish, S., Biswas, D. K., Wachtel, C., Waldman Ben-Asher, H., Carmi, S., Glaser, F., Mandel-Gutfreund, Y., and Michaeli, S. (2013). The hnRNP F/H homologue of Trypanosoma brucei is differentially expressed in the two life cycle stages of the parasite and regulates splicing and mRNA stability. Nucleic Acids Res, 41(13):6577–94.

[Hafner et al., 2010] Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M. J., Jungkamp, A.-C., Munschauer, M., Ulrich, A., Wardle, G. S., Dewell, S., Zavolan, M., and Tuschl, T. (2010). Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell, 141(1):129–41.

[Hallacli et al., 2012] Hallacli, E., Lipp, M., Georgiev, P., Spielman, C., Cu- sack, S., Akhtar, A., and Kadlec, J. (2012). Msl1-mediated dimerization of the dosage compensation complex is essential for male X-chromosome regulation in Drosophila. Mol Cell, 48(4):587–600.

[Hatoum-Aslan et al., 2011] Hatoum-Aslan, A., Maniv, I., and Marraffini, L. A. (2011). Mature clustered, regularly interspaced, short palindromic repeats RNA (crRNA) length is measured by a ruler mechanism anchored at the precursor processing site. Proc Natl Acad Sci USA, 108(52):21218–22.

[Hausser et al., 2009] Hausser, J., Landthaler, M., Jaskiewicz, L., Gaidatzis, D., and Zavolan, M. (2009). Relative contribution of sequence and structure features to the mRNA binding of Argonaute/EIF2C-miRNA complexes and the degradation of miRNA targets. Genome Res, 19(11):2009–20.

[Helwak et al., 2013] Helwak, A., Kudla, G., Dudnakova, T., and Tollervey, D. (2013). Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell, 153(3):654–665.

[Hershey and Chase, 1952] Hershey, A. D. and Chase, M. (1952). Independent functions of viral protein and nucleic acid in growth of bacteriophage. J Gen Physiol, 36(1):39–56.

[Heyne et al., 2012] Heyne, S., Costa, F., Rose, D., and Backofen, R. (2012). GraphClust: alignment-free structural clustering of local RNA secondary structures. Bioinformatics, 28(12):i224–i232.

[Hiller et al., 2006] Hiller, M., Pudimat, R., Busch, A., and Backofen, R. (2006). Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res, 34(17):e117.

113 Bibliography

[Hiller et al., 2007] Hiller, M., Zhang, Z., Backofen, R., and Stamm, S. (2007). Pre-mRNA Secondary Structures Influence Exon Recognition. PLoS Genet, 3(11):e204.

[Hoell et al., 2011] Hoell, J. I., Larsson, E., Runge, S., Nusbaum, J. D., Dug- gimpudi, S., Farazi, T. A., Hafner, M., Borkhardt, A., Sander, C., and Tuschl, T. (2011). RNA targets of wild-type and mutant FET family proteins. Nat Struct Mol Biol, 18(12):1428–31.

[Hofacker et al., 1994] Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, S., Tacker, M., and Schuster, P. (1994). Fast folding and comparison of RNA secondary structures. Monatshefte Chemie, 125:167–188.

[Hofacker et al., 2004] Hofacker, I. L., Priwitzer, B., and Stadler, P. F. (2004). Prediction of locally stable RNA secondary structures for genome-wide surveys. Bioinformatics, 20(2):186–190.

[Hofacker et al., 1998] Hofacker, I. L., Schuster, P., and Stadler, P. F. (1998). Combinatorics of RNA secondary structures. Discrete Appl. Math., 88(1):207.

[Honer Zu Siederdissen et al., 2013] Honer Zu Siederdissen, C., Hammer, S., Abfalter, I., Hofacker, I. L., Flamm, C., and Stadler, P. F. (2013). Com- putational design of RNAs with complex energy landscapes. Biopolymers, 99(12):1124–1136.

[Hong et al., 2009] Hong, X., Hammell, M., Ambros, V., and Cohen, S. M. (2009). Immunopurification of Ago1 miRNPs selects for a distinct class of microRNA targets. Proc Natl Acad Sci USA, 106(35):15085–90.

[Ilik et al., 2013] Ilik, I. A., Quinn, J. J., Georgiev, P., Tavares-Cadete, F., Maticzka, D., Toscano, S., Wan, Y., Spitale, R. C., Luscombe, N., Backofen, R., Chang, H. Y., and Akhtar, A. (2013). Tandem Stem-Loops in roX RNAs Act Together to Mediate X Chromosome Dosage Compensation in Drosophila. Mol Cell, 51(2):156–73.

[International Human Genome Sequencing Consortium, 2001] International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822):860–921.

[Izzo et al., 2008] Izzo, A., Regnard, C., Morales, V., Kremmer, E., and Becker, P. B. (2008). Structure-function analysis of the RNA helicase maleless. Nucleic Acids Res, 36(3):950–62.

[Jacobs et al., 2009] Jacobs, G. H., Chen, A., Stevens, S. G., Stockwell, P. A., Black, M. A., Tate, W. P., and Brown, C. M. (2009). Transterm: a database to aid the analysis of regulatory sequences in mRNAs. Nucleic Acids Res, 37(Database issue):D72–6.

114 Bibliography

[Jenkins et al., 2010] Jenkins, R. H., Bennagi, R., Martin, J., Phillips, A. O., Redman, J. E., and Fraser, D. J. (2010). A conserved stem loop motif in the 5’untranslated region regulates transforming growth factor-beta(1) translation. PLoS One, 5(8):e12283.

[Jens and Rajewsky, 2015] Jens, M. and Rajewsky, N. (2015). Competition between target sites of regulators shapes post-transcriptional gene regulation. Nat. Rev. Genet., 16(2):113–126.

[Jungkamp et al., 2011] Jungkamp, A.-C., Stoeckius, M., Mecenas, D., Grun, D., Mastrobuoni, G., Kempa, S., and Rajewsky, N. (2011). In vivo and transcriptome-wide identification of RNA binding protein target sites. Mol. Cell, 44(5):828–840.

[Kadlec et al., 2011] Kadlec, J., Hallacli, E., Lipp, M., Holz, H., Sanchez- Weatherby, J., Cusack, S., and Akhtar, A. (2011). Structural basis for MOF and MSL3 recruitment into the dosage compensation complex by MSL1. Nat Struct Mol Biol, 18(2):142–9.

[Karakasiliotis et al., 2010] Karakasiliotis, I., Vashist, S., Bailey, D., Abente, E. J., Green, K. Y., Roberts, L. O., Sosnovtsev, S. V., and Goodfellow, I. G. (2010). Polypyrimidine tract binding protein functions as a negative regulator of feline calicivirus translation. PLoS One, 5(3):e9562.

[Kazan and Morris, 2013] Kazan, H. and Morris, Q. (2013). RBPmotif: a web server for the discovery of sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res, 41(Web Server issue):W180–6.

[Kazan et al., 2010] Kazan, H., Ray, D., Chan, E. T., Hughes, T. R., and Morris, Q. (2010). RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput Biol, 6:e1000832.

[Kelley et al., 2008] Kelley, R. L., Lee, O.-K., and Shim, Y.-K. (2008). Tran- scription rate of noncoding roX1 RNA controls local spreading of the Drosophila MSL chromatin remodeling complex. Mech Dev, 125(11-12):1009– 19.

[Kertesz et al., 2007] Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U., and Segal, E. (2007). The role of site accessibility in microRNA target recognition. Nat Genet, 39(10):1278–84.

[Kertesz et al., 2010] Kertesz, M., Wan, Y., Mazor, E., Rinn, J. L., Nutter, R. C., Chang, H. Y., and Segal, E. (2010). Genome-wide measurement of RNA secondary structure in yeast. Nature, 467(7311):103–7.

115 Bibliography

[Kim et al., 2013] Kim, H. S., Headey, S. J., Yoga, Y. M. K., Scanlon, M. J., Gorospe, M., Wilce, M. C. J., and Wilce, J. A. (2013). Distinct binding properties of TIAR RRMs and linker region. RNA Biol, 10(4):579–89.

[Kiryu et al., 2008] Kiryu, H., Kin, T., and Asai, K. (2008). Rfold: an exact algorithm for computing local base pairing probabilities. Bioinformatics, 24(3):367–73.

[Kiryu et al., 2011] Kiryu, H., Terai, G., Imamura, O., Yoneyama, H., Suzuki, K., and Asai, K. (2011). A detailed investigation of accessibilities around target sites of siRNAs and miRNAs. Bioinformatics, 27(13):1788–97.

[Kishore et al., 2011] Kishore, S., Jaskiewicz, L., Burger, L., Hausser, J., Khor- shid, M., and Zavolan, M. (2011). A quantitative analysis of CLIP methods for identifying binding sites of RNA-binding proteins. Nat Methods, 8(7):559– 64.

[Kleinkauf et al., 2015] Kleinkauf, R., Mann, M., and Backofen, R. (2015). an- taRNA - ant colony based RNA sequence design. Bioinformatics, 31(19):3114– 3121.

[Kojima et al., 2007] Kojima, S., Matsumoto, K., Hirose, M., Shimada, M., Nagano, M., Shigeyoshi, Y., Hoshino, S.-i., Ui-Tei, K., Saigo, K., Green, C. B., Sakaki, Y., and Tei, H. (2007). LARK activates posttranscriptional expression of an essential mammalian clock protein, PERIOD1. Proc Natl Acad Sci USA, 104(6):1859–64.

[K¨onig et al., 2010] K¨onig,J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D. J., Luscombe, N. M., and Ule, J. (2010). iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat. Struct. Mol. Biol., 17(7):909–915.

[Konings and Gutell, 1995] Konings, D. A. and Gutell, R. R. (1995). A com- parison of thermodynamic foldings with comparatively derived structures of 16S and 16S-like rRNAs. RNA, 1(6):559–74.

[Kornberg, 1999] Kornberg, R. D. (1999). Eukaryotic transcriptional control. Trends Cell Biol., 9(12):M46–9.

[Kruger et al., 1982] Kruger, K., Grabowski, P. J., Zaug, A. J., Sands, J., Gottschling, D. E., and Cech, T. R. (1982). Self-splicing RNA: autoex- cision and autocyclization of the ribosomal RNA intervening sequence of Tetrahymena. Cell, 31(1):147–157.

[Kudla et al., 2009] Kudla, G., Murray, A. W., Tollervey, D., and Plotkin, J. B. (2009). Coding-sequence determinants of gene expression in Escherichia coli. Science, 324(5924):255–8.

116 Bibliography

[Kundu et al., 2013] Kundu, K., Costa, F., and Backofen, R. (2013). A graph kernel approach for alignment-free domain-peptide interaction prediction with an application to human SH3 domains. Bioinformatics, 29(13):i335– i343.

[Landry et al., 2013] Landry, J. J. M., Pyl, P. T., Rausch, T., Zichner, T., Tekkedil, M. M., St¨utz,A. M., Jauch, A., Aiyar, R. S., Pau, G., Delhomme, N., Gagneur, J., Korbel, J. O., Huber, W., and Steinmetz, L. M. (2013). The genomic and transcriptomic landscape of a HeLa cell line. G3, 3(8):1213– 1224.

[Lange et al., 2013] Lange, S. J., Alkhnbashi, O. S., Rose, D., Will, S., and Backofen, R. (2013). CRISPRmap: an automated classification of repeat conservation in prokaryotic adaptive immune systems. Nucleic Acids Res, 41(17):8034–44. SJL, OSA and DR contributed equally to this work.

[Lange et al., 2012] Lange, S. J., Maticzka, D., M¨ohl,M., Gagnon, J. N., Brown, C. M., and Backofen, R. (2012). Global or local? Predicting sec- ondary structure and accessibility in mRNAs. Nucleic Acids Res, 40(12):5215– 26. SJL and DM contributed equally to this work.

[Langmead and Salzberg, 2012] Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods, 9(4):357–9.

[Laver et al., 2013] Laver, J. D., Li, X., Ancevicius, K., Westwood, J. T., Smibert, C. A., Morris, Q. D., and Lipshitz, H. D. (2013). Genome-wide analysis of Staufen-associated mRNAs identifies secondary structures that confer target specificity. Nucleic Acids Res.

[Law et al., 2006] Law, M. J., Rice, A. J., Lin, P., and Laird-Offringa, I. A. (2006). The role of RNA structure in the interaction of U1A protein with U1 hairpin II RNA. RNA, 12(7):1168–78.

[Lebedeva et al., 2011] Lebedeva, S., Jens, M., Theil, K., Schwanhausser, B., Selbach, M., Landthaler, M., and Rajewsky, N. (2011). Transcriptome-wide analysis of regulatory interactions of the RNA-binding protein HuR. Mol Cell, 43(3):340–52.

[Lee et al., 2002] Lee, J. H., Kim, H., Ko, J., and Lee, Y. (2002). Interaction of C5 protein with RNA aptamers selected by SELEX. Nucleic Acids Res, 30(24):5360–8.

[Leibovich et al., 2013] Leibovich, L., Paz, I., Yakhini, Z., and Mandel- Gutfreund, Y. (2013). DRIMust: a web server for discovering rank im- balanced motifs using suffix trees. Nucleic Acids Res, 41(Web Server issue):W174–9.

117 Bibliography

[Leslie et al., 2004] Leslie, C. S., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4):467–76.

[Li et al., 2010] Li, X., Quon, G., Lipshitz, H. D., and Morris, Q. (2010). Predicting in vivo binding sites of RNA-binding proteins using mRNA secondary structure. RNA, 16(6):1096–107.

[Licatalosi et al., 2008] Licatalosi, D. D., Mele, A., Fak, J. J., Ule, J., Kayikci, M., Chi, S. W., Clark, T. A., Schweitzer, A. C., Blume, J. E., Wang, X., Darnell, J. C., and Darnell, R. B. (2008). HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature, 456(7221):464–469.

[Lorenz et al., 2011] Lorenz, R., Bernhart, S. H., H¨oner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., and Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms Mol. Biol., 6:26.

[Lorenz et al., 2016] Lorenz, R., Luntzer, D., Hofacker, I. L., Stadler, P. F., and Wolfinger, M. T. (2016). SHAPE directed RNA folding. Bioinformatics, 32(1):145–147.

[Lu et al., 2016] Lu, Z., Zhang, Q. C., Lee, B., Flynn, R. A., Smith, M. A., Robinson, J. T., Davidovich, C., Gooding, A. R., Goodrich, K. J., Mattick, J. S., Mesirov, J. P., Cech, T. R., and Chang, H. Y. (2016). RNA duplex map in living cells reveals Higher-Order transcriptome structure. Cell, 165(5):1267–1279.

[Lu et al., 2009] Lu, Z. J., Gloor, J. W., and Mathews, D. H. (2009). Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA, 15(10):1805–13.

[Lucchesi, 1998] Lucchesi, J. C. (1998). Dosage compensation in flies and worms: the ups and downs of X-chromosome regulation. Curr Opin Genet Dev, 8(2):179–84.

[Lunde et al., 2007] Lunde, B. M., Moore, C., and Varani, G. (2007). RNA- binding proteins: modular design for efficient function. Nat. Rev. Mol. Cell Biol., 8(6):479–490.

[Lyngsø and Pedersen, 2000] Lyngsø, R. B. and Pedersen, C. N. (2000). RNA pseudoknot prediction in energy-based models. J. Comput. Biol., 7(3-4):409– 427.

[Lyngso et al., 1999] Lyngso, R. B., Zuker, M., and Pedersen, C. N. (1999). Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics, 15(6):440–445.

118 Bibliography

[Madison, 1968] Madison, J. T. (1968). Primary structure of RNA. Annu. Rev. Biochem., 37:131–148.

[Maenner et al., 2012] Maenner, S., Muller, M., and Becker, P. B. (2012). Roles of long, non-coding RNA in chromosome-wide transcription regulation: lessons from two dosage compensation systems. Biochimie, 94(7):1490–8.

[Magendzo et al., 1991] Magendzo, K., Shirvan, A., Cultraro, C., Srivastava, M., Pollard, H. B., and Burns, A. L. (1991). Alternative splicing of human synexin mRNA in brain, cardiac, and skeletal muscle alters the unique N-terminal domain. Journal of Biological Chemistry, 266(5):3228–32.

[Mandal and Breaker, 2004] Mandal, M. and Breaker, R. R. (2004). Gene regulation by riboswitches. Nat. Rev. Mol. Cell Biol., 5(6):451–463.

[Mar´ınand Van´ıˇcek,2011] Mar´ın, R. M. and Van´ıˇcek,J. (2011). Efficient use of accessibility in microRNA target prediction. Nucleic Acids Res, 39(1):19–29.

[Markham and Zuker, 2008] Markham, N. R. and Zuker, M. (2008). UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol, 453:3– 31.

[Martin, 2011] Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1).

[Masliah et al., 2013] Masliah, G., Barraud, P., and Allain, F. H.-T. (2013). RNA recognition by double-stranded RNA binding domains: a matter of shape and sequence. Cell. Mol. Life Sci., 70(11):1875–1895.

[Mathews et al., 1999] Mathews, D., Sabina, J., Zuker, M., and Turner, D. (1999). Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol, 288(5):911–40.

[Mathews and Turner, 2002] Mathews, D. H. and Turner, D. H. (2002). Ex- perimentally derived nearest-neighbor parameters for the stability of RNA three- and four-way multibranch loops. Biochemistry, 41(3):869–80.

[Maticzka et al., 2014] Maticzka, D., Lange, S. J., Costa, F., and Backofen, R. (2014). GraphProt: modeling binding preferences of RNA-binding proteins. Genome Biol, 15(1):R17.

[Mattick, 2003] Mattick, J. S. (2003). Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. Bioessays, 25(10):930– 939.

[Mattick and Makunin, 2005] Mattick, J. S. and Makunin, I. V. (2005). Small regulatory RNAs in mammals. Hum. Mol. Genet., 14 Spec No 1:R121–32.

119 Bibliography

[McCaskill, 1990] McCaskill, J. S. (1990). The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopoly- mers, 29(6-7):1105–19. [Meller et al., 2000] Meller, V. H., Gordadze, P. R., Park, Y., Chu, X., Stuck- enholz, C., Kelley, R. L., and Kuroda, M. I. (2000). Ordered assembly of roX RNAs into MSL complexes on the dosage-compensated X chromosome in Drosophila. Curr Biol, 10(3):136–43. [Meller and Rattner, 2002] Meller, V. H. and Rattner, B. P. (2002). The roX genes encode redundant male-specific lethal transcripts required for targeting of the MSL complex. EMBO J, 21(5):1084–91. [Messias and Sattler, 2004] Messias, A. C. and Sattler, M. (2004). Structural basis of single-stranded RNA recognition. Acc. Chem. Res., 37(5):279–287. [Mokrejs et al., 2010] Mokrejs, M., Masek, T., Vopalensky, V., Hlubucek, P., Delbos, P., and Pospisek, M. (2010). IRESite–a tool for the examination of vi- ral and cellular internal ribosome entry sites. Nucleic Acids Res, 38(Database issue):D131–6. [Møller et al., 2002] Møller, T., Franch, T., Højrup, P., Keene, D. R., B¨achinger, H. P., Brennan, R. G., and Valentin-Hansen, P. (2002). Hfq: a bacterial sm-like protein that mediates RNA-RNA interaction. Mol. Cell, 9(1):23–30. [Morgan and Higgs, 1996] Morgan, S. R. and Higgs, P. G. (1996). Evidence for kinetic effects in the folding of large RNA molecules. The Journal of Chemical Physics, 105(16):7152. [Morris et al., 2010] Morris, A. R., Mukherjee, N., and Keene, J. D. (2010). Systematic analysis of posttranscriptional gene expression. Wiley Interdiscip Rev Syst Biol Med, 2(2):162–80. [Mukherjee et al., 2011] Mukherjee, N., Corcoran, D. L., Nusbaum, J. D., Reid, D. W., Georgiev, S., Hafner, M., Ascano, M. J., Tuschl, T., Ohler, U., and Keene, J. D. (2011). Integrative regulatory mapping indicates that the RNA-binding protein HuR couples pre-mRNA processing and mRNA stability. Mol Cell, 43(3):327–39. [Nicholas et al., 2006] Nicholas, M. K., Lukas, R. V., Jafri, N. F., Faoro, L., and Salgia, R. (2006). Epidermal growth factor receptor - mediated signal transduction in the development and therapy of gliomas. Clin Cancer Res, 12(24):7261–70. [Nilsen and Graveley, 2010] Nilsen, T. W. and Graveley, B. R. (2010). Expan- sion of the eukaryotic proteome by alternative splicing. Nature, 463(7280):457– 463.

120 Bibliography

[Noller et al., 1992] Noller, H. F., Hoffarth, V., and Zimniak, L. (1992). Un- usual resistance of peptidyl transferase to protein extraction procedures. Science, 256(5062):1416–1419.

[Nussinov and Jacobson, 1980] Nussinov, R. and Jacobson, A. B. (1980). Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. U. S. A., 77(11):6309–6313.

[Nussinov et al., 1978] Nussinov, R., Pieczenik, G., Griggs, J. R., and Kleit- man, D. J. (1978). Algorithms for loop matchings. SIAM J. Appl. Math., 35(1):68–82.

[Nussinov and Tinoco, 1981] Nussinov, R. and Tinoco, I. J. (1981). Sequential folding of a messenger RNA molecule. J Mol Biol, 151(3):519–33.

[Park et al., 2007] Park, S.-W., Kang, Y. I., Sypula, J. G., Choi, J., Oh, H., and Park, Y. (2007). An evolutionarily conserved domain of roX2 RNA is sufficient for induction of H4-Lys16 acetylation on the Drosophila X chromosome. Genetics, 177(3):1429–37.

[Park et al., 2008] Park, S.-W., Kuroda, M. I., and Park, Y. (2008). Regulation of histone H4 Lys16 acetylation by predicted alternative secondary structures in roX noncoding RNAs. Mol Cell Biol, 28(16):4952–62.

[Perez et al., 1997] Perez, I., Lin, C. H., McAfee, J. G., and Patton, J. G. (1997). Mutation of PTB binding sites causes misregulation of alternative 3’ splice site selection in vivo. RNA, 3(7):764–78.

[Pipas and McMahon, 1975] Pipas, J. M. and McMahon, J. E. (1975). Method for predicting RNA secondary structure. Proc. Natl. Acad. Sci. U. S. A., 72(6):2017–2021.

[Prasanth and Spector, 2007] Prasanth, K. V. and Spector, D. L. (2007). Eu- karyotic regulatory RNAs: an answer to the ‘genome complexity’ conundrum. Genes Dev., 21(1):11–42.

[Pudimat et al., 2005] Pudimat, R., Schukat-Talamazzini, E., and Backofen, R. (2005). A multiple-feature framework for modelling and predicting transcription factor binding sites. Bioinformatics, 21(14):3082–8.

[Ray et al., 2009] Ray, D., Kazan, H., Chan, E. T., Pena Castillo, L., Chaudhry, S., Talukder, S., Blencowe, B. J., Morris, Q., and Hughes, T. R. (2009). Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol., 27(7):667–670.

[Ray et al., 2013] Ray, D., Kazan, H., Cook, K. B., Weirauch, M. T., Na- jafabadi, H. S., Li, X., Gueroussov, S., Albu, M., Zheng, H., Yang, A., Na, H., Irimia, M., Matzat, L. H., Dale, R. K., Smith, S. A., Yarosh, C. A.,

121 Bibliography

Kelly, S. M., Nabet, B., Mecenas, D., Li, W., Laishram, R. S., Qiao, M., Lipshitz, H. D., Piano, F., Corbett, A. H., Carstens, R. P., Frey, B. J., Anderson, R. A., Lynch, K. W., Penalva, L. O. F., Lei, E. P., Fraser, A. G., Blencowe, B. J., Morris, Q. D., and Hughes, T. R. (2013). A compendium of RNA-binding motifs for decoding gene regulation. Nature, 499(7457):172–7. [Re et al., 2014] Re, A., Joshi, T., Kulberkyte, E., Morris, Q., and Workman, C. T. (2014). RNA-Protein Interactions: An Overview. Methods Mol Biol, 1097:491–521. [Reid et al., 2009] Reid, D. C., Chang, B. L., Gunderson, S. I., Alpert, L., Thompson, W. A., and Fairbrother, W. G. (2009). Next-generation SELEX identifies sequence and structural determinants of splicing factor binding in human pre-mRNA sequence. RNA, 15(12):2385–97. [Reuter and Mathews, 2010] Reuter, J. S. and Mathews, D. H. (2010). RNAs- tructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11:129. [Reyes-Herrera and Ficarra, 2014] Reyes-Herrera, P. H. and Ficarra, E. (2014). Computational methods for CLIP-seq data processing. Bioinform. Biol. Insights, 8:199–207. [Richter et al., 2010] Richter, A. S., Schleberger, C., Backofen, R., and Steglich, C. (2010). Seed-based IntaRNA prediction combined with GFP- reporter system identifies mRNA targets of the small RNA Yfr1. Bioinfor- matics, 26(1):1–5. [Rick et al., 2005] Rick, M., Ramos Garrido, S. I., Herr, C., Thal, D. R., Noegel, A. A., and Clemen, C. S. (2005). Nuclear localization of Annexin A7 during murine brain development. BMC Neurosci, 6:25. [Rivas and Eddy, 2000] Rivas, E. and Eddy, S. R. (2000). The language of RNA: a formal grammar that includes pseudoknots. Bioinformatics, 16(4):334–40. [Robinson et al., 2011] Robinson, J. T., Thorvaldsd´ottir,H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., and Mesirov, J. P. (2011). Integrative genomics viewer. Nat. Biotechnol., 29(1):24–26. [Saito and Rehmsmeier, 2015] Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10(3):e0118432. [Sanford et al., 2009] Sanford, J. R., Wang, X., Mort, M., Vanduyn, N., Cooper, D. N., Mooney, S. D., Edenberg, H. J., and Liu, Y. (2009). Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res, 19(3):381–94.

122 Bibliography

[Saunders, 2014] Saunders, S. J. (2014). Computational analyses of post- transcriptional regulatory mechanisms. PhD thesis, Albert-Ludwigs- University Freiburg.

[Schmitter et al., 2006] Schmitter, D., Filkowski, J., Sewer, A., Pillai, R. S., Oakeley, E. J., Zavolan, M., Svoboda, P., and Filipowicz, W. (2006). Effects of Dicer and Argonaute down-regulation on mRNA levels in human HEK293 cells. Nucleic Acids Res, 34(17):4801–15.

[Schnall-Levin et al., 2011] Schnall-Levin, M., Rissland, O. S., Johnston, W. K., Perrimon, N., Bartel, D. P., and Berger, B. (2011). Unusually effective microRNA targeting within repeat-rich coding regions of mam- malian mRNAs. Genome Res, 21(9):1395–403.

[Schroeder, 2009] Schroeder, S. J. (2009). Advances in RNA structure predic- tion from sequence: new tools for generating hypotheses about viral RNA structure-function relationships. J. Virol., 83(13):6326–6334.

[Selbach et al., 2008] Selbach, M., Schwanhausser, B., Thierfelder, N., Fang, Z., Khanin, R., and Rajewsky, N. (2008). Widespread changes in protein synthesis induced by microRNAs. Nature, 455(7209):58–63.

[Shao et al., 2006] Shao, Y., Wu, Y., Chan, C. Y., McDonough, K., and Ding, Y. (2006). Rational design and rapid screening of antisense oligonucleotides for prokaryotic gene modulation. Nucleic Acids Res, 34(19):5660–9.

[Sharma et al., 2011] Sharma, S., Maris, C., Allain, F. H.-T., and Black, D. L. (2011). U1 snRNA directly interacts with polypyrimidine tract-binding protein during splicing repression. Mol Cell, 41(5):579–88.

[Shine and Dalgarno, 1974] Shine, J. and Dalgarno, L. (1974). The 3’-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to non- sense triplets and ribosome binding sites. Proc. Natl. Acad. Sci. U. S. A., 71(4):1342–1346.

[Sievers et al., 2012] Sievers, C., Schlumpf, T., Sawarkar, R., Comoglio, F., and Paro, R. (2012). Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data. Nucleic Acids Res, 40(20):e160.

[Sims et al., 2014] Sims, D., Sudbery, I., Ilott, N. E., Heger, A., and Ponting, C. P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet, 15(2):121–32.

[Singh and Valc´arcel,2005] Singh, R. and Valc´arcel,J. (2005). Building speci- ficity with nonspecific RNA-binding proteins. Nat. Struct. Mol. Biol., 12(8):645–653.

123 Bibliography

[Smith et al., 2000] Smith, E. R., Pannuti, A., Gu, W., Steurnagel, A., Cook, R. G., Allis, C. D., and Lucchesi, J. C. (2000). The drosophila MSL complex acetylates histone H4 at lysine 16, a chromatin modification linked to dosage compensation. Mol Cell Biol, 20(1):312–8.

[Srivastava et al., 2001] Srivastava, M., Bubendorf, L., Srikantan, V., Fossom, L., Nolan, L., Glasman, M., Leighton, X., Fehrle, W., Pittaluga, S., Raf- feld, M., Koivisto, P., Willi, N., Gasser, T. C., Kononen, J., Sauter, G., Kallioniemi, O. P., Srivastava, S., and Pollard, H. B. (2001). ANX7, a candidate tumor suppressor gene for prostate cancer. Proc Natl Acad Sci USA, 98(8):4575–80.

[Srivastava et al., 2003] Srivastava, M., Montagna, C., Leighton, X., Glasman, M., Naga, S., Eidelman, O., Ried, T., and Pollard, H. B. (2003). Haploinsuf- ficiency of Anx7 tumor suppressor gene and consequent genomic instability promotes tumorigenesis in the Anx7(+/-) mouse. Proc Natl Acad Sci USA, 100(24):14287–92.

[Steffen et al., 2006] Steffen, P., Voss, B., Rehmsmeier, M., Reeder, J., and Giegerich, R. (2006). RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics, 22(4):500–3.

[Stefl et al., 2010] Stefl, R., Oberstrass, F. C., Hood, J. L., Jourdan, M., Zimmermann, M., Skrisovska, L., Maris, C., Peng, L., Hofr, C., Emeson, R. B., and Allain, F. H.-T. (2010). The solution structure of the ADAR2 dsRBM-RNA complex reveals a Sequence-Specific readout of the minor groove. Cell, 143(2):225–237.

[Stein and Waterman, 1979] Stein, P. R. and Waterman, M. S. (1979). On some new sequences generalizing the catalan and motzkin numbers. Discrete Math., 26(3):261–272.

[Stevens et al., 2011] Stevens, S. G., Gardner, P. P., and Brown, C. (2011). Two covariance models for iron-responsive elements. RNA Biol, 8(5).

[Straub and Becker, 2011] Straub, T. and Becker, P. B. (2011). Transcription modulation chromosome-wide: universal features and principles of dosage compensation in worms and flies. Curr Opin Genet Dev, 21(2):147–53.

[Stupp et al., 2005] Stupp, R., Mason, W. P., van den Bent, M. J., Weller, M., Fisher, B., Taphoorn, M. J. B., Belanger, K., Brandes, A. A., Marosi, C., Bogdahn, U., Curschmann, J., Janzer, R. C., Ludwin, S. K., Gorlia, T., Allgeier, A., Lacombe, D., Cairncross, J. G., Eisenhauer, E., and Mirimanoff, R. O. (2005). Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma. N Engl J Med, 352(10):987–96.

124 Bibliography

[Sturm et al., 2010] Sturm, M., Hackenberg, M., Langenberger, D., and Frish- man, D. (2010). TargetSpy: a supervised machine learning approach for microRNA target prediction. BMC Bioinformatics, 11:292.

[Sugimoto et al., 2015] Sugimoto, Y., Vigilante, A., Darbo, E., Zirra, A., Militti, C., D’Ambrogio, A., Luscombe, N. M., and Ule, J. (2015). hi- CLIP reveals the in vivo atlas of mRNA secondary structures recognized by Staufen 1. Nature, 519(7544):491–494.

[Swets, 1988] Swets, J. A. (1988). Measuring the accuracy of diagnostic sys- tems. Science, 240(4857):1285–93.

[Tacke et al., 1997] Tacke, R., Chen, Y., and Manley, J. L. (1997). Sequence- specific RNA binding by an SR protein requires RS domain phosphorylation: creation of an SRp40-specific splicing enhancer. Proc Natl Acad Sci USA, 94(4):1148–53.

[Tafer et al., 2008] Tafer, H., Ameres, S. L., Obernosterer, G., Gebeshuber, C. A., Schroeder, R., Martinez, J., and Hofacker, I. L. (2008). The impact of target site accessibility on the design of effective siRNAs. Nat Biotechnol, 26(5):578–83.

[Tinoco et al., 1973] Tinoco, Jr, I., Borer, P. N., Dengler, B., Levin, M. D., Uh- lenbeck, O. C., Crothers, D. M., and Bralla, J. (1973). Improved estimation of secondary structure in ribonucleic acids. Nat. New Biol., 246(150):40–41.

[Tinoco and Bustamante, 1999] Tinoco, Jr, I. and Bustamante, C. (1999). How RNA folds. J. Mol. Biol., 293(2):271–281.

[Tinoco et al., 1971] Tinoco, Jr, I., Uhlenbeck, O. C., and Levine, M. D. (1971). Estimation of secondary structure in ribonucleic acids. Nature, 230(5293):362–367.

[Tollervey et al., 2011] Tollervey, J. R., Curk, T., Rogelj, B., Briese, M., Cereda, M., Kayikci, M., Konig, J., Hortobagyi, T., Nishimura, A. L., Zupunski, V., Patani, R., Chandran, S., Rot, G., Zupan, B., Shaw, C. E., and Ule, J. (2011). Characterizing the RNA targets and position-dependent splicing regulation by TDP-43. Nat Neurosci, 14(4):452–8.

[Tuller et al., 2010] Tuller, T., Waldman, Y. Y., Kupiec, M., and Ruppin, E. (2010). Translation efficiency is determined by both codon bias and folding energy. Proc Natl Acad Sci USA, 107(8):3645–50.

[Turner and Mathews, 2010] Turner, D. H. and Mathews, D. H. (2010). NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res, 38(Database issue):D280–2.

125 Bibliography

[Ule, 2014] Ule, J. (2014). Gene regulation via protein–RNA interactions. Methods, 65(3):261–262.

[Ule et al., 2005] Ule, J., Jensen, K., Mele, A., and Darnell, R. B. (2005). CLIP: a method for identifying protein-RNA interaction sites in living cells. Methods, 37(4):376–86.

[Ule et al., 2003] Ule, J., Jensen, K. B., Ruggiu, M., Mele, A., Ule, A., and Darnell, R. B. (2003). CLIP identifies Nova-regulated RNA networks in the brain. Science, 302(5648):1212–5.

[Underwood et al., 2010] Underwood, J. G., Uzilov, A. V., Katzman, S., On- odera, C. S., Mainzer, J. E., Mathews, D. H., Lowe, T. M., Salama, S. R., and Haussler, D. (2010). FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing. Nat Methods, 7(12):995–1001.

[Ungewitter and Scrable, 2010] Ungewitter, E. and Scrable, H. (2010). Delta40p53 controls the switch from pluripotency to differentiation by regu- lating IGF signaling in ESCs. Genes Dev, 24(21):2408–19.

[Uren et al., 2012] Uren, P. J., Bahrami-Samani, E., Burns, S. C., Qiao, M., Karginov, F. V., Hodges, E., Hannon, G. J., Sanford, J. R., Penalva, L. O. F., and Smith, A. D. (2012). Site identification in high-throughput RNA-protein interaction data. Bioinformatics, 28(23):3013–20.

[Van Nostrand et al., 2016] Van Nostrand, E. L., Pratt, G. A., Shishkin, A. A., Gelboin-Burkhart, C., Fang, M. Y., Sundararaman, B., Blue, S. M., Nguyen, T. B., Surka, C., Elkins, K., Stanton, R., Rigo, F., Guttman, M., and Yeo, G. W. (2016). Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods, 13(6):508–514.

[Vaquerizas et al., 2009] Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A., and Luscombe, N. M. (2009). A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet., 10(4):252–263.

[Venter et al., 2001] Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., Gocayne, J. D., Amanatides, P., Ballew, R. M., Huson, D. H., Wortman, J. R., Zhang, Q., Kodira, C. D., Zheng, X. H., Chen, L., Skupski, M., Subramanian, G., Thomas, P. D., Zhang, J., Gabor Miklos, G. L., Nelson, C., Broder, S., Clark, A. G., Nadeau, J., McKusick, V. A., Zinder, N., Levine, A. J., Roberts, R. J., Simon, M., Slayman, C., Hunkapiller, M., Bolanos, R., Delcher, A., Dew, I., Fasulo, D., Flanigan, M., Florea, L., Halpern, A., Hannenhalli, S., Kravitz, S., Levy, S., Mobarry, C., Reinert, K., Remington, K., Abu-Threideh, J., Beasley, E., Biddick, K., Bonazzi, V., Brandon, R., Cargill, M., Chandramouliswaran, I., Charlab, R., Chaturvedi, K., Deng, Z., Di Francesco, V., Dunn, P., Eilbeck, K., Evangelista, C.,

126 Bibliography

Gabrielian, A. E., Gan, W., Ge, W., Gong, F., Gu, Z., Guan, P., Heiman, T. J., Higgins, M. E., Ji, R. R., Ke, Z., Ketchum, K. A., Lai, Z., Lei, Y., Li, Z., Li, J., Liang, Y., Lin, X., Lu, F., Merkulov, G. V., Milshina, N., Moore, H. M., Naik, A. K., Narayan, V. A., Neelam, B., Nusskern, D., Rusch, D. B., Salzberg, S., Shao, W., Shue, B., Sun, J., Wang, Z., Wang, A., Wang, X., Wang, J., Wei, M., Wides, R., Xiao, C., Yan, C., Yao, A., Ye, J., Zhan, M., Zhang, W., Zhang, H., Zhao, Q., Zheng, L., Zhong, F., Zhong, W., Zhu, S., Zhao, S., Gilbert, D., Baumhueter, S., Spier, G., Carter, C., Cravchik, A., Woodage, T., Ali, F., An, H., Awe, A., Baldwin, D., Baden, H., Barnstead, M., Barrow, I., Beeson, K., Busam, D., Carver, A., Center, A., Cheng, M. L., Curry, L., Danaher, S., Davenport, L., Desilets, R., Dietz, S., Dodson, K., Doup, L., Ferriera, S., Garg, N., Gluecksmann, A., Hart, B., Haynes, J., Haynes, C., Heiner, C., Hladun, S., Hostin, D., Houck, J., Howland, T., Ibegwam, C., Johnson, J., Kalush, F., Kline, L., Koduru, S., Love, A., Mann, F., May, D., McCawley, S., McIntosh, T., McMullen, I., Moy, M., Moy, L., Murphy, B., Nelson, K., Pfannkoch, C., Pratts, E., Puri, V., Qureshi, H., Reardon, M., Rodriguez, R., Rogers, Y. H., Romblad, D., Ruhfel, B., Scott, R., Sitter, C., Smallwood, M., Stewart, E., Strong, R., Suh, E., Thomas, R., Tint, N. N., Tse, S., Vech, C., Wang, G., Wetter, J., Williams, S., Williams, M., Windsor, S., Winn-Deen, E., Wolfe, K., Zaveri, J., Zaveri, K., Abril, J. F., Guigo, R., Campbell, M. J., Sjolander, K. V., Karlak, B., Kejariwal, A., Mi, H., Lazareva, B., Hatton, T., Narechania, A., Diemer, K., Muruganujan, A., Guo, N., Sato, S., Bafna, V., Istrail, S., Lippert, R., Schwartz, R., Walenz, B., Yooseph, S., Allen, D., Basu, A., Baxendale, J., Blick, L., Caminha, M., Carnes-Stine, J., Caulk, P., Chiang, Y. H., Coyne, M., Dahlke, C., Mays, A., Dombroski, M., Donnelly, M., Ely, D., Esparham, S., Fosler, C., Gire, H., Glanowski, S., Glasser, K., Glodek, A., Gorokhov, M., Graham, K., Gropman, B., Harris, M., Heil, J., Henderson, S., Hoover, J., Jennings, D., Jordan, C., Jordan, J., Kasha, J., Kagan, L., Kraft, C., Levitsky, A., Lewis, M., Liu, X., Lopez, J., Ma, D., Majoros, W., McDaniel, J., Murphy, S., Newman, M., Nguyen, T., Nguyen, N., Nodell, M., Pan, S., Peck, J., Peterson, M., Rowe, W., Sanders, R., Scott, J., Simpson, M., Smith, T., Sprague, A., Stockwell, T., Turner, R., Venter, E., Wang, M., Wen, M., Wu, D., Wu, M., Xia, A., Zandieh, A., and Zhu, X. (2001). The sequence of the human genome. Science, 291(5507):1304–51.

[Vivanco et al., 2010] Vivanco, I., Rohle, D., Versele, M., Iwanami, A., Kuga, D., Oldrini, B., Tanaka, K., Dang, J., Kubek, S., Palaskas, N., Hsueh, T., Evans, M., Mulholland, D., Wolle, D., Rajasekaran, S., Rajasekaran, A., Liau, L. M., Cloughesy, T. F., Dikic, I., Brennan, C., Wu, H., Mischel, P. S., Perera, T., and Mellinghoff, I. K. (2010). The phosphatase and tensin homolog regulates epidermal growth factor receptor (EGFR) inhibitor response by targeting EGFR for degradation. Proc Natl Acad Sci USA, 107(14):6459–64.

127 Bibliography

[Walczak et al., 1996] Walczak, R., Westhof, E., Carbon, P., and Krol, A. (1996). A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA, 2(4):367–79.

[Wan et al., 2011] Wan, Y., Kertesz, M., Spitale, R. C., Segal, E., and Chang, H. Y. (2011). Understanding the transcriptome through RNA structure. Nat. Rev. Genet., 12(9):641–655.

[Wang et al., 2011] Wang, X., Juan, L., Lv, J., Wang, K., Sanford, J. R., and Liu, Y. (2011). Predicting sequence and structural specificities of RNA binding regions recognized by splicing factor SRSF1. BMC Genomics, 12 Suppl 5:S8.

[Wang et al., 2010] Wang, Z., Kayikci, M., Briese, M., Zarnack, K., Luscombe, N. M., Rot, G., Zupan, B., Curk, T., and Ule, J. (2010). iCLIP predicts the dual splicing effects of TIA-RNA interactions. PLoS Biol, 8(10):e1000530.

[Warf and Berglund, 2010] Warf, M. B. and Berglund, J. A. (2010). Role of RNA structure in regulating pre-mRNA splicing. Trends Biochem. Sci., 35(3):169–178.

[Watson, 1963] Watson, J. D. (1963). Involvement of RNA in the synthesis of proteins. Science, 140(3562):17–26.

[Watson and Crick, 1953] Watson, J. D. and Crick, F. H. (1953). Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature, 171(4356):737–8.

[Wilhelm et al., 2014] Wilhelm, M., Schlegl, J., Hahne, H., Moghaddas Gho- lami, A., Lieberenz, M., Savitski, M. M., Ziegler, E., Butzmann, L., Gessulat, S., Marx, H., Mathieson, T., Lemeer, S., Schnatbaum, K., Reimer, U., Wen- schuh, H., Mollenhauer, M., Slotta-Huspenina, J., Boese, J.-H., Bantscheff, M., Gerstmair, A., Faerber, F., and Kuster, B. (2014). Mass-spectrometry- based draft of the human proteome. Nature, 509(7502):582–587.

[Wilkinson et al., 2006] Wilkinson, K. A., Merino, E. J., and Weeks, K. M. (2006). Selective 2’-hydroxyl acylation analyzed by primer extension (SHAPE): quantitative RNA structure analysis at single nucleotide res- olution. Nat Protoc, 1(3):1610–6.

[Will et al., 2012] Will, S., Joshi, T., Hofacker, I. L., Stadler, P. F., and Backofen, R. (2012). LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs. RNA, 18(5):900–14.

[Will et al., 2007] Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F., and Backofen, R. (2007). Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol, 3(4):e65.

128 Bibliography

[Wilusz and Shenk, 1990] Wilusz, J. and Shenk, T. (1990). A uridylate tract mediates efficient heterogeneous nuclear ribonucleoprotein C protein-RNA cross-linking and functionally substitutes for the downstream element of the polyadenylation signal. Mol Cell Biol, 10(12):6397–407.

[Wong et al., 1987] Wong, A. J., Bigner, S. H., Bigner, D. D., Kinzler, K. W., Hamilton, S. R., and Vogelstein, B. (1987). Increased expression of the epidermal growth factor receptor gene in malignant gliomas is invariably associated with gene amplification. Proc Natl Acad Sci USA, 84(19):6899– 903.

[Wuchty et al., 1999] Wuchty, S., Fontana, W., Hofacker, I. L., and Schuster, P. (1999). Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49(2):145–165.

[Xue et al., 2013] Xue, Y., Ouyang, K., Huang, J., Zhou, Y., Ouyang, H., Li, H., Wang, G., Wu, Q., Wei, C., Bi, Y., Jiang, L., Cai, Z., Sun, H., Zhang, K., Zhang, Y., Chen, J., and Fu, X.-D. (2013). Direct conversion of fibroblasts to neurons by reprogramming PTB-regulated microRNA circuits. Cell, 152(1-2):82–96.

[Xue et al., 2009] Xue, Y., Zhou, Y., Wu, T., Zhu, T., Ji, X., Kwon, Y.-S., Zhang, C., Yeo, G., Black, D. L., Sun, H., Fu, X.-D., and Zhang, Y. (2009). Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping. Mol Cell, 36(6):996–1006.

[Yadav et al., 2009] Yadav, A. K., Renfrow, J. J., Scholtens, D. M., Xie, H., Duran, G. E., Bredel, C., Vogel, H., Chandler, J. P., Chakravarti, A., Robe, P. A., Das, S., Scheck, A. C., Kessler, J. A., Soares, M. B., Sikic, B. I., Harsh, G. R., and Bredel, M. (2009). Monosomy of associated with dysregulation of epidermal growth factor signaling in glioblastomas. JAMA, 302(3):276–89.

[Yang et al., 2007] Yang, Z., Sui, Y., Xiong, S., Liour, S. S., Phillips, A. C., and Ko, L. (2007). Switched alternative splicing of oncogene CoAA during embryonal carcinoma stem cell differentiation. Nucleic Acids Res, 35(6):1919– 32.

[Yeo et al., 2009] Yeo, G. W., Coufal, N. G., Liang, T. Y., Peng, G. E., Fu, X.-D., and Gage, F. H. (2009). An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat Struct Mol Biol, 16(2):130–7.

[Zhang and Darnell, 2011] Zhang, C. and Darnell, R. B. (2011). Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS- CLIP data. Nat Biotechnol, 29(7):607–14.

129 Bibliography

[Zhang et al., 2013] Zhang, C., Lee, K.-Y., Swanson, M. S., and Darnell, R. B. (2013). Prediction of clustered RNA-binding protein motif sites in the mammalian genome. Nucleic Acids Res.

[Zuker and Sankoff, 1984] Zuker, M. and Sankoff, D. (1984). RNA secondary structures and their prediction. Bull. Math. Biol., 46(4):591–621.

[Zuker and Stiegler, 1981] Zuker, M. and Stiegler, P. (1981). Optimal com- puter folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res, 9(1):133–48.

[Zykovich et al., 2009] Zykovich, A., Korf, I., and Segal, D. J. (2009). Bind-n- Seq: high-throughput analysis of in vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res., 37(22):e151.

130 Appendix A

Detailed statement of contributions

The material presented in Chapters 1 and 6 was written by myself and is not published elsewhere. I prepared Figures 1.1 and 1.2. The exemplary ROC and PR curves depicted in Figure 1.3 were taken from the supplementary material of [Maticzka et al., 2014] and can also be found in Supplementary Section B.2. The material presented in Chapter 2 is based on the publication [Lange et al., 2012] for which I share joint first authorship with Sita J. Saunders n´eeLange, signifying equal scientific contribution. This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License (http: //creativecommons.org/licenses/by-nc/3.0); the copyright lies with me and my co-authors. While the majority of ideas arose from joint discussions, I focussed on the evaluation of the accessibility data, the analysis of border effects and the implementation of LocalFold; Sita J. Saunders focussed on the evaluation of cis-regulatory elements. Correspondingly, I put more emphasis on the data processing related to Figures 2.3, 2.4, 2.8, B.2, B.3 and B.4 whereas Sita J. Saunders emphasised the data processing related to Figures 2.1, 2.2, 2.6, 2.7 and B.1. Sita J. Saunders and me wrote the majority of the manuscript, large parts of which were also used for this dissertation. As acknowledged in Sita J. Saunders PhD thesis, this results in some overlap with her dissertation [Saunders, 2014]. Mathias M¨ohlprovided advice on global and local folding algorithms and checked the validity of our theoretical contributions. The set of cis-regulatory structures used in our benchmarks was compiled by Chris Brown; Joshua Gagnon implemented the corresponding webserver. Rolf Backofen devised the structure accuracy measure presented in Section 2.2. The material presented in Chapter 3 is based on the publication [Ilik et al., 2013]. To this work I contributed the full data analys for the MLE and MSL2 iCLIP experiments performed by Ibrahim Avsar Ilik. For this purpose, I de- signed and implemented the iCLIP processing pipeline and additional processing

131 A. Detailed statement of contributions steps discussed in Section 3.2. In Section 3.3 I report the results of the iCLIP analysis within their biological context and summarise the implications of further experiments performed by Ibrahim Avsar Ilik (GRNA chromatography) and members of Howard Y. Chang’s lab — Jeffrey J. Quinn, Yue Wan, Robert C. Spitale — (SHAPE, PARS and corresponding structural analysis). I created Figures 3.1, 3.2, 3.3, 3.4, 3.6, 3.7 A and 3.8 A. Figures 3.7 B and 3.8 B were prepared by Ibrahim Ilik and are included to highlight the structural context of MLE and MSL2 binding. Figures 3.2, 3.3, 3.4, 3.6, 3.7 and 3.8 reprinted from [Ilik et al., 2013] with permission from Elsevier. I contributed large parts of the Figure legends; these are mostly taken from the original publication. The main text of Chapter 3 was written by myself is not published elsewhere. The material presented in Chapter 4 is based on the publication [Mat- iczka et al., 2014] for which I have first authorship. This article is dis- tributed under the terms of the CreativeCommons Attribution License (http: //creativecommons.org/licenses/by/2.0); the copyright lies with me and my co-authors. Rolf Backofen and me conceived of the project and designed its overall goals. All co-authors contributed to the writing, however, I wrote significant parts of the manuscript. Figures 4.1, 4.2, 4.3 contain contributions from Fabrizio Costa and Rolf Backofen; Figures 4.4, 4.5, 4.6, 4.7, 4.8, 4.9 and 4.10 are solely based on my work. I prepared the RNAcompete and CLIP-seq and Ago2 knockdown data sets and conducted all experiments. I developed the GraphProt software and motif representation. The script handling the graph encoding was developed by Steffen Heyne, Sita J. Saunders and me. Sita J. Saunders contributed the implementation for the classification of abstract RNA structure elements. Fabrizio Costa developed the NSPD Kernel and subsequent enhancements. Rolf Backofen conceived of the Ago2 knockdown evaluation and researched literature on RBP binding preferences. The text of Chapter 5, was written by myself and is not published elsewhere. The material presented in Sections 5.1.1 and 5.2 reports results from the publication [Ferrarese et al., 2014]. The analyses presented in Section 5.3 are not published elsewhere. In Section 5.1.1 I summarize biological experiments and analyses reported in [Ferrarese et al., 2014] to provide additional biological context. These results are solely based on the work of my co-authors; for that reason I refrain from using ”we” in this section and cite the source publication where appropriate. I performed all bioinformatics analyses reported in this chapter. Specifically, I collected and processed the PTB CLIP-seq data used for the PTB GraphProt model, I performed all analyses regarding missed binding sites, I remapped the PTB CLIP-seq raw data, I processed and analyzed the PTB next-generation SELEX data, and I developed the algorithm for designing deleterious mutations for binding sites. The idea for the GraphProt- based sequence design arose during joint discussions with Fabrizio Costa who also gave advice during development. The experimental validation of GraphProt-predicted binding sites presented in Section 5.2.3 was performed by Eva Bug, Roberto Ferrarese and Maria Stella Carro. I prepared all data and

132 conducted the experiments used to prepare Figures 5.2, 5.4, 5.5, 5.6, 5.7, B.9, B.10, B.11, B.12, B.13, B.14, B.15, B.16, B.17 and B.18. The primers used for site-directed mutagenesis shown in Supplementary Table B.8 were designed by Roberto Ferrarese and Maria Stella Carro and are based on the GraphProt- designed sequences prepared by me. Figure 5.1 was prepared by Maria Stella Carro and is included to show the location of GraphProt-predicted binding sites on the ANXA7 minigene. Figure 5.3 shows the effects of GraphProt- designed mutations. The corresponding wet-lab experiments were conducted by Eva Bug, Roberto Ferrarese and Maria Stella Carro. Figures 5.1, 5.2, 5.3, 5.4 and Supplementary Table B.8 were republished with permission of ”The Journal of Clinical Investigation”, from ”Lineage-specific splicing of a brain- enriched alternative exon promotes glioblastoma progression”, Ferrarese, R. et al., Volume 124, Issue 7, Copyright 2014 [Ferrarese et al., 2014]; permission conveyed through Copyrigh Clearance Center, Inc.

133

Appendix B

Supplementary material

B.1 Chapter 2

Folding parameters The commands for Rfold and Raccess were run_rfold -max_pair_dist=L -print_prob=true and run_raccess -max_span=L -access_len=1

The execution call for RNAplfold was

RNAplfold -noLP -W W -L L -u 1

135 B. Supplementary material

Figure B.1: The median bp-accuracy is shown separately for each of the 95 Rfam families within the CisReg data using sequence contexts of 500 nucleotides. The families are sorted by the maximum base-pair span of their elements, ranging from 15 to 551. This information is more relevant than the actual element length, because this corresponds to the parameter L used. RNAfold only performs better than the other methods when the base-pair spans of the structure greatly exceeds the maximum base-pair span parameter L = 150. In general, we see similar trends across most families and no bias due to data redundancy is evident.

136 B.1. Chapter 2

Accessibility bias at window ends

Figure B.2: Average accessibilities per window position for the 400 mRNAs used for Figure 2.3, split by GC-content of the windows. While average accessibilities decrease with increasing GC-content, border nucleotides are distinctly more accessible for all instances.

Figure B.3: Average accessibilities per window position for ten random se- quences of 15, 000 nucleotides ranging in GC-content from 10−100%. Sequences were folded with L = 100 and W = 150 (lower) and W = 100 (upper). Folding of each sequence resulted in 15, 000 − W + 1 independent folding windows.

137 B. Supplementary material

Performance of recommended folding parameters

Figure B.4: AUCs for separating high-scored and low-scored nucelotides from the YeastUnpaired data for several window sizes W and span L = 100, using RNAplfold. The comparison of YeastUnpaired in Figure 2.8 was done for several L (fixing W at L+50). The best result for RNAplfold was reached using parameters L=100 and W=150. This is the optimal W for this span for RNAplfold.

138 B.2. Chapter 4

B.2 Chapter 4

Source publications for CLIP-seq sets sets downloaded from doRiNA: The doRiNA [Anders et al., 2012] database is available at http://dorina.mdc-berlin.de. We used binding sites for hg19 located at http://dorina.mdc-berlin.de/rbp_browser/download_hg19.html. Additional information on the preparation of the individual tracks is avail- able via the doRiNA web frontend at http://dorina.mdc-berlin.de/rbp_ browser/hg19.html.

• Ago2 HITS-CLIP [Kishore et al., 2011]

• ELAVL1 PAR-CLIP (A) & HITS-CLIP [Kishore et al., 2011]

• ELAVL1 PAR-CLIP (B) [Lebedeva et al., 2011]

• ELAVL1 PAR-CLIP (C) [Mukherjee et al., 2011]

• HNRNPC iCLIP [K¨onig et al., 2010]

• MOV10 PAR-CLIP [Sievers et al., 2012]

• SFRS1 CLIP-seq [Sanford et al., 2009]

• TDP-43 iCLIP [Tollervey et al., 2011]

• TIA1 & TIAL1 iCLIP [Wang et al., 2010]

• EWSR1, FUS & TAF15 PAR-CLIP [Hoell et al., 2011]

• Ago1-4, IGF2BP1-3, PUM2 & QKI PAR-CLIP [Hafner et al., 2010]

• ALKBH5, C17ORF85, C22ORF28, CAPRIN1, ZC3H7B PAR-CLIP [Baltz et al., 2012] other sources:

• PTB HITS-CLIP [Xue et al., 2009], GEO accession number [GSE19323]

139 B. Supplementary material

CLIP cross-validation and RNAcompete validation results Table B.1 shows the results of the CLIP-seq 10-fold cross-validations (AUC and APR), Table B.2 shows the results of the RNAcompete evaluations (AUC and APR).

Table B.1: CLIP-seq cross-validation results

Dataset GraphProt RNAcontext MatrixReduce APR AUC APR AUC APR AUC ALKBH5 PAR-CLIP 0.669 0.680 0.585 0.600 0.527 0.537 C17ORF85 PAR-CLIP 0.775 0.800 0.670 0.695 0.377 0.303 C22ORF28 PAR-CLIP 0.746 0.751 0.676 0.671 0.518 0.545 CAPRIN1 PAR-CLIP 0.851 0.855 0.635 0.650 0.415 0.352 Ago2 HITS-CLIP 0.756 0.765 0.715 0.732 0.381 0.264 ELAVL1 HITS-CLIP 0.940 0.955 0.943 0.958 0.938 0.954 SFRS1 HITS-CLIP 0.898 0.898 0.842 0.833 0.837 0.828 HNRNPC iCLIP 0.947 0.952 0.947 0.951 0.926 0.933 TDP43 iCLIP 0.895 0.874 0.864 0.828 0.813 0.780 TIA1 iCLIP 0.842 0.861 0.837 0.855 0.807 0.817 TIAL1 iCLIP 0.819 0.833 0.819 0.833 0.804 0.815 Ago1-4 PAR-CLIP 0.906 0.895 0.730 0.721 0.406 0.275 ELAVL1 PAR-CLIP (B) 0.935 0.935 0.918 0.923 0.895 0.903 ELAVL1 PAR-CLIP (A) 0.951 0.959 0.953 0.962 0.942 0.952 EWSR1 PAR-CLIP 0.942 0.935 0.936 0.935 0.901 0.912 FUS PAR-CLIP 0.970 0.968 0.953 0.954 0.931 0.938 ELAVL1 PAR-CLIP (C) 0.992 0.991 0.972 0.974 0.950 0.957 IGF2BP1-3 PAR-CLIP 0.901 0.889 0.792 0.778 0.539 0.519 MOV10 PAR-CLIP 0.853 0.863 0.715 0.750 0.682 0.717 PUM2 PAR-CLIP 0.958 0.954 0.917 0.906 0.887 0.879 QKI PAR-CLIP 0.971 0.957 0.964 0.945 0.959 0.942 TAF15 PAR-CLIP 0.973 0.970 0.969 0.967 0.942 0.950 PTB HITS-CLIP 0.925 0.937 0.863 0.875 0.828 0.839 ZC3H7B PAR-CLIP 0.813 0.820 0.613 0.636 0.435 0.405

140 B.2. Chapter 4

Table B.2: RNAcompete validation results

Dataset GraphProt RNAcontext APR AUC APR AUC Fusip 0.841 0.983 0.515 0.895 ELAVL1 0.978 0.999 0.832 0.994 PTB 0.379 0.892 0.390 0.906 RBM4 0.928 0.996 0.773 0.984 SFRS1 0.909 0.993 0.671 0.927 SLM2 0.820 0.990 0.550 0.976 U1A 0.657 0.935 0.478 0.871 VTS1 0.647 0.949 0.577 0.956 YB1 0.374 0.897 0.057 0.661

141 B. Supplementary material

ROC and PR curves of GraphProt models

ALKBH5 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

C17ORF85 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

142 B.2. Chapter 4

C22ORF28 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

CAPRIN1 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

143 B. Supplementary material

Ago2 HITS-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

ELAVL1 HITS-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

144 B.2. Chapter 4

SFRS1 HITS-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

HNRNPC iCLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

145 B. Supplementary material

TDP43 iCLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

TIA1 iCLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

146 B.2. Chapter 4

TIAL1 iCLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

Ago1-4 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

147 B. Supplementary material

ELAVL1 PAR-CLIP (A)

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

ELAVL1 PAR-CLIP (B)

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

148 B.2. Chapter 4

EWSR1 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

FUS PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

149 B. Supplementary material

ELAVL1 PAR-CLIP (C)

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

IGF2BP1-3 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

150 B.2. Chapter 4

MOV10 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

PUM2 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

151 B. Supplementary material

QKI PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

TAF15 PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

152 B.2. Chapter 4

PTB HITS-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

ZC3H7B PAR-CLIP

1

0.8

0.6

0.4 True positive rate

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate

153 B. Supplementary material

Fusip RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

ELAVL1 RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

154 B.2. Chapter 4

PTB RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

RBM4 RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

155 B. Supplementary material

SFRS1 RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

SLM2 RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

156 B.2. Chapter 4

U1A RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

VTS1 RNAcompete

1

0.9

0.8

0.7

0.6

0.5 Precision 0.4

0.3

0.2

0.1 GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

157 B. Supplementary material

YB1 RNAcompete

1

0.8

0.6 Precision 0.4

0.2

GraphProt RNAcontext 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

158 B.2. Chapter 4

Ago2 knockdown analysis Figure B.5 shows the full distributions of Ago2 binding-site hits corresponding to Figure 4.10 B. Figures B.6 and B.7 show additional analyses on microRNA target prediction corresponding to Figure 4.10 A and B.

159 B. Supplementary material

Figure B.5: Full distribution of the number of binding-site hits per 3’-UTR as depicted in Figure 4.10, comparing high-scoring GraphProt predictions and HITS-CLIP sites.

160 B.2. Chapter 4

**

*

*

*

Figure B.6: Number of 3’-UTRs with at least one Ago2 binding-site hit. mi- croRNA seed hits (“top miRNA seeds”) were calculated for seeds AAAGUGC, GUAAACA and AUAAAGU as described by Schmitter and colleagues [Schmit- ter et al., 2006]. PicTar 2.0 microRNA predictions were downloaded from doRiNA [Anders et al., 2012]. In all cases, any oberlapping or bookended sites were merged prior to counting. Asterisk indicates statistically significant increase (t-test, *: p < 0.05, **: p < 0.001).

161 B. Supplementary material

** ** *

* * * * *

Figure B.7: Number of binding-site hits per 3’-UTR. microRNA seed hits (“top miRNA seeds”) were calculated for seeds AAAGUGC, GUAAACA and AUAAAGU as described by Schmitter and colleagues [Schmitter et al., 2006]. PicTar 2.0 microRNA predictions were downloaded from doRiNA [Anders et al., 2012]. In all cases, any oberlapping or bookended sites were merged prior to counting. Asterisk indicates statistically significant increase (Wilcoxon rank sum test, *: p < 0.05, **: p < 0.001).

162 B.2. Chapter 4

Paremeters used for GraphProt, RNAcontext and MatrixREDUCE GraphProt parameters selected for the CLIP-seq models are shown in Tables B.3 (sequence models) and B.4 (structure models. GraphProt parameters selected for the RNAcompete models are shown in Table B.5 (structure models). Motif lengths chosen for RNAcontext models are shown in Table B.6, motif lengths chosen for MatrixREDUCE models are shown in Table B.7.

Table B.3: Parameters fitted for GraphProt CLIP-seq sequence models. protein R D b EPOCHS LAMBDA ALKBH5 PAR-CLIP 1 2 16 20 0.0001 C17ORF85 PAR-CLIP 2 6 14 30 0.001 C22ORF28 PAR-CLIP 1 3 16 30 0.001 CAPRIN1 PAR-CLIP 1 4 16 40 0.001 Ago2 HITS-CLIP 1 3 16 40 0.001 ELAVL1 HITS-CLIP 1 6 18 40 0.001 SFRS1 HITS-CLIP 3 6 18 40 0.001 HNRNPC iCLIP 3 0 14 10 0.001 TDP43 iCLIP 1 3 14 30 0.001 TIA1 iCLIP 1 6 16 30 0.001 TIAL1 iCLIP 3 4 18 40 0.001 Ago1-4 PAR-CLIP 3 4 18 30 0.001 ELAVL1 PAR-CLIP (B) 0 6 14 10 0.001 ELAVL1 PAR-CLIP (A) 2 6 16 10 0.001 EWSR1 PAR-CLIP 1 2 16 50 0.001 FUS PAR-CLIP 1 1 16 40 0.0001 ELAVL1 PAR-CLIP (C) 0 1 14 50 1e-05 IGF2BP1-3 PAR-CLIP 1 3 14 40 0.001 MOV10 PAR-CLIP 4 2 16 20 0.001 PUM2 PAR-CLIP 4 4 16 40 0.001 QKI PAR-CLIP 4 6 16 50 0.001 TAF15 PAR-CLIP 3 2 14 50 0.001 PTB HITS-CLIP 1 6 14 50 0.001 ZC3H7B PAR-CLIP 2 2 14 30 1e-05

163 B. Supplementary material

Table B.4: Parameters fitted for GraphProt CLIP-seq structure models. protein ABSTRACTION R D b EPOCHS LAMBDA ALKBH5 PAR-CLIP 1 4 2 14 40 0.001 C17ORF85 PAR-CLIP 5 4 1 18 50 0.001 C22ORF28 PAR-CLIP 5 3 0 18 50 0.001 CAPRIN1 PAR-CLIP 5 4 0 16 40 0.0001 Ago2 HITS-CLIP 1 3 0 18 30 0.001 ELAVL1 HITS-CLIP 3 1 5 14 20 0.0001 SFRS1 HITS-CLIP 3 1 2 14 20 0.0001 HNRNPC iCLIP 5 3 1 18 30 0.001 TDP43 iCLIP 3 3 1 18 10 0.001 TIA1 iCLIP 5 2 0 16 30 0.001 TIAL1 iCLIP 1 3 1 18 20 0.001 Ago1-4 PAR-CLIP 1 4 1 18 20 1e-06 ELAVL1 PAR-CLIP (B) 3 1 2 14 30 0.0001 ELAVL1 PAR-CLIP (A) 1 3 3 18 50 0.0001 EWSR1 PAR-CLIP 3 3 0 16 30 0.001 FUS PAR-CLIP 5 4 0 16 50 0.0001 ELAVL1 PAR-CLIP (C) 5 3 0 18 50 1e-07 IGF2BP1-3 PAR-CLIP 5 4 0 16 50 0.0001 MOV10 PAR-CLIP 5 4 0 18 40 0.0001 PUM2 PAR-CLIP 3 3 2 14 20 1e-05 QKI PAR-CLIP 3 4 0 18 20 0.001 TAF15 PAR-CLIP 5 1 3 18 50 0.0001 PTB HITS-CLIP 1 3 0 18 20 0.001 ZC3H7B PAR-CLIP 5 4 0 18 30 0.0001

164 B.2. Chapter 4

Table B.5: Parameters fitted for GraphProt RNAcompete models. protein ABSTRACTION R D b c e Fusip data full A 1 2 4 14 1 0.1 Fusip data full B 1 2 3 14 1 0.1 ELAVL1 data full A 5 1 6 14 1 0.1 ELAVL1 data full B 3 2 5 14 1 0.1 PTB data full A 5 2 6 14 1 1 PTB data full B 5 2 2 14 0.1 0.1 RBM4 data full A 5 2 2 14 1 0.1 RBM4 data full B 3 3 3 14 1 0.1 SFRS1 data full A 5 2 4 14 1 0.1 SFRS1 data full B 5 3 3 14 1 0.1 SLM2 data full A 5 2 3 14 1 0.1 SLM2 data full B 5 3 4 14 1 0.1 U1A data full A 3 4 4 14 1 0.1 U1A data full B 3 4 4 14 1 0.1 VTS1 data full A 3 3 3 14 1 0.1 VTS1 data full B 5 4 4 14 1 0.1 YB1 data full A 5 3 5 14 0.1 0.1 YB1 data full B 1 4 4 14 0.1 0.01

165 B. Supplementary material

Table B.6: Motif lengths chosen for RNAcontext models. protein motif length Fusip A 10 Fusip B 11 HuR A 11 HuR B 10 PTB A 7 PTB B 8 RBM4 A 7 RBM4 B 7 SF2 A 5 SF2 B 5 SLM2 A 10 SLM2 B 12 U1A A 12 U1A B 12 VTS1 A 7 VTS1 B 7 YB1 A 4 YB1 B 9 ALKBH5 PAR-CLIP 7 C17ORF85 PAR-CLIP 6 C22ORF28 PAR-CLIP 4 CAPRIN1 PAR-CLIP 6 Ago2 HITS-CLIP 4 ELAVL1 HITS-CLIP 7 SFRS1 HITS-CLIP 10 HNRNPC iCLIP 4 TDP43 iCLIP 7 TIA1 iCLIP 12 TIAL1 iCLIP 12 Ago1-4 PAR-CLIP 7 ELAVL1 PAR-CLIP (B) 8 ELAVL1 PAR-CLIP (A) 6 EWSR1 PAR-CLIP 7 FUS PAR-CLIP 8 ELAVL1 PAR-CLIP (C) 8 IGF2BP1-3 PAR-CLIP 7 MOV10 PAR-CLIP 4 PUM2 PAR-CLIP 7 QKI PAR-CLIP 9 TAF15 PAR-CLIP 9 PTB HITS-CLIP 8 ZC3H7B PAR-CLIP 4 166 B.2. Chapter 4

Table B.7: Motif lengths chosen for MatrixREDUCE models. protein motif length cross-validation split 1 2 3 4 5 6 7 8 9 10 ALKBH5 PAR-CLIP 4 4 4 4 4 5 4 4 4 4 C17ORF85 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 C22ORF28 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 CAPRIN1 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 Ago2 HITS-CLIP 4 4 4 4 4 4 4 4 4 4 ELAVL1 HITS-CLIP 4 4 4 4 4 4 4 4 4 4 SFRS1 HITS-CLIP 4 4 4 4 4 4 4 4 4 4 HNRNPC iCLIP 4 4 4 4 4 4 4 4 4 4 TDP43 iCLIP 4 4 4 4 4 4 4 4 4 4 TIA1 iCLIP 4 4 4 4 4 4 4 4 4 4 TIAL1 iCLIP 4 4 4 4 4 4 4 4 4 4 Ago1-4 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 ELAVL1 PAR-CLIP (A) 4 4 4 4 4 4 4 4 4 4 ELAVL1 PAR-CLIP (B) 4 4 4 4 4 4 4 4 4 4 EWSR1 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 FUS PAR-CLIP 4 4 4 4 4 4 4 4 4 4 ELAVL1 PAR-CLIP (C) 4 4 4 4 4 4 4 4 4 4 IGF2BP1-3 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 MOV10 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 PUM2 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 QKI PAR-CLIP 4 4 4 4 4 4 4 4 4 4 TAF15 PAR-CLIP 4 4 4 4 4 4 4 4 4 4 PTB HITS-CLIP 4 4 4 4 4 4 4 4 4 4 ZC3H7B PAR-CLIP 4 4 4 4 4 4 4 4 4 4

167 B. Supplementary material

GraphProt motifs In this section we show the GraphProt sequence and structure motifs for all CLIP- seq sets. Structure motifs are annotated with the full set of structure elements – stems (S), external regions (E), hairpins (H), internal loops (I), multiloops (M) and bulges (B). Accessibility motifs are simplified representations of the full structure motifs and only distinguish – paired (P) and unpaired / accessible nucleotides (U).

ALKBH5 PAR-CLIP sequence structure accessibility

C17ORF85 PAR-CLIP sequence structure accessibility

C22ORF28 PAR-CLIP sequence structure accessibility

CAPRIN1 PAR-CLIP sequence structure accessibility

168 B.2. Chapter 4

Ago2 HITS-CLIP sequence structure accessibility

ELAVL1 HITS-CLIP sequence structure accessibility

SFRS1 HITS-CLIP sequence structure accessibility

HNRNPC iCLIP sequence structure accessibility

TDP43 iCLIP sequence structure accessibility

169 B. Supplementary material

TIA1 iCLIP sequence structure accessibility

TIAL1 iCLIP sequence structure accessibility

Ago1-4 PAR-CLIP sequence structure accessibility

ELAVL1 PAR-CLIP (A) sequence structure accessibility

ELAVL1 PAR-CLIP (B) sequence structure accessibility

170 B.2. Chapter 4

EWSR1 PAR-CLIP sequence structure accessibility

FUS PAR-CLIP sequence structure accessibility

ELAVL1 PAR-CLIP (C) sequence structure accessibility

IGF2BP1-3 PAR-CLIP sequence structure accessibility

MOV10 PAR-CLIP sequence structure accessibility

171 B. Supplementary material

PUM2 PAR-CLIP sequence structure accessibility

QKI PAR-CLIP sequence structure accessibility

TAF15 PAR-CLIP sequence structure accessibility

PTB HITS-CLIP sequence structure accessibility

ZC3H7B PAR-CLIP sequence structure accessibility

172 B.3. Chapter 5

B.3 Chapter 5

Mutant: Mutated sequence: Primers: ANXA7 M1 tctttttcttcctatcctTttttcGctccGgtttAtttg Fwd 5'-aatacagattctttttcttcctatccttttttcgctccggtttatttggattatagcagtgaagtgagtaa-3' Rev 5'-ttactcacttcactgctataatccaaataaaccggagcgaaaaaaggataggaagaaaaagaatctgtatt-3' ANXA7 M2 tgtttctgattctcatgtacttgtctctcactacGtGAt Fwd 5'-ctcatgtacttgtctctcactacgtgataagggctcacctgtatctattt-3' Rev 5'-aaatagatacaggtgagcccttatcacgtagtgagagacaagtacatgag-3' ANXA7 M3 tctagtctatatctGctgcctActagaccagcggtctctctct Fwd 5'-agaagattctagtctatatctgctgcctactagaccagcggtct-3' Rev 5'-agaccgctggtctagtaggcagcagatatagactagaatcttct-3' ANXA7 M4 aaagtGtttacttctgatagactccatactttctttggt Fwd 5'-ttatagtaatggaagactaaagtgtttacttctgatagactccatac-3' Rev 5'-gtatggagtctatcagaagtaaacactttagtcttccattactataa-3' ANXA7 M5 acctggggtcttggttttagttAtttttcctcGtccAGg Fwd 5'-ttaacctggggtcttggttttagttatttttcctcgtccaggcccttaccagca-3' Rev 5'-tgctggtaagggcctggacgaggaaaaataactaaaaccaagaccccaggttaa-3' ANXA7 M6 gatggtagtttTtttAgccgtgcagaagctAtttggttt Fwd 5'-cctgttcactctgatggtagtttttttagccgtgcagaag-3' Rev 5'-cttctgcacggctaaaaaaactaccatcagagtgaacagg-3' (2 steps required) Fwd' 5'-cttttgccgtgcagaagctatttggtttaattagatccca-3' Rev' 5'-tgggatctaattaaaccaaatagcttctgcacggcaaaag-3' ANXA7 M7 tccttgccgatgcttatgtcctgaatggtattgcctaggttttTttA Fwd 5'-cttatgtcctgaatggtattgcctaggtttttttatagggtttttatggttt-3' tagggtttttatggGtttagatctaacattta Rev 5'-aaaccataaaaaccctataaaaaaacctaggcaataccattcaggacataag-3' Fwd' 5'-ctaggttttcttctagggtttttatgggtttagatctaacatttaagtctttaat-3' (2 steps required) Rev' 5'-attaaagacttaaatgttagatctaaacccataaaaaccctagaagaaaacctag-3' ANXA7 M8 cagttgtctcaatgatgCccAtttGcAggAGcaggatca Fwd 5'-tcagtttcaccagttgtctcaatgatgcccatttgcaggagcaggatcaaatccagaatgcaatgttg-3' Rev 5'-caacattgcattctggatttgatcctgctcctgcaaatgggcatcattgagacaactggtgaaactga-3' ANXA7 M9 ttcttcctatccatgagAatggGatgtAcAtgttAAtcc Fwd 5'-attttcacgatattgattcttcctatccatgagaatgggatgtacatgttaatccatttgtttgtgtccttttttatttcgttga-3' Rev 5'-tcaacgaaataaaaaaggacacaaacaaatggattaacatgtacatcccattctcatggataggaagaatcaatatcgtgaaaat-3' ANXA7 M10 ttAtGcctatccatgagcatggaatgttcttgttcttcc Fwd 5'-gcagtatggccattttcacgatattgattatgcctatccatgagcat-3' Rev 5'-atgctcatggataggcataatcaatatcgtgaaaatggccatactgc-3' ANXA7 M11 gcttcctatgtattttattAtAtttgaagcgattgtgag Fwd 5'-ccttgtaagttggcttcctatgtattttattatatttgaagcgattgtgag-3' Rev 5'-ctcacaatcgcttcaaatataataaaatacataggaagccaacttacaagg-3'

Figure B.8: Mutated sequences and primers used for site-directed mutagenesis. Mutated nucleotides are set in capital letters.

173 B. Supplementary material

A A T AG T A G G T A AA CT A GGAA C A A GGAA C

Figure B.9: Visualization of candidate mutations for the validation of predicted PTB binding site S1. The sequence of the predicted binding site is shown on top. The corresponding bar chart shows the nucleotide-wise scores according to the PTB GraphProt model for the wild-type sequence, the bar chart at the bottom shows the scores of the candidate sequence incorporating seven mutations. The heatmap visualizes the effect of binding-site mutations according to the GraphProt model. Scores of individual nucleotides are color- coded. High scores, indicating likely PTB binding, are colored dark-blue. Middle and low scores, indicating a reduced likelihood of PTB binding, are colored green (midrange) and white (low). Selected mutations are superimposed on the heatmap. Mutations M1 selected for incorporation into the wild-type minigene are printed bold red. Visualization: Integrative Genomics Viwer (IGV) [Robinson et al., 2011].

G G G A G G A G A A A G G G A A G A G A A A G A G A G A

Figure B.10: Visualization of candidate mutations for the validation of predicted PTB binding site S2. Layout as described in Figure B.9. Mutations M2 selected for incorporation into the wild-type minigene are printed bold red.

174 B.3. Chapter 5

G AG CG G CGGG GAG GA GAG GGA AAG TGGA

Figure B.11: Visualization of candidate mutations for the validation of predicted PTB binding site S3. Layout as described in Figure B.9. Mutations M3 selected for incorporation into the wild-type minigene are printed bold red.

G TG G AG G CAG GG GAG GG GAGA GGGG AGA

Figure B.12: Visualization of candidate mutations for the validation of predicted PTB binding site S4. Layout as described in Figure B.9. Mutations M4 selected for incorporation into the wild-type minigene are printed bold red.

175 B. Supplementary material

G G G G A G G A G A G AG G A G A G G C G G A G G G G G

Figure B.13: Visualization of candidate mutations for the validation of predicted PTB binding site S5. Layout as described in Figure B.9. Mutations M5 selected for incorporation into the wild-type minigene are printed bold red.

A A A A A T A A A A A A G A A A A A A A A A A AA A A A

Figure B.14: Visualization of candidate mutations for the validation of predicted PTB binding site S6. Layout as described in Figure B.9. Mutations M6 selected for incorporation into the wild-type minigene are printed bold red.

176 B.3. Chapter 5

A A A G AT A A A G A A GA G AAA G AG AGA A AGC

Figure B.15: Visualization of candidate mutations for the validation of predicted PTB binding site S7. Layout as described in Figure B.9. Mutations M7 selected for incorporation into the wild-type minigene are printed bold red.

A A G G G G G A A C G A A G C G A A G A C C G A A G A C

Figure B.16: Visualization of candidate mutations for the validation of predicted PTB binding site S8. Layout as described in Figure B.9. Mutations M8 selected for incorporation into the wild-type minigene are printed bold red.

177 B. Supplementary material

A A M10 G A GA A A G A A M9 A A A A G A A A A A AG A

Figure B.17: Visualization of candidate mutations for the validation of predicted PTB binding sites S9 and S10. Layout as described in Figure B.9. Mutations M9 and M10, corresponding to sites S9 and S10 respectively, selected for incorporation into the wild-type minigene are printed bold red.

A A A A A C A A A A A A A C C A A A A CC A A AA AC C

Figure B.18: Visualization of candidate mutations for the validation of predicted PTB binding site S11. Layout as described in Figure B.9. Mutations M11 selected for incorporation into the wild-type minigene are printed bold red.

178