Developing computational approach to predict coincidence of structural and primary signatures in mRNAs

Yang Zhang

Computer Science

McGill University

Montreal,Quebec

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of

Master of Science

15-08-2014

Copyright c 2014 by Yang Zhang I would like to dedicate this thesis to my loving parents and my supportive fiancé ... Acknowledgements

I would like to express my deep gratitude to my master supervisors Dr. Jérôme Waldispühl and Dr. Éric Lecuyer. They have been tremendous mentor for me. I would like to thank them for encouraging my research and for allowing me to grow as a research scientist. I also thank them for many insightful conversations during the development of the projects. Special thanks are given to Yann Ponty and Mathieu Blanchette, who also supervised me throughout the project. During the period of the two years, many friends are helpful to color my life. I have to acknowledge all my colleagues in Jerome and Eric’s lab for their assistance in many aspects that I cannot list all of them because of limited space. For financial support, I thank Institut de recherches cliniques de Montréal (IRCM) for the internal scholarship throughout the 2 years. Abstract

RNA binding proteins (RBPs) are known to interact with cis-regulatory motifs present in the mRNAs to perform biological functions. However, little is known for the RBP binding sites. Therefore, the prediction of RBP binding sites from sequence only is crucial to RNA functions. It is composed of three factors: structural properties prediction, primary motif search and evolutionary information. In this thesis, we introduce our approach, which considers structural properties, primary motif and evolutionary conservation information at the same time.

Abrégé

Les protéines de liaison aux ARNs (RNA Binding Proteins ; RBPs) sont con- nues pour interagir avec des motifs cis-régulateurs présents dans les ARNm pour accomplir des fonctions biologiques. Cependant, peu est connu sur les sites de li- aison des RBPs. Par conséquent, la prédiction des sites des RBPs à partir de leur séquence est indispensable pour la fonction des ARNs. La prédiction se base sur trois facteurs : la prédiction de propriétés structurales, la recherche de motifs primaires et l’information évolutive. Dans cette thèse, nous présentons notre approche, qui considère les propriétés structurales, les motifs primaires ainsi que l’information sur la conservation évolutive, le tout en même temps. Contribution of authors

Yang Zhang has performed the research described in this thesis. She imple- mented the full program and conducted the experiment for SPARCS. She wrote the paper [1] in collaboration with Yann Ponty (co-author), Mathieu Blanchette, Éric Lecuyer and Jérôme Waldispühl. Jérôme Waldispühl and Éric Lecuyer provided guid- ance. Yang Zhang also designed, implemented the full website of DroDARScan. She is preparing the manuscript for DroDARScan with Quaid Morris, Mathieu Blanchette, Éric Lecuyer and Jérôme Waldispühl. Jérôme Waldispühl and Éric Lecuyer provided guidance. Contents

Contentsv

List of Figures vii

1 Introduction1

2 Background7 2.1 RNA structure ...... 7 2.2 RNA Binding Protein (RBP) ...... 8 2.2.1 Position Weight Matrix (PWM) ...... 9

3 Methods 11 3.1 Structural profile prediction ...... 11 3.1.1 Method overview ...... 11 3.1.2 Multivariate Boltzmann sampling ...... 12 3.1.2.1 Weighted distribution...... 13 3.1.2.2 Self-adaptive calibration of weights...... 13 3.1.2.3 Random generation...... 15 3.1.3 Secondary structure prediction ...... 16 3.1.4 Characterization of the structural profile ...... 16

v CONTENTS

3.2 Primary motif prediction ...... 19 3.2.1 Cis-bp RNA database ...... 19 3.2.2 Motif searching ...... 20 3.2.3 Binding sites clustering ...... 20 3.3 Evolutionary conservation ...... 21 3.4 DroDARScan database ...... 22

4 Discussion 23 4.1 DroDARScan Dataset ...... 23 4.2 Implementation ...... 24 4.2.1 SPARCS ...... 24 4.2.2 DroDARScan ...... 24 4.3 Time and space requirement ...... 24 4.4 Limitation of the work ...... 26 4.5 Future work ...... 27

5 Results 28 5.1 SPARCS ...... 28 5.1.1 Analysis of Ash1 gene in yeast ...... 29 5.2 DroDARScan ...... 32 5.2.1 Analysis of FMR1 in Drosophila ...... 36

6 Conclusions 40

References 42

vi List of Figures

2.1 A typical secondary structure (stemloop) of RNA...... 8 2.2 An illustration of RBPs’ roles...... 9 2.3 An example PWM of RBP ELAVL1...... 10

3.1 Entropy comparison between sequences generated by DiCodonShuffle [2] and our probabilistic shuffling method. For both methods, 1000 se- quences are generated and, for our approach, the relative tolerance was set to ε = 10−1. Sequences produced using DiCodonShuffle show much less diversity than those generated using our approach, either indicating a substantially limited accessibility of compatible se- quences, or a substantial bias (non-stationarity) due to the bounded nature of their random walk...... 14

vii LIST OF FIGURES

3.2 Impact of weighted distribution on the number of occurrences of din- ucleotides AU and GU. Either in the uniform distribution (π(XY) = 1, blue), or setting larger weights to AU (π(AU) = 10, red) or GU (π(GU) = 10, green), 100 000 sequences compatible with an mRNA sequence encoding 179 amino acid (the first two exons of oskar gene in D. melanogaster) were randomly generated. The concentration of the distribution, and the shift in expected DF observed for different weights, are the key ingredients of our method, allowing for an efficient approach based on adaptive sampling...... 17

4.1 An example of the Jqplot graphical representation of a gene...... 25

5.1 SPARCS interface...... 30 5.2 Typical runtime of SPARCS for sequence lengths varying from 200 to 1000 nts...... 31 5.3 Analysis of the protein-coding region of the ASH1 gene in yeast. The Z-scores of the base pair probability are represented in magenta and those of the base pair entropy in red. Structured, unstructured and disordered regions are displayed in green, blue and orange, and the functional elements E1, E2A and E2B are indicated at the bottom of the figure with yellow boxes. Dashed lines show the thresholds for determining high or low Z-score values...... 33 5.4 Genes annotated with Alzheimer’s disease...... 34 5.5 Web interface of DroDARScan...... 35 5.6 A sample output page from DroDARScan...... 37 5.7 Resubmit option from DroDARScan...... 37 5.8 Summary section of a gene in DroDARScan...... 38

viii LIST OF FIGURES

5.9 Target sites mapping of RBP bru-3 in gene FMR1 from UCSC genome browser...... 39

ix Chapter 1

Introduction

RNA binding proteins (RBPs) play important roles in the regulation of many post-transcriptional events in gene expression and are key components in RNA metabolism. RBPs can influence the structure of through interactions with cis-regulatory elements and therefore have crucial roles in stability, function, trans- port and localization of the RNAs. Such cis regulatory elements, which are targeted by trans-regulatory molecules such as microRNAs and RBPs, can be resided through- out the mRNA molecule. They are typically defined by their secondary structure and primary sequence characteristics. In eukaryotic cells, large amount of RBPs are encoded. Each of the RBP has its own binding preference to the target, both in terms of structural and primary sequence preferences. However, only a small fraction of the RBPs have been charac- terized so far. A recent study has reported a systematic way of detecting the RNA motifs recognized by a broad range of RBPs [3]. They discovered that vast majority of RBPs tend to bind to single stranded RNAs and thus do not requires a specific RNA secondary structure. From these information, we developed analysis tools that

1 can take a mRNA of interest and decipher the regulatory logic considering both the structural and sequence features. To detect the structural properties of a mRNA molecule, we first need to think about the difference between coding sequences and UTR sequences. There are many published tools that aimed at discovering the structural landscape at the UTR regions. However, none of them considered the special characterization pos- sessed by the coding region of the mRNA. Recent studies suggested that coding regions of messenger RNAs can often include secondary structure elements involved in post-transcriptional regulatory processes [4;5;6]. While many programs have been developed to analyze folding properties of large non-coding RNAs [7] or un- translated regions of mRNAs [8], these tools cannot be directly applied to study the structural properties in coding regions. Indeed, the sequence of codons that specify the amino acid chain might bias the thermodynamic folding properties of the polynucleotide, thus preventing accurate estimate of the statistical significance of local structural motifs. Similar issues are encountered in the context of large scale studies and techniques aiming as defining RNA structure characteristics on a genome-wide scale [9; 10]. Actually, assessing the statistical significance of ob- served phenomena or patterns requires the definition of a reliable and expressive background model (a.k.a. the null hypothesis). In particular, any sequence property that is a natural consequence of a well-understood mechanism should be captured by the background model, so that it will generically appear in random sequences. Including these features in the background model will lead to an increased statistical significance for novel phenomena. A classic exploratory approach starts with a random generation of sequences that share similar properties as a reference set of sequences. Various metrics can then be evaluated, possibly leading to diverging distributions of values within the

2 random and reference sets. The significance of such an observation can be empir- ically assessed using classic statistical tools (Z-score, P-value. . . ) . To implement such an approach in the context of mRNAs, one must restrict random sequences to synonymous sequences (i.e. the set of sequences that encode the same amino acid sequence). Such sequences can trivially be generated, uniformly at random, by sim- ply choosing, for each amino acid, one of its alternate codons. Another constraint, essential when analyzing structural properties of RNA molecules, is the preservation of the overall dinucleotide frequencies (DF). Such a constraint has been popular in the field of RNA following the study of Workman and Krogh [11], and builds on the rationale that preserving the DF maintains the feasibility of stacking base pairs, arguably the main contributor to RNA stability. Efficient methods have been proposed for such a model, drawing an analogy with the random generation of a Euler path in a De Bruijn-like graph, whose edges represent the dinucleotides [12; 13]. When attempting to infer an evolutionary pressure from the observation of structural features within mRNA sequences, both constraints should ideally be sat- isfied. Unfortunately, the algorithms used to capture these two constraints rely on radically different principles, and cannot be easily combined into an algorithm that would, at the same time, preserve the dinucleotide frequency and an amino acid se- quence. For this reason, Katz and Burge proposed DiCodonShuffle [2], a heuristic algorithm based on a swapping procedure, which repeatedly exchanges codons while preserving the DF. As shown by Shabalin et al [14], such a model preserves the periodic pattern of base-pair frequencies observed within coding regions of mRNAs. However, this method is only asymptotically uniform, and a bias towards certain sequences may be anticipated in samples produced in finite time (depending on the initial sequence and the number of swaps). Furthermore, as noted by the authors, the codon/DF preserving swaps may disconnect the underlying Markov chain, causing

3 some legit sequences to be completely inaccessible by the sampling procedure. The impact of such limitations turned out to be more than purely theoretical, and we observed (see Figure 3.1) that the diversity (indicated by the sequence entropy) of generated sequences was much lower for DiCodonShuffle than for our truly uniform procedure, indicating a substantial bias in the method. We developed SPARCS (Structural Profile Assignment of RNA Coding Se- quences) a web server that predicts structured, unstructured and disorder regions in coding RNA sequences. Building on recent algorithmic advances [15; 16], this novel sampling algorithm enables us to sample uniformly random sequences preserving the encoded protein sequence as well as the dinucleotide frequencies. Combined with multiple classical metrics (e.g. base pair probabilities and base pairing entropy), this sampling algorithm enables the calculation of accurate Z-scores and the pre- diction of strongly and weakly structured regions, along with disordered regions in exons – an insight that could not be fully achieved using previously existing sampling techniques. However, knowing the structural properties of a sequence is not sufficient for predicting RBP binding sites. We need more information from the primary sequence. For this, we integrated information from the above mentioned RBP motif study from Ray et al [3]. We scan the transcript with the RBP binding preferences in our input mRNA, then we eliminate RBP false positives by generating random sequences of the mRNA and compare the hits significance with the biological sequence. Although structural predictions and primary motif search can give us a good idea of what and where a RBP might potentially bind to, a number of transcriptome- wide studies have identified that the functional sites are highly conserved across species [17]. However, we can only conclude that a highly conserved motif is more likely to be the cis regulatory element, but we cannot assume the low conserved

4 motifs are non functional. In our pipeline, we mainly look at the conservation in the coding region of the mRNA using a tool called CRUNCS for Coding Region Under Non-Coding Selection [18]. CRUNCS detects evolutionary conserved elements in the coding region using comparative genomics approach. To integrate all these information together, We selected a set of genes from Drosophila melongaster that are homologous to human disease genes and assembled a database called DroDARScan for Drosophila Disease-Associated RNA Scan. For many years, Drosophila melongaster have proven to be instrumental for gaining insights into gene function and regulatory features. It is a well studied model organism that has been frequently employed for understanding molecular mechanisms of human diseases. Many basic biological properties are conserved between fly and human. It has been found that at least 75% of human disease-causing genes have a functional orthologues in Drosophila melongaster[19]. In this database, we systematically compiled RNA sequence and structural features for a broad collection of Drosophila mRNAs with human disease-associated orthologues. For each mRNA, the DroDARScan graphical interface displays i) the positions of consensus binding sites for a broad collection of over one hundred RBPs with deeply conserved RNA binding specificities between fly and human, as well as ii) the presence of binding site clusters for these RBPs, either within coding segment or 5’/3’ UTRs of the mRNA. In addition, the database highlights iii) regions of the mRNA that are predicted to be structured, disordered or unstructured based on analyses conducted with our previously developed SPARCS tool. Finally, we further identify iv) coding regions under non-coding selection (CRUNCS), which pinpoint segments of the mRNA coding region that are likely to carry out coding-independent regulatory functions. In addition to the graphical display, we offer users the option to resubmit the queries with a different set of thresholds, allowing more flexibility in

5 data parsing. Elucidating the mechanisms governing these various RNA regulatory events is a key area of biomedical research with important implications for understanding disease etiology [20]. Indeed, for a growing number of diseases, including neuro- and musculo-degenerative disorders and cancers, aberrant alterations in mRNA metabolism are thought to have a causal role in disease etiology. Therefore, this project can help scientists understanding basic mechanisms often disrupted in diseases and have a better understanding of the roles RBPs played. The culmination of this project would thus help to establish potential targets for human disease treatment [21]. The following chapters are organized as follows. In Chapter 2, we briefly go over the background of RNA structures and RNA binding proteins. In Chapter 3, we explain our pipeline from three aspects: structural profile prediction, primary motif prediction and conservation information. In Chapter 4, we discuss the input, benchmark and usage of our pipeline. In conclusion, we discuss the limitation and future perspectives of this work.

6 Chapter 2

Background

2.1 RNA structure

RNA is a family of biological molecules that perform multiple roles in regula- tion and expression of genes. It is make of a chain of nucleic acids. Each nucleotide is composed of a ribose sugar and a phosphate group, linking the nucleotides together to form the backbone. It contains four bases: Adenine (A), Uracil (U), Guanine (G) and Cytosine (C). Adenine and Uracil form a base pair with two hydrogen bonds whereas Cytosine and Guanine form a base pair with three hydrogen bonds [22], as shown in Figure 2.1. A large fraction of the bases are engaged in non-Watson-Crick base pairings, which can also form binding motifs for protein and small molecules to bind [23]. Virtually all RNA strands can form into secondary structures. Some RNA molecules have evolved to form specific secondary structures. However, it is not realistic to observe RNA shapes in vivo. Many tools have been developed to predict the folding of RNAs, for example, the ViennaRNA Package [24]. Prediction of RNA

7 Figure 2.1: A typical secondary structure (stemloop) of RNA. secondary structure is crucial in biology because it can help to infer or to understand the function of RNA molecule. Also, it can also infer the translation processes in ribosomal RNAs.

2.2 RNA Binding Protein (RBP)

RNA-binding proteins (RBPs) are proteins that bind to single or double stranded RNAs. They contains structural motifs and are key components of RNA metabolism. They regulate the temporal, spatial and functional dynamics of RNAs [21]. In cells, RBP recognizes the binding motif in RNA to target them for functional roles. For example, some RBPs specialize in targetting RNAs to localization in the cell, as shown in Figure 2.2. Recent genetic and proteomic data suggest that RBPs are involved in many human diseases, for example, neurological disorders and cancer.

8 Figure 2.2: An illustration of RBPs’ roles.

Therefore, revealing the structural motifs in both RNAs and RBPs might enlighten the cure in complex diseases.

2.2.1 Position Weight Matrix (PWM)

Position Weight Matrix (PWM) is a commonly representation of motifs in biological sequences (Figure 2.2). The PWMs used in this project are derived from the RNAcompete study [3]. In a PWM, the row represents the different position in the motif. The column represent the four different nucleotide A, U, G and C. Each number in the intersection of row and column represents the probability that the corresponding nucleotide occur at that position of the motif. Another way of representing the probabilities is by Sequence Logo. At each column of the sequence

9 Figure 2.3: An example PWM of RBP ELAVL1. logo, the size of the nucleotide is proportional to the probability in the PWM. In Cis-bp RNA database, you can download both the PWM and the sequence logo for the RBP you are interested in.

10 Chapter 3

Methods

3.1 Structural profile prediction

In this section, we will explain the methods used for the SPARCS server. SPARCS stands for Structural Profile Assignment of RNA Coding Sequences, an effi- cient method to analyze the (secondary) structure profile of protein coding regions in mRNAs. A web server implementing this discovery pipeline is available at this page http://csb.cs.mcgill.ca/sparcs together with the source code of the sampling algo- rithm.

3.1.1 Method overview

The methodology of our approach is to combine the following procedures, starting from a given RNA sequence:

1. Use a novel statistical sampling algorithm to generate a set of random sequences that preserves both the encoded amino acid sequence and the DF of the input

11 3. Methods

sequence.

2. Use RNAplfold [7] to compute thermodynamical properties of the input se- quence and all random sequences generated.

3. Predict regions that are significantly structured, unstructured and disordered, based on a comparison of thermodynamic properties between input and random sequences.

3.1.2 Multivariate Boltzmann sampling

Our approach builds on a multivariate Boltzmann sampling scheme, initially introduced in the context of enumerative combinatorics [15], and previously applied to control the GC-content of sampled RNA sequences within the RNAmutants soft- ware [16]. This approach initially relaxes the goal of preserving the DF, and draws sequences that strictly preserve the amino acid sequence while only achieving, on the average, the prescribed DF. A further rejection of unsuitable sequences, whose DF differ too much from the targeted DF, filters the generated sequences, reestablishing the uniformity within the selected subset. The produced sequences therefore feature both correct DF and coding capacity while being generated with uniform probability. ∗ ∗ ∗ ∗ Namely, let S be an amino acid sequence, and dc = (dcAA, dcAC, dcAG ...) be the vector of targeted dinucleotide frequencies, the algorithm repeats the following steps until the desired number of samples is reached:

1. Draw a set of structures encoding S, with respect to a weighted distribution;

2. Estimate expected DF from sample;

3. Collect suitable sequences;

12 3. Methods

4. Update weights to match expected DF with target.

3.1.2.1 Weighted distribution.

We associate a weight πXY to each dinucleotide XY. This weight is inherited multiplicatively by any RNA sequence in (S), the set of sequences compatible with a targeted amino acid sequence S. This implicitly defines a probability distribution over rna(S) where any RNA sequence w ∈ rna(S) has probability

|w|−1 π(w) Y P(w) = P 0 with π(w) = πwi.wi+1 . w0∈rna(S) π(w ) i=1

Quite importantly, it should be noted that any pair of sequences having an equal DF will also have equal relative probability. Therefore, generating a set of sequences, and retaining only the sequences that do feature the targeted DF, gives an unbiased set of sequences. This property also holds for sequences generated using different weights, therefore they can be gathered across different iterations of the adaptive sampling without introducing a bias.

3.1.2.2 Self-adaptive calibration of weights.

The weighting scheme may be used to shift the expected number of occur- rences of each dinucleotide, as illustrated by Figure 3.2. Let us denote by VXY the number of copies of XY in a random sequence generated in the weighted model. For instance, setting πGU to 0 will cancel the probability of any sequence featuring any occurrence of GU, and the expected number of GU will therefore drop to 0. Con- versely, setting πGU → +∞ will only grant positive probability to sequences that maximize the number of copies of GU.

13 3. Methods

0.6

0.55

0.5 Entropy 0.45

0.4

DicodonShuffle 0.35 Probabilistic Shuffling 0.2 0.4 0.6 0.8 1 1.2 Length 104 ·

Figure 3.1: Entropy comparison between sequences generated by DiCodonShuffle [2] and our probabilistic shuffling method. For both methods, 1000 sequences are gen- erated and, for our approach, the relative tolerance was set to ε = 10−1. Sequences produced using DiCodonShuffle show much less diversity than those generated us- ing our approach, either indicating a substantially limited accessibility of compatible sequences, or a substantial bias (non-stationarity) due to the bounded nature of their random walk.

14 3. Methods

To find a weight that matches the expected DF with the targeted one, we use a heuristic strategy to figure out weights that achieve, on the average, the targeted ∗ DF. To that purpose, we initially set π(XY) := dcXY and, after each iteration of the ∗ adaptive sampling, we update each weight to π(XY) · dcXY/µXY, where µXY is the expected value for VXY, estimated from the sample. The process typically converges after a few iterations, leading to a good approximation of the best weight set.

3.1.2.3 Random generation.

To draw a sequence of rna(S) within the weighted distribution, one needs to choose a compatible codon for each of the n amino acid in S. Such choices cannot be made independently, since the overlap between consecutive codons contributes to an additional dinucleotide, ultimately impacting the weight of a generated sequence. Following the general principle of the recursive approach for random generation [25;

26], we precompute the total weight Zb,i+1 of every sequence accessible upon choosing some codon ending with a base b ∈ [A, U, C, G] at the i-th position. Such weights can be efficiently computed using dynamic programming based on

X Zb,n+1 = 1 and Zb,i = π(b.c) · Zc−1,i+1 c∈cod(Si) where cod(Si) is the set of codons compatible with the i-th amino acid in S, and c−1 is the last nucleotide of c. Since the first amino acid (i = 1) is not preceded by any nucleotide, it must be treated slightly differently by setting b := ∅ and π(∅.c) := 1. During the random generation, these precomputations are used to assign probabilities to each of the possible codons, such that each sequence is generated with respect to the weighted distribution. Namely, one picks a codon c ∈ cod(Si) for the i-th amino

15 3. Methods acid, in the context of a previous nucleotide b, with probability

π(b.c) · Zc−1,i+1 pb,c,i = . Zb,i

The sampling algorithm starts on the first codon (i := 1 and b := ∅), and iterates over the amino acid sequence S in increasing order, picking a codon with the above probabilities, and updating b to the last nucleotide in the elected codon. After picking the last codon, it can be shown that the generated sequence is indeed in rna(S), and has probability which is proportional to its weight (cf supp. mat.). The complexity of the algorithm is in Θ(k · n) time and space for sampling k sequences, each consisting of n codons.

3.1.3 Secondary structure prediction

The secondary structures of both the input mRNA sequence and random se- quences are predicted using the RNAplfold software, distributed within the Vienna RNA package [24]. RNAplfold considers all possible locally stable secondary struc- tures for an input RNA sequence, and calculates base pair probabilities, assuming a Boltzmann equilibrium. As recommended by Lange et al. [27], we use a window size of W + 50 nucleotides (W = 150 by default), and retain only those base pairs separated by at most W positions, and set a base pair probability cut off threshold to 0.1.

3.1.4 Characterization of the structural profile

We screen the input sequence with a sliding window of W nucleotides and evaluate the standardized score (Z-score) for each window w on two classical metrics:

16 3. Methods

Figure 3.2: Impact of weighted distribution on the number of occurrences of din- ucleotides AU and GU. Either in the uniform distribution (π(XY) = 1, blue), or setting larger weights to AU (π(AU) = 10, red) or GU (π(GU) = 10, green), 100 000 sequences compatible with an mRNA sequence encoding 179 amino acid (the first two exons of oskar gene in D. melanogaster) were randomly generated. The con- centration of the distribution, and the shift in expected DF observed for different weights, are the key ingredients of our method, allowing for an efficient approach based on adaptive sampling.

17 3. Methods

B(w) the sum of base pair probabilities and H(w) the base pair entropy. Let C be the set of all valid base pairs in the sliding window and pi,j the probability of a base pair (i, j). We define the sum of base pair probabilities as the sum of all base pair probabilities assessed by RNAplfold within the frame, such that

X B(w) = pi,j. (i,j)∈C

The sum of base pair probability estimates the stability of the secondary structures in the conformational landscape and thus quantifies the structural potential of the sequence. Similarly, we define the base pair entropy as the Shannon entropy of the base pair probabilities, such that

X H(w) = − pi,j · log(pi,j). (i,j)∈C

The base pair entropy aims to evaluate whether many alternate sub-optimal struc- tures exist in the conformational landscape. For each nucleotide position, the Z-scores of all windows are averaged out to give the structural profile at a single nucleotide resolution. We use these metrics to characterize a structural profile consisting in three, mutally-exclusive, types of regions, based on two user-defined thresholds tB and tH:

• Structured regions: A region is said to be structured when the Z-score of the

base pair probability exceeds tB and the Z-score of the base pair entropy is lower

than −tH. This configuration indicates stable structures with few competitors.

• Unstructured regions: A region is unstructured when the Z-score of the base

18 3. Methods

pair probability and the Z-score of the base pair entropy are respectively lower

than −tB and −tH. In that case, the energy landscape is flat with no dominant structure.

• Disordered regions: A region is disordered when the Z-score of the base pair

probability and the Z-score of the base pair entropy respectively exceed tB and

tH. This configuration suggests the presence of multiple stable and competing structures in the conformational landscape.

By default, our method uses thresholds on the Z-score of 0.2 to discriminate high or low values. As illustrated in the next section, these settings aim to classify structural domains in the input sequences. Nonetheless, more stringent values can be specified, for instance if the user wishes to detect strongly (un-)structured regions.

3.2 Primary motif prediction

3.2.1 Cis-bp RNA database

To find the RBP binding sites, we not only need to consider the structural properties of the sequence, but we also need to find the primary sequence motif in the sequence. RNAcompete is an in vitro experiment which introduced a systematic analysis of RNA sequence preferences of RBPs [3]. Each RBP motif characterized via the RNAcompete procedure is associated with a Position Weighted Matrix (PWM), which shows the binding probability of each nucleotide at precise positions within the motif, which are typically 6-10 nucleotides in length [28]. Consequently, each of these PWMs represents a degenerate set of possible binding sequences that could be recognized by individual RBPs.

19 3. Methods

3.2.2 Motif searching

For each RNA motif sequence, we computed the likelihood score by multiply- ing the probability at all position. We scanned each mRNA sequence using a sliding window at the length of the motif and each position along the transcript was assigned a likelyhood score. To evaluate the significance of the match for all positions more simply, we converted the likelihood score into a Z-score measurement. The larger the Z-score, the more closely the motif matches the optimal RBP targeting sequence. In this way, systematic scans were performed for all conserved RBPs for which PWMs were available. To eliminate false positives of the binding site, we generate 500 randomly shuffled sequences against the biological sequence. The RBP scanning process is performed on each of the sequences. Average binding sites and average Z-score are then computed to compare with the value in the true sequence. Let us denote

Zrandom to be the average Z-score of the random sequences of a RBP, Nrandom to be the average binding sites number of a RBP, Ztrue to be the average Z-score of a RBP in the true sequence and Ntrue to be the average number of binding sites in the true sequence of a RBP. We define that if

Ztrue × Ntrue ≥ 1.2Zrandom × Nrandom

, then we classify the RBP binding sites as specific to the true sequence.

3.2.3 Binding sites clustering

As was recently observed for the RBFox2 protein, an RBP implicated in stem cell differentiation [29], the presence of RBP binding site clusters within an RNA

20 3. Methods transcript can be a strong predictor of regulatory links. Therefore, we searched for clusters of binding sites for the tested RBPs by scanning every consecutive 100 bp window and calculating the average z-score for each cluster. By default, we defined an RBP binding site cluster as a 100 bp stretch of sequence containing three or more binding sites with an average Z-score greater or equal to 3.

3.3 Evolutionary conservation

Other than primary and secondary information of a binding site, conserva- tion is usually considered as a crucial feature [30]. Chen et al described a method to detect non-coding functional elements located in coding regions, which they call coding regions under non-coding selection (CRUNCS). These elements were detected using comparative genomics approach of Drosophila species following a previously described strategy [18]. In this approach, orthologous DNA coding regions are com- pared and inter-species conserved regions showing higher than expected third base conservation within codons are identified. Putative non-coding functional elements can thus be identified within coding region. Such regions may harbour various types of regulatory elements, such as exonic splicing enhancers, transcription factor bind- ing sites, and RNA structure elements affecting transcript stability, localization, or translation. A “significance” score is provided to evaluate the significance of con- servation. We categorize CRUNCS into three types: the lowly conserved elements, mediumly conserved and highly conserved elements.

21 3. Methods

3.4 DroDARScan database

Finally, we assembled all three approaches, structural prediction, primary motif searching and evolutionary information together. Using a graphical represen- tation, we are able to represent our results in a database called DroDARScan which is short for Drosophila Disease-Associated RNA Scan (http://csb.cs.mcgill.ca/DroDARScan (still under development)). At this moment, we are specifically interested in genes in Drosophila that are homologous to human diseases. More information on DroDARScan will be introduced in the Discussion section.

22 Chapter 4

Discussion

4.1 DroDARScan Dataset

Since Feburary 2014, FlyBase (http://flybase.org) has begun to display in- formation on genetics models of human diseases in Drosophila. We have conducted RBP motif and structural profiling of 650 mRNA transcripts orthologous to human genes implicated in a broad range of diseases. Each gene is associated with one or more types of human diseases. As Flybase continues to expand its collection of disease-associated transcripts, regular updates will be incorporated into DroDARScan accordingly. Similarly, the database currently contains motif predictions for 120 highly- conserved RBPs for which the RNA sequence binding specificities have previously been empirically determined by Ray and colleagues, using a systematic in vitro se- lection procedure designated RNAcompete. This RNAcompete data has been cat- alogued in the CISBP database [3]. As more RBPs are characterized through this pipeline, additional RBP motif scanning data will be incorporated into DroDARScan.

23 4. Discussion

4.2 Implementation

4.2.1 SPARCS

SPARCS runs on a server hosted at University McGill, which has 8 cores and has a total of 63 GB of memory. Each core is an Intel(R) Xeon(R) CPU X5570 at 2.93GHz, with 8192 KB cache. Figure 5.2 shows the overall runtime on the server as a fonction of the mRNA length, for mRNA sequences ranging from 200 to 1000 nts, and reveals a linear trend. SPARCS was implemented in python 2.7 and the SPARCS was implemented using python CGI.

4.2.2 DroDARScan

DroDARScan is a database freely accessible from its website. It was built using Twitter Bootstrap web framework (http://getbootstrap.com/2.3.2/) coupled with PHP backend. To visualize the binding sites and all the other properties, we take advantage of the jQuery library Jqplot (http://www.jqplot.com/) to represent our candidates in a plot where each binding site is represented as one square, as shown in Figure 4.1, a sample output graph from our database DroDARScan. The website was published on an Apache server hosted by McGill Bioinformatics group.

4.3 Time and space requirement

In SPARCS, we empirically observed, and could formally prove using Drmota theorem [31] for non-degenerate cases, that VXY asymptotically follows a Normal law √ of mean in Θ(n) and standard deviation σXY in Θ( n). Furthermore, the covariations between numbers for different dinucleotides remains provably limited, and the joint

24 4. Discussion

Figure 4.1: An example of the Jqplot graphical representation of a gene.

distribution of the VXY for every dinucleotide XY asymptotically follows a 16-variate Normal law. Consequently, the probability of generating a sequence having expected DF scales like Θ(n−16/2) and it takes, on the average, Θ(n8) attempts to obtain such a sequence. The average-case complexity of a rejection procedure for the uniform sampling is in Θ(k · n9) time, after a linear time and space preprocessing. Such a large time complexity may be impractical for real-life applications. However, √ if a small relative tolerance ε ∈ Θ(1/ n) is allowed on every targeted dinucleotide count, leading any sequence w to be accepted if its dinucleotide counts are such that

∗ ∗ (1 − ε) · dcXY ≤ dcXY(w) ≤ (1 + ε) · dcXY, for every dinucleotide XY. Under this setting, the probability of acceptance only decreases like o(C−16), where C is a constant which only depends on the covariance matrix. In particular, if

√ ε = 3 · max(σXY/n) ∈ Θ(1/ n) (The 3 std.-dev. rule), XY then the probability of acceptance becomes greater than 0.9916 ≈ 85%, and the average-case complexity of the method becomes asymptotically equivalent to Θ(k·n),

25 4. Discussion at the cost of loss of uniformity which is typically negligible, and can be efficiently corrected through a post-processing step [15].

4.4 Limitation of the work

Though the pipeline are useful to a certain degree, it still has limitations in its approaches. First of all, the secondary structure properties are analyzed based on the prediction from RNAplfold. The accuracy of the landscape is thus affected by the accuracy of RNAplfold. In real world, RNA not only has secondary structure, it also folds into tertiary structures. A work published by Reinharz et al [32] suggests a novel way of predicting the 3D structure of RNA using sequence information only. If we couple the secondary structure prediction with tertiary structure prediction, it will certainly increase the structural prediction accuracy of an RNA molecule. Moreover, in the process of primary search, we search the primary motif based on the PWM from RNACompete experiment. RNACompete is an in vitro experi- ment that detects the binding preferences of a RBP using microarray. The microarray contains RBP that surrounded by a pool of RNA segments. This experiment can quantify the binding preferences of the RBP in the form of PWM, however, it may not represent what is really happening in vivo. Thus, there is bias in our searching method using the results from RNAcompete. Lastly, we use CRUNCS to analyze the conservation level of the gene. How- ever, the CRUNCS information we have is only limited to the coding region of the gene. We will need to develop a similar approach that could analyze the conservation in UTR regions as well.

26 4. Discussion

4.5 Future work

The analysis we performed so far is applied on coding sequences only, but as many studies have shown, regions sometimes have RBP target sites too. Some RBPs are even specific to intronic regions [33]. In our future work, we would like to develop databases that will scan the RBP binding sites in genomic sequences rather than transcript only. Right now the pipeline uses arbitrary threshold, even though we offer the users the option to change the threshold themselves, a valid and efficient threshold has not been decided. CLIP, a method used in molecular biology to cross-link proteins and RNAs to study protein-RNA interactions, can detect the binding locations of RBPs [34]. In our future work, we would like to gather data from CLIP and use machine learning algorithm coupled with the current approach to predict the binding sites of RBPs. It can also help us to set a reasonable threshold. Other types of RNAs, like non-coding RNAs, are known to interact with RBPs [35]. These RNAs are the ones that do not translate into proteins, but they have important functional roles. Many databases have been built to study non-coding RNAs, but none of them has looked at the RBP binding sites predictions. Other scientists study RNAs that localized to a specific compartment in the cell. These RNAs share similar RBP binding sites that allows the RBP target the gene to the destination in the cell. In our future work, we will also study RNAs that belong to these groups. By establishing different databases on different set of genes, it will help biologists working in different research areas to better understand the function of the genes they are interested in.

27 Chapter 5

Results

5.1 SPARCS

The SPARCS web server takes an RNA/DNA sequence or a FASTA file as input, as shown in Figure 5.1 Upon validation, a first set of 1 000 random sequences, preserving both the DF and encoded animo acid sequence of the input sequence, is generated. A second set of 1 000 random sequences, called the uniform model, is generated to preserve only the amino acid sequence. The input sequence and the 2 000 random sequences are then fed to RNAplfold to predict their base pairing properties. The Z-score is computed for a sliding window of user-specified width (defaulting to 150 nts), and all Z-scores are averaged for every position to evaluate the statistical significance of the secondary structure profile. SPARCS finally outputs a single Z-score plot based on our metrics: sum of base pair probabilities, base pair entropy, structural potential region, unstructured poten- tial region and disordered potential region. The dashed line(s) indicates the Z-score thresholds for the sum of base pair probabilities and base pair entropy, respectively.

28 5. Results

Users may specificy custom thresholds for both of the base pair probability and base pair entropy metrics.

5.1.1 Analysis of Ash1 gene in yeast

We illustrate the insights brought by SPARCS on the well-studied ASH1 gene in yeast. Using mutagenesis studies and comparative sequence analysis, four functional elements have been identified in the ASH1 mRNA. Each of them has been shown to be sufficient to localize a reporter mRNA to the bud of dividing yeast cells [36]. Out of the four elements, three (E1,E2A and E2B) are located within the coding region of ASH1 mRNA. Figure 5.3 shows the output of SPARCS for the ASH1 mRNA coding region. The Z-scores of the sum of base pair probabilities are represented in magenta and those of the base pair entropy in red. Structured, unstructured and disordered regions are displayed in green, blue and orange, and the functional elements E1, E2A and E2B are indicated at the bottom of the figure with yellow boxes. As mentioned above, here we aim to detect structural domains and tendencies in the structural profile rather than focusing on the prediction of single elements. Therefore, we used a threshold of 0. Our Results show that the E1 (positions 625 to 775) match predicted disor- dered and structured regions. The presence of disordered region at the beginning of the element could be explained by the presence of internal loops and alternate base pairings in the predicted secondary structure (See Gonzales et al [37] and Char- trand et al [36]). Interestingly, the elements E2A (positions 1081 to 1199) and E2B (positions 1200 to 1447) are both surrounded by unstructured regions, possibly to avoid interactions between these elements. Noticeably, unlike the E2A element, the

29 5. Results

Figure 5.1: SPARCS interface.

30 5. Results

250

200

150

100 Time (seconds)

50

0 200 400 600 800 1,000 Length (nts)

Figure 5.2: Typical runtime of SPARCS for sequence lengths varying from 200 to 1000 nts.

31 5. Results

E2B element is particularly stable and structured. Outside these functional seg- ments, we identify a large unstructured region (from 200 to 600) before the E1 element which could help to stabilize the E1 element or, hypothetically, to facilitate translation. By contrast, we identify a strongly structured regions between the E1 and E2 elements. This prediction could reveal a buffer that aims to prevent these elements interacting. Finally, our analysis also suggests a structured region at the beginning of the sequence (positions 50 to 200). To the best of our knowledge, this region has not been experimentally studied, motivating further comparative studies.

5.2 DroDARScan

DroDARScan (Figure 5.5 ) provided two approaches to search your favorite gene. You could either go to the “Search by gene” tab, and then search your gene based on three annotations: by FlyBase ID (ex. FBgn0000117), by gene symbol (ex. arm) or by secondary annotation (ex. CG11579), or you could go to the “Search by disease” tab, select your favourite disease and click on the genes under this category (Figure 5.4). Currently we have 390 Drosophila melongaster genes associated with human diseases. More will be updated in the future. DroDARScan offers a graphical plot for each gene. There are four tracks dis- played in the plot: The first track displays the positions of RBP binding sites, where each color represents one type of RBP with its PWM. The second track displays RBP clusters. This track displays a 100 bp segment in the transcript indicating the window contains number of binding sites of the same RBP. It uses the same color convention as the first track. The third track displays the structural landscape infor- mation. Orange color stands for disordered region, blue color stands for unstructured region and green represents structured regions. The last track is the CRUNCS track.

32 5. Results

Messenger RNA AUG UCA AGC UUA UAC AUC AAA ACA CCA CUG CAU GCA UUA UCU GCU GGU CCG GAU UCU CAU GCA AAU ··· Amino-acids Sequence + Dinucleotide Composition M S S L Y I K T P L H A L S A G P D S H A N ··· Random Generation Random Sequences AUG UCA UCA UUA UAU AUA AAA ACU CCU CUA CAC GCU UUA UCU GCU GGA CCA GAU UCU CAC GCA AAU AUG UCU AGC CUA UAU AUU AAG ACC CCU CUG CAU GCA CUA UCA GCG GGU CCA GAU UCU CAU GCA AAU ··· AUG UCC AGU CUG UAU AUA AAA ACG CCA UUG CAU GCC CUC UCG GCA GGU CCU GAU UCC CAU GCG AAU ··· AUG UCU UCC UUG UAU AUA AAG ACG CCA UUG CAC GCU UUA AGU GCA GGA CCA GAU UCC CAU GCC AAC ··· AUG AGC UCU UUA UAC AUA AAG ACA CCU CUU CAU GCA UUA UCG GCA GGG CCU GAU UCU CAC GCU AAU ··· AUG UCA UCA CUU UAU AUC AAA ACC CCG UUG CAU GCA CUU UCU GCU GGA CCA GAU UCG CAU GCC AAU ··· AUG UCA AGU CUU UAC AUU AAA ACG CCG CUA CAU GCA UUG UCA GCU GGA CCU GAC UCG CAU GCA AAU ··· ··· RNAplfold + Z-score Computation + Window Averaging

Figure 5.3: Analysis of the protein-coding region of the ASH1 gene in yeast. The Z-scores of the base pair probability are represented in magenta and those of the base pair entropy in red. Structured, unstructured and disordered regions are displayed in green, blue and orange, and the functional elements E1, E2A and E2B are indicated at the bottom of the figure with yellow boxes. Dashed lines show the thresholds for determining high or low Z-score values.

The saturation of the color indicates the significance level of the elements. Users can also find more detailed information in the help page. Once the result is shown, there are four sections below the plot. The anno- tation summary section details all the color representation used for the plot (Fig- ure 5.6). The change options section allows users to redefine the maximum Z-score and cluster numbers themselves (Figure 5.7 ). The statistical summary section sum- marize the candidates of the binding sites in a table. For each RBP, it shows the

33 5. Results

Figure 5.4: Genes annotated with Alzheimer’s disease.

34 5. Results

Figure 5.5: Web interface of DroDARScan.

35 5. Results sequence logo, which is a graphical view of the PWM. In the sequence column, the 15 base pair sequences around the binding site is shown in black, whereas the site is highlighted with its RBP color. For each cluster, you could blat the binding sites in the current cluster in the UCSC genome browser (Figure 5.8 ). You can also download the fasta files of these sequences in the file download section.

5.2.1 Analysis of FMR1 in Drosophila

We illustrate the usage of DroDARScan using FMR1 gene in Drosophila as an example. If you know the annotated names of FMR1, namely, FlyBase Id, gene symbol and secondary annotation, then you can go to the “Search by gene” tab, and for example, enter “FMR1” (case-insensitive) then select gene symbol and click submit. This will retrieve the results for the FMR1 gene. If you are not sure about any of the annotated names, go to “Statistics” tab, on the right corner, there is a search text box allowing you to enter the name or partial name of the gene. Let’s say, if you just remember “mr” from the symbol name, you can simply enter that, and you can see FMR1 showed up as the first few hits. From this table, simply click on “Results” link at the end of the row and this shows the result page. Another searching method is through disease category. Let’s say we are inter- ested in genes from Down Syndrome. We navigate into “Search by disease” tab, enter the name of the disease or if you forget the name, enter a word that you remember, it will bring all the hits up. Now you see Down syndrome, indicating there are 5 genes in this disease. Click on the number and you will see your favorite FMR1 gene is there. Click on the gene and it will bring you to the result page as well. FMR1 is a gene that is full of RBP targeting sites. With the default pa- rameters, we can see that the RBP candidates are: ELAVL1, shep, CPEB3, bru-3,

36 5. Results

Figure 5.6: A sample output page from DroDARScan.

Figure 5.7: Resubmit option from DroDARScan.

37 5. Results

Figure 5.8: Summary section of a gene in DroDARScan.

38 5. Results

Figure 5.9: Target sites mapping of RBP bru-3 in gene FMR1 from UCSC genome browser.

CG17838, yps, orb2, pUf68, sm, elav, Cnot4, papi, Rbp1, Spx, Rnp4F and aret. Among these, bru-3 has a strong conserved site. Surprisingly, it also has the most matching score (Z-score = 27.39) and 3 binding sites in the cluster. We then blat the 3 binding sites into UCSC using the "Blat sequences" button from the bru-3 row. The sites are indeed overlapping with each other as shown in Figure 5.9. In majority of the cases, a RBP site tends to reside in unstructured region. This has been tested in the RNAcompete experiment [3]. However, bru-3 falls into a disordered region even though it has the best hit and also yields the best conservation score. We encourage the users to test the RBPs in both unstructured and structured regions and see how it behaves since this is still an open question to be answered. To raise the threshold, you could simply enter the new threshold in the result page under “Change options” and click update. The new thresholds will reload the page, update the RBP candidates and also the files you are going to download in files downloads section.

39 Chapter 6

Conclusions

RBPs are proteins that bind to RNAs, are generally found in cytoplasm and nucleus. They play key roles in post-transcriptional control of RNAs. The cis reg- ulatory elements are frequently located in the RNA transcript, which are served as specific binding site for RBPs. Aberrant interaction of RBPs and RNAs can lead to all types of diseases, such as neurodegenerative disease [38]. Therefore, there are tremendous interest in locating the binding sites of RBPs. Recent years, CLIP based technique has become a popular technique to detect in vivo protein-RNA interaction. However, the experiment are costly, time consuming and usually require a moderate or large amount of the materials to prepare the library. Predicting the RBP binding sites is known to be a challenge computationally due to many reasons. First, RBPs have structural preferences when binding to an RNA molecule, therefore, we need to predict the structure properties of the molecule first. Second, RBPs usually recognize short and degenerate sequences. This indeed would yield many primary motif search in the RNA. However, the majority of the primary search results are contributing to false positives. Lastly, many functional sites are highly conserved across species. We would need to be able to detect these conserved elements.

40 In this thesis, we developed a pipeline to predict RBP binding sites, consid- ering structural, primary and evolutionary information, all at the same time. We developed a webserver called SPARCS that can take the RNA coding region to reveal its structural landscape. We show that our probabilistic algorithm, which was used to generate random sequences that preserve both dinucleotide frequency and amino acid sequence, is better than previous approaches in terms of larger sampling space and more sequence diversity. This pipeline also enabled us to build a database DroDARScan, which scans all the fly genes that are homologous to human diseases. Moreover, the database provides a graphical representation of the RBP binding sites candidates. Users have two choices to search for their favorite genes. They could either search by gene anno- tation or by disease category. They also have the options to change the parameters according to their need and update the results. Taken together, the approach described provides the search of a comprehensive dataset of RBP binding motifs tested in vitro and in addition allows the users to visualize the candidates in a graphical output and also in UCSC genome browser. The strength of the approach lies in the fact that we employed structural prediction, primary sequence prediction and evolutionary information together for accurate RBP motifs mapping. Notably, even though we believe that our pipeline can help users to detect the most probable RBP candidates that could potentially bind to the RNA of interest, it is clear that experimental follow-up is required to confirm the predictions and to refine the approach in the future.

41 References

[1] Yang Zhang, Yann Ponty, Mathieu Blanchette, Eric Lécuyer, and Jérôme Wald- ispühl. Sparcs: a web server to analyze (un) structured regions in coding rna sequences. Nucleic acids research, 41(W1):W480–W485, 2013. iv

[2] Luba Katz and Christopher B Burge. Widespread selection for local RNA sec- ondary structure in coding regions of bacterial genes. Genome Res, 13(9):2042– 51, 2003. vii,3, 14

[3] Debashish Ray, Hilal Kazan, Kate B Cook, Matthew T Weirauch, Hamed S Najafabadi, Xiao Li, Serge Gueroussov, Mihai Albu, Hong Zheng, Ally Yang, et al. A compendium of rna-binding motifs for decoding gene regulation. Nature, 499(7457):172–177, 2013.1,4,9, 19, 23, 39

[4] Hui Chen and Mathieu Blanchette. Detecting non-coding selective pressure in coding regions. BMC Evol Biol, 7 Suppl 1:S9, 2007.2

[5] Michael Kertesz, Yue Wan, Elad Mazor, John L Rinn, Robert C Nutter, Howard Y Chang, and Eran Segal. Genome-wide measurement of rna secondary structure in yeast. Nature, 467(7311):103–7, 2010.2

[6] Michael F Lin, Pouya Kheradpour, Stefan Washietl, Brian J Parker, Jakob S

42 REFERENCES

Pedersen, and Manolis Kellis. Locating protein-coding sequences under selection for additional, overlapping functions in 29 mammalian genomes. Genome Res, 21(11):1916–28, 2011.2

[7] Stephan H Bernhart, Ivo L Hofacker, and Peter F Stadler. Local RNA base pairing probabilities in large sequences. Bioinformatics, 22(5):614–5, Mar 2006. 2, 12

[8] Michal Rabani, Michael Kertesz, and Eran Segal. Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes. Proc Natl Acad Sci U S A, 105(39):14885–90, 2008.2

[9] Joseph M Watts, Kristen K Dang, Robert J Gorelick, Christopher W Leonard, Julian W Bess, Ronald Swanstrom, Christina L Burch, and Kevin M Weeks. Architecture and secondary structure of an entire hiv-1 rna genome. Nature, 460(7256):711–716, Aug 2009.2

[10] Robert C Spitale, Pete Crisalli, Ryan A Flynn, Eduardo A Torre, Eric T Kool, and Howard Y Chang. Rna shape analysis in living cells. Nat Chem Biol, 9(1):18–20, Jan 2013.2

[11] Christopher Workman and Anders Krogh. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distri- bution. Nucleic Acids Res, 27(24):4816–4822, 1999.3

[12] Walter M. Fitch. Random sequences. Journal of Molecular Biology, 163(2):171 – 176, 1983.3

[13] Stephen F Altschul and Bruce W Erickson. Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinu-

43 REFERENCES

cleotide and codon usage. Molecular Biology and Evolution, 2(6):526–538, 1985. 3

[14] Svetlana A. Shabalina, Aleksey Y. Ogurtsov, and Nikolay A. Spiridonov. A periodic pattern of mRNA secondary structure created by the . Nucleic Acids Research, 34(8):2428–2437, 2006.3

[15] Olivier Bodini and Yann Ponty. Multi-dimensional boltzmann sampling of lan- guages. In DMTCS Proceedings - AOFA’10, volume 0, pages 49–64, 2010.4, 12, 26

[16] Jérôme Waldispühl and Yann Ponty. An unbiased adaptive sampling algorithm for the exploration of RNA mutational landscapes under evolutionary pressure. Journal of Computational Biology, 18(11):1465–79, 2011.4, 12

[17] Anna-Carina Jungkamp, Marlon Stoeckius, Desirea Mecenas, Dominic Grün, Guido Mastrobuoni, Stefan Kempa, and Nikolaus Rajewsky. In vivo and transcriptome-wide identification of rna binding protein target sites. Molecular cell, 44(5):828–840, 2011.4

[18] Hui Chen and Mathieu Blanchette. Detecting non-coding selective pressure in coding regions. BMC evolutionary biology, 7(Suppl 1):S9, 2007.5, 21

[19] Udai Bhan Pandey. Human disease models in drosophila melanogaster and the role of the fly in therapeutic drug discovery. Pharmacol Rev, 2011.5

[20] Thomas A Cooper, Lili Wan, and Gideon Dreyfuss. Rna and disease. Cell, 136(4):777–793, 2009.6

44 REFERENCES

[21] Kiven E Lukong, Kai-wei Chang, Edouard W Khandjian, and Stéphane Richard. Rna-binding proteins in human genetic disease. Trends in Genetics, 24(8):416– 425, 2008.6,8

[22] Jeremy M Berg, John L Tymoczko, and Lubert Stryer. Biochemistry, ; w. h, 2002.7

[23] Neocles B Leontis, Jesse Stombaugh, and Eric Westhof. The non-watson–crick base pairs and their associated isostericity matrices. Nucleic acids research, 30(16):3497–3531, 2002.7

[24] Ronny Lorenz, Stephan Bernhart, Christian Honer zu Siederdissen, Hakim Tafer, Christoph Flamm, Peter Stadler, and Ivo Hofacker. Viennarna package 2.0. Algorithms for Molecular Biology, 6(1):26, 2011.7, 16

[25] Herbert S. Wilf. A unified setting for sequencing, ranking, and selection algo- rithms for combinatorial objects. Advances in Mathematics, 24:281–291, 1977. 15

[26] Alain Denise, Yann Ponty, and Michel Termier. Controlled non uniform random generation of decomposable structures. Theoretical Computer Science, 411(40- 42):3527–3552, 2010. 15

[27] Sita J Lange, Daniel Maticzka, Mathias Möhl, Joshua N Gagnon, Chris M Brown, and Rolf Backofen. Global or local? predicting secondary structure and accessibility in mRNAs. Nucleic Acids Res, 40(12):5215–26, 2012. 16

[28] Gary D Stormo, Thomas D Schneider, Larry Gold, and Andrzej Ehrenfeucht. Use of the perceptronalgorithm to distinguish translational initiation sites in e. coli. Nucleic Acids Research, 10(9):2997–3011, 1982. 19

45 REFERENCES

[29] Michael T Lovci, Dana Ghanem, Henry Marr, Justin Arnold, Sherry Gee, Mar- ilyn Parra, Tiffany Y Liang, Thomas J Stark, Lauren T Gehman, Shawn Hoon, et al. Rbfox proteins regulate alternative mrna splicing through evolutionarily conserved rna bridges. Nature structural & molecular biology, 20(12):1434–1442, 2013. 20

[30] David Sankoff and Joseph H Nadeau. Comparative genomics. Springer, 2000. 21

[31] Michael Drmota. Systems of functional equations. Random Structures and Algorithms, 10(1-2):103–124, 1997. 24

[32] Vladimir Reinharz, François Major, and Jérôme Waldispühl. Towards 3d struc- ture prediction of large rna molecules: an integer programming framework to insert local 3d motifs in rna secondary structure. Bioinformatics, 28(12):i207– i214, 2012. 26

[33] Astrid A Bunse, Jörg Nickelsen, and Ulrich Kück. Intron-specific rna bind- ing proteins in the chloroplast of the green alga< i> chlamydomonas rein- hardtii. Biochimica et Biophysica Acta (BBA)-Gene Structure and Ex- pression, 1519(1):46–54, 2001. 27

[34] Jernej Ule, Kirk B Jensen, Matteo Ruggiu, Aldo Mele, Aljaž Ule, and Robert B Darnell. Clip identifies nova-regulated rna networks in the brain. Science, 302(5648):1212–1215, 2003. 27

[35] Riki Kurokawa. Promoter-associated long noncoding rnas repress transcription through a rna binding protein tls. In RNA Infrastructure and Networks, pages 196–208. Springer, 2011. 27

46 REFERENCES

[36] Pascal Chartrand, Xiu Hua Meng, Stefan Huttelmaier, Damiane Donato, and Robert H Singer. Asymmetric sorting of ash1p in yeast results from inhibition of translation by localization elements in the mRNA. Mol Cell, 10(6):1319–30, 2002. 29

[37] Isabel Gonzalez, Sara B Buonomo, Kim Nasmyth, and Uwe von Ahsen. ASH1 mRNA localization in yeast involves multiple secondary structural elements and Ash1 protein translation. Curr Biol, 9(6):337–340, Mar 1999. 29

[38] Miha Modic, Jernej Ule, and Christopher R Sibley. Cliping the brain: Studies of protein–rna interactions important for neurodegenerative disorders. Molecular and Cellular Neuroscience, 56:429–435, 2013. 40

47