Developing Computational Approach to Predict Coincidence of Structural and Primary Signatures in Mrnas Yang Zhang
Total Page:16
File Type:pdf, Size:1020Kb
Developing computational approach to predict coincidence of structural and primary signatures in mRNAs Yang Zhang Computer Science McGill University Montreal,Quebec A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Master of Science 15-08-2014 Copyright c 2014 by Yang Zhang I would like to dedicate this thesis to my loving parents and my supportive fiancé ... Acknowledgements I would like to express my deep gratitude to my master supervisors Dr. Jérôme Waldispühl and Dr. Éric Lecuyer. They have been tremendous mentor for me. I would like to thank them for encouraging my research and for allowing me to grow as a research scientist. I also thank them for many insightful conversations during the development of the projects. Special thanks are given to Yann Ponty and Mathieu Blanchette, who also supervised me throughout the project. During the period of the two years, many friends are helpful to color my life. I have to acknowledge all my colleagues in Jerome and Eric’s lab for their assistance in many aspects that I cannot list all of them because of limited space. For financial support, I thank Institut de recherches cliniques de Montréal (IRCM) for the internal scholarship throughout the 2 years. Abstract RNA binding proteins (RBPs) are known to interact with cis-regulatory motifs present in the mRNAs to perform biological functions. However, little is known for the RBP binding sites. Therefore, the prediction of RBP binding sites from sequence only is crucial to RNA functions. It is composed of three factors: structural properties prediction, primary motif search and evolutionary information. In this thesis, we introduce our approach, which considers structural properties, primary motif and evolutionary conservation information at the same time. Abrégé Les protéines de liaison aux ARNs (RNA Binding Proteins ; RBPs) sont con- nues pour interagir avec des motifs cis-régulateurs présents dans les ARNm pour accomplir des fonctions biologiques. Cependant, peu est connu sur les sites de li- aison des RBPs. Par conséquent, la prédiction des sites des RBPs à partir de leur séquence est indispensable pour la fonction des ARNs. La prédiction se base sur trois facteurs : la prédiction de propriétés structurales, la recherche de motifs primaires et l’information évolutive. Dans cette thèse, nous présentons notre approche, qui considère les propriétés structurales, les motifs primaires ainsi que l’information sur la conservation évolutive, le tout en même temps. Contribution of authors Yang Zhang has performed the research described in this thesis. She imple- mented the full program and conducted the experiment for SPARCS. She wrote the paper [1] in collaboration with Yann Ponty (co-author), Mathieu Blanchette, Éric Lecuyer and Jérôme Waldispühl. Jérôme Waldispühl and Éric Lecuyer provided guid- ance. Yang Zhang also designed, implemented the full website of DroDARScan. She is preparing the manuscript for DroDARScan with Quaid Morris, Mathieu Blanchette, Éric Lecuyer and Jérôme Waldispühl. Jérôme Waldispühl and Éric Lecuyer provided guidance. Contents Contentsv List of Figures vii 1 Introduction1 2 Background7 2.1 RNA structure . .7 2.2 RNA Binding Protein (RBP) . .8 2.2.1 Position Weight Matrix (PWM) . .9 3 Methods 11 3.1 Structural profile prediction . 11 3.1.1 Method overview . 11 3.1.2 Multivariate Boltzmann sampling . 12 3.1.2.1 Weighted distribution. 13 3.1.2.2 Self-adaptive calibration of weights. 13 3.1.2.3 Random generation. 15 3.1.3 Secondary structure prediction . 16 3.1.4 Characterization of the structural profile . 16 v CONTENTS 3.2 Primary motif prediction . 19 3.2.1 Cis-bp RNA database . 19 3.2.2 Motif searching . 20 3.2.3 Binding sites clustering . 20 3.3 Evolutionary conservation . 21 3.4 DroDARScan database . 22 4 Discussion 23 4.1 DroDARScan Dataset . 23 4.2 Implementation . 24 4.2.1 SPARCS ............................... 24 4.2.2 DroDARScan ............................ 24 4.3 Time and space requirement . 24 4.4 Limitation of the work . 26 4.5 Future work . 27 5 Results 28 5.1 SPARCS ................................... 28 5.1.1 Analysis of Ash1 gene in yeast . 29 5.2 DroDARScan ................................ 32 5.2.1 Analysis of FMR1 in Drosophila ................. 36 6 Conclusions 40 References 42 vi List of Figures 2.1 A typical secondary structure (stemloop) of RNA. .8 2.2 An illustration of RBPs’ roles. .9 2.3 An example PWM of RBP ELAVL1. 10 3.1 Entropy comparison between sequences generated by DiCodonShuffle [2] and our probabilistic shuffling method. For both methods, 1000 se- quences are generated and, for our approach, the relative tolerance was set to ε = 10−1. Sequences produced using DiCodonShuffle show much less diversity than those generated using our approach, either indicating a substantially limited accessibility of compatible se- quences, or a substantial bias (non-stationarity) due to the bounded nature of their random walk. 14 vii LIST OF FIGURES 3.2 Impact of weighted distribution on the number of occurrences of din- ucleotides AU and GU. Either in the uniform distribution (π(XY) = 1, blue), or setting larger weights to AU (π(AU) = 10, red) or GU (π(GU) = 10, green), 100 000 sequences compatible with an mRNA sequence encoding 179 amino acid (the first two exons of oskar gene in D. melanogaster) were randomly generated. The concentration of the distribution, and the shift in expected DF observed for different weights, are the key ingredients of our method, allowing for an efficient approach based on adaptive sampling. 17 4.1 An example of the Jqplot graphical representation of a gene. 25 5.1 SPARCS interface. 30 5.2 Typical runtime of SPARCS for sequence lengths varying from 200 to 1000 nts. 31 5.3 Analysis of the protein-coding region of the ASH1 gene in yeast. The Z-scores of the base pair probability are represented in magenta and those of the base pair entropy in red. Structured, unstructured and disordered regions are displayed in green, blue and orange, and the functional elements E1, E2A and E2B are indicated at the bottom of the figure with yellow boxes. Dashed lines show the thresholds for determining high or low Z-score values. 33 5.4 Genes annotated with Alzheimer’s disease. 34 5.5 Web interface of DroDARScan....................... 35 5.6 A sample output page from DroDARScan................. 37 5.7 Resubmit option from DroDARScan.................... 37 5.8 Summary section of a gene in DroDARScan................ 38 viii LIST OF FIGURES 5.9 Target sites mapping of RBP bru-3 in gene FMR1 from UCSC genome browser. 39 ix Chapter 1 Introduction RNA binding proteins (RBPs) play important roles in the regulation of many post-transcriptional events in gene expression and are key components in RNA metabolism. RBPs can influence the structure of RNAs through interactions with cis-regulatory elements and therefore have crucial roles in stability, function, trans- port and localization of the RNAs. Such cis regulatory elements, which are targeted by trans-regulatory molecules such as microRNAs and RBPs, can be resided through- out the mRNA molecule. They are typically defined by their secondary structure and primary sequence characteristics. In eukaryotic cells, large amount of RBPs are encoded. Each of the RBP has its own binding preference to the target, both in terms of structural and primary sequence preferences. However, only a small fraction of the RBPs have been charac- terized so far. A recent study has reported a systematic way of detecting the RNA motifs recognized by a broad range of RBPs [3]. They discovered that vast majority of RBPs tend to bind to single stranded RNAs and thus do not requires a specific RNA secondary structure. From these information, we developed analysis tools that 1 can take a mRNA of interest and decipher the regulatory logic considering both the structural and sequence features. To detect the structural properties of a mRNA molecule, we first need to think about the difference between coding sequences and UTR sequences. There are many published tools that aimed at discovering the structural landscape at the UTR regions. However, none of them considered the special characterization pos- sessed by the coding region of the mRNA. Recent studies suggested that coding regions of messenger RNAs can often include secondary structure elements involved in post-transcriptional regulatory processes [4;5;6]. While many programs have been developed to analyze folding properties of large non-coding RNAs [7] or un- translated regions of mRNAs [8], these tools cannot be directly applied to study the structural properties in coding regions. Indeed, the sequence of codons that specify the amino acid chain might bias the thermodynamic folding properties of the polynucleotide, thus preventing accurate estimate of the statistical significance of local structural motifs. Similar issues are encountered in the context of large scale studies and techniques aiming as defining RNA structure characteristics on a genome-wide scale [9; 10]. Actually, assessing the statistical significance of ob- served phenomena or patterns requires the definition of a reliable and expressive background model (a.k.a. the null hypothesis). In particular, any sequence property that is a natural consequence of a well-understood mechanism should be captured by the background model, so that it will generically appear in random sequences. Including these features in the background model will lead to an increased statistical significance for novel phenomena. A classic exploratory approach starts with a random generation of sequences that share similar properties as a reference set of sequences. Various metrics can then be evaluated, possibly leading to diverging distributions of values within the 2 random and reference sets. The significance of such an observation can be empir- ically assessed using classic statistical tools (Z-score, P-value. ) . To implement such an approach in the context of mRNAs, one must restrict random sequences to synonymous sequences (i.e.