Detecting and Comparing Non-Coding Rnas

“Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page i — #1 Detecting and comparing non-coding RNAs Giovanni Bussotti TESI DOCTORAL UPF / ANY 2012 THESIS DIRECTOR Dr. Cedric Notredame THESIS DEPARTMENT BIOINFORMATICS AND GENOMICS PROGRAMME AT CRG (CENTER FOR GENOMIC REGULATION) “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page ii — #2 “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page iii — #3 Vorrei dedicare non solo questa tesi ma più in generale questo periodo di quattro anni a tutti i familiari che hanno continuato ad appoggiarmi di la dal mare. Malgrado non sia possible recuperare il tempo passato, di certo bisogna rallegrarsi di quello presente e guardare con ottimismo al futuro. In particolare la mia dedica va ai miei due bei nipoti Giacomo e Paolo, perche’ possano avere sempre tanta curiosita’. Il mio ricordo e la mia dedica vanno anche a nonno Torinto e ai pomeriggi passati insieme a raccogliere frutta. « Sempre caro mi fu quest'ermo colle, e questa siepe, che da tanta parte dell'ultimo orizzonte il guardo esclude. Ma sedendo e mirando, interminati spazi di là da quella, e sovrumani silenzi, e profondissima quiete io nel pensier mi fingo, ove per poco il cor non si spaura. E come il vento odo stormir tra queste piante, io quello infinito silenzio a questa voce vo comparando: e mi sovvien l'eterno, e le morte stagioni, e la presente e viva, e il suon di lei. Così tra questa immensità s'annega il pensier mio: e il naufragar m'è dolce in questo mare. » Giacomo Leopardi, L’infinito iii “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page iv — #4 “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page v — #5 Acknowledgments During these unforgettable years in Barcelona I was very fortunate to be part of an extraordinary research group in an outstanding institute, the CRG. It would be impossible for me to say how thankful I am for all the help and support I have got, both from the scientific and the non-scientific personnel. I have taken advantage so many times of the interactions with people at CRG that it would just be impossible to acknowledge each and every single person. Still, I would like to take the opportunity to express gratitude to some people in particular. Above all I wish to give credit to my supervisor Dr. Cedric Notredame. Working under his guidance has been not just a unique opportunity to learn science from one of the most brilliant researchers I ever met, but also a pleasant human experience. I acknowledge him for all the efforts in managing our group wisely, and creating the most favourable working conditions ever. In these years I really enjoyed going to work every single day, and this is mostly merit of Cedric. I wish to acknowledge the members of my thesis committee, Dr. Roderic Guigó, Dr. Juan Valcárcel and Dr. Eduardo Eyras for their availability and help all along these years. I would like to say special thanks to Roderic Guigó. During these years I had the chance to collaborate with him and his group in several projects. This has been a unique opportunity for me to get in touch with cutting-edge researchers all across the world and, even if only a little, to participate in the prestigious ENCODE project. I would like to say thanks to all the people in Cedric’s and Roderic’s groups. Special thanks go to Ionas Erb and Carsten Kemena, colleagues and friends in these years. Especially Ionas followed my PhD progresses, and lastly participated in the revision of this thesis. v “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page vi — #6 Finally, I would love to say thanks to Cedric Notredame, Roderic Guigó, Gian Gaetano Tartaglia and Anna Tramontano for providing references in the search of a post-doc position. vi “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page vii — #7 Abstract In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes, and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences becomes extremely important. Aligning nucleotide sequences is one of the main requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally, any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of proteins. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structure level and not on the sequence level. This results in a very poor sequence conservation impeding the comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results at their full potential. These are the issues I have tried to address in my PhD. I have started by developing a novel algorithm able to reveal the homology relationship of distantly related ncRNA genes, and then I have applied the approach thus defined in combination with other sophisticated data mining tools to discover novel non-coding genes and generate genome-wide ncRNA predictions. vii “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page viii — #8 Resumen En los últimos años el interés en el campo de los ARN no codificantes ha crecido mucho a causa del enorme aumento de la cantidad de secuencias no codificantes disponibles y a que muchos de estos transcriptos han dado muestra de ser importantes en varias funciones celulares. En este contexto, es fundamental el desarrollo de métodos para la correcta detección y comparativa de secuencias de ARN. Alinear nucleótidos es uno de los enfoques principales para buscar genes homólogos, identificar relaciones evolutivas, regiones conservadas y en general, patrones biológicos importantes. Sin embargo, comparar moléculas de ARN es una tarea difícil. Esto es debido a que el alfabeto de nucleótidos es más simple y por ello menos informativo que el de las proteínas. Además es probable que para muchos ARN la evolución haya mantenido la estructura en mayor grado que la secuencia, y esto hace que las secuencias sean poco conservadas y difícilmente comparables. Por lo tanto, hacen falta nuevos métodos capaces de utilizar otras fuentes de información para generar mejores alineamientos de ARN. En esta tesis doctoral se ha intentado dar respuesta exactamente a estas temáticas. Por un lado desarrollado un nuevo algoritmo para detectar relaciones de homología entre genes de ARN no codificantes evolutivamente lejanos. Por otro lado se ha hecho minería de datos mediante el uso de datos ya disponibles para descubrir nuevos genes y generar perfiles de ARN no codificantes en todo el genoma. viii “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page ix — #9 Preface This work focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies. Within this vast subject, this work aims at dealing with two extremely important research aspects nowadays: the development of new methods to align RNAs and the analysis of high-throughput data. Regarding the methodological aspect, this work introduces BlastR, a new in-silico tool able to reveal the homologous relationships between distantly related non-coding genes in a fast and reliable way. This tool is able to deal with poor sequence conservation by taking into account additional information sources and is less computationally demanding than state of the art methods. The data analysis part of this work is centred mainly on investigating the conservation of long non-coding RNAs using a combination of techniques. The unprecedented amount of expression data returned by next generation sequencing technologies allowed the detection of thousands of new and uncharacterized non-coding genes. Despite the fact that just a few dozens were functionally characterized, many of these genes are likely to be key regulators of diverse cellular processes and probably involved in important biological functions. Keywords non-coding RNA, long non-coding RNA, alignment, comparative biology, Blast, homology search. ix “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page x — #10 “Giovanni˙Bussotti” — 2012/11/22 — 17:34 — page xi — #11 Index ABSTRACT --------------------------------------------------------------------------------------------------------- VII PREFACE ------------------------------------------------------------------------------------------------------------ IX KEYWORDS--------------------------------------------------------------------------------------------------------- IX ABBREVIATIONS ----------------------------------------------------------------------------------------------- XIII CHAPTER 1: INTRODUCTION ---------------------------------------------------------------------------------- 1 COMPARING NON CODING RNAS ---------------------------------------------------------------------------------- 6 NCRNA HOMOLOGUES DETECTION ------------------------------------------------------------------------------ 11 HIGH-THROUGHPUT TECHNOLOGIES AND GENOME-WIDE ANNOTATION OF NCRNAS -------------------- 15 CHAPTER 2: BLASTR ALGORITHM FOR NCRNA SEARCH ---------------------------------------- 27 CHAPTER 3: ANALYZING THE PIG TRANSCRIPTOME --------------------------------------------- 38 CHAPTER 4: ANALYZING THE HUMAN LNCRNA DATASET -------------------------------------- 53 DISCUSSION -------------------------------------------------------------------------------------------------------- 69 CONCLUSION ------------------------------------------------------------------------------------------------------

Detecting and Comparing Non-Coding Rnas

Eukaryotic Non-Coding Rnas: New Targets for Diagnostics and Therapeutics?

A Pirna-Like Small RNA Interacts with and Modulates P-ERM Proteins in Human Somatic Cells

Hammerhead Ribozymes Against Virus and Viroid Rnas

Questions with Answers- Nucleotides & Nucleic Acids A. the Components

Mirna Detection Using a Rolling Circle Amplification and RNA

Neo-Cloo2 B2 O 1 B 1 (3) REVERSE TRANSCRIPTION RNA PROCESSING

De Novo Nucleic Acids: a Review of Synthetic Alternatives to DNA and RNA That Could Act As † Bio-Information Storage Molecules

Design and Switch of Catalytic Activity with the Dnazyme–Rnazyme Combination

LESSON 4 Using Bioinformatics to Analyze Protein Sequences

Conditional Deoxyribozyme−Nanoparticle Conjugates For

Cytoplasmic Synthesis of Endogenous Alu Complementary DNA Via Reverse Transcription and Implications in Age-Related Macular Degeneration

Transitioning to DNA Genomes in an RNA World