Brno University of Technology Vysoké Učení Technické V Brně
Total Page:16
File Type:pdf, Size:1020Kb
BRNO UNIVERSITY OF TECHNOLOGY VYSOKÉ UČENÍ TECHNICKÉ V BRNĚ FACULTY OF ELECTRICAL ENGINEERING AND COMMUNICATION FAKULTA ELEKTROTECHNIKY A KOMUNIKAČNÍCH TECHNOLOGIÍ DEPARTMENT OF BIOMEDICAL ENGINEERING ÚSTAV BIOMEDICÍNSKÉHO INŽENÝRSTVÍ TRANSCRIPTOMIC CHARACTERIZATION USING RNA-SEQ DATA ANALYSIS TRANSKRIPTOMICKÁ CHARAKTERIZACE POMOCÍ ANALÝZY RNA-SEQ DAT DOCTORAL THESIS DIZERTAČNÍ PRÁCE AUTHOR Layal Abo Khayal AUTOR PRÁCE SUPERVISOR prof. Ing. Ivo Provazník, Ph.D. ŠKOLITEL BRNO 2017 ABSTRACT The high-throughputs sequence technologies produce a massive amount of data, that can reveal new genes, identify splice variants, and quantify gene expression genome-wide. However, the volume and the complexity of data from RNA-seq experiments necessitate a scalable, and mathematical analysis based on a robust statistical model. Therefore, it is challenging to design integrated workflow, that incorporates the various analysis procedures. Particularly, the comparative transcriptome analysis is complicated due to several sources of measurement variability and poses numerous statistical challenges. In this research, we performed an integrated transcriptional profiling pipeline, which generates novel reproducible codes to obtain biologically interpretable results. Starting with the annotation of RNA-seq data and quality assessment, we provided a set of codes to serve the quality assessment visualization needed for establishing the RNA-Seq data analysis experiment. Additionally, we performed comprehensive differential gene expression analysis, presenting descriptive methods to interpret the RNA-Seq data. For implementing alternative splicing and differential exons usage analysis, we improved the performance of the Bioconductor package DEXSeq by defining the open reading frame of the exonic regions, which are differentially used between biological conditions due to the alternative splicing of the transcripts. Furthermore, we present a new methodology to analyze the differentially expressed long non-coding RNA, by finding the functional correlation of the long non-coding RNA with neighboring differential expressed protein coding genes. Thus, we obtain a clearer view of the regulation mechanism, and give a hypothesis about the role of long non-coding RNA in gene expression regulation. KEYWORDS: RNA-Seq, Differential Gene Expression (DGE), Alternative splicing, Differential Exon Usage (DEU), long non-coding RNA (lncRNA). I ABSTRAKT Vysoce výkonné sekvenční technologie produkují obrovské množství dat, která mohou odhalit nové geny, identifikovat splice varianty a kvantifikovat genovou expresi v celém genomu. Objem a složitost dat z RNA-seq experimentů vyžadují škálovatelné metody matematické analýzy založené na robustníchstatistických modelech. Je náročné navrhnout integrované pracovní postupy, které zahrnují různé postupy analýzy. Konkrétně jsou to srovnávací testy transkriptů, které jsou komplikovány několika zdroji variability měření a představují řadu statistických problémů. V tomto výzkumu byla sestavena integrovaná transkripční profilová pipeline k produkci nových reprodukovatelných kódů pro získání biologicky interpretovovatelných výsledků. Počínaje anotací údajů RNA-seq a hodnocení kvality je navržen soubor kódů, který slouží pro vizualizaci hodnocení kvality, potřebné pro zajištění RNA-Seq experimentu s analýzou dat. Dále je provedena komplexní diferenciální analýza genových expresí, která poskytuje popisné metody pro testované RNA-Seq data. Pro implementaci analýzy alternativního sestřihu a diferenciálních exonů jsme zlepšili výkon DEXSeq definováním otevřeného čtecího rámce exonového regionu, který se používá alternativně. Dále je popsána nová metodologie pro analýzu diferenciálně exprimované dlouhé nekódující RNA nalezením funkční korelace této RNA se sousedícími diferenciálně exprimovanými geny kódujícími proteiny. Takto je získán jasnější pohled na regulační mechanismus a poskytnuta hypotéza o úloze dlouhé nekódující RNA v regulaci genové exprese. KLÍČOVÁ SLOVA RNA-Seq, diferenciální genová exprese (DGE), alternativní splicing, diferenciální použití exonů (DEU), dlouhá nekódující RNA (lncRNA). II BIBLIOGRAPHIC CITATION Abo Khayal, Layal. Transcriptomic characterization using RNA-Seq data analysis Brno: Brno University of Technology, Faculty of Electrical Engineering and Communication, 2018. 125 p. Academic advisor: Prof. Ing. Ivo Provazník, Ph.D. III DECLARATION I declare that I have written my doctoral thesis on the theme of “Transcriptomic characterization using RNA-Seq data analysis” independently, under the guidance of the doctoral thesis supervisor and using the technical literature and other sources of information which are all quoted in the thesis and detailed in the list of literature at the end of the thesis. As the author of the doctoral thesis I furthermore declare that, as regards the creation of this doctoral thesis, I have not infringed any copyright. In particular, I have not unlawfully encroached on anyone’s personal and/or ownership rights and I am fully aware of the consequences in the case of breaking Regulation §11 and the following of the Copyright Act No 121/2000 Sb., and of the rights related to intellectual property right and changes in some Acts (Intellectual Property Act) and formulated in later regulations, inclusive of the possible consequences resulting from the provisions of Criminal Act No 40/2009 Sb., Section 2, Head VI, Part 4. Brno 18.10.2017 .................................. Ing. Layal Abo Khayal IV ACKNOWLEDGEMENT I would like to thank all the people who supported me during my doctoral study in the Department of Biomedical Engineering at Brno University of Technology, and during my research stay in Charité medical university, Germany. Especially, I thank my supervisor Prof. Ing. Ivo Provaznik, Ph.D. head of the Department of Biomedical Engineering at Brno University of Technology for his endless support. I would like also to express my gratitude and appreciation to my external supervisor Prof. Dr. Med. Peter N Robinson from Charité Berlin, for his professional guidance. Special thanks to my family in Syria and friends for their great patience, persistence, and inspiration. I dedicate my PhD thesis to my mother, and to my great mother and homeland Syria, wishing peace to return to its whole lands soon. Brno, 18.10.2017 …………………………….. Ing. Layal Abo Khayal V TABLE OF CONTENTS Introduction to RNA Sequencing ........................................................................................................................ 1 1. THE THEORETICAL REVIEW ............................................................................................................... 3 1.1. Isolation of RNAs ................................................................................................................................. 3 1.2. Quality Control of RNA ........................................................................................................................ 3 1.3. Library Preparation ............................................................................................................................... 4 1.4. RNA-Seq Platforms .............................................................................................................................. 6 1.4.1 Illumina ........................................................................................................................................ 6 1.4.2 SOLID .......................................................................................................................................... 7 1.4.3 Roche 454 ..................................................................................................................................... 7 1.4.4 Ion Torrent .................................................................................................................................... 9 1.4.5 Pacific Biosciences ..................................................................................................................... 10 1.4.6 Nanopore Technologies .............................................................................................................. 13 1.5. RNA-Seq Applications ........................................................................................................................ 14 1.5.1 Protein Coding Gene Structure ................................................................................................... 15 1.5.2 Novel Protein-Coding Genes ...................................................................................................... 16 1.5.3 Quantifying and Comparing Gene Expression ........................................................................... 16 1.5.4 Expression Quantitative Trait Loci (eQTL) ............................................................................... 18 1.5.5 Single-Cell RNA-Seq ................................................................................................................. 18 1.5.6 Fusion Genes .............................................................................................................................. 18 1.5.7 Gene Variations .......................................................................................................................... 19 1.5.8 Long Noncoding RNAs .............................................................................................................. 19 1.5.9 Small Noncoding RNAs (miRNA-seq) .....................................................................................