Characterizing Alternative Splicing and Long Non
Total Page:16
File Type:pdf, Size:1020Kb
CHARACTERIZING ALTERNATIVE SPLICING AND LONG NON-CODING RNA WITH HIGH-THROUGHPUT SEQUENCING TECHNOLOGY Ao Zhou Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the School of Informatics and Computing, Indiana University October 2018 Accepted by the Graduate Faculty of Indiana University, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Doctoral Committee ______________________________________ Huanmei Wu, Ph.D., Chair ______________________________________ Yunlong Liu, Ph.D. October 2018 ______________________________________ Sarath C. Janga, Ph.D. ______________________________________ Xiaowen Liu, Ph.D. ii © 2018 Ao Zhou iii DEDICATION To my parents, who grew the love of science in me and constantly supported me from the beginning of my life. iv ACKNOWLEDGEMENTS The 10 years’ odyssey of Ph.D. study was full of ups and downs. During the past decade, I worked with many people, I learned from many people, I received generous help from many people, and I shared happiness with many people. First of all, I would like to thank Prof. Yunlong Liu, who led me in the gate of next generation sequencing technology, and continuously helped me to improve my scientific knowledge and skills during the past 8 years. Besides his selfless support in academics, his optimism and “can do” spirit inspired me during the down times, and will continue inspiring me in my future career. I would also like to thank Prof. Jake Chen, who taught me a lot on database construction and management skills. He tremendously influenced me with his scientific way of thinking and logical presentations skills. During the preparation and writing of this dissertation, I have received invaluable advice from Prof. Huanmei Wu, Prof. Sarath Janga and Prof. Xiaowen Liu. My appreciation also extend to Prof. Howard Edenberg, Prof. Todd Skaar, Prof. Yue Wang, Prof. Lang Li and Prof. Keith Dunker, who shared their precious data and algorithm which made this dissertation possible. I am also enormously grateful to all those who collaborated with me and helped me during my Ph.D. study. They are Dr. Meng Li, Dr. Hai Lin, Dr. Yangyang Hao, Mr. Ed Simpson, Dr. Weilun Hsu, Dr. Fei Huang, Ms. Bo He, Dr. Fan Zhang and Dr. Xiaogang Wu. I especially appreciate the help from Ms. Elizabeth Cassell on student affairs and the organization of my dissertation defense. My special thanks goes to Dr. Christian Haudenschild, my supervisor and mentor at Personalis Inc., who taught me a lot about the genomic industry and generously supported me on finishing this dissertation. v The work in Chapter 2 was supported in part by the Medical and Molecular Genetics, Indiana University School of Medicine Startup Funds, Showalter Trust Award and by the Indiana Clinical and Translational Sciences Institute, funded in part by grant # TR 000006 from the National Institutes of Health, National Center for Advancing Translational Sciences, Clinical and Translational Sciences Award. The work in Chapter 4 was supported by the grants from the National Institutes of Health AA017941, CA113001, GM085121, and GM088076. The work in Chapter 5 was supported by a grant from the National Cancer Institute (U24CA126480-01), which is a part of NCI’s Clinical Proteomic Technologies Initiative (http://proteomics.cancer.gov). vi Ao Zhou CHARACTERIZING ALTERNATIVE SPLICING AND LONG NON-CODING RNA WITH HIGH-THROUGHPUT SEQUENCING TECHNOLOGY Several experimental methods has been developed for the study of the central dogma since late 20th century. Protein mass spectrometry and next generation sequencing (including DNA-Seq and RNA-Seq) forms a triangle of experimental methods, corresponding to the three vertices of the central dogma, i.e., DNA, RNA and protein. Numerous RNA sequencing and protein mass spectrometry experiments has been carried out in attempt to understand how the expression change of known genes affect biological functions in various of organisms, however, it has been once overlooked that the result data of these experiments are in fact holograms which also reveals other delicate biological mechanisms, such as RNA splicing and the expression of long non-coding RNAs. In this dissertation, we carried out five studies based on high-throughput sequencing data, in an attempt to understand how RNA splicing and differential expression of long non-coding RNAs is associated biological functions. In the first two studies, we identified and characterized 197 stimulant induced and 477 developmentally regulated alternative splicing events from RNA sequencing data. In the third study, we introduced a method for identifying novel alternative splicing events that were never documented. In the fourth study, we introduced a method for identifying known and novel RNA splicing junctions from protein mass spectrometry data. In the fifth study, we introduced a method for identifying long non-coding RNAs from poly-A selected RNA sequencing data. Taking advantage of these methods, we turned RNA sequencing and protein mass spectrometry data into an information gold mine of splicing and long non-coding RNA activities. Huanmei Wu, Ph.D., Chair vii TABLE OF CONTENTS List of Tables ............................................................................................................... xiii List of Figures ............................................................................................................. xiv List of Abbreviations................................................................................................... xvi Chapter 1. Introduction to Sequencing and Data Analysis Methods ................... 1 1.1 Next Generation Nucleic Acid Sequencing Technology .......................... 1 1.1.1 Brief History of Nucleic Acid Discovery ................................................... 1 1.1.1 Basic DNA Sequencing Methods ............................................................... 2 1.1.2 High-throughput DNA Sequencing Methods ............................................. 3 1.1.3 Sequencing Reads Alignment..................................................................... 7 1.1.4 Gene Expression Analysis Methods ........................................................... 7 1.2 Protein Mass Spectrometry Technology ................................................... 8 1.2.1 Brief History of Protein Science................................................................. 8 1.2.2 Protein Mass Spectrometry ......................................................................... 9 1.3 Alternative Splicing ................................................................................ 10 1.3.1 Discovery of Alternative Splicing ............................................................ 10 1.3.2 Alternative Splicing Identification Methods ............................................ 10 1.3.3 Algorithms of MISO................................................................................. 12 1.3.4 Algorithms of rMATS .............................................................................. 15 1.4 Long Non-coding RNA........................................................................... 18 1.5 Organization of the Dissertation ............................................................. 20 Chapter 2. Stimulant Induced Alternative Splicing ........................................... 21 2.1 Background ............................................................................................. 21 2.2 Results ..................................................................................................... 23 2.2.1 LPS-Induced Alternative Splicing ............................................................ 23 2.2.2 Protein Domains are Differentially Spliced ............................................. 32 viii 2.2.3 AS in Protein Domains may Affect Protein Interactions ......................... 33 2.2.4 Intrinsic Disorder and MoRF in AS Regions ........................................... 35 2.2.5 PTM within Differentially Spliced Regions ............................................. 37 2.2.6 Characterization of Potential Splicing Regulators ................................... 39 2.3 Discussion ............................................................................................... 39 2.4 Methods................................................................................................... 45 2.4.1 Preparation of Mouse BMSCs .................................................................. 45 2.4.2 RNA Sample Preparation and RNA-seq Assay........................................ 45 2.4.3 Bioinformatics Analysis for RNA-seq Data ............................................. 46 2.4.4 Data Processing and Quality Assessment................................................. 46 2.4.5 Sequence Alignment ................................................................................. 46 2.4.6 Alternative Splicing Analysis ................................................................... 47 2.4.7 Ontological Annotations ........................................................................... 47 2.4.8 Protein Domains Overlapping AS regions ............................................... 48 2.4.9 Identification of Protein Interactions ........................................................ 48 2.4.10 Other Characterizations ............................................................................ 48 Chapter 3. Developmentally Regulated Alternative Splicing ...........................