GUO-DISSERTATION-2017.Pdf (3.383Mb)
Total Page:16
File Type:pdf, Size:1020Kb
APPLICATIONS OF HIGH-THROUGHPUT SEQUENCING DATA ANALYSIS IN TRANSCRIPTIONAL STUDIES A Dissertation by ZHENGYU GUO Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Aniruddha Datta Committee Members, Ulisses Braga-Neto Xiaoning Qian Alan Dabney Head of Department, Miroslav M. Begovic December 2017 Major Subject: Electrical Engineering Copyright 2017 Zhengyu Guo ABSTRACT High-throughput sequencing has become one of the most powerful tools for studies in genomics, transcriptomics, epigenomics, and metagenomics. In recent years, HTS pro- tocols for enhancing the understanding of the diverse cellular roles of RNA have been designed, such as RNA-Seq, CLIP-Seq, and RIP-Seq. In this work, we explore the appli- cations of HTS data analysis in transcriptional studies. First, the differential expression analysis of RNA-Seq data is discussed and applied to a sheep RNA-Seq dataset to exam- ine the biological mechanisms of the sheep resistance to worm infection. We develop an automatic pipeline to analyze the RNA-Seq dataset, and use a negative binomial model for gene expression analysis. Functional analysis is conducted over the differentially ex- pressed genes, and a broad range of mechanisms providing protection against the parasite are identified in the resistant sheep breed. This study provides insights into the underlying biology of sheep host resistance. Then, a deep learning method is proposed to predict the RNA binding protein binding preferences using CLIP-Seq data. The proposed method uses a deep convolutional autoencoder to effectively learn the robust sequence features, and a softmax classifier to predict the RBP binding sites. To demonstrate the efficacy of the proposed method, we evaluate its performance over a dataset containing 31 CLIP-Seq experiments. This benchmarking shows that the proposed method improves the prediction performance in terms of AUC, compared with the existing methods. The analysis also shows that the proposed method is able to provide insights to identify new RBP binding motifs. Therefore, the proposed method will be of great help in understanding the dynamic regulations of RBPs in various biological processes and diseases. Finally, a database is cre- ated to facilitate the reuse of the public available mouse RNA-Seq dataset. The metadata of the publicly available mouse RNA-Seq datasets is manually curated and is served by a ii well-designed website. The database can be scaled up in the future to serve more types of HTS data. iii DEDICATION To my family. iv ACKNOWLEDGMENTS I would like to express my gratitude to Dr. Aniruddha Datta for his kind help on this work and my stay at Texas A&M University. I thank Dr. Ulisses Braga-Neto, Dr. Xiaoning Qian, and Dr. Alan Dabney, for serving on my committee and for their support. I thank my fellow students and friends, Yue Gan, Yukun Tan, and Xingde Jiang for their friendships, memories, and supports. Most importantly, I would like to thank my family for their encouragement, love and help. v CONTRIBUTORS AND FUNDING SOURCES Contributors This work was supervised by a dissertation committee consisting of Dr. Aniruddha Datta, Dr. Ulisses Braga-Neto and Dr. Xiaoning Qian of the Department of Electrical and Computer Engineering and Dr. Alan Dabney of the Department of Statistics. The content of Chapter 2 was completed by the student in collaboration with Dr. Robert Li. Jorge Francisco González and Dr. Robert Li conceived and designed the ex- periment. Biological experiments were performed by Julia Hernandez, Tom McNeilly, Yolanda Corripio-Miyar, David Frew, Jorge Francisco González and Dr. Robert Li. The manuscript partly forms Chapter 2 was written by Dr. Li. In Chapter 4, the curation was conducted in part by Boriana Tzvetkova, Jennifer M. Bassik, Tara Bodziak and Brianna Wojnar from Dr. Bo Hua Hu’s team. The website was built in part by Wei Qiao and Md Obaida. Dr. Peng Yu supervised the work of Chapter 2 and Chapter 4. All other work conducted for the dissertation was completed by the student indepen- dently. Funding Sources The student was supported by TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering (CBGSE) at Texas A&M University. This work was made possible in part by startup funding from the ECE department and Texas A&M Engineering Experiment Station/Dwight Look College of Engineering at Texas A&M University, NIDCD R01DC010154, AGL2009/09985 and Fondo Social Europeo (FSE). vi NOMENCLATURE AUC Area Under the receiver operating characteristic Curve CHB Canaria Hair Breed CLIP-Seq High-throughput sequencing of RNA isolated by crosslinking immunoprecipitation CAE Convolutional Autoencoder CNN Convolutional Neural Network CS Canaria Sheep DEG Differentially Expressed Gene dpi Days post infection FDR False Discovery Rate GO Gene Ontology HTS High Throughput Sequencing NB Negative Binomial PWM Positional Weight Matrix RBP RNA Binding Protein ReLU Rectified Linear Unit RNA-Seq RNA Sequencing vii TABLE OF CONTENTS Page ABSTRACT ............................................................................... ii DEDICATION ............................................................................. iv ACKNOWLEDGMENTS ................................................................ v CONTRIBUTORS AND FUNDING SOURCES ....................................... vi NOMENCLATURE ....................................................................... vii TABLE OF CONTENTS ................................................................. viii LIST OF FIGURES ....................................................................... x LIST OF TABLES......................................................................... xii 1. INTRODUCTION .................................................................... 1 1.1 DNA, RNA and Protein ......................................................... 1 1.2 The Central Dogma of Molecular Biology..................................... 4 1.2.1 DNA Replication ....................................................... 4 1.2.2 Transcription............................................................ 5 1.2.3 Translation .............................................................. 6 1.3 High-Throughput Sequencing .................................................. 6 1.4 High-Throughput Sequencing for Transcriptomics............................ 9 1.5 Dissertation Outline ............................................................. 11 2. COMPUTATIONAL DIFFERENTIAL EXPRESSION ANALYSIS FOR UN- DERSTANDING THE MOLECULAR BASIS OF SHEEP HOST RESIS- TANCE ................................................................................ 13 2.1 Introduction...................................................................... 13 2.1.1 Sheep Host Resistance ................................................. 13 2.1.2 Comparative Analysis of Indigenous Sheep from Canary Islands ... 14 2.2 Materials and Methods.......................................................... 15 2.2.1 RNA Extraction and Sequencing ...................................... 15 viii 2.2.2 Computational Analysis of Sheep Host Resistance RNA-Seq Data ..................................................................... 17 2.2.3 Data Preprocessing and Alignment .................................... 17 2.2.4 Differential Expression Analysis ...................................... 18 2.2.5 Gene Ontology Analysis ............................................... 21 2.3 Results ........................................................................... 23 2.3.1 Haemonchus Infection Induced Distinctly Different Transcrip- tome Patterns in the Abomasal Mucosa of CHB and CS Breeds..... 23 2.3.2 Gene Ontology Implicated in Host Resistance........................ 32 2.4 Discussion ....................................................................... 36 3. A CONVOLUTIONAL AUTOENCODER BASED METHOD FOR PRE- DICTING RNA PROTEIN INTERACTIONS ...................................... 43 3.1 Introduction...................................................................... 43 3.2 Materials and Methods ......................................................... 46 3.2.1 Dataset .................................................................. 46 3.2.2 RNA Sequence Encoding ............................................. 46 3.2.3 Convolutional Layers................................................... 47 3.2.4 Model Architecture ..................................................... 49 3.2.5 Model Implementation ................................................. 52 3.3 Results .......................................................................... 52 3.3.1 Model Evaluation....................................................... 52 3.3.2 Insight in Motif Discovery ............................................. 54 3.4 Conclusion ...................................................................... 58 4. CREATION OF A DATABASE AND WEB-SERVER FOR METADATA OF PUBLICLY AVAILABLE MOUSE RNA-SEQ DATA SETS....................... 60 4.1 Introduction...................................................................... 60 4.2 Methods.......................................................................... 62 4.3 Results ........................................................................... 64 4.4 Discussion ....................................................................... 66 5. CONCLUSIONS ..................................................................... 67 REFERENCES