Development of Novel Analysis and Data Integration Systems to Understand Human Gene Regulation

Development of novel analysis and data integration systems to understand human gene regulation Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der FakultätfürMathematik und Informatik der Georg-August-UniversitätGöttingen im PhD Programme in Computer Science (PCS) der Georg-August University School of Science (GAUSS) vorgelegt von Raza-Ur Rahman aus Pakistan Göttingen,April 2018 Prof. Dr. Stefan Bonn, Zentrum fürMolekulare Neurobiologie (ZMNH), Betreuungsausschuss: Institut fürMedizinische Systembiologie, Hamburg Prof. Dr. Tim Beißbarth, Institut fürMedizinische Statistik, Universitätsmedizin, Georg-August Universität,Göttingen Prof. Dr. Burkhard Morgenstern, Institut fürMikrobiologie und Genetik Abtl. Bioinformatik, Georg-August Universität,Göttingen Prüfungskommission: Prof. Dr. Stefan Bonn, Zentrum fürMolekulare Neurobiologie (ZMNH), Referent: Institut fürMedizinische Systembiologie, Hamburg Prof. Dr. Tim Beißbarth, Institut fürMedizinische Statistik, Universitätsmedizin, Korreferent: Georg-August Universität,Göttingen Prof. Dr. Burkhard Morgenstern, Weitere Mitglieder Institut fürMikrobiologie und Genetik Abtl. Bioinformatik, der Prüfungskommission: Georg-August Universität,Göttingen Prof. Dr. Carsten Damm, Institut fürInformatik, Georg-August Universität,Göttingen Prof. Dr. Florentin Wörgötter, Physikalisches Institut Biophysik, Georg-August-Universität,Göttingen Prof. Dr. Stephan Waack, Institut fürInformatik, Georg-August Universität,Göttingen Tag der mündlichen Prüfung: der 30. März2018 i Contents List of Figuresv Acknowledgements vi Abstract 1 List of publications and softwares3 Thesis structure5 1 Biological Background Knowledge6 1.1 Deoxyribonucleic acid.............................6 1.2 Gene expression.................................6 1.2.1 Transcription start site.........................7 1.2.2 RNA polymerase II...........................7 1.2.3 Promoter................................8 1.2.4 Enhancers................................8 1.2.5 Transcription factors..........................8 1.3 Alternative Splicing..............................8 1.4 Small RNA (sRNA).............................. 10 1.4.1 MicroRNAs............................... 10 1.4.2 PIWI-interacting RNAs........................ 11 1.4.3 Small nucleolar RNAs......................... 12 1.4.4 Small interfering RNA......................... 12 1.4.5 Small nuclear RNAs.......................... 13 1.5 Next generation sequencing.......................... 15 1.5.1 RNA sequencing............................ 15 1.5.1.1 Method............................ 16 2 Bioinformatics Background Knowledge 18 2.1 Database management systems........................ 18 2.1.1 DBMS Architecture.......................... 19 2.2 Types of databases............................... 20 2.2.1 Relational database systems...................... 20 2.2.1.1 Constraints.......................... 22 2.2.1.2 Entity relationship model (ER model)........... 23 2.2.2 Non-relational database systems................... 23 2.2.2.1 Types of NoSQL databases................. 24 ii CONTENTS iii 2.3 Standard workflows for NGS data analysis.................. 26 2.3.1 Raw data (FASTQ).......................... 26 2.3.2 Quality control (QC).......................... 27 2.3.2.1 FastQC............................ 27 2.3.3 Adapter trimming........................... 28 2.3.4 Alignment and counting........................ 29 2.3.5 Differential expression (DE) analysis................. 29 2.4 Biological ontologies.............................. 30 2.5 Principles of supervised machine learning methods............. 30 2.5.1 Classification.............................. 31 2.5.1.1 Biological example...................... 31 2.5.1.2 Random forest........................ 32 2.6 Thesis related existing resources and research................ 33 2.6.1 sRNA-seq analysis tools........................ 33 2.6.1.1 sRNA workbench...................... 33 2.6.1.2 CAP-miRSeq......................... 34 2.6.1.3 omiRas............................ 34 2.6.1.4 mirTools 2.0......................... 34 2.6.1.5 MAGI............................. 34 2.6.1.6 Chimira............................ 34 2.6.1.7 sRNAtoolbox......................... 34 2.6.2 sRNA expression databases...................... 35 2.6.2.1 miRmine........................... 35 2.6.2.2 DASHR............................ 35 2.6.2.3 Miratlas............................ 35 2.6.2.4 YM500v3........................... 36 2.6.3 Mutually exclusive splicing of exons................. 36 2.7 Goals of the Thesis............................... 36 2.7.1 Online analysis of small RNA deep sequencing data (Oasis).... 36 2.7.2 sRNA expression atlas (SEA)..................... 37 2.7.3 Mutually exclusive splicing of exons................. 38 3 Results, Discussion and Outlook 39 3.1 Online analysis of small RNA-seq data (Oasis 2).............. 39 3.1.1 Oasis 2's module............................ 39 3.1.2 OasisCompressor............................ 42 3.1.3 Quality Control (QC)......................... 44 3.1.4 Functional enrichment analysis.................... 45 3.2 Small RNA expression atlas (SEA)...................... 47 3.2.1 System design.............................. 48 3.2.2 Annotation tool............................ 49 3.2.2.1 Annotation criteria..................... 50 3.2.3 SEA web application.......................... 51 3.3 Mutually exclusive splicing of exons..................... 52 3.3.1 Data sources.............................. 52 3.3.2 Prediction of MXE candidates.................... 53 3.3.3 Validation of MXE candidates.................... 53 CONTENTS iv 3.3.4 Spatio-temporal expression of MXEs................. 54 3.3.5 Disease pathology prediction..................... 55 3.4 Conclusion and outlook............................ 57 References 67 Appendices 68 A Article 1 69 B Article 2 80 C Article 3 95 List of Figures 1.1 DNA structure.................................7 1.2 Gene expression.................................7 1.3 Promoter, enhancers and TFs.........................9 1.4 Forms of alternative splicing.......................... 10 1.5 miRNA biogenesis............................... 11 1.6 piRNA biogenesis................................ 13 1.7 snoRNA biogenesis............................... 14 1.8 siRNA biogenesis................................ 15 1.9 RNA-seq library preparation workflow.................... 17 2.1 Three-level DBMS architecture........................ 19 2.2 DBMS architecture along with different ways of querying the DBMS... 21 2.3 ERD representation.............................. 22 2.4 Standard workflow for NGS data analysis (RNA-seq,sRNA-seq)...... 26 2.5 FastQ format.................................. 27 2.6 FastQC per-base quality............................ 28 2.7 FastQC sequence quality............................ 28 2.8 Disease ontology................................ 30 2.9 Supervised machine learning.......................... 31 2.10 Illustration of random forest algorithm.................... 32 3.1 Oasis 2 modules and workflow......................... 40 3.2 OasisCompressor................................ 43 3.3 Browser view of the primary output of sRNA detection module...... 44 3.4 Assessment of Oasis 2' (QC) outlier detection................ 46 3.5 SEA system architecture............................ 49 3.6 SEA data integration workflow........................ 50 3.7 Annotation tool................................. 51 3.8 SEA home page................................. 52 3.9 MXE illustration................................ 54 3.10 Spatio-temporal expression of MXEs..................... 55 3.11 MXE-ratio expression predicts disease pathology.............. 56 v Acknowledgements First, I would like to thank Prof. Dr. Stefan Bonn for his guidance and helpful sug- gestions, who helped me to expand on my bioinformatics skills, and guided me to be able to manage teams. I would also like to thank my Thesis Committee, Prof. Dr. Tim Beißbarth and Prof. Dr. Burkhard Morgenstern, who gave me advice regarding my various projects from time to time. I would like to thank the entire Bonn lab, who were very helpful and encouraging. I would especially like to thank Abhivyakti Gautam and Abdul Sattar, who helped me in the development of these projects. Finally, I would like to dedicate my phd to my mother Shams-un Nahar for her ongoing love and support and to my father Atta Ur Rahman who could not see this thesis completed. vi Abstract This thesis covers a very broad range of bioinformatics methods ranging from the development of the analysis pipeline to the data integration and development of an expression atlas (database and web application development). In addition, an in silco method was developed to annotate genome with novel features, and predicting diseases based on the expression profiles. Development of online analysis of small RNA sequencing data Small RNA (sRNA) are biomolecules that play important roles in organismal health and disease; as such, sRNA dysregulation can cause severe diseases. The modern method of choice for sRNA expression profiling is sRNA sequencing (sRNA-seq). There are several sRNA-seq analysis platforms available that differ in their analysis portfolio, per- formance, and user-friendliness. However, these analysis platforms lack one or more important features such as disease biomarkers identification, detection of viral and bac- terial infections in sRNA-seq samples, storage of novel predicted miRNAs, multivariate differential expression(DE) analysis and automated submission

Development of Novel Analysis and Data Integration Systems to Understand Human Gene Regulation

Analysis of Trans Esnps Infers Regulatory Network Architecture

Transcriptome Analyses of Rhesus Monkey Pre-Implantation Embryos Reveal A

Seq2pathway Vignette

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Parallel Next Generation Sequencing of DNA and RNA from a Single

Noelia Díaz Blanco

Targeting Toxoplasma Gondii CPSF3 As a New Approach to Control Toxoplasmosis

Aneuploidy: Using Genetic Instability to Preserve a Haploid Genome?

Binding Specificities of Human RNA Binding Proteins Towards Structured

Transcriptome Analysis Uncovers the Diagnostic Value of Mir-192-5P/HNF1A-AS1/VIL1 Panel in Cervical Adenocarcinoma

Whole Exome Sequencing in Families at High Risk for Hodgkin Lymphoma: Identification of a Predisposing Mutation in the KDR Gene

The Interaction of DNA Repair Factors ASCC2 and ASCC3 Is Affected by Somatic Cancer Mutations