Development of novel analysis and data integration systems to understand human gene regulation Dissertation zur Erlangung des Doktorgrades Dr. rer. nat. der Fakult¨atf¨urMathematik und Informatik der Georg-August-Universit¨atG¨ottingen im PhD Programme in Computer Science (PCS) der Georg-August University School of Science (GAUSS) vorgelegt von Raza-Ur Rahman aus Pakistan G¨ottingen,April 2018 Prof. Dr. Stefan Bonn, Zentrum f¨urMolekulare Neurobiologie (ZMNH), Betreuungsausschuss: Institut f¨urMedizinische Systembiologie, Hamburg Prof. Dr. Tim Beißbarth, Institut f¨urMedizinische Statistik, Universit¨atsmedizin, Georg-August Universit¨at,G¨ottingen Prof. Dr. Burkhard Morgenstern, Institut f¨urMikrobiologie und Genetik Abtl. Bioinformatik, Georg-August Universit¨at,G¨ottingen Pr¨ufungskommission: Prof. Dr. Stefan Bonn, Zentrum f¨urMolekulare Neurobiologie (ZMNH), Referent: Institut f¨urMedizinische Systembiologie, Hamburg Prof. Dr. Tim Beißbarth, Institut f¨urMedizinische Statistik, Universit¨atsmedizin, Korreferent: Georg-August Universit¨at,G¨ottingen Prof. Dr. Burkhard Morgenstern, Weitere Mitglieder Institut f¨urMikrobiologie und Genetik Abtl. Bioinformatik, der Pr¨ufungskommission: Georg-August Universit¨at,G¨ottingen Prof. Dr. Carsten Damm, Institut f¨urInformatik, Georg-August Universit¨at,G¨ottingen Prof. Dr. Florentin W¨org¨otter, Physikalisches Institut Biophysik, Georg-August-Universit¨at,G¨ottingen Prof. Dr. Stephan Waack, Institut f¨urInformatik, Georg-August Universit¨at,G¨ottingen Tag der m¨undlichen Pr¨ufung: der 30. M¨arz2018 i Contents List of Figuresv Acknowledgements vi Abstract 1 List of publications and softwares3 Thesis structure5 1 Biological Background Knowledge6 1.1 Deoxyribonucleic acid.............................6 1.2 Gene expression.................................6 1.2.1 Transcription start site.........................7 1.2.2 RNA polymerase II...........................7 1.2.3 Promoter................................8 1.2.4 Enhancers................................8 1.2.5 Transcription factors..........................8 1.3 Alternative Splicing..............................8 1.4 Small RNA (sRNA).............................. 10 1.4.1 MicroRNAs............................... 10 1.4.2 PIWI-interacting RNAs........................ 11 1.4.3 Small nucleolar RNAs......................... 12 1.4.4 Small interfering RNA......................... 12 1.4.5 Small nuclear RNAs.......................... 13 1.5 Next generation sequencing.......................... 15 1.5.1 RNA sequencing............................ 15 1.5.1.1 Method............................ 16 2 Bioinformatics Background Knowledge 18 2.1 Database management systems........................ 18 2.1.1 DBMS Architecture.......................... 19 2.2 Types of databases............................... 20 2.2.1 Relational database systems...................... 20 2.2.1.1 Constraints.......................... 22 2.2.1.2 Entity relationship model (ER model)........... 23 2.2.2 Non-relational database systems................... 23 2.2.2.1 Types of NoSQL databases................. 24 ii CONTENTS iii 2.3 Standard workflows for NGS data analysis.................. 26 2.3.1 Raw data (FASTQ).......................... 26 2.3.2 Quality control (QC).......................... 27 2.3.2.1 FastQC............................ 27 2.3.3 Adapter trimming........................... 28 2.3.4 Alignment and counting........................ 29 2.3.5 Differential expression (DE) analysis................. 29 2.4 Biological ontologies.............................. 30 2.5 Principles of supervised machine learning methods............. 30 2.5.1 Classification.............................. 31 2.5.1.1 Biological example...................... 31 2.5.1.2 Random forest........................ 32 2.6 Thesis related existing resources and research................ 33 2.6.1 sRNA-seq analysis tools........................ 33 2.6.1.1 sRNA workbench...................... 33 2.6.1.2 CAP-miRSeq......................... 34 2.6.1.3 omiRas............................ 34 2.6.1.4 mirTools 2.0......................... 34 2.6.1.5 MAGI............................. 34 2.6.1.6 Chimira............................ 34 2.6.1.7 sRNAtoolbox......................... 34 2.6.2 sRNA expression databases...................... 35 2.6.2.1 miRmine........................... 35 2.6.2.2 DASHR............................ 35 2.6.2.3 Miratlas............................ 35 2.6.2.4 YM500v3........................... 36 2.6.3 Mutually exclusive splicing of exons................. 36 2.7 Goals of the Thesis............................... 36 2.7.1 Online analysis of small RNA deep sequencing data (Oasis).... 36 2.7.2 sRNA expression atlas (SEA)..................... 37 2.7.3 Mutually exclusive splicing of exons................. 38 3 Results, Discussion and Outlook 39 3.1 Online analysis of small RNA-seq data (Oasis 2).............. 39 3.1.1 Oasis 2's module............................ 39 3.1.2 OasisCompressor............................ 42 3.1.3 Quality Control (QC)......................... 44 3.1.4 Functional enrichment analysis.................... 45 3.2 Small RNA expression atlas (SEA)...................... 47 3.2.1 System design.............................. 48 3.2.2 Annotation tool............................ 49 3.2.2.1 Annotation criteria..................... 50 3.2.3 SEA web application.......................... 51 3.3 Mutually exclusive splicing of exons..................... 52 3.3.1 Data sources.............................. 52 3.3.2 Prediction of MXE candidates.................... 53 3.3.3 Validation of MXE candidates.................... 53 CONTENTS iv 3.3.4 Spatio-temporal expression of MXEs................. 54 3.3.5 Disease pathology prediction..................... 55 3.4 Conclusion and outlook............................ 57 References 67 Appendices 68 A Article 1 69 B Article 2 80 C Article 3 95 List of Figures 1.1 DNA structure.................................7 1.2 Gene expression.................................7 1.3 Promoter, enhancers and TFs.........................9 1.4 Forms of alternative splicing.......................... 10 1.5 miRNA biogenesis............................... 11 1.6 piRNA biogenesis................................ 13 1.7 snoRNA biogenesis............................... 14 1.8 siRNA biogenesis................................ 15 1.9 RNA-seq library preparation workflow.................... 17 2.1 Three-level DBMS architecture........................ 19 2.2 DBMS architecture along with different ways of querying the DBMS... 21 2.3 ERD representation.............................. 22 2.4 Standard workflow for NGS data analysis (RNA-seq,sRNA-seq)...... 26 2.5 FastQ format.................................. 27 2.6 FastQC per-base quality............................ 28 2.7 FastQC sequence quality............................ 28 2.8 Disease ontology................................ 30 2.9 Supervised machine learning.......................... 31 2.10 Illustration of random forest algorithm.................... 32 3.1 Oasis 2 modules and workflow......................... 40 3.2 OasisCompressor................................ 43 3.3 Browser view of the primary output of sRNA detection module...... 44 3.4 Assessment of Oasis 2' (QC) outlier detection................ 46 3.5 SEA system architecture............................ 49 3.6 SEA data integration workflow........................ 50 3.7 Annotation tool................................. 51 3.8 SEA home page................................. 52 3.9 MXE illustration................................ 54 3.10 Spatio-temporal expression of MXEs..................... 55 3.11 MXE-ratio expression predicts disease pathology.............. 56 v Acknowledgements First, I would like to thank Prof. Dr. Stefan Bonn for his guidance and helpful sug- gestions, who helped me to expand on my bioinformatics skills, and guided me to be able to manage teams. I would also like to thank my Thesis Committee, Prof. Dr. Tim Beißbarth and Prof. Dr. Burkhard Morgenstern, who gave me advice regarding my various projects from time to time. I would like to thank the entire Bonn lab, who were very helpful and encouraging. I would especially like to thank Abhivyakti Gautam and Abdul Sattar, who helped me in the development of these projects. Finally, I would like to dedicate my phd to my mother Shams-un Nahar for her ongoing love and support and to my father Atta Ur Rahman who could not see this thesis completed. vi Abstract This thesis covers a very broad range of bioinformatics methods ranging from the devel- opment of the analysis pipeline to the data integration and development of an expression atlas (database and web application development). In addition, an in silco method was developed to annotate genome with novel features, and predicting diseases based on the expression profiles. Development of online analysis of small RNA sequencing data Small RNA (sRNA) are biomolecules that play important roles in organismal health and disease; as such, sRNA dysregulation can cause severe diseases. The modern method of choice for sRNA expression profiling is sRNA sequencing (sRNA-seq). There are several sRNA-seq analysis platforms available that differ in their analysis portfolio, per- formance, and user-friendliness. However, these analysis platforms lack one or more important features such as disease biomarkers identification, detection of viral and bac- terial infections in sRNA-seq samples, storage of novel predicted miRNAs, multivariate differential expression(DE) analysis and automated submission
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages123 Page
-
File Size-