Genomic Data Integration with Hidden Markov Models to Understand Transcription Regulation

Dissertation zur Erlangung des Doktorgrades der Fakultät für Chemie und Pharmazie der Ludwig-Maximilians-Universität München Genomic data integration with hidden Markov models to understand transcription regulation Benedikt Zacher aus München, Deutschland 2016 Erklärung Diese Dissertation wurde im Sinne von § 7 der Promotionsordnung vom 28. November 2011 von Herrn Prof. Dr. Patrick Cramer betreut. Eidesstattliche Versicherung Diese Dissertation wurde eigenständig und ohne unerlaubte Hilfe erarbeitet. München, 11.03.2016 Benedikt Zacher Dissertation eingereicht am 11.03.2016 1. Gutachter: Prof. Dr. Patrick Cramer 2. Gutachter: Prof. Dr. Julien Gagneur Mündliche Prüfung am 02.05.2016 Acknowledgments I am very grateful to Prof. Dr. Julien Gagneur for his supervision and guidance during this project. He created a working atmosphere in his group where I could thrive. He gave suggestions and help whenever needed and enough freedom to explore and follow my own ideas. He trusted me with planning own projects which gave me the unique opportunity to visit Stanford University during my work on this thesis. Prof. Dr. Achim Tresch has been an important person already during my very early career as a researcher and since then I have been enjoying working with him. I am thankful for his support and encouragment not only during this thesis. It is also thanks to him that I chose to follow a scientific career. His casual manner made it very easy and fun to work with him and taking things not too serious once in a while. I also want to thank Prof. Dr. Patrick Cramer for supervising this thesis and his support during collaborations ever since I started working at the Gene Center. It is admirable to see him manage his full schedule while still keeping his interest and positive attitude, which you can feel in his whole group. I would like to thank the rest of my thesis committee: Dr. Dietmar Martin, Prof. Dr. Klaus Förstemann and Prof. Dr. Karl-Peter Hopfner. I owe my gratitude to all present and former members of the Gagneur, Tresch, Cramer and Söding group for an extraordinary working atmosphere. Special thanks to Dr. Björn ’Poony’ Schwalb, Daniel Bader and Chris Mertes for withstanding my stupid jokes in the office (and all the rest who were not sitting as close but had to deal with it once in a while). Many thanks to Michael Lidschreiber for sharing data way before publication, for introducing me to ’Extrem Bosseln’ and just having a nice time. Many many thanks again to Björn Schwalb and Margaux Michel for working together and much beyond that! Many thanks to Carina Demel, Daniel Schulz, Matthias Siebert, Phillipp Torkler, Saskia Gressel, Jürgen Niesser, Katja Frühauf and Wolfgang Mühlbacher for fun times within and beyond science. I would be nothing without my friends. Thank you for being in my life: Jannis, Zissy, Magda, Doris, Joni, Jan, Biddy, Paul, Sara, Veri, Micha, Jimmy, Domi, Flo, Clemens, Anna, Simon, Tisy, X, Laia, Julia N., Martin, Jesse, Joubi, Simone, Amrei, Peti, Julia I., Alex and Miri. And everybody I forgot, you know who you are. Further I would like to thank Siegfried, Christine and Marvin Maschke. I am very lucky to have you in my family! I owe my deepest gratitude to my parents, Silvia and Werner, who made this (and me) possible with their support, love and trust. I also want to thank my sister Vroni for being my sister. Above all and everything I want to thank my wife Jasmin. Writing down everything that I owe you would take another 3,5 years, so I’ll make it short: You are the love of my life! v Summary Transcription is a tightly controlled process that involves the recruitment and prost-translational modification of DNA-associated protein complexes, which can be mapped to the genome using high-throughput experimental assays. An accurate annotation of genomic elements such as transcription units or cis-regulatory elements such as promoters or enhancers is crucial for the use and interpretation of data generated by these assays. Thus, integrative genomic data analysis of high- throughput assays with hidden Markov models (HMMs) has become a popular tool for genome annotation. However, current algorithms are limited by unrealistic data distribution assumptions and variance models. Moreover, they are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g., RNA expression) with non-strand-specific (e.g., ChIP) data, which is essential to characterize directed processes such as transcription. In this thesis new HMM-based methods are proposed to overcome these limitations. These in- clude (i) bidirectional HMMs (bdHMMs) which integrate strand-specific with non-strand-specific data to infer directed genomic states de novo and (ii) GenoSTAN (Genomic STate ANnotation), a HMM using discrete probability distributions to model count data, for genome annotation from Next-Generation-Sequencing data. Both approaches are made available in the R/Bioconductor package STAN (STate ANnotation) which provides an efficient implementation that can be run on large genomes such as human. STAN is used to derive new and improved annotations of transcription in yeast and human and to generate a map of promoters and enhancers in 127 human cell types and tissues. Integration of transcription factor binding and RNA expression data in yeast recovers the major- ity of transcribed loci, reveals gene-specific variations in the yeast transcription cycle, identifies 32 new transcribed loci, a regulated initiation-elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly ex- pressed genes and novel DNA sequence motifs associated with transcription termination. Moreover, promoters and enhancers are predicted in 127 human cell types and tissues are mapped by integrating sequencing data from the ENCODE and Roadmap Epigenomics projects, today’s largest compendium of chromatin assays. Promoters and enhancers are identified with con- sistently higher accuracy and show significantly higher enrichment of complex trait-associated genetic variants than current annotations. Investigation of binding of 101 transcription factors in human K562 cells reveals common and distinctive TF binding properties of enhancers and promoters. Application of STAN to transient transcriptome sequencing (TT-Seq) data in human K562 cells recovers stable mRNAs, long intergenic non-coding RNAs, and additionally maps over 10,000 transient RNAs, including enhancer RNAs, antisense RNAs, and promoter-associated RNAs. Further analyses reveal that transient RNAs such as enhancer RNAs are short and lack U1 motifs and secondary structure. Taken together, the annotations inferred in this thesis gave new insights into transcription and its regulation and will be an important resource for future research in genomics. STAN is a valu- able tool to create such annotations also in other organisms and as more data becomes available improve the existing ones. vii Publications Parts of this work have been published or are in the process of publication: 2016 Accurate promoter and enhancer identification in 127 ENCODE and Roadmap Epigenomics cell types and tissues by GenoSTAN B. Zacher*, M. Michel, B. Schwalb, P. Cramer, A. Tresch*, J. Gagneur* (* corresponding author) submitted, preprint available on biorxiv (doi: http://dx.doi.org/10.1101/041020) Author contributions: BZ, JG and AT developed the statistical methods and computational workflow of the study. BZ developed and implemented all software and scripts and carried out all computational analyses. BS helped with preprocessing of the K562 data set. MM and PC helped with interpretation of the biological results. BZ, JG and AT wrote the manuscript with input from all authors. All authors read and approved the final version of the manuscript. 2016 TT-Seq captures the human transient transcriptome B. Schwalb*, M. Michel*, B. Zacher*, K.Frühauf, C. Demel, A. Tresch, J. Gagneur, P. Cramer (* joint first authorship) accepted for publication in Science Author contributions: MM carried out all experiments. BS carried out all bioinformatics analysis except transcript calling, RNA classification, analysis of U1 sequence motifs, and prediction of RNA secondary structure, which were carried out by BZ. BZ, JG and AT developed the chromatin state annotation. KF designed RNA spike-in probes. CD established the spike-in normalization method. BS, JG and PC designed research. JG and PC supervised research. BS, MM, JG, and PC prepared the manuscript, with input from all authors. 2014 Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle B. Zacher, M. Lidschreiber, P. Cramer, J. Gagneur, A. Tresch Molecular Systems Biology, 10(12):768–768 http://msb.embopress.org/content/10/12/768.long Author contributions: BZ, AT and JG developed the statistical methods and computational workflow of the study. BZ developed and implemented all software and scripts and carried out all computational analyses. AT initiated the study. AT and JG supervised research. ML and PC helped with preprocessing of the data and interpretation of the biological results. BZ, AT, JG and PC wrote the manuscript. All authors read and approved the final version of the manuscript. ix Contents Acknowledgments v Summary vii Publications ix Contents xi I Introduction 1 1 Transcription regulation by cis-regulatory elements 1 1.1 Transcription regulation by UASs and enhancers . .3 1.2 The pre-initiation complex assembles at the promoter . .3 1.3 The mediator complex . .3 2 The transcription cycle 3 2.1 Initiation and promoter escape . .3 2.2 Elongation . .5 2.3 Termination . .5 3 The histone code of transcription 7 4 High-throughput experimental assays 7 4.1 Microarrays . .7 4.1.1 Transcriptome profiling using microarrys . .7 4.1.2 ChIP-chip . .8 4.2 Next-Generation Sequencing . .8 4.2.1 ChIP-Seq . .9 4.2.2 RNA-Seq and Dynamic Transcriptome Analysis . .9 4.2.3 Detecting DNase hypersensitivity sites with DNase-Seq .

Load more