Stochastic Modeling of RNA Polymerase Predicts Transcription Factor Activity
Total Page:16
File Type:pdf, Size:1020Kb
Stochastic modeling of RNA polymerase predicts transcription factor activity by Joseph Gaspare Azofeifa B.A., Vassar College A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2017 This thesis entitled: Stochastic modeling of RNA polymerase predicts transcription factor activity written by Joseph Gaspare Azofeifa has been approved for the Department of Computer Science Prof. Robin Dowell Prof. Aaron Clauset Prof. Elizabeth Bradley Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. Azofeifa, Joseph Gaspare (Ph.D., Computer Science) Stochastic modeling of RNA polymerase predicts transcription factor activity Thesis directed by Prof. Robin Dowell Seventy-six percent of disease associated variants occur in non-genic sites of open chromatin suggesting that the regulation of gene expression plays a crucial role in human health. Nucleosome- free with flanking chromatin modifications, these regulatory loci are optimal platforms for tran- scription binding and, in fact, recruit RNA Polymerase. The subsequent transcription of these sites is an unintuitive discovery as these regulatory loci do not harbor an open reading frame. The role these enhancer RNAs (eRNA) play in downstream gene regulation remains an open and exciting question. However, fast RNA degradation rates challenge eRNA identification, requiring non-traditional sequencing technologies. Global Run-on followed by sequencing (GRO- seq) detects non-genic transcription and thus, in theory, eRNA presence. Yet GRO-seq is not without noise and bias, predictive modeling of both the sequencing error and the stochastic nature of RNA polymerase itself is required to discover enhancer RNA transcripts. In short, this thesis asks: what regulates eRNA transcription? To answer this question, I first develop two novel probabilistic models to unbiasedly determine eRNA location. A regression method was constructed to quickly identify all transcribed regions in GRO-seq. Based on the known enzymatic stages of RNA polymerase, a subsequent latent variable model was built to infer the precise location of eRNA initiation. With the relevant technology developed, I undertake a massive data integration project and show strong contextual relationships between TF-binding events, epigenetics and eRNA transcription. I conclude by showing that enhancer RNAs can unbiasedly quantify transcription factor activity and predict cell type. Dedication To my family. v Acknowledgements First, I would like to thank my PhD committee, Dr. Aaron Clauset, Dr. Elizabeth Bradley, Dr. Michael Mozer, Dr. Katerina Kechris and my thesis advisor, Dr. Robin Dowell. Apart from shaping many of the ideas we ultimately published, Robin also trusted in my work-ethic and independence which provided me the necessary freedom I needed to thrive in graduate school. I would like to thank Dr. Manual Lladser for some of the preliminary work on the mixture model and also giving me with confidence to pursue more rigorous mathematics. Dr. Tim Read's endless phone calls helped me deeply in developing the MD-Score and correctly framing the story of eRNAs. I am very indebted to Josephina Hendrix who performed all the short read alignments (> 700 datasets) for the MD-score publication. Finally, I would like to formally thank Dr. Mary Allen. Mary and I sat next to each other for 5 years and I can say, with little doubt, that I learned all my molecular biology through her. Apart from acknowledgments related directly to my thesis, I would like to thank the IQ Biology program and the BioFrontiers Institute for funding support but also for providing a PhD training program that really fostered an atmosphere of creativity, collaboration and interdisciplinary research. Specifically, I would like to thank Dr. Thomas Cech, Dr. Jana Watson-Capps, Amber McDonell, Kim Little, Kim Kelly and, most of all, Dr. Andrea Stith. I would like to thank Sam Way and Ryan Langendorf who I feel honored to call friends; I hope to continue our lively scientific debates in the future. Finally, I would like to thank my family, Andie and Katherine Azofeifa, without whose support none of this would have been possible. I am not sure my mother realized she'd raise two scientists but I am extremely grateful for such a strong and loving family. vi Contents Chapter 1 Introduction 1 1.1 Biological Setup . .1 1.2 Regulatory Element Identification . .3 1.2.1 ChIP-seq . .3 1.2.2 Chromatin State and Epigenetics . .5 1.2.3 Regulatory Element Identification . .6 1.2.4 GRO-seq and eRNAs . .9 1.3 Thesis Outline . 11 1.3.1 Overview . 11 1.3.2 Chapter 2: Transcribed Region Annotation . 12 1.3.3 Chapter 3: Stochastic models of RNA Polymerase . 12 1.3.4 Chapter 4: eRNA Profiles Predict Transcription Factor Activity . 13 2 Transcribed Region Annotation 14 2.1 Introduction . 14 2.2 Nascent Transcript Model . 16 2.2.1 Description . 16 2.2.2 Parameter Estimation . 17 2.2.3 Software Design . 20 vii 2.3 Model Accuracy . 21 2.3.1 Datasets . 21 2.3.2 Sensitivity to depth of data . 22 2.3.3 Benchmarking FStitch & Vespucci . 23 2.4 Biological Analysis . 25 2.4.1 Annotation Comparisons . 25 2.4.2 Characterizing bidirectional RNA Activity . 27 2.4.3 Differential transcription at annotated genes: a comparison of FStitch to Allen et. al. 29 2.4.4 Differential transcription using all FStitch active calls . 30 2.5 Conclusions . 32 3 Stochastic models of RNA Polymerase 34 3.1 Introduction . 34 3.2 Modeling RNA Polymerase . 36 3.2.1 Double Geometric Distribution . 36 3.2.2 Exponentially Modified Gaussian . 39 3.2.3 Poisson Point Process . 47 3.2.4 Mixture Models . 49 3.2.5 Bayesian Extensions . 53 3.3 Applications to GRO-seq . 55 3.3.1 Numerical confirmation of model inference by simulation . 55 3.3.2 Predicting enzymatic changes of RNAP following Experimental Perturbation 58 3.3.3 RNAP model accurately predicts marks of regulatory elements . 59 3.3.4 Three dimensionally paired loci display centrality and associativity based on bidirectional transcription . 61 viii 4 eRNA Profiles Predict Transcription Factor Activity 64 4.1 Introduction . 64 4.2 Enhancer RNAs originate from transcription factor binding sites . 65 4.3 Enhancer RNA origins mark sites of regulatory TF binding . 67 4.4 eRNA origins co-localize with TF-binding motifs . 68 4.5 Motif displacement scores quantify TF activity . 70 4.6 MD-scores predict TF activity across cell types . 72 4.7 Conclusion . 73 5 Looking Forward 76 5.1 Mixture Models . 76 5.1.1 Model Selection . 77 5.1.2 Integration of other Data Types . 79 5.2 TF Activity Inference Models . 80 5.3 Predicting Enhancer to Gene Interactions . 82 5.3.1 Network Structure Prediction . 82 5.3.2 Correlation Networks . 83 5.3.3 Bayesian Networks . 84 5.3.4 A 3D Genome . 87 5.4 Thesis Conclusions . 88 Bibliography 90 Appendix A Supplementary Material to Chapter 2 105 ix B Supplementary Material to Chapter 3 109 B.1 Seeding the EM . 110 B.2 Datasets . 111 B.3 CTCF ChIA-PET network construction . 112 B.4 Software Package: Tfit . 112 B.5 Numerical confirmation of model inference by simulation . 112 C Supplementary Material to Chapter 4 117 C.1 eRNA origins . 117 C.2 Genomic Feature Data Integration . 118 C.3 Nascent transcription data processing . 118 C.4 Tfit parameters and bidirectional prediction . 119 C.4.1 Template Matching . 119 C.4.2 EM Algorithm and Bidirectional Origin estimation . 121 C.4.3 Footprint Estimation . 121 C.5 Computation of Bimodality, ∆BIC . 122 C.6 Motif Curation and Motif Scanning . 123 C.7 MD-score Hypothesis Testing . 124 C.7.1 The Motif Displacement score . 124 C.7.2 MD-score significance under stationary model . 125 C.7.3 MD-score significance under a non-stationary background model . 125 C.7.4 MD-score significance between experiments . 126 C.8 Cell type and TF enrichment analysis . 127 C.9 Associated File Types . 128 C.10 IPython Notebook . 129 Chapter 1 Introduction 1.1 Biological Setup A central goal in genetics is to understand how genotype (the unique ordering of DNA) trans- lates to phenotype (observable qualities like height or eye color). Although some phenotypic traits are innocuous, one's genotype may influence cancer susceptibility, a predisposition to alcoholism or cognitive disabilities[47, 21, 62]. For advancements in human medicine|as well as a fundamental understanding of biology|genetics remains an exciting and active area of research. A long way from Mendel's pea plants, whole genome sequencing makes possible the com- plete identification of an organism's genotype. Resolving the human genome's nearly 3.2 billion nucleotides, we now know that alterations in the gene sequence of p53, kvlqt1 and adam19 correlate with incidences of cancer, Type II diabetes and heart disease respectively[14, 173, 183]. Although genome-wide association studies (GWAS)[25] successfully link genotypic variants to phenotype, they require hundreds or even thousands of genomic samples to achieve significant correlations[88]. Yet furthermore, GWAS is unable to predict the phenotypic consequences of a novel genetic variant, unassociated with a specific phenotype. In contrast, the study of gene expression|the biochemical or molecular process by which a genotype renders a phenotype|promises to uncover why certain genotypes result in specific phenotypes. To summarize briefly, gene expression begins with the enzyme.