1 Mechanistic Analysis of Enhancer Sequences in the Estrogen Receptor
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.11.08.373555; this version posted November 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Mechanistic analysis of enhancer sequences in the Estrogen Receptor 2 transcriptional program 3 Shayan Tabe-Bordbar 4 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 5 United States of America. Email: [email protected] 6 You Jin Song 7 Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, 8 Urbana, IL, United States of America, Email: [email protected] 9 Bryan J. Lunt 10 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 11 United States of America. Email: [email protected] 12 Kannanganattu V. Prasanth 13 Department of Cell and Developmental Biology, Cancer Center of Illinois, University of Illinois at 14 Urbana-Champaign, Urbana, IL, United States of America. Email: [email protected] 15 Saurabh Sinha 16 Department of Computer Science, Carl R. Woese Institute for Genomic Biology, Cancer Center 17 of Illinois, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America. 18 Email: [email protected] 19 Corresponding Author: 20 Saurabh Sinha 21 2122 Siebel Center, 201 N. Goodwin Ave, Urbana, IL 61801. USA. 22 Phone: 217-333-3233 23 Email: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.08.373555; this version posted November 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 24 Abstract 25 Background: Estrogen Receptor α (ERα) is a major lineage determining transcription factor 26 (TF) in mammary gland development, orchestrating the expression of thousands of genes. 27 Dysregulation of ERα-mediated transcriptional program results in abnormal cell proliferation and 28 cancer. Transcriptomic and epigenomic profiling of breast cancer cell lines has revealed large 29 numbers of enhancers involved in this regulatory program, but how these enhancers encode 30 function in their sequence remains poorly understood. 31 Results: A subset of ERα-bound enhancers are transcribed into short bidirectional RNA 32 (enhancer RNA or eRNA), and this property is believed to be a reliable marker of active 33 enhancers. We therefore analyze thousands of ERα-bound enhancers and build quantitative, 34 mechanism-aware models to discriminate eRNAs from non-transcribing enhancers based on 35 their sequence. Our thermodynamics-based models provide insights into the roles of specific 36 TFs in ERα-mediated transcriptional program, many of which are supported by the literature. 37 We use in silico perturbations to predict TF-enhancer regulatory relationships and integrate 38 these findings with experimentally determined enhancer-promoter interactions to construct a 39 gene regulatory network. We also demonstrate that the model can prioritize breast cancer- 40 related sequence variants while providing mechanistic explanations for their function. Finally, we 41 experimentally validate the model-proposed mechanisms underlying three such variants. 42 Conclusions: We modeled the sequence-to-expression relationship in ERα-driven enhancers 43 and gained mechanistic insights into the workings of a major transcriptional program. Our model 44 is consistent with the current body of knowledge and its predictions are confirmed by 45 experimental observations. We believe this to be a promising approach to analysis of regulatory 46 sequences and variants. 47 Keywords 2 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.08.373555; this version posted November 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 48 enhancers, gene regulatory network, eRNA, sequence-to-expression modeling, variant impact 49 prediction, thermodynamics-based modeling, breast cancer 50 51 Background 52 Breast cancer is the second most common cancer in women in the United States, with about 53 12.4% of women diagnosed with breast cancer during their lifetime (1). The most common 54 breast cancer subtype is characterized by increased activity of Estrogen Receptor α (ERα), a 55 protein that is activated by estrogen and in turn changes the expression of hundreds of genes. 56 Acting as a transcription factor (TF), it binds to thousands of DNA locations and influences the 57 expression levels of nearby genes (2). This cascade of events is called the “ERα transcriptional 58 program”, and aberrations in this program lead to increased cell proliferation and cancer. 59 Several drugs have been developed to treat the ER+ subtype of breast cancer by reversing 60 aberrations in the ERα program or interfering with its cancer-causing effects. However, about 61 50% of treated patients either do not respond to or develop acquired resistance against these 62 drugs (3). As such, there is great interest in characterizing the major principles and crucial 63 details of the ERα transcriptional program. 64 The regulatory information controlling gene expression programs are known to be in part stored 65 in relatively short stretches of DNA (between 1-2 kbp) called enhancers (4). Enhancers typically 66 harbor several binding sites for one or more transcription factors (TFs). Each binding site is an 67 approximately 10 base long DNA sequence with relatively high binding affinity for its 68 corresponding TF. Active enhancers – marked by specific chromatin marks – regulate 69 expression of genes in their spatial proximity through binding to a specific set of TFs. Notably, 70 the spatial proximity of a gene-enhancer pair depends on the chromatin conformation and is not 71 necessarily captured by their genomic distance. In other words, genes can be regulated by 3 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.08.373555; this version posted November 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 72 enhancers located virtually anywhere on the genome. ERα is known to mainly regulate 73 expression of genes through binding to distant enhancers, as only about 5% of its binding sites 74 are located within a distance of 5 kb from Transcription Start Site (TSS) of any gene (5). 75 Over the past decade, biotechnological advancements such as RNA-sequencing (RNA-Seq) 76 and chromatin immunoprecipitation coupled with sequencing (ChIP-Seq) have dramatically 77 expanded our knowledge about the ERα transcriptional program. RNA-Seq assays provide data 78 on expression levels of all genes in different experimental conditions, while ChIP-seq 79 experiments reveal all the locations in the genome where ERα is bound in those conditions. Due 80 to its biological importance and potential impact, the ERα transcriptional program has been 81 studied not only through mainstream experimental approaches such as RNA-seq, ChIP seq, 82 and DNase-Seq (6,7), but also relatively more advanced and expensive experimental 83 techniques such as Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET) 84 (8,9), Global run-on sequencing (GRO-Seq) (10–13), and high-resolution chromatin 85 conformation capture (Hi-C) (14), resulting in an unprecedented wealth of diverse data 86 describing this system. 87 Here, we aimed to comprehensively characterize gene regulatory relationships in this system 88 through modeling of gene expression data, taking advantage of several of the above-mentioned 89 multi-omics data. A common approach to reconstruction of gene regulatory networks (GRN) is 90 to train a quantitative model that correctly predicts gene expression levels, and then interpret 91 the parameters of the model to identify regulatory relationships. There exist two major variations 92 of the gene expression prediction problem. In the “sequence-to-expression” formulation, DNA 93 sequences (enhancers) associated with a gene should bear the “footprints” of the TFs that 94 ostensibly regulate the gene, since a TF needs to physically bind to enhancers in order to 95 regulate the gene (Additional File 1: Figure S1). In the alternative, less constrained “expression- 96 to-expression” formulation, one attempts to relate a gene’s expression to the expression of its 4 bioRxiv preprint doi: https://doi.org/10.1101/2020.11.08.373555; this version posted November 9, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 97 potential regulators, without taking DNA sequence information into account. In the current study, 98 we considered the former formulation of the expression prediction problem in order to study the 99 ERα transcriptional program, reconstruct the underlying GRN, and decipher the sequence-level 100 encoding of the network. 101 Sequence-to-expression modeling in mammalian systems poses two key challenges: enhancer- 102 gene assignment, and enhancer-readout prediction. In enhancer-gene assignment, one must 103 identify the genomic location of enhancers regulating a gene’s expression and their relative 104 contributions in a particular condition. On the other hand, enhancer-readout prediction amounts 105 to quantifying the expression driven by a particular