Human Cardiac Cis-Regulatory Elements, Their Cognate Transcription Factors
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 1 Human cardiac cis-regulatory elements, their cognate transcription factors 2 and regulatory DNA sequence variants 3 4 Dongwon Lee1,†,*, Ashish Kapoor1,‡, Alexias Safi2, Lingyun Song2, Marc K. Halushka3, 5 Gregory E. Crawford2, Aravinda Chakravarti1,†,* 6 7 1McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of 8 Medicine, Baltimore, MD 21205, United States; 2Department of Pediatrics and Center 9 for Genomic & Computational Biology, Duke University Medical Center, Durham, NC 10 27708, United States; 3Department of Pathology, Johns Hopkins University School of 11 Medicine, Baltimore, MD 21205, United States. 12 13 †Current address: Center for Human Genetics & Genomics, New York University School 14 of Medicine, New York, NY, USA 15 ‡Current address: University of Texas Health Science Center at Houston, Houston, TX, U SA 16 USA 17 18 Running title: A human cardiac cis-regulatory map 19 20 Keywords: cis-regulatory elements, transcription factors, regulatory variants, QT 21 interval, heritability 22 23 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 24 *Send all correspondence to: 25 26 Aravinda Chakravarti, Ph.D. & Dongwon Lee, Ph.D. 27 Center for Human Genetics and Genomics 28 New York University School of Medicine 29 435 East 30th Street, SM Room 803/4 30 New York, NY 10016, USA 31 (T) 212-263-8023 32 (E) [email protected], [email protected] 2 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 33 Abstract 34 cis-regulatory elements (CRE), short DNA sequences through which transcription 35 factors (TF) exert regulatory control on gene expression, are postulated to be the major 36 sites of causal sequence variation underlying the genetics of complex traits and diseases. 37 We present integrative analyses, combining high-throughput genomic and epigenomic 38 data with sequence-based computations, to identify the causal transcriptional 39 components in a given tissue. We use data on adult human hearts to demonstrate that a) 40 sequence-based predictions detect numerous, active, tissue-specific CREs missed by 41 experimental observations, b) learned sequence features identify the cognate TFs, c) 42 CRE variants are specifically associated with cardiac gene expression, and, d) a 43 significant fraction of the heritability of exemplar cardiac traits (QT interval, blood 44 pressure, pulse rate) is attributable to these variants. This general systems approach can 45 thus identify candidate causal variants and the components of gene regulatory networks 46 (GRN) to enable understanding of the mechanisms of complex disorders at a tissue or 47 cell-type basis. 3 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 48 Introduction 49 The majority of sequence variation affecting inter-individual complex disease risk 50 is common and noncoding, in contrast to the rare coding variation underlying 51 Mendelian disorders (Chakravarti and Turner 2016). These findings are consistent with 52 the intense natural selection on coding sequence variation and the much weaker 53 selection on noncoding variation (Asthana et al. 2007). Most complex traits arise from 54 multi-genic effects with identified risk alleles having small effects so that no individual 55 effect is either necessary or sufficient (Chakravarti and Turner 2016). Thus, there must 56 be considerable redundancy in non-coding functions. The best candidates for these non- 57 coding functions are cis-regulatory elements (CREs) of gene expression because they are 58 the primary agents of regulatory control, and affect disease risk through altered gene 59 expression (Davidson 2010; Maurano et al. 2012; Chatterjee et al. 2016; Phillips- 60 Cremins and Corces 2013). 61 62 Understanding the regulatory genomic architecture is, therefore, important for 63 elucidating the etiologies of complex disorders, particularly at the tissue- and cell-type 64 levels. This involves identifying the numbers of TFs and enhancers engaged, their 65 genomic distribution and the feedback mechanisms required for expression control of 66 each gene, namely, the components of its gene regulatory network (GRN) (Davidson 67 2010). Additionally, we need to quantify the effect of sequence variation within the GRN 68 on gene expression. These questions can now be answered using high-throughput ChIP- 69 seq (Barski et al. 2007; Johnson et al. 2007; Robertson et al. 2007), DNase-seq (Boyle et 70 al. 2008; John et al. 2011), and ATAC-seq (Buenrostro et al. 2013) experimental data 4 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 71 from the ENCODE (The ENCODE Project Consortium 2012) and Roadmap epigenomics 72 projects (Roadmap Epigenomics Consortium et al. 2015), on many cell types, tissues, 73 developmental times and states, in conjunction with human sequence variation (1KGP) 74 (The 1000 Genomes Project Consortium 2015) and its effects on gene expression across 75 human tissues (GTEx) (The GTEx Consortium 2015). The fulcrum of a GRN are its 76 enhancers (CREs), however, their complete experimental detection is not possible 77 because CRE activity is affected by many factors (Misteli 2001; Degner et al. 2012; He et 78 al. 2014): (1) functional strength is variable across enhancers, (2) activity is stochastic 79 across cells, (3) detected activity varies by experimental protocol and sample, and, (4) 80 resolution of a CRE’s genomic location from current data is approximate. In principle, 81 ChIP-seq data are superior but the general lack of TF-specific antibodies is a major 82 roadblock to this approach. We solve these major impediments instead by using a 83 generalizable genome sequence-based machine learning approach (Lee et al. 2011; 84 Ghandi et al. 2014; Lee et al. 2015). This method learns the short genome sequence 85 features that maximally discriminate sequences underlying CREs from random genome 86 sequences. The learned models are then used to identify additional CREs with similar 87 sequence features, including combinations of transcription factor binding sites, that 88 failed detection through experiments. 89 90 We focus here on identifying the GRN ‘parts list’ of the adult human heart, and 91 the effects of CRE genetic variation on an exemplary cardiac trait, the 92 electrocardiographic QT interval (QTi), causal to sudden cardiac death and other 93 conduction disorders (Tomaselli et al. 1994). We show the generalizability of our results 5 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 94 by using these cardiac enhancers to explain phenotypic variation in systolic/diastolic 95 blood pressure and pulse rate but not BMI. We focus on the heart because of the 96 extensive genome-wide association study (GWAS) literature implicating CRE variation 97 in many cardiovascular diseases and phenotypes (Arking et al. 2014; Kapoor et al. 2014; 98 the CARDIoGRAMplusC4D Consortium 2015; Eppinga et al. 2016), but also because the 99 QTi is likely affected by the cardiac system alone. A number of human cardiac CRE 100 maps exist (May et al. 2012; Narlikar et al. 2010) with a recent study identifying distal 101 heart enhancers based on H3K27ac and EP300/CREBBP profiles (Dickel et al. 2016). At 102 least one study has also used DNase-seq and histone modification ChIP-seq data for 103 heart tissues to identify functional regulatory variants associated with electrocardiogram 104 traits (Wang et al. 2016). Our study improves on these maps by comprehensive 105 identification of CREs. In addition, we identify their corresponding TFs and analyze the 106 effects of genetic variants within these heart CREs for heart-related traits. 107 108 Results 109 Experimental detection of cardiac CREs is incomplete 110 A schematic overview of our approach is shown in Fig. S1. We first performed 111 DNase-seq to identify open chromatin regions through DNase I hypersensitive sites 112 (DHSs) as potential CREs in two adult human hearts (left ventricles) together with three 113 publicly available heart DNase-seq data sets (Fig. 1A). Peak calling with MACS2 (Zhang 114 et al. 2008) identified varying numbers of DHSs (50,000 ~ 110,000) across the five 115 samples (Methods) (Supplemental Fig. S2A). We defined DHSs as 600bp regions 116 centered at the identified summits. DHSs frequently cluster in small regions, and 6 Downloaded from genome.cshlp.org on October 7, 2021 - Published by Cold Spring Harbor Laboratory Press 117 22~32% of extended DHSs overlap their neighbors forming larger DHSs with multiple 118 summits (Supplemental Fig. S2B). We next compared heart DHSs with other tissue 119 DHSs using the top 50,000 DHSs from each sample and the Jaccard index (number of 120 bases in the intersection over the union for each pair). These comparisons show higher 121 similarity (~50%) between the adult cardiac samples than with other tissues (~30%), 122 including fetal heart (Methods) (Supplemental Fig. S2C). Thus, DHS maps do uncover 123 tissue-relevant CREs although many regions are open across many tissues. Nevertheless, 124 technical and biological variation affect the adult cardiac DHS maps. To maximally 125 detect CREs, we aggregated all DHSs across the five adult heart replicates and identified 126 ~160,000 distinct regions (“observed DHSs”) covering ~4% of the genome. Using 127 multiple samples is important because ~40% of DHSs were detected only once across 128 the replicates while only ~22% detected across all five (Fig. 1B). 129 130 Sequence-based models can predict additional CREs 131 To elucidate the sequence code for these cardiac CREs, we used the gapped k-mer 132 support vector machines (SVM) method, gkm-SVM (Ghandi et al. 2014), widely 133 accepted as one of the state-of-the-art sequence-based methods (Kreimer et al. 2017). 134 Also, extracting predictive sequence features from gkm-SVM is easier than from other 135 methods.