Integration of 198 ChIP-seq datasets reveals human cis-regulatory regions

Hamid Bolouri & Larry Ruzzo

Thanks to Stephen Tapscott, Steven Henikoff & Zizhen Yao

Slides will be available from: http://labs.fhcrc.org/bolouri/

Email [email protected] for manuscript (Bolouri & Ruzzo, submitted)

Kleinjan & van Heyningen, Am. J. Hum. Genet., 2005, (76)8–32

Epstein, Briefings in Func. Genom. & Protemoics, 2009, 8(4)310-16 Regulation of SPi1 (Sfpi1, PU.1 ) expression – part 1

miR155*, miR569#

~750nt promoter

~250nt promoter

The antisense RNA • causes translational stalling • has its own promoter • requires distal SPI1 enhancer • is transcribed with/without SPI1.

# Hikami et al, Arthritis & Rheumatism, 2011, 63(3):755–763 * Vigorito et al, 2007, Immunity 27, 847–859 Ebralidze et al, & Development, 2008, 22: 2085-2092. Regulation of SPi1 expression – part 2 (mouse coordinates)

Bidirectional ncRNA proportional to PU.1 expression

PU.1/ELF1/FLI1/GLI1 GATA1 GATA1 Sox4/TCF/LEF PU.1 RUNX1 SP1 RUNX1 RUNX1 SP1

ELF1 NF-kB SATB1 IKAROS PU.1 cJun/CEBP OCT1 cJun/CEBP 500b 500b 500b 500b 500b 750b

500b

-18Kb -14Kb -12Kb -10Kb -9Kb

Chou et al, Blood, 2009, 114: 983-994 Hoogenkamp et al, Molecular & Cellular Biology, 2007, 27(21):7425-7438 Zarnegar & Rothenberg, 2010, Mol. & cell Biol. 4922-4939 An NF-kB binding-site variant in the SPI1 URE reduces PU.1 expression & is GGGCCTCCCC correlated with AML GGGTCTTCCC Bonadies et al, , 2009, 29(7):1062-72.

SATB1 binding site

A distal single nucleotide polymorphism alters long- range regulation of the PU.1 in acute myeloid leukemia

Steidl et al, J Clin Invest. 2007, 117(9):2611-20. ChIP-seq of 13 sequence-specific TFs

Nanog, Oct4, STAT3, Smad1, , Zfx, c-, n-Myc, , Esrrb, Tcfcp2l1, , and CTCF

Number Number of loci

Number of TFs bound within 100bp of nearest neighbor

Chen et al, Cell, 2008;133(6):1106-17 Nature. 2011;473(7345):43-9

Cells: H1hesc, K562, HepG2, HUVEC, HSMM , HMEC, NHLF, NHEK + 1 HapMap B-lymphoblastoid

Histone marks classified as: 1_Active_Promoter , 2_Weak_Promoter , 3_Poised_Promoter , 4_Strong_Enhancer , 5_Strong_Enhancer , 6_Weak_Enhancer , 7_Weak_Enhancer (total footprint of all histone marked regions = 627,972,582 bps , ~ 20.9% of the genome)

(HeLaS3, HUVEC, K562, NHEK, H1hesc + 7 HapMap B-lymphoblastoid cell lines)

958,250 / 1,067,220 = 89.8% of DNase1 selected regions overlap histone marked regions

(total footprint of DNase1-selected-regions = 22,388,756 bps , ~ 0.75% of the genome) 198 ChIPseq experiments analyzed in this study colored text indicates datasets expected to overlap red border: immortalized B-lymphocyte from 2nd HapMap individual

A549GrPcr1xDexb Ecc1EralphaaV0416102Estradia1h Gm12878Atf3Pcr1x H1hescAtf3V0416102 Hepg2Atf3V0416101 K562Bcl3Pcr1x AR_Biaoyang A549GrPcr1xDexc Ecc1EralphaaV0416102Gen1h Gm12878BatfPcr1x H1hescBcl11aPcr1x Hepg2Bhlhe40V0416101 K562CebpbIggrab ArRD A549GrPcr1xDexd Ecc1Foxa1sc6553V0416102Dmso2 Gm12878Bcl11aPcr1x H1hescCjunIggrab Hepg2bZnf274Ucd K562CmycIfng30Std betaCatenin A549GrPcr2xDexa Ecc1GrV0416102Dexa Gm12878Cmyc H1hescCmyc Hepg2CebpbIggrab K562Cmyc CBPJurkat A549Usf1Pcr1xDex100nm Gm12878Ebfsc137065Pcr1x H1hescEgr1V0416102 Hepg2CjunIggrab K562E2f6sc22823V0416102 ELF1RD A549Usf1Pcr1xEtoh02 Gm12878Egr1V0416101 H1hescGabpPcr1x Hepg2Cmyc K562Efos ERGRD Nb4CmycStd Gm12878Elf1sc631V0416101 H1hescJundV0416102 Hepg2Elf1sc631V0416101 K562Egata2 Ets1Jurkat Htb11NrsfPcr2x Nb4MaxStd Gm12878Ets1Pcr1x H1hescMaxUcd Hepg2Fosl2V0416101 K562Egr1V0416101 EwsErgRD Gm12878GabpPcr2x H1hescNanogsc33759V0416102 Hepg2Foxa1sc101058V0416101 K562Ejunb Fli1RD Gm12878Irf3Std H1hescNrf1Iggrab Hepg2Foxa1sc6553V0416101 K562Ejund KLF4 HuvecCfosUcd SknshraP300V0416102 Gm12878Irf4sc6059Pcr1x H1hescNrsfV0416102 Hepg2Foxa2sc6554V0416101 K562Elf1sc631V0416102 Nanog HuvecCmyc SknshraUsf1sc8983V0416102 Gm12878Mef2aPcr1x H1hescP300V0416102 Hepg2GabpPcr2x K562Enr4a1 nfkbRD HuvecGata2Ucd SknshraYy1sc281V0416102 Gm12878Mef2csc13268V0416101 H1hescPou5f1sc9081V0416102 Hepg2Hey1V0416101 K562Ets1V0416101 Oct4Pou5F1 Gm12878Nrf1Iggmus H1hescRxraV0416102 Hepg2Hnf4asc8987V0416101 K562Fosl1sc183V0416101 P63goodRD Gm12878NrsfPcr2x H1hescSix5Pcr1x Hepg2Hnf4gsc6558V0416101 K562GabpV0416101 PU1RD Mcf10aesStat3Etoh01bStd T47dEralphaaPcr2xGen1h Gm12878P300Pcr1x H1hescSp1Pcr1x Hepg2Irf3Iggrab K562Gata2sc267Pcr1x Runx1Jurkat Mcf10aesStat3Etoh01cStd T47dEralphaaV0416102Estradia1h Gm12878Pax5c20Pcr1x H1hescSrfPcr1x Hepg2JundIggrab K562Hey1Pcr1x selectedStat1ChIPseqPeaks Mcf10aesStat3Etoh01Std T47dFoxa1sc6553V0416102Dmso2 Gm12878Pax5n19Pcr1x H1hescTcf12Pcr1x Hepg2JundPcr1x K562Irf1Ifna30Std Sox2 Mcf10aesStat3TamStd T47dGata3sc268V0416102Dmso2 Gm12878Pbx3Pcr1x H1hescUsf1Pcr1x Hepg2Maffm8194Iggrab K562Irf1Ifng6hStd Sox2cell2 Mcf7CmycVeh T47dP300V0416102Dmso2 Gm12878Pou2f2Pcr1x H1hescUsf2Iggrab Hepg2Mafkab50322Iggrab K562Mafkab50322Iggrab SpdefRD Gm12878Pu1Pcr1x H1hescYy1sc281V0416102 Hepg2Mafksc477Iggrab K562MaxV0416102 VdrRD Gm12878RxraPcr1x Hepg2Nrf1Iggrab K562Nrf1Iggrab ZNF263 Nt2d1Znf274Ucd Panc1NrsfPcr2x Gm12878Six5Pcr1x Hepg2NrsfPcr2x K562NrsfV0416102 Gm12878Sp1Pcr1x Hepg2P300V0416101 K562Pu1Pcr1x Gm12878SrfPcr2x Hepg2RxraPcr1x K562Six5Pcr1x not-ENCODE Trexhek293Znf263Ucd PbdeGata1Ucd Gm12878SrfV0416101 Hepg2Sp1Pcr1x K562Sp1Pcr1x Hek293bElk4Ucd Gm12878Stat1Std Hepg2SrfV0416101 K562Sp2sc643V0416102 Gm12878Stat3Iggmus Hepg2Tcf12Pcr1x K562SrfV0416101 U87NrsfPcr2x Pfsk1NrsfPcr2x Gm12878Tcf12Pcr1x Hepg2Tcf4Ucd K562Tal1sc12984Iggmus Gm12878Usf1Pcr2x Hepg2Usf1Pcr1x K562Usf1V0416101 Gm12878Usf2Iggmus Hepg2Usf2Iggrab K562Usf2Std Helas3Cebpb Shsy5yGata2Ucd Gm12878Zbtb33Pcr1x Hepg2Zbtb33Pcr1x K562Yy1V0416101 Helas3Cmyc Gm12878Zeb1sc25388V0416102 Hepg2Zbtb33V0416101 K562Yy1V0416102 Helas3Elk4 Gm12878Znf143166181apStd K562Zbtb33Pcr1x Helas3GabpPcr1x Gm12878Znf274Ucd K562Zbtb7asc34508V0416101 Helas3Irf3Iggrab Gm12891Pax5c20V0416101 Helas3NrsfPcr1x Gm12891Pou2f2Pcr1x File identifiers = Helas3Stat3Iggrab Gm12891Pu1Pcr1x Helas3Usf2Iggmus Gm12891Yy1sc281V0416101 Pre-processing: remove peaks wider than 4Kbp Helas3Zzz3Std Gm12892Pax5c20V0416101 Gm12892Yy1V0416101 remove peaks with outlier score Numbers of experiments per cell-type. [Row 11 combines related cell-lines (SK-N-SH & SK-N-MC).]

Cell type No. Expts. No. TFs. Source GMxxxxx 43 37 HapMap lymphoblastoid k562 36 33 Chronic Myelogenous Leukemia HEPG2 33 28 liver carcinoma H1HESC 25 25 embryonic stem cells HELAS3 10 10 cervical carcinoma A549 6 2 alveolar basal epithelium T47D 5 5 mammary ductal carcinoma MCF10 4 1 mammary epithelium ECC1 4 3 endometrial adenocarcinoma SK-N-SH/MC 4 3 neuroblastoma JURKAT 4 4 T-cell Leukemia HUVEC 3 3 umbilical vein endothelium VCAP 3 3 prostate HEK293 2 2 embryonic kidney NB4 2 2 Acute Promyelocytic Leukemia HTB11 1 1 neuroblastoma MCF7 1 1 invasive breast ductal carcinoma 51 experiments in NT2D1 1 1 embryonal testicular carcinoma 24 cell types U87 1 1 glioblastoma SHSY5Y 1 1 neuroblastoma PANC1 1 1 pancreatic carcinoma PBDE 1 1 peripheral blood derived erythroblasts PFSK1 1 1 cerebral brain tumor HKC 1 1 primary keratinocytes PC3 1 1 pancreatic carcinomas HTC116 1 1 colorectal carcinoma LN229 1 1 glioblastoma HL60 1 1 promyelocytic leukemia CADO-ES1 1 1 Ewing's Sarcoma Observed ChIP-seq peak overlaps are unlikely by chance. 13 experiments used to test if the observed predictability is due to datasets for TFs that bind similar motifs.

TF/antibody cell type lab Cebpb Hepg2 Sydh Cfos* Huvec Sydh Cmyc Nb4 Sydh Foxa1sc6553V0416101 Hepg2 Haib Gata2 K562E Uchicago Junb* K562E Uchicago Maffm8194 Hepg2 Sydh Mafkab50322 Hepg2 Sydh P300V0416102 Sknshra Haib Stat3Etoh01 Mcf10aes Sydh Tal1sc12984 K562 Sydh Znf143166181ap Gm12878 Sydh Znf263 Trexhek293 Sydh

* JunB has 10-fold lower binding affinity for cJun DNA binding sites.

“JunB differs from c-Jun in its DNA-binding and dimerization domains, and represses c-Jun by formation of inactive heterodimers.”

T Deng and M Karin, Genes & Dev. 1993. 7: 479-490. Observed ChIP-seq peak overlaps are unlikely by chance.

30 files have >30K peaks, 97 files have <10K peaks.

Frequency >30K peaks

Number of peaks in experiment

zoom

<10K peaks Observed ChIP-seq peak overlaps are unlikely by chance.

<10K-Peaks from >30K-Peaks >30K-Peaks from <10K-Peaks Datasets used to test cell-type specificity of ChIP-seq peak overlaps.

(A) For within-cell-type predictability tests (B) For across-cell-type predictability tests

unique experiments in unique experiments in experiments unique to HEPG2 experiments unique to K562 HEPG2 K562 Atf3 -- V0416101 Bcl3 -- Pcr1x Atf3 -- V0416101 Bcl3 -- Pcr1x Bhlhe40 -- V0416101 E2f6 -- sc22823V0416102 Bhlhe40 -- V0416101 Cebpb -- Iggrab Cebpb -- Iggrab Egr1 -- V0416101 Cebpb -- Iggrab Cmyc Cjun -- Iggrab Elf1 -- sc631V0416102 Cjun -- Iggrab E2f6 -- sc22823V0416102 Cmyc Fos Cmyc NR4a1 Elf1 -- sc631V0416101 Egr1 -- V0416101 Elf1 -- sc631V0416101 Ets1 -- V0416101 Fosl2 -- V0416101 Junb Fosl2 -- V0416101 Gata2 -- sc267Pcr1x Foxa1 -- sc101058V0416101 Elf1 -- sc631V0416102 Foxa1 -- sc101058V0416101 Irf1 -- Ifng6hStd Gabp -- Pcr2x NR4a1 Gabp -- Pcr2x Mafk -- ab50322Iggrab Hey1 -- V0416101 Ets1 -- V0416101 Hey1 -- V0416101 Max -- V0416102 Hnf4 -- asc8987V0416101 Fosl1 -- sc183V0416101 Hnf4 -- asc8987V0416101 Nrf1 -- Iggrab Irf3 -- Iggrab Gabp -- V0416101 Irf3 -- Iggrab Nrsf -- V0416102 Jund -- Iggrab Gata2 -- sc267Pcr1x Jund -- Iggrab Pu1 -- Pcr1x Maff -- m8194Iggrab Hey1 -- Pcr1x Maff -- m8194Iggrab Six5 -- Pcr1x Mafk -- ab50322Iggrab Irf1 -- Ifng6hStd Nrf1 -- Iggrab Mafk -- ab50322Iggrab P300 -- V0416101 Sp2 -- sc643V0416102 Nrsf -- Pcr2x Max -- V0416102 Rxra -- Pcr1x Srf -- V0416101 P300 -- V0416101 Nrf1 -- Iggrab Sp1 -- Pcr1x Tal1 -- sc12984Iggmus Rxra -- Pcr1x Nrsf -- V0416102 Tcf12 -- Pcr1x Usf1 -- V0416101 Sp1 -- Pcr1x Pu1 -- Pcr1x Tcf4 -- Ucd Yy1 -- V0416101 Srf -- V0416101 Six5 -- Pcr1x Usf2 -- Iggrab -- asc34508V0416101 Tcf12 -- Pcr1x Sp1 -- Pcr1x Zbtb33 -- Pcr1x ZNF263 Tcf4 -- Ucd Sp2 -- sc643V0416102 Znf274 -- Ucd Usf1 -- Pcr1x Srf -- V0416101 Usf2 -- Iggrab Tal1 -- sc12984Iggmus Zbtb33 -- Pcr1x Usf1 -- V0416101 Znf274 -- Ucd Usf2 -- Std Yy1 -- V0416101 Zbtb33 -- Pcr1x Zbtb7 -- asc34508V0416101 ZNF263 Observed ChIP-seq peak overlaps for experiments in the same cell-type. Observed ChIP-seq peak overlaps for experiments using different cell-types.

Nature. 2011;473(7345):43-9

Cells: H1hesc, K562, HepG2, HUVEC, HSMM , HMEC, NHLF, NHEK + 1 HapMap B-lymphoblastoid

Histone marks classified as: 1_Active_Promoter , 2_Weak_Promoter , 3_Poised_Promoter , 4_Strong_Enhancer , 5_Strong_Enhancer , 6_Weak_Enhancer , 7_Weak_Enhancer (total footprint of all histone marked regions = 627,972,582 bps , ~ 20.9% of the genome)

23,368 / 353,481 = 6.6% of histone marked regions overlap CRR198

30,705 / 32,467 = 94.6% of CRR198 overlap histone marked regions

22,883 / 32,467 = 70.5% of CRR198 overlap ‘strong enhancers’ Overlaps between 1000 runs of 32,467 uniformly distributed regions with the same size distribution as CRR198 and Ernst et al’s “Strong Enhancer” regions

For the observed overlap of 70.5% , p-value =0, Z-score=328

Frequency distribution of percent overlap. Blue curve Quantile-quantile plot confirms overlaps are is a Normal distribution with the same mean and SD. approximately Normally distributed. (HeLaS3, HUVEC, K562, NHEK, H1hesc + 7 HapMap B-lymphoblastoid cell lines)

958,250 / 1,067,220 = 89.8% of DNase1 selected regions overlap histone marked regions

(total footprint of DNase1-selected-regions = 22,388,756 bps , ~ 0.75% of the genome)

442,295 / 1,067,404 = 41.4% of DNase1Regions overlap CRR198

27,784 / 32,467 = 85.6% of CRR198 overlap DNase1Regions

84.2% of ChIPseq predicted CRMs are supported by both histone and DNase1-based predictions Overlaps between 1000 runs of 32,467 uniformly distributed regions with the same size distribution as CRR198 and Boyle et al’s DNase1-marked conserved TFBS clusters

For the observed overlap of 85.6% (p-value=0, Z-score=715).

Frequency distribution of percent overlap. Blue curve Quantile-quantile plot confirms overlaps are is a Normal distribution with the same mean and SD. approximately Normally distributed. Effect of selection threshold on overlap with DNase1-marked binding regions.

38,753,334

1.0 62,306

0.8 overlaps

0.6

totals

0.4

11,839,603 Fraction shared/total Fraction

19,588

5,388,843 0.2 2,953,431 9,457 1,390,931 490,953 141,986 31,701 7,667 1,000 5,436

2,618 0.0 935 270 61 15 2

0 20 40 60 80 100

Minimum number of overlapping experiments Effect of selection threshold on overlap with histone-marked regulatory regions

38,753,334

1.0 62,306

0.8 overlaps

0.6

totals

0.4

11,839,603 Fraction shared/total Fraction

19,588

5,388,843 0.2 2,953,431 9,457 1,390,931 490,953 141,986 31,701 7,667 1,000 5,436

2,618 0.0 935 270 61 15 2

0 20 40 60 80 100

Minimum number of overlapping experiments Distances of predicted CRRs from the nearest annotated transcription start sites (TSS)

CRM198T15

4

3

2

1

log10(frequency) 0

250 3250 6750 10250 14250 18250

Distance from nearest annotated TSS (bp) 230 COSMIC mutations fall in predicted CRMs

Example implicated genes:

Ensembl Gene ID Description Gene Name EntrezGene ID

ENSG00000039068 cadherin 1, type 1, E-cadherin (epithelial) [Source:HGNC Symbol;Acc:1748] CDH1 999

ENSG00000099810 methylthioadenosine phosphorylase [Source:HGNC Symbol;Acc:7413] MTAP 4507

ENSG00000114315 hairy and enhancer of split 1, (Drosophila) [Source:HGNC Symbol;Acc:5192] HES1 3280

ENSG00000118046 serine/threonine kinase 11 [Source:HGNC Symbol;Acc:11389] STK11 6794

ENSG00000130164 low density lipoprotein [Source:HGNC Symbol;Acc:6547] LDLR 3949

ENSG00000145675 phosphoinositide-3-kinase, regulatory subunit 1 (alpha) [Source:HGNC Symbol;Acc:8979] PIK3R1 5295

ENSG00000145740 solute carrier family 30 (zinc transporter), member 5 [Source:HGNC Symbol;Acc:19089] SLC30A5 64924

ENSG00000158887 myelin protein zero [Source:HGNC Symbol;Acc:7225] MPZ 4359

ENSG00000185104 Fas (TNFRSF6) associated factor 1 [Source:HGNC Symbol;Acc:3578] FAF1 11124

ENSG00000185338 suppressor of cytokine signaling 1 [Source:HGNC Symbol;Acc:19383] SOCS1 8651

ENSG00000196092 paired box 5 [Source:HGNC Symbol;Acc:8619] PAX5 5079 Six GWAS SNPs fall inside the intersection of CRM198T15 and DNase1HS and HistoneMarked Strong Enhancer regions

TF motif occurrences GWAS SNP Association location Gene Comment

AP-4 rs10794720 chronic kidney disease intronic WDR37 Multiple associations

Integrin beta 1, metastatic AP-2 rs11009175 Depression upstream ITGB1 diffusion

Eip74EF rs4387287 leukocyte telomere 5’ UTR OBFC1 DNA replication machinery

Titf1+Nkx2-4+USF rs2268983 Smoking behavior intronic ACTN1 Actinin1a, in focal adhesions

USF+Mycn (E-boxes) rs5759167 prostate cancer upstream BIK promotes apoptosis Putative antisense transcript Lmx1a+Lmx1b+Lhx3 rs3130320 Systemic lupus erythematosus upstream AK123889 ENSG00000225914

Also: 1,252 of predicted CRMs flank lincRNA regions (total 7,561 ) Sequenced tag density varies with ESCs

(1) Cell-type/library (2) Genomic location (3) level

Vega et al (2009), PLoS ONE, 4(4): e5241

Red : high expression Green: low expression

MEFs Neural Progenitor Cells Integration of 198 ChIP-seq datasets reveals human cis-regulatory regions

Hamid Bolouri & Larry Ruzzo

Thanks to Stephen Tapscott, Steven Henikoff & Zizhen Yao

Slides will be available from: http://labs.fhcrc.org/bolouri/

Email [email protected] for manuscript (Bolouri & Ruzzo, submitted) Histograms of all peak widths before filtering. A small fraction of reported peak regions extend well beyond 10Kbp.

zoom-in

Log10(Frequency) Log10(Frequency)

Peak widths (bp) Peak widths (bp) (A) (B) Example peak scores for peaks called by MACS in a subset of our data.

The distribution is bimodal andHistogram includes of sometotal$score unusually high scoring peaks.

1e+06

8e+05

6e+05

Frequency

4e+05

2e+05 0e+00

0 200 400 600 800 1000

total$score Peak width distribution for peaks <1000bp called by MACS in a subset of our data.

Note the presence of some peaks well below 100bp wide.

250000

200000

150000

Frequency

100000

50000

0

200 400 600 800 1000

width(short regions, bp) Width distribution of regions with >15 peak overlaps in 198 datasets.

Histogram of width(CRMs198fromIslands)

8000

6000

4000

Frequency

2000 0

0 500 1000 1500 2000 2500 3000

Overlap region width (bp)