Article Global reference mapping of human factor footprints

https://doi.org/10.1038/s41586-020-2528-x Jeff Vierstra1 ✉, John Lazar1,2, Richard Sandstrom1, Jessica Halow1, Kristen Lee1, Daniel Bates1, Morgan Diegel1, Douglas Dunn1, Fidencio Neri1, Eric Haugen1, Eric Rynes1, Alex Reynolds1, Received: 30 January 2020 Jemma Nelson1, Audra Johnson1, Mark Frerker1, Michael Buckley1, Rajinder Kaul1, Accepted: 25 June 2020 Wouter Meuleman1 & John A. Stamatoyannopoulos1,2,3 ✉

Published online: 29 July 2020 Open access Combinatorial binding of transcription factors to regulatory DNA underpins Check for updates regulation in all organisms. Genetic variation in regulatory regions has been connected with diseases and diverse phenotypic traits1, but it remains challenging to distinguish variants that afect regulatory function2. Genomic DNase I footprinting enables the quantitative, nucleotide-resolution delineation of sites of occupancy within native chromatin3–6. However, only a small fraction of such sites have been precisely resolved on the human sequence6. Here, to enable comprehensive mapping of transcription factor footprints, we produced high- density DNase I cleavage maps from 243 human cell and tissue types and states and integrated these data to delineate about 4.5 million compact genomic elements that transcription factor occupancy at nucleotide resolution. We map the fne- scale structure within about 1.6 million DNase I-hypersensitive sites and show that the overwhelming majority are populated by well-spaced sites of single transcription factor–DNA interaction. Cell-context-dependent cis-regulation is chiefy executed by wholesale modulation of accessibility at regulatory DNA rather than by diferential transcription factor occupancy within accessible elements. We also show that the enrichment of genetic variants associated with diseases or phenotypic traits in regulatory regions1,7 is almost entirely attributable to variants within footprints, and that functional variants that afect transcription factor occupancy are nearly evenly partitioned between loss- and gain-of-function alleles. Unexpectedly, we fnd increased density of human genetic variation within transcription factor footprints, revealing an unappreciated driver of cis-regulatory evolution. Our results provide a framework for both global and nucleotide-precision analyses of gene regulatory mechanisms and functional genetic variation.

Genome-encoded recognition sites for sequence-specific DNA bind- DNase I footprints pinpoint regulatory factor occupancy on DNA and can ing are the atomic units of eukaryotic gene regulation. Cur- be used to discriminate sites of direct versus indirect occupancy when rently we lack a comprehensive, nucleotide-resolution annotation of integrated with and sequencing (ChIP– such elements and their selective occupancy in different cell types and seq) experiments4. Cognate transcription factors (TFs) can be assigned states. Such a reference is essential both for analysis of cell-selective to footprints on the basis of matching consensus sequences, enabling regulation and for systematic integration of regulation with genetic the TF-focused analysis of gene regulation and regulatory networks10 variation associated with diseases and phenotypic traits. and of the evolution of regulatory factor binding patterns11. DNase I is In vivo binding of regulatory factors shields bound DNA elements from roughly the size of a typical TF and recognizes the minor groove of DNA, nuclease attack, giving rise to protected single-nucleotide-resolution where it hydrolyses single-stranded cleavages. These, in turn, reflect DNA ‘footprints’. The advent of DNA footprinting using the non-specific both the topology and the kinetics of coincidently bound proteins. Pre- nuclease DNase I8 marked a turning point in analyses of gene regu- vious efforts to analyse these featues4 were complicated by the slight lation, and facilitated the identification of the first mammalian sequence-driven cleavage preferences of DNase I, which have since been sequence-specific DNA binding proteins9. Genomic DNase I footprint- exhaustively determined12, setting the stage for fully resolved tracing of ing3–6 enables the genome-wide delineation of DNA footprints (approxi- DNA– interactions within regulatory DNA. mately 7–35 base pairs (bp)) over any genomic region in which DNase I Here we combine sampling of more than 67 billion uniquely map- cleavage is sufficiently dense—chiefly DNase I hypersensitive sites (DHSs). ping DNase I cleavages from over 240 human cell types and states to

1Altius Institute for Biomedical Sciences, Seattle, WA, USA. 2Department of Genome Sciences, University of Washington, Seattle, WA, USA. 3Division of Oncology, Department of Medicine, University of Washington, Seattle, WA, USA. ✉e-mail: [email protected]; [email protected]

Nature | Vol 583 | 30 July 2020 | 729 Article index human genomic footprints with unprecedented accuracy and DHSs within a representative (RELB) across all 243 biosamples. resolution, and thereby to identify the sequence elements that encode A notable feature of these data is the positional stability and discrete TF recognition sites within the . We leverage this index appearance of footprints seen within each DHS across tens to hundreds to (i) systematically assign footprints to TF archetypes; (ii) define pat- of biosamples. Plotting individual nucleotides scaled by their footprint terns of cell-selective occupancy; and (iii) analyse the distribution and prevalence across all samples precisely resolves the core recognition effect of human genetic variation on regulatory factor occupancy and sequences of diverse TFs (Fig. 1b, bottom). the genetics of common diseases and traits. To establish a reference set of TF-occupied DNA elements genome-wide, we applied the Bayesian approach to all DHSs detected within one or more of the 243 biosamples, and applied the same Global mapping of TF footprints consensus approach used to establish a consensus DHS index13 to To create comprehensive maps of TF occupancy, we deeply sequenced collate overlapping footprinted regions across individual biosa- high-quality, high-complexity DNase I libraries from 243 biosamples mples into distinct high-resolution consensus footprints (Supple- derived from diverse primary cells and tissues (n = 151), primary cells mentary Methods). Collectively, we delineated approximately 4.46 in culture (n = 22), immortalized cell lines (n = 10) and cancer cell lines million consensus footprints within about 1.6 million (46%) of the and primary samples (n = 60) (Supplementary Table 1). Collectively, 3.39 million DHSs indexed within these biosamples13 (Fig. 1c). More we uniquely mapped 67.6 billion DNase I cleavage events (mean, 278.2 than 90% of the DHSs with moderate sequencing coverage (over 250 million uniquely mapped cleavages per biosample), which represents tags per 250 million sequenced) contained at least one footprint a great increase over earlier studies4. On average, 49.7% of DNase cleav- (and typically many more; Fig. 1d). As expected, consensus (that is, ages from each biosample mapped to DHSs, which covered 1–3% of empirical Bayes) footprints were markedly more reproducible than the genome. footprints detected using individual datasets (average Jaccard simi- To identify DNase I footprints genome-wide, we developed a compu- larity between replicate biosamples 0.43 versus 0.29, respectively) tational approach that incorporates both chromatin architecture and (Extended Data Fig. 3a). exhaustively enumerated empirical DNase I sequence preferences to Consensus footprints were on average 16 bp wide (middle 95%: determine expected per-nucleotide cleavage rates across the genome, 7–44 bp; 90%: 7–36 bp; 50%: 9–21 bp) and were distributed across all and to derive, for each biosample, a statistical model for testing whether classes of DHS, albeit with enrichment in -proximal elements its observed cleavage rates at individual nucleotides deviated signifi- owing to their generally elevated cleavage density (Extended Data Fig. 3b, cantly from expectation (Extended Data Fig. 1a–g, Supplementary c). Most consensus footprints (82.6%) localized directly within the core Methods). We note that the derivation of cleavage variability models of a DHS peak (average width 203 bp), with virtually all of the remainder for each biosample individually accounts for additional sources of localized within 250 bp of a DHS peak summit (Fig. 1e). Collectively, technical variability beyond DNase I cleavage preference. consensus footprints annotated 2.1% (72 Mb) of the human genome refer- Using this model, we performed de novo footprint discovery inde- ence sequence, compared with about 1.5% for protein coding elements. pendently on each of 243 biosamples, detecting on average 657,029 Given the strong dependency of footprint detection on sequenc- high-confidence footprints per biosample (range 220,580–1,664,065, ing depth (Extended Data Fig. 1b) and sample diversity, we sought empirical false discovery rate <1% (Supplementary Methods)), and col- to estimate how comprehensively this index covered the possible lectively 159.6 million footprint events across all biosamples. Nucleo- detectable footprint space and to what degree additional sequenc- tide protection tracked closely with both the presence of known TF ing and/or biosamples would augment footprint discovery. De novo recognition sequences and the level of per-nucleotide evolutionary footprint detection after iteratively subsampling the most deeply conservation (Extended Data Fig. 2a, b). At the level of individual sequenced DNase I libraries (more than 750 million sequenced tags) nucleotides, de novo footprints genome-wide were highly concordant showed that footprints detected increased linearly with sequencing between biological replicates of the same cultured cell type or between depth (Extended Data Fig. 3d, e), indicating that these DNase I librar- the same primary cell and tissue types sampled from different individu- ies have yet to be sampled to saturation. By contrast, the addition of als (median Pearson’s r = 0.83 and 0.74, respectively) (Extended Data new biosamples and/or replicates produced a sublinear increase in Fig. 2c–e). Within each biosample, footprints encompassed an average the number of footprints detected (Extended Data Fig. 3f, g). Because of around 7.6 Mb (0.2%) of the genome, with a mean of 4.3 footprints the consensus approach favours footprints with support from many per DHS with sufficient read depth for robust detection (normalized biosamples, the consensus footprint space reported here is likely to cleavage density within DHS of at least 1). represent a substantial proportion of TF binding sites that are shared across many cell and tissue types.

Unified index of human genomic footprints Comparative footprinting across cell types has the potential to Assigning TFs to footprints illuminate both the structure and function of regulatory DNA, but Recognition sequences now exist for all major families and subfami- a systematic approach for joint analysis of genomic footprinting lies of TFs, and for a large number of individual TF isoforms14. We data has been lacking. Given the scale and diversity of the cell types thus sought to create a reference mapping between annotated TFs and tissues surveyed, we sought to develop a framework that could and consensus footprints by (i) compiling and clustering all publicly integrate hundreds of available footprinting datasets to increase the available motif models15–17; (ii) creating non-redundant TF archetypes precision and resolution of footprint detection and, furthermore, by placing closely related TF family members on a common sequence to provide a scaffold for a common reference index of TF-contacted axis (Extended Data Fig. 4, Supplementary Table 2, Supplementary DNA genome-wide. Methods); (iii) aligning TF archetypes to the human reference sequence To accomplish this, we implemented an empirical Bayes frame- at high stringency (P < 10−4); and (iv) enumerating all potential TF arche- work that estimates the posterior probability that a given nucleotide types that are compatible with each consensus footprint on the basis is footprinted by incorporating a prior on the presence of a footprint of overlap and match stringency. In total, 80.7% of the approximately (determined by footprints independently identified within individual 4.46 million consensus footprints could be assigned to at least one TF datasets) and a likelihood model of cleavage rates for both occupied and with at least 90% sequence overlap, of which 860,780 (19.3%) could be unoccupied sites (Fig. 1a, Supplementary Methods). Figure 1b depicts unambiguously assigned to a single factor, and 2,038,220 (45.7%) to a per-nucleotide footprint posterior probabilities computed for two single TF with two lower-ranked alternatives.

730 | Nature | Vol 583 | 30 July 2020 a Chr. 19 c 243 cell & tissue types 44,995,000 45,010,000 45,025,000

RELB 4.5 million distinct consensus footprints DNase I 20 density 10 + 1.6 million footprinted CD8 T cells 0 regulatory elements 397 bp 337 bp 30 Observed 80 Biosamples (n = 243) cleavage 0 0 Expected 80 30 SP1 cleavage 0 0 AP1 1 1 Footprint CTCF ETS probability 00 ... 50 bp 50 bp Footprints (4.5M) b Footprint Unocccupied DHS DHS Occupied TF motifs probability in sample in sample Constituitive DHS Cell-selective DHS 0 1 Haematopoietic d No. footprints 1+ 5+ detected: 3+ Fetal life support 1.0 Genitourinary 90% Renal Endocrine 0.5 243 ) Nervous n = Hepatic Prop. DHSs 0.0 with footprint(s) Connective 5 50 250 2,000 Digestive DHS density (tags per 250 million sequenced) Musculoskeletal

Biosamples ( Epithelial Integumentary e Respiratory 82.6% footprints in core DHS Embryonic 4 Cardiovascular 203 bp (avg. DHS TF-protected 2 peak width) nucleotides (×100,000)

Consensus No. footprints 0 footprints SP/KLF SP/KLF AP1 SP/KLF SP/KLF SP/KLF ETS SP/KLF No match SP/KLF SP/KLF ETS SP/KLF ETS SP/KLF SP/KLF No match MY B E-box RUNX REL POU REL STAT NFAT –250 0 +250 Distance to DHS peak summit

Fig. 1 | A nucleotide-resolution atlas of TF occupancy on the human consensus footprints in one or more cell or tissue types (footprint posterior genome. a, DNase I cleavage patterns (RELB locus in CD8+ T cells). Top, >0.99). c, Consensus map of TF occupancy derived from 243 biosamples windowed DNase I cleavage density. Below, per-nucleotide cleavage and covering 1.6 million DHSs providing expansive nucleotide-resolution footprint posterior probabilities within two DHSs. b, Heat map of footprint annotation of regulatory DNA. d, Proportion of DHSs with footprints at given posterior probabilities integrating 243 biosamples. Rows are individual sequencing depth. Dashed red lines and dot show read depth (tags per 250 biosamples grouped by tissue or organ systems; columns are individual million uniquely mapped reads) at which footprint is detected in 90% of DHSs. nucleotides. Black fills to right of heat maps indicate overlapping DHSs in e, Histogram of footprint location relative to DHS peak summit. Dashed red biosample. Below, DHS sequence scaled by footprint prevalence. Grey boxes, lines represent average size of a DHS peak (203 bp).

To gauge the sensitivity and accuracy of the motif-to-consensus foot- print mappings, we evaluated the posterior footprint probability as Primary architecture of regulatory regions metric to classify motif occupancy by using the genomic master regula- Despite intensive efforts over several decades, the primary architecture tor CCCTC-binding factor (CTCF). CTCF combines a well-documented, of regulatory regions has remained unclear, with the singular exception unambiguous motif with the availability of ENCODE ChIP–seq data18 for of the interferon ‘enhanceosome’19. Elucidating the primary architec- a broad range of cell and tissue types that match those represented in ture of active regulatory DNA requires accurate tracing of the TF–DNA the consensus footprint index (Supplementary Table 3). Comparing the interface over an extended interval. Because TF engagement creates occupancy of all CTCF motifs within all DHSs (Supplementary Methods) subtle alterations in DNA shape and protects underlying phosphate with CTCF ChIP–seq data showed strong classification performance, bonds from nuclease attack via steric hindrance6, we investigated to with a mean area under precision-recall curve of 0.80 (Extended Data what extent fluctuations in corrected DNase I cleavage rates within Fig. 5a, b). At the posterior footprint probability threshold used to gener- individual consensus footprints accurately reflected the topology ate consensus footprints (P > 0.99), we correctly identified an average of of the TF–DNA interface. Notably, previous efforts to resolve such 19,904 CTCF-bound recognition elements per cell type, corresponding features4 were obscured by subtle intrinsic cleavage preferences and to a mean precision of 82.5% and sensitivity of 60% (Supplementary lacked resolving power at individual TF footprints on the genome. Table 3), despite posterior footprint probability not encoding any Poly-zinc fingers are the most prevalent class of human TFs and have information about the quality of motif matches. Lower CTCF motif recognition interfaces that potentially cover tens of nucleotides14. The match scores were strongly associated with false-positive footprint or DNA recognition domain of CTCF comprises 11 zinc fingers, potentially motif classifications, so the incorporation of motif match strength in encoding 33 bp of sequence (or DNA shape20) recognition. We identi- addition to footprint probability is expected to increase classification fied 25,852 footprints that coincided precisely with CTCF motifs within precision (Extended Data Fig. 5c). Overall, footprinted motifs showed an regulatory T cells. Transposing the average corrected per-nucleotide approximately 2.5-fold increase in CTCF ChIP–signal when compared to cleavage propensity with an extended co-crystal structure of CTCF21 non-footprinted motifs (Extended Data Fig. 5d, e). Examination of other accurately traced all features of the protein–DNA interaction interface, TFs yielded similar results, albeit with variable classification accuracy including focal hypersensitivity within the hinge region between zinc that was probably driven by the ambiguity in footprint assignment for fingers 7 and 95,22,23 (Fig. 2a, Supplementary Methods). A similar result motifs recognized by many distinct TFs and the predominance of weak was obtained for widely divergent classes of DNA binding domain, such and/or indirect occupancy (Extended Data Fig. 5f–m). as the paired-box domain-containing TF PAX624 (Extended Data Fig. 6a)

Nature | Vol 583 | 30 July 2020 | 731 Article a Core Upstream a Motif enrichment (Z-score) No. motifs CTCF Single TF (ZFs 3–7) (ZFs 9–11) –0.5 2 012+ (poly-zinc nger) 15–20nt 20–25nt Independent 25–30nt 30–35nt Multiple TFs

Footprint width 35–40nt

DNase I cleavage –40 –20 0 20 40 0 100 Cooperative Low High Hinge Distance from Footprints (ZF 8) footprint centre (bp) (%) b Cleavage c b 50 Footprint density 7–30 bp 7 Single TF footprint –1.5 1.5 5.5 Obs/exp 6 5 footprints (log ) 92% 2 4 8% 3 Promoter Expected 0 2 (per 200 bp)

chr4:113,547,724 No. footprints 1 Distal 0 >30 bp Multi-TF footprint Footprint spacing DNase I cleavage Observed 60 e 20 Typical regulatory element 40 21 bp ~200 bp: 5–6 TFs 0 between 20 directly bound chr3:114,291,020 footprints (T regulatory cells) distance (bp) c 0 0 1 2 3 4 5 6 7 8 9 Footprint edge-to-edge 25,852 footprinted CTCF motifs CTCF (core) DHS peak acessibility ~21 bp 20 spacing PAX6 (footprint detection sensitivity) HNF4A TP63 16 MEF2D Fig. 3 | Modes of TF occupancy within regulatory DNA. a, Overlap and spatial 1 MYF6 IRF4 enrichment of TF recognition sequences within footprints binned by width. 0 12 OTX1 EBF1 Left, density heat map of motif occurrences around footprints binned by SP1 (obs/exp)

2 width. Right, proportion of footprints uniquely overlapped by 0, 1 or 2 or more –1

log U = 0.90

Mean cleavag e 8 SOX3 recognition sequences. b, Percentage of footprints representing occupancy of –25 1–35 +25 Consensus footprint widt h single TF (≤30 bp) or multiple TFs (>30 bp). c, d, Footprint density and spacing Distance to motif (bp) 8 12 16 20 (edge-to-edge) plotted against mean normalized cleavage density (tags per Recognition sequence width 150 bp per million reads) within promoter-proximal and distal DHSs. Solid lines Fig. 2 | Footprints encapsulate topological structures of individual and shaded regions indicate median and middle 50th percentile, respectively. TF–DNA interactions. a, Structure of CTCF zinc fingers 3–11 bound to cognate e, A typical DHS contains about five or six directly bound TFs spaced roughly DNA recognition sequence ( (PDB) codes: 5YEF and 5YEL)21. 20 bp apart. DNA coloration shows mean observed versus expected cleavage at footprinted CTCF motifs (in T regulatory cells). b, Heat map of relative cleavage at each of 25,852 footprinted CTCF motifs (posterior probability >0.99). Below, and morphology of TF binding events within individual regulatory aggregate (summed) nuclease cleavage relative to footprinted motifs. Right, elements could be used to gain insight into the mechanistic basis of nuclease cleavage (observed and expected) at three footprints randomly TF cooperativity. selected across genome. c, Footprint width is tightly correlated with the width As the width of genomic footprints tightly tracks the physical struc- of the TF recognition sequence (Spearman’s ρ = 0.9, P = 0.001). ture of individual TFs bound to DNA (Fig. 2a, b, Extended Data Fig. 6), and direct TF–TF interactions are dependent on close proximity, such interactions should result in larger footprints that contain multiple TF recognition sites. Conversely, independent TF–DNA interaction events and other TFs with extant co-crystal structures (not shown). Critically, should yield compact and widely spaced footprints that contain single these topological features were evident at the level of individual TF TF recognition sites. As such, the prevalence of cooperativity mediated footprints on the genome (Fig. 2a, Extended Data Fig. 6). Overall, the by direct TF–TF interactions rather than by synergy of independent average footprint width for diverse TFs tightly tracked the width of binding events should be reflected in the relative proportion of wide, their respective recognition sequences (Spearman’s ρ = 0.90, P = 0.001) multi-motif footprints compared to that of well-spaced single foot- (Fig. 2b). As such, the extended profile of corrected per-nucleotide prints. Larger footprints are overwhelmingly associated with two (or DNase I cleavage across entire regulatory regions should, in principle, more) recognition sequences (Fig. 3a), but such footprints represent provide a snapshot of the primary structure of active regulatory DNA. only 8% of the global footprint landscape. By contrast, 92% of footprints contain a single TF recognition site (Fig. 3b). Because TFs can distort DNA upon engagement, TF spacing could Distinguishing TF occupancy modes be critical for establishing regulatory structures. To quantify global TFs compete cooperatively with nucleosomes for access to regu- footprint spacing patterns, we first binned each DHS by its average latory DNA25,26. Many TFs have the potential to catalyse changes in accessibility across all biosamples (as footprint discovery depends nucleosome occupancy over a strongly matching recognition motif, on total DNase I cleavage; Extended Data Fig. 1b), and for each bin we a process referred to as ‘pioneering’27. However, it is unclear how computed the mean number of footprints present per element and steady-state chromatin accessibility is maintained by TFs in place of their relative edge-to-edge spacing. The density of footprints within the a canonical nucleosome, and whether this results primarily from local most deeply sampled DHSs genome-wide plateaued at an average of 5.5 protein–protein interactions or the synergistic effects of independent per 200 bp (Fig. 3c, top), which is in agreement with theoretical predic- TF–DNA binding26. We reasoned that the number, relative spacing, tions of the number of human TFs required to destabilize a canonical

732 | Nature | Vol 583 | 30 July 2020 abd Variants in NFIX footprints 147 individuals 15 Alt. stronger Ref. stronger (243 samples) 946 943 All variants 10 (n = 7,110) 2 Imbalanced

1.65 million SNVs 5 Densit y (n = 1,889) Percentage of 0

variants imbalance d 0 117,626 0 0.25 0.50 0.75 1.00 DHS + + imbalanced Prop. cleavages on reference allele Footprint + – e Variant effect on NFIX occupancy

c Chr. 2 232,606,200 2 U = 0.90 Loss of EFHD1 binding &* rs10171498 0 Gain of 50 Allelic 4,504 rs10171498 binding (38.8%) DNase I 0 –2 cleavage 120 7,117 sample s (homozygotes; RR vs. AA ) 0 0.25 0.50 0.75 1.00 (summed) (61.2%) Relative DNase I protection

0 Heterozygou s Prop. cleavages on reference allele (heterozygotes) Retina 16 (fetal) 0 HT29 10 f Variant effect on NFIX cells sample s recognition sequence

0 Homozygou s

Relative 2 && n = 74 5 Alternate allele DNase I 4 P weaker motif 0 10 protection 0 4 0 (log2) n = 12 –lo g –2 ** Alternate allele −5 stronger motif

777&*$&*7777&&$**&*$ Log-odds motif score

* (reference vs. alternative ) 0 0.25 0.50 0.75 1.00 NFIX Prop. cleavages on reference allele

Fig. 4 | Functional genetic variation localizes in TF footprints. a, Allelic of allelic ratios for variants that overlap the footprinted NFIX recognition imbalance assessed at 1.65 million variants discovered from 147 unique sequence. Grey, all variants tested (n = 7,110). Blue, significantly imbalanced individuals represented in 243 biosamples. b, Percentage of variants variants (n = 1,889). Prop., proportion. e, Scatter plot of allelic imbalance in imbalanced in footprinted and non-footprinted segments of DHS peaks. heterozygous individuals (x-axis) against relative difference in footprint depth c, Variant rs10171498 (C→G; C allele ancestral) creates a de novo NFIX footprint. between homozygous individuals at variants overlapping an NFIX footprint. Top, allelically resolved per-nucleotide DNase cleavage aggregated from 56 Each point shows an individual SNV within the footprinted NFIX binding heterozygotes. Middle, DNase cleavage in two samples homozygous for site that is both imbalanced (q < 0.2) in heterozygotes and differentially reference or alternative alleles. Bottom, mean differential per-nucleotide footprinted (nominal P < 0.05) in homozygotes. Grey line, fitted linear model. cleavage (log2) between homozygous reference (n = 74) and alternative allele f, Allelic imbalance versus predicted energetic effects of variants within NFIX samples (n = 12). Colour indicates statistical significance (–log10 P) of footprints. Shown is median log-odds score (reference versus alternate allele) per-nucleotide differential test (Methods). Variant and differentially of all tested variants within footprinted motifs binned by allelic ratio. Error bars footprinted nucleotides precisely colocalize at the NFIX element. d, Histogram show 5th and 95th percentiles of log-odds motif scores in each bin.

nucleosome26 and to encode specificity28. Within DHSs, footprints Given that most DHSs are shared across at least two cell types or exhibited average edge-to-edge spacing of about 21 bp (middle 50%, states13,34, we queried how the pattern of footprints within a DHS (and 12–35 bp) (Fig. 3c, bottom). Together, these results are compatible hence its topology) differed with cellular context. Although differential with the observed lack of evolutionary constraint on the spacing and TF occupancy can be discerned upon manual inspection4, systematic orientation29–33 of TF motifs and strongly suggest that steady-state analysis has not been possible owing to the dominance of intrinsic regulatory DNA accessibility is maintained chiefly by independent but DNase I cleavage propensities. To enable unbiased detection of dif- synergistic TF binding modes (Fig. 3d). ferential footprint occupancy, we developed a statistical framework to test for differences in relative cleavage rates at individual nucleotides across many samples, analogous to methods developed for the identi- Cell-selective TF occupancy landscapes fication of differentially expressed (Supplementary Methods). Footprint occupancy across all biosamples showed marked enrichment To estimate the proportion of differentially regulated footprints within for the recognition sequences of key regulatory TFs in their cognate DHSs of a given cell or tissue, we focused on the neural lineage, for which lineages (Extended Data Fig. 7a). In total, we identified 609 motif mod- many biosamples were available. We compared footprint occupancy els that matched footprinted sequences (Supplementary Methods); within DHSs that were broadly accessible in nervous-system-derived these models encompassed 64 distinct archetypal TF recognition codes samples (n = 31) with that in non-nervous-system-derived samples (Supplementary Table 2), representing virtually all major DNA-binding (n = 212). We selected 67,368 DHSs that were highly accessible in at least domain families. For degenerate motifs where the same sequence is 10 nervous- and non-nervous-derived samples, and for each DHS, per- recognized by many distinct TFs, we observed highly cell-selective formed a per-nucleotide differential test (Extended Data Figs. 8a, b, 9a). occupancy patterns that could be decomposed into coherent groups This analysis identified only a small proportion of DHSs (1,720 DHSs; that corresponded to cell type and function (Extended Data Fig. 7b–d). 2.5%) as containing one or more differentially footprinted elements However, the cell-selective occupancy patterns of most individual TF (Extended Data Fig. 9a). Most of these DHSs contained a single differ- footprints within DHSs mirrored the cell-selective actuation of their entially regulated footprint, whereas a small fraction contained 2–4 encompassing DHS (Extended Data Fig. 7b–d). differentially occupied elements (Extended Data Fig. 9a). Nonetheless,

Nature | Vol 583 | 30 July 2020 | 733 Article

a 4.5 million consensus footprints c evenly partitioned between loss-of-function (disruption of binding)

) Nucleotide diversity 4 1.0 10.0 and gain-of-function (increased or de novo binding) alleles (Fig. 4c, Evolutionarily constrained 9.0 d, Extended Data Fig. 10c). Homozygosity for either the reference or 0.5 (phyloP > 1) 8.0 alternative allele paralleled results from heterozygotes and further

Density 11.6% Mean S (× 10 −100 −50050 100 revealed that structural changes due to TF occupancy were precisely 0 –2 0 2 Distance to footprint centre confined to the DNA sequence recognition interface (Fig. 4c, bot- Mean phyloP score at footprint tom). In many cases, SNVs that were detected in both heterozygous Rare variation and homozygous configurations showed strong agreement between b Nucletoide diversity d 1.2 Observed allelic ratios and relative footprint strength (Fig. 4e; Spearman’s ρ = 0.9, 12.5 < 0.0001 −5

) Consensus

4 P < 10 ). Variants within footprinted motifs were markedly enriched for 10.0 footprints (95%-ile CI) 1.0 imbalance when compared to non-footprinted motifs; were localized 7.5 Fourfold to high-information-content positions within the recognition interface 1.25 Expected 5.0 degenerate SNV densit y 7-mer mutation coding sites (Fig. 4c, bottom, Extended Data Fig. 11); and paralleled the predicted

Mean S (× 10 rate model 2.5 energetic effect of the variant on the TF binding site (Fig. 4f, Extended l 1.00 Al ) 1 1) –100 –50 0 50 100 Data Fig. 12), thus providing a direct quantitative readout of the effects ≤ Neutral Distance to footprint centre loP > of functional variation on TF occupancy. Constrained (phyloP (phy

Fig. 5 | Footprints are highly polymorphic in human populations. TFs occupy hypermutable DNA a, Histogram of mean phyloP scores within consensus footprints. Dashed red We next sought to characterize the patterns of human genetic variation line, phyloP = 1. b, Nucleotide diversity (π) within footprints (mean ± 95% confidence interval of mean). Yellow bar, 95% confidence interval of mean π within regulatory DNA with high precision. Only a small fraction (11.6%) computed at fourfold degenerate coding sites. c, Mean nucleotide diversity of individual footprints showed evidence of evolutionary constraint versus distance from consensus footprints. Shaded bar represents average (phyloP score >1), consistent with purifying selection, whereas the vast footprint width (14 bp). d, Density of observed and expected rare variation majority appeared to be evolving neutrally (Fig. 5a). To quantify the within footprints. Top, variants with minor allele frequencies (maf) <0.0001. relationship between evolutionary constraint and genetic variation Bottom, expected rare variation computed using 7-mer sequence context in human populations, we calculated mean nucleotide diversity (π) mutation rate model42. within consensus genomic footprints by using more than 400 mil- lion single-nucleotide variants detected by whole-genome sequenc- ing of over 65,000 individuals under the TOPMED project37 (Fig. 5b, differentially occupied footprints were significantly enriched in recog- Supplementary Methods). Canonically, reduced levels of π reflect the nition sites for known nervous system regulators such as REST, NFIB, elimination of deleterious alleles from the population by natural selec- ZIC1, and EBF1 (Extended Data Fig. 9a–c) and tissue-selective occupancy tion, and hence are indicative of recent functional constraint. Consist- events paralleled the expression of nearby genes (in the case of REST ent with prior observations36, we found that mean π within footprints occupancy) (Extended Data Fig. 9d). approximated that of fourfold degenerate sites within protein-coding Collectively, the above results indicate that the vast majority of regions, which are assumed to be evolving neutrally or under relaxed regulatory DNA regions marked by DHSs encode a single structural selection. Stratification of footprints by the level of evolutionary con- topology that reflects a fixed pattern of footprint occupancy. None- straint (phyloP score >1) revealed marked differences in genetic diver- theless, at a small minority of elements, DHSs provide a scaffold for sity, with significantly reduced levels of π within highly evolutionarily cell-context-specific TF occupancy that is typically confined to one constrained footprints and increased π in non-constrained footprints or a small number of footprinted elements. (P < 0.0001; two-sample bootstrap t-test). The density of sampled variation enabled nucleotide-resolution anal- ysis of nucleotide diversity at footprinted and non-footprinted bases Functional DNA variants in TF footprints within DHSs. Unexpectedly, we found a marked increase in nucleotide Identifying genetic variants that are likely to affect regulatory func- diversity centred precisely within the core of footprints (Fig. 5c), reveal- tion has remained challenging. Deep sequence coverage at DHSs ing that these elements as a class—but not intervening non-footprinted enables de novo genotyping of regulatory variants and simultaneous segments of DHSs—are highly polymorphic in human populations. characterization of their functional effect on local chromatin archi- This result eclipses prior global analyses indicating that TF occupancy tecture by quantifying and comparing cleavage for each allele2,4. The sites are generally not under substantial purifying selection4,36 both 243 biosamples we analysed were derived from 147 individuals, and in the magnitude of the observed effect and in its nucleotide-precise de novo genotyping (Supplementary Methods) revealed 3.76 million localization within the footprint core. single-nucleotide variants (SNVs) within DHSs, of which 1,656,597 were Focally increased genetic diversity within footprints suggested that heterozygous and had sufficient read depth (at least 35 overlapping the nucleotides that encode these elements may have an increased reads) to accurately quantify allelic imbalance. mutational load when compared with immediately adjacent sequences. Across individuals, we conservatively identified 117,626 To explore this possibility, we focused on variants with extremely low chromatin-altering variants (CAVs) that altered DNA accessibility on allele frequencies in human populations (minor allele frequency less individual alleles (median 2.4-fold imbalance) (Fig. 4a, Extended Data than 10−4); such variants are assumed to result from de novo germline Fig. 10a–c, Supplementary Methods). Within DHSs, CAVs were mark- (that is, non-segregating) mutation and are often used as a surro- edly enriched in core consensus footprints, even after controlling for gate for mutation rate in humans. We found that the distribution of the increased detection power (that is, sequencing depth) within this extremely rare variants within and around footprints mirrored that of compartment (Fig. 4b, Extended Data Fig. 10d). nucleotide diversity, compatible with increased mutation rate within In protein-coding regions, most functional genetic variation footprints (Fig. 5d, top). TFs have been hypothesized to potentiate is expected to be deleterious, with rare gain-of-function alleles35. de novo mutation by focally inhibiting access by the DNA repair machin- Protein–DNA recognition interfaces are likewise presumed to be ery38,39. Nucleotide context is also known to have a substantial role in susceptible to disruption at critical nucleotides, predisposing to genome mutation40, and this can be accurately modelled across a wide loss-of-function alleles36. Notably, we found that CAVs were nearly range of nucleotide combinations41,42. To differentiate these possible

734 | Nature | Vol 583 | 30 July 2020 abGWAS catalogue UKBB c UKBB have a key evolutionary role by favouring variability in TF occupancy (LD-expanded) lymphocyte RBC counts counts * and hence natural variation in gene regulation. 1.3 35 45 *P < 10–5 * 30 * 1.2 * *P < 0.01 40 *P < 0.01 * 25 GWAS variants localize within TF footprints )/Pr(SNPs)

1.1 )/Pr(SNPs ) 2

10 2 * 5 * 1.0 Given the above, genetic variation within footprints should, in princi- Pr ( h 5 Pr ( h * * ple, be a key contributor to phenotypic variation. We therefore next

sampled 1KGP SNVs 0 Fold-enrichment over 0.9 0

s resolved the large set of variants that are strongly associated (nominal −8 0.05 0.01

0.001 All DHS All DHS P < 5 × 10 ) with diverse diseases and phenotypic traits from the NHGRI/ 45 All footprints All footprints EBI genome-wide association study (GWAS) catalogue to consensus Consensus Lymphoid DHS Erythroid DHS Within DHS, genomic footprints. To account for the baseline increase in genetic footprints Lymphoid footprints Erythroid footprints (1 – posterior)

outside footprint variation present within the genomic footprints described above, we performed exhaustive (1,000×) sampling of matched variants (by minor Fig. 6 | Trait-associated variation is concentrated in consensus footprints. allele frequency, linkage-disequilibrium (LD) structure, and distance a, Enrichment of GWAS variants within or outside consensus footprints versus 46 randomly sampled 1,000 Project (1KGP) variants, after expanding to the nearest gene) from the 1,000 Genome Project (Supplemen- both with variants in perfect LD (r2 = 1.0, central European population). Centre tary Methods). In addition, we expanded both GWAS catalogue and lines, median; boxes, interquartile range (IQR); whiskers, 5th and 95th matched sampled variants to include variants that were in perfect LD 2 percentile (of enrichments from 1,000 sampling iterations). Statistical (r = 1). Within DHSs, aggregated GWAS catalogue SNPs were enriched significance determined by a normal distribution fitted to sampled data within footprints but not non-footprinted subregions, and the former (Supplementary Methods). b, c, Enrichment of SNP-based trait heritability increased monotonically with footprint strength (Fig. 6a). using LD-score regression for UK BioBank (UKBB) GWAS on lymphocyte count To gain a more accurate view of the enrichment of trait-associated (b) and red blood cell count (c). Pr(h2), proportion of trait narrow-sense variants in footprints, we compared the SNP-based trait heritability heritability explained by SNPs overlapping DHSs or footprints. Pr(SNPs), of individual traits47,48. Using summary statistic data from individual proportion of total SNPs within each annotation. Asterisk, enrichment P < 0.01. GWAS studies from the UK BioBank, we applied partitioned LD-score regression to compute the relative heritability contribution of vari- ants within all DHSs and footprints collectively versus that of DHSs causes of increased mutation, we used a 7-mer context mutation rate and footprints from the expected cognate cell type for a given trait model42 (Supplementary Methods) to predict mutation density within (Fig. 6b, c). We found striking enrichment of variants that account for footprints. This model nearly completely recapitulated the observed trait heritability in footprints generally (more than fivefold) and most density of human SNVs within footprints (Fig. 5d, bottom), indicating prominently in footprints from the cognate cell type (up to approxi- that footprint mutational load derives chiefly from local sequence mately 45-fold) (Fig. 6b, c). We thus conclude that the genetic signals composition and not from a repair-mediated process. from disease- and trait-associated variants within DHSs emanate from Mutational mechanisms have been linked to the observed wide- TF footprints, and that variants within footprints are major contribu- spread turnover of TF recognition sites43,44. Of note, many TFs favour tors to trait heritability. the recognition of dinucleotide combinations such as CpGs that are intrinsically hypermutable or of dinucleotides that result from CpG deamination43,44. We examined the nucleotide-resolved patterns of Discussion both evolutionary conservation and genetic variation at footprinted We have described the highest-resolution view to date of regulatory motifs for structurally distinct TFs with CpG dinucleotides in their factor occupancy patterns on the human genome, measured across an core recognition sequence (ETS1, JDP2 and CTCF). For each motif, con- expansive range of cell and tissue contexts sampled from more than 140 servation and nucleotide diversity were reciprocal, and mutations at genotype backgrounds. The scale and breadth of the data have enabled CpG dinucleotides appeared to be the key drivers of generic diversity delineation of a reference set of about 4.5 million genomic sequence (Extended Data Fig. 13a–c). elements that form the building blocks of regulatory DNA and collec- Because increased polymorphism within TF footprints is attribut- tively define nucleotides that are crucial for genome regulation and able to variability in mutation rates resulting from sequence context, function. While expansive, this catalogue is nonetheless not compre- it remains unclear to what extent purifying selection is acting on TF hensive owing to incomplete sampling of human cell types and states, occupancy. To quantify this, we compared footprinted motifs to and non-exhaustive sequencing of individual DNase-seq libraries. We non-footprinted elements (both within and outside DHS), reasoning note further that the algorithms we have applied, while incorporating that the latter should represent neutrally evolving, non-functional considerable advances over prior efforts, nonetheless incompletely sites, but should be subjected to similar mutational forces owing to exploit the richness and subtleties of the measured cleavage landscape. proximity. Consistent with this, footprinted motifs were markedly Assigning individual TFs to individual footprints presents many more evolutionarily constrained (approximately threefold to five- challenges. Here, we applied a de novo approach in which TFs were fold) than non-footprinted motifs (Extended Data Fig. 13a–c, top). assigned to footprints post hoc via overlap with their cognate rec- For each TF, we found that footprinted motifs had lower aggregate ognition sequences. A complicating factor is that many functionally nucleotide diversity than non-footprinted elements, yet these differ- distinct TFs use similar recognition sequences, leading to potential ences were largely overshadowed by differences between evolutionarily ambiguous assignment of TFs to individual footprints. In addition, constrained and unconstrained motifs (Extended Data Fig. 13d–f, red co-expressed TFs with similar recognition sequences may alternatively and black boxes, respectively). These results indicate that while a core occupy the same element49. Because DNase I cleavage patterns encode set of binding sites appears to be under substantial constraint (on a par rich information about the topology and binding modes of individual with protein-coding regions), the vast majority of footprints appear to factors (Fig. 2, Extended Data Fig. 6), incorporating this information be under very weak selective constraint. Notably, for each of the three into future approaches should greatly increase the fidelity of TF–foot- aforementioned TFs, mutations that occurred within their footprinted print assignments. motifs preferentially modulated allelic imbalance in chromatin acces- Collectively, the consensus footprint index now provides a ready and sibility, linking natural variation to functional variation (Extended Data extensible nucleotide-precise reference for diverse analyses, particu- Fig. 11). Thus, hypermutation within genomic footprints appears to larly those involving genetic variation. The preferential localization of

Nature | Vol 583 | 30 July 2020 | 735 Article disease- and trait-associated variation within regulatory DNA has here- 24. Xu, H. E. et al. Crystal structure of the human Pax6 paired domain-DNA complex reveals specific roles for the linker region and carboxy-terminal subdomain in DNA binding. tofore been described in terms of entire regulatory regions demarcated Genes Dev. 13, 1263–1275 (1999). by DHSs or clusters thereof. Our results now show that genetic asso- 25. Svaren, J., Klebanow, E., Sealy, L. & Chalkley, R. Analysis of the competition between ciation and heritability signals from regulatory DNA overwhelmingly nucleosome formation and transcription factor binding. J. Biol. Chem. 269, 9335–9344 (1994). emanate from consensus TF footprints, which should greatly facilitate 26. Mirny, L. A. Nucleosome-mediated cooperativity between transcription factors. Proc. Natl the connection of disease- and trait-associated genetic variation with Acad. Sci. USA 107, 22534–22539 (2010). genome function. 27. Zaret, K. S. & Mango, S. E. Pioneer transcription factors, chromatin dynamics, and cell fate control. Curr. Opin. Genet. Dev. 37, 76–81 (2016). Perhaps most notably, we report that human genetic variation is 28. Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of itself concentrated within TF footprints, owing apparently to a com- binding motifs. Trends Genet. 25, 434–440 (2009). bination of mutation propensity and the evolved sequence recogni- 29. Rastegar, S. et al. The words of the regulatory code are arranged in a variable manner in highly conserved enhancers. Dev. Biol. 318, 366–377 (2008). tion repertoire of human TFs, which favours hypermutable nucleotide 30. Lusk, R. W. & Eisen, M. B. Evolutionary mirages: selection on binding site composition combinations (for example, CpG dinucleotides). Given that human and creates the illusion of conserved grammars in Drosophila enhancers. PLoS Genet. 6, mouse TFs share the large majority of their recognition landscapes, the e1000829 (2010). 31. Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in concentration of variation within TF occupancy sites is likely to have had mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 50 a considerable role in shaping mammalian regulation ; furthermore, 1114–1121 (2002). this finding suggests that genomes are heavily primed for regulatory 32. Vierstra, J. et al. Mouse regulatory DNA landscapes reveal global principles of cis-regulatory evolution. Science 346, 1007–1012 (2014). evolution, providing a possible underlying mechanism for facilitated 33. Weirauch, M. T. & Hughes, T. R. Conserved expression without conserved regulatory 51 phenotypic evolution . sequence: the more things change, the more they stay the same. Trends Genet. 26, 66–74 (2010). Online content 34. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012). Any methods, additional references, Nature Research reporting sum- 35. Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human maries, source data, extended data, supplementary information, protein-coding variants. Nature 493, 216–220 (2013). 36. Vernot, B. et al. Personal and population genomics of human regulatory variation. acknowledgements, peer review information; details of author con- Genome Res. 22, 1689–1697 (2012). tributions and competing interests; and statements of data and code 37. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. availability are available at https://doi.org/10.1038/s41586-020-2528-x. Preprint at https://www.biorxiv.org/content/10.1101/563866v1 (2019). 38. Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A. & López-Bigas, N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267 (2016). 1. Maurano, M. T. et al. Systematic localization of common disease-associated variation in 39. Perera, D. et al. Differential DNA repair underlies mutation hotspots at active promoters in regulatory DNA. Science 337, 1190–1195 (2012). cancer genomes. Nature 532, 259–263 (2016). 2. Maurano, M. T. et al. Large-scale identification of sequence variants influencing human 40. Francioli, L. C. et al. Genome-wide patterns and properties of de novo mutations in transcription factor occupancy in vivo. Nat. Genet. 47, 1393–1401 (2015). humans. Nat. Genet. 47, 822–826 (2015). 3. Hesselberth, J. R. et al. Global mapping of protein-DNA interactions in vivo by digital 41. Aggarwala, V. & Voight, B. F. An expanded sequence context model broadly explains genomic footprinting. Nat. Methods 6, 283–289 (2009). variability in polymorphism levels across the human genome. Nat. Genet. 48, 349–355 4. Neph, S. et al. An expansive human regulatory lexicon encoded in transcription factor (2016). footprints. Nature 489, 83–90 (2012). 42. Carlson, J. et al. Extremely rare variants reveal patterns of germline mutation rate 5. Boyle, A. P. et al. High-resolution genome-wide in vivo footprinting of diverse heterogeneity in humans. Nat. Commun. 9, 3753 (2018). transcription factors in human cells. Genome Res. 21, 456–464 (2011). 43. He, X. et al. Methylated cytosines mutate to transcription factor binding sites that drive 6. Vierstra, J. & Stamatoyannopoulos, J. A. Genomic footprinting. Nat. Methods 13, 213–221 tetrapod evolution. Genome Biol. Evol. 7, 3155–3169 (2015). (2016). 44. Zemojtel, T., Kielbasa, S. M., Arndt, P. F., Chung, H.-R. & Vingron, M. Methylation and 7. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, deamination of CpGs generate -binding sites on a genomic scale. Trends Gen. 25, 317–330 (2015). 63–66 (2009). 8. Galas, D. J. & Schmitz, A. DNAse footprinting: a simple method for the detection of 45. Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170 (1978). studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 9. Dynan, W. S. & Tjian, R. The promoter-specific transcription factor Sp1 binds to upstream (2019). sequences in the SV40 early promoter. Cell 35, 79–87 (1983). 46. The 1000 Genomes Project Consortium. A global reference for human genetic variation. 10. Neph, S. et al. Circuitry and dynamics of human transcription factor regulatory networks. Nature 526, 68–74 (2015). Cell 150, 1274–1286 (2012). 47. Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from 11. Stergachis, A. B. et al. Conservation of trans-acting circuitry during mammalian polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015). regulatory evolution. Nature 515, 365–370 (2014). 48. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide 12. Lazarovici, A. et al. Probing DNA shape and methylation state on a genomic scale with association summary statistics. Nat. Genet. 47, 1228–1235 (2015). DNase I. Proc. Natl Acad. Sci. USA 110, 6376–6381 (2013). 49. Wang, J. et al. Sequence features and chromatin structure around the genomic regions 13. Meuleman, W. et al. Index and biological spectrum of accessible DNA elements in the bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012). human genome. Nature https://doi.org/10.1038/s41586-020-2559-3 (2020). 50. Weirauch, M. T. et al. Determination and inference of eukaryotic transcription factor 14. Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018). sequence specificity. Cell 158, 1431–1443 (2014). 15. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 51. Gerhart, J. & Kirschner, M. The theory of facilitated variation. Proc. Natl Acad. Sci. USA 104 327–339 (2013). (Suppl. 1), 8582–8589 (2007). 16. Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46 (D1), D1284 (2018). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in 17. Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription published maps and institutional affiliations. factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46 (D1), D252–D259 (2018). 18. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human Open Access This article is licensed under a Creative Commons Attribution genome. Nature 489, 57–74 (2012). 4.0 International License, which permits use, sharing, adaptation, distribution 19. Panne, D., Maniatis, T. & Harrison, S. C. An atomic model of the interferon-β and reproduction in any medium or format, as long as you give appropriate enhanceosome. Cell 129, 1111–1123 (2007). credit to the original author(s) and the source, provide a link to the Creative Commons license, 20. Rohs, R. et al. The role of DNA shape in protein–DNA recognition. Nature 461, 1248–1253 and indicate if changes were made. The images or other third party material in this article are (2009). included in the article’s Creative Commons license, unless indicated otherwise in a credit line 21. Yin, M. et al. Molecular mechanism of directional CTCF recognition of a diverse range of to the material. If material is not included in the article’s Creative Commons license and your genomic sites. Cell Res. 27, 1365–1377 (2017). intended use is not permitted by statutory regulation or exceeds the permitted use, you will 22. Arnold, R., Burcin, M., Kaiser, B., Muller, M. & Renkawitz, R. DNA bending by the need to obtain permission directly from the copyright holder. To view a copy of this license, protein NeP1 is modulated by TR and RXR. Nucleic Acids Res. 24, 2640–2647 (1996). visit http://creativecommons.org/licenses/by/4.0/. 23. MacPherson, M. J. & Sadowski, P. D. The CTCF insulator protein forms an unusual DNA structure. BMC Mol. Biol. 11, 101 (2010). © The Author(s) 2020

736 | Nature | Vol 583 | 30 July 2020 Acknowledgements This work was supported by NIH grants U54HG007010 and Reporting summary 5UM1HG009444, and a charitable contribution to the Altius Institute for Biomedical Sciences by GlaxoSmithKline. We thank S. Sunyaev and V. Seplyarskiy for helpful Further information on research design is available in the Nature discussions and feedback with regards to the interpretation of human genetic Research Reporting Summary linked to this paper. variation data.

Author contributions J.V. and J.A.S. conceived the project and supervised experiments. J.V. designed and performed analysis. J.L. aided in the conceptual design of statistical methods. Data availability W.M. collaborated on consensus footprint indexing. J.H., K.L., D.B., M.D., D.D., F.N., E.H., E.R., A.R., J.N., A.J., M.F., M.B. and R.S. performed primary data generation and processing. R.K. All raw and processed DNase-seq data are available through the ENCODE coordinated data production. J.A.S. and J.V. wrote the paper. portal (http://www.encodeproject.org/) under accessions in Supplemen- tary Table 1. Footprints and their metadata are available at http://vierstra. Competing interests The authors declare no competing interests. org/resources/dgf or https://doi.org/10.5281/zenodo.3603548 (Zenodo). Additional information A track hub to visualize data in the UCSC Genome Browser is hosted Supplementary information is available for this paper at https://doi.org/10.1038/s41586-020- at https://resources.altius.org/~jvierstra/projects/footprinting.2020/ 2528-x. hub.txt. Protein structures for CTCF (Fig. 2a) and PAX6 (Extended Data Correspondence and requests for materials should be addressed to J.V. or J.A.S. Peer review information Nature thanks Hendrik Stunnenberg and the other, anonymous, Fig. 6a) were downloaded from Protein Data Bank (https://www.rcsb. reviewer(s) for their contribution to the peer review of this work. org) (PDB IDs: 5YEF, 5YEL, and 6PAX). Reprints and permissions information is available at http://www.nature.com/reprints.

Code availability Footprint detection software is available at http://www.github.com/ jvierstra/footprint-tools. All code for analyses herein is available upon request. Article

a c 15 expected cleavages 25 expected cleavages Chr. 19 0.075 0.09 Observed y y 0.06 Negative 0.050 48364000 48364300 48364600 binomial fit Densit

Densit 0.03 Poisson (λ=15) 0.025 TMEM143 100 nt 0.00 0.000 GENCODE annotations SYNGR4 0 40 80 120 0 40 80 120 Observed cleavage Observed cleavage DNase I 80 60 expected cleavages cleavage 40 0 0.045 90 Negative CD19+ cells binomial fit 1. Compute expected cleavage counts & build variance model 0.030 60 y=x

Densit y 0.015 30 80

cleavages Expected CD19+ cells 40 0.000 Mean observed 0 0 0 40 80 120 0 30 60 90 2. Per-nucleotide test for cleavage rate deviation Observed cleavage Expected cleavage (observed vs. expected)

b Per nucleotide 4 Cleavage depletion (obs/exp) 2 –log10 p 0.50 0.40 0.30 0.20 0.10 0 3.Combine adjacent per nucleotide p-values 4.5 -value)

p 8 Windowed 10 4 –log10 p 3.0 0 4. Determine FDR cutoff (per DHS) by resampling 1.5 expected cleavage rate distrubition 5% Footprints 1% Significance (–log 0.0 (FDR) 0.1% 0.01% 0 50 100 150 200 250 Expected cleavage d ef g 0.4 1.00 ) Mean correlation (r) 0.2 95% CI 6 6 y y 0.75 R 0.0 4 4 0.50 FP −0.2 Densit 2 Densit 2 0.25 Correlation (r −0.4 0 0 0.00 1 5 10 15 20 25 0.0 0.5 1.0 0.0 0.5 1.0 0 250 500 750 p-value offset (nt) Windowed p-value (observed) Windowed p-value (resampled null) Nucleotides sorted by p-value

Extended Data Fig. 1 | Statistical modelling of DNase I cleavage variation relative to the expected rate corresponding to a hexamer sequence model. and footprint detection within a single dataset. a, A negative binomial c, Example of footprint detection within promoters for TMEM143 and SYNGR4 model was fit from the distribution of observed cleavage counts for each in CD19+ B cells. Expected cleavages were generated by reassigning observed predicted cleavage rate. Shown are histograms of observed cleavage counts in cleavages according to a hexamer cleavage model (Supplementary Methods). CD19+ B cells at all genomic sites with 5, 25, or 60 expected cleavages. Red, the The significance of difference between the observed and expected cleavages negative binomial distribution fit to the observed data using maximum was evaluated per nucleotide using the negative binomial dispersion model. likelihood estimation (Supplementary Methods). Blue, Poisson distribution Individual P values are combined in 7-bp windows using Stouffer’s Z-score with λ set to the corresponding expected cleavage rate. Lower right panel, method. Per-nucleotide false discovery rates were computed by sampling from means of fitted negative binomial distributions vs means of observed cleavage the expected null distributions. d, Autocorrelation of P values sampled from rates. Dashed grey line indicates y = x for reference. b, Estimated power of the expected negative binomial distribution. e, f, Histogram of windowed P empirical cleavage dispersion model. Computed P values for different cleavage values for observed (e) and sampled (f) data. g, Observed and sampled P values rate effect sizes with respect to expected cleavage rates in CD19+ B cells. compared to empirically determine and calibrate false-positive rates (FPR). Coloured lines represent the modelled effect size (depletion of cleavages) a b

0.01 CD19+ 0.8 0.01 CD19+ 0.9 B cells B cells 0.05 0.05 0.6 0.8 0.4 0.7 Mean motif overlap Mean phyloP score 0.2 0 2 4 0 2 4 Nucleotides ranked Nucleotides ranked by p-value (x106) by p-value (x106)

d c Same cell type, Same cell type same invididual Different individual 80 120 r=0.92 r=0.93

) 100 60 80 40 60 40

(397.5M tags 20 (555.5M tags )

h.CD8+-DS17203 20

h.NAMALWA-DS22816 0 0 0 25 50 75 0 50 100 h.NAMALWA-DS22823 h.CD8+-DS17885 (421.6M tags) (469.1M tags) e

1.0

0.8

0.6

0.4 n=43 n=111 0.2

Correlation of p-values (r) 0.0 , , l

Same cell type Same cell type same invididua different individual

Extended Data Fig. 2 | Genomic footprints are reproducible, overlap evolutionarily constrained nucleotides, and are enriched for TF recognition sequences. a, Motif density is associated with footprint strength. Plotted is the overlap of motif recognition sequence matches (P < 0.0001) with nucleotides ranked by footprint P value from CD19+ B cells. b, As in a, but for per-nucleotide evolutionary conservation (phyloP). c, Scatter plot of per-nucleotide footprint P values for replicate experiments from the same cell line (NAMALWA Burkitt’s lymphoma cells). All individual nucleotides within FDR 1% footprints in either replicate were considered for correlation analysis. d, As in c, but for replicates of the same primary cell (CD8+ T cells) between two distinct individuals. e, Pearson’s correlation between replicates pairs grouped by whether they were derived from the same cell and individual (n = 43) or were the same primary cell or tissue from different individuals (n = 111). Boxes indicate median and inner quartile-range (IQR). Whiskers, 5th and 95th percentile. Article

a 0.29 0.43 6 Footprint detection method: 4 Independent (q<0.01)

Coun t Emperical Bayes (posterior>0.99) 2

0 0.0 0.2 0.4 0.6 Jaccard similarity between replicates (same cell type, same individual)

Promoter Coding Intron b 5’ UTR 3’ UTR Intergenic c 1.0 Promoter 0.8 5’UTR 0.6 Intron 0.4 3’UTR

0.2 Intergenic in each annotation Coding Proportion of footprint s 0.0 ] ] ] ] −1.0 −0.5 0.0 0.5 1.0 Enrichment of footprints in annotation (0, 100] (100,(201, 201(302, 302](402, 402(503, 503](604, 604(705, 705(805, 805] 906] versus DHS (log2) DHS density (tags per 250 million sequenced)

Biosample contribution to d f consensus footprints

CD8+ T cells ) 6 d (h.CD8-DS17203) 0 600k 50% 4 footprints Observed 400k Projected 2 (logarithmic fit) 200k 52 samples

# footprints detecte 0 0 Mean # of consensus

0 200 400 600 800 footprints detected (x1 0 200 400 Tags sequenced (millions) # of samples

e g CD4+ T cells

d 800k (h.CD4-DS17881) 105 Observed Projected 600k (logarithmic fit)

400k 104 200k Mean # of novel footprints detecte d

# footprints detecte 0 0 200 400 600 800 0 200 400 Tags sequenced (millions) Number of samples

Extended Data Fig. 3 | Characteristics of consensus derived genomic subsampled sequencing depth. Red dashed line, linear model fit. e, Same as d footprints. a, Histogram of footprint discovery concordance (Jaccard for CD4+ T cells. f, g, Contribution of individual samples to the consensus similarity) between replicates (same cell type, same individual) using footprint index. Datasets were ordered randomly, and the collective number of footprints called either independently (FDR 1%) (light orange) or with the footprints was computed after the footprints present within each additional empirical Bayes approach (posterior probability >0.99). b, Genomic dataset was considered. f, Mean total number of consensus footprints distribution of consensus footprints stratified by DNase I signal of the detected vs. number of datasets included after 100 iterations of random encompassing DHS. c, Enrichment of footprint overlap in gene annotations vs. dataset orderings. Red dashed line shows a logarithmic curve fit (excluding the distribution of DHSs. d, Effect of sequencing depth on footprint detection first 50 samples). Grey dashed line indicates number of datasets that in CD8+ T cells. Sequencing tags were randomly subsampled (90% down to 10%) recapitulate 50% of consensus footprints. g, Mean number of novel footprints from the complete library. Footprints were called independently on each detected after the sequential addition of each sample. subsampled set of tags. Plotted is the number of FDR 1% footprints detected vs acC112:KLF/SP d C183: (n=20) (n=14) KLF13_C2H2_1 MEF2A_HUMAN.H11MO.0.A 2,179 motif models KLF13_MA0657.1 MEF2C_HUMAN.H11MO.0.A KLF14_C2H2_1 MEF2A_MOUSE.H11MO.0.A KLF14_MA0740.1 MEF2C_MOUSE.H11MO.0.A Compute SP4_C2H2_1 MEF2C_MA0497.1 pair-wise motif similarity SP4_MA0685.1 MEF2D_MOUSE.H11MO.0.A Klf12.mouse_C2H2_1 MEF2D_HUMAN.H11MO.0.A Klf12_MA0742.1 MEF2B_HUMAN.H11MO.0.A KLF9_HUMAN.H11MO.0.C MEF2A_MA0052.3 MEF2A_MADS_1 Hierarchical clustering KLF16_C2H2_1 KLF16_MA0741.1 MEF2D_MA0773.1 SP8_C2H2_1 MEF2D_MADS_1 SP8_MA0747.1 MEF2B_MA0660.1 SP3_C2H2_1 MEF2B_MADS_1 Align clustered motifs SP3_MA0746.1 KLF9_MA1107.1 SP1_C2H2_1 f C62:E-box/CATATG Klf1_MA0493.1 (n=22) 286 non-redundant KLF4_HUMAN.H11MO.0.A Atoh1.mouse_bHLH_1 archetype motifs KLF4_MA0039.3 Atoh1_MA0461.2 NEUROG2_bHLH_1 OLIG2_bHLH_1 Archetypal motif OLIG3_MA0827.1 b boundaries OLIG3_bHLH_1 OLIG2_MA0678.1 e C111:EGR OLIG2_bHLH_2 (n=13) BHLHE22_MA0818.1 BHLHE23_bHLH_1 EGR1_C2H2_1 OLIG1_MA0826.1 EGR2_C2H2_2 OLIG1_bHLH_1 C62:E-box/CATATG EGR3_C2H2_1 BHLHA15_bHLH_1 EGR3_MA0732.1 NEUROD2_MA0668.1 Egr3.mouse_C2H2_1 BHLHE22_bHLH_1 C111:EGR EGR1_C2H2_2

2,179 motif models BHLHE23_MA0817.1 EGR1_MA0162.3 NEUROD2_bHLH_1 C112:KLF/SP EGR4_C2H2_1 NEUROG2_MA0669.1 C183:MEF2 EGR4_MA0733.1 NEUROG2_bHLH_2 EGR2_MA0472.2 Bhlha15_MA0607.1 EGR4_C2H2_2 2,179 motif models Neurog1_MA0623.1 EGR2_C2H2_1 Twist2_MA0633.1 Egr1.mouse_C2H2_1

Extended Data Fig. 4 | Clustering motifs by similarity. a, Outline of motif dendrogram was cut at height 0.7 to create non-redundant archetypal clusters clustering approach. Motif models (n = 2,174) from ref. 11, JASPAR12 (2018), and of motifs. c–f, Exemplar clusters of similar TF recognition sequences HOCOMOCO13 were clustered by motif similarity. b, Hierarchically clustered corresponding to KLF/SP (C2H2 family), EGR (C2H2 family), MEF2 (MADS) and heat map of the pairwise similarity scores between motifs. The cluster E-box/CATATG (bHLH). Article

a b c d e CTCF CTCF CTCF CTCF CTCF (CD20+ B cells) (n=33 datasets) (CD20+ B cells) (CD20+ B cells) (n=33 datasets) 1.0 20 1.0 1.00 2.36-fold p>0.99 300 0.8 0.5 0.75 83% 15 0.6 0.0 200 0.50

Footprint AUPR 0.4 10 Precision −0.5

(AUC=0.83) Motif score 0.25 100

Random ChIP-seq signal 0.2 −1.0

(AUC=0.55) to dataset median (log2)

5 Mean ChIP signal relative 0.00 60% 0.0 −1.5 0.00 0.25 0.50 0.75 1.00 d d d ChIP + + +––– Sensitivity Footprint +/– ++– +/– – Footprinte Footprinte Motif classification Not footprinted Not footprinte

fgh i ATF3 (K562) ATF3 (K562) GATA1 (K562) GATA1 (K562) 1.00 1.00 2500 7.12-fold 4.64-fold Footprint (AUC=0.41) 3000 Footprint (AUC=0.45) Random (AUC=0.12) Random (AUC=0.28) 2000 0.75 0.75

2000 1500 0.50 0.50 1000 Precision Precision 0.25 1000 0.25 ChIP-seq signal ChIP-seq signal 500

0.00 0 0.00 0 0.00 0.25 0.50 0.75 1.00 d 0.00 0.25 0.50 0.75 1.00 Sensitivity Sensitivity

Footprinte Footprinted Not footprinted Not footprinted

jkGABPA (K562) GABPA (K562) l NFE2 (K562) m NFE2 (K562) 1.00 1.00 1250 Footprint (AUC=0.49) 7.08-fold Footprint (AUC=0.36) 4.24-fold Random (AUC=0.2) 3000 Random (AUC=0.15) 0.75 0.75 1000

2000 750 0.50 0.50 Precision Precision 500 0.25 1000 0.25 ChIP-seq signal ChIP-seq signal 250

0.00 0.00 0 0 0.00 0.25 0.50 0.75 1.00 d 0.00 0.25 0.50 0.75 1.00 d d Sensitivity Sensitivity

Footprinted Footprinte Not footprinte Not footprinte

Extended Data Fig. 5 | Classification of ChIP–seq data by genomic intensity at peaks overlapping a footptrinted CTCF motif vs. non-footprinted footprinting. a, Precision-recall (PR) curve for predictions of CTCF motif motifs in CD20+ B cells. e, Relative ChIP-seq signal at footprinted and occupancy (that is, overlap CTCF ChIP–seq peak) based on footprint posterior non-footprinted CTCF peaks containing a motif across 21 cell types (n = 32 total probabilities in CD20+ B cells. Black dot indicates precision and recall at datasets). f–m, PR curves and relative ChIP–seq intensities for ATF3 (f, g), posterior footprint probability threshold of >0.99. Blue, PR curve computed GATA1 (h, i), GABPA (j, k) and NFE2 (l, m) in K562 cells. For all TFs analysed, only after shuffling ChIP–seq peak labels. b, Area under precision-recall curve motifs overlapping DHSs were considered. DHS, ChIP-seq and motif models are (AUPR) computed for 21 ENCODE cell types and/or replicates (n = 33 total described in Supplementary Table 3. Boxes indicate median and IQR. Whiskers, datasets). c, Distribution of MOODS scores stratified by motif overlap with a 5th and 95th percentiles. ChIP–seq peak and/or a genomic footprint in CD20+ B cells. d, ChIP–seq signal a Paired-box PAX6 domain + Homeodomain (Paired-box)

b DNase I cleavage c EBF1 (EBF) Cleavage Low High

–1.51.5 obs/exp Observed (log2) Expected

s 10 s 5

0 0 chr17:61,119,056 chr12:1,924,741 (fetal eye) 10 5 (bipolar neuron) 1,881 footprinted site 25,677 footprinted site 0 0 chr2:1,721,418 chr12:57,726,208 DNase I cleavage DNase I cleavage

1 20 1 0 0 5 (obs/exp ) (obs/exp ) 2 −1 2 −1 log log 0 0 Mean cleavage Mean cleavage –25 1 19 25 chr2:218,396,556 –25 1 14 25 chr2:70,685,738 Distance to motif (bp) Distance to motif (bp)

d SOX3 (HMG) e MYF6 (bHLH)

20 s s 20

0 0 chr11:68,213,847 chr13:33,229,980 (fetal tongue) 20 10 5,538 footprinted site 3,134 footprinted site

0 0 (differentiated nueronal stem cell) chr15:40,106,616 chr17:50,952,199 DNase I cleavage DNase I cleavage ) ) 1 1 0 25 0 10 (obs/exp (obs/exp 2 −1 2 −1 log log 0 0 Mean cleavage Mean cleavage –25 1 11 25 chr19:54,420,432 –25 1 10 25 chr2:206,379,540 Distance to motif (bp) Distance to motif (bp)

Extended Data Fig. 6 | Aggregate DNase I cleavage profiles for diverse TFs. randomly ordered heat map of the per-nucleotide relative DNase I cleavage a, Physical structure of the paired-box TF PAX6 bound to its cognate protection (observed/expected) for each footprinted motif instance recognition element (PDB: 6PAX)24. b–e, Per-nucleotide DNase I cleavage (posterior footprint probability >0.99). Below, aggregate DNase I protection patterns surrounding instances of motifs within genomic footprints for PAX6 averaged over all footprinted sites. Right, DNase I cleavage at individual motif (fetal eye) (b); EBF1 (differentiated bipolar neuron) (c); SOX3 (differentiated instances (blue, observed cleavage; yellow, expected cleavage). neuronal cell) (d); and MYF6 (fetal tongue) (e). For each, top left shows a Article

a b NEUROG1 (bHLH) (C62:E-box/CATATG) Motif enrichment Samples

0.03.0 Footprints –log10 q-value DHS

Pr(footprint)

0 1 HNF4A GATA Erythroblast fetal intestine =13,195) DNase I n ( density

NEUROG1 0 3 Brain (fetal)

MEIS1 MYF6 Eye (fetal) fetal lung Motifs Neurog1_MA0623.1 footprints

PAX6 Eye (fetal) fBrain-DS11872 fBrain-DS11872 h.brain-DS23545 h.brain-DS23661 h.brain-DS23545 h.brain-DS23661 fSpinal_cord-DS20351 fSpinal_cord-DS20351 h.Bipolar_neuron-DS26980 h.Bipolar_neuron-DS26986 h.Bipolar_neuron-DS26980 h.Bipolar_neuron-DS26986

c MYF6 (bHLH) d MEIS1 (MEIS) (C65:E-box/CAGCTG) (C69:MEIS)

Footprints DHS Footprints DHS n =6,604) ( n =11,492) ( 6 9 6 9 8 8 8 8 MEIS1_MEIS_1 footprints MYF6_bHLH_1 footprints fBrain-DS11872 fBrain-DS11872 fHeart-DS12531 fHeart-DS24731 fHeart-DS12531 fHeart-DS24731 h.f.Eye-DS24796 h.f.Eye-DS24728 h.f.Eye-DS23833 h.f.Eye-DS24796 h.f.Eye-DS24728 h.f.Eye-DS23833 h.f.Retina-DS2328 h.f.Retina-DS2333 h.f.Retina-DS2328 h.f.Retina-DS2333 h.f.Tongue-DS2488 h.f.Tongue-DS2480 h.f.Tongue-DS2488 h.f.Tongue-DS2480 fSpinal_cord-DS20351 fSpinal_cord-DS20351 4 6 3 4 6 3 6 6 8 8 8 8 h.f.heart.left.ventricle-DS4470 8 h.f.heart.left.ventricle-DS4470 8 fLung-DS1472 fLung-DS2472 fLung-DS2398 fLung-DS1472 fLung-DS2472 fLung-DS2398 HSMM-DS1442 HSMM-DS1442 LHCN-M2-DS20534 LHCN-M2-DS20548 LHCN-M2-DS20534 LHCN-M2-DS20548 h.f.Tongue-DS2488 h.f.Tongue-DS2480 h.f.Tongue-DS2488 h.f.Tongue-DS2480 h.SJCRH30-DS23123 h.SJCRH30-DS23123 fMuscle_leg-DS20239 fMuscle_leg-DS20239 fMuscle_trunk-DS20242 fMuscle_trunk-DS20242

Extended Data Fig. 7 | Cell-selective occupancy of TF recognition corresponding DNase I density (right) in each sample. Rows and columns sequences. a, Hierarchically clustered heat map of TF recognition sequence are ordered using K-means (k = 6) and hierarchical clustering, respectively. enrichment (–log10 q values; Supplementary Methods) overlapping consensus c, d, Same as b, for footprints overlapping an E-box/CATATG (c, Neurog1_ footprints. Rows correspond to motifs and columns correspond to individual MA0623.1 motif model) or MEIS (d, MEIS1_MEIS_1 motif model) recognition samples. b, Clustered heat maps of posterior probabilities for footprints (left) sequence. overlapping an E-box/CAGCTG (MYF6_bHLH_1 motif model) and their ab Chr. 15 Chr. 11

74,995,400 74,995,500 73,983,300 73,983,400 73,983,500

SCAMP5 UCP2

Consensus Consensus footprints footprints 40 CD4+ 50 CD3+ T cells T cells 0 0 30 A673 cells 5 B lymphoblasts (GM06990) Non-nervous 0 Non-nervous 0 16 10 Bipolar Bipolar neuron neuron 0 0 20 Fetal brain 10 Nervous

SK-N-DZ Nervous 0 0

26 nervous vs. 151 non-nervous samples with DHS 28 nervous samples vs. 189 non-nervous with DHS

Relative 1 Relative 1 Nervous 4 P Nervous 4 P DNase I DNase I 0 0 0 protection protection 0 Non-nervous 4 4

(log ) –log10 –log10 2 −1 (log2) −1 Non-nervous

ZFX (C2H2) REST (C2H2) TFAP2C (TFAP) NFYB ZIC1

Extended Data Fig. 8 | Comparative footprinting identifies cell-selective TF per nucleotide cleavage (log2 observed/expected) between nervous occupancy at nucleotide resolution. a, b, Comparative footprinting within system-derived (SCAMP5: n = 26; UCP2: n = 28 out of 31) and non-nervous the SCAMP5 (a) and UCP2 (b) promoters identifies footprints that are samples (SCAMP5: n = 151; UCP2: n = 189 out of 212) in which region is DNase I differentially occupied in nervous cell and tissue types. Top, DNase I cleavage hypersensitive (Supplementary Methods). The colour of each bar indicates the in two exemplar nervous and non-nervous cell types. Bottom, mean differential statistical significance (–log10 p) of the per-nucleotide differential test. Article

a d DHSs accessible in both Odds ratio nervous and non-nervous tissues (n=67,368) 0.00.5 1.01.5 2.02.5

brain (bulk) Differential footprint test motor neuron (each DHS, per-nucleotide) superior frontal gyrus prefrontal cortex spinal cord s 67,368 DHSs spinal cord (bulk) cingulate gyrus 2.5% 97.5% not alpha cell differentially differentially beta cell footprinted footprinted cerebral cortex (nominal p<0.01) dorsal striatum cerebellum

# differentially ARCHS4 cell/tissue tyoe sensory neuron occupied TFs (enrichment) dentate granule cell footprints per DHS EBF1 (4.5) retina NFI (3.9) fetal brain cortex pancreatic islet 2 ZIC (1.7) 1 3 REST (1.5) 4+

b REST (n=119) All tested motifs Differentially occupied (p<0.01) Occupancy in nervous cells and tissues NFIB (n=104) ZIC1 (n=255) EBF1 (n=144) Decreased Increased

y 2 2 2 2 Densit 0 0 0 0 −1 0 1 −1 0 1 −1 0 1 −1 0 1

Relative DNase I protection in nervous cells and tissues (mean log2)

c REST NFIBZIC1 EBF1

Nervous Cell-selectiv ) Occupancy

2 0.5 0.5 0.5 n=119 0.5 n=104n=255 n=144 0.0 0.0 0.0 0.0 e protection −0.5 (mean lo g −0.5 −0.5 −0.5 Non- nervous Relative DNase I

) 5 5 5 2.5 Mean p-value(–log

differential test 0 0.0 0 0 −5051121 0 −505115 0−50 14 50 −505114 0 Distance to recogntion sequence (bp)

Extended Data Fig. 9 | Differentially occupied nucleotides reflect b, Density histograms of relative footprint occupancy between aggregate DNase I cleavage profiles. a, Differential footprint testing within nervous-system derived and non-nervous-system derived samples for the TF thousands of accessible DHSs between nervous and non-nervous related recognition sequences of REST, NFIB, ZIC1 and EBF1. Grey indicates biosamples. The vast majority of tested DHSs encode a single TF binding distribution of all motif instances tested. Black indicates differentially topology. Top, percentage of the DHSs tested that containing one or more footprinted. c, Per-nucleotide aggregate plots of the mean relative DNase I differentially occupied element. Bottom left, distribution of differentially protection (top) and differential test P value (–log10, bottom) around footprinted elements per DHS. Bottom right, selected TF recognition differential occupied motifs. d, Cell- and tissue-specific expression of genes sequences significantly enriched in differentially occupied footprints nearby differentially occupied REST footprints. Enrichment was performed (binomial test P < 0.01). Indicated in parenthesis is the fold-enrichment vs using Enrichr27. Shown are cell and tissues with an adjusted Fisher exact test P expected (based on prevalence of footprinted motif in tested regions). value <0.01. a b 100 randomly 407,511 high read e sampled SNVs depth SNVs 1.0 6 Binomial Beta- σ =0.09 y 0.8 μ=0.5 binomial α=66.8 0.8 β=66.7 0.6 4 0.6 0.4 0.4 2

0.2 (-scaled) Sampled density

0.2 Observed densit 0 0.0

in single samples (mean±σ) 0.2 0.4 0.6 0.8 0.0 0.5 1.0 Prop. reads w/ reference allel Prop. reads w/ reference allele Prop. reads w/ reference allele in combined data

c d 5 4 0.20 3 1.73-fold 0.10

Density 2

1 imbalanced Inside footprint Prop. variants Outside footprint 0 0.00 0.00 0.25 0.50 0.75 1.00 ) Prop. reads w/ reference allele [25, 50[50,) 75) [75, 100 [100, [125,125) [150,150) [175,175) 200) Tested SNVs (n=1.65 million) Read depth over SNV Imbalanced SNVs (n=117,626)

Extended Data Fig. 10 | Detection of chromatin-altering variants. high-confidence SNVs assuming a binomial distribution (P = 0.5) or a a, Scatter plot of allelic ratios at 100 randomly selected high-confidence SNVs beta-binomial distribution. Grey indicates the observed allelic ratios at the (Supplementary Methods) computed after aggregating reads from different same variants. c, Density histogram of allelic ratios for all tested SNVs (grey samples (x-axis) against the distribution of allelic ratios at the same SNVs in line) and significantly imbalanced SNVs (blue line). d, Proportion of SNVs each sample (y-axis; mean ± s.d.). The average s.d. indicated in the top left imbalanced with respect to read depth for variants within (blue) or outside corner was used to tune the parameters of a beta-binomial distribution. (orange) consensus footprints (posterior probability >0.99). b, Simulation of allelic ratios from the observed total read depth at Article

a CREB1_MA0018.3 (C49:CREB/ATF)

Nucleotide 500 A SNVs C 0 G e T Representative Imbalanced 25 Footprinted Motif: cluster SNVs Not footprinted motif model 0 Footprinted CCAAT/CEBP % imbalanced 10 Not footprinted AP1/1 CREB/ATF/1 0 CREB/ATF/2 -20 1 12 20 NFI/1 Distance relative to recognition sequence CREB/ATF/3 Ebox/CAGATGG b NFIX_NFI_1 CTCF (C189:NFI) AP1/2 IRF/3 TEAD 1000 RFX/1 SNVs ETS/1 0 Ebox/CAGCTG SPI NFKB/1 Imbalanced 50 RUNX/1 SNVs MAF 0 Ebox/CACCTG 25 THAP1 % imbalanced ZNF423 FOX/7 0 NFI/2 -20 1 15 20 NR/16 Distance relative to recognition sequence EBF1 KLF/SP/2 TFAP2/1 c TEAD1_TEA_1 KLF/SP/1 (C166:TEAD) PAX/2 Ebox/CACGTG/1 MFZ1 500 CREB3/XBP1 SNVs TFAP2/2 SMAD 0 ZIC PAX/1 Imbalanced 25 NR/13 SNVs HIC/1 0 NR/12 NR/11 % imbalanced 10 HD/12 REL-halfsite 0 NR/17 -20 1 10 20 /2 Distance relative to recognition sequence GLIS IRF/2 NR/1 RUNX/2 d TFAP2C_TFAP_2 NRF1 (C264:TFAP2) EGR GLI INSM1 ZIC/2 SNVs1000 NR/3 REST/NRSF 0 PLAG1 Ebox/CACGTG/2 Imbalanced 25 HINFP1/1 SNVs ZFX 0 5 −1 0 1 2 % imbalanced 2 Enrichment of imbalanced 0 SNVs within motif (log ) -20 1 11 20 2 Distance relative to recognition sequence

Extended Data Fig. 11 | Enrichment of imbalanced variants within within footprinted (blue) and non-footprinted motifs (orange). Motifs are footprinted TF recognition sequences. a–d, Distribution of SNVs around the grouped into clusters, where each point represents an individual motif model recognition sequences for CREB/ATF, NFI, TEAD and TFAP2 TFs. For each TF, (Extended Data Fig. 4, Supplementary Table 2 and Supplementary Methods). shown are the total SNVs tested for imbalance (top), imbalanced variants Black bars indicate mean enrichment across all motifs in each cluster and (middle), and the proportion of variants imbalanced stratified overlap with a footprint overlap. Only motifs with significant (q < 0.05) enrichment of consensus footprint. e, log2 enrichment of imbalanced variants residing within imbalanced SNVs with a footprinted recognition sequence are shown. TF recognition sequences relative to non-imbalanced SNVs for both variants a Variants in CREB1 footprints Variants in TEAD1 footprints Variants in TFAP2C footprints Alt. stronger Ref. stronger Alt. stronger Ref. stronger Alt. stronger Ref. stronger 165 188 4 193 143 340 396 All variants All variants 4 All variants y (n=1,320) (n=1,434) (n=5,215) 2 2 Imbalanced Imbalanced 2 Imbalanced (n=336) (n=736)

Desnit (n=353) 0 0 0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Prop. cleavages on reference allele

b Variant effect on CREB1 Variant effect on TEAD1 Variant effect on TFAP2C recognition sequence recognition sequence recognition sequence 5.0 5 5 Alternate allele 2.5 weaker motif 0.0 0 0

(ref. vs alt.) −2.5 Alternate allele −5 stronger motif

Log-odds motif score −5 −5.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

Prop. cleavages on reference allele

Extended Data Fig. 12 | Allelic imbalance parallels the predicted energetic line, all variants significantly imbalanced. b, Median log-odds score (reference effect of genetic variation. a, Histogram of allelic ratios for variants versus alternate allele) of all tested variants within footprinted motifs binned overlapping footprinted CREB1 (CREB/ATF), TEAD1 (TEAD) and TFAP2C by allelic ratio. Error bars show 5th and 95th percentiles of log-odds motif (TFAP2) recognition sequence. Grey line, all variants tested for imbalance. Blue scores in each bin. Article

abETS1 CREB/ATF c CTCF

0.75 0.75 0.75 Mean phyloP 0.25 0.25 0.25 ) π *** 4 20 20 12.5 10.0 (x1 0

Mean 10 10 7.5

-15 1 10 15 -15 112 15 -15 1 17 15 Distance to Distance to Distance to footprinted motif footprinted motif footprinted motif

def ETS1 motifs CREB/ATF motifs CTCF motifs

d 22.1 20 16.8 24.8 20 10.8 8.7 20 6.8 5.1 5.1 7.6 0 0 0 constrainte % evolutionarily DHS DHS DHS Winthin Winthin Winthin Outside Outside Outside Outside Outside Outside footprint footprint footprint footprint Footprint Footprint

) 15 4 0 10 10 10

Mean π (x1 5 5 5

Constrained (phyloP>1) Neutral (phyloP≤1) 4-fold degenerate coding sites

Extended Data Fig. 13 | Nucleotide-resolved patterns of genetic variation ±5 kb of DHS peaks but not inside (outside DHS), inside DHSs but not a footprint within TF binding sites. a–c, Per-nucleotide profiles of phyloP scores (top) (outside footprint) and those overlapping a consensus footprint. Top, and human nucleotide diversity (π) (bottom) within footprinted motifs for percentage of TF recognition sequence under elevated evolutionary constraint ETS1 (a), JDP2 (b), and CTCF (c) motifs. Black box in the motif consensus logo (mean phyloP score >1) in each group. Bottom, mean nucleotide diversity annotates CpG dinucleotides. Asterisk indicates the position of the CpG within the footprinted motifs additionally stratified by evolutionary dinucleotide in the profiles below. d–f, Ancient and recent constraints at the TF constraint. Boxes indicate median and IQR of enrichments from 1,000 recognition sequences with respect to proximity to DHSs and consensus bootstrap samples. Whiskers, 5th and 95th percentile. footprints. TF recognition sequences are grouped by those residing within        $%&&'()%012034567%&8(9@ j'RRk2'&(6&4401j%70eTE64F46%C400%)%5Q%(    

A4(65)146'1BC4567%&8(9@a4ClmWlnln 

   

D')%&6203E5FF4&C     

G465&'D'('4&H7I2(7'(6%2F)&%P'67'&')&%15H2B2Q26C%R67'I%&S6746I')5BQ2(7TU72(R%&F)&%P21'((6&5H65&'R%&H%0(2(6'0HC4016&40()4&'0HC  

20&')%&6203TV%&R5&67'&20R%&F462%0%0G465&'D'('4&H7)%Q2H2'(W(''%5&X126%&24QY%Q2H2'(40167'X126%&24QY%Q2HC$7'HSQ2(6T     E6462(62H( V%&4QQ(6462(62H4Q404QC('(WH%0R2&F674667'R%QQ%I20326'F(4&')&'('062067'R235&'Q'3'01W64BQ'Q'3'01WF4206'`6W%&a'67%1(('H62%0T

0b4 $%0R2&F'1

o U7''`4H6(4F)Q'(2c'8d9R%&'4H7'`)'&2F'064Q3&%5)bH%01262%0W32P'04(412(H&'6'05FB'&4015026%RF'4(5&'F'06

o e(646'F'06%0I7'67'&F'4(5&'F'06(I'&'64S'0R&%F12(620H6(4F)Q'(%&I7'67'&67'(4F'(4F)Q'I4(F'4(5&'1&')'46'1QC U7'(6462(62H4Q6'(68(95('1eGfI7'67'&67'C4&'%0'g%&6I%g(21'1 o hdipqrsttsdquvwuwqwxsyi€qvq€vwr‚ƒv€qwsivipqpqd„tv q€vwr‚ƒvqts‚vqrst†iv‡quvrxdƒˆyvwqƒdquxvq‰vuxs€wqwvruƒsd

o e1'(H&2)62%0%R4QQH%P4&246'(6'(6'1

o e1'(H&2)62%0%R40C4((5F)62%0(%&H%&&'H62%0(W(5H74(6'(6(%R0%&F4Q26C40141‘5(6F'06R%&F5Q62)Q'H%F)4&2(%0( eR5QQ1'(H&2)62%0%R67'(6462(62H4Q)4&4F'6'&(20HQ51203H'06&4Q6'01'0HC8'T3TF'40(9%&%67'&B4(2H'(62F46'(8'T3T&'3&'((2%0H%'RR2H2'069 o eGfP4&2462%08'T3T(64014&11'P2462%09%&4((%H246'1'(62F46'(%R50H'&64206C8'T3TH%0R21'0H'206'&P4Q(9

V%&05QQ7C)%67'(2(6'(6203W67'6'(6(6462(62H8'T3T’WuW‚9I267H%0R21'0H'206'&P4Q(W'RR'H6(2c'(W1'3&''(%RR&''1%F401“P4Q5'0%6'1 o ”ƒ•vq“q•„iyvwq„wqv‡„ruq•„iyvwq–xvdv•v‚qwyƒu„iv

o V%&—4C'(240404QC(2(W20R%&F462%0%067'H7%2H'%R)&2%&(401a4&S%PH7420a%06'$4&Q%('66203(

o V%&72'&4&H72H4Q401H%F)Q'`1'(230(W21'062R2H462%0%R67'4))&%)&246'Q'P'QR%&6'(6(401R5QQ&')%&6203%R%56H%F'(

o X(62F46'(%R'RR'H6(2c'(8'T3T$%7'0˜(€WY'4&(%0˜(‚9W2012H462037%I67'CI'&'H4QH5Q46'1

hy‚q–vqrsiivruƒsdqsdqwu„uƒwuƒrwq™s‚qƒsisdƒwuwqrsdu„ƒdwq„‚uƒrivwqsdqt„dpqs™quxvq†sƒduwq„s•v E%R6I4&'401H%1' Y%Q2HC20R%&F462%04B%564P42Q4B2Q26C%RH%F)56'&H%1' f464H%QQ'H62%0 E'i5'0H2031464I4(H%QQ'H6'15(20340pQQ5F204g2E'iqnnn401)&%H'(('15(203g$E(%R6I4&'PrTrTst8pQQ5F204Wp0HT9T

f464404QC(2( f464I4(404QCc'15(203R%QQ%I203(%R6I4&'6%%Q(@BI48PnTsTul9W—XfvYE8PlTqTrw9W)C67%08PlTs401rTt9WfGx4IS8PqTnTl9W)CavA8PlTrTl9W eeEY8PnTrTq9WBHR6%%Q(8PuTw9WPHR6%%Q(8PnTuTuq9WavvfE8PuTwTr9WEgAfE$8PuTnTn9T$5(6%F(%R6I4&'1'P'Q%)'1R%&672(F405(H&2)62(R&''QC 4P42Q4BQ'501'&40%)'0g(%5&H'apUQ2H'0('46766)@bbIIIT32675BTH%Fb‘P2'&(6&4bR%%6)&206g6%%Q(T V%&F405(H&2)6(562Q2c203H5(6%F4Q3%&267F(%&(%R6I4&'67464&'H'06&4Q6%67'&'('4&H7B560%6C'61'(H&2B'120)5BQ2(7'1Q26'&465&'W(%R6I4&'F5(6B'F41'4P42Q4BQ'6%'126%&(401 &'P2'I'&(Te'(6&%03QC'0H%5&43'H%1'1')%(262%0204H%FF5026C&')%(26%&C8'T3Tf26g5B9TE''67'G465&'D'('4&H73521'Q20'(R%&(5BF266203H%1'h(%R6I4&'R%&R5&67'&20R%&F462%0T f464 Y%Q2HC20R%&F462%04B%564P42Q4B2Q26C%R1464 eQQF405(H&2)6(F5(620HQ51'414644P42Q4B2Q26C(646'F'06U72((646'F'06(7%5Q1)&%P21'67'R%QQ%I20320R%&F462%0WI7'&'4))Q2H4BQ'@

geHH'((2%0H%1'(W502i5'21'062R2'&(W%&I'BQ20S(R%&)5BQ2HQC4P42Q4BQ'1464('6(   

geQ2(6%RR235&'(674674P'4((%H246'1&4I1464 ! "

ge1'(H&2)62%0%R40C&'(6&2H62%0(%014644P42Q4B2Q26C # " # eQQ&4I('i5'0H2031464401)&2F4&C)&%H'((203%R67'fG4('p14642(4P42Q4BQ'67&%53767'XG$vfX1464)%&64Q8766)@bbIIIT'0H%1')&%‘'H6T%&3b9I2674HH'((2%0( Q2(6'120X`6'01'1f464U4BQ'uTV%%6)&206(4014((%H246'1404QC(2(4&'4P42Q4BQ'46766)@bbP2'&(6&4T%&3b&'(%5&H'(b13R%&766)(@bb1%2T%&3bunTylmubc'0%1%Trtnryqm 8zXGvfv9Te6&4HS75B6%P2(54Q2c'14642067'x$E$f'0%F'—&%I('&2(7%(6'146766)(@bb&'(%5&H'(T4Q625(T%&3b{‘P2'&(6&4b)&%‘'H6(bR%%6)&206203Tlnlnb75BT6`6TY&%6'20 (6&5H65&'(R%&$U$V8V23Tl49401Ye|t8X`6'01'1f464V23Tt49I'&'1%I0Q%41'1R&%FY&%6'20f464—40S8766)(@bbIIIT&H(BT%&3b98Yf—pf@y}XVWy}XA401tYe|9T

   

V2'Q1g()'H2R2H&')%&6203    YQ'4('('Q'H667'%0'B'Q%I67462(67'B'(6R26R%&C%5&&'('4&H7TpRC%54&'0%6(5&'W&'4167'4))&%)&246'('H62%0(B'R%&'F4S203C%5&('Q'H62%0T    

o A2R'(H2'0H'( —'74P2%5&4Qh(%H24Q(H2'0H'( XH%Q%32H4QW'P%Q562%04&Ch'0P2&%0F'064Q(H2'0H'( 

V%&4&'R'&'0H'H%)C%R67'1%H5F'06I2674QQ('H62%0(W(''0465&'TH%Fb1%H5F'06(b0&g&')%&6203g(5FF4&CgRQ46T)1R        

A2R'(H2'0H'((651C1'(230    eQQ(6512'(F5(612(HQ%('%067'(')%206('P'0I7'067'12(HQ%(5&'2(0'3462P'T   E4F)Q'(2c' G%(4F)Q'(2c'(H4QH5Q462%0(I'&')'&R%&F'1gg4QQ1464I4(5('1I7'&'4))Q2H4BQ'  f464'`HQ5(2%0( G%1464I4('`HQ51'1R&%F404QC(2( 

D')Q2H462%0 G%64))Q2H4BQ'gg0%3&%5)gI2(''`)'&2F'064Q6'(6203I4()'&R%&F'1

D401%F2c462%0 G%64))Q2H4BQ'gg0%3&%5)gI2(''`)'&2F'064Q6'(6203I4()'&R%&F'1

—Q201203 G%64))Q2H4BQ'gg0%3&%5)gI2(''`)'&2F'064Q6'(6203I4()'&R%&F'1

D')%&6203R%&()'H2R2HF46'&24Q(W(C(6'F(401F'67%1( e'&'i52&'20R%&F462%0R&%F4567%&(4B%56(%F'6C)'(%RF46'&24Q(W'`)'&2F'064Q(C(6'F(401F'67%1(5('120F40C(6512'(Tg'&'W2012H46'I7'67'&'4H7F46'&24QW (C(6'F%&F'67%1Q2(6'12(&'Q'P4066%C%5&(651CTpRC%54&'0%6(5&'2R4Q2(626'F4))Q2'(6%C%5&&'('4&H7W&'4167'4))&%)&246'('H62%0B'R%&'('Q'H62034&'()%0('T a46'&24Q(h'`)'&2F'064Q(C(6'F( a'67%1( 0b4 p0P%QP'12067'(651C 0b4 p0P%QP'12067'(651C o e062B%12'( o $7pYg('i

o X5S4&C%62HH'QQQ20'( o VQ%IHC6%F'6&C

o Y4Q4'%06%Q%3C4014&H74'%Q%3C o aDpgB4('10'5&%2F43203

o e02F4Q(401%67'&%&3402(F(

o g5F40&'('4&H7)4&62H2)406(

o $Q202H4Q1464

o f54Q5('&'('4&H7%RH%0H'&0

X5S4&C%62HH'QQQ20'( Y%Q2HC20R%&F462%04B%56H'QQQ20'( $'QQQ20'(%5&H'8(9 $'QQQ20'(I'&')&%H5&'1R&%F4))&%)&246'H%FF'&H24Q(%5&H'(T7TXE$Q20'(5('1I'&'R&%FGpg4))&%P'1Q2(6401)&%P21'1BC Q4B%&46%&2'(I267'`)'&62('203&%I203WH74&4H6'&2c20340112RR'&'0624620367'('H'QQ6C)'(T8(''XG$vfXI'B(26'R%&1'642Q( 401)&%6%H%Q(9T

e567'062H462%0 $'QQ(Q20'(I'&'4567'062H46'1204HH%&140H'I267XG$vfX)%Q2H2'(T

aCH%)Q4(F4H%064F20462%0 aCH%)Q4(F46'(6203I4(0%6)'&R%&F'1T

$%FF%0QCF2(21'062R2'1Q20'( G%0' 8E''p$Ae$&'32(6'&9   

! " # " #

~