<<

Comprehensive, high-resolution binding energy landscapes reveal context dependencies of factor binding

Daniel D. Lea,1, Tyler C. Shimkoa,1, Arjun K. Adithamb,c, Allison M. Keysc,d, Scott A. Longwellb, Yaron Orensteine, and Polly M. Fordycea,b,c,f,2

aDepartment of , Stanford University, Stanford, CA 94305; bDepartment of Bioengineering, Stanford University, Stanford, CA 94305; cStanford ChEM-H (Chemistry, Engineering, and Medicine for Human Health), Stanford University, Stanford, CA 94305; dDepartment of Chemistry, Stanford University, Stanford, CA 94305; eDepartment of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel POB 653; and fChan Zuckerberg Biohub, San Francisco, CA 94158

Edited by David Baker, University of Washington, Seattle, WA, and approved February 23, 2018 (received for review September 22, 2017)

Transcription factors (TFs) are primary regulators of expres- However, even subtle changes in binding energies can have sion in cells, where they bind specific genomic target sites to dramatic effects on both occupancy and transcription (15–18). control transcription. Quantitative measurements of TF–DNA Sequences surprisingly distal from a known consensus motif can binding energies can improve the accuracy of predictions of TF affect affinities and levels of transcription (19–22), and genomic occupancy and downstream in vivo and shed variants in regulatory regions outside of known transcription light on how transcriptional networks are rewired throughout factor binding sites (TFBSs) may be subject to nonneutral evo- . Here, we present a -based TF binding assay lutionary pressures (23). Therefore, understanding the funda- and analysis pipeline (BET-seq, for Binding Energy Topography mental mechanisms that regulate transcription requires mea- by sequencing) capable of providing quantitative estimates of surement of binding energies at sufficient resolution to resolve binding energies for more than one million DNA sequences in even small effects. parallel at high energetic resolution. Using this platform, we mea- Despite the utility of comprehensive binding energy measure- sured the binding energies associated with all possible combina- ments, existing methods often lack the energetic resolution and tions of 10 flanking the known consensus DNA tar- scale required to yield such datasets. Currently, three dominant get interacting with two model yeast TFs, and Cbf1. A large technologies are used to query DNA specificities: methods based fraction of these flanking change overall binding ener- on systematic evolution of by exponential enrichment gies by an amount equal to or greater than consensus site muta- (SELEX) (24–29), binding microarrays (PBMs) (30, 31), tions, suggesting that current definitions of TF binding sites may and mechanically induced trapping of molecular interactions be too restrictive. By systematically comparing estimates of bind- ing energies output by deep neural networks (NNs) and biophys- Significance ical models trained on these data, we establish that dinucleotide (DN) specificities are sufficient to explain essentially all variance in observed binding behavior, with Cbf1 binding exhibiting signifi- Transcription factors (TFs) are key that bind DNA tar- cantly more nonadditivity than Pho4. NN-derived binding ener- gets to coordinate gene expression in cells. Understanding gies agree with orthogonal biochemical measurements and reveal how TFs recognize their DNA targets is essential for predicting that dynamically occupied sites in vivo are both energetically and how variations in disrupt transcription to mutationally distant from the highest affinity sites. cause disease. Here, we develop a high-throughput assay and analysis pipeline capable of measuring binding energies for over one million sequences with high resolution and apply it protein–DNA binding | binding | transcription factor toward understanding how nucleotides flanking DNA targets specificity | microfluidics | transcriptional regulation affect binding energies for two model yeast TFs. Through sys- tematic comparisons between models trained on these data, ene expression is extensively regulated by transcription fac- we establish that considering dinucleotide (DN) interactions is Gtors (TFs) that bind genomic sequences to activate or sufficient to accurately predict binding and further show that repress transcription of target (1). The strength of binding sites used by TFs in vivo are both energetically and - between a TF and a given DNA sequence at equilibrium depends ally distant from the highest affinity sequence. on the change in Gibbs free energy (∆G) of the interaction (2– 5). Thermodynamic models that explicitly incorporate quantita- Author contributions: D.D.L., T.C.S., A.K.A., and P.M.F. designed research; D.D.L., T.C.S., tive binding energies more accurately predict occupancies, rates A.K.A., and A.M.K. performed research; D.D.L., T.C.S., A.K.A., S.A.L., and Y.O. contributed new reagents/analytic tools; D.D.L. and T.C.S. analyzed data; and D.D.L., T.C.S., and P.M.F. of transcription, and levels of gene expression in vivo (4, 6–11). wrote the paper. In addition, binding energy measurements for TF–DNA interac- The authors declare no conflict of interest. tions can provide insights into the evolution of regulatory net- This article is a PNAS Direct Submission. works. Unlike coding sequence variants that manifest at the pro- This open access article is distributed under Creative Commons Attribution- tein level to influence fitness, noncoding TF target site variants NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). affect by modulating the binding energies of these Data deposition: The data reported in this paper have been deposited in the Gene Ex- interactions to affect gene expression (12–14). Understanding pression Omnibus (GEO) database, https://www.ncbi.nlm.nih.gov/geo (accession no. how TFs identify their cognate DNA target sites in vivo and how GSE111936). All processed data have been uploaded to Figshare (https://figshare.com/ these interactions change during evolution therefore requires the articles/ /5728467) and Github (https://github.com/FordyceLab/BET-seq). ability to accurately estimate binding energies for a wide range of 1 D.D.L. and T.C.S. contributed equally to this work. sequences. 2 To whom correspondence should be addressed. Email: [email protected]. Most high-throughput efforts to develop accurate models This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. of TF binding specificity have focused on mutations within 1073/pnas.1715888115/-/DCSupplemental. known TF target sites that dramatically change binding energies. Published online March 27, 2018.

E3702–E3711 | PNAS | vol. 115 | no. 16 www.pnas.org/cgi/doi/10.1073/pnas.1715888115 Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 rbtoswr hntando hs ag aaest yield to datasets large mod- these NN on Deep con- trained nonadditive Cbf1). order, then and higher were (Pho4 possible tributions all TFs incorporate motif yeast that consensus els model known two the surrounding for for landscapes mutations and energy million sizes binding >1 measure different to quantitative assay and this of deployed comprehensive then libraries We ranges. for energy expected energies resolve to binding we required resolution, depth accurate sequencing energetic ener- the the on for mimic small guidelines noise establish to sampling relatively simulations stochastic for of Carlo effects Monte even Using parallel, differences. getic in sequences million ee al. et Le and relative and yields (∆∆G that energies pipeline binding analysis absolute integrated and an sequencing), assay by sequencing Topography Energy BET- (Binding developed seq we measurements, thermodynamic throughput landscapes. energy binding of yield per-sequence topography to local a potential revealing at the level, binding has of therefore estimates data high-resolution binding accurate, models of NN sets Training large (56). on sequences genomic func- noncoding the of predicting tion large including net- from applications, patterns many neural across complex Deep relation- datasets learn can degenerate. the models somewhat (NN)-based rendering two work sequence, the between primary ship by variables biophysical determined these However, are heli- (51–55). struc- twist, roll) propeller and using width, twist, shape cal groove DNA minor local (e.g., developed parameters on tural Recently based dinu- (41–50). binding as including k-mers predict such longer models by features, or refined sequence (DNs) be order cleotides can higher low-affinity approximation of for This contributions particularly 40). predictions, (32, inaccurate sites which PWM-based to nucleotides, However, for lead between 37–39). nonadditivity can specificity capture (9, to binding TFs fail studied models of of approximations majority the useful provide mononucleotide interpreted and These and visualized, implemented, (36). easily are energies models indepen- binding (MN) each and overall which additively to in contributes dently (PWM), position matrix each weight at position a as specificities resolution simul- and to landscapes. scale complete challenging the derive to at remains necessary energies it sam- binding but can measure spaces, assays taneously binding sequence the TF–DNA vast hardware these by ple sequencing together, limited customized Taken been (35). extensively has technology for this requirement of adoption High- measurements; binding however, 34). concentration-dependent perform (29, to abil- the energies ity with sequencing binding parallel profiling massively resolving couples interaction (HiTS-FLIP) of mas- sequencing–fluorescent coverage cost using throughput space the sequence by at increase limitations but to sequencing throughput MITOMI-based parallel addressed sively of sev- to iterations have limited Recent is assays sequences. throughput but hundred affinities, eral enables absolute yield valves, to microfluidic ing bind- by concentration-dependent of trapping measurements high-resolution mechanical of on number the based to limiting probed sequence, con- sequences each usually dis- distinct for arrays that replicates PBM steps many addition, wash tain In require equilibrium. PBMs binding because rupt unclear - binding fluo- precise is and the intensities a energies fluorescence range, using measured dynamic microarrays between broad DNA tionship a PBMs to with Although TFs readout weakly affinities. of rometric detect binding their to the measure quantify fail or but sequences populations from bound substrates random depth affinity highest large low the extremely identify to by optimized are meth- followed these ods Thus, material. cycles TF-bound enriched the repeated of amplification sequencing require methods and SELEX-based enrichment 33). (32, (MITOMI) oadestene o ehooiscpbeo high- of capable technologies for need the address To TF represent models used widely and popular most The 000 h IOIplatform, MITOMI The ∼50,000. ∆ ,rsetvl)for respectively) G, >1 qiiru attoigo eune nobudadunbound the and by bound determined into be such states: sequences can system, of interaction given partitioning two-state a equilibrium a of affinity considered the be that TF– can resolution. interactions high at DNA to energies ability binding the maintaining measure while TF– quantitatively probed which be at can scale interactions the extends DNA significantly that assay Landscapes. an develop Affinity Binding to (HTS) Comprehensive Sequencing Derive High-Throughput Using Approach Microfluidic A Results predictive improving TFs, species. of across affinities variety TF–DNA wide of pipeline a models to analysis extended and be assay may this model Furthermore, bio- understand and behaviors. to logical energies required specificity binding of measure on determinants high- to our effects of approach utility near-neutral the throughput the demonstrate data for from These energies. distant of selection binding evidence mutationally molecular providing are sequence, evolutionary Cbf1 target flanking or occupied optimal Pho4 energetically dynamically both most nonaddi- for more Strikingly, loci significantly Pho4. exhibiting than Cbf1 tivity bind- with observed reveal all behavior, nearly models ing explain pre- motivated preferences biophysically and specificity of DN occu- predictions that series TF NN a of between from predictions Comparisons dictions accurate vivo. limit in may pancy muta- are and TFBSs than of restrictive definitions greater current or too that sequences suggesting great of core, flanking as the in range energies of tions binding number a on large effects over surprisingly have energies accu- A quantitative, binding kcal/mol. 3 highly measured are predicting predictions rately NN measurements that affinity sequence. established biochemical each orthogonal for to energy Comparisons binding of estimates high-resolution eoi oiadrglt itnttre ee 1,6) To Pho4 63). affect (19, of nucleotides genes sets flanking target nonoverlapping how distinct largely probe six- regulate bind comprehensively the and they of loci 60–62), vivo variant in genomic 28, and CACGTG vitro 18, in same both (16, the motif (E-box) bind -box nucleotide Cbf1 and the Pho4 from eluted from TFs be then 1C). can (Fig. out HTS species using wash 59) quantified DNA and 33, to device (32, Bound interactions possible 1B). weak of it (Fig. loss without making sequester material then sequences, unbound valves DNA Mechanical are reached. TF-bound sequences is DNA equilibrium TFs of surface-immobilized until libraries with interact capture, to purifying allowed TF and washing, introduced After situ. before in transcription/ TFs vitro produced in TFs via (meGFP)-tagged cap- GFP protein enhanced device -based monomeric the ture for within (Fig. and surfaces need substrate -patterned 32–34) the DNA production. eliminating (29, of amounts DNA protein, equilibrium small expressed “trap” requires at device mechanically proteins This to 1A). TF times we with actuation with this, associated ms) valves address pneumatic underesti- (∼100 To incorporating systematic fast (58). device microfluidic a interactions a affinity to used weak leading of fraction, mation bound underrepresented the be may within complexes rates and step, fast particularly electrophoresis isolate with the chem- to during at equilibrium not (EMSAs) are ical complexes assays TF–DNA assays However, These shift material. 57). bound electromobility 40, (10, use species and generally each bound measure for reliably concentrations can input HTS indi- via of molecules counting DNA molecular vidual that established have groups Several . safis plcto fBTsq efcsdo w model two on focused we BET-seq, of application first a As ∆ h4adCf.Although Cbf1. and Pho4 cerevisiae , Saccharomyces G = −RT ln  [TF unbound [TF PNAS · DNA ][DNA | o.115 vol. bound unbound ] | esuh to sought We o 16 no. ]  | E3703

BIOPHYSICS AND COMPUTATIONAL A

B

C

Fig. 1. DNA library design and assay overview. (A) Schematic of flanking sequence library design indicating Illumina sequencing adapters (blue), unique molecular identifiers (UMIs, red), variable flanking regions (orange), and E-box consensus. (B) Photograph of MITOMI device and schematic showing device operation. (C) Schematic showing downstream sample analysis. Counting of individual molecules within bound and input fractions allows calculation of relative binding energies for each sequence (Left and Middle) to yield a comprehensive thermodynamic binding affinity landscape (Pho4 example). Each color-coded point (blue = –∆∆G; red = +∆∆G) represents a sequence, grouped by Hamming distance from the highest affinity sequence and alphabetically ordered in clockwise polar coordinates.

and Cbf1 binding affinities, we designed a library of 1,048,576 sequence) were sufficient to recover highly accurate ∆∆G values sequences in which the core E-box sequence was flanked by for every sequence. all possible random combinations of five nucleotides upstream Measuring accurate ∆∆Gs for a 1,048,576 species library rep- and downstream, embedded within a constant sequence shown resents a 1,000-fold increase in scale from these prior experi- to exhibit negligible binding (33) (Fig. 1A). Constant sites at ments. To understand more generally the determinants of ∆∆G the 50 and 30 ends allowed simultaneous PCR amplification and measurement accuracy, we generated a simulated test set of true incorporation of Illumina adapters. UMIs included within each relative binding energies and implemented Monte Carlo sim- sequence allowed accurate counting of library species even in the ulations to mimic stochastic sampling during HTS. For given presence of PCR bias (64). Each UMI barcode was segmented library sizes, sequencing depths, and binding energy ranges, we and interspersed along the library sequence to prevent formation again calculated the correlation coefficient between calculated of an additional CACGTG consensus site. After sequencing, - ∆∆G values and true values. As expected, accuracy improves ative binding affinities (∆∆Gs) were calculated for all sequences and library coverage expands with increasing sequencing depth by considering relative enrichment of individual DNA species, (Fig. 2B and SI Appendix, Figs. S1 and S2). As the range of thereby generating a comprehensive binding affinity landscape expected binding energies increases, accuracy improves but the (example shown in Fig. 1C). fraction of sequences observed from the input library decreases. Nearly all existing motif discovery libraries used in SELEX-seq, Assay Simulations Determine Sequencing Depth Requirements for MITOMI-seq, and SMiLE-seq experiments probe on the order Binding Affinity Measurements. Accurately estimating concentra- of 1018–1024 species with read depths of several thousand total tions of DNA in TF-bound and input samples via sequencing reads. These simulations establish that such sparse sequencing requires that measured read counts reflect true abundances. will sample only the highest affinity sequences, representing an However, read counts can be distorted by stochastic sampling infinitesimal fraction of the input library. error, particularly for low read count numbers (65, 66). To under- To delineate conditions under which unbound concentra- stand how stochastic sampling error depends on read depth, tions can be approximated by sequencing the input library, library size, and the expected range of binding energies across reducing assay cost, we modeled the distribution of “appar- library sequences, we considered a previously published exper- ent” per-sequence ∆∆Gs for a given population of sequences iment that quantified interactions between the Escherichia coli under competitive binding conditions in which we explicitly con- LacI and a library of 1,024 variants via sider effects of ligand depletion. These simulations consider deep sequencing (40). Each sequence was sampled with roughly the total number of library sequences, respective concentra- 3 10 reads per species, yielding ∆∆G measurements with neg- tions of the TF and the DNA library, and expected range and ligible sampling noise. To understand how read depth affects distribution of ∆∆G values and return predicted equilibrium recovery of accurate ∆∆G measurements, we down-sampled concentrations of each species within the bound and unbound these data to simulate lower sequencing depths of 102-106 reads, fractions (SI Appendix, Fig. S3). To make these simulations com- split evenly between bound and input fractions (ca. 0.05–5,000 putationally feasible, we modeled 100 species uniformly dis- reads per sequence). We then assessed the accuracy of recovered tributed across the ∆∆G range with a single high concentra- ∆∆G values at these lower sequencing depths by calculating tion “dummy” substrate to represent the majority of species. the squared Pearson’s correlation coefficient (r 2) between ∆∆G As the ∆∆G range and number of species increases, species values for each species calculated from down-sampled data and become depleted from the unbound fraction (SI Appendix, Fig. published values calculated from the full dataset. To minimize S3), causing ∆∆G values estimated from sequencing input to the effect of a few high accuracy values dominating the correla- systematically underestimate true ∆∆G values for high affinity tion statistic, each r 2 was normalized by the fraction of observed interactions (SI Appendix, Fig. S4). However, under the condi- species (Fig. 2A). For this 1,024-species library with binding ener- tions used here (30 nM and 1 µM concentrations for TF and gies that span ∼3 kcal/mol, ∼2 × 105 total reads (∼100 reads per DNA libraries, respectively, with 1,048,576 species and a range

E3704 | www.pnas.org/cgi/doi/10.1073/pnas.1715888115 Le et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 ee al. et Le (Pearson’s accuracy median Normalized (A) range. energy and size, 2. Fig. iuae T fteedsrbtosrva w eie with regimes two reveal distributions these of HTS (r simulated accuracies assay lated repli- 10 of for sequence per reads pre- conditions mean assay library of flank here. denotes sented function rectangle cyan a simulations; cate as sequenced (red) (rows) sizes species library various (C (r for depths. range different MN affinity to for binding of depth tion val- read true of and (r recovered function ues between coefficients a correlation Pearson’s individual as squared and values (40) (red) “true” coefficients model with data sampled where C B A 2

ewe eoee n revle bu)adfato fobserved of fraction and (blue) values true and recovered between ) Median ∆∆Gs normalized accuracy 2 Median Median , f 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 rbn h eainhpbtenasyacrc,ra et,library depth, read accuracy, assay between relationship the Probing Left stefato fosre pce)cmaigrslsfrdown- for results comparing species) observed of fraction the is n einfato fosre pce (Right species observed of fraction median and ) < 10 10 Pearson′s -1 Sequencing depth,readcounts 12345 ka/o) hsapoiaini utfid Calcu- justified. is approximation this 4kcal/mol), 2 Mean readspersequence 10 10 ΔΔG, kcal/mol einsurdPasnscreaincoefficient correlation Pearson’s squared Median ) 0 3 r 2 10 10 2 1 4 × 12345 fract.obs. rcino irr observed library of fraction 10 10 5 2 Median (B) (blue). values ∆∆G 10 10

6 3 lib. size = 10 = size lib. 10 = size lib. lib. size = 10 = size lib. 10 = size lib. lib. size= 10 size= lib.

6 4 3 2 5 niiulΔΔG individual additive model mononucleotide Sequencing depth Pearson's fract. obs. 10 10 10 10 10 10 safunc- a as ) 8 7 6 5 4 3 r 2 r for ) 2 × f , S2 4mlinlmtn ons,ti corre- this counts), limiting to million increases (∼24 lation depths read higher at (ca. predicted, depth As read of S6). Fig. ratio Appendix, measured per-sequence (SI measured the counts read from input sequence experiment, to bound each each for For calculated ). S1 were Table ranging depths Appendix , sequencing (SI at samples Cbf1 of three from and measure- replicate Pho4 four of acquire ments to above Binding Cbf1. described library and Absolute DNA Pho4 of for Estimates Affinities Comprehensive High-Throughput, specificity. Cbf1 and Pho4 bio- the for unprecedented of responsible dissection of mechanisms allowed physical and landscape resolution energy models energetic and binding linear scale integrated a This motivated models). yielded DN biophysically all scheme and and DN, neighbor model nearest made (MN, NN predictions the between sys- by correlations by biophysical binding comparing the TF–DNA tematically quantified observed among for and interactions responsible nonadditive parsed mechanisms order we high- higher Finally, obtain all capture nucleotides. that to of substrate data effects each sequencing the for these predictions on energetic model resolution millions NN collected a we trained per-sequence First, of 3A). estimates sequencing-based (Fig. of approach modeling measurement integrated and an applied the to we ability improve insights, the mechanistic To preserving gain while interpretability. predic- estimates of energetic increased our cost this of the accuracy However, at energies. comes of power tive range binding wide predicting a a accurately yielding over of complexity, capable order model higher high-resolution measurable all capture can 40). target (32, affinity affinity diversity entire lower sequence high among high the with predict variance sites explain PWMs from to fail while generated they However, binding, those 2A). to (Fig. dataset power with models predictive PWM MN similar additive of generation allowed repressor sequence this LacI accurate of published for power the Although the considered (40). illustrate dataset again To be we minimiz- noise. can approach, while sampling modeling specificity binding stochastic scenarios, consider- of ing those determinants when infer In or to libraries. used TFs DNA many large involving ing studies for prohibitive Improves Resolution. Measurements Individual Assay Noisy from Specificity Modeling of of depths errors read and high species, individual per yield reads 4–5 recov- as be ∼10 few can as of sequences of from species 95% ered Although per 2C). energy depths (Fig. reads) this read total with mean library at on sim- 1,048,576-member range impact detail a in largest examined sampling most we the ulations assays, the sequencing has that guide likely To given binding. estimate nucleotide initial first flanking reasonable proximal here, the a queried the library provide reflect larger in the not they for may binding mutations of observations spanning range energetic these affinities with full While in (32). motifs kcal/mol differences ∼1 E-box revealed nucleotide to flanking binding Cbf1 depletion. ligand of effects large and error with sampling libraries while with error, libraries ing Large S5): Fig. small Appendix, (SI accuracy decreased Ntando iloso os e-eunemeasurements per-sequence noisy of millions on trained NN A and Pho4 concentration-dependent of observations Previous .T ute mrv eouin etandaN regression NN a trained we resolution, improve further To ). 2 o5 ilo ed loae oete on rinput or bound either to allocated reads million 50 to ∼5 onsprseis( 10 species per counts agsaesbett eunigascae count- sequencing-associated to subject are ranges ∆∆G estimates, ∆∆G . kcal/mol. ∼0.2 – ilo iiigcut)so ocorrelation; no show counts) limiting million 6–8 esrmnswt cuaisof accuracies with measurements ∆∆G eyhg-et eunigmyb cost- be may sequencing high-depth Very r 2 IAppendix (SI 0.67 = ∼10 sbtentoeprmnsa low at experiments two between ∆∆Gs 2 8 eue h E-e sa and assay BET-seq the used We oa ed e F r eurdto required are TF) per reads total ed e eunewr required were sequence per reads 10 PNAS 1 to agsaesbetto subject are ranges ∆∆G 10 | 2 o.115 vol. fl ee ed per reads fewer -fold i.S7 Fig. , 10 −1 s et we Next, ∆∆Gs. | –10 o 16 no. 3 and (10 ∆∆Gs | ∼80% 3 Table E3705 –10 8

BIOPHYSICS AND COMPUTATIONAL BIOLOGY of ∆Gs from sparse datasets can underestimate the true affin- ABraw 5 data ity range due to systematic undersampling of bound reads for 4 neural low-affinity sequences. In addition, the NN is trained only on network 3 core relative binding affinities (∆∆Gs) and therefore cannot return 2 mono- all flank estimates of absolute energies (∆Gs). NN-derived ∆∆G esti-

nucleotides dinucleotides kcal/mol ΔΔG, 1 mates can be projected onto an absolute scale by calibrating 0 to a set of high-resolution biochemical measurements of ∆Gs nearest neighbor dinucleotides Pho4 Cbf1 with a linear scaling factor and offset. To generate a set of high- confidence ∆Gs, we measured concentration-dependent binding C Pho4 Cbf1 for surface-immobilized Pho4 and Cbf1 TFs interacting with all T A single-nucleotide variants of AGACA TCGAG, a medium affin- T ity reference flanking sequence (where the underscore indicates -0.25 GC C G C G G C the CACGTG core motif), via traditional fluorometric MITOMI G C C A C C T C G T C C (SI Appendix, Figs. S10 and S11). For each sequence, observed CT T GC A T T CAT CT GC C T T G 0.00 GA GA AT T G G GA G CT G G T binding was globally fit to a single-site binding model, yield- A A A A A G A A T C A ing both Kds and ∆Gs (SI Appendix, Table S3). All NN values A C T G 0.25 T A G were then scaled by fit parameters returned from a linear regres- Mean kcal/mol ΔΔG, A sion between NN predictions and experimental measurements T A for these sequences. Median Kd values for all flanking library -5 -4 -3 -2 -1 1 2 3 4 5 -5 -4 -3 -2 -1 1 2 3 4 5 sequences were 100 and 63 nM for Pho4 and Cbf1, respectively, Flank position in agreement with prior work (32). Strikingly, flanking sequence variation can modulate Kd values by over two orders of magni- 1.5 1.5 tude, ranging between 11–1,036 nM and 1–866 nM, respectively. In some cases, the magnitude of these effects exceeds that of 1.0 1.0 Bits Bits mutations within the CACGTG core consensus (Fig. 3B and SI ScerTF 0.5 0.5 Appendix, Fig. S12), demonstrating the importance of flanking 0.0 0.0 sequences to specificity. -1 1 -2 -1 Pho4 and Cbf1 Flanking Preferences Extend Far Beyond the Known D Pho4 Cbf1 . To understand the biophysical features that contribute to the predictive performance of the NN model, we 1 G, kcal/mol Count generated PWMs (71), which estimate the mean energetic con- 10000 tribution of each nucleotide at each position, from the full set of 0 1000 scaled NN-predicted ∆∆G values (Fig. 3C). While the assump- 100 -1 tion of additivity fails to explain all specificity, these models offer 10 a close approximation (11, 44) and PWMs are easily visualized 1 -2 and interpreted. These MN model results confirm that positions proximal to the E-box core motif exhibit the largest mean effect -2 -1 0 1 -2 -1 0 1 on binding, in agreement with PWMs generated by orthogonal Neural network model ΔΔ Mononucleotide model ΔΔG, kcal/mol techniques (67–69) (Fig. 3C). However, nucleotides up to four and five positions from the consensus contribute to specificity for Fig. 3. Modeling and interpretation of binding specificity based on MN Pho4 and Cbf1, respectively, significantly farther than previously features. (A) Data analysis diagram shows NN modeling from raw data fol- reported. lowed by model interpretation. (B) Magnitude of energetic changes (∆∆Gs) To quantitatively assess the degree to which MN features dic- for core (red) (32) and flanking (blue) mutations for Pho4 and Cbf1. (C) Pho4 and Cbf1 mean MN ∆∆G values as a function of flanking sequence position tate binding behavior, we determined the proportion of NN- (Top), compared with the ScerTF database (67–70) sequence logos (Bottom). derived per-sequence ∆∆G variance explained by a simple Gray boxes show position of core consensus CACGTG. (D) Density scatter PWM (Fig. 3D). If MN models capture all determinants of plots comparing NN model estimates and MN additive model predictions observed specificity, PWM predictions should explain all of the based on those estimates. variance in NN-derived ∆∆G values; conversely, discrepancies may indicate the presence of higher order interactions. PWMs explained a majority of the variance in NN predictions (r 2 = 0.92 model that predicted the measured ∆∆G for each flanking and r 2 = 0.70 for Pho4 and Cbf1, respectively) (SI Appendix, sequence. Accuracy of the NN against observed training data Table S4), consistent with the prevailing sentiment that PWMs and an unobserved validation dataset was recorded throughout provide good approximations of specificity (9, 11). Intriguingly, training; training was stopped once accuracy against the valida- PWMs explain a significantly smaller proportion of observed tion dataset failed to improve, protecting against overfitting to Cbf1 measurement variance, suggesting that Cbf1 recognition the training data (SI Appendix, Fig. S8). Predictions from NN may rely on higher order determinants of specificity. models trained on the two high-depth Pho4 replicates showed To evaluate BET-seq assay reproducibility, we generated excellent correlation (r 2 = 0.94), validating the ability to apply PWMs from each of the Pho4 and Cbf1 technical replicates such models to derive accurate, reproducible estimates of bind- (SI Appendix, Fig. S13). Linear model coefficients for MNs at ing energies. For all analyses moving forward, we therefore use each position were strongly correlated between replicates of per-sequence ∆∆G estimates output from the NN trained on a a TF (Pho4 r 2 = 0.95–0.97 and Cbf1 r 2 = 0.79–0.87) and composite dataset of all replicates (SI Appendix, Fig. S9). uncorrelated between TFs (SI Appendix, Table S5 and Fig. Absolute binding energies and dissociation constants (∆G S14); a meGFP negative control protein exhibited no sequence and Kd, respectively) allow direct comparison between differ- specificity (SI Appendix, Fig. S15). Using the fraction of unex- ent TFs and across experimental platforms and further enable plained variance(1 – r 2) as a precision metric, the expected quantitative predictions of TF occupancy in vivo under known error range in NN-derived mean MN ∆∆G values for Pho4 cellular conditions. However, sequencing-based measurements is 0.02–0.04 kcal/mol and that of Cbf1 is 0.09–0.16 kcal/mol,

E3706 | www.pnas.org/cgi/doi/10.1073/pnas.1715888115 Le et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 ar cos8pstos,adtealDsmdlicue 720 ( includes pairs model positional (4 DNs identities and nucleotide all 16) (16 of the combinations parameters [all and free parameters positions), near- 128 free the 8 another positions), across adds 10 model pairs across parame- DN position neighbor free per est all 40 nucleotides describe only (4 to using ters attempt measurements models observed increase MN 1,048,576 always While Epistatic. should power. that parameters explanatory More Confirms free Significantly additional Models incorporate Are DN Interactions into Cbf1 Constraints obser- Weight this Incorporating for this basis and reveals structural residues potential structure His5 (62). a and vation crystal providing synergis- Arg2 DN, Pho4 strong the GG between shows the Pho4, contacts CACGTG Interestingly, direct for of effects. smaller downstream tic significantly DN the overall GG is (and the a nonadditivity motif Although of the respectively. of magnitude nonadditivity, positive upstream 72). large negative DNs (62, exhibit and Cbf1 TG palindromes) downstream and and site corresponding Pho4 binding TT like of Cbf1, TFs expectation For the homodimeric near with for Inter- arrangements consistent symmetry them. motif, palindromic between core exhibited than the DNs rather intraflank combina- flanks and within among primarily differences N6/N7 ring or energetic (N4/N5 spanning absolute E-box tions immedi- the with DNs of downstream pairs), for or observed upstream is ately nonadditivity magnitude largest measured increase coop- erativity negative exhibit exhibit MNs that individual interactions conversely, of cooperativity; combination linear the considering on Nucleotide measured 4C). (Fig. with sequences interactions flanking across and within DNs residual PWM-predicted mean against the regression calculated we alone, widely of accuracy nonadditivity and PWMs. power find- ulti- which used predictive which These the TFs, to determines related respectively). structurally mately degree among Cbf1, even differential and binding remain- defines the Pho4 the highlight for of of ings 5% (improvements all energies and nearly binding binding ∼1% NN-derived all for in affect Considering accounts variance poten- ing to distortions. features the DN structural nucleotides with possible local interacting consistent through are physically energies models, for MN tial over power tory to corresponding with S4 ments, correla- Cbf1, Table increased and Appendix, Pho4 (SI showed both energies for tion binding NN-derived and combina- DN all (46). pairs from nonadjacent contributions including com- tions, considers a that and DNs model adjacent plex from contributions to only models considers DN that two fit scaled S4 we NN-derived Table interactions, the order Appendix , higher (SI for (41) probe To noise experimental represent sim- could ply or specificity governing interactions nonadditive (∼8% order values PWM-predicted and and values NN-derived between Non- Significant Exhibit Cbf1. for Nucleotides additivity Flanking that Reveal Models DN here. presented data and derived assay models the specificity from binding of the highlighting ee al. et Le least used we that fashion, features unbiased little of an contribute set in specificity minimal they sequence the meaning define identify 4C), To (Fig. power. zero explanatory near are ficients ovsaieaditrrtbnigeeg otiuin fDNs of contributions energy binding interpret and visualize To model-predicted DN neighbor nearest between Comparisons 0 o h4adCf,rsetvl)cudidct higher indicate could respectively) Cbf1, and Pho4 for ∼30% . clmladeittcitrcin occur- interactions epistatic and kcal/mol ∼0.5 h eann nxlie aineobserved variance unexplained remaining The aus ers egbrmodel neighbor nearest a values: ∆∆G n i.3 Fig. and %ad2%icessi explana- in increases 24% and ∼5% 10 2 slwrta xetdbased expected than lower ∆∆Gs  smr hnepce.The expected. than more ∆∆Gs 5] nms ae,D coef- DN cases, most In 45)]. = ausfralpossible all for values ∆∆G A r 2 and auso .8ad0.94 and 0.98 of values rmtelinear the from ∆∆G .Teeimprove- These B). oesthat Models 2 = ). NNNNN b1 ers egbrpisehbtdtelretcoefficient largest the S17). exhibited Fig. and Appendix , Pho4 pairs (SI both magnitudes neighbor for features nearest DN Cbf1, selected Among Figs. Appendix, specificity. (SI ing larger feature 3-fold MN to significant most S17 up the are of coefficients that than model DN these Strikingly, (NNNAT per- motif: of core pairs and [two the regime magnitudes spanning penalization DNs coefficient the palindromic of initial most throughout large sisted exhibited DNs coefficient differential on based distinguish rates. minimization to features stringencies regres- sequence penalization The of S16 ). important range Fig. to a Appendix, explores respect (SI sion with error variables squared explanatory in most reduction to the leading only penalized, are constraints of model inclusion weight the in with coefficients models Nonzero (73). linear regression parsimonious (LASSO) develop operator to selection and shrinkage absolute Cbf1. for Cbf1 removed are and effects Pho4 MN (C in when models. predic- contributions variance DN energetic and model of residual MN Fraction by additive explained (B) neighbor) estimates model estimates. (nearest NN those DN on vs. based estimates tions model NN of 4. Fig. C B A rmteslce b1faue,fu ers neighbor nearest four features, Cbf1 selected the From and est cte plots scatter Density (A) features. DN using interpretation model NN S18 TN)ad(NNNGT and ATNNN) ,hglgtn h motneo N oCf bind- Cbf1 to DNs of importance the highlighting ), PNAS NN,NNNNN NNNNN, | o.115 vol. eta fmean of Heatmap ) | o 16 no. ACNNN)]. NNNNN, | E3707

BIOPHYSICS AND COMPUTATIONAL BIOLOGY Orthogonal in Vitro Biochemical Measurements Confirm Results A Obtained Via HTS. To confirm that NN model predictions pro- vide accurate per-sequence estimates of true binding energies, we quantitatively compared titration-based ∆∆G values with unprocessed measurements and NN predictions. Using tradi- tional fluorometric MITOMI, we determined ∆∆Gs for Pho4 and Cbf1 binding to single-site variants of the ACAGA TCGAG flanking sequence (SI Appendix, Table S3 and Figs. S10 and S11). In addition, we compared NN predictions with previ- ously reported ∆∆G measurements of CACGTG flanking site mutations (18, 32). Consistent with Monte Carlo simulations, ∆∆G values calculated directly from raw sequencing data showed essentially no correlation to direct measurements, with r 2 values ranging between 0.07–0.16 and 0–0.24 for Pho4 and Cbf1, respectively (SI Appendix, Fig. S19). NN-predicted values B showed remarkable agreement, with r 2 values ranging between 0.76–0.94 and 0.61–0.69 for Pho4 and Cbf1. In all cases, predic- tions agreed with observations within ∼1 kcal/mol. These results establish that the Pho4 and Cbf1 NN models presented here yield accurate measurements of binding energies for >1 million TF– DNA interactions with similar resolution to “gold-standard” bio- chemical measurements.

High-Resolution in Vitro Affinity Measurements Can Be Used to Iden- tify Biophysical Mechanisms Underlying in Vivo Behavior. The role of transcriptional activators in vivo is not simply to bind DNA but to bind specific genomic loci and regulate transcription of down- stream target genes. The high resolution of these comprehensive binding energy measurements makes it possible to quantitatively estimate the degree to which measured binding affinities explain differences in measured TF occupancies, rates of downstream transcription, and ultimate levels of induction. Fig. 5. Pho4 and Cbf1 binding energies and in vivo activity. (A) Pho4 and First, we compared NN-modeled ∆G values with measured Cbf1 affinities (Kd, in nM) compared with in vivo ChIP-seq enrichment (63). rates of transcription and fold change induction for engineered (B) Functional-energetic landscapes of ChIP enrichment at dynamically reg- promoters containing CACGTG Pho4 consensus sites with dif- ulated loci, relative to the measured highest affinity sequence in Hamming distance space. ferent flanking sequences driving the expression of fluores- cent reporter genes (17, 18). As reported previously, rates of transcription and induction scaled with measured ∆G values (SI Appendix, Fig. S20). Next, we compared NN-derived bind- experiments (63) onto these binding affinity landscapes to yield ing energies with reported levels of TF occupancy in vivo at a composite functional-energetic landscape (Fig. 5B). For both CACGTG consensus sites in the S. cerevisiae for both Pho4 and Cbf1, the majority of enriched genomic loci are greater Pho4 and Cbf1 (63). Large magnitude TF enrichment induced than four mutational steps away from the global minimum, cor- by phosphate starvation was observed at loci with measured Kd responding to mean increases in binding energy of approximately values of around 100 nM or lower (Fig. 5A). While TF enrich- 0.8 and 1.5 kcal/mol, respectively (SI Appendix, Fig. S23). These ment roughly correlated with binding energy, very high affin- quantitative comparisons between measured affinities and in vivo ity sequences showed strikingly low enrichment. The observed occupancies establish that even relatively small differences in nonlinearities may indicate the degree to which other regulatory ∆∆G are associated with differential TF enrichment. mechanisms, such as cooperation and competition among TFs or changes in DNA accessibility due to positioning, Discussion contribute to reported TF enrichment (63). Alternatively, these TFs play a central role in regulating gene expression through- nonlinearities may reveal the need for higher resolution in vivo out development and allowing to adapt to changing measurements to test the degree to which binding energies alone environmental conditions. The ability to quantitatively predict dictate occupancies. levels of TF occupancy in vivo from DNA sequence would there- Previous analyses of TF binding energies used landscape visu- fore be transformative for our understanding of cellular func- alization strategies to identify energy-dependent patterns in the tion. TF binding at a given locus depends on multiple factors, data (74). To better understand the relationship between bind- including accessibility of a particular site (75, 76), effects of coop- ing energies and in vivo occupancies, we similarly visualized the eration and competition with other TFs and (77, binding energy landscapes for both Pho4 and Cbf1 as a function 78), the presence of nucleotide modifications (79, 80), and the of sequence space (Fig. 5B and SI Appendix, Fig. S21). The high- nuclear concentration of a TF at a given time (81, 82). For acces- est affinity sequence for each TF was placed at the center of a sible, unmodified target sites, the probability of TF occupancy at series of concentric rings, each of which includes all sequences a given locus includes an exponential dependence on the cor- at a given Hamming distance from this sequence. Within each responding TF–DNA binding energy (83); therefore, accurate ring, points representing each sequence are arranged alphabeti- occupancy predictions require the ability to resolve even small cally, with the color of each point reporting the measured ∆∆G (∼1–2 kcal/mol) changes in binding energies. Toward this goal, for that sequence. As expected, the landscape forms a some- we developed an assay to provide comprehensive and quanti- what rugged funnel, with binding energies increasing with muta- tative measurements of near-neutral changes in binding ener- tional distance from the highest affinity site (SI Appendix, Fig. gies caused by mutations in the flanking sequences surround- S22). Next, we projected flanking site occupancies from ChIP-seq ing TF consensus sites. By training a NN on noisy estimates of

E3708 | www.pnas.org/cgi/doi/10.1073/pnas.1715888115 Le et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 ee al. et Le o,cnitn ihbohsclosrain htlclDNA local that observations biophysical nearest with behav- on binding the consistent based observed in explain ior, models fully lost that preferences is DN speci- demonstrate information neighbor of we order Here, higher determinants data, process. mechanistic sparse While extract from (51–55)]. ficity models can PWMs shape-based for models for T) these roll) val- and and twist, G, four propeller twist, width, C, helical of groove (minor (A, features set structural [nucleotides and a position (36) by each models at preferences Both ues binding specificity. DNA DNA TF and parameterize (PWMs) representing rela- models the models sequence-based surrounding shape-based DNA These debates of energies. recent utility to binding tive relevance observed models have to linear results of contributions quan- series and their a specificity drive tified and that features NN mechanistic a the revealed by output energies evolutionarily binding more sites can binding sites Pho4 plastic. binding determin- rendering Cbf1 Pho4, in than that ultimately pathways role evolutionary speculate nondeleterious larger fewer we traverse a specificity, that play Cbf1 Given (88). ing interactions with interactions DN compared binding landscapes nonadditive additive by energetic created rugged those more thought are produce nonadditivity binding of to levels suboptimal elevated a in addition, to In proximity sites extreme. sequence flanking occupied avoid to affinity evolutionary highly used an highest buffer of most existence the the the poten- indicating from Consis- potentially that distant sequences, the (87). mutationally find to are control we vivo due transcriptional this, favorable with dynamic tent evolutionarily greater bind- be for TF tial may submaximal, to sites transcriptional but opportunity of ing High-affinity, evolution unique energy networks. drive binding a that regulatory The mechanisms provide the sequences. here explore affinity presented highest dis- landscapes toward the focused of development assay covery most with rare, remains significant a false-negatives. and return false-positives could both Failing effects of TFBSs. number sequence TFs known flanking bound to putative consider similarities ana- for to sequence first regions acces- for these TFs of searching scan regions by bound then identify and of to DNA data presence sible ATAC-seq many the or addition, DNase-seq infer con- lyze In to regions unoccupied. accessible efforts remain other current while sites site appar- consensus consensus an ChIP taining despite a occupied of regarding are lack loci mysteries ent genomic explain some which TFBSs may muta- in core of data discrepancy the only representations for This energy current binding tion. in However, mea- change a previously 32. predict were would ref. effects in whose sequence sured consensus change a the highlight in letters underlined downstream the and and upstream sequences, from flanking specific within indicate sequences motif italics all the for library, identical core the is that the ’CACGTG’ motif TF mutating sus to example, equivalent CACGTG For kcal/mol, core. a ∼2.6 the between energy within binding CCCCA mutations equal in than amount difference an the greater by energies or binding to change can motifs sensus comprehensive, vivo. yield in binding to TF of nucleosomes) models and and predictive 86)] TFs (85, competition other and data between cooperation accessibility quantifying (e.g., chromatin data methyl- and mechanistic [e.g., (84) genomic In binding with state sequence. vitro combined ation in be each can high-resolution for landscapes that energy anticipate estimates we energy work, future binding required interactions accurate complex model for order a higher obtained all we incorporates sequences, that of millions for energies binding ytmtccmaiosbtenprsqec siae of estimates per-sequence between comparisons Systematic landscapes energy binding complete measuring practice, In con- “core” of outside nucleotides flanking in mutations Many euneada and sequence to CGT h odsqec niae h consen- the indicates sequence bold The GTG. AATTT CACGTGAAAAG TCCCCCACGTG- euneis sequence n nryetmtsotu yteN.TeM oe nldssequence includes model MN The NN. the by output estimates energy ing Appendix, Models. (SI Binding epochs Linear two further a for improve to (ini- S8). Fig. failed rate error point, until this ued At 10 epochs. at consecutive failed tialized error three -mean-squared for set decrease validation to the until stochas- descent using ReLU trained gradient were and tic networks datasets; val- (97) (30%) (60%), test training normalization and into (10%), divided batch idation randomly size was used dataset of entire layers The layers all Xavier activation. with hidden and initialized three were (96), weights of All initialization respectively. predicted consisted units, 250 the network and 500, The was 500, interest. output of NN species matrix; encoded hot NBnigModels. Binding NN energy Methods binding and Materials these of topography the quan- high-resolution, landscapes. of providing mapping by specificity titative target to efforts TF PBM probe and SELEX-seq work, initial future complement can In BET-seq interactions. both protein–protein including to systems, and assays additional protein–RNA sequencing-based for of energies development binding presented measure the simulations guide The can (93–95). here competition modifications this site-specific influence how nucleosomes and and quantitative occupancies into TFs and dictates between assembled competition direct how libraries facilitating of vitro, investigation DNA in with arrays compatible nucleosomal be BET- should addition, In seq specificities. TF affect modifications these 5-methylcytosine, how of investigation (e.g., systematic regulation allow vivo. could epigenetic 5-hydroxymethylcytosine) in modified in containing binding involved libraries TF bases DNA influence synthesized of that addi- Introduction probe mechanisms to work control future tional in BET- opportunities Finally, unique valv- assays, provides required. seq MITOMI microfluidic infrastructure traditional pneumatics the for the than Moreover, reducing simpler resolution. energies significantly is and measure ing scale to this services at deep-sequencing these educational commercial eliminates to access or readout with laboratory any sequencing allowing requirements, attached; A device platform. GAIIx microfluidic Illumina customized sequencing a a a to with access require either microscope slide assays HiTS-FLIP a automated and imaging fully printer of or microarray capable scanner DNA fluores- fluorescence a MITOMI high-cost require Traditional assays infrastructure. cence assay HiTS- and The or less significantly MITOMI efforts. equipment requiring traditional while discovery assays of 11, fluorescence-based site resolution FLIP (9, target the platforms offers including and further 92), applications binding of 91, absolute variety resolve 29, a to near- depths across of can sequencing energies simulations (measurement of these case range), choice energy use guide small was a specific BET-seq over effects a mea- While neutral for high-resolution landscapes. here and energy deployed comprehensive binding make of to surements and labs diver- design broader of a assay allow sity should simulation-guided here The presented assay vivo. experimental in to pre- behavior degree can energies dict mea- the binding on of these based models quantification of thermodynamic which direct resolution could allows high models further simpler The surements which behavior. to observed degree binding explain were the Cbf1 NN quantify the to and proved by essential output Pho4 ultimately predictions modeling high-resolution complexity specificities, accurately order required capac- for higher of NN’s the unnecessary number incorporate While the to models. in MN ity to increase relative modest parameters mod- a free DN only neighbor nearest require interactions Such between 90). els groove 89, stacking major (55, the pairs base base in adjacent bonds by hydrogen pair determined interbase and largely is structure −3 a erae n1-odiceet,adtann contin- training and increments, 10-fold in decreased was ) Niptwsdfie safltee 4 flattened a as defined was input NN ierbnigmdl eetando cldbind- scaled on trained were models binding Linear PNAS | o.115 vol. au o the for value ∆∆G | o 16 no. × 0one- 10 | E3709

BIOPHYSICS AND COMPUTATIONAL BIOLOGY features consisting of nucleotide identities at each flanking position; near- (DOI 10.6084/m9.figshare.5728467), and code used for analysis and figure est neighbor DN included all MN features plus all possible combinations of generation is available on GitHub (https://github.com/FordyceLab/BET-seq). adjacent nucleotide pairs. The full DN model adds all nonadjacent (gapped) DN combinations. All linear binding models were trained using the same ACKNOWLEDGMENTS. We thank Justin Kinney, Hua Tang, Anshul Kundaje, 60% of the data as the NN; reported accuracies are calculated with respect Daniel Herschlag, and Rhiju Das for helpful discussions. T.C.S. and A.K.A. acknowledge NSF Graduate Research Fellowships; A.K.A. acknowledges the to the held-out 40% of the data. Stanford ChEM-H (Chemistry, Engineering, and Medicine for Human Health) Predoctoral Training Program. This work was supported by NIH/National Material Availability. Detailed methods are available in SI Appendix, raw Institute of General Medical Sciences Grant R00GM09984804. P.M.F. is a Chan data are available from the Gene Expression Omnibus (accession number Zuckerberg Biohub Investigator and acknowledges Sloan Research Founda- GSE111936), processed and intermediate files are available from Figshare tion and McCormick and Gabilan faculty fellowships.

1. Latchman DS (1990) factors. Biochem J 270:281–289. 31. Berger MF, et al. (2006) Compact, universal DNA microarrays to comprehensively 2. Kim HD, O’Shea EK (2008) A quantitative model of transcription factor-activated gene determine transcription-factor binding site specificities. Nat Biotechnol 24:1429– expression. Nat Struct Mol Biol 15:1192–1198. 1435. 3. Kim HD, Shay T, O’Shea EK, Regev A (2009) Transcriptional regulatory circuits: Pre- 32. Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding energy dicting numbers from alphabets. Science 325:429–432. landscapes of transcription factors. Science 315:233–237. 4. Segal E, Raveh-Sadka T, Schroeder M, Unnerstall U, Gaul U (2008) Predicting expres- 33. Fordyce PM, et al. (2010) De novo identification and biophysical characterization of sion patterns from regulatory sequence in Drosophila . Nature 451:535– transcription-factor binding sites with microfluidic affinity analysis. Nat Biotechnol 540. 28:970–975. 5. Raveh-Sadka T, Levo M, Segal E (2009) Incorporating nucleosomes into thermody- 34. Isakova A, et al. (2017) SMiLE-seq identifies binding motifs of single and dimeric tran- namic models of transcription regulation. Genome Res 19:1480–1496. scription factors. Nat Methods 14:316–322. 6. Gertz J, Siggia ED, Cohen BA (2009) Analysis of combinatorial cis-regulation in syn- 35. Nutiu R, et al. (2011) Direct measurement of DNA affinity landscapes on a high- thetic and genomic promoters. Nature 457:215–218. throughput sequencing instrument. Nat Biotechnol 29:659–664. 7. Foat BC, Morozov AV, Bussemaker HJ (2006) Statistical mechanical modeling of 36. Stormo GD, Schneider TD, Gold L, Ehrenfeucht A (1982) Use of the ‘Perceptron’ algo- genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics rithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res 10:2997– 22:e141–e149. 3011. 8. Riley TR, Lazarovici A, Mann RS, Bussemaker HJ (2015) Building accurate sequence- 37. Stormo GD, Fields DS (1998) Specificity, free energy and information content in to-affinity models from high-throughput in vitro protein-DNA binding data using protein-DNA interactions. Trends Biochem Sci 23:109–113. FeatureREDUCE. eLife 4:e06397. 38. Stormo GD, Hartzell GW (1989) Identifying protein-binding sites from unaligned DNA 9. Zhao Y, Stormo GD (2011) Quantitative analysis demonstrates most transcrip- fragments. Proc Natl Acad Sci USA 86:1183–1187. tion factors require only simple models of specificity. Nat Biotechnol 29:480– 39. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically 483. significant alignments of multiple sequences. Bioinformatics 15:563–577. 10. Zhao Y, Granas D, Stormo GD (2009) Inferring binding energies from selected binding 40. Zuo Z, Stormo GD (2014) High-resolution specificity from DNA sequencing highlights sites. PLoS Comput Biol 5:e1000590. alternative modes of binding. Genetics 198:1329–1343. 11. Weirauch MT, et al. (2013) Evaluation of methods for modeling transcription factor 41. Bulyk ML, Johnson PLF, Church GM (2002) Nucleotides of transcription factor binding sequence specificity. Nat Biotechnol 31:126–134. sites exert interdependent effects on the binding affinities of transcription factors. 12. Mustonen V, Kinney J, Callan CG, Lassig¨ M (2008) Energy-dependent fitness: A quan- Nucleic Acids Res 30:1255–1261. titative model for the evolution of yeast transcription factor binding sites. Proc Natl 42. Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordanˆ R (2013) Stability selec- Acad Sci USA 105:12376–12381. tion for regression-based models of transcription factor-DNA binding specificity. 13. Haldane A, Manhart M, Morozov AV (2014) Biophysical fitness landscapes for tran- Bioinformatics 29:i117–i125. scription factor binding sites. PLoS Comput Biol 10:e1003683. 43. Mathelier A, Wasserman WW (2013) The next generation of transcription factor bind- 14. Crocker J, et al. (2015) Low affinity binding site clusters confer hox specificity and ing site prediction. PLoS Comput Biol 9:e1003214. regulatory robustness. Cell 160:191–203. 44. Zhao Y, Ruan S, Pandey M, Stormo GD (2012) Improved models for transcription fac- 15. Bintu L, Buchler NE, Garcia HG, Gerland U (2005) Transcriptional regulation by the tor binding site identification using nonindependent interactions. Genetics 191:781– numbers: Models. Curr Opin Genet Dev 15:116–124. 790. 16. Lam FH, Steger DJ, O’Shea EK (2008) Chromatin decouples threshold from 45. Tomovic A, Oakeley EJ (2007) Position dependencies in transcription factor binding dynamic range. Nature 453:246–250. sites. Bioinformatics 23:933–941. 17. Aow JSZ, et al. (2013) Differential binding of the related transcription factors Pho4 46. Siddharthan R (2010) Dinucleotide weight matrices for predicting transcription factor and Cbf1 can tune the sensitivity of promoters to different levels of an induction binding sites: Generalizing the position weight matrix. PLoS One 5:e9722. signal. Nucleic Acids Res 41:4877–4887. 47. Badis G, et al. (2009) Diversity and complexity in DNA recognition by transcription 18. Rajkumar AS, Denervaud´ N, Maerkl SJ (2013) Mapping the fine structure of a eukary- factors. Science 324:1720–1723. otic promoter input-output function. Nat Genet 45:1207–1215. 48. Annala M, Laurila K, Lahdesm¨ aki¨ H, Nykter M (2011) A linear model for transcrip- 19. Gordanˆ R, et al. (2013) Genomic regions flanking E-box binding sites influence DNA tion factor binding affinity prediction in protein binding microarrays. PLoS One 6: binding specificity of bHLH transcription factors through DNA shape. Cell Rep 3:1093– e20059. 1104. 49. Zhao X, Huang H, Speed TP (2005) Finding short DNA motifs using permuted Markov 20. Levo M, et al. (2015) Unraveling determinants of transcription factor binding outside models. J Comput Biol 12:894–906. the core binding site. Genome Res 25:1018–1029. 50. Sharon E, Lubliner S, Segal E (2008) A feature-based approach to modeling protein- 21. Afek A, Schipper JL, Horton J, Gordanˆ R, Lukatsky DB (2014) Protein-DNA binding DNA interactions. PLoS Comput Biol 4:e1000154. in the absence of specific base-pair recognition. Proc Natl Acad Sci USA 111:17140– 51. Rohs R, et al. (2009) The role of DNA shape in protein-DNA recognition. Nature 17145. 461:1248–1253. 22. Farley EK, Olson KM, Zhang W, Rokhsar DS, Levine MS (2016) Syntax compensates for 52. Abe N, et al. (2015) Deconvolving the recognition of DNA shape from sequence. Cell poor binding sites to tissue specificity of developmental enhancers. Proc Natl 161:307–318. Acad Sci USA 113:6508–6513. 53. Chiu TP, et al. (2016) DNAshapeR: An R/bioconductor package for DNA shape predic- 23. Afek A, Cohen H, Barber-Zucker S, Gordanˆ R, Lukatsky DB (2015) Nonconsensus pro- tion and feature encoding. Bioinformatics 32:1211–1213. tein binding to repetitive DNA sequence elements significantly affects eukaryotic 54. Yang L, et al. (2017) Transcription factor family-specific DNA shape readout revealed . PLoS Comput Biol 11:e1004429. by quantitative specificity models. Mol Syst Biol 13:910. 24. Jolma A, et al. (2010) Multiplexed massively parallel SELEX for characteriza- 55. Zhou T, et al. (2013) DNAshape: A method for the high-throughput prediction of DNA tion of human transcription factor binding specificities. Genome Res 20:861– structural features on a genomic scale. Nucleic Acids Res 41:W56–W62. 873. 56. Quang D, Xie X (2016) DanQ: A hybrid convolutional and recurrent deep neural net- 25. Slattery M, et al. (2011) binding evokes latent differences in DNA binding work for quantifying the function of DNA sequences. Nucleic Acids Res 44:e107. specificity between Hox proteins. Cell 147:1270–1282. 57. Djordjevic M, Sengupta AM, Shraiman BI (2003) A biophysical approach to transcrip- 26. Tuerk C, Gold L (1990) Systematic evolution of ligands by exponential enrichment: tion factor binding site discovery. Genome Res 13:2381–2390. RNA ligands to bacteriophage T4 DNA polymerase. Science 249:505–510. 58. Hellman LM, Fried MG (2007) Electrophoretic mobility shift assay (EMSA) for detect- 27. Ellington AD, Szostak JW (1990) In vitro selection of RNA molecules that bind specific ing protein-nucleic acid interactions. Nat Protoc 2:1849–1861. ligands. Nature 346:818–822. 59. Fordyce PM, et al. (2012) Basic zipper transcription factor Hac1 binds DNA 28. Zykovich A, Korf I, Segal DJ (2009) Bind-n-Seq: High-throughput analysis of in in two distinct modes as revealed by microfluidic analyses. Proc Natl Acad Sci USA vitro protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res 109:E3084–E3093. 37:e151. 60. Jones S (2004) An overview of the basic helix-loop-helix proteins. Genome Biol 5:226. 29. Chen D, et al. (2016) SELMAP–SELEX affinity landscape MAPping of transcription fac- 61. Fisher F, Goding CR (1992) Single amino acid substitutions alter helix-loop-helix pro- tor binding sites using integrated microfluidics. Sci Rep 6:33351. tein specificity for bases flanking the core CANNTG motif. EMBO J 11:4103–4109. 30. Mukherjee S, et al. (2004) Rapid analysis of the DNA-binding specificities of transcrip- 62. Shimizu T, et al. (1997) Crystal structure of PHO4 bHLH domain-DNA complex: Flank- tion factors with DNA microarrays. Nat Genet 36:1331–1339. ing base recognition. EMBO J 16:4689–4697.

E3710 | www.pnas.org/cgi/doi/10.1073/pnas.1715888115 Le et al. Downloaded by guest on September 25, 2021 Downloaded by guest on September 25, 2021 5 uG,e l 21)Mlclridxn nbe uniaietree N sequenc- RNA targeted quantitative enables indexing Molecular (2014) al. molec- et unique GK, using molecules Fu of 65. numbers absolute Counting (2012) al. et T, genome- Kivioja of 64. determinants reveal approaches Integrated (2011) EK O’Shea X, Zhou 63. 8 ooo V igaE 20)Cnetn rti tutr ihpeitoso regu- of predictions with structure protein Connecting (2007) ED Siggia AV, Morozov posi- benchmarked 68. of the database by comprehensive A molecules ScerTF: DNA (2012) GD individual Stormo AT, Counting Spivak (2011) SPA 67. Fodor PH, Wang J, Hu GK, Fu 66. 0 ai 21)gsqoo estl akg o rwn eunelogos. sequence drawing for package R versatile A ggseqlogo: Saccha- (2017) for O sites Wagih regulatory conserved of 70. map improved An (2006) al. et KD, MacIsaac 69. 1 troG,ShedrT,Gl 18)Qatttv nlsso h relationship the of analysis Quantitative (1986) L Gold TD, Schneider GD, Stormo 71. 3 isiaiR(96 ersinsrnaeadslcinvatelasso. the via pro- selection and and centromeres shrinkage in Regression (1996) functions R which Tibshirani protein yeast 73. a CPF1, (1990) al. et J, Mellor 72. 8 ehrtJM ta.(03 igemlcl mgn ftasrpinfco binding factor transcription of imaging Single-molecule (2013) al. et JCM, Gebhardt 78. elucidate molecules binding DNA of landscapes Specificity (2010) al. et CD, Carlson 74. 0 i ,e l 21)Ipc fctsn ehlto nDAbnigseicte of specificities binding DNA on methylation of Impact (2017) al. et Y, Yin ACAT|GTG motifs 80. E-box in 5-Hydroxymethylcytosine (2016) al. et S, Khund-Sayeed 79. positioning. nucleosome for code human genomic of A determinant (2006) major al. a et are E, QTLs Segal sensitivity I 77. DNase (2012) al. et genome. JF, human the Degner of landscape 76. chromatin accessible The (2012) al. et RE, Thurman 75. 1 a ,OSe K(01 inldpnetdnmc ftasrpinfco translo- factor transcription of dynamics Signal-dependent (2011) EK O’Shea N, Hao 81. ee al. et Le n n eel orefiinisi tnadlbaypreparations. library standard in USA efficiencies poor reveals and ing Pho4. factor identifiers. ular transcription the of function 836. and binding wide aoysites. latory species. Saccharomyces for matrices weight tion labels. diverse of attachment stochastic formatics cerevisiae. romyces ewe uloiesqec n ucinlactivity. functional and sequence nucleotide between ttMethodol Stat B moters. oDAi iemmaincells. mammalian live in DNA to 778. function. biological ua rncito factors. transcription human TCF4. factor transcription B-HLH the Biol of DNA-binding increases ACAC|GTG and variation. expression Nature aincnrl eeexpression. gene controls cation 8:936–945. 111:1891–1896. 489:75–82. MOJ EMBO 3:3645–3647. rcNt cdSiUSA Sci Acad Natl Proc a Methods Nat 9:4017–4026. 58:267–288. M Bioinformatics BMC rcNt cdSiUSA Sci Acad Natl Proc Nature 482:390–394. Science 9:72–74. a Methods Nat a tutMlBiol Mol Struct Nat 104:7068–7073. 356:eaaj2239. rcNt cdSiUSA Sci Acad Natl Proc 7:113. 107:4544–4549. 10:421–426. uli cd Res Acids Nucleic 19:31–39. uli cd Res Acids Nucleic 108:9026–9031. 40:D162–D168. rcNt cdSci Acad Natl Proc o Cell Mol ttScSeries Soc Stat R J Nature 14:6661–6679. 442:772– 42:826– Integr Bioin- 7 of ,SeeyC(05 ac omlzto:Aclrtn epntoktrain- network deep Accelerating normalization: Batch (2015) C Szegedy S, Ioffe 97. feedforward deep training of difficulty the Understanding (2010) ubiquity- Y Bengio Chemically X, (2008) Glorot TW 96. Muir RG, Roeder C, Chatterjee J, mod- Kim protein RK, authentic site-specific McGinty to route 95. into biology analogs chemical A methyl-lysine (2016) al. of et installation A, Yang site-specific The 94. (2007) al. et MD, Simon 93. factors. transcription human Tu of tran- specificities of 92. DNA-binding features (2013) shape al. DNA et for A, database Jolma motif A 91. TFBSshape: (2014) al. et L, sequence-dependent Yang DNA (1998) 90. VB Zhurkin LM, Hock XJ, Lu AA, Gorin WK, Olson 89. factor transcription Low-affinity Aguilar-Rodr touch: soft 88. The (2016) DL Stern EPB, Noon J, of Crocker Transposition (2013) 87. WJ Greenleaf HY, Chang LC, Zaba PG, Giresi JD, Buenrostro 86. 5 ol P ta.(08 ihrslto apn n hrceiaino pnchro- open of characterization and mapping High-resolution (2008) al. dis- et positive AP, a Boyle yields 85. that protocol sequencing genomic A (1992) al. et M, Frommer 84. 3 it ,e l 20)Tasrpinlrglto ytenmes Models. numbers: the by regulation Transcriptional (2005) al. analogue et and L, activation Bintu digital reveal 83. dynamics NF-κB Single-cell (2010) al. et S, Tay 82. ee Dev Genet nomto processing. information ns(rceig fte3n nentoa ofrneo ahn erig Lille, Learning, Machine on 448–456. pp Conference Proceed- 37, International Vol Conference 32nd France), & the Workshop of shift. Research: (Proceedings Learning ings covariate Machine of internal Journal Learning, reducing by ing 249–256. pp 9, Vol Italy], Sardinia, (AISTATS), Statistics and ceedings networks. neural methylation. 453:812–816. intranucleosomal hDot1L-mediated stimulates H2B histone lated ifications. . recombinant evolution. site ing 152:327–339. sites. binding factor scription complexes. crystal 95:11163–11168. protein-DNA from deduced deformability navigability. their and scapes evolution. and development in sites binding DNA- position. chromatin, nucleosome open and of proteins profiling epigenomic binding sensitive and fast for chromatin native ai costegenome. the across matin strands. DNA individual in 89:1827–1831. residues 5-methylcytosine of play rlM Paix M, grul ˘ Poednso h 3hItrainlCneec nAtfiilIntelligence Artificial on Conference International 13th the of [Proceedings Science 15:116–124. ge ,PyeJ,Wge 21)Atosn miia dpieland- adaptive empirical thousand A (2017) A Wagner JL, Payne J, ´ ıguez oT atnN,Tka NH, Barton T, ao ˜ ora fMcieLann eerh okhp&Cneec Pro- Conference & Workshop Research: Learning Machine of Journal 354:623–626. LSGenet PLoS Cell Nature Cell 128:1003–1012. a clEvol Ecol Nat uli cd Res Acids Nucleic 11:e1005639. 132:311–322. 466:267–271. i 21)Dnmc ftasrpinfco bind- factor transcription of Dynamics (2015) G cik ˇ PNAS 1:45. a Methods Nat nentoa ofrneo Machine on Conference International urTpDvBiol Dev Top Curr 42:D148–D155. | o.115 vol. 10:1213–1218. rcNt cdSiUSA Sci Acad Natl Proc rcNt cdSiUSA Sci Acad Natl Proc 117:455–469. | o 16 no. urOpin Curr | Nature E3711 Cell

BIOPHYSICS AND COMPUTATIONAL BIOLOGY