Specificity landscapes unmask submaximal binding site preferences of transcription factors

Devesh Bhimsariaa,b,1, José A. Rodríguez-Martíneza,2, Junkun Panc, Daniel Rostonc, Elif Nihal Korkmazc, Qiang Cuic,3, Parameswaran Ramanathanb, and Aseem Z. Ansaria,d,4

aDepartment of Biochemistry, University of Wisconsin–Madison, Madison, WI 53706; bDepartment of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, WI 53706; cDepartment of Chemistry, University of Wisconsin–Madison, Madison, WI 53706; and dThe Genome Center of Wisconsin, University of Wisconsin–Madison, Madison, WI 53706

Edited by Michael Levine, Princeton University, Princeton, NJ, and approved September 24, 2018 (received for review July 13, 2018) We have developed Differential Specificity and Energy Landscape Here, we report the development of Differential Specificity (DiSEL) analysis to comprehensively compare DNA– inter- and Energy Landscapes (DiSEL) to compare experimental actomes (DPIs) obtained by high-throughput experimental plat- platforms, computational methods, and interactomes of TFs, forms and cutting edge computational methods. While high-affinity especially those factors that bind identical consensus motifs. Our DNA binding sites are identified by most methods, DiSEL uncovered results reveal that (i) most high-throughput experimental plat- nuanced sequence preferences displayed by homologous transcription forms reliably identify high-affinity motifs but yield less reliable factors. Pairwise analysis of 726 DPIs uncovered homolog-specific dif- information on submaximal sites; (ii) with few exceptions, com- ferences at moderate- to low-affinity binding sites (submaximal sites). putational methods model DPIs with a focus on high-affinity DiSEL analysis of variants of 41 transcription factors revealed that sites; (iii) submaximal sites improve the annotation of biologi- many disease-causing mutations result in allele-specific changes in cally relevant binding sites across genomes; (iv) among members binding site preferences. We focused on a set of highly homologous of TF families, homolog-specific preferences are most evident at factors that have different biological roles but “read” DNA using iden- submaximal affinity sites rather than high-affinity motifs; (v) tical amino acid side chains. Rather than direct readout, our results among closely related homologs that use identical side chains to indicate that DNA noncontacting side chains allosterically contribute interact with DNA, the residues that face away from the DNA to sculpt distinct sequence preferences among closely related mem- can allosterically confer homolog-specific preferences for sub- BIOPHYSICS AND

bers of families. maximal sites; and (vi) among naturally occurring alleles of COMPUTATIONAL BIOLOGY specific factors, several disease-causing alleles impact binding to Differential Specificity and Energy Landscapes | cognate site identification | submaximal affinity sites (24). Taken together, DiSEL analysis DNA–protein interactome | DNA sequence recognition | allostery readily unmasks the differences between experimental platforms and computational models and identifies submaximal sites that enome-wide binding profiles of hundreds of transcription Gfactors (TFs) have made it abundantly clear that these Significance bind to a large spectrum of sequences to manifest their – biological functions (1 3). The affinity for different biologically Several experimental platforms and computational methods relevant binding sites can vary dramatically. Surprisingly, only a have been developed to identify DNA binding sites of over 1,000 fraction of the genomic sites occupied in living cells can be an- transcription factors. Often, high-affinity (maximal) binding sites are notated using high-affinity motifs assigned to a given TF (1). To reported as consensus motifs. Differences between experimental further confound annotation, high-affinity sites can be bound platforms contribute to uncertainty in ascribing binding to sub- interchangeably by TFs that bear a common DNA binding fold maximal sites. However, biological studies emphasize the impor- (4–6). This is especially true for highly homologous TFs that tance of submaximal binding sites in shaping regulatory functions of often bind indistinguishably to consensus high-affinity sites (6, 7). transcription factors. To bridge this gap, we developed Differential Increasingly, moderate- to low-affinity (submaximal or sub- Specificity and Energy Landscapes to unmask differences between optimal affinity) binding sites have been shown to guide selective experimental and computational methods as well as capture distinct binding of individual TFs to distinct genomic loci (8–11). In submaximal binding site preferences of transcription factors. Our other words, energetically subtle preferences for different mod- results suggest that subtle variation in protein structure can allo- erate- to low-affinity sites govern selective binding and distinct sterically confer homolog-specific differences in binding to sub- biological roles of closely related homologous TFs (8–11). maximal affinity sites. The quest to identify consensus binding sites of all DNA (and RNA) binding proteins encoded within the is Author contributions: D.B., J.A.R.-M., Q.C., P.R., and A.Z.A. designed research; D.B., being driven by high-throughput experimental platforms and new J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. performed research; D.B., P.R., and A.Z.A. contributed new reagents/analytic tools; D.B., J.A.R.-M., J.P., D.R., and E.N.K. ana- computational approaches (12, 13). Each experimental and lyzed data; and D.B., J.A.R.-M., J.P., D.R., E.N.K., Q.C., P.R., and A.Z.A. wrote the paper. computational approach has inbuilt advantages and limitations Conflict of interest statement: A.Z.A. is the sole member of VistaMotif, LLC and founder (14–16). While high-affinity sites are readily identified, binding of the educational nonprofit WINStep Forward. to submaximal affinity sites is nontrivial and is often overlooked. This article is a PNAS Direct Submission. However, an unexpected result from recent analyses is that high- Published under the PNAS license. “ ” affinity consensus binding sites often do not predict in vivo 1Present address: Bio Informaticals, Jaipur, Rajasthan 302016, India. genome-wide binding profiles (chromatin immunoprecipitation 2Present address: Department of Biology, University of Puerto Rico–Rio Piedras, San Juan, followed by sequencing or ChIP-seq) as effectively as models that Puerto Rico 00925. include sequences of submaximal affinities (17). Pairwise com- 3Present address: Departments of Chemistry and Physics & Biomedical Engineering, Boston parisons of DNA–protein interactomes (DPIs) suggest that most University, Boston, MA 02215. experimental platforms capture high-affinity sites with remarkable 4To whom correspondence should be addressed. Email: [email protected]. fidelity (18–23).However,theextenttowhich platform-dependent This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. idiosyncrasies thwart the identification of submaximal binding sites 1073/pnas.1811431115/-/DCSupplemental. is underscrutinized and poorly understood.

www.pnas.org/cgi/doi/10.1073/pnas.1811431115 PNAS Latest Articles | 1of10 Downloaded by guest on September 24, 2021 are preferred by homologous proteins with indistinguishable high- A affinity target sites. Our results highlight the importance of non- obvious allosteric contributors in conferring differential sequence specificity. While widely ignored, such allosteric effects likely con- tribute to sequence specificity beyond current models of direct and indirect readout of DNA sequence and shape. Increased evaluation of differential binding to submaximal affinity sites will undoubtedly improve the ability to decipher how genomic information is utilized by TFs to manifest their regulatory functions in vivo. Results Specificity and Energy Landscapes Display Binding Affinities for an B Entire Sequence Space. DPIs from high-throughput experimental methods are typically distilled down to a position weight matrix (PWM)-based “consensus motif” or a limited set of motifs (12, 25, 26) (Fig. 1A). While PWM-based motifs efficiently summa- rize sequence preferences of a DNA binding protein, they compress related sequences into a consensus, overlook the impact of flanking sequences, and underestimate the full spectrum of cognate sites contained within a given interactome. We utilize sequence specificity landscapes (SSLs) to visualize individual interactomes (19, 27) (Fig. 1B). When binding affinities are measured and correlated with cog- nate sites within an interactome, the resulting plots display binding energy landscapes [Specificity and Energy Landscapes (SELs)] of individual TFs (27, 28). In SSL/SEL plots, the binding affinities for a k-mer sequence space are represented in a series of concentric rings organized by a “seed motif.” All sequences in a DPI are then placed at different positions along the concentric circles based on sequence similarity to the seed motif (Fig. 1B and SI Appendix,Fig.S1). SELs of different classes of proteins reveal the range of binding modes displayed by a given TF and impact of flanking sequences and mis- matches on binding (6, 19). C To elucidate similarities and differences in DPIs obtained by various experimental methods and sequence preferences of highly homologous TFs, we now report the development of DiSELs (Fig. 1B) (29). To perform DiSEL, we normalize paired DPI datasets and scale the dynamic range of one DPI against the other (Methods has details). An automated peak finding algorithm to systematically identify and rank order the binding site preferences of TFs was also developed (Fig. 1C and Methods). We then utilized DiSEL to compare 568 pairs of DPIs of different DNA binding domains, 129 pairs of DPIs of different alleles of a given protein Fig. 1. Comprehensive display of DPI data using SELs and DiSELs. (A) High- and, 29 DPIs obtained by different experimental platforms. throughput methods to study DNA–protein interactions can be classified as array based (CSI and PBM), microfluidics based (MITOMI), or based on next DPIs Captured by Different Experimental Platforms. Over the past generation DNA sequencing (high-throughput sequencing) (12). Computa- decade, several high-throughput platforms have been developed tional methods can summarize DPIs into PWMs. Sequence affinities de- to chart the sequence specificity of DNA binding proteins. termined by high-throughput methods correlate with equilibrium binding Among those, the most widely used methods can be grouped as constants measured by standard methods. (B) The entire DPI is displayed as SEL. DNA microarray-based platforms [Cognate Site Identifier (CSI) Shown is an SEL representation (19, 27) of the DPI of Gata4 by CSI array (19) (18), protein binding microarray (PBM) (20), high-throughput using 5′WGATAA3′ as seed motif, where W = AorG.Sequencesplacedwithin sequencing–fluorescent ligand interaction profiling (HiTS-FLIP) the zero-mismatch ring have an exact match to the seed motif. The one- mismatch ring contains all sequences that differ from the seed motif at any (23)], massively parallel sequencing-enabled methods [for RNA: in one position or a Hamming distance of one. The sequences are placed in a vitro selection, high-throughput sequencing of RNA, and sequence clockwise manner starting with mismatches at the first position of the motif and specificity landscapes (SEQRS) (30); for DNA: high-throughput ending with mismatches at the last position of the motif. Within each sector, the systematic evolution of ligands by exponential enrichment (HT- mismatches at a given position “x” are organized in an alphabetical order (A-C- SELEX) (22, 31), systematic evolution of ligands by exponential G-T). The two-mismatch ring contains all permutations with two positional dif- enrichment with massively parallel sequencing (SELEX-seq) (7), ferences with the seed. The height of each color-coded peak corresponds to the Bind-n-Seq (32)], microfluidics-based protein arrays [mechanically DNA binding intensity scale of plots: red to blue to gray corresponds to highest induced trapping of molecular interactions (MITOMI), selective to median to lowest binding, respectively. Differences between any two DPIs can microfluidics-based ligand enrichment followed by sequencing be readily visualized through a DiSEL. DiSEL comparison between DPIs of Lhx4 and Lhx2 proteins with seed 5′TAATTA3′ is shown; Lhx4-preferred DNA (SMiLE-seq)] (13, 33), and cell-based bacterial one-hybrid binding compared with Lhx2 is pointed out as peaks. Here, x means a mismatch methods (34). Each method has its advantages and limitations at that the position of seed. (C) Domainwise distribution of DPIs of 568 homol- (12); however, submaximal affinity binding sites are typically not ogous TFs and 129 alleles of 41 TFs that are compared via DiSEL in the text. examined due to uncertainty about whether such binding is in- dicative of biologically relevant affinities or simply arises as ar- tifacts of the experimental platform. both CSI array and PBM is Gzf3,aC4-class zinc finger (35). The We use PBM data as a benchmark to compare DPIs obtained consensus motifs for Gzf3 from both sets of DPI data are nearly via different platforms. A common protein that was examined by identical, and the scatterplot of binding intensities shows remarkable

2of10 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al. Downloaded by guest on September 24, 2021 correlation (r = 0.88) between the two platforms. SELs visually yielded scatterplots and DiSELs that highlight the limitation of display the extent of similarity between the two DPIs, whereas using the De Bruijn approach to represent the entire sequence DiSELs quantitatively highlight the differences (Fig. 2A). space of a given binding site. Despite the poor correlation (r = Whether these differences arise due to platform-specific differ- 0.30) across the DPI, the motif and several high-affinity sites for ences or if these are functional sites that are identified by one Cbf1, a bHLH protein, are congruent between the two platforms platform but not the other can now be clarified using a focused as shown by coinciding peaks in both SELs. However, DiSELs set of sequences. highlight the differences and provide clusters of sites detected in Despite significantly deeper representation of sequence space one platform vs. the other (Fig. 2C). in HiTS-FLIP, the Gcn4 motif and interactome obtained by Comparison of HT-SELEX interactome for FOXJ3 protein HiTS-FLIP are remarkably similar to those obtained by PBM (human) with the interactome of Foxj3 (mouse) obtained via (Fig. 2B) (correlation r = 0.65). However, impact of flanking PBM shows that these two different experimental platforms yield sequences and other subtle contributions are better resolved in nearly identical motifs and highly comparable SELs. However, HiTS-FLIP due to the depth afforded by the sheer number of DiSEL analysis reveals the underrepresentation of a cognate site DNA sequences available on the Illumina platform (23). (5′GGTAAACA3′) that was previously identified as a part of the MITOMI utilizes microfluidic approaches to examine 1,440 primary Foxj3 binding motif (Fig. 2D and SI Appendix, Fig. S3) different DNA sequences with a given protein by capturing (14, 21). In other words, in HT-SELEX, if a sequence is un- DNA–protein complexes using surface-tethered antibodies. The derrepresented in early rounds of enrichment or not amplified or levels of trapped complex, as detected by fluorescence, reflect sequenced efficiently, it might be lost from the repertoire of equilibrium binding affinities of the examined protein for a given bona fide cognate sites of a given protein. To avoid potential loss DNA sequence (33). Focused study of bHLH half sites yielded a of relevant sites, several groups have sequenced the initial set of sequences that correlated well with PBM data. However, members of the library bound by the protein without additional encoding the entire 8-mer space within 1,440 oligonucleotides rounds of enrichment (HT-SELEX) (31) or enriched complexes

A BIOPHYSICS AND COMPUTATIONAL BIOLOGY

B

C

D

Fig. 2. SEL/DiSEL to compare different high-throughput experimental platforms. SELs (Left), scatterplots of quantile-normalized DNA binding intensities for all 8- mers (Center), and DiSELs (Right) comparing DPIs obtained through high-throughput platforms with PBM. (A) CSI vs. PBM data for Saccharomyces cerevisiae Gzf3 (seed motif: 5′GATAAG3′). (B) HiTS-FLIP vs. PBM data for S. cerevisiae Gcn4 (seed motif: 5′TGACTCA3′). (C) MITOMI vs. PBM data for S. cerevisiae Cbf1 (seed motif: 5′CACGTG3′). (D)HT-SELEXforHomo sapiens FOXJ3 vs. PBM for MusmusculusFoxj3(seed motif: 5′AAACA3′). DNA logos derived from PBM were downloaded from UniPROBE. SEL peaks represent DNA binding preferences of the protein as measured by the experimental platform, whereas DiSEL peaks correspond to binding preference identified by one platform but not the other. Few differences are pointed out on DiSELs (arrows). All data are displayed as z scores.

Bhimsaria et al. PNAS Latest Articles | 3of10 Downloaded by guest on September 24, 2021 by EMSA and sequenced bound and unbound DNA from every SELEX Enrichment Depletes Submaximal Sites. The ability to probe larger round of selection [SELEX-seq (7) or Spec-seq (36)]. binding sites, ease of use, widespread access to high-throughput A major constraint in comparing a comprehensive set of TF sequencing, and cost-effectiveness through multiplexing are some of DPIs across all experimental platforms is the paucity of DNA the reasons that have made systematic evolution of ligands by interactome data beyond a handful of TFs that have been sys- exponential enrichment (SELEX)-based methods widely used in tematically tested on different experimental platforms. Even campaigns to obtain DPIs for hundreds of DNA binding proteins. when the same TF was examined across different high-throughput Different variations on the theme have been developed recently; platforms, on closer inspection, a number of confounding differ- some rely on limited rounds of enrichment, whereas others do a ences emerged [for example, differences in species (mouse vs. single round of enrichment and sequence both bound and unbound human TFs), protein size (DNA binding domains vs. full-length DNA (7, 22, 31, 32, 36). At present, the decision on how many rounds TFs), sample purity (highly purified preparations vs. overexpressed of enrichment are required to define the sequence specificity of TFs in crude whole-cell lysates or “in vitro transcription–translation” proteins is arbitrary. Moreover, the of cognate sites that are extracts), library design, depth of sequence coverage, binding gained or lost per round is rarely examined. We, therefore, used SEL reaction conditions under which the experiments were conducted]. and DiSEL to evaluate the DPIs obtained from each round of Despite these constraints, we were able to find 29 TF inter- FOXJ3 enrichment and sequencing. actomes that permitted meaningful comparisons by DiSEL The consensus motifs derived from the top 500 sequences of (Fig. 2 and Dataset S1). successive rounds of enrichment and amplification in HT-SELEX

A

B

Fig. 3. Evaluation of sequences enriched in different rounds of HT-SELEX. (A) Submaximal affinity sequences are lost during consecutive rounds of selection in SELEX experiments for FOXJ3.(Top) PWMs by MEME (multiple EM for motif elicitation) algorithm derived from the top 500 8-mer binding sequences for rounds 1–5 (counts). (Middle) SELs for FOXJ3 with 5′AAACA3′ as seed motif. (Bottom) DiSELs comparing different selection rounds. Sequences with high intensity are pointed to in SELs; sequences lost or gained in different rounds are pointed out on DiSELs. (B) ROC curves for sensitivity–specificity analysis of mapping of the interactome data from different rounds of HT-SELEX enrichment to FOXJ3 ChIP-seq peaks from U2OS cell line. Round 1 best fits the ChIP peaks (AUC = 0.674). The AUC reduces significantly when excluding peaks with 5′AAAAA3′ (AUC = 0.547) or 5′AAATA3′ (AUC = 0.596) sites. However, ex- cluding peaks with a random 5-mer such as 5′ACGAC3′, which shows no binding by FOXJ3, does not alter the ROC values (AUC = 0.672).

4of10 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al. Downloaded by guest on September 24, 2021 are nominally different (Fig. 3A) (37). In contrast from SEL plots, it A is evident that, with each round of enrichment, submaximal sites are weeded out until only high-affinity sites dominate the landscape (Fig. 3A and SI Appendix,Fig.S4). DiSEL analysis reveals the nature of sites that are successively eliminated in each round of enrichment. For example, 5′AAACATT3′ occurs prominently in the first round, but it is lost by round 3 and certainly missing in round 5; however, a related sequence, 5′AAACATAA3′, is successively enriched, and by round 5, two very sharp high-affinity peaks with this motif are evident B in the final DPI. DiSEL thus captures the high-affinity motifs as well as provides a view of the evolving specificity landscape and the pro- gressive loss of submaximal affinity sites that may well be biologically relevant in vivo. In doing so, SEL and DiSEL assist in defining the optimal rounds of enrichment that might best capture the range of DNA cognate sites bound by a given protein or small molecule. To determine whether the submaximal sites identified by DiSEL analysis are bound by TFs in cells, we examined genome-wide binding profiles of FOXJ3 in U2OS, a human osteosarcoma cell line Fig. 4. SELs to compare specificity models derived from PBM-BEEML and (38). The receiver operator characteristic (ROC) to retrieve FOXJ3 Seed-n-Wobble computational methods. (A) SELs for the DNA interactome ChIP-seq peaks with all sites identified in the first round of HT- of Foxj3 with PWMs representing specificity models of two computational SELEX enrichment was far greater than subsequent rounds of methods derived from PBM data (z scores). Seed motifs derived by PBM- ′ ′ ′ ′ enrichment (Fig. 3B). Removing peaks bearing two submaximal BEEML (5 AAACA3 ; Left) and Seed-n-Wobble (5 RTAAACA3 ; Right), where ′ ′ ′ ′ R = A/G (14). Positions of the two sequences from Seed-n-Wobble seeds are 5 AAAAA3 and 5 AAATA3 sites eliminates the advantage pointed out in both SELs 5′GTAAACAA3′ (1) and 5′CAAAACAA3′ (2). (B)Top offered by round 1-enriched sequences to annotate ChIP-seq views of SELs of PAX6 protein using seed 5′TAATTA3′ in Left,seed5′TGCACA3′ peaks. In contrast, removing regions bearing unrelated sites 5′ in Center, and both 5′TAATTA3′ and 5′TGCACA3′ as seed in Right. The dashed ACGAC3′ (Fig. 3B), 5′TAACA3′,or5′GTATG3′ (SI Appendix, black lines demarcate the landscape using one motif or the other as a seed. Fig. S4C) does not impact area under the curve (AUC) values of

ROC. These results highlight the biological relevance of the BIOPHYSICS AND ′ ′ ′ submaximal sites identified by the DiSEL approach. Thus, with TAATTA3 and 5 TGCACA3 . Specificity profiles of such TFs COMPUTATIONAL BIOLOGY each round of SELEX, the best binding sites are enriched at the can be analyzed in two different ways: first by plotting different expense of biologically relevant submaximal binding sites. SELs using each motif as a seed and second by combining both motifs as seeds in a single SEL (Fig. 4B). The advantages in the Comparison of Computational Models. Two different algorithms first strategy are that one can focus on one motif at a time and reported related but not identical motifs for the same Foxj3 that binding sites encapsulating the other motif would manifest DNA interactome data (14–16). These motifs display differing themselves as peaks in appropriate mismatch rings. The second levels of success in capturing cognate sites within the inter- strategy offers the advantage of comparative analysis of more actome. Protein binding microarray–binding energy and expec- than one motif on a single SEL plot. In either case, the binding tation maximization likelihood (PBM-BEEML) was designed for affinity associated with any given sequence is not altered in any PBM data and considers thermodynamic binding in generating a way, only the position with respect to seed changes. PWM for a motif (14). Seed-n-Wobble also works on PBM data, but it identifies a seed and builds a motif by considering substi- Differential Preferences of Highly Homologous TFs. To determine if tutions to the seed sequence and by adding nucleotides on the 5′ DiSELs can tease out specificity differences between homolo- and 3′ ends to determine if any given nucleotide leads to an gous TFs, we examined the entire interactomes of 568 different increased overall binding intensity (20). The consequence is that pairs of DNA binding proteins representing different classes of Seed-n-Wobble builds high information content motifs, and re- DNA binding domains (Figs. 1C and 5A, Datasets S2–S5, and SI lated motifs are designated secondary motifs: for example, 5′ Appendix, Figs. S5 and S6). In particular, we carefully scrutinized GTAAACA3′ vs. 5′CAAAACA3′. In direct comparison, PBM- three pairs of homologous TFs: Lhx2 and Lhx4 of the Homeo- BEEML successfully identifies a larger fraction of Foxj3 binding domain family, Hnf4a and Rxra of the Nuclear Receptor family, sequences, because it defines the core 5′AAACA3′ as the motif. and Irf4 and Irf5 of the tryptophan pentad repeat members of the Displaying the Foxj3 interactome in an SEL and using either 5′ winged helix-turn-helix family of DNA binding proteins. The AAACA3′ or 5′RTAAACA3′ (R = G/A) as seed motifs show added benefit of examining these three pairs of TFs is that se- that the PBM-BEEML model is too permissive, whereas Seed-n- quences preferred by one member over the other have been Wobble may be too restrictive (Fig. 4A). mapped previously (21, 42). Thus, this prior set of homolog- In recent reports, several models are simultaneously applied to specific sites serves as a benchmark for DiSEL-based identifi- discover consensus motifs (5). The assumptions inherent to each cation of sequences preferred by closely related homologs. computational method may impact the inclusion or exclusion of As expected, the derived PWM from each interactome shows submaximal cognate sites in unanticipated ways as is evident that related TFs yield nearly identical consensus motifs (Fig. 5A from the SEL in Fig. 4A. SELs and DiSELs could serve as an and SI Appendix,Figs.S5A and S6A). SEL representation of the unbiased tool to evaluate how accurately motifs derived from entire DPI for all three pairs further emphasizes the extent of the different computational models capture the full affinity and overlap in sequence preferences of matched pairs (Fig. 5B and SI specificity profiles of DNA binding proteins. From the crenella- Appendix,Figs.S5B and S6B). This is not surprising, especially in the tions in the zero-mismatch ring, one can immediately identify the case of Lhx2 and Lhx4, because both proteins utilize identical amino impact of different flanking sequences on the ability of a protein to acid side chains to read the high-affinity 5′TAATTA3′ cognate site. bind a perfectly matched core cognate site. Recent studies have In DiSELs, by virtue of the sequence placement within the shown that such context effects are important in modulating landscapes, related sequences preferred by one homolog over the binding of TFs to different genomic loci in cells (19, 39–41). other cluster together and are automatically identified using a Another consistent pattern that emerges from SELs is that purpose-built software package (provided here). Such sites can also many TFs bind more than one motif: for example, PAX6 binds 5′ be visually identified, and underlying sequences can be queried

Bhimsaria et al. PNAS Latest Articles | 5of10 Downloaded by guest on September 24, 2021 using an interactive graphical user interface (Fig. 5C, Dataset S6, and SI Appendix,Figs.S5C and S6C). Previous efforts to identify homolog-preferred sites relied on A either a subjective manual curation or a Bayesian ANOVA model (21, 42, 43). Our analysis of DiSEL plots readily captured the site preferences identified by previous methods [such as B manually curated (5′GGTCCA3′ preferred by Hnf4a compared with Rxra,5′TGAAAG3′ preferred by Irf4 compared with 5′CGA- GAC3′ preferred by Irf5) and ANOVA based (5′TAACGA3′,5′ TAATGG3′,and5′TAATGA3′ preferred by Lhx2 vs. 5′TAATCA3′, 5′TGATTG3′,and5′TAATCT3′ by Lhx4)]. More important, auto- mated DiSEL analysis revealed homolog-preferred submaximal sites that were missed by previous studies (Datasets S2–S5). Displaying the identified sites on a scatterplot makes plain the challenges of identi- C fying these homolog preferences through current approaches. DiSEL plots, however, cluster and highlight submaximal binding sites (Fig. 5D and SI Appendix,Figs.S5D and S6D). Another striking observation that emerges from this analysis is that homolog-specific sequences primarily appear in the mismatch rings and comprise submaximal sites. In addition to capturing previously mapped differences in vitro, an additional feature supports the conclusion that the submaximal sequences identified by DiSEL analysis are re- D flective of preferential binding in vivo. As in the example of Foxj3 above, genome-wide binding profiles in biologically rele- vant cells, as identified by ChIP-seq, show that submaximal sites improve the annotation of ChIP-seq peaks significantly. We ex- amined the ChIP-seq peaks of LHX2 in hair follicle cells (44) to determine whether Lhx2-orLhx4-preferred binding sites better annotate these ChIP peaks. Focusing on the submaximal sites 5′ TAATG3′ preferred by Lhx2 and 5′TAATC3′ preferred by Lhx4, we report that ROC plots show that retaining 5′TAATG3′ containing ChIP peaks led to better annotation with Lhx2 (AUC E 0.811) vs. the Lhx4 (AUC 0.754) interactome data (Fig. 5E, Upper Left). Conversely, removing peaks bearing 5′TAATG3′ sites led to indistinguishable annotation using either Lhx2 (AUC 0.607) or Lhx4 (AUC 0.619) interactome data (Fig. 5E, Lower Left); this observation was also true for other replicates (SI Appendix, Fig. S7). Furthermore, removal of peaks bearing Lhx4-preferred 5′ TAATC3′ sites improves the annotation of ChIP-seq data with the Lhx2 interactome data (Fig. 5E, Right), supporting the fact that, consistent with in vitro sequence preferences, 5′TAATC3′ sites are less favorably bound by Lhx2 in comparison with Lhx4.

Amino Acids That Differ Between Closely Related Homologs Do Not “Read” DNA. Aligning the amino acid sequence of Lhx2 and Lhx4 with the prototypical homeodomain protein engrailed shows that, in each homolog, side chains that make direct con- tacts are identical and so are over one-half of the residues that interact with DNA backbone (Fig. 6A). However, homologs vary in the residues that face away from the DNA and are packed against the hydrophobic core of the homeodomain (Fig. 6 B and C). In Fig. 5. DiSEL to compare interactomes of highly homologous DNA binding particular, the substitution of residues L45 and R52 of Lhx2 (blue in proteins. (A and B) Lhx2 (Left) and Lhx4 (Right); 8-mer PBM data are used for Fig. 6) with V45 and A52 of Lhx4 (orange in Fig. 6) could poten- comparison. All units in B–D are E scores. (A) PWM motifs determined by tially tilt the DNA recognition helix 3 with respect to the major ′ ′ Seed-n-Wobble and (B) SELs for Lhx2 and Lhx4 using 5 TAATTA3 as seed groove. Such a change in the docking angle of the recognition helix motif. (C) DiSELs [(Left) Lhx2 over Lhx4;(Right) Lhx4 over Lhx2] with seed 5′ TAATTA3′ used to highlight subtle differences in DNA binding preferences in could change the energetic dependence on specific positions within the form of small blue peaks. (D) Selected differences identified via DiSELs an otherwise identical DNA consensus motif. Remarkably, in Lhx2, are highlighted in a scatterplot [(Left) Lhx2 preferred sequences; (Right) L45 packs against M16, whereas in Lhx4, V45 shows poorer packing Lhx4 preferred sequences]. (E) ROC curves plotted to obtain sensitivity– against L16. Such variation in residues that pack and stabilize the specificity analysis of LHX2 ChIP-seq peaks in hair follicle stem cells. Left protein core is not limited to Lhx2 and Lhx4; similar changes at compares ROCs with ChIP peaks containing the Lhx2 submaximal site 5′ positions 45 and 52 are observed in 30 of 255 homeodomains (SI TAATG3′ (Upper Left) vs. ChIP data where 5′TAATG3′-containing peaks were Appendix,Fig.S8C). We also note that other residues that pack computationally removed (Lower Left). Right compares ROCs with peaks against the recognition helix differ between closely related homo- containing the Lhx4 submaximal site 5′TAATC3′ (Upper Right)andwithoutit (Lower Right). ChIP peaks containing 5′TAATG3′ are better annotated by the logs and may similarly contribute to homolog-specific preference for Lhx2 rather than the Lhx4 interactome. In contrast, peaks bearing 5′TAATC3′ distinct submaximal binding sites (45). Such differences are typically are not preferentially annotated, validating the robustness of our approach in overlooked, because the side chains that do not contact DNA are identifying biologically relevant submaximal sites preferred by each homolog. mostly ignored in defining the sequence specificity (42). A few

6of10 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al. Downloaded by guest on September 24, 2021 A DiSEL profiles (SI Appendix, Fig. S9 B–E). The identification of submaximal sites from Lhx3 and Lhx9 further reveals nearly identical homolog-specific preferences as Lhx4 and Lhx2,re- spectively (SI Appendix,Fig.S9). These results strongly validate our approach, where four distinct proteins purified, tested, and analyzed independently converge on the same specificity profiles and cross-validate each homology pairing. BC Molecular Dynamics Simulations Reveal Differences in Helix Orientations. A series of molecular dynamics (MD) simulations tested the pro- posal that mutations that do not directly interface with DNA affect specificity by altering the orientation of helix 3 with respect to the DNA. Both Lhx2 and Lhx4 sequences were modeled onto homeo- domain template structures (Fig. 6B and SI Appendix,Fig.S10A). The distances between the backbone alpha carbons for both ho- mologs do not exceed 0.3 Å rmsd in the modeled structures. We conducted equilibrium MD simulations of both structures using three different force fields: the AMBER ff14SB force field (47) with generalized Born solvation model, the AMBER force field with explicit solvent, and the CHARMM force field (48, 49) with explicit solvent (Methods). D The simulations reveal important differences in the overall structures and flexibilities of the two proteins (Fig. 6 and SI Appendix, Fig. S10). Notably, all simulation schemes show per- sistent differences between the proteins in the relative orienta- tions of the three helical axes. Thus, while the residues of helix 3 that read DNA are identical between the two proteins, the

differences in orientation relative to the other helices will alter BIOPHYSICS AND

how those residues interact with the major groove of DNA, af- COMPUTATIONAL BIOLOGY fecting the sequence specificity. Fig. 6E shows how structures from the CHARMM simulations might interact with DNA and that differences in relative orientations of the helices alter the E interactions of helix 3 with the DNA. Altered Sequence Preferences of Disease-Causing Variants. The homolog-specific differences that we observed even among TFs that read DNA using identical amino acid residues motivated us to explore the impact of TF variants observed in populations. Re- cently, DPIs of 41 TFs and their alleles, including many diseases- causing variants, became publicly available(24).WeusedDiSELto compare these TFs against all of their tested alleles (129 alleles) (Fig. 1C and Dataset S7). While variants that showed drastic alterations in DNA binding preferences were readily identified, many alleles exhibited binding profiles that were similar but nonidentical to the Fig. 6. Structural models and differential dynamics of Lhx2 and Lhx4.(A) reference protein. These nuanced shifts in specificity, as highlighted in Homeodomain amino acid sequences of engrailed, Lhx2 (blue), and Lhx4 (or- the three examples below, were identified by DiSEL. In the first ex- ange) proteins. Residues that make DNA base contacts (▴) or DNA backbone ample, the DNA interactomes of HOXD13 protein and its allele contacts (▪) or are identified as important for DNA binding by alanine shotgun ▾ Q325R, a variant that is causally linked to syndactyly type V and a scanning experiments ( ) are marked. (B and C)StructuralmodelsforLhx2 brachydactyly-syndactyly syndrome(50),wereexaminedbyDiSEL (blue) and Lhx4 (orange) were aligned to engrailed DNA structure (PDB ID ′ ′ ′ ′ code 1HDD) to highlight amino acid differences. Helices are numbered for (Fig. 7); 5 GTAAA3 and 5 GTACA3 , which are two 5-mers that reference. (D) Angles between helices during 200-ns MD simulations with the display greater preference for HOXD13 and Q325R allele, respectively, CHARMM force field in explicit solvent. Analogous analyses for other simula- were identified by DiSEL analysis. ROCs were plotted for ChIP-seq tion models are in SI Appendix,Fig.S10.(E) Representative structures from the peaks with and without 5′GTAAA3′ or 5′GTACA3′ sequences (51). CHARMM simulations overlaid on the engrailed structure with DNA using the The DNA interactome for Q325R allele better annotates ChIP-seq backbone atoms of the three helices for alignment. peaks (AUC for HOXD13 REF (reference) is 0.616 vs. 0.721 for Q325R). Excluding peaks bearing 5′GTAAA3′ has a minor impact, ′ ′ – but excluding peaks bearing 5 GTACA3 dramatically reduced the studies have indicated that non DNA-contacting positions might gains (AUC for HOXD13 REF is 0.656 vs. 0.692 for Q325R) (Fig. 7D affect DNA binding (24, 34, 45, 46). and SI Appendix,Fig.S11). This analysis supports the pathological We further examined the specificity landscapes of the 30 importance of the 5′GTACA3′ sequence identified by DiSEL analysis. homeodomains with identical DNA contacting residues R5, V47, In the case of the R90W allele of CRX (linked to the disease Q50, N51, A54, and K55, similar to Lhx2 and Lhx4 (SI Appendix, Leber Congenital Amaurosis 7), DiSEL revealed minimal impact Fig. S9A). Of the 30 homologs, DPIs of 20 were obtained under on binding to sites with the 5′GGATTA3′ core motif (the in- the same experimental conditions (Dataset S5). We examined nermost zero-mismatch ring of DiSEL is shown in SI Appendix, specificity landscapes of all 20 homologs; of these, Lhx3 has the Fig. S12). In contrast, the R90W allele showed a precipitous loss same V45–R52 pair as Lhx4,whereasLhx9 has the L45–A52 pair of binding to submaximal sites with a single mismatch to the 5′ found in Lhx2. In agreement with our predictions, we find that GGATTA3′ core. A focus on consensus motifs and high-affinity Lhx3–Lhx4 and Lhx9–Lhx2 show striking overlap in the SEL and binding sites would fail to identify the disease-causing loss of

Bhimsaria et al. PNAS Latest Articles | 7of10 Downloaded by guest on September 24, 2021 from various experimental platforms or computational analyses, while remarkably similar at the most robust cognate sites, does A not always conform at submaximal affinity ranges. It is in- creasingly apparent that submaximal sites have been retained during evolution to optimize differential regulation and combi- natorial control over different (8–10). We developed DiSEL to identify such submaximal sites and to compare dif- ferent experimental and computational methods. DiSEL permits B an unbiased and unsupervised comparison of the entire DPI obtained by different experimental methods. Moreover, using motifs identified by different computational methods as an or- ganizing “seed” permits a comprehensive view of how well a given method captures the binding profile of the entire inter- actome. Applying DiSEL to five prominent experimental plat- forms shows that array-based methods, such as HiTS-FLIP, CSI, and PBM, provide the most comprehensive view of the speci- C ficity and affinity landscape of a given DNA binding protein or small molecule. Among sequencing-based approaches, such as HT-SELEX, SELEX-seq, Spec-seq, and Bind-n-Seq, the speci- ficity landscapes immediately make it apparent that early rounds of enrichment capture a wider range of cognate sites and that increasing rounds of enrichment yield high-affinity motifs at the expense of submaximal sites. Circumventing the loss of sub- maximal affinity sites by sequencing the first round, as is done in HT-SELEX (31), is also fraught with the challenge of sifting through and identifying bona fide low-affinity cognate sites from the much larger pool of noncognate “encounter complexes” that D are also captured under those conditions. This is not surprising given that electrostatic affinity for any DNA fragment of suffi- − cient length is typically 10 6 M, and binding to a library bearing 1015 different sequence permutations is typically performed at these concentrations. In this context, DiSEL provides a facile and valuable approach to rapidly identify clusters of submaximal cognate sites from a larger pool of sequences captured in early rounds of sequencing-based approaches. Beyond these plat- forms, DiSEL can be applied to compare a range of experi- mental methods that provide data-rich protein–nucleic acid interactomes, and the approach can identify differences that emerge from platform-specific biases or from genuine differ- Fig. 7. DiSEL to compare interactomes of HOXD13 protein (Left) and its allele Q325R (Right); 8-mer E-score PBM data are used for comparison. (A) ences in recognition properties of DNA binders. SELs for HOXD13 REF (reference protein) and Q325R allele using seed motif In addition, our approach enables the evaluation of motifs 5′ATAAA3′.(B) DiSELs [(Left) HOXD13 REF over HOXD13 Q325R; (Right) returned by burgeoning varieties of computational methods. HOXD13 Q325R over HOXD13 REF] with seed motif 5′ATAAA3′.(C)Differences More inclusive or more constrained motifs capture different are identified through automated analysis, and DiSELs are highlighted in the features of the entire interactome. Rather than relying on mul- scatterplot. (D) ROC curves mapping DPIs of HOXD13 protein and its allele tiple different algorithms and manual “intuition” to identify the Q325R to ChIP-seq peaks of HOXD13 in the chicken mesenchymal stem cell line appropriate motif, SEL/DiSEL can provide a more systematic (without 5′GTAAA3′ and 5′GTACA3′). ROCs reflect binding preference of 5′ path to comparing and winnowing k-mers to select set of se- ′ GTACA3 sequence for HOXD13 Q325R allele over HOXD13 REF in vivo. quences that capture the full spectrum of binding sites preferred by a given protein or small molecule. While motifs aggregate binding to submaximal sites by this allele of CRX. The final ex- related sequences, obfuscate the role of flanking sequences or ample of such subtle but relevant differences between two vari- subtle variations within binding sites, and overlook relevant low- affinity sites, using all of the k-mers within an interactome may ants of VSX1 is displayed in SI Appendix, Fig. S13 (VSX1. REF lead to the inclusion of platform-dependent biases, overfitting of vs. VSX1. Q175H). We highlight this particular pairwise compar- data as well as other sources of experimental error. Thus, SEL/ ison, because the two protein variants display a remarkably high 2 = DiSEL provides a balance between using consensus motifs and correlation across the affinity spectrum (r 0.985), and yet, using the entire set of k-mers of the DPI. submaximal sites preferred by one allele over the other are readily In this context, it is important to note that, at the low-affinity identified by an algorithmic examination of the DiSEL plots. We range, distinguishing a bona fide submaximal site from a non- map these differentially preferred sites onto scatterplots that are specific site is nearly irresolvable. The challenge of this task is commonly used to visualize allele-preferred binding sites. The especially apparent in scatterplots, where arbitrary cutoffs are robustness of the algorithmic examination is such that even subtle used to anoint sites as specific or discard them as nonspecific preferences become readily apparent and can be rank ordered. sites. The key advantage offered by our approach is that DiSEL clusters related sequences that might individually be indistin- Discussion guishable from experimental noise. Clustering reveals those se- The past decade has witnessed rapid expansion in the develop- quences with near-cognate motifs that recur in the moderate- to ment of experimental and computational methods to compre- low-affinity range. This greatly enables the identification of hensively map DNA and RNA recognition properties of proteins low-affinity binding sites, and SEL/DiSEL plots extend well and small molecules (12, 25, 52). The information that emerges beyond PWM-based motifs in providing a comprehensive but

8of10 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al. Downloaded by guest on September 24, 2021 intuitive view of complex binding landscapes of DNA/RNA Z = user-defined flanking peak cutoff percentage × maximum absolute in- binding molecules. tensity for the whole SEL/DiSEL. In essence, DiSEL analysis reliably identified differential pref- erences to submaximal sites between different alleles of proteins, ROC Curve. ChIP-seq peaks from previously published datasets were used as a true positive set, whereas two scrambled versions of DNA of each positive several of which are disease-causing point mutations that show no peak were used to make a true negative set. The fractions of regions in the alteration in their affinity for consensus motifs. Additional appli- positive vs. negative sets with scores above a varying DPI intensity cutoff were cation of DiSEL to homologous TFs highlighted the differential plotted to generate ROC curves (true positive rate vs. false positive rate). ROC dependence on positions within high- and medium-affinity binding curves and heat maps were generated in MATLAB. To plot ROCs with a sites. Importantly, the results highlight the importance of non- specific sequence, peak set is created with ChIP-seq peaks containing the obvious allosteric contributors in conferring differential sequence sequence of interest or its reverse complement, and then, ROCs were plotted specificity at submaximal affinity sites. While widely overlooked, as described above. Similarly, the set of peaks left behind is used to plot ROCs such allosteric effects likely contribute to sequence specificity without that sequence. beyond current models of direct and indirect readout of DNA Lhx2 and Lhx4 Homology Models. Homology models for Lhx2 and Lhx4 were sequence and shape. Systematic mutational analysis of a set of built by threading onto the Drosophila melanogaster engrailed protein noncontacting amino acids identified additional positions that bound to DNA [ (PDB) ID code 3HDD] using Phyre2 (55). would perturb DNA binding (45). While MD simulations were Both models were predicted with >99.6% confidence using default settings. used to qualitatively rationalize the importance of protein allostery Lhx2 and Lhx4 models were superimposed to the structure of engrailed to DNA binding affinity, it is worthwhile to pursue free energy sim- bound to DNA (3HDD) using PyMOL (56). ulations in the future to further dissect the energetic contributions to such allosteric effects; a deeper mechanistic understanding will inspire MD Simulations. Due to the potential sensitivity of protein structural dynamics strategies for modulating binding specificities of TFs using, for ex- to computational model, it is generally important to repeat the simulation with different force fields and simulation protocols. Therefore, we have ample, small molecules. compared the structural dynamics of Lhx2 and Lhx4 using three distinct As high-resolution DNA interactomes of thousands of TFs are models: CHARMM force field with explicit solvent and AMBER force field being reported, we provide the DiSEL-generating software to with both explicit solvent and implicit solvent (Generalized Born) models; rapidly evaluate and identify differences as well as commonali- the explicit solvent models are generally more reliable, while the implicit ties in specificity preferences of alleles of same protein and solvent model can be simulated to substantially longer timescales (1 μsvs. closely related TFs. We anticipate that such interactome-based 200 ns). The results from the three distinct simulations indeed differ in fine BIOPHYSICS AND evaluations will provide unprecedented insights into how sub- details; the robust trend, however, is that the orientations of the three helices are different between Lhx2 and Lhx4. Therefore, by integrating the results COMPUTATIONAL BIOLOGY maximal affinity sites are differential recognized by related or from three different simulations, the conclusion that subtle sequence varia- unrelated TFs. This, in turn, will provide a far more effective tions away from the DNA binding site lead to considerable differences in the means to annotate genomes and unmask subtle variations that orientation and flexibility of the helices in Lhx2 and Lhx4 is further supported. play a quintessential role in selective binding by specific TFs in Initial structures were created in Modeler Homology Modeling Package normal regulation as well as in diseased states. (57) using structures of homologous homeodomains. After sequence align- While DiSEL analysis provides an important means to un- ment, Lhx2 and Lhx4 structures are built based on the template homeo- derstand complex specificity landscapes, the approach faces domain structures obtained from the PDB (with PDB ID codes 3A02, 3A03, challenges in encapsulating different modes of sequence recog- 1ENH, and 3K2A) (58). Next, we sought to perform MD simulations long enough to ensure diverse sampling and with a variety of simulation nition by a given protein: for example, the ability of proteins to models to ensure that any conclusions are not model dependent. Simu- form homo- or heterodimers or higher-order oligomers with lations used implicit or explicit solvation and either the AMBER ff14SB force different spacing configurations and half-site orientations. These field (47) or the CHARMM36 force field (48, 49). We used the Amber v14 challenges guide our future efforts in the development of a Molecular Dynamics package (59, 60) for simulations with the AMBER force comprehensive view of molecular forces that sculpt sequence field and OpenMM (61) for those with the CHARMM force field. Both pro- specificities of proteins and small molecules (6, 19, 53, 54). grams offer graphics processing unit (GPU)-accelerated simulations, allowing us to reach timescales of hundreds of nanoseconds to microseconds. Details of Methods the three simulations schemes are as follows. The implicit solvent simulations used the AMBER ff14SB force field (47) and DiSELs. A DiSEL displays the difference between two DPIs. By keeping the generalized Born model (62). Production simulations were carried out for a same arrangement as an SEL, differences between two DPIs are readily minimum of 1,000 ns using 1-fs time steps. Langevin dynamics were applied identified. Before generating a DiSEL, the two DPIs are rescaled by first sub- − using a collision frequency of 20 ps 1. The temperature was maintained at tracting their respective mean intensity and then dividing all binding intensities by 300 K. The SHAKE algorithm was applied to bonds involving hydrogens with a the maximum binding intensity. Differences in binding intensities between any − tolerance of 10 5 Å. The nonbonded cutoff was set to 9,999 Å. The maximum two DPIs are calculated by subtracting the corresponding intensities. distance between atom pairs for Born radii calculations was chosen as 12 Å. The same AMBER force field was also used in simulations with explicit Data Processing. To enable comparisons using SELs/DiSELs, DPI data for dif- TIP3P (transferable intermolecular potential with 3 points) water (63) in the = = ferent TFs were rescaled to have mean 0 and maximum binding intensity NPT (constant particle number, constant pressure, and constant tempera- 1. To get box plots, SDs for DPI data were used as one. Scatterplots for Fig. 2 ture) ensemble. The simulations used periodic boundary conditions (PBCs) in were obtained by quantile normalizing the DPI data obtained from the a cubic system with the edge of the box at least 10 Å from the protein. The compared experimental platform (CSI, MITOMI, HiTS-FLIP, and HT-SELEX) system was neutralized with chloride ions resulting in ca. 16,400 atoms, against the DPI data from PBM for the same protein. All SELs, DiSELs, and depending on protein. Simulations were conducted at 300 K with 1-fs time scatterplots were made using MATLAB. steps using SHAKE to constrain bonds to hydrogen. Finally, analogous simulations were conducted using the CHARMM36 force Automated Peak Finding Algorithm. There are two kinds of peaks picked by field (48, 49) with explicit TIP3P water. These used a similar PBC setup as those the program. (i) Mismatch peak. These are the peaks that are present in with AMBER, and production simulations were conducted in both the NVT mismatch rings (>0) generally, where median intensity of all sequences with (constant particle number, constant volume, and constant temperature) en- the same mismatch to the seed sequence is X units higher (or lower) than Y, semble and the NPT ensemble at 300 K with 1-fs time steps using SHAKE to where X = user-defined mismatch peak cutoff percentage × maximum ab- constrain bonds to hydrogen. Because the two ensembles were nearly in- solute intensity for the whole SEL/DiSEL and Y = 0 for DiSEL and median distinguishable, we show only the NVT results. intensity of the whole mismatch ring for SEL. (ii) Flanking peak. These are peaks present among all of the sequences having the same mismatch to Software Description. MATLAB code for SEL/DiSEL software is provided here. the seed. They are calculated as the sequence with Z-unit higher intensity Details of implementation are available in the text. The software uses in- than median intensity of sequences having the same mismatch to the seed. formation from US patents US 20100159457 A1 and US 20100160178 A1; thus,

Bhimsaria et al. PNAS Latest Articles | 9of10 Downloaded by guest on September 24, 2021 additional use of this software should be in compliance with these patents. the W. M. Keck Medical Research Award (to P.R. and A.Z.A.). J.A.R.-M. was Code can be modified and used as long as change is stated clearly and ref- supported by NIH Grant National Human Genome Research Institute Train- erenced to this publication. ing Grant of the Genome Sciences Training Program T32 HG002760. The MD simulations work was supported in part by National Science Founda- tion (NSF) Grants CHE-1300209 (to Q.C.) and CHE-1829555 (to Q.C.). Compu- Code Availability. Thecomputercodeandthedatausedforthepapercanbe tational resources from the Extreme Science and Engineering Discovery downloaded from the website https://ansarilab.biochem.wisc.edu/computation.html. Environment, which is supported by NSF Grant OCI-1053575, are also greatly appreciated; computations are also supported in part by NSF Instrumenta- ACKNOWLEDGMENTS. We thank current and former members of the labo- tion Grant CHE-0840494 (to the Department of Chemistry), and the ratory of A.Z.A. for helpful discussions and Laura Vanderploeg for help GPU computing facility was supported by Army Research Office with the artwork. This study was supported by NIH Grant GM120625 and Grant W911NF-11-1-0327.

1. Dunham I, et al.; ENCODE Project Consortium (2012) An integrated encyclopedia of 34. Noyes MB, et al. (2008) Analysis of homeodomain specificities allows the family-wide DNA elements in the human genome. Nature 489:57–74. prediction of preferred recognition sites. Cell 133:1277–1289. 2. Kittler R, et al. (2013) A comprehensive nuclear receptor network for breast cancer 35. Badis G, et al. (2008) A library of yeast transcription factor motifs reveals a widespread cells. Cell Rep 3:538–551. function for Rsc3 in targeting nucleosome exclusion at promoters. Mol Cell 32: 3. Xie D, et al. (2013) Dynamic trans-acting factor colocalization in human cells. Cell 155: 878–887. 713–724. 36. Stormo GD, Zuo Z, Chang YK (2015) Spec-seq: Determining protein-DNA-binding 4. Jolma A, et al. (2013) DNA-binding specificities of human transcription factors. Cell specificity by sequencing. Brief Funct Genomics 14:30–38. – 152:327 339. 37. Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME suite. Nucleic Acids Res 43: 5. Weirauch MT, et al. (2014) Determination and inference of eukaryotic transcription W39–W49. – factor sequence specificity. Cell 158:1431 1443. 38. Sokolova M, et al. (2017) Genome-wide screen of cell-cycle regulators in normal and 6. Rodríguez-Martínez JA, Reinke AW, Bhimsaria D, Keating AE, Ansari AZ (2017) tumor cells identifies a differential response to nucleosome depletion. Cell Cycle 16: Combinatorial bZIP dimers display complex DNA-binding specificity landscapes. eLife 189–199. 6:e19272. 39. Gordân R, et al. (2013) Genomic regions flanking E-box binding sites influence DNA binding 7. Slattery M, et al. (2011) Cofactor binding evokes latent differences in DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep 3:1093–1104. specificity between Hox proteins. Cell 147:1270–1282. 40. Keles¸ S, Warren CL, Carlson CD, Ansari AZ (2008) CSI-Tree: A regression tree approach 8. Ptashne M (2004) A Genetic Switch Phage Lambda Revisited (Cold Spring Harbor Lab for modeling binding properties of DNA-binding molecules based on cognate site Press, Cold Spring Harbor, NY). identification (CSI) data. Nucleic Acids Res 36:3171–3184. 9. Crocker J, et al. (2015) Low affinity binding site clusters confer hox specificity and 41. Abe N, et al. (2015) Deconvolving the recognition of DNA shape from sequence. Cell regulatory robustness. Cell 160:191–203. 161:307–318. 10. Tanay A (2006) Extensive low-affinity transcriptional interactions in the yeast ge- 42. Berger MF, et al. (2008) Variation in homeodomain DNA binding revealed by high- nome. Genome Res 16:962–972. – 11. Farley EK, et al. (2015) Suboptimization of developmental enhancers. Science 350: resolution analysis of sequence preferences. Cell 133:1266 1276. 325–328. 43. Jiang B, Liu JS, Bulyk ML (2013) Bayesian hierarchical model of protein-binding mi- 12. Stormo GD, Zhao Y (2010) Determining the specificity of protein-DNA interactions. croarray k-mer data reduces noise and identifies transcription factor subclasses and – Nat Rev Genet 11:751–760. preferred k-mers. 29:1390 1398. 13. Isakova A, et al. (2017) SMiLE-seq identifies binding motifs of single and dimeric 44. Folgueras AR, et al. (2013) Architectural niche organization by LHX2 is linked to hair transcription factors. Nat Methods 14:316–322. follicle stem cell function. Cell Stem Cell 13:314–327. 14. Zhao Y, Stormo GD (2011) Quantitative analysis demonstrates most transcription 45. Sato K, Simon MD, Levin AM, Shokat KM, Weiss GA (2004) Dissecting the engrailed factors require only simple models of specificity. Nat Biotechnol 29:480–483. homeodomain-DNA interaction by phage-displayed shotgun scanning. Chem Biol 11: 15. Morris Q, Bulyk ML, Hughes TR (2011) Jury remains out on simple models of tran- 1017–1023. scription factor specificity. Nat Biotechnol 29:483–484. 46. Nakagawa S, Gisselbrecht SS, Rogers JM, Hartl DL, Bulyk ML (2013) DNA-binding 16. Weirauch MT, et al.; DREAM5 Consortium (2013) Evaluation of methods for modeling specificity changes in the evolution of forkhead transcription factors. Proc Natl transcription factor sequence specificity. Nat Biotechnol 31:126–134. Acad Sci USA 110:12349–12354. 17. Ruan S, Swamidass SJ, Stormo GD (2017) BEESEM: Estimation of binding energy 47. Maier JA, et al. (2015) ff14SB: Improving the accuracy of protein side chain and models using HT-SELEX data. Bioinformatics 33:2288–2295. backbone parameters from ff99SB. J Chem Theory Comput 11:3696–3713. 18. Warren CL, et al. (2006) Defining the sequence-recognition profile of DNA-binding 48. MacKerell AD, et al. (1998) All-atom empirical potential for molecular modeling and molecules. Proc Natl Acad Sci USA 103:867–872. dynamics studies of proteins. J Phys Chem B 102:3586–3616. 19. Carlson CD, et al. (2010) Specificity landscapes of DNA binding molecules elucidate 49. Best RB, et al. (2012) Optimization of the additive CHARMM all-atom protein force – biological function. Proc Natl Acad Sci USA 107:4544 4549. field targeting improved sampling of the backbone φ, ψ and side-chain χ(1) and χ(2) 20. Berger MF, et al. (2006) Compact, universal DNA microarrays to comprehensively dihedral angles. J Chem Theory Comput 8:3257–3273. – determine transcription-factor binding site specificities. Nat Biotechnol 24:1429 1435. 50. Zhao X, et al. (2007) Mutations in HOXD13 underlie syndactyly type V and a novel 21. Badis G, et al. (2009) Diversity and complexity in DNA recognition by transcription brachydactyly-syndactyly syndrome. Am J Hum Genet 80:361–371. – factors. Science 324:1720 1723. 51. Ibrahim DM, et al. (2013) Distinct global shifts in genomic binding profiles of limb 22. Jolma A, et al. (2010) Multiplexed massively parallel SELEX for characterization of malformation-associated HOXD13 mutations. Genome Res 23:2091–2102. human transcription factor binding specificities. Genome Res 20:861–873. 52. Pelossof R, et al. (2015) Affinity regression predicts the recognition code of nucleic 23. Nutiu R, et al. (2011) Direct measurement of DNA affinity landscapes on a high- acid-binding proteins. Nat Biotechnol 33:1242–1249. throughput sequencing instrument. Nat Biotechnol 29:659–664. 53. Siggers T, Gordân R (2014) Protein-DNA binding: Complexities and multi-protein co- 24. Barrera LA, et al. (2016) Survey of variation in human transcription factors reveals des. Nucleic Acids Res 42:2099–2111. prevalent DNA binding changes. Science 351:1450–1454. 54. Jolma A, et al. (2015) DNA-dependent formation of transcription factor pairs alters 25. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence speci- their binding specificity. Nature 527:384–388. ficities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33:831–838. 55. Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE (2015) The Phyre2 web 26. Setty M, Leslie CS (2015) SeqGL identifies context-dependent binding signals in portal for protein modeling, prediction and analysis. Nat Protoc 10:845–858. genome-wide regulatory element maps. PLoS Comput Biol 11:e1004271. 56. Schrödinger L (2015) The PyMOL Molecular Graphics System (Schrödinger, New York), 27. Tietjen JR, Donato LJ, Bhimisaria D, Ansari AZ (2011) Sequence-specificity and energy landscapes of DNA-binding molecules. Methods Enzymol 497:3–30. Version 2.0. 28. Puckett JW, et al. (2007) Quantitative microarray profiling of DNA-binding molecules. 57. Sali A, Potterton L, Yuan F, van Vlijmen H, Karplus M (1995) Evaluation of compar- – J Am Chem Soc 129:12310–12319. ative protein modeling by MODELLER. Proteins 23:318 326. 29. Erwin GS, et al. (2016) Synthetic genome readers target clustered binding sites across 58. Bernstein FC, et al. (1978) The protein data bank: A computer-based archival file for – diverse chromatin states. Proc Natl Acad Sci USA 113:E7418–E7427. macromolecular structures. Arch Biochem Biophys 185:584 591. 30. Campbell ZT, et al. (2012) Cooperativity in RNA-protein interactions: Global analysis 59. Case DA, et al. (2014) Amber 14 (University of California, San Francisco). of RNA binding specificity. Cell Rep 1:570–581. 60. Salomon-Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular 31. Zhao Y, Granas D, Stormo GD (2009) Inferring binding energies from selected binding simulation package. Wiley Interdiscip Rev Comput Mol Sci 3:198–210. sites. PLoS Comput Biol 5:e1000590. 61. Eastman P, et al. (2013) OpenMM 4: A reusable, extensible, hardware independent 32. Zykovich A, Korf I, Segal DJ (2009) Bind-n-Seq: High-throughput analysis of in vitro library for high performance molecular simulation. J Chem Theory Comput 9:461–469. protein-DNA interactions using massively parallel sequencing. Nucleic Acids Res 37:e151. 62. Nguyen H, Roe DR, Simmerling C (2013) Improved generalized born solvent model 33. Fordyce PM, et al. (2010) De novo identification and biophysical characterization of parameters for protein simulations. J Chem Theory Comput 9:2020–2034. transcription-factor binding sites with microfluidic affinity analysis. Nat Biotechnol 28: 63. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, Klein ML (1983) Comparison 970–975. of simple potential functions for simulating liquid water. J Chem Phys 79:926–935.

10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1811431115 Bhimsaria et al. Downloaded by guest on September 24, 2021