A General Pairwise Interaction Model Provides an Accurate Description of in Vivo Transcription Factor Binding Sites

A General Pairwise Interaction Model Provides an Accurate Description of In Vivo Transcription Factor Binding Sites Marc Santolini, Thierry Mora, Vincent Hakim* Laboratoire de Physique Statistique, CNRS, Universite´ P. et M. Curie, Universite´ D. Diderot, E´cole Normale Supe´rieure, Paris, France Abstract The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs. Citation: Santolini M, Mora T, Hakim V (2014) A General Pairwise Interaction Model Provides an Accurate Description of In Vivo Transcription Factor Binding Sites. PLoS ONE 9(6): e99015. doi:10.1371/journal.pone.0099015 Editor: Shoba Ranganathan, Macquarie University, Australia Received December 10, 2013; Accepted May 9, 2014; Published June 13, 2014 Copyright: ß 2014 Santolini et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This study was funded by CNRS institutional funds. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] Introduction binding free energy at thermodynamic equilibrium, provided that the TF concentration is far from saturation [4–6]. Gene regulatory networks are at the basis of our understanding Despite the widespread use and success of PWMs, there is of cell states and of the dynamics of their response to mounting evidence that its central hypothesis of independence environmental cues. Central effectors of this regulation are between positions is not always justified. Several works have Transcription Factors (TFs), which bind on short DNA regulatory reported cases of correlations between nucleotides at different sequences and interact with the transcription apparatus or with positions in TFBSs [7–11]. A systematic in vitro study of 104 TFs histone-modifying proteins to alter target gene expressions [1]. using DNA microarrays has revealed a rich picture of binding The determination of Transcription Factor Binding Sites (TFBSs) patterns [10], including the existence of multiple motifs, strong on a genome-wide scale is thus of central importance, and is the nucleotide position interdependence, and variable spacers between focus of many current experiments [2]. In eukaryotes the TF determining subregions, none of which can be described by binding specificity is only moderate, meaning that a given TF may PWMs. Recently, the in vitro specificity of several hundred human bind to a variety of different sequences in vivo [3]. The collection of and mouse DNA-binding domains was investigated using high- such binding sequences is typically described by a Position Weight throughput SELEX. Correlations between nucleotides were found Matrix (PWM) which simply gives the probability that a particular to be widespread among TFBSs and predominantly located base pair stands at a given position in the TFBS. The PWM between adjacent flanking bases in the TFBS [11]. However, provides a full statistical description of the TFBS collection when TFBSs in vivo are certainly determined in multiple ways besides in there are no correlations between the occurrences of nucleotides at vitro measured TF binding affinities and the extent of correlations different positions. In addition, the PWM description has a simple between their nucleotide positions remains to be fully assessed. biophysical interpretation: it is exact in the special case where On the computational side, a number of probabilistic models base-pairs of the TFBS contribute additively to the TF-DNA have been proposed to describe nucleotide correlations in TFBSs, PLOS ONE | www.plosone.org 1 June 2014 | Volume 9 | Issue 6 | e99015 A General Pairwise Interaction Model generally based on specific simplifying assumptions, such as myogenic cell line (C2C12). For a given TF, the ChIPseq data mutually exclusive groups of co-varying nucleotide positions consist of an ensemble of DNA sequences, the ChIPseq fragments, [8,12–14] or the computationally-friendly probabilistic structures each a few hundred nucleotides long. The exact positions of the of Bayesian networks or Markov chains [11,15–17]. A systematic TFBSs are unknown on each of these sequences. They are and general framework is yet to be applied to account for and to determined as L-mers (we take here L~12) that have a score for analyze the rich landscape of observed TF binding behaviours. the PWM above a given threshold score (here we chose to adjust The recent breakthrough in the experimental acquisition of the threshold score so that 50% of the ChIPseq fragments have at precise, genome-wide TF-bound DNA regions with the ChIPseq least one L-mer above the threshold score -typically, there are one technology offers the opportunity to address these two important or two L-mers above the threshold score on these ChIP fragments). issues. This set of high-scoring L-mers provides a collection of putative BS Here, using a variety of ChIPseq experiments coming both from s for the TF considered. From these, a PWM can be buil t in the fly and mouse, we first show that the PWM model generally does usual way by counting the frequencies of the 4 possible nucleotide not reproduce the observed in vivo TFBS statistics for a majority of s at each position 1,…,L. However, this PWM does not coincide in TFs. This calls for a refinement of the PWM description that general with the one that served to determine the set of putative includes interdependence between nucleotide positions. TFBS s. To determine self-consistently the collection of binding To this purpose, we propose the general Pairwise Interaction sites for a given TF from a collection of ChIPseq fragments, we model (PIM), which generalizes the PWM model by accurately thus iteratively refined the PWM together with the set of putative reproducing pairwise correlations between nucleotides in addition TFBSs in the ChIPseq data (see Figure 1 and Methods for a detailed to position-dependent nucleotide usage. The model derives from description). This process ensured that the frequency of different the principle of maximum entropy, which has been recently nucleotides at a given position in the considered set of binding sites applied with success to a variety of biological problems where (the high-scoring L-mers) was exactly accounted for by the PWM. correlations play an important role, from the correlated activity of We then enquired whether the frequencies of the different neurons [18–21] to the statistics of protein families [22–26] to the binding sites in the set agreed with that predicted by the PWM, as alignment of large animal flocks [27–29]. In a thermodynamical would be the case if the probabilities of observing nucleotides at model of TF-DNA interaction, the PIM amounts to including different positions were independent. Figure 2 shows the results for general effective pairwise interactions between nucleotides [30]. three different TFs, one from each of the three considered The PIM offers a systematic and general computational frame- categories: Twist (Drosophila), Esrrb (mammals, ESC), and MyoD work allowing one to determine and analyze the landscape of TF (mammals, C2C12). The independent PWM model strongly binding in vivo. underestimates the probabilities of the most frequent sequences. We find that the in vivo TFBS statistics and predictability are Moreover, the PWM model does not correctly predict the significantly improved in this

A General Pairwise Interaction Model Provides an Accurate Description of in Vivo Transcription Factor Binding Sites

Regulatory Motifs of DNA

Motif Discovery in Sequential Data Kyle L. Jensen

A New Sequence Logo Plot to Highlight Enrichment and Depletion

Predicting Expression Levels of De NovoProtein Designs in Yeast

Low-N Protein Engineering with Data-Efficient Deep Learning

Lecture 12. Gene Finding & Rnaseq

Interpretable Convolution Methods for Learning Genomic Sequence Motifs

Specificity and Nonspecificity in RNA–Protein Interactions

Ancient Mechanisms for the Evolution of the Bicoid Homeodomain's

Probability and Statistics for Bioinformatics and Genetics Course

Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model

Regulatory Motif Discovery Using Pwms and the Architecture of Eukaryotic Core Promoters