Predicting Transcription Factor Specificity with All-Atom Models Sahand Jamal Rahi1, Peter Virnau2,*, Leonid A

Published online 1 October 2008 Nucleic Acids Research, 2008, Vol. 36, No. 19 6209–6217 doi:10.1093/nar/gkn589 Predicting transcription factor specificity with all-atom models Sahand Jamal Rahi1, Peter Virnau2,*, Leonid A. Mirny1,3 and Mehran Kardar1 1Department of Physics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA, 2Staudinger Weg 7, Institut fu¨ r Physik, 55099 Mainz, Germany and 3Harvard-MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA Downloaded from https://academic.oup.com/nar/article/36/19/6209/2410008 by guest on 02 October 2021 Received May 4, 2008; Revised August 29, 2008; Accepted September 2, 2008 ABSTRACT needs to have a set of known sites bound by the protein. Given these known sites, the parameters are inferred using The binding of a transcription factor (TF) to a DNA either a widely used Berg–von Hippel approximation (1), operator site can initiate or repress the expression or by other recently proposed methods (2–4). This consti- of a gene. Computational prediction of sites recog- tutes a physical basis for many widely used bioinformatics nized by a TF has traditionally relied upon knowl- techniques that rely on a particular form of the energy edge of several cognate sites, rather than an ab function known as a position-specific weight matrix initio approach. Here, we examine the possibility (PWM). of using structure-based energy calculations that All these methods require a priori knowledge of the require no knowledge of bound sites but rather sites (or at least longer sequences containing these sites) start with the structure of a protein–DNA complex. bound by the protein. These data are available for only a We study the PurR Escherichia coli TF, and explore small number of DNA-binding proteins. For many DNA- to which extent atomistic models of protein–DNA binding proteins, however, their sequence of amino acids complexes can be used to distinguish between cog- is well known. Sufficiently high-evolutionary conservation of DNA-binding domains, and the availability of crystal nate and noncognate DNA sites. Particular empha- structures for many of them, makes it possible to construct sis is placed on systematic evaluation of this 3D models for a broad range of DNA-binding proteins. approach by comparing its performance with bio- Can such protein structures be used to predict sites recog- informatic methods, by testing it against random nized by a DNA-binding protein? The basic procedure for decoys and sites of homologous TFs. We also structure-based methods is to compute the binding energy examine a set of experimental mutations in both of the protein–DNA complex. The structure of the com- DNA and the protein. Using our explicit estimates plex for an arbitrary DNA sequence can be modeled by of energy, we show that the specificity for PurR is replacing (‘mutating’) the DNA sequence in the protein dominated by direct protein–DNA interactions, and structure containing its cognate site, followed by energy weakly influenced by bending of DNA. minimization and/or molecular dynamics (MD) to allow the protein–DNA complex to adjust to the new DNA sequence. After several minimization steps, the interaction energy can be calculated using either standard molecular INTRODUCTION mechanics force fields like AMBER (5) or CHARMM (6) Binding of cognate sites of DNA is central to many essen- with an implicit solvent, or a knowledge-based force field tial biological processes. Most of DNA-binding proteins optimized for the particular complex (7). have the ability to recognize and tightly bind cognate Several recent studies have significantly elaborated DNA sequences (sites). To find sites bound by a particular upon the above procedure. Lafontaine and Lavery (8), DNA-binding protein, one needs to calculate the free for example, pioneered a very efficient process termed energy of binding for the protein and possible DNA ADAPT in which they replace the DNA in the structure sites and then select sites that provide sufficiently low- by a ‘multicopy’ or ‘average’ piece of DNA. The structure binding energy. A widely used approach to find sites for is only minimized once after which the energy of the com- a DNA-binding protein is to assume a form of the energy plex is measured for all possible DNA sequences in place function, infer its parameters and to then calculate the of the average piece. From this, only the energy of the energy for all sites in a genome. To infer parameters one unbound DNA must be subtracted. The unbound protein *To whom correspondence should be addressed. Tel: +49 6131 392 3646; Fax: +49 6131 392 5441; Email: [email protected] ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 6210 Nucleic Acids Research, 2008, Vol. 36, No. 19 energy is the same for all DNA sequences and hence irrel- above test compare with that of a motif obtained from the evant for comparisons. This approach is so efficient that set of cognate sites by bioinformatic methods? By calcu- all possible sequences (4N for N bases) for short DNA lating binding energies we can also answer the following operator sites can be evaluated. Their results successfully questions which are not addressable by bioinformatic identify the experimental consensus sequence for a variety means: (i) what is the relative importance to recognition of DNA-binding proteins (9,10), and the ordering of bind- of direct binding energies to indirect factors such as DNA ing free energies for DNA point mutations in several com- bending? (ii) can the computed results for ÁGbinding of plexes (9). In this context, it was also noted that the actual mutations in DNA, and more importantly in the protein, binding energy computed via minimizations is incorrect be compared to experiment? [Bioinformatics data can also and cannot be compared to experiments quantitatively. be converted to compute ÁGbinding for DNA, but not pro- Endres et al. (11) allowed protein side chains to tein mutations as in ref. (1–4)]. explore rotamer conformations in their study of Zif268. To test the ability of the force field to discriminate Interestingly, the agreement with experiments becomes between cognate sites and random decoys, we developed Downloaded from https://academic.oup.com/nar/article/36/19/6209/2410008 by guest on 02 October 2021 worse when rotamers are considered, which points to a a procedure to speed up calculations and the screening of potential bias of the approach towards sequences similar many sites. We find that a single cognate site can be dis- to the one on which the underlying experimental structure criminated from about 7000 random decoy sites. While is based. Morozov et al. (12) predict binding affinities such performance is impressive, it is insufficient to detect using energy measurements as well, they keep their struc- sites from the whole bacterial genome. In the comparisons tures rigid or allow them to relax and compare the two of our results with experimental binding free energies for approaches. However, instead of considering their binding DNA and amino acid point mutations, we obtain the cor- energies to be approximately equal to free energies as we rect order of binding free energies of the mutants. do, they fit their energies to a few experimentally known free energies. They assign different weights to the energies involved, e.g. the Lennard-Jones or the electrostatic MATERIALS AND METHODS energy, and optimize the weights so that the sum matches The change in free energy due to protein–DNA interac- the free energy. They proceed to study several transcrip- tions can be decomposed as tion factors (TFs) and even find consensus sequence logos for two TFs whose structures they construct by homology Gbinding ¼ Gprotein-- DNA complex modeling. In recent work, Donald et al. (7) focus on direct À Gfree DNA : 1 protein–DNA interactions. They study and compare a number of potentials and propose some that outperform À Gfree protein the standard Amber potential. All these efforts represent Clearly, Gbinding depends on both the particular DNA pioneering work in the emerging field of structure-based sequence and the protein. In order to simplify the problem predictions of TF specificity. from a computational point of view, it is often assumed Here, we explore whether widely available MD force that the differences in Gbinding for two different DNA fields can be used to calculate the binding free energy sequences are dominated by differences in enthalpy. from all-atom models of the protein–DNA complex. In Entropic contributions are usually ignored since the contrast to some of the previous studies, we: (i) assess entropy losses upon binding for both the fragment of the power and limitations of the method in dealing with 6 DNA and the protein are likely not to depend significantly the roughly 10 decoy sites of bacterial genomes (by com- on the DNA sequence; hence puting binding energies for representative mutations and assembling an energy-based weight matrix (EBWM), Gbinding Ebinding which is then used for the task); (ii) explore whether ¼ EproteinÀDNA complex energy-minimization methods utilizing MD force fields : 2 can predict protein–DNA binding when DNA sites, or À Efree ðstraightÞ DNA the protein, are mutated. À Efree ðunboundÞ protein For our study, we focus on the purine repressor, PurR, from Escherichia coli, a well-characterized TF with more Furthermore, if DNA sequences bound by the same than 20 known sites in the genome.

Predicting Transcription Factor Specificity with All-Atom Models Sahand Jamal Rahi1, Peter Virnau2,*, Leonid A

Genes of the Escherichia Coli Pur Regulon Are Negatively Controlled

Adenylosuccinate Lyase of Bacillus Subtilis Regulates the Activity of The

PURB Antibody (C-Term) Affinity Purified Rabbit Polyclonal Antibody (Pab) Catalog # AP12051B

Dissertation

Accumulated Degeneration of Transcriptional Regulation Contributes to Disease Development and Detrimental Clinical Outcomes of Alzheimer’S Disease

The Cytogenetics of Hematologic Neoplasms 1 5

Dynamic Changes in RNA–Protein Interactions and RNA Secondary Structure in Mammalian Erythropoiesis

A Catalogue of Stress Granules' Components

Identification of High-Copy-Number Inhibitors

(PURB) (NM 033224) Human Untagged Clone Product Data

Product Data Sheet

(Rgla) and Mcrb (Rglb) Loci of Escherichia Coli K-12