Using Protein Design Algorithms to Understand the Molecular Basis of Disease Caused by Protein–DNA Interactions: the Pax6 Example

Using protein design algorithms to understand the molecular basis of disease caused by protein–DNA interactions: the Pax6 example The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Alibes, Andreu et al. "Using Protein Design Algorithms to Understand the Molecular Basis of Disease Caused by protein-DNA Interactions: The Pax6 Example.” Nucleic Acids Research 38.21 (2010): 7422–7431. Web. As Published http://dx.doi.org/10.1093/nar/gkq683 Publisher Oxford University Press Version Final published version Citable link http://hdl.handle.net/1721.1/72332 Terms of Use Creative Commons Attribution Non-Commercial Detailed Terms http://creativecommons.org/licenses/by-nc/2.5 7422–7431 Nucleic Acids Research, 2010, Vol. 38, No. 21 Published online 4 August 2010 doi:10.1093/nar/gkq683 Using protein design algorithms to understand the molecular basis of disease caused by protein–DNA interactions: the Pax6 example Andreu Alibe´ s1,*, Alejandro D. Nadra1, Federico De Masi2, Martha L. Bulyk2,3,4, Luis Serrano1,5 and Franç ois Stricher1 1EMBL/CRG Systems Biology Research Unit, Center for Genomic Regulation, UPF, Barcelona, Spain, 2Division of Genetics, Department of Medicine, 3Department of Pathology, Brigham and Women’s Hospital, 4Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA 02115, USA and 5ICREA professor, Center for Genomic Regulation, UPF, Barcelona, Spain Received May 30, 2010; Revised July 14, 2010; Accepted July 15, 2010 ABSTRACT of accuracy as protein stability, or protein–protein Quite often a single or a combination of protein interactions. mutations is linked to specific diseases. However, distinguishing from sequence information which INTRODUCTION mutations have real effects in the protein’s function is not trivial. Protein design tools are The paired box gene 6 (PAX6) is a member of the Pax commonly used to explain mutations that affect gene family of transcription factors (TFs) and it is mainly protein stability, or protein–protein interaction, but involved in tissue specification during early development (1). Pax6 is required for the multipotent state of retinal not for mutations that could affect protein–DNA progenitor cells (2) and is usually related to the develop- binding. Here, we used the protein design algorithm ment of the eyes and sensory organs (3,4). Mutations in FoldX to model all known missense mutations in the this TF are linked to eye diseases such as aniridia, foveal paired box domain of Pax6, a highly conserved tran- hypoplasia, cataracts and nystagmus (5). Because of its scription factor involved in eye development and in importance in human ocular disease and the vast several diseases such as aniridia. The validity of amount of biological information regarding this protein, FoldX to deal with protein–DNA interactions was a database of disease-related mutations of PAX6 is avail- demonstrated by showing that high levels of able (6). accuracy can be achieved for mutations affecting Most of the time, a specific disease can be described as these interactions. Also we showed that protein- the consequence of protein mutations, being a single one design algorithms can accurately reproduce experi- or a combination of several. However, establishing the exact effect on the function of protein based on its mental DNA-binding logos. We conclude that 88% of sequence alone is not trivial. The effects of mutations on the Pax6 mutations can be linked to changes in protein stability and protein–protein interaction can be intrinsic stability (77%) and/or to its capabilities to reasonably well predicted using protein design tools, as bind DNA (30%). Our study emphasizes the import- we previously demonstrated in the analysis of the relation- ance of structure-based analysis to understand the ship between the stability changes of the human phenyl- molecular basis of diseases and shows that protein– alanine hydroxylase and phenylketonuria (7). Similarly, DNA interactions can be analyzed to the same level mutations favoring protein aggregation or amyloid *To whom correspondence should be addressed. Tel: +34 93 316 0258; Fax: +34 93 316 0099; Email: [email protected] Correspondence may also be addressed to Franç ois Stricher. Tel: +34 93 316 0258; Fax: +34 93 316 0099; Email: [email protected] Present addresses: Alejandro D. Nadra, Departamentos de Fisiologı´ a, Biologı´ a Molecular y Celular, y Quı´ mica Biolo´ gica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, Ciudad Universitaria (1428), Buenos Aires, Argentina. Federico De Masi, Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Lyngby 2800 Denmark. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ß The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2010, Vol. 38, No. 21 7423 disease in unstructured protein regions can be accurately and mutations in the interface of DNA that might have an predicted (8,9). However, similar studies have not been effect on the Pax6 regulation of genes. We find that 88% performed on mutations affecting protein–nucleic acid of the mutations have a negative impact on the stability interactions, although studies predicting the effect of mu- and/or binding energy of Pax6, thus impairing its function tations on DNA recognition of specific sequences have and causing disease. been published (10–12). Protein–DNA interactions are a key process in tran- scriptional regulation and replication. To carry out their MATERIALS AND METHODS function, DNA-binding proteins must find and bind to Quantitative benchmark sets infrequent and small specific binding sites and discrimin- ate them from a huge excess of non-specific DNA. Two different data sets were used to quantitatively Protein–DNA complexes involve direct and indirect inter- evaluate the performance of FoldX: one containing actions and there is not a general recognition code to protein mutations in the interface with DNA and the predict base–residue interactions. For some well other made of wild-type proteins interacting with different characterized families [zinc fingers (13,14), homeodomains DNA sequences. (11,15) and bHLHs (16)] some general rules can be The set of protein mutants in protein–DNA complexes applied. However, for the majority of DNA-binding was initially taken from the ProNIT database (36). proteins the main way to identify the DNA recognition However, due to internal inconsistencies of the database, sequence is through experimental methods. the actual data (physical data of the experiments and the DNA-binding sites are traditionally characterized using change in interaction energy due to the mutation, ÁÁGint) a limited number of sequences by biochemical assays. were taken from the papers describing the original experi- However, in the last few years, several experimental tech- ments. The final set contains 97 mutations, among them niques and an increasing number of sequenced genomes 59 conservative mutations (Supplementary Table S1). allowed a more detailed analysis. Several computational The set of proteins binding to different DNA sequences methods for discovering TF binding sites have been was compiled in ref. (24). We analyzed those for which a described (17,18). Experimental methods that challenge crystal structure is available, and we compared the correl- the protein to a library of DNA sequences and successive- ation factors obtained between experiments and predic- ly enrich those with high affinity have been developed, tions for each TF with our method and their software. such as in vitro selection (19,20) or yeast or bacterial one-hybrid assays (21). Additionally, universal protein- FoldX DNA force field binding microarrays (PBMs) (10,12,22,23) expose the The classical four bases -A, C, G and T-, as well as the protein to all possible DNA-binding site sequence methylated A and C bases, were incorporated using variants making universal PBMs the only exhaustive tech- standard FoldX parameters for defining atoms (37) nique available. (charges, Van der Waals radii, volumes, electrostatic inter- In the past years, different in silico approaches have actions, solvation energies and hydrogen bond param- been developed to predict DNA-binding site motifs for eters) without parameter fitting. In order to better take DNA-binding proteins using structures. There have been into account the stacking of bases and their preferred con- successful attempts either by using existing crystal struc- formation, we put an entropy-like term inside the Van der tures (24–32), homology modeling (33) or by a docking Waals clash term of FoldX for adjacent bases. This term approach (13). In particular, structure-based predictions was derived from a statistical analysis of all DNA struc- were evaluated in zinc fingers (28,34) where a sensitivity to tures in the Protein Data Bank (PDB) looking at each docking geometry was reported (35), and in possible pairs of consecutive bases and looking at the meganucleases (30–32), highlighting the importance of angles made by the planes of each bases. Those angles having multiple

Using Protein Design Algorithms to Understand the Molecular Basis of Disease Caused by Protein–DNA Interactions: the Pax6 Example

Human Stem Cells from Single Blastomeres Reveal Pathways of Embryonic Or Trophoblast Fate Specification Tamara Zdravkovic1,2,3,4,5,‡, Kristopher L

Identification of a Novel Mutation Disrupting the DNA Binding Activity

View Software Version 1.6.6 [51]

Aberrant Gcm1 Expression Mediates Wnt/Β-Catenin Pathway Activation in Folate Deficiency Involved in Neural Tube Defects

GCM1 Antibody - N-Terminal Region Rabbit Polyclonal Antibody Catalog # AI16248

Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications

Primate-Specific Mir-515 Family Members Inhibit Key Genes in Human Trophoblast Differentiation and Are Upregulated in Preeclamps

Early Onset Preeclampsia in a Model for Human Placental Trophoblast

ADHD) Gene Networks in Children of Both African American and European American Ancestry

Primepcr™Assay Validation Report

P45nf-E2 Represses Gcm1 in Trophoblast Cells to Regulate

Transcription Factors Underlying the Development and Endocrine Functions of the Placenta † † † JAMES C