Prediction of Distal Residue Participation in Enzyme Catalysis
Total Page:16
File Type:pdf, Size:1020Kb
Article Protein Science DOI 10.1002/pro.2648 Prediction of Distal Residue Participation in Enzyme Catalysis Heather R. Brodkina,b,1 Nicholas A. DeLateura,2 Srinivas Somarowthua,3 Caitlyn L. Millsa Walter R. Novakb,4 Penny J. Beuninga Dagmar Ringeb Mary Jo Ondrechena* aDepartment of Chemistry and Chemical Biology, Northeastern University, Boston, MA, 02115, USA, and bDepartments of Biochemistry and Chemistry, Rosenstiel Basic Medical Sciences Research Center, Brandeis University, Waltham, MA 02454-9110, USA1Present address: Alkermes, Inc., 852 Winter Street, Waltham, MA 02451, USA 2Present address: Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139 3Present address: Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520 4Present address: Department of Chemistry, Wabash College, Crawfordsville, IN 47933 *To whom correspondence should be sent. Corresponding author information: Professor Mary Jo Ondrechen Department of Chemistry and Chemical Biology Northeastern University 360 Huntington Avenue Boston, MA 02115 USA Telephone: +1-617-373-2856 Fax: +1-617-373-8795 E-mail: [email protected] This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1002/pro.2648 © 2014 The Protein Society Received: Nov 22, 2014; Revised: Jan 10, 2015; Accepted: Jan 26, 2015 This article is protected by copyright. All rights reserved. Protein Science Page 2 of 42 Abstract A scoring method for the prediction of catalytically important residues in enzyme structures is presented and used to examine the participation of distal residues in enzyme catalysis. Scores are based on the Partial Order Optimum Likelihood (POOL) machine learning method, using computed electrostatic properties, surface geometric features, and information obtained from the phylogenetic tree as input features. Predictions of distal residue participation in catalysis are compared with experimental kinetics data from the literature on variants of the featured enzymes; some additional kinetics measurements are reported for variants of Pseudomonas putida nitrile hydratase (ppNH) and for E. coli alkaline phosphatase (AP). The multi-layer active sites of Pseudomonas putida nitrile hydratase and of human phosphoglucose isomerase are predicted by the POOL logZP scores, as is the single-layer active site of Pseudomonas putida ketosteroid isomerase. The logZP score cutoff utilized here results in over- prediction of distal residue involvement in E. coli alkaline phosphatase. While fewer experimental data points are available for Pseudomonas putida mandelate racemase and for human carbonic anhydrase II, the POOL logZP scores properly predict the previously reported participation of distal residues. Keywords Remote residues; catalytic efficiency; Partial Order Optimum Likelihood; nitrile hydratase; phosphoglucose isomerase; ketosteroid isomerase; alkaline phosphatase Abbreviations CSA – Catalytic Site Atlas INTREPID Information Theoretic Tree Traversal for Protein Functional Site Identification KSI – Ketosteroid isomerase PGI – Phosphoglucose isomerase POOL – Partial Order Optimum Likelihood THEMATICS – Theoretical Microscopic Anomalous Titration Curve Shapes 2 John Wiley & Sons This article is protected by copyright. All rights reserved. Page 3 of 42 Protein Science Introduction Many chemical reactions that require high temperature and/or extreme pH in the laboratory are catalyzed by enzymes at physiological temperature and neutral pH. One of the central challenges in biochemistry is to understand how enzymes achieve this catalytic power under mild conditions. Structural and biochemical studies have now characterized the active sites of hundreds of enzymes; nearly all of these studies have focused on the amino acid residues in direct contact with the reacting substrate molecule(s). A typical enzyme might consist of hundreds of residues, but only a handful of them have been identified in the literature as catalytically important. In the manually curated portion of the Catalytic Site Atlas (CSA),1 a database of enzyme active site composition compiled from the experimental literature, an average of only 1.2% of the residues in the biological units of these 170 enzymes is listed as catalytically important. Virtually all of these reported catalytic residues are, in the active conformation of the enzyme, in direct contact with the reacting substrate molecule(s). A recent review surveys examples from the literature of proteins where distal residues have been shown to participate directly in the biochemical function.2 The present paper presents a model, based on computed chemical properties, that successfully predicts both first-shell and more distant residues that are important for function. Model predictions are compared with available experimental data and, for a few cases, new experimental results are presented. The ability to predict, with a simple calculation, the involvement of distal residues in catalytic and other processes is a very important capability for protein engineering and for understanding how nature builds catalytic sites. Here new computational and experimental results, together with experimental data from the literature, are presented to show that residues that are not in direct contact 3 John Wiley & Sons This article is protected by copyright. All rights reserved. Protein Science Page 4 of 42 with the reacting substrate molecule often play significant roles in catalysis, but more importantly, that their participation in catalysis is predictable with a simple calculation. Previously we have shown3,4 that all of the residues in a protein structure may be assigned scores based on computed chemical properties and that these scores may be used to rank-order all of the residues according to their probability of functional importance. In the present work, these scores are transformed so that they are more comparable across proteins of different sizes. We now show that the pattern of these scores predicts whether an enzyme has a highly localized, compact active site, or a spatially extended, multi-layer active site with extensive involvement of distal residues in catalysis and ligand binding. For purposes of the present study, residues interacting directly with the substrate molecule are referred to as first-shell residues; other residues interacting directly with one or more first-shell residues are called second-shell residues; and finally, other residues interacting with one or more second-shell residues are called third-shell residues. All residues defined as first-, second- or third-shell thus may be thought of as members of respective spheres surrounding the reacting substrate molecule. In the present work, enzymes with available experimental data on the participation of distal residues in catalysis are the focus. THEMATICS (THEoretical Microscopic Anomalous TItration Curve Shapes) is a computational method for the prediction of catalytic and binding residues in proteins solely from the 3D structure of the query protein.5-7 THEMATICS takes advantage of the unique chemical and electrostatic properties of catalytic and binding residues. In particular, the ionizable residues in active sites tend to have anomalously shaped theoretical titration curves. We have established metrics6 to quantify the degree of deviation from normal titration behavior and have shown how these metrics may be used 4 John Wiley & Sons This article is protected by copyright. All rights reserved. Page 5 of 42 Protein Science to predict, with high sensitivity and specificity,7 the active site residues in a protein 3D structure, just from a simple set of calculations. POOL (Partial Order Optimum Likelihood)3,4,8 is a machine learning method for the prediction of functionally important residues in protein structures. POOL, a multidimensional, monotonicity-constrained maximum likelihood technique, utilizes input for which the output variable (in this case the probability of functional importance of a residue) is a monotonic function of each of the input features. POOL starts with the hypothesis that the larger the THEMATICS metrics for a given residue, the higher the probability that that residue is important for the biochemical function. Electrostatics features from THEMATICS are combined with multidimensional isotonic regression to form maximum likelihood estimates of probabilities that specific residues are active in biochemical function. All residues are then rank-ordered according to their probability of functional importance. Through the introduction of environment variables in POOL, THEMATICS metrics are used to characterize the chemical environments of all residues, not just the ionizable ones. Furthermore, the input to POOL is flexible, in that any property upon which the probability of the functional importance of a residue depends monotonically can be a POOL input feature. These include features based solely on the structure of the query protein, such as THEMATICS metrics6,7,9 and surface geometric properties,10 but may also include other features if they are available, such as phylogenetic tree-based scores.11 It has been demonstrated3 that POOL with purely structure-based features performs very nearly as well as the best methods that utilize both sequence alignments and structure. POOL with input features that include