Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2016 Knowledge-based approaches for understanding structure-dynamics-function relationship in proteins Kannan Sankar Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Biochemistry Commons, Bioinformatics Commons, and the Biophysics Commons

Recommended Citation Sankar, Kannan, "Knowledge-based approaches for understanding structure-dynamics-function relationship in proteins" (2016). Graduate Theses and Dissertations. 16007. https://lib.dr.iastate.edu/etd/16007

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected].

Knowledge-based approaches for understanding structure-dynamics-function relationship in proteins

by

Kannan Sankar

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Bioinformatics and Computational Biology

Program of Study Committee: Robert L. Jernigan, Co-Major Professor Drena Dobbs, Co-Major Professor Amy Andreotti Guang Song Sanjeevi Sivasankar

Iowa State University

Ames, Iowa

2016

Copyright © Kannan Sankar, 2016. All rights reserved. ii

DEDICATION

To my parents and my sister for their unconditional support throughout this journey

iii

TABLE OF CONTENTS

Page

NOMENCLATURE ...... vii

ACKNOWLEDGMENTS ...... xii

ABSTRACT………………………………...... xiv

CHAPTER 1 OVERVIEW ...... 1

1.1. Background ...... 1 1.1.1. Protein Dynamics: the Link between Structure and Function ...... 1 1.1.2. Experimental Determination of Protein Structure and Dynamics ...... 3 1.1.3. Computational Approaches to Protein Structure and Dynamics ...... 4 1.2. Motivation and Specific Aims ...... 13 1.3. Dissertation Organization ...... 15

CHAPTER 2 DISTRIBUTIONS OF EXPERIMENTAL PROTEIN STRUCTURES ON COARSE-GRAINED FREE ENERGY LANDSCAPES .. 20

2.1. Introduction ...... 21 2.2. Theory and Methods ...... 27 2.2.1. Datasets ...... 27 2.2.2. PCA ...... 28 2.2.1. Knowledge based potential functions ...... 29 2.2.2. Structural entropy evaluation ...... 30 2.2.1. Construction of free energy landscapes ...... 31 2.2.2. Generation of a transition path between two structures on the free energy landscape ...... 32 2.3. Results ...... 34 2.2.1. Distribution of structures in low energy regions of the landscape...... 34 2.2.2. Case study I: T4 Lysozyme ...... 37 2.2.3. Case study II: Human Serum Albumin (HSA) ...... 40 2.2.2. Case Study III: SERCA ...... 42 2.2.2. Predicting the transition pathway between the open and closed forms of HIV-1 protease ...... 45 2.4. Conclusions ...... 48 2.5. Acknowledgements ...... 50 2.6. Supplementary Material ...... 50

iv

CHAPTER 3 KNOWLEDGE-BASED ENTROPIES IMPROVE THE IDENTIFICATION OF NATIVE PROTEIN STRUCTURES ...... 51

3.1. Introduction ...... 52 3.2. Results ...... 56 3.2.1. Patterns of amino acid contact changes during protein conformational changes ...... 56 3.2.2. Correlation between contact-change patterns and nature of amino acids 60 3.2.1. Knowledge based entropy functions ...... 62 3.2.2. Knowledge-based free energy function (KBF) for protein native structure recognition ...... 63 3.3. Discussion ...... 67 3.4. Materials and Methods ...... 69 3.4.1. Datasets ...... 69 3.4.2. Evaluation of normalized amino acid contact changes ...... 70 3.4.3. Evaluation of local entropies based on contact changes ...... 71 3.4.4. Optimization of weights for potential and entropy terms ...... 72 3.4.5. Comparison of performance measures...... 73 3.5. Author Contributions ...... 73 3.6. Acknowledgements ...... 74 3.7. Supplementary Material ...... 74

CHAPTER 4 MOLECULAR DETERMINANTS OF CADHERIN IDEAL BOND FORMATION: CONFORMATION DEPENDING UNBINDING ON A MULTIDIMENSIONAL LANDSCAPE...... 75 4.1. Introduction ...... 76 4.2. Results ...... 79 4.2.1. Cadherin energy landscape and transition pathway for the interconversion between X-dimers and S-dimers ...... 79 4.2.2. Wild type and conformational shuttling mutants form a metastable intermediate, dimer state ...... 83 4.2.3. Conformational interconversion depends on the K14-D138 salt- bridge interactions ...... 88 4.2.4. Intermediate dimer states exhibit force-induced conformational motion perpendicular to the pulling direction ...... 90 4.2.5. Cadherins trapped in an intermediate dimer state form ideal bonds .... 94 4.3. Discussion ...... 98 4.4. Methods...... 101 4.4.1. Principal component analysis ...... 101 4.4.2. Projecting energy landscapes onto PC coordinates ...... 102 4.4.3. Determination of transition path between X-dimers and S-dimers ..... 103 4.4.4. MD and SMD simulations and structural analysis ...... 103 4.4.5. E-cadherin constructs and single molecule force-clamp measurements 105 4.5. Author Contributions ...... 107 4.6. Acknowledgements ...... 108 4.7. Supplementary Material ...... 108

v

CHAPTER 5 AN ANALYSIS OF CONFORMATIONAL CHANGES UPON RNA-PROTEIN BINDING ...... 109

5.1. Introduction ...... 110 5.2. Materials and Methods ...... 114 5.2.1. Dataset...... 114 5.2.2. Identification of ‘flexible’ regions ...... 114 5.2.3. Identification of interface residues...... 116 5.2.4. Identification of surface vs. buried residues ...... 116 5.2.5. Classification based on SCOR category of bound RNA and SCOP class of protein ...... 116 5.2.6. Performance evaluation of sequence-based vs. structure-based methods for RNA-binding site prediction...... 117 5.3. Results ...... 118 5.3.1. Sequence-based methods perform better than structure-based Methods for predicting RNA-binding sites in proteins...... 118 5.3.2. Most conformationally flexible residues in RNA-binding proteins map to non-interface surface residues...... 119 5.3.3. A large fraction of residues in rRNA-binding proteins undergo conformational changes upon RNA binding...... 121 5.4. Discussion ...... 122 5.5. Acknowledgements ...... 123

CHAPTER 6 MECHANISM OF DRUG EFFLUX BY ABGT FAMILY OF MEMBRANE TRANSPORTERS ...... 125

6.1. Introduction ...... 125 6.2. Methods...... 129 6.2.1. Structure preparation ...... 129 6.2.2. Analysis of pathways accessible through ENM modes ...... 129 6.2.3. Computational docking of ligands onto transporters ...... 130 6.2.4. Steered molecular dynamics ...... 131 6.2.5. Analysis of mean square fluctuations (MSFs) of the monomer in monomeric vs. oligomeric state ...... 132 6.3. Results and Discussion ...... 133 6.3.1. Identification of ligand migration channels ...... 133 5.3.2. Steered molecular dynamics of PABA and sulfamethazine in YdaH.. 134 5.3.3. Analysis of the effect of oligomerization on dynamics ...... 139 6.4. Conclusions ...... 140 6.5. Acknowledgements ...... 141

CHAPTER 7 CONCLUSIONS ...... 142

7.1. Specific Findings and Contributions ...... 142 7.2. Future Directions ...... 144

vi

APPENDIX A THE USE OF EXPERIMENTAL STRUCTURES TO MODEL PROTEIN DYNAMICS ...... 147

APPENDIX B INVESTIGATIONS ON ELASTIC NETWORK MODELS OF COARSE-GRAINED MEMBRANE PROTEINS ...... 169

APPENDIX C INVESTIGATIONS ON THE MECHANISM OF CARBON CONCENTRATION BY Lci1 ...... 172

APPENDIX D SUPPLEMENTARY INFORMATION FOR CHAPTER 3 ..... 181

APPENDIX E SUPPLEMENTARY INFORMATION FOR CHAPTER 4 ..... 191

BIBLIOGRAPHY ...... 199

vii

NOMENCLATURE

ADP Adenosine Diphosphate

AFM Atomic Force Microscopy

AMBER Assisted Model Building with Energy Refinement

ANM Anisotropic Network Model

ATH Alumina Trihydrate

ATP Adenosine Triphosphate

AUC Area Under the Curve

BPTI Bovine Pancreatic Trypsin Inhibitor

CAPRI Critical Assessment of Prediction of Interactions

CASP Critical Assessment of Protein Structure Prediction

CATH Class Architecture Topology Homologous superfamily

CCM Carbon Concentrating Mechanism

CD-HIT Cluster Database at High Identity with Tolerance

CHARMM Chemistry at Harvard Molecular Mechanics

CO Cumulative Overlap

DFIRE Distance-scaled Finite Ideal-Gas Reference State

DIMB Diagonalization in a Mixed Basis

DMEM Dulbecco’s Modified Eagle’s Medium

DNA Deoxyribonucleic acid

DOPE Discrete Optimized Protein Energy

CG Coarse-grained

viii

EA Explicit-Atom

EC Enzyme Commission

EDD Error-scaled Difference Distance

EM Electron Microscopy

ENM Elastic Network Model

ESCET Error-inclusive Structure Comparison and Evaluation Tool

ESR Electron Spin Resonance

FBS Fetal Bovine Serum

FEL Free Energy Landscape

FN False Negative

FP False Positive

GA Genetic Algorithm

GCE Genetic Control Element

GNM Gaussian Network Model

GO Gene Ontology

GPCR G-Protein Coupled Receptor

GROMACS Groningen Machine for Chemical Simulations

GROMOS Groningen Molecular Simulation

GTP Guanosine Triphosphate

HEK Human Embryonic Kidney

HIV Human Immunodeficiency Virus

HAS Human Serum Albumin

ISBRA International Conference on Bioinformatics Research and Applications

ix

ISMB Intelligent Systems for Molecular Biology

KBP Knowledge-based Potential

KBF Knowledge-based Free Energy

LGA Lamarckian Genetic Algorithm

MATLAB Matrix Laboratory

MAVEN Motion Analysis and Visualization of Elastic Networks

MC Monte Carlo

MCMC Markov Chain Monte Carlo

MCC Matthew’s Correlation Coefficient

MM Molecular Mechanics

MMC Metropolis Monte-Carlo

MD Molecular Dynamics

MDR Multi-drug Resistance

MSA Multiple Sequence Alignment

MSF Mean Square Fluctuation

MUSTANG Multiple Structural Alignment Algorithm

NAG N-acetyl-D-glucosamine

NAM N-acetyl muramic acid

NAMD Nanoscale Molecular Dynamics

NGS Next-Generation Sequencing

NIH National Institutes of Health

NMA Normal Mode Analysis

NMR Nuclear Magnetic Resonance

x

NSF National Science Foundation

OPLS Optimized Potentials for Liquid Simulations

OPRA Optimal Protein-RNA Area

PABA p-amino benzoic acid

PABG p-amino benzoyl glutamate

PCA Principal Component Analysis

PDB Protein Data Bank

PEG Polyethylene glycol

PMF Proton Motive Force

POPC 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine

POPE 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphoethanolamine

PRIDB Protein-RNA Interaction Database

PRIP Protein-RNA Interface Prediction

PSO Particle Swarm Optimization

QM Quantum Mechanics

RAPDF Residue-specific All-atom Conditional Probability Discriminatory

Function

RMSD Root Mean Square Deviation

RMSIP Root Mean Square Inner Product

RMSF Root Mean Square Fluctuation

RNA Ribonucleic acid

RNP Ribonucleoprotein

ROC Receiver Operating Characteristic

xi

RTB Rotations-Translations of Blocks

ROC Receiver Operating Characteristic

SASA Solvent Accessible Surface Area

SCOP Structural Classification of Proteins

SCOR Structural Classification of RNAs

SELEX Systematic Evolution of Ligands by Exponential Enrichment

SERCA Serco-endoplasmic Reticular ATPase

SGC Structural Genomics Consortium

SMD Steered Molecular Dynamics

SRP Signal Recognition Particle

TM Transmembrane

TN True Negative

TP True Positive

UA United-Atom

VMD Visual Molecular Dynamics

WHO World Health Organization

WT Wild-type

xii

ACKNOWLEDGMENTS

First of all, I would like to extend my sincere gratitude towards my doctoral advisor

Dr. Robert L. Jernigan for all his guidance throughout my graduate study at Iowa State. His unique style of advising has given me sufficient freedom to explore and learn diverse research topics, to ask good scientific questions and to take failures positively. I would also like to thank my co-major advisor, Dr. Drena Dobbs for being a constant source of encouragement. Having worked on a rotation project in her lab, I greatly cherish the opportunity of being a de-facto member of the Dobbs Lab.

I would like to extend my special thanks to my other committee members Dr. Amy

Andreotti, Dr. Guang Song and Dr. Sanjeevi Sivasankar for providing valuable suggestions which have played a crucial role in shaping up various projects. I am also particularly thankful to Dr. Dianne Cook, who has provided me with valuable inputs for my projects.

Two statistics courses on multivariate analysis and data mining taught by her using her unique method of teaching have proved to be of great help in each of my projects. I would also like to extend my gratitude to several other faculty members at Iowa State for their important roles in my graduate study at Iowa State including Dr. Edward Yu and Dr. Taner

Sen.

I owe a ton of thanks to several staff at Iowa State for making my life as a graduate student, a smooth journey. Trish Stauble of BCB has played a pivotal role in every stage of my advancement towards the degree; and has always been willing to help in times of need. I would also like to thank Mary Jane McCunn, of the Jernigan Lab for all the help with official procedures and Xuefeng Zhao for software support. I would also like to express my gratitude

xiii to all the staff in BBMB department, specifically Connie Garnett for taking care of all the paperwork and ensuring that everything went smoothly.

My special thanks go to my student collaborators Rasna Walia, Kristine Manibog and

Reed Harrison; who have all been a pleasure to work with. I would like to thank my lab- mates Jie, Kejue, Sambit, Yuan, Nikita, Sayane and Ataur for the opportunity to work with them on several exciting projects. I would also like to thank Mike Zimmermann, from whom

I have been learnt a lot. My thanks also go to members of the Dobbs Lab including Pete

Zaback, Ben Lewis and Usha Muppirala for valuable suggestions during my rotation. I would also like to thank several other students at ISU: Scott, Jennifer, Gokul, Ruolin, Katie, Sweta,

Bhagyashree and Bipin as well as my dear friends Adarsh, Divya, Omana and Joseph who have played a huge role in making my time in Ames enjoyable and fun.

However, I wouldn’t have been able to make it to this stage in my life without the unconditional love and support of my parents, Sankar and Mala and my sister, Karthika.

They have always been available for me to talk to in times of need and have been a constant source of encouragement in my academic pursuit. I dedicate this dissertation to them, without whom, it wouldn’t have been possible for me to reach this stage.

xiv

ABSTRACT

Proteins accomplish their functions through conformational changes, often brought about by changes in environmental conditions or ligand binding. Predicting the functional mechanisms of proteins is impossible without a deeper understanding of conformational transitions. Dynamics is the key link between the structure and function of proteins. The protein data bank (PDB) contains multiple structures of the same protein, which have been solved under different conditions, using different experimental methods or in complexes with different ligands. These alternate conformations of the same protein (or similar proteins) can provide important information about what conformational changes take place and how they are brought about. Though there have been multiple computational approaches developed to predict dynamics from structure information, little work has been done to exploit this apparent, but potentially informative, redundancy in the PDB. In this work I bridge this gap by exploring various knowledge-based approaches to understand the structure-dynamics relationship and how it translates into protein function.

First, a novel method for constructing free energy landscapes for conformational changes in proteins is proposed by combining principal motions with knowledge-based potential energies and entropies from coarse-grained models of protein dynamics. Second, an innovative method for computing knowledge-based entropies for proteins using an inverse

Boltzmann approach is introduced, similar to the manner in which statistical potentials were previously extracted. We hypothesize that amino acid contact changes observed in the course of conformational changes within a large set of proteins can provide information about local pairwise flexibilities or entropies. By combining this new entropy measure with knowledge-

xv based potential functions, we formulate a knowledge-based free energy (KBF) function that we demonstrate outperforms other statistical potentials in its ability to identify native protein structures embedded with sets of decoys. Third, I apply the methods developed above in collaboration with experimentalists to understand the molecular mechanisms of conformational changes in several protein systems including cadherins and membrane transporters.

This work introduces several ways that the huge data in the PDB can be utilized to understand the underlying principles behind the structure-dynamics-function relationships of proteins. Results from this work have several important applications in structural bioinformatics such as structure prediction, molecular docking, protein engineering and design. In particular, the new KBFs developed in this dissertation have immediate applications in emerging topics such as prediction of 3D structure from coevolving residues in sequence alignments as well as in identifying the phenotypic effects of mutants.

1

CHAPTER 1. OVERVIEW

Proteins are aptly described as the workforce of the cell, performing most functions necessary for sustaining life, such as catalysis of biochemical reactions (enzymes), maintenance of structural integrity (structural proteins such as actin, myosin, etc.), cell-cell adhesion and formation of tissues (cadherins and integrins), molecular recognition and regulation (signaling proteins) and transport of molecules (membrane transporters). All proteins are biopolymers consisting of a sequence of amino acids connected together by peptide bonds. The sequence of a protein dictates a specific 3D structure, and the structure in turn specifies the dynamics. The function of a protein is the outcome of interplay between the sequence and the structure and is primarily manifested through its dynamics.

The term ‘dynamics’ can refer to any type of intramolecular change in proteins, which can range from the motion of individual atoms or group of atoms to those that involve whole domains or subunits of the molecule. These are usually classified into three types [Ringe and

Petsko, (1985)]: (a) fluctuations, which usually include vibrations of atoms (usually < 1 Å and in the timescale of picoseconds); (b) collective motions, which include motions of groups of linked atoms such as domains (usually in the timescale of picoseconds to nanoseconds); and (c) triggered conformational changes, which involve side chains or group of atoms such as loops or domains in response to an external stimulus such as binding of a ligand or change in environment.

1.1. Background

1.1.1. Protein dynamics: the link between structure and function

Although proteins are often considered to adopt unique structures, they are much more than just static objects. They often adopt more than one conformation and exist as

2 ensembles of multiple conformations or states. Transitions can occur between these states over a wide range of lengths (local, intra-domain or inter-domain) and time-scales and are often closely related to the function of a protein such as enzyme catalysis [Fraser et al.,

(2009)] or allosteric regulation [Bu and Callaway, (2011)]. This also forms the basis of the conformational selection hypothesis, according to which all the forms of a protein exist in equilibrium; and an external event (such as ligand binding or pH change) alters the equilibrium, changing the population of each state. Understanding how structure dictates function requires knowledge of the dynamics of the protein.

The study of protein dynamics primarily involves understanding the kinetics and thermodynamics of the transitions among these available states (or conformational changes), and experience tells us that there are relatively few of these for proteins. This perspective is often conceptualized in terms of the ‘energy landscape’ paradigm [Onuchic et al., (1997)]

(which is often used in ), mapping all the conformations to their specific free energies; with the highly populated states in the deepest wells and their kinetics expressed in terms of the heights of the energy barriers [Brooks et al., (1998); Bryngelson et al., (1995)].

In order to get an accurate picture of the free energy landscape and conformational transitions, one needs two things: (a) knowledge of all the conformational states, and (b) free energies of each state. It is virtually impossible to experimentally determine all conformational states of a protein or to measure the energies of each state. Hence, understanding protein dynamics and its link to protein function is highly dependent on progress with computational approaches, both for sampling conformations and for evaluating free energies.

3

1.1.2. Experimental Determination of Protein Structure and Dynamics

With recent advances in Next-Generation Sequencing (NGS) experiments, there has been remarkable progress in the determination of the sequences of numerous genomes

[Daniels et al., (2013)]. However, the determination of protein structures has lagged far behind and is unlikely to catch up with rapid sequencing efforts in the foreseeable future

[Levitt, (2007)]. Most of the current knowledge about protein structures comes from X-ray crystallography [Kendrew and Perutz, (1957)] and NMR spectroscopy [Wüthrich, (2001)], although recent developments in high resolution cryo-electron microscopy [Bartesaghi et al.,

(2015); Fischer et al., (2015)] are rapidly leading to structures of large macromolecular complexes.

The 3D structures of proteins solved by X-ray crystallography or NMR spectroscopy provide snapshots of their dynamics in their functional mechanisms. Experimental information on protein dynamics can be obtained to a certain extent from NMR relaxation measurements [Ishima and Torchia, (2000); Kay, (2005); Kleckner and Foster, (2011); Zhu et al., (2000)], ultrafast 2D-infrared spectroscopy [Adamczyk et al., (2012)], time-resolved

X-ray solution scattering [Kim et al., (2015)] and diffuse scattering [Nienhaus et al., (1989)],

X-ray crystallography [Ringe and Petsko, (1985)] or recently by hybrid methods that combine several of these methods [Fenwick et al., (2014)]. However one of the major drawbacks to these methods is that they provide measures of dynamics which are either 1D

(e.g. residue flexibilities) or 2D (internal distance changes between residues); whereas understanding protein dynamics requires 3D information about how various parts (or domains) move in coordination with each other and how these relate to its function. As a result, computational approaches to model dynamics have received a lot of attention from

4 theoretical physicists and chemists. A recent review article by Bedem and Fraser summarizes various experimental methods for determining dynamics information and their potential for synergies with computational simulations to resolve the structural basis of protein conformational changes in atomic detail [van den Bedem and Fraser, (2015)]. I briefly summarize the various computational approaches to studying protein dynamics, since they form the background for my work.

Figure 1.1. Essential components of a molecular mechanics protocol. Any MM approach consists of three essential components: (i) a method for representation of particles (ii) a which captures the rules governing interactions between particles of the system, and (iii) a simulation or sampling method which can capture the progress of the system over space/time.

1.1.3. Computational Approaches to Protein Structure and Dynamics

The majority of computational methods for studying the structure and dynamics of proteins fall under the umbrella of ‘molecular mechanics’ (MM), which is the use of classical mechanics to model molecular systems. MM approaches vary usually in three essential aspects (Fig. 1.1): (a) the representation of particles (b) the force-field, i.e. the function to

5 evaluate the energy of interactions between particles in the system; and (c) the sampling or simulation method, i.e. the technique used to simulate the behavior of the system across time/space.

(a) Molecular representation: As in classical mechanics, the particles (atoms, molecules or ions) of a biomolecular system are modeled as spheres having a particular mass, charge and radius. Depending on the level of detail used to describe the system, the models are categorized as (a) all-atom or explicit-atom (EA) models in which all atoms including hydrogens are explicitly accounted for; (b) united-atom (UA) models in which some small group of atoms is considered in condensed form, e.g., a carbon atom and its covalently bonded hydrogens are considered as one group or (c) coarse-grained (CG) models in which two or more united atoms, or even an entire amino acid residue is considered as one bead).

The primary objective of grouping atoms is to reduce computation time, in addition to eliminating many details of the structure. Depending on the length and time-scale of the process being investigated, one of the approaches may be preferable to the others.

(b) Force-fields: The potential energy of a biomolecular system is usually modeled using force-fields. A force-field is a combination of the mathematical formulae and the set of parameters that captures the potential energy of the molecular system in terms of its atomic coordinates. Force-fields are of two types: (a) physics-based force-fields derived from basic physical principles; and (b) knowledge-based or statistical force-fields, which aim to derive the forces and potentials governing protein structures by learning from a database of known native structures.

6

Physics-based force fields, as their name suggests, are derived using principles of physics. They model bonded interactions in the system using harmonic springs (to describe changes in bond lengths, bond angles and torsion angles) and non-bonded interactions such as van der Waals forces using a Lennard-Jones (LJ) like potential [Lennard-Jones, (1924)],

H-bonds, electrostatic interactions from Coulomb’s law. The parameters in the functional form of these force-fields are often derived from quantum mechanical (QM) calculations (in which case they are referred to as ab-initio force fields) or by fitting to experimental data (in which case they are referred to as empirical or semi-empirical force-fields). The most commonly used physics-based force-fields for proteins include AMBER [Cornell et al.,

(1995)], CHARMM [B. R. Brooks et al., (1983)], OPLS [Jorgensen and Tirado-Rives,

(1988)] and GROMOS [Scott et al., (1999)]. An extensive summary of physics-based force fields for biomolecular simulations can be found in two reviews [Mackerell, (2004); Ponder and Case, (2003)]. Physics-based force-fields are most often employed for molecular dynamics [Price and Brooks, (2002)], energy minimization and structure refinement [Levitt and Lifson, (1969)].

On the other hand, knowledge-based force-fields are derived from large datasets of known structures of proteins. The underlying idea behind statistical potentials is that the forces that govern protein structure are complicated, and hence the best sources of information about these are the known native structures themselves. By assuming that the known native structure of a protein is at the global minimum of its free energy landscape, the implication is that favorable interactions will occur more frequently than less favorable ones in the known structures; and these can then be used to estimate the energy of interactions.

This is often referred to as the inverse Boltzmann approach [Sippl, (1993)]. This approach

7 was pioneered by Tanaka and Scheraga [Tanaka and Scheraga, (1976)] and developed extensively by Miyazawa and Jernigan [Miyazawa and Jernigan, (1999), (1996), (1985)], and

Sippl [Sippl, (1990)] for computing contact potentials of amino acid pairs. A detailed review of various two-body contact potentials is presented in the reference [Pokarowski et al.,

(2005)]. Since two-body interactions are not sufficient to account for the dense packing and cooperative nature of protein contacts, three [Munson and Singh, (1997)] and four-body contact potentials [Carter et al., (2001); Feng et al., (2007); Krishnamoorthy and Tropsha,

(2003)] have been developed. Although originally developed for coarse-grained structures, knowledge-based approaches have since been extended to develop several all-atom statistical potentials, and additionally account for either the distance dependence or orientation dependence of interactions. The most popular all-atom statistical potentials are DFIRE [Zhou and Zhou, (2002)], RAPDF [Samudrala and Moult, (1998)], atomic KBP [Lu and Skolnick,

(2001)], DOPE [Sali and Sali, (2006)], OPUS-PSP [Lu et al., (2008)] and RW/RWplus

[Zhang and Zhang, (2010)]. The most popular applications of statistical potentials include structure prediction [Hamelryck et al., (2010)], fold recognition or threading [Domingues et al., (1999)], prediction of protein-protein or protein-nucleic acid interactions [Feliu et al.,

(2011)] and protein engineering and design [Russ and Ranganathan, (2002)]. Interestingly, the most successful methods in 3D structure prediction [Zhang, (2008)] or near-native structure refinement [Summa and Levitt, (2007)] have been achieved by the use of knowledge-based potentials themselves, or in combination with physics-based force fields.

(c) Molecular simulations: Given a molecular representation of the system and a force-field, a simulation method is necessary to sample the conformational space of the protein, from

8 which a rough idea of its free energy of the landscape can be obtained. Here I briefly outline the most commonly used approaches, which I have used in this dissertation: molecular dynamics, Monte-Carlo methods, normal mode analysis and elastic network models. Special emphasis is given to elastic network models, since they have been used in multiple chapters.

Initially developed in the field of theoretical physics [Alder and Wainwright, (1959)], molecular dynamics (MD) found its applications for understanding the dynamics and function of proteins in the 1970s [Levitt and Warshel, (1975); Warshel, (1976)]. In the classical model, an MD simulation involves solving Newton’s equations of motion for a system consisting of interacting atoms and molecules over a period of time. Each particle in the system is assigned a velocity according to its temperature and is allowed to move according to forces derived from the slope of the potential energy function. The procedure is repeated for a certain period of time in short steps (usually a few femtoseconds), with the velocities updated at each step. The first application of MD on a biomolecular system was to study bovine pancreatic trypsin inhibitor (BPTI) [McCammon et al., (1977)], but since then, it has been applied to address a broad variety of questions including, among others, the functional mechanisms of conformational changes, the role of solvent in protein dynamics, and protein folding [Scheraga et al., (2007)]. An overview of MD and its applications can be found in several references [Durrant and McCammon, (2011); Karplus and McCammon,

(2002)].

One other popular technique for simulations of proteins is Monte-Carlo (MC), which is the general term that encompasses a variety of stochastic methods. An MC methodology starts with an initial configuration of the particles and an MC move is made through random sampling in an attempt to change the configuration. The move may be accepted or rejected

9 based on an acceptance criterion, e.g. based on the energy change. It is critical to be certain that all sampled configurations are drawn in an unbiased way from the distribution (usually a

Boltzmann distribution). Following a move, the property of interest of the system is calculated and the procedure is repeated many times. By averaging the property of interest over many steps, the equilibrium thermodynamic properties of the system can be obtained

[Earl and Deem, (2008)]. MC methods have been frequently been applied to simulate protein folding [Hansmann and Okamoto, (1999); Kolinski and Skolnick, (1994a), (1994b)] and for conformational sampling [Jorgensen and TiradoRives, (1996)].

Some alternative approaches to understanding the dynamics of proteins are based on vibrational analysis of molecules. Normal mode analysis (NMA) is a method used to characterize the vibrational motions of oscillating systems near their equilibrium state

[Goldstein et al., (2002)]. A system is considered to be in equilibrium when it is at the bottom of the energy well and the net force acting on it is zero. The double derivatives of the potential with respect to position are captured in a 3푁 × 3푁 Hessian matrix, where 푁 is the number of particles (e.g. atoms) in the system. The eigenvectors of the Hessian matrix

(normal modes) describe the magnitude and direction of motion of each particle in the system whereas the eigenvalues represent the square of frequencies of the corresponding mode. All patterns of motion of the system can be fully described by a combination of one or more of the normal modes; with the lower frequency (soft) modes corresponding to larger scale cooperative motions and higher modes representing more localized motions (such as vibrations of individual side chains or atoms). Originally applied to small molecules and shown to reproduce their vibrational spectra, NMA found its first applications for proteins in the 1980s [B. Brooks et al., (1983); Go et al., (1983); Levitt et al., (1985)]. The main

10 limitation of NMA is the computation time, since it involves diagonalization of the Hessian matrix, and this can become prohibitive for larger proteins. Several solutions have been proposed to address this problem, which include the Block Lanczos algorithm [Marques and

Sanejouand, (1995)], the Diagonalization in a Mixed Basis (DIMB) method [Mouawad and

Perahia, (1993); Perahia and Mouawad, (1995)] and the rotations-translations of blocks

(RTB) method [Durand et al., (1994); Tama et al., (2000)]. Reference [Skjaerven et al.,

(2009)] provides an extensive summary of applications of NMA to proteins.

A simplified approach for vibrational analysis is to describe the protein as an elastic network model (ENM) as an alternative to using empirical force-fields to model the interactions between particles. All interactions (both bonded and non-bonded) between particles (beads) are modeled as harmonic springs (hence the term bead-spring models).

Tirion first showed that the slow dynamics of proteins can be reproduced by using a simple one-parameter spring-like interaction between all atoms [Tirion, (1996)]. The potential energy of the system is proportional to the sum of squares of displacement of each particle from its equilibrium position. Hinsen further simplified Tirion’s model by coarse-graining the protein to retain only the C atoms of each residue [Hinsen, (1998)] and showed how it could be used to identify dynamical domains in proteins. Further a modified model was developed in which an all-atom normal mode analysis was used to obtain effective interaction matrices for Cα atoms; from which the spring constants were fit empirically, resulting in a distance-dependent stiffness function [Hinsen et al., (2000)].

A further simplification was provided by Bahar et al. [I Bahar et al., (1997)] in what is commonly referred to as the Gaussian Network Model (GNM). In the GNM, the 3푁 × 3푁

Hessian matrix is replaced with the 푁 × 푁 Kirchoff matrix that captures the contact topology

11 of the protein. The fluctuations of the particles are assumed to be isotropic and normally distributed; and the mean-square fluctuations (MSFs) of each particle can be obtained directly from the inverse of the Kirchoff matrix as its diagonal elements. Cross-correlations between residues are obtained from the off-diagonal elements. The GNM was then extended to the Anisotropic Network Model (ANM) which considered the fluctuations to be anisotropic and hence the directions of fluctuations of the particles can also be obtained and visualized [Atilgan et al., (2001)]. There are usually two parameters in these models: a cutoff distance 푅푐 (particles within this distance are connected by springs) and a spring constant 훾

(describing the stiffness of each spring). In the simplest case, all the spring constants are assumed to be identical (usually, unity). The parameter 푅푐 is usually fit to maximize correlation with some experimentally determined quantity, usually the temperature factors from X-ray crystal structures. The typical 푅푐 values used are 7 Å for GNMs and 13 Å for

ANM. A detailed overview of the theory and applications of ENMs can be found in several references [Bahar and Rader, (2005); Chennubhotla et al., (2005); Sanejouand, (2011)].

There have been various works on modifying ENMs to improve their predictive power.

Some examples include using different spring constants for covalently-bound residues

[Kondrashov et al., (2006)]; a distance-based network model (DNM) with different spring constants for atomic contacts [Kondrashov et al., (2007)]; and the parameter-free ANM

(pfANM) model in which the spring constant between a pair of nodes is taken to be proportional to 푟−푘 where 푘 is an integer. Though the simplest ENMs do not take into account the sequences of proteins, ENMs can also be modified to take into account the specificity of interactions between residues [Hamacher and McCammon, (2006)] as well as the differences in masses and chemical nature of residues [Kim et al., (2013)].

12

ENMs are not only simple and fast; they can more readily and more accurately capture the collective motions of proteins than molecular dynamics. One of the important lessons from ENMs is that the motions of a protein essential to its function are often the manifestation of their overall shape, rather than minute details in the structure. Moreover small scale motions often reflect large scale motions at the domain level [Tama and

Sanejouand, (2001); Yang et al., (2007)]. ENMs have found applications in a variety of problems including prediction of residue flexibilities [L. Yang et al., (2009)], investigating the dynamics of a variety of proteins [Chennubhotla et al., (2005)] and supramolecular complexes including viral capsids [Kim et al., (2003)] and ribosomes [Burton et al., (2012);

Kurkcuoglu et al., (2008); Wang et al., (2004)]; prediction of hinge regions in proteins

[Emekli et al., (2008)]; and elucidation of transition pathways between alternative conformations of proteins [Kim et al., (2002); Maragakis and Karplus, (2005); Schuyler et al., (2009); Tekpinar and Zheng, (2010); Zheng et al., (2007)]. However, one of the most important drawbacks of ENMs is that they require a densely packed globular structure.

However, not all biomolecular phenomena can be modeled using MM approaches. For example, modeling enzyme reactions that involve the cleavage and formation of covalent bonds requires a more accurate description of the atoms in the reaction with quantum mechanics (QM) approaches. More often than not, different parts of a system can be modeled at different levels of detail, in approaches referred to as ‘multiscale modeling’ [Ayton et al.,

(2007); Tozzini, (2010)]. One such approach is the hybrid QM/MM approach [Warshel and

Levitt, (1976)] which combined the speed of MM approaches and the accuracy of QM approaches. The 2013 Nobel Prize in Chemistry was awarded to Arieh Warshel, Michael

Levitt and Martin Karplus for “the development of multiscale models for complex chemical

13 systems” which is testimony to the power of computational approaches in understanding the structure-dynamics relationship in proteins.

1.2. Motivation and Specific Aims

The protein data bank (PDB) [Berman et al., (2000)] has grown significantly over the past few years and now has more than 120,000 structures of diverse proteins. The Structural

Genomics Consortium (SGC) played an important role in expanding the diversity of the database [Kouranov et al., (2006)]. However, an interesting aspect of the PDB which has not yet been fully exploited is its remarkable redundancy. A clustering of the PDB chains having high sequence identity shows that there are at least 1200 proteins with more than 30 structures in the PDB. Ensembles of these multiple structures of the same protein can provide insight into the free energies, conformational changes, and protein dynamics. The enormous success of knowledge-based potentials over physics-based force fields in structure prediction is testimony to the fact that known protein structures are the most reliable source of information regarding the principles underlying protein structure. In this dissertation, I exploit the availability of multiple structures of proteins to develop knowledge-based approaches that further our understanding of the structure-dynamics-function relationship in proteins:

(a) Construction of coarse-grained free energy landscapes of protein conformational changes: Recently, Nussinov and Wolynes highlighted the importance of the free energy landscape concept for decoding a range of biochemical behaviors including protein folding, biomolecular function, protein disorder, misfolding, binding mechanisms, allostery and

14 signaling; and suggested that this can even lead to a second molecular biology revolution

[Nussinov and Wolynes, (2014)]. The free-energy landscape of proteins is usually constructed by sampling conformations using molecular dynamics or MC methods, performing a principal component analysis (PCA) on the ensemble and using the probability distribution of structures along the PCs to estimate free energies. However, one of the drawbacks of this approach is that molecular simulations usually do not fully sample the entire conformational space. We introduce a new method which combines PCs from sets of structures (experimental or computationally generated) with knowledge-based potentials and

ENM-based entropies to construct coarse-grained free energy landscapes.

(b) Development of a novel knowledge-based local entropy function for identification of native protein structures: Although there have been remarkable advances in evaluating potential energies of protein structures, estimation of entropies has lagged far behind. By using a method similar to the way in which knowledge-based potentials were extracted from observed frequencies of amino acid contacts in known protein structures (inverse Boltzmann approach), we extract knowledge-based entropies for proteins. These rely directly upon observed frequencies of contact changes between alternate conformations of proteins. One of the major objectives of this study was to use the patterns of contact changes between pairs of amino acid types to improve the identification of native structures from decoys.

(c) Investigating the molecular mechanism of conformational changes in proteins using hybrid computational and experimental approaches: In order to test out the methods developed in this thesis on specific proteins, we have established several experimental

15 collaborations. One such application is aimed at understanding the mechanism of conformational changes in cell-cell adhesion proteins (cadherins) in collaboration with the

Sivasankar Lab. Another collaborative project aimed to elucidate the process of molecular transport by two membrane transporters (the AbgT family of multi-drug efflux transporters and inorganic carbon transport protein Lci1) in collaboration with the Yu Lab. Results from this group of studies have applications to cancer therapy (cadherins) and drug discovery such as tackling drug resistance in bacteria (multi-drug efflux proteins)

1.3. Dissertation Organization

The first chapter set the background for the work presented in this dissertation. In this section, I provide a brief overview of the organization of the rest of this dissertation, which is also shown as a flowchart in Fig. 1.2.

Figure 1.2. Organization of this dissertation.

16

Chapter 2 is a published manuscript titled “Distributions of experimental protein structures on coarse-grained free energy landscapes” by Sankar, K., Liu, J., Wang Y. and

Jernigan, R. L. in J. Chem. Phys. 143(24):243153 [Sankar et al., (2015)]. It is an exhaustive study of the application of principal component analysis to 50 proteins to extract their dynamics. It offers a novel method for constructing free energy landscapes on PC coordinates by combining knowledge-based potential functions and entropy measures (from elastic network models and tests these.

Chapter 3 is a manuscript titled “Knowledge-based entropies improve the identification of native protein structures” which has been submitted to Proc. Natl. Acad. Sci. USA for review. It introduces an innovative method to estimate the local conformational entropies of protein structures by using statistical data from observed amino acid contact changes for conformational transitions in a set of diverse proteins. We also combine these new entropies with cooperative potential energy functions to generate knowledge-based free energy functions (KBFs) with significant gains in the ability to distinguish between native structures of proteins and non-native decoys.

Chapter 4 is a manuscript titled “Molecular determinants of cadherin ideal bond formation: conformation dependent unbinding on a multidimensional landscape” which has been accepted for publication in Proc. Natl. Acad. Sci. USA in 2016. It is an application of the new method described in Chapter 2 to a class of cell-cell adhesion proteins - the classical cadherins. By using free energy landscapes and all-atom MD simulations, we show that conformational changes in cadherins can be explained by using a multidimensional landscape. This work provides an explanation for the diverse behaviors of various mutants to externally applied forces. This work was performed in collaboration with the Sivasankar Lab

17 at ISU. My contributions included construction of coarse-grained free energy landscapes and

PCA based analysis of the MD trajectories whereas the single molecule experiments and all- atoms simulations were performed by Kristine Manibog of the Sivasankar Lab.

Chapter 5 is an expanded version of a conference paper titled “An analysis of conformational changes upon RNA-protein binding“ published in the Proceedings of the 5th

ACM Conference on Bioinformatics, Computational Biology and Health Informatics (2014) at Newport Beach, CA, USA: pp 592-593 [Sankar et al., (2014)]. We compile a dataset of

RNA-bound and unbound forms of 90 RNA-binding proteins and analyze whether the conformationally variant parts of the proteins map to surface/buried as well as interface/non- interface regions. Our study offers compelling evidence that the conformationally variant regions map predominantly to non-interface regions. This is a collaborative project with

Rasna Walia from the Dobbs Lab at ISU. I curated the dataset, mapped the interface regions and conformationally variant regions whereas the analysis based on solvent-accessible surface area based analysis was performed by Rasna. This work was also presented as a poster at Intelligent Systems for Molecular Biology (ISMB) 2014 in Boston and as an oral talk

+ poster at 3DSig 2014 (a satellite meeting of ISMB in the area of structural bioinformatics).

Chapter 6 is a manuscript in preparation that summarizes the study of transport mechanisms for two membrane transporter proteins of the AbgT family of multi-drug efflux transporters, YdaH and MtrF. By using steered MD simulations, I map out the transition pathway of ligands in these transporters and identify the residues which are crucial to the transport pathway (some simulations were performed by Sayane). Results from the study may have significant applications in combating drug resistance. This is part of an ongoing collaboration with the Yu Lab at ISU.

18

Chapter 7 concludes this dissertation by highlighting the important contributions and also makes suggestions for future research.

In addition, the Appendices discuss some of the theory and methodology relevant to this dissertation, and also provide supplementary information for some chapters.

Appendix A is adapted from a chapter titled “The Use of Experimental Structures to

Model Protein Dynamics” in the book “Molecular Modeling of Proteins” which was published as part of the series Meth. Mol. Biol. 1215: pp 213-236 [Katebi et al., (2015)]. This details the methodology of extracting dynamics information from multiples structures of the same protein, which forms the basis for Chapter 2. In addition, a new method for parameterizing ENMs based on internal distance changes between residues observed in crystal structures of proteins is also presented.

Appendix B is adapted from a conference talk titled “Investigations on Elastic Network

Models of Coarse-Grained Membrane Proteins” presented at the International Symposium on Bioinformatics Research and Applications (ISBRA) 2012 held at University of Texas at

Dallas. This study highlights the importance of including the membrane in ENM based analysis of membrane proteins.

Appendix C is a manuscript in preparation in collaboration with the Yu and Spalding

Labs at ISU. It summarizes results of computational simulations on the mechanism of carbon concentration by low-CO2 inducible protein (Lci1) in the algae Chlamydomonas. This is a protein critical for the concentration of inorganic carbon for photosynthesis in algae, and this study has important implications for further advancements in introducing carbon concentrating mechanisms in higher plants to improve crop yield. The crystal structure was

19 solved by members of the Yu Lab whereas SMD simulations were performed by me with help from Sayane Shome.

Appendix D provides supplementary information for Chapter 3.

Appendix E provides supplementary information for Chapter 4.

20

CHAPTER 2. DISTRIBUTIONS OF EXPERIMENTAL PROTEIN STRUCTURES

ON COARSE-GRAINED FREE ENERGY LANDSCAPES

Paper originally published in (2015) J. Chem. Phys. 143(24):243153.

Kannan Sankar, Jie Liu, Yuan Wang and Robert L. Jernigan

Abstract

Predicting conformational changes of proteins is needed in order to fully comprehend functional mechanisms. With the large number of available structures in sets of related proteins, it is now possible to directly visualize the clusters of conformations and their conformational transitions through the use of principal component analysis. The most striking observation about the distributions of the structures along the principal components is their highly non-uniform distributions. In this work, we use principal component analysis of experimental structures of 50 diverse proteins to extract the most important directions of their motions, sample structures along these directions and estimate their free energy landscapes by combining knowledge-based potentials and entropy computed from elastic network models. When these resulting motions are visualized upon their coarse-grained free energy landscapes, the basis for conformational pathways becomes readily apparent. Using three well-studied proteins, T4 lysozyme, serum albumin and sarco-endoplasmic reticular

Ca2+ ATPase, as examples, we show that such free energy landscapes of conformational changes provide meaningful insights into the functional dynamics and suggest transition pathways between different conformational states. As a further example, we also show that

Monte Carlo simulations on the coarse-grained landscape of HIV-1 protease can directly yield pathways for force-driven conformational changes.

21

2.1. Introduction

Proteins are often regarded as the work force of cells, and understanding their actions requires an understanding of their dynamics. Experimental protein structures, whether determined by X-ray crystallography, NMR spectroscopy or by high resolution cryo-electron microscopy [Bartesaghi et al., (2015); Fischer et al., (2015)], shed light about the structure and function of diverse proteins. However the structures individually only provide a static snapshot of the protein. But collectively, multiple structure determinations of the same or closely related proteins can inform us directly about its dynamics. Even mutants, it is now being realized, have structures and motions falling primarily along the same limited dynamics pathways [Haliloglu and Bahar, (2015); Marsh and Teichmann, (2014)]. Wolynes,

Onuchic and Dill [Brooks et al., (2001); Bryngelson et al., (1995); Chan and Dill, (1998);

Cheung et al., (2004); Dill et al., (1997); Frauenfelder et al., (1991); Onuchic et al., (1997);

Schug and Onuchic, (2010); Wolynes, (2005), (1996)] have all pointed out the importance of understanding the energy landscapes. Understanding the dynamic distributions of the different structures and their energetics upon the landscape is a crucial step in understanding structure-function relationship in proteins. Recently, Nussinov and Wolynes [Nussinov and

Wolynes, (2014)] have pointed out how useful it is to interpret biomolecular function within the framework of energy landscapes and can help to explain diverse phenomena ranging from the effects of ligand binding [Miller and Dill, (1997)] to the effects of mutations [Dixit and Verkhivker, (2011); Sapra et al., (2008); Sutto and Gervasio, (2013)] on protein stability.

Predicting dynamics information, given the 3D structure of a protein, has been a topic of a huge body of research. Molecular dynamics [Levitt and Warshel, (1975); Warshel, (1976)] and Monte-Carlo methods [Hansmann and Okamoto, (1999); Metropolis et al., (1953)] are

22 the most commonly employed techniques for extracting such dynamics information. Despite their proven success, these methods remain computationally intensive and limited in the time-scales that can be thoroughly investigated. On the other hand, coarse-grained (CG) methods such as those used in the elastic network models (ENMs) offer a convenient and quick alternative to all-atom models. Coarse-grained ENMs successfully model the dynamics of most proteins, even though the interactions between amino acid residues are represented by extremely simple Hooke’s-law springs. The most popular ENMs are the Gaussian network model (GNM) [I Bahar et al., (1997)] and the anisotropic network model (ANM)

[Atilgan et al., (2001)] In addition to being able to accurately predict residue position fluctuations, the low-frequency modes predicted by ENMs often capture the functionally relevant conformational changes evident in multiple crystal structures, for a wide variety of proteins [Tama and Sanejouand, (2001); Yang et al., (2007)] including even the largest molecular structures such as viral capsids [Kim et al., (2003)] and ribosome [Burton et al.,

(2012); Kurkcuoglu et al., (2009), (2008); Wang et al., (2004)].

The number of available structures in the protein databank (PDB) [Berman et al., (2000)] has been growing exponentially. While there is remarkable diversity in the variety of type of structures in the PDB, many of them are indeed structures of the same protein or its close homologs and many more belong to the same protein fold. These multiple structures of the same or closely similar proteins in many cases provide an excellent sampling of the possible conformational states, analogous to what one would obtain from simulations such as molecular dynamics (MD) or Monte-Carlo. Previous works have shown the close correspondences between motions inherent in sets of structures in the PDB and motions extracted from analysis of MD trajectories [Ichiye and Karplus, (1991)] or predicted motions

23 from theoretical models [Meireles et al., (2011); Yang et al., (2008); L.-W. Yang et al.,

(2009)]. Surprisingly, little effort is being made to systematically explore the conformational space by using the different structures of the same protein already available in the PDB.

Given a set of structures (either experimental or those generated from MD simulations), perhaps the most common method of extracting useful dynamics information is principal component analysis (PCA) [Hotelling, (1933); Pearson, (1901)], and when applied to protein samples generated from MD termed essential dynamics [Amadei et al., (1993)]. PCA is a statistical method based on covariance analysis, which can transform high dimensional data from the original space of correlated variables into a highly reduced space of independent variables (i.e., principal components or PCs). By performing PCA to reduce the dimensionality, most of a system’s variance will usually be captured by a small subset of the

PCs. This is one of the primary advantages of performing PCA - that it greatly reduces the dimensionality of the dynamics space (originally of the order of number of residues) to a few dominant motions of the protein. PCA has been applied extensively to analyze trajectory data from MD simulations to find a protein’s essential motions [Amadei et al., (1999); Hayward and de Groot, (2008)].

Earlier, Howe [Howe, (2001)] used PCA to classify structures in NMR ensembles automatically, according to the correlated structural variations, and the results have shown that two different representations of the protein structure, the Cα coordinate matrix and the

Cα-Cα distance matrix, gave equivalent results and permitted the identification of structural differences between conformations. Teodoro et al. [Teodoro et al., (2003)] applied PCA to a dataset composed of many conformations of HIV-1 protease and found that PCA transformed

24 the original high-dimensional representation of protein motions into a low-dimensional one that provides the dominant protein motions. PCA has also been employed to characterize diverse biomolecular phenomena such as protein folding pathways from MD simulations [G.

Maisuradze et al., (2009); Maisuradze and Leitner, (2006); G. G. Maisuradze et al., (2009)], the mechanism of prion action [Gendoo and Harrison, (2012)], and others.

Recent studies have also shown that the most important motions (PCs) extracted from sets of experimental structures correspond well to the modes predicted by using coarse- grained models such as elastic network models [Meireles et al., (2011); Yang et al., (2008);

L.-W. Yang et al., (2009)]. Software to perform PCA on sets of protein structures is currently supported by software packages such as Maven [Zimmermann et al., (2011a)] from our lab,

ProDy [Bakan et al., (2011)] from the Bahar group as well as Bio3d [Grant et al., (2006)] from the Grant group.

PCs involved in the largest scale motions are often associated with the functional mechanism of a protein [Lange et al., (2008)] and thus also provide a convenient reduced coordinate system upon which to construct energy landscapes as a basis for describing conformational changes, and even to treat protein folding [G. G. Maisuradze et al., (2009)].

Even though the energy landscape of a protein can be rugged and high dimensional

[Frauenfelder et al., (1991)], using the PCs as coordinates for the landscapes can usually reveal the dominant low energy regions and pathways for conformational changes [G.

Maisuradze et al., (2009)]. There have also been recent attempts to use PCA for internal coordinates [Altis et al., (2007)] rather than Cartesian coordinates to construct free-energy landscapes [Mu et al., (2005); Riccardi et al., (2009); Sicard and Senet, (2013)]. Free

25 energies along the PCs are traditionally calculated from the negative logarithm of the probability distribution function of structures along each PC [Maisuradze and Leitner,

(2006)] as ∆퐺 = −푘푇푙표푔푃푖푗 where 푘 is the , T the temperature and 푃푖푗 the joint probability density function of structures along a pair of PCs, 푃퐶푖 and 푃퐶푗. But this assumes that the simulation samples the entire conformational space accessible to the protein, which is not necessarily true. A more accurate picture of the energy landscape can be obtained if the conformational space (at least along the most significant directions of motion) is explicitly sampled and the relative energies of structures in different regions of the landscape can be computed. Here, we propose a new method of combining the PCs from sets of experimental structures with our previously successful free energy estimates

[Zimmermann et al., (2012)] to construct the free energy landscapes of a group of 50 well studied proteins.

The free energy 훥퐺 of a system is defined as 훥푉 – 푇훥푆, where 훥푉 and 훥푆 are measures of the energy and entropy of the system, respectively. Given the difficulties in computing interaction energies for proteins by using first principles, the empirical statistical or knowledge-based potentials have emerged as a convenient method to estimate potential energies of proteins. They have been tested out extensively at the CASP (Critical Assessment of Structure Prediction) competitions [Moult et al., (2014)], and have proven themselves to be superior to other types of potentials. Knowledge-based potentials are calculated based on the preference of amino acid contacts between different residues in a database of known structures under the assumption that the global free energy minimum is the native structure of the protein. Pairwise (two-body) statistical contact potentials were pioneered by Tanaka and

Scheraga [Tanaka and Scheraga, (1976)] and subsequently developed and extended by

26

Miyazawa and Jernigan [Miyazawa and Jernigan, (1996), (1985)] and Sippl [Sippl, (1990)].

Since then, with increased availability of structures in the PDB, many different two-body potentials have been developed and have found applicability for a variety of protein problems ranging from protein tertiary structure prediction [Kihara et al., (2009); Kryshtafovych and

Fidelis, (2009)] and protein-protein interaction prediction [Ritchie, (2008); Vajda and

Kozakov, (2009); Vakser and Kundrotas, (2008)] to [Gerlt and Babbitt,

(2009); Mandell and Kortemme, (2009)].

The dense packing of residues in globular proteins means that two-body potentials are likely not sufficient to capture the 3-dimensional cooperative nature of multiple interactions

[Betancourt and Thirumalai, (1999); Czaplewski et al., (2003), (2000)], and it has been suggested that higher-body potentials are necessary for tasks like protein structure prediction.

To address this, three-body [Munson and Singh, (1997)] and four-body potentials

[Krishnamoorthy and Tropsha, (2003)] have been developed. Our own four body potentials

[Feng et al., (2010), (2007)] capture the cooperative nature of interactions among amino acid residues in addition to incorporating differences between buried and exposed residues and the interactions between backbone and side chains. In addition, we have also developed an optimized potential function [Gniewek et al., (2011)] combining the long-range four body potentials with short-range potentials [I. Bahar et al., (1997b)]. This optimized potential when combined with entropy measures obtained from coarse-grained computational methods such as the elastic network models (ENMs) [Zimmermann et al., (2012), (2011b)] can provide estimates of free energy that have already proven to be extremely powerful in identifying native protein-protein complexes from sets of docked poses [Zimmermann et al.,

(2012)]. We therefore combine information about preferred directions of motions from PCs

27 with free energy information to present coarse-grained free energy landscapes for proteins.

These show the pathways for the limited conformational changes described by the set of dominant motions.

The paper is organized as follows: First we discuss how to collect a dataset of proteins for this type of analysis and how to construct free energy landscape for these proteins by combining principal components and free energy estimates. Then we analyze and discuss in detail the energy landscapes of three well known proteins and discuss how the energy landscapes can be interpreted in the context of the motions extracted from each dataset. As a further step, we also show how Monte-Carlo simulations on these coarse-grained free energy landscapes can provide transition pathways for force-driven conformational changes in proteins.

2.2. Theory and Methods

2.2.1. Datasets

The PDB [Berman et al., (2000)] provides a clustering of all the chains by using CD-HIT

[Fu et al., (2012); Li and Godzik, (2006)] at different levels of specified sequence similarity.

In order to identify all the structures which are highly similar to one another in the PDB, we have utilized clusters obtained at 95% sequence similarity cutoff, from the PDB (as of Nov

2014). In other words, all protein chains in each cluster are at least 95% identical in sequence to each other. After obtaining these clusters, only monomeric proteins were retained for the analysis. However with more careful alignment of oligomers, this methodology can handle multimeric proteins as well. Each of the members of these sets are aligned using the multiple structural alignment tool MUSTANG [Konagurthu et al., (2006)] and the alignment is

28 manually edited to remove any obvious mismatches or indels. Proteins within each set often have stretches of residues lacking position coordinate information (resulting in gaps in the alignment), and these structures have been removed from the sets. Guided by the multiple structural alignment (MSA), the PDB files of the structures are processed using our own Perl scripts to retain only residues present in all the structures within each set (i.e. not including positions having gaps in the MSA). Care is taken so as not to include any structures having gaps in the middle of the protein. This processed dataset of the position coordinates for each residue in the set of proteins constitutes the data used to perform PCA. Following this selection process, we obtain 50 proteins from which at least 45 structures are retained. The complete list of PDB IDs for all the 50 sets of proteins used in this study are provided in

Supplementary Table S1 and the distribution of RMSDs within the dataset of structures is provided in Supplementary Fig. S1.

2.2.2. Principal component analysis (PCA)

α The dataset for PCA, 횵푛 × 푝 is the matrix of position coordinates (x, y, and z) of the C atoms in an aligned set of proteins for n structures each having the total number of variables,

푝 = 3푁 where 푁 is the number of residues in each structure. Then the 푝 × 푝 dimensional variance-covariance matrix 푪 has elements

푛 ̅ ̅ 푐푖푗 = ∑푘=1(휉푘푖 − 휉푖)(휉푘푗 − 휉푗)/(푛 − 1) ∀ 1 ≤ 푖, 푗 ≤ 3푁 (2.1)

Each diagonal term is the variance of each position coordinate and the cross diagonal terms

푡ℎ 푡ℎ are the covariances. Here, 휉푘푖 refers to the value of the 푖 variable (x, y, or z) for the 푘

̅ 푡ℎ structure in the dataset 횵푛 × 푝 and 휉푖 refers to the mean of the 푖 variable. The covariance matrix 푪 can be decomposed as 푪 = 푬∆푬푇 where the columns of 푬 are the eigenvectors

29

풆푘 ∀ 1 ≤ 푘 ≤ 3푁, which are the linearly independent, orthogonal vectors along directions of the variations in the data and the eigenvalues are the elements of the diagonal matrix ∆. The eigenvalues are sorted in order, and each eigenvalue is directly proportional to the amount of the variance it captures. The projections of the points on each eigenvector are called the principal components (PCs) and are obtained as columns of the matrix 푷푛×3푁 = 횵푛×3푁 ×

푬3푁×3푁. The PC scores are calculated as projections of the mean centered data onto the PCs,

푇 푇 obtained as columns of the matrix 푷푛×3푁 = (흃 − (⃗1 푝×1 × 흃 )) × 푬3푁×3푁 where 흃 is the transpose of the mean vector of position coordinates. The 푖푡ℎ row of the matrix 푷 correspondingly gives the PC scores of structure 푖 in the dataset.

2.2.3. Knowledge based potential functions

The potential energies for the structures are estimated as an optimized linear combination of three different in-house statistical potential functions: four-body sequential potential [Feng et al., (2007)], four-body non-sequential potential [Feng et al., (2010)] and short-range potentials [I. Bahar et al., (1997b)]; as in our previous work [Zimmermann et al., (2012)].

Four-body refers to close groups of four amino acids that can interact.

푉표푝푡 = 푉4−푏표푑푦 푠푒푞 + 0.28 ∗ 푉4−푏표푑푦 푛표푛−푠푒푞 + 0.22 ∗ 푉푠ℎ표푟푡 푟푎푛푔푒 (2.2)

The weights for the four-body sequential and four-body non-sequential potential terms were obtained previously [Gniewek et al., (2011)] by minimizing the RMSD of best decoys from targets of CASP8 [Cozzetto et al., (2009)] to their corresponding native structures using particle swarm optimization (PSO) [Kennedy and Eberhart, (1995)]. Please refer to our previous work [Gniewek et al., (2011)] for more details about how the weights for each potential terms were optimized.

30

2.2.4. Structural entropy evaluation

In order to obtain a reliable measure of the entropy of a system, we resort to coarse- grained models of protein dynamics referred to as elastic network models (ENMs) [Atilgan et al., (2001); Bahar et al., (2010); I Bahar et al., (1997); Tirion, (1996)]. In ENMs, the molecules are represented using bead-spring models in a simplified manner (for the coarse- grained cases usually the beads are the Cα atoms of proteins, i.e., one bead per residue, which is what has been used here) and are assumed to interact with only the physically close beads

(within a specified distance cutoff, taken here as 7 Å). Here we specifically use the Gaussian network model (GNM) [I Bahar et al., (1997)] in which the equilibrium fluctuations of the beads are assumed to be isotropic and normally distributed. The spring stiffness (훾) between all the beads is assumed to be the same (훾 = 1). The potential energy of the system is then simply proportional to the sum of squares of displacements of all the beads from their equilibrium positions. Mean square fluctuations of the Cα atoms computed from the GNM

(obtained as diagonal elements of the pseudoinverse of the connectivity or Kirchoff matrix) have been shown to agree well with the experimental temperature factors for many different crystal structures, and also to agree with the variabilities observed in sets of structures [Yang et al., (2008); L. W. Yang et al., (2009)]. The entropies for the structures are directly computed as the sum of mean square fluctuations of all the Cα atoms [Zimmermann et al.,

(2012)] as computed with the GNM:

1 ∆푆 ∝ Γ−1 = ∑푁 (푀 푀푇) (2.3) 푖=2 휆 푖 푖

푡ℎ where 푁 is the number of residues in the structure, 푀푖 is the 푖 mode vector from the GNM,

휆 the corresponding square frequency, Γ the system’s Kirchoff or connectivity matrix, and Γ-1 its pseudo-inverse.

31

2.2.5. Construction of free energy landscapes

The first few eigenvectors from PCA capture the most important directions of motions from the set of structures, and these provide convenient coordinates for constructing free energy landscapes. By using the PC vectors, representative structures can be sampled along the first few eigenvectors under the assumption of linearity provided the conformational changes are not overly large. The distribution of structures along the PC axes (the mean- centered projections of the structures onto the eigenvectors) indicates the similarities and dissimilarities between the various structures in the dataset. Usually there are clusters within the dataset, by viewing their distribution.

In order to obtain a free energy landscape, we choose to focus on the most important motions, along the PC1-PC2 coordinates, considered as grid points. Consider a dataset of (x, y, z) coordinates, Ξ푛 × 3푁 of 푛 structures with 푁 residues each. Performing PCA on this dataset as described above yields 3푁 eigenvectors 풆푘 ∀ 1 ≤ 푘 ≤ 3푁. For this study, we consider only the first two eigenvectors (풆1 and 풆2) which capture the largest fraction of the variance in the data of any pair of such coordinates. Representative structures ware sampled uniformly at equally spaced points along the PC1 and PC2 directions to give yield a rectangular grid where the extrema of the grid are dictated by the extrema in the PC scores of all of the crystal structures. For this, the coordinates of each representative structure on the grid are obtained relative to the coordinates of a central structure (closest to the origin) on the grid, 푅0. The 3D coordinates 푹1×3푁 of a structure 푅 on the PCi-PCj grid at position (푅푖, 푅푗) are obtained using the coordinates of the central structure on the grid 푅0 as

0 0 0 푹1×3푁 = 푹1×3푁 + (푅푖 − 푅푖 ) × 풆푖 + (푅푗 − 푅푗 ) × 풆푗 (2.4)

32

0 0 where (푅푖 , 푅푗 ) are the PCi-PCj scores of the central structure on the PC grid and 풆푖 and 풆푗 are the eigen vectors corresponding to PCi and PCj.

The free energy of a representative structure is measured as

훥퐺 = 훥푉 – 푎훥푆 (2.5)

where the energy contribution 훥푉 is obtained from 푉표푝푡 (as in Eq. 2.2) and the entropy contribution 훥푆 is obtained from the GNM fluctuations (Eq. 2.3). The value of a cannot be determined universally for all proteins because the entropy term depends on various factors such as the size of the protein. The value of a is taken to be a variable and is optimized for each protein as the value that places the largest number of structures in lowest energy regions of the landscape, as discussed in the Results section.

Once the free energies for each of the representative structures is computed, the values are visualized as a contour along the PC1-PC2 coordinate space and the contour plot is colored spectrally according to the order VIBGYOR (with violet corresponding to regions of lowest energy and red corresponding to regions of highest energy). The experimental structures are plotted in this space on top of the contours. Usually the experimental structures fall into lower free energy regions of such a contour plot, subject to some uncertainties arising from additional conformational variabilities from additional PCs beyond the first two that are being ignored.

2.2.6. Generation of a transition path between two structures on the free energy landscape

In order to show an example of how to obtain the transition pathway between two different forms of a protein we have chosen to perform force applications using Monte-Carlo

33 simulations on HIV-1 protease. This approach builds on the Hessian matrix computed from coarse-grained ANM and generates a displacement vector in response to an external force perturbation vector based on linear response theory [Atilgan and Atilgan, (2009); Ikeguchi et al., (2005)], to relate the response behavior to the equilibrium fluctuations in the unperturbed state. This displacement vector can be represented as:

−1 Γ푖 ∙ 퐹푖 = ∆푅푖 (2.6) where the matrix Γ−1 is equivalent to the inverse Hessian and 퐹 is the external force vector applied on residue 푖 with component directions (퐹푥, 퐹푦, 퐹푧), and ∆푅푖 is the displacement vector in Cartesian coordinates for residue i.

We have developed a pipeline (unpublished) to perform randomly directed force perturbations at sites where exothermic events occur. To understand the conformational changes in HIV-1 protease, where the binding process itself is exothermic [Kožíšek et al.,

(2007); Ohtaka et al., (2003)] we have added forces on the residues close to the flaps, where the major conformational changes take place. Any extremely large forces that could rupture bonds would clearly fall outside the range of linear responses, so we apply small iterative forces. In this way, we will avoid large disruptions, but permit new contacts between two nodes to form during a transition. We use a Metropolis Monte Carlo approach [Metropolis and Ulam, (1949)], which follows a series of steps (deformations) that are mostly downhill on the energy landscape, but with occasional uphill steps . Instead of accepting all steps during a simulation, we accept some and reject others using the Metropolis decision criterion.

We have integrated this Monte Carlo (MC) scheme with our elastic network based force perturbation method.

34

The Metropolis decision criterion uses only the four-body potential energy of the newly generated state 푚 in comparison with the four-body energy of the previous state:

1, 푉푚 ≤ 푉푚−1 푝 = { 푉 −푉 (2.7) 푒푥푝 (− 푚 푚−1) , 푉 > 푉 푘푇 푚 푚−1 where 푝 is the probability for accepting the newly generated structure in the MC simulation.

푉푚 the four-body energy of the newly deformed structure, 푉푚−1 the four-body energy of the previous structure, 푘 the Boltzmann constant and 푇 the temperature. In other words, any newly generated conformation lower in energy than the previous conformation will always be accepted while the probability of accepting a newly generated conformation is lower if the newly generated conformation has a four-body energy higher than at the previous step.

2.3. Results

2.3.1. Distribution of crystal structures in low energy regions of the landscape

One of the principal aims of this study is to learn whether the crystal structures are located in low free energy regions of the landscape. If the experimentally determined structures do reside in the low free energy regions on the landscape, this supports the conformational selection point of view for the protein under study. In other words, we can assume that the protein is in a state of dynamic inter-conversion between the conformations corresponding to the low free energy regions and different triggering events such as binding of a ligand, introduction of a mutation or a chemical reaction may shift the equilibrium in favor of some slightly different conformations.

35

In order to test this hypothesis, we choose 50 proteins of interest (selected on the basis of having at least 45 experimental structures each). Next, we construct the free energy landscape for the proteins by computing the free energies of the structures obtained by deforming the structures along the first two pairs of PCs on an equally spaced rectangular grid. Let us assume that the entire grid produces a range of free energies from 퐺푚푖푛 (lowest free energy) to 퐺푚푎푥 (highest free energy). The free energy of each crystal structure is assumed to be that of the closest grid point. We then consider a set of percentiles 퐺푖 ∀ 푖 ∈

{0,5,10, … .100} of the crystal structures on the free energy range of the whole grid. If the free energy of the crystal structures were predominantly in low energy regions of the entire landscape, we would expect the higher percentile values to be closer to the lowest free energy on the grid, 퐺푚푖푛. For this, we compute the normalized energy difference 훿푖 from 퐺푚푖푛 for each percentile value 퐺푖 relative to 퐺푚푎푥 (the highest free energy in the grid):

퐺푖−퐺푚푖푛 훿푖 = , 푖 ∈ {0,5,10, … .100} (2.8) 퐺푚푎푥−퐺푚푖푛

We then plot the scaled percentile rank 푖/100 against the normalized energy difference of each percentile value, 훿푖 ∀ 푖 ∈ {0,5,10, … .100} (Eq. 2.8). This plot can be considered analogous to the receiver operating characteristic (ROC) curve used in : for higher percentile rank 푖 corresponding to lower 훿푖, the curve is shown in Fig. 2.1a. As in the

ROC, we can use the area under the curve (AUC) as a measure of the tendency for experimental structures to lie in low energy regions. Higher AUC values mean that the energies of the experimental structures with respect to the entire landscape grid are lower.

For each of the 50 sets of proteins, AUC values were calculated for different values of the entropy weight ‘a’ from Eq. 2.5 to find an optimal value for a. Fig. 2.1a shows the plot of

2+ percentile rank 푖 vs 훿푖 curve for sarco-endoplasmic reticular Ca ATPase. The optimum

36 value of a obtained is 1.35 with an AUC (red curve) of 0.84 vs. 0.81 (blue curve) when the entropy term was not included.

Figure 2.1. Measures of the distribution of the experimental structures in the low free energy regions of the landscapes. (a) Plot of percentile rank 푖/100 against the normalized free energy difference 훿푖 from the lowest free energy in grid for sarco-endoplasmic reticular Ca2+ ATPase (SERCA). Without including the entropy term, the area under the curve (AUC) is 0.81 (thin line) while for the entropy weight 푎 = 1.35, the AUC increases to 0.84 (thick line). (b) Plot of AUC (sorted) for optimal weight of the entropy term for all 50 proteins investigated in this study. The AUC for 43 out of 50 cases is above 0.5 suggesting that the crystal structures are located in lower energy regions of the free-energy landscape.

Supplementary Table S2 shows the maximum AUC values and the corresponding optimal values of a for all 50 proteins under study. If the crystal structures were not found to be preferentially located in low energy regions of the landscape, then the curve would be close to the diagonal from the origin which would result in an AUC of 0.5. In our dataset we find that 43/50 proteins (86%) show an AUC above 0.5 (Fig. 2.1b). Interestingly, for a number of proteins, including the entropy term does not improve the AUC; whereas for some, it does improve the behavior significantly. We hypothesize that for at least those cases that improve the AUC when the entropy term is included; there is a significant entropic contribution to the conformational change. In the following sections, we discuss in detail the

37 energy landscapes derived from the sets of experimental structures for three well studied proteins: lysozyme, serum albumin and sarco-endoplasmic reticular Ca2+ ATPase (SERCA).

2.3.2. Case study I: T4 lysozyme

Lysozyme is an enzyme found in various plants and animals and is primarily used as a first line of defense against bacteria. In humans, it is found in many bodily secretions including saliva, tears, mucus and milk as well as the secondary (granulocyte specific) granules of neutrophils and serves as a part of the innate immune system. It causes bacterial lysis by hydrolyzing the 1,4-β-glycosidic linkages between the N-acetyl muramic acid

(NAM) and N-acetyl D-glucosamine (NAG) residues in peptidoglycan cell walls of bacteria

[Anheim et al., (1973)]. Several types of lysozymes have been identified in diverse organisms, but the most important classes of lysozymes are the chicken-type (C-type), virus type (V-type) and goose type (G-type).

Discovered by Fleming in 1922 [Fleming and Allison, (1922); Fleming, (1922)], lysozyme was not only one of the first proteins whose 3D structure was solved using X-ray crystallography [Blake et al., (1965); Johnson and Phillips, (1965)] but also a first protein for which a detailed catalytic mechanism was proposed. Since then, more than 1500 structures of different members of the lysozyme superfamily have been determined using X- crystallography and NMR spectroscopy. After filtering structures with missing residues and outliers, we obtain 218 structures for human lysozyme (C-type), 183 structures for T4 lysozyme (V-type) and 586 structures for hen egg-white lysozyme (C-type). Here we discuss results for the set of T4 lysozyme structures. The crystal structure [Matthews and Remington,

(1974)] of the T4L protein (162 residues) shows that it is comprised of two domains, the N- terminal domain (residues 15-65), and the C-terminal domain (residues 80-162) connected by

38 an inter-domain helix (residues 66-80) with a deep cleft between them where the peptidoglycan backbone of the bacterial cell wall binds. PCA on the set of 183 T4 lysozyme structures results in the first three PCs capturing an unusually high fraction of the variance in the first three PCs, with 78%, 5% and 2% of the total variance, respectively (Fig. 2.2a). Both

PC1 (Fig. 2.2c) and PC2 (Fig. 2.2d) correspond to combinations of hinge bending motion of the two domains with respect to each other and a twisting of the domains (refer to

Supplementary Movies S1 and S2 for animations of the PCs).

Figure 2.2. Bacteriophage T4 lysozyme. (a) Percentage of variance captured by the first 10 PCs from a set of 183 T4 lysozyme structures. (b, c) Visualization of PC1 and PC2 on the protein structure (thick black arrows) as a combination of hinge-bending and twisting motions of the N-term domain (blue) with respect to the C-term domain (red). (d) Energy Landscape of human lysozyme along the PC1-PC2 coordinates (entropy weight 푎 = 0). Crystal structures are denoted by white hexagons. The large cluster at lower values of PC1 corresponds to closed structures whereas the open structures are more broadly scattered along PC1 and PC2.

39

The difference between the two PCs is that the motions are at an angle of approximately

900 relative to one another. The hinge-bending motion between the two domains in T4L has previously been well documented as an intrinsic property of T4L based on experimental structures of various mutants [Dixon et al., (1992); Faber and Matthews, (1990); Zhang et al.,

(1995)]. This motion was also reported from MD simulations [Arnold and Ornstein, (1997);

Arnold et al., (1994); de Groot et al., (1998)] and shown to be highly similar to the principal motions extracted from a set of crystal structures [de Groot et al., (1998)]. In addition, this motion was also characterized extensively in both hen-egg white [McCammon et al., (1976)] and human lysozymes [Gibrat and Go, (1990)] using normal mode analysis. The hinge- bending motion of the domains has been considered to be the functional motion for the entry of substrate and the release of products.

Upon projecting the structures onto the PCs (mean centered projections, also referred to as PC scores), it can be seen that most of the structures fall into a low energy cluster located at low values of PC1. The free energy landscape (as discussed in Methods) along PC1-PC2 is shown in Fig. 2.2b. These are the structures where the two domains are ‘closed’ with respect to one another and correspond to a conformation with bound ligand where the protein can be considered ‘closed’. On the other hand, the ‘open’ forms of T4L are scattered along PC2 for a range of higher values in PC1. This is quite different from what we have observed for many other proteins where there are tighter clusters of open and closed forms. This broader unusual distribution possibly suggests that the two hinge motions may be coupled to each other and that at higher values of PC1, the structures can be sampled uniformly along each of the two

PCs. The AUC was 0.69 suggesting that the crystal structures fall into low energy regions of the energy landscape.

40

2.3.3. Case study II: human serum albumin (HSA)

Serum albumin (HSA) is the most abundant blood protein in mammals and is essential for maintaining the proper osmotic balance between body fluids inside blood vessels and tissues [He and Carter, (1992)]. It is also the primary carrier of many hydrophobic molecules

[Sugio et al., (1999)] in the blood such as steroids, fatty acids, thyroid hormones and hemin and also transports certain metal ions like Cu2+ and Ca2+. Structurally, HSA is a globular protein (585 amino acids) comprised of several helices organized into three domains [Carter et al., (1989)]: domain I (residues1-195), domain II (residues 196-383) and domain III (384-

585), which are homologous in both sequence and structure but arranged in an asymmetric fashion. Each of these domains can be divided into subdomains A and B where the subdomains IA, IB and IIA can be thought of as forming a head for the molecule with

IIB,IIIA and IIIB forming a tail [Carter et al., (1989)] giving the protein overall a heart shape

[Sugio et al., (1999)].

The versatility of serum albumin to bind diverse water insoluble ligands ranging from fatty acids to metal ions is attributed to the diverse binding sites present on its domains.

There are at least six major sites where ligand association occurs. Of the various ligand binding sites, the one on subdomain IIIA is the most active and preferentially accommodates several ligands [Dockal et al., (1999)]. The primary binding sites for fatty acids and bilirubin are IIA and IIIA with their pockets located in similar regions containing hydrophobic side chains and gated by two helices A-h5 and A-h6. It is believed that the binding ability of these pockets is due to the strategic positioning of W214, K199 and Y411 which limit accessibility to solvent [He and Carter, (1992); Sugio et al., (1999)]. In addition, since IIA and IIIA share

41 a common interface, the binding of ligands to one of the domains can affect the conformation and binding ability of the other.

We perform extensive analysis on a set of 99 structures of HSA for the stretch of residues

5-558 with no gaps. PCA on this set results in PC1, PC2 and PC3 capturing 85%, 7% and 2% of the total variance respectively (Fig. 2.3a). In PC1, domain I rotates as a single unit relative to domain III providing access to the ligand binding pocket within subdomain IIIA (Fig.

2.3c). PC2 involves a motion of subdomain IIIB relative to subdomain IB, providing access to the ligand binding site on IB. In addition, PC2 also involves a breathing motion of the helices A-h5 and A-h6 of subdomain IIIA, which is most likely responsible for the gating of this versatile pocket (Fig. 2.3d). It is worth noting that both PC1 and PC2 are motions involved in restricting access to the crucial IIIA binding pocket (see animations of the PCs in

Supplementary Movies S3 and S4.

PC1 and PC2 separate the set of 99 structures into three primary clusters (Fig. 2.3b), with one cluster at high values of PC1 corresponding to structures with the domain I rotated and open to provide access to the domain IIIA binding pocket; a second cluster at low values of

PC1 and high values of PC2 (structures with domain III closed and blocking access to the IB binding site) and a third cluster at low values of PC1 and low values of PC2 (representing structures with domain III open). We construct the free energy landscape for this set of proteins and obtain an AUC of 0.77 suggesting that a majority of the crystal structures fall into the minima of the free energy landscape. In addition, the landscape also clearly shows possible low energy transition paths between the different clusters.

42

Figure 2.3. Human serum albumin (HSA). (a) Percentage of variance captured by the first 10 PCs from the set of 99 HSA structures. (b) Visualization of PC1 on the protein structure - Domain I (red + magenta) rotates and moves away from domain III (blue + cyan) providing access to the ligand binding site on subdomain IIIA (cyan). (c) Visualization of PC2 – subdomain IIIB (blue) moves away from subdomain IB (red) providing access to its ligand binding site. In addition, the two helices governing access to the binding site on subdomain IIIA (cyan) open and close in a breathing motion. (d) Energy Landscape of HSA along PC1- PC2 (entropy weight 푎 = 0). Crystal structures are denoted by white hexagons. The two largest clusters are clearly located in lowest energy regions (see free energy scale on the right hand side, from blue favorable to red unfavorable).

2.3.4. Case study III: sarco-endoplasmic reticular Ca2+ ATPase (SERCA)

SERCA is a Ca2+ ATPase found on membranes of the sarcoplasmic reticulum (SR) in muscle cells. The primary function of SERCA is the reuptake of Ca2+ ions (an active transport process) from the cytosol of muscle cells into the lumen of the SR (for internal storage of Ca2+) during muscle relaxation using energy derived from ATP hydrolysis. In

43 other words, it is essential for maintaining a proper concentration of Ca2+ in the cytosol of muscle cells. There are several isoforms of SERCA encoded by three different genes which were reviewed in detail by Misquitta et al [Misquitta et al., (1999)].

Early on, site–directed mutagenesis [Clarke et al., (1990a), (1990b), (1989); Vilsen et al.,

(1991)] and cryo-electron microscopy [Toyoshima et al., (1993)] have elucidated extensive information about the structure and function of the various domains of the protein. The 994 residue protein is an integral membrane protein consisting of a large head on the cytoplasmic side, a small flexible stalk and a transmembrane (TM) domain comprised of 10 TM helices and associated loops in the lumen of the SR. A crystal structure [Toyoshima et al., (2000)] of the SERCA1a isoform (most abundant form) from rabbit fast-twitch skeletal muscle revealed that the cytoplasmic head consists of three domains: domain A (actuator) involved in the gating mechanism regulating the binding and release of Ca2+, domain N (nucleotide-binding) that binds ATP and ADP; and domain P (phosphorylation) containing residue D351, which is phosphorylated as part of the transport cycle reaction. A transport mechanism has been described [MacLennan et al., (1997)] in the form of a cycle to consist of two main conformations E1 and E2 where the E1 (open) conformation has high affinity for Ca2+ and binds it from the cytoplasm whereas the E2 (closed) conformation has low affinity for Ca2+ and releases it into the SR lumen. The transition from E1 to E2 proceeds through the phosphorylated states E1P and E2P and involves large conformational rearrangements and rotation of the N and A domains.

Several structures of SERCA are available from the PDB that sample multiple conformational states of the transport cycle which makes its analysis by PCA worthwhile.

We compiled a dataset of 63 structures of rabbit SERCA1a and performed PCA on this set,

44 which results in the PCs 1-3 capturing ~57%, 27% and 11% of the total variance respectively

(Fig. 2.4a). PC1 when visualized appears as a twisting motion of the actuator and nucleotide- binding domains whereas PC2 corresponds to a hinge-bending motion of the actuator and nucleotide-binding domains toward each other (Fig. 2.4c, d). Since the A-domain is linked to three helices of the TM domain through highly flexible linkers, it has been suggested previously that the rotation of the A domain could play a key role in the rearrangement of helices that open the gate to release Ca2+ into the lumen [Nagarajan et al., (2012)] (see

Supplementary Movies S5 and S6 for animations of the PCs.

When the structures are projected onto PC1 and PC2, they distinctly separate into two major clusters: one cluster at low values of PC1 and PC2 corresponding to E2 (closed) structures and another at high values of PC1 and PC2 corresponding to E1 (open) structures.

Two minor clusters are also observed at high values of PC1 and low values of PC2, and these correspond to structures where the A and N domains have rotated, but a hinge-bending motion between the two domains has not occurred. The free energy landscape obtained from our analysis is shown in Fig. 2.4b. The optimum weight for the entropy term obtained is 1.35 corresponding to an AUC of 0.84, again suggesting that most of the crystal structures fall into low energy regions of the energy landscape. One interesting feature of this landscape that differs from those of other proteins investigated is that the low energy basins corresponding to clusters are not connected to others by low free energy paths. This can be understood by interpreting the landscape in the context of the SERCA transport cycle which requires external energy in the form of ATP. This further shows that these coarse-grained free energy landscapes are powerful enough to identify high energy barriers that cannot be

45 crossed without significant additional energy (e.g. ATP or GTP driven mechanisms in proteins).

Figure 2.4. Sarco-endoplasmic reticular Ca2+ ATPase (SERCA). (a) Percentage of variance captured by the first 10 PCs from the set of 63 SERCA structures. (b) Visualization of PC1 – twisting motion of the N (green) and A (red) domains against each other whereas the TM domain (gray) remains relatively rigid (c) Visualization of PC2 as an opening-closing motion of the N and A-domains towards each other. (d) Free energy landscape of the molecule along PC1-PC2 (entropy weight 푎 = 1.35). Crystal structures are denoted by white hexagons.

2.3.5. Predicting the transition pathway between the open and closed forms of HIV-1 protease

When there are two or more distinct conformations for a protein, it becomes important to understand how the protein passes between these conformations. For example, many proteins have a ‘closed’ conformation after they bind their ligands and an ‘open’ conformation when

46 they have released the ligands. Using the intensely studied protein HIV-1 protease as an example, we show that transition paths between the open and closed conformations can be predicted by using the free energy landscapes.

HIV-1 protease is a retroviral aspartyl protease responsible for cleaving newly synthesized polyproteins to produce mature proteins in the infectious HIV virion. The protein is composed of two symmetrical identical subunits (each 99 residues long) [Navia et al.,

(1989)]. Each monomer consists of three domains: a flap domain (residues 33-62), a core domain (10-32 and 63-85) and a terminal domain (1-4 and 96-99). The active site is composed of the D25-T26-G27 amino acid triad from both the monomeric units and the protein functions only in the dimeric form.

Given its importance as a primary target for HIV therapy, more than 300 structures of this protein have been solved using X-ray crystallography in complex with diverse ligands. In addition, this protein has been a subject of extensive study by computational simulations, especially molecular dynamics [Chang et al., (2007), (2006); Tozzini and McCammon,

(2005); Tozzini et al., (2007); Trylska et al., (2007)]. Previous work [Yang et al., (2008)] from our lab has shown that the principal motions extracted from sets of X-ray and NMR structures or snapshots from MD simulations of the protein agree well with the motions predicted by ANM. Crystal structures of mutants as well as MD simulations have identified distinct closed and open conformations of the protein. The flaps are assumed to open up, allowing for the binding of substrate and the release of products. Here, we discuss the transition between the open and closed forms within the context of free-energy landscapes generated using a set of 304 experimental structures of the protein.

47

Figure 5. Predicted conformational transition pathway for HIV-1 protease. (a) Percentage of variance captured by the first 10 PCs from the set of 304 HIV-1 protease structures. (b) Visualization of PC1 – Opening and closing of the flap domains (red) against the core domain (blue). The terminal domain is shown in green. (c) Visualization of PC2 – Twisting motion of the flaps (red) (d) Free Energy Landscape of the molecule along PC1-PC2 (entropy weight 푎 = 1.3). Crystal structures are denoted by black hexagons while intermediate structures along the predicted transition pathway are shown as magenta diamonds. The predicted transition pathway follows a relatively low-energy path on the landscape along a diagonal path and passes close to several experimental intermediate forms.

The PCs obtained from a set of 304 structures are shown in Fig. 2.5. The first three PCs capture 30%, 21% and 7% of the total variance respectively (Fig 2.5a). PC1 is an opening and closing motion of the flaps resulting in significant changes for the ligand binding space

(Fig. 2.5c). PC2 (Fig. 2.5d) is a twisting motion of the flaps (see animations in

Supplementary Movies S7 and S8. When the intermediate structures along the transition pathway (discussed in methods section F) are projected onto the free energy landscape (Fig

48

2.5b) from the set of structures, it can be seen that they fall on a relatively low free energy path between the two conformations. There are a few energy barriers which the protein crosses to reach the final state, but most interestingly the transition path passes through the regions of the landscape where experimental structures are located. Recall that in this Monte

Carlo simulation only energies, and not entropies have been considered in making the decisions for the steps taken, so the path when plotted on the free energy surface does not follow the lowest free energy path. This suggests that the free energy landscapes obtained by the use of this method can guide the probable transition pathways between structures.

2.4. Conclusions

In this work, we have exploited the availability of multiple structures for groups of closely related proteins in the PDB to understand conformational changes in the context of their free energy landscapes constructed by combining knowledge based potential functions with entropy terms from elastic network models. By using principal components as a suitable coordinate system for landscape construction, we have been able to map out the free energetics of conformational changes along the most important directions of motion for several proteins. It has been found that most of the crystal structures tend to lie in regions of relatively low free energies. However, we also find cases where there were lower free energy regions on the landscape where a structure has not yet been observed. In principle, for cases such as these it may be possible to pursue these analyses to suggest mutants that would occupy these lower free energies regions.

Further investigations are required to establish with certainty whether the conformational changes from higher order less important principal components affect in any significant way

49 the free energy landscapes. The cases where the first few principal components are dominant should be the most reliable cases, but approximations to account for the effects of some higher order, less important motions can be developed in future studies.

Our analysis also sheds light on the two contrasting views about conformational changes in proteins: the conformational selection hypothesis or induced fit. According to the conformational selection hypothesis, proteins exist in equilibrium among their different conformations and a trigger (such as a binding event) causes a shift in the equilibrium towards one of the states. This can be contrasted with the induced-fit hypothesis where the protein is assumed to exist in one conformation only and where a triggering event such as binding induces a change in conformation of the protein. We find from our analysis of a set of 50 proteins that most of the crystal structures do occur in regions of relatively low free energy on coarse-grained landscapes. With the exception of a few cases (e.g. T4 lysozyme), the structures are clustered along the PC coordinate and each of these clusters can be considered to represent a conformation of the protein. Further, the clusters seem to occupy a low free energy basin within the conformational space and are often connected to each other through narrow low free energy paths (which suggest possible transition paths between the conformations), as can be seen from the landscapes of T4 lysozyme, serum albumin or HIV-1 protease. However in a few cases (e.g. SERCA), the clusters are separated from each other by high energy barriers. These can be considered to represent cases that require extra energy

(from ATP or GTP interactions) which is not considered in our calculations. In summary, our analysis suggests that such coarse-grained free energy landscapes of proteins can be used to shed light on the extent to which conformational selection or induced fit is operative in a system. From the present point of view, interpretation of the difference between

50 conformational selection and induced fit can be made directly from the free energy landscapes. Whenever the conformations are accessible without requiring passage over high energy barriers, this would be conformational selection, but when there are high free energy barriers this would require induced fit arising from favorable interactions with the ligand.

2.5. Acknowledgments

This research was supported by NIH grant R01-GM72014 and NSF grant MCB-1021785.

K.S. was also supported by fellowship funds from the Office of Biotechnology, Iowa State

University.

2.6. Supplementary Material

Supplemental information which includes Supplementary Tables S1-S2, Supplementary

Figure S1 and Supplementary Movies S1-S8 are available online from J. Chem. Phys. at http://dx.doi.org/10.1063/1.4937940.

51

CHAPTER 3. KNOWLEDGE-BASED ENTROPIES IMPROVE THE

IDENTIFICATION OF NATIVE PROTEIN STRUCTURES

(Manuscript submitted to Proc. Nat. Acad. Sci.)

Kannan Sankar, Kejue Jia and Robert L. Jernigan

Abstract

Evaluating protein structures requires reliable free energies with good estimates of both potential energies and entropies. Although there are many demonstrated successes from using knowledge-based potential energies, computing entropies of proteins has lagged far behind. Here we take an entirely new approach and evaluate knowledge-based conformational entropies of proteins based on the observed frequencies of contact changes between amino acids in a set of 167 diverse proteins having two alternative structures. The results show that charged and polar interactions break more than hydrophobic pairs. This pattern correlates strongly with the average solvent exposure of amino acids in globular proteins, as well as with polarity indices and the sizes of the amino acids. Knowledge-based entropies are derived by using the inverse Boltzmann relationship. Including these new knowledge-based entropies almost doubles the performance of knowledge-based potentials in selecting the native protein structures from decoy sets. Beyond this overall energy-entropy compensation, a similar compensation is seen for individual pairs of interacting amino acids.

The new entropies have immediate applications for 3D structure prediction, protein model assessment and protein engineering and design.

52

3.1. Introduction

Knowledge of a protein’s structure is required to understand its dynamics and function; so, improvements in protein structure prediction, especially template-free methods are essential if the whole protein universe it to be fully comprehended. Computational methods of structure prediction typically yield large numbers of possible structure models (decoys) then requiring challenging discrimination in determining which of these models is most likely to be the native structure. This well-known bottleneck in protein structure prediction suffers from present-day limitations in structure evaluation. Since the folding of a protein into its native structure is dictated by its free energy landscape, the development of accurate free energy functions for native structure evaluation is an area of active research. The free energy

ΔG of a protein structure can be represented as ΔG = ΔV – TΔS, where ΔV and ΔS represent the energetic (enthalpic) and entropic components, respectively and T the temperature. In the conventional folding funnel hypothesis of protein folding, the energies and entropies are captured by the depth and by the width of the well [Brooks et al., (1998)]. Both the energetic and entropic components are combinations of large numbers of contributions and hence the accurate prediction of free energies is limited by the reliable ability to assess all of these contributions.

The energetic contribution to free energy of proteins is usually captured by potential functions, either physics-based or knowledge-based. Physics-based force fields such as

CHARMM, AMBER, GROMOS and OPLS (see details in reference [Mackerell, (2004)]) are derived from various assessments of atomic interactions, often based on data from small molecules and include bond lengths, bond angles, torsion angles, and energies for all possible values. While these functions have been widely used in protein structure refinement [Friesner

53 et al., (2013)] and sampling using molecular dynamics simulations [Karplus and

McCammon, (2002)], they have been shown not to necessarily provide significant gains in structure prediction [Chen and Brooks, (2007)]. On the other hand, knowledge-based or statistical potentials extracted from observed frequencies of amino acid or atomic contacts

(hence the term contact potentials) in sets of known protein structures are based on the assumption that the native structure is located at the global free energy minimum of the landscape and that more favorable contacts will appear in larger numbers in sets of known protein structures. This approach is sometimes called the inverse Boltzmann approach. An extensive summary and comparison of various two-body contact potentials can be found in reference [Pokarowski et al., (2005)]. However, since two-body potentials cannot capture the dense packing and cooperative nature of various protein interactions [Betancourt and

Thirumalai, (1999)]; three-body [Munson and Singh, (1997)] and four-body potentials

[Carter et al., (2001); Krishnamoorthy and Tropsha, (2003)] were developed and demonstrated to improve structure predictions. These potentials have also been tested in multiple Critical Assessment of protein Structure Prediction (CASP) competitions [Moult et al., (2014)] and shown to be superior to most other types of potentials. They have been widely used not only for structure prediction [Hamelryck et al., (2010)] but also for a variety of applications including protein-protein and protein-nucleic acid docking [Feliu et al.,

(2011)], fold recognition [Buchete et al., (2004)] and protein design [Russ and Ranganathan,

(2002)].

We have previously developed a four-body potential [Feng et al., (2007)] which captures not only the sequence dependence but also the differences between solvent exposed and buried regions of proteins. We have also extended this to a non-sequential version [Feng et

54 al., (2010)] which also captures the long-range interactions and their cooperative nature. We have further developed an optimized potential function [Gniewek et al., (2011)] that combines these two long range potentials with previously developed short-range potentials [I.

Bahar et al., (1997b)]; which was shown to perform better than the individual terms at discriminating native structures from decoys.

Despite the remarkable advances in energies, predicting the entropies of protein structures has remained largely elusive and relatively little investigated [Meirovitch, (2007)].

The ability to compute entropies of proteins is crucial not only for protein folding

[Makhatadze and Privalov, (1996)] but also for assessing the energetics of conformational changes involved in ligand binding, regulation and protein-protein interactions [Brady and

Sharp, (1997); Tzeng and Kalodimos, (2012)]. In principle, predicting entropies of protein structures requires estimations of the effects of conformational changes of backbone and side chains, solvent entropy as well as changes in freedom of binding partners (association entropy) [Brady and Sharp, (1997)]. Motions of backbone and side-chain atoms can be estimated from protein NMR relaxation experiments to provide measures of conformational entropies that correlate with protein stability, ligand binding, and cooperativity or even enzyme catalysis [Stone, (2001); Wand, (2013)]. However, most approaches for estimating conformational entropies of proteins have been computational due to difficulties in measuring entropies directly from experiments.

The earliest approaches for conformational entropy estimation used molecular dynamics or Monte-Carlo simulations to sample possible conformations followed by various approaches such as the local states method [Meirovitch et al., (1994)], the quasi-harmonic method [Karplus and Kushick, (1981); Levy et al., (1984)] or its generalizations [Berendsen

55 and Edholmt, (1984); Edholm and Berendsen, (1984)] and hypothetical scanning

[Cheluvaraja and Meirovitch, (2005)]. Wang and Brüschweiler computed estimated entropies based on principal component analysis of the covariance matrix of dihedral angles obtained from an MD trajectory [Li et al., (2007); Wang and Brüschweiler, (2006)]. Other studies focused on evaluating the different conformations accessible to the side chains [Kon Ho Lee et al., (1994)] and comparing the number of side-chain rotameric states accessible in the folded and unfolded states [Creamer and Rose, (1992); Doig and Sternberg, (1995); DuBay and Geissler, (2009); Pickett, and Sternberg, (1993)]. A more detailed review of various methods of entropy estimation can be found in reference [Meirovitch, (2007)]. Recently, we have shown that by combining our optimized potential function with entropy measures obtained from coarse-grained elastic network models (ENMs), an improvement can be achieved especially in discriminating native protein-protein complexes from docked poses, reflecting the largest scale changes in dynamics [Zimmermann et al., (2012), (2011b)]. This is based on previous findings that large scale changes to the dynamics accompany bound states.

In this work, we develop a novel local conformational entropy function for proteins in a manner highly analogous to the way that knowledge-based potentials have been extracted, by employing the inverse Boltzmann relationship. We consider the frequencies of contact changes between residues to provide information about relative entropies of amino acid interactions. There are two points of support for this: (a) Proteins in general have a highly limited repertoire of important motions, one of the basic lessons from extensive uses of the

ENMs [Atilgan et al., (2001); Bahar et al., (2010); I Bahar et al., (1997); Chennubhotla et al.,

(2005); Hinsen et al., (2000); Tirion, (1996)]. It has been established that the small scale

56 fluctuations indicate the direction of changes on the larger scale [Tama and Sanejouand,

(2001); Yang et al., (2007)]. Here we reverse this, by assuming that the contact changes between amino acids during large scale conformational changes can inform about the smaller scale local flexibilities in protein structures. (b) Protein structure predictors have often found that their best predictions are obtained by selecting from the large cluster of predicted structures [Shortle et al., (1998); Zhang and Skolnick, (2004a)], a proxy for entropy in that region of conformational space.

We first show how information on the frequencies of contact changes between pairs of amino acids is used to develop a 20 × 20 contact-change matrix. Then we use these data to calculate local entropies for proteins and then combine them with potential functions to generate a knowledge-based free energy function (KBF) whose terms are optimized to best discriminate native structures of proteins from decoys. We also compare the performance of the newly developed KBF with several two-body potentials as well as some all-atom statistical potentials. These new novel KBFs perform better than these other coarse-grained statistical potentials.

3.2. Results

3.2.1. Patterns of amino acid contact changes during protein conformational transitions

The determination of the 3D structure of the same protein by multiple methods or different groups (under different conditions, in the presence of diverse ligands, etc.) has resulted in large sets of structures for many proteins. Principal component analysis of sets of structures of homologous proteins from the PDB usually reveals remarkable redundancy and separates the structures into only a few distinct clusters, defining the range of available conformations.

57

Such analyses suggest the existence of the limited conformational states for most proteins. In some cases, the conformational changes are local and small (< 2 Å) whereas in other cases it can be as large as changes at the domain or subunit level (~ tens of angstroms).

Understanding these multiple conformations can shed light on the structure-dynamics- function relationships.

Protein conformational changes are accompanied by changes in internal distances between multiple residues. As different parts of the protein move relative to each other, the amino acid contacts at those regions change. Understanding these patterns can be extremely useful for predicting which contacts are likely to break or form between conformations and to predict alternate conformations of the protein. In addition, such information can also provide an idea about the local flexibility (entropy) within different parts of a protein.

In order to investigate whether there are patterns in the nature of contact changes between amino residues in proteins, we have used 167 pairs of alternate conformations of proteins

(Table D.1) from the database of molecular movements (MolMovDB) [Echols et al., (2003);

Flores et al., (2006)]. More details on how amino acid contact-change information (Table

D.2) is obtained are provided in Materials and Methods and Fig. D.1. This gives the fraction of contact changes for every pair of amino acids, summarized in the form of a 20 × 20 fractional contact-change (cc) matrix in Table D.3 as well as in the heatmap in Fig. D.2. In order to account for the differences in frequencies of the individual amino acids within the dataset, we further normalize this matrix as a ratio of observed to expected probabilities

(details in Methods), which gives the normalized cc matrix in Table D.3 and shown as the heatmap in Fig. D.3.

58

Interestingly, the cc matrix separates the amino acids roughly into a hydrophobic and a polar cluster with a few exceptions. Notable outliers include Gly, Pro, His and Cys. As observed in most two-body potential matrices, Gly and Pro tend to cluster with polar amino acids. Cys-Cys contacts constitute a unique type since many of them occur in covalent disulphide links. His is also commonly located in cores of proteins owing to its common role in active sites of enzymes. It is clearly evident that the amino acid contact changes follow a distinct pattern with charged residues showing the largest fraction of contact changes followed by polar residues, with hydrophobic residues showing the smallest fraction of contact changes. This separation becomes even clearer when the cc matrix is recalculated

(from the raw counts) by grouping the amino acids into four classes (Fig. 3.1a and Table

D.4): acidic (D and E), basic (H, K and R), polar (C, N, Q, S, T and Y) and hydrophobic (A,

M, L, G, W, F, P, I and V).

Figure 3.1. Comparison of the entropy matrix and potential matrix for a reduced 4-letter alphabet. (a) Heatmap of the entropy matrix, expressed as fraction of contact changes between pairs of amino acids using a reduced alphabet, where H: hydrophobic, P: polar, A: acidic, B: basic amino acids. Changes in interactions involving polar/charged residues are higher (red=high) than those involving hydrophobic residues (blue=low) (b) Heatmap of the MJ2h potential energy matrix [Miyazawa and Jernigan, (1996)] using a reduced alphabet. Hydrophobic interactions yield lowest energy (red=low) with charged interactions being relatively higher in energies (blue=high). The entropy and energy matrices are complementary in nature.

59

Another striking feature of Fig. 3.1a is the clear distinction between charged residues: same charge interactions (e.g. D-E or K-R) are broken more frequently than opposite charge interactions (e.g. D-R), which is sensible given the attractive nature of the latter. The observation that the smallest numbers of contact changes are observed between hydrophobic residues supports the hypothesis that most of these residues are found in the hydrophobic core of proteins which plays an important role in stabilizing the 3D structure. Polar residues are likely to be found in least buried parts, mostly solvent exposed, increasing their likelihood of being involved in conformational changes and fluctuations.

In order to compare and contrast the entropy matrices and two-body potential matrices, we show the MJ2h two-body contact potential [Miyazawa and Jernigan, (1996)] using a reduced alphabet (Fig. 3.1b and Table D.4) alongside the reduced fractional cc matrix. The reduced version of the MJ2h matrix is obtained by combining the raw counts of each type of interaction from Table 3.2 in reference [Miyazawa and Jernigan, (1996)] according to amino acid categories and re-computing the matrix. The most striking observation is the complementary nature of the two matrices. As can be from Fig. 3.1a, charged and polar interactions contribute most to entropy, with hydrophobic residues contributing least. On the other hand, hydrophobic interactions have the lowest energies, in agreement with their role in stabilizing the core of proteins. This means that energetic stabilization is primarily brought about maximizing the number of hydrophobic contacts whereas entropic stabilization can be obtained by maximizing the number of charged/polar interactions. Since these two forces act simultaneously and cannot be optimized separately, the protein is forced to strike a balance between both these forces, which is manifested in its 3D structure.

60

3.2.2. Correlation between contact-change patterns and nature of amino acids

In order to understand how the cc matrix can be further interpreted at the individual amino acid level, we perform a principal component analysis (PCA) of the cc matrix. PCA is a multivariate statistical analysis technique [Pearson, (1901)] which is often used to understand the most important variations present in the data. Each row of the cc matrix can be considered as a vector of size 20 indicating the pattern of contact changes with each of the 20 amino acids (variables). PCA of the matrix decomposes the original matrix into mutually independent, orthogonal variables (eigenvectors). The eigenvectors are arranged in decreasing order of the eigenvalues (variance captured). The first PC captures 43.8% of variance in the matrix, and the subsequent PCs capture much less variance as shown in Fig.

D.4 (PC2, PC3 and PC4 capture 7.7%, 6.8% and 5.7% of the variance respectively).

Figure 3.2. Correlation between contact-change patterns and amino acid properties. Heatmap shows Pearson’s correlation values between coefficients of each principal component (columns) of the contact-change matrix and quantitative features of each amino acid (rows). The first PC correlates strongly with the average solvent exposure of residues in proteins as well as size (volume) whereas the second PC captures differences in the contact patterns of positively and negatively charged residues.

61

Each eigenvector of the cc matrix represents a fundamental feature of the contact-change pattern and the coefficients of each PC eigenvector for each of the 20 amino acids (shown in

Fig. D.5) measure the contribution of each individual amino acid to the corresponding feature. We compute the correlation between these eigenvectors (fundamental features) and various properties of each of the 20 amino acids (such as a hydrophobicity/polarity index, average solvent exposure of each amino acid in proteins, volume etc.). The Pearson correlation coefficients are summarized in the form of the heatmap shown in Fig. 3.2.

Interestingly, the first PC shows maximum correlations with the size (mass and volume), hydrophobicity/polarity indices and fraction of total surface area of each amino acid that is exposed to solvent. This reinforces our conclusions from Fig. 3.1a that polar amino acids show relatively larger contact changes than hydrophobic amino acids. The second PC shows a high correlation with the charge of amino acids, implying that it highlights the different behaviors of positively and negatively charged amino acids for contact changes.

There is a body of supporting evidence for the results found here. Liu and Bahar [Liu and

Bahar, (2012)] performed an analysis of the mobilities of different amino acids across a diverse set of 34 enzymes and observed a trend for polar/charged and small amino acids to exhibit greater mobilities than hydrophobic/aromatic residues. Similar results were found by

Ruvinsky et al. [Ruvinsky et al., (2012)] who classified residues into three categories based on their mobilities and showed that the mobility profiles of amino acids correlated strongly with their polarity and propensity to promote protein disorder [Dunker et al., (2001)].

Previous works have also shown that while hydrophobic interactions are dominant in protein folding, electrostatic interactions are often the drivers of intermolecular binding [Xu et al.,

(1997); Zhao et al., (2011)] and large scale conformational changes [Harrison et al., (2013)];

62 and present a low energy barrier to hinge-bending transitions [Sinha et al., (2001)]. In addition, Sinha and Smith-Gill have pointed out that electrostatic interactions are often evolutionarily selected to perform specific functions in proteins [Sinha and Smith-Gill,

(2002)], making their consideration important for protein engineering and design [Vizcarra and Mayo, (2005)]. For example, in the case of viral fusion proteins like hemagglutinin A

[Harrison et al., (2013)], a pH change leads to side chain protonation, causing electrostatic repulsion between His-Lys pairs in the pre-fusion stage and attraction between Glu-Glu pairs in the post fusion stage, thereby governing the large-scale conformational change between the two stages. Other examples include membrane channels in which salt-bridges often govern their gating mechanism for ligand transport [Hong et al., (2006)].

3.2.3. Knowledge-based entropy functions

Knowledge-based potential functions have been shown to be extremely useful in discriminating native structures of proteins from decoys at several CASP experiments.

However, one of the drawbacks of these potential functions is that they do not capture the entropies of protein structures. In this work, we show for the first time how the information on amino acid contact changes within a large set of proteins can be used to obtain knowledge-based entropy functions. Since the cc matrix contains information about the propensity of contact changes between every pair of amino acids, it directly provides information about the tendency for internal distance changes between pairs of residues in proteins. In other words, this gives an estimate of the local two-body conformational entropies in proteins. Given the structure of a protein, summing up the cc matrix values for all contacts in the structure (see details in Methods) gives an estimate of the overall local flexibilities of a protein. We refer to this as the knowledge-based entropy (Scc) of a structure.

63

3.2.4. Knowledge-based free energy function (KBF) for protein native structure recognition

The CASP experiments are one of the most popular venues for testing potential functions for proteins. At CASP, in one of the main categories, the participants are required to model the structures of several target proteins (whose structures are not yet known) and the performance of various evaluation schemes and potentials are assessed by multiple metrics.

An optimal free energy function will always be able to correctly identify the native structure of a protein from a set of decoys (models) of the same protein. However, often, the native structures of proteins may not be known. For such cases, an ideal free energy function should provide a quantitative ranking of decoys with better scores for those closer to the native structure (or lower in free energy) than for those which are farther from the native structure.

Similarity to native structure is often assessed in terms of RMSD from native structure or template modeling score (TM-score) [Xu and Zhang, (2010); Zhang and Skolnick, (2004b)] in comparison with the native structure. While RMSD measures dissimilarity between structures, TM-score is a measure of similarity. It is logical to expect free energy predictions for the decoys to be highly positively correlated with RMSD from native structure or negatively correlated with TM-score with native structure. Another metric is the Z-score of the native structure (or the best decoy) which measures the separation of the native structure

(lowest free energy decoy) from the average free energies of decoys. A good free energy function will have a significantly lower free energy for the native (or the best decoy) than for the other decoys.

From the CASP11 experiment, we used a dataset of 105 target proteins and decoys

(submissions from each participant) to train our novel KBF (list of targets is in Table D.5). A

64

KBF is formulated as a linear combination of three different potentials (the four-body sequential potential [Feng et al., (2007)], four-body non-sequential potential [Feng et al.,

(2010)] and a short range potential [I. Bahar et al., (1997b)]) and the cc-based entropy; with weights for each term optimized separately to satisfy each of six performance metrics as discussed in Methods. These include correlation of KBF with RMSD from native structure

(as measured by Pearson’s, Spearman’s and Kendall’s correlation coefficients), rank of native structure amongst decoys as well as RMSD and Z-score of the best decoy. It is to be noted that we have used two versions of the contact matrix to compute the entropy term (the fractional cc matrix in Table D.3 and the normalized cc matrix in Table D.3) and accordingly there are actually 12 different KBF functions generated. The optimal weights of the coefficients for each of the 12 KBF functions are shown in Table D.6. An interesting feature of almost all of the KBFs is that the weight of the entropy term is much higher than any of the other terms, implying that the contact-change based entropy is the most important term in our free energy formulation. This also underlines the importance of efficiently predicting entropies for the development of free energy functions for proteins.

Table D.7 shows the performance of the new KBFs and provides comparisons with 23 two-body potentials and three all-atom statistical potential functions on an independent previously published test dataset [Rykunov and Fiser, (2010)]. This dataset contains representative decoys extracted by structural clustering from the set of all the submissions for

143 selected protein targets from CASP5-8. The performance of each method is assessed using 12 different criteria (averaged over 143 targets) as explained in Methods. For the sake of simplicity, we show a portion of Table D.7 in Table 3.1.

65

Table 3.1. Performance of the novel optimized KBFs in comparison with other statistical potential functions in terms of three metrics. Each metric is averaged across 143 protein targets from CASP5-8 [Rykunov and Fiser, (2010)] and the functions are arranged in decreasing order of number of natives identified as lowest energy structure. The novel KBFs recognize the largest number of natives from among all the coarse-grained knowledge-based functions and compare favorably with all-atom knowledge-based potentials; and outperform the all-atom potentials in terms of correlation with RMSD from native and Z-score of native. No. of Natives (out of Spearman’s Corr. with Function Z-score of native 143) RMSD from native KBFa 97 0.63 -1.92 4BOPTb 72 0.51 -1.46 BT 58 0.44 -1.49 Qp 57 0.40 -1.32 BFKV 57 0.46 -1.52 TD 57 0.40 -1.30 SKJG 55 0.41 -1.37 fourBodyc 53 0.48 -1.30 genFourd 53 0.52 -1.25 VD 53 0.43 -1.37 MJ3h 51 0.43 -1.34 HLPL 50 0.39 -1.28 SKOa 49 0.38 -1.27 SKOb 48 0.42 -1.45 4BOPT_GNMe 46 0.33 -1.06 Tel 46 0.42 -1.36 TEs 46 0.43 -1.39 MJ3 46 0.41 -1.36 MJPL 45 0.29 -1.02 MJ2h 45 0.31 -1.05 Qm 42 0.36 -1.21 MS 42 0.41 -1.34 shortRangef 39 0.29 -1.05 GKS 38 0.35 -1.18 Qa 37 0.36 -1.16 TS 36 0.28 -1.00 MSBM 17 0.00 -0.11 RO 7 0.18 -0.36 MJ1 0 -0.20 0.76 calRW g 110 0.61 -1.69 calRWplus g 106 0.61 -1.69 dDFIRE h 100 0.58 -1.41 aNovel KBF introduced in this paper (values indicate best out of 12 computed KBFs). bOptimized potential function from [Gniewek et al., (2011)]. cSequential four-body potential function developed in [Feng et al., (2007)]. dNon-sequential four-body potential function introduced in [Feng et al., (2010)]. eFree energy function combining optimized potential with entropies from elastic network models introduced in [Zimmermann et al., (2012)]. fShort range potentials introduced in [I. Bahar et al., (1997b)]. gSide-chain orientation dependent all-atom statistical potentials introduced in [Zhang and Zhang, (2010)]. hAll-atom statistical potential introduced in [Yang and Zhou, (2008)]. All other abbreviated potentials are as described in [Pokarowski et al., (2005)]. Full table with performance on all 12 evaluation criteria can be found in Table D.7.

66

At least one of our novel KBF functions outperforms all other compared potential functions in 9 out of 12 performance metrics. The KBF function (using fractional cc matrix based entropy term) optimized to maximize Pearson’s correlation with RMSD from native structure yields the highest average Pearson correlation (0.59), Spearman’s rank correlation

(0.63) and Kendall’s rank correlation (0.48) coefficients with RMSD from native structure as well as the best Spearman’s rank correlation with TM-score with native structure (-0.65) for the entire test dataset. In addition, the two KBF functions (using normalized cc matrix based entropy term) optimized to maximize Pearson’s and Kendall’s correlation coefficients with

RMSD from native structure yield the best average Pearson correlation (-0.63) and Kendall’s rank correlation (-0.52) with TM-score in comparison to native structure over the entire test dataset (see Table D.7 for full table).

Most importantly, our novel method correctly identifies the highest number of native structures of any of the coarse-grained statistical potential functions. The KBF function optimized to maximize Spearman’s rank correlation coefficient with RMSD from native structure (using the fractional cc matrix based entropy term) assigned the lowest free energy to the native structure for 97 out of 143 (~68%) cases, which is quite similar to the performance achieved by the three all-atom statistical potentials (100 for dDFIRE [Yang and

Zhou, (2008)], 103 for RW [Zhang and Zhang, (2010)] and 110 for RWPlus [Zhang and

Zhang, (2010)]). In addition, our KBF function achieves the highest average Z-score of native (-1.92) from among all the functions, including the three all-atom potentials.

Interestingly the three metrics in which our KBF functions did not perform best are the

RMSD, Z-score and TM-score of best decoy. All three of these are criteria that evaluate the power of the functions in identifying the decoy structure which is closest to the native

67 structure or how well the function separates the lowest energy structure from the mean energy of all decoys (Z-score). Among the coarse-grained functions, the two-body potential

BFKV [Bastolla et al., (2001)] performed best in terms of average RMSD of best decoy

(5.51) while BT [Betancourt and Thirumalai, (1999)] performed best in terms of average

TM-score of best decoy (0.75) and HLPL [Park and Levitt, (1996)] gave the highest average

Z-score of best decoy (2.00). Even though the performance of the three all-atom potentials is better than the two-body potentials in terms of RMSD or TM-score of best decoy, our KBF functions give better Z-score for the best decoy (1.6) than do any of the all-atom potentials

(1.4).

3.3. Discussion

In this work, we have utilized the availability of pairs of structures of the same protein to identify patterns of contact changes between amino acids during the known conformational changes, i.e., from two snapshots as representatives of the fluctuations of the structures. Our results show that contact changes between amino acids are not random: contact changes between hydrophobic amino acids are rarely broken, and this is consistent with the well- known role of hydrophobic interactions in stabilizing protein structures. Contacts between polar amino acids are broken more frequently, with the largest number of such breaks occurring between amino acids with the same charge. The observed pattern correlates strongly with the hydrophobicity index and sizes of amino acids, as well as the average solvent exposure of amino acids in proteins.

Analogous to parameterizing knowledge-based potentials using observed frequencies of residue contact pairs from a database of known protein structures, we develop for the first time knowledge-based two-body entropy functions based on the observed frequencies of

68 changed contact pairs in a set of 167 protein structure pairs. By combining such entropy measures with previously extracted multi-body long-range and short-range potentials, we generate a new set of free energy functions which have been optimized to select native structures of proteins from decoys. We also rigorously test our new free energy functions on an independent test dataset and compare its performance with 23 other knowledge-based coarse-grained two-body potentials as well as three all-atom knowledge-based potentials.

Our results show that the new KBF functions introduced in this paper provide significant gains in terms of multiple performance metrics and are the best available coarse-grained free energy functions.

Another interesting result is that the entropy term receives the largest weight (Table D.6) in the optimized KBF, indicating the high significance of entropic contributions for protein structure evaluation. Improved performance relative to all other potential functions is a strong indicator that entropy plays a significant role in the conformational transitions for many proteins.

Extensive research on two-body potential functions has shown that hydrophobic interactions have the most stabilizing (lower potential energy) effect and play a role in maintaining the core of proteins. Polar interactions contribute less. Here, we have learned that polar interactions contribute most to entropies. Thus we hypothesize that protein stability is dictated by a balance between energetic contribution primarily from hydrophobic interactions and entropic contributions primarily from polar and charged interactions. These results have important implications for molecular modeling of proteins, indicating that polar interactions should be receiving less attention for their energetic stabilizing effects and more attention for their entropic behaviors.

69

While application of these new free energies requires the use of a structure, this might possibly change because of recent advances in predicting contact pairs from sequence correlations [de Juan et al., (2013)]. Improved next generation sequencing technologies are leading to the rapid growth in protein sequence data so that these sequence-based predictions of contacts may eventually overwhelm structure determination studies.

3.4. Materials and Methods

3.4.1. Datasets

To obtain the pattern of amino acid contact changes, we have used 167 protein structure pairs from our previously compiled dataset [Yang et al., (2007)] from MolMovDB [Echols et al.,

(2003)]. Each pair consists of two alternate structures of the same protein from the PDB.

Three structures were removed from our previous dataset of 170 proteins because they had only resolved C-alpha atom positions. A list of the PDB IDs for each of the 167 structure pairs used in this analysis is provided in Table D.1.

For training and optimizing the weights of the potential and entropy terms for our new

KBF, we have used the set of the native and decoy structures (submissions from participants) for 105 CASP11 targets (see Table D.5). All the target native structures and decoy structures were obtained from http://predictioncenter.org/download_area/CASP11/. This set contains both template-based modeling targets and free modeling targets.

For testing our new free energy functions and comparing their performance with other potential functions, we use an independent dataset compiled by Rykunov and Fiser [Rykunov and Fiser, (2010)] consisting of 143 proteins from CASP5-8 modeling targets and their decoys. A cleaned and processed version of the same dataset was used by Zhang et al.

70

[Zhang and Zhang, (2010)] for testing their RW potentials and was obtained from http://zhanglab.ccmb.med.umich.edu/RW/. All the potentials compared in this study have been evaluated with this updated dataset.

3.4.2. Evaluation of normalized amino acid contact changes

The database of 167 pairs of structures was used for extracting the frequency of amino acid contact changes. First, the sequences corresponding to both structures (those for which 3D position coordinate information was present in the PDB files) were aligned using

CLUSTALW algorithm [Bateman, (2007)] with default parameters. Following this, the residues common to both structures were extracted from regions in the sequence alignment without any gaps. All subsequent calculations were performed on the resulting subset of common residues in each pair.

For the sake of convenience, let us refer to one of the structure forms as open (O) and the other as closed (C). For each form, the residue level contact maps (푴퐶 and 푴푂) were obtained as binary adjacency matrices:

1 푖푓 푘 contacts 푙 푀퐹 = { , 퐹 ∈ {퐶, 푂} (3.1) 푘푙 0 otherwise using the criteria that two residues 푘 and 푙 are considered as contacting if any heavy atom of residue 푘 is within 4.5 Å of any heavy atom of the residue 푙. Only contacts between residues that are separated by at least three residues in sequence are considered. Let 푏푖푗 be the number of interactions between amino acid types 푖 and 푗 in form C but not in O, 푓푖푗 the number of interactions between amino acid 푖 and 푗 present in O but not in C and 푢푖푗 be the number of interactions between 푖 and 푗 in both O and C. The probability of contact changes between

71 amino acid 푖 and amino acid 푗 is summarized in the 20 × 20 fractional contact change (풄풄) matrix, whose elements can be obtained as:

푓푟푎푐 ∑푎푙푙 푝푟표푡푒푖푛푠(푏푖푗+푓푖푗) 푐푐푖푗 = , 1 ≤ 푖, 푗 ≤ 20 (3.2) ∑푎푙푙 푝푟표푡푒푖푛푠(푏푖푗+푓푖푗+2푢푖푗)

In order to account for the frequency of occurrence of the individual amino acid types within the dataset, we also calculate the normalized contact change matrices as a ratio of observed to expected probabilities as follows:

푛푐 푖푗⁄ ( 푁푐) 푐 푛푖 푛푗 푝 푝 푝 ( ⁄푛)( ⁄푛) 푐푐푛표푟푚 = 푖푗 푖 푗 = , 1 ≤ 푖, 푗 ≤ 20 (3.4) 푖푗 푝푐푝푐 푝 푛푖푗 푖 푗 푖푗 ( ⁄푁) 푐 푛푐 푛푖⁄ 푗⁄ 푁푐 ( 푁푐)

( )

where 푛 is the total number of residues analyzed, 푛푖 the total number of amino acids of type 푖

푐 in the dataset, 푛푖 the total number of contact changes involving amino acid of type 푖, 푛 the total number of amino acids in the dataset, 푁 the total number of amino acid contacts in the

푐 dataset, 푁 the total number of amino acid contact changes in the dataset, 푛푖푗 = 푏푖푗 + 푓푖푗 +

푐 2푢푖푗 the total number of amino acid contacts between types 푖 and 푗 and 푛푖푗 = 푏푖푗 + 푓푖푗 the

푐 푐 total number of amino acid contact changes between types 푖 and 푗 and 푝푖, 푝푖 , 푝푖푗 and 푝푖푗 indicate the corresponding probabilities.

3.4.3. Evaluation of local entropies based on contact changes

The 풄풄 matrix can be directly used to estimate the tendency of contacting amino acid pairs to break, which can be used as a measure of entropy by summing up the corresponding values for each contact present in the protein. We estimate the knowledge-based two-body entropies

72

푆푐푐(푷) for a protein with sequence 푷 = 푃1푃2 … 푃푖 … 푃푁 with 푁 residues and contact map

푴푃as follows:

( ) ∑푁 ∑푖−4 푃 푆푐푐 푷 = 푖=1 푗=1{푐푐푃푖푃푗 ∗ 푀푖푗} (3.5)

3.4.4. Optimization of weights for potential and entropy terms

We combine our novel 푆푐푐 entropy function with our multi-body long range and short range potential functions to generate optimized KBFs. Using an approach similar to the one adopted to generate our previous optimized potential [Gniewek et al., (2011)], we formulate

KBFs as a linear combination of three potential functions and one entropy function:

퐾퐵퐹 = 푤푠4푏 ∗ 푉푠4푏 + 푤푛푠4푏 ∗ 푉푛푠4푏 + 푤푠푟 ∗ 푉푠푟 − 푤푐푐 ∗ 푆푐푐 (3.6)

where 푤푠4푏, 푤푛푠4푏, 푤푠푟 and 푤푐푐 are the weights for the 4-body sequential potential [Feng et al., (2007)], 4-body non-sequential potential [Feng et al., (2010)], short range potential [I.

Bahar et al., (1997b)] and the contact-change based entropy terms respectively. The weights are optimized in a manner similar to our previous work [Gniewek et al., (2011)] using particle-swarm optimization (PSO) by fixing 푤4푏 as 1 and varying all the other terms between 0 and 20 for six objective functions describing performance in discriminating native protein structures from decoys: (a) maximize Pearson’s correlation with RMSD from native structure, (b) maximize Spearman’s rank correlation with RMSD from native structure, (c) maximize Kendall’s rank correlation coefficient with RMSD from native structure, (d) minimize RMSD of best decoy with native, (e) maximize Z-score of best decoy and (f) minimize rank of native structure. The optimization was performed using two different variations of the entropy term calculated using the fractional and normalized cc matrices, resulting in 12 slightly different KBFs. All the 12 functions are optimized for best

73 performance using a training set of decoys from 105 selected target proteins from the

CASP11 competition. The three measures of correlation (Pearson’s correlation coefficient,

Spearman’s rank correlation coefficient and Kendall’s rank correlation coefficient) are calculated using the formulae provided in our previous work [Gniewek et al., (2011)]. The

PSO procedure was performed using the ‘particleswarm’ function implemented in MATLAB version R2015a.

3.4.5. Comparison of performance measures

The performance of the new KBFs is compared with that of the 23 other two-body statistical potentials as well as three other atomic knowledge-based potentials (dDFIRE [Yang and

Zhou, (2008)], RW and RWPlus [Zhang and Zhang, (2010)]) using 12 conventional metrics: correlation between free energy measure and RMSD from native structure as measured by (a)

Pearson’s rank correlation coefficient (b) Spearman's rank correlation coefficient and (c)

Kendall’s rank correlation coefficient (d) RMSD of best ranked decoy from the native structure (e) TM-score of best decoy with native structure (f) Z-score of best decoy (g) rank of native structure; correlation between free energy measure and TM-score with native structure as measured by (h) Pearson’s rank correlation co-efficient, (i) Spearman's rank correlation co-efficient and (j) Kendall’s rank correlation co-efficient, (k) Z-score of native and (l) number of natives ranked first.

3.5. Author Contributions

RJ and KS designed research; KS performed research with help from KJ; KS, KJ and RJ analyzed and interpreted data; and KS and RJ wrote the paper.

74

3.6. Acknowledgements

This research was supported by NIH grant R01-GM72014 and NSF grant MCB-1021785, as well as funds from the Carver Trust awarded to the Roy J. Carver Department of

Biochemistry, Biophysics and Molecular Biology. KS was also supported by fellowship funds from the Office of Biotechnology, Iowa State University.

3.7. Supplementary Material

Supplementary Information is available in Appendix D.

75

CHAPTER 4. MOLECULAR DETERMINANTS OF CADHERIN IDEAL BOND

FORMATION: CONFORMATION DEPENDENT UNBINDING ON A

MULTIDIMENSIONAL LANDSCAPE

Paper originally published in (2016) Proc. Natl. Acad. Sci. 113(39): E5711-5720.

Kristine Manibog, Kannan Sankar, Sunae Kim, Yunxiang Zhang, Robert L. Jernigan,

Sanjeevi Sivasankar

Abstract

Classical cadherin cell-cell adhesion proteins are essential for the formation and maintenance of tissue structures; their primary function is to physically couple neighboring cells and withstand mechanical force. Cadherins from opposing cells, bind in two distinct trans conformations: strand-swap dimers and X-dimers. As cadherins convert between these conformations, they form ideal bonds – i.e. adhesive interactions that are insensitive to force.

However, the biophysical mechanism for ideal bond formation is unknown. Here, we integrate single molecule force measurements with coarse-grained and atomistic simulations, to resolve the mechanistic basis for cadherin ideal bond formation. Using simulations, we predict the energy landscape for cadherin adhesion, the transition pathways for interconversion between X-dimers and strand-swap dimers and the cadherin structures that form ideal bonds. Based on these predictions, we engineer cadherin mutants that promote or inhibit ideal bond formation and measure their force dependent kinetics using single molecule force clamp measurements with an atomic force microscope. Our data establishes that cadherins adopt an intermediate conformation as they shuttle between X-dimers and

76 strand-swap dimers; pulling on this conformation induces a torsional motion perpendicular to the pulling direction that unbinds the proteins and forms force independent ideal bonds. This torsional motion is blocked when cadherins associate laterally in a cis orientation, suggesting that ideal bonds may play a role in mechanically regulating cadherin clustering on cell surfaces.

4.1. Introduction

The formation and maintenance of multicellular structures rely upon specific and robust intercellular adhesion [Gumbiner, (2005)]. Cell-cell adhesion proteins, such as classical cadherins, are crucial in these processes [Halbleib and Nelson, (2006); Harris and Tepass,

(2010); Niessen et al., (2011)]. Cadherins are Ca2+ dependent transmembrane proteins that mediate the integrity of all soft tissues. One of their principal functions is to bind cells and dampen mechanical perturbations; however, the biophysical mechanisms by which cadherins tune adhesion is not understood. Here, we combine predictive simulations with quantitative single molecule force clamp measurements to show that E-cadherin (Ecad), a prototypical classical-cadherin, dampens the effect of tugging forces, by switching conformations and unbinding along a strongly preferred pathway on a multidimensional landscape.

Cadherin adhesive function resides in their ectodomain that is comprised of five extracellular (EC) domains arranged in tandem [Brasch et al., (2012); Leckband and

Sivasankar, (2012a), (2012b); Pokutta and Weis, (2007); Shapiro and Weis, (2009)].

Structural studies [Boggon et al., (2002); Ciatto et al., (2010); Harrison et al., (2010); Nagar et al., (1996); Pertz et al., (1999); Shapiro et al., (1995)] and single molecule fluorescence measurements [Sivasankar et al., (2009)], show that opposing ectodomains bind in two distinct trans conformations: strand-swap dimers (S-dimers) and X-dimers. S-dimers, which

77 have a higher binding affinity, are formed by the exchange of a conserved tryptophan at position 2 (W2) between binding partners [Boggon et al., (2002); Häussinger et al., (2004);

Hong et al., (2011); Parisini et al., (2007); Shapiro et al., (1995)]. In contrast, low-affinity X- dimers are formed by extensive surface interactions around the linker region that connects the two outermost domains (EC1-EC2) [Ciatto et al., (2010); Harrison et al., (2010);

Häussinger et al., (2004); Nagar et al., (1996); Sivasankar et al., (2009)]. In the absence of force, X-dimers are believed to serve as a kinetic intermediate in S-dimer formation

[Harrison et al., (2010); Li et al., (2013); Sivasankar et al., (2009)] and dissociation [Hong et al., (2011)]. Nuclear magnetic resonance (NMR) studies show that trans dimer formation is a slow process that occurs over a ~1 s timescale [Li et al., (2013)]; this slow dimerization process presumably enables cadherins to sample metastable, intermediate conformations along their reaction coordinate. Furthermore, double electron-electron resonance (DEER) experiments [Vendome et al., (2014)] showed that classical cadherins exist in an equilibrium ensemble of multiple conformational states and that the dynamic exchange between conformations has important effects on cell-cell adhesion. However, the pathway by which cadherins interconvert between X-dimers and S-dimers and its effect on adhesion are not well understood.

Using single molecule force clamp experiments with an atomic force microscope (AFM) coupled with atomistic computer simulations, we recently showed that X-dimers and S- dimers have distinctly different mechanical properties [Manibog et al., (2014); Rakshit et al.,

(2012)]. While cadherin mutants trapped in an X-dimer conformation formed catch-slip bonds, i.e. molecular interactions that become longer-lived when pulled, mutants trapped in an S-dimer conformation formed slip bonds with lifetimes that decrease upon application of

78 increasing tensile force. In contrast, Wild Type (WT) cadherins formed interactions that were insensitive to pulling force, i.e. ideal bonds [Rakshit et al., (2012)]. We hypothesized that

WT-cadherins formed ideal bonds because they adopt an intermediate conformation which unbinds along a coordinate that is perpendicular to the pulling force; this intermediate conformation is formed when WT-cadherins transition between X-dimers and S-dimers.

However, the structure of this intermediate conformation and the mechanistic consequences of its formation and rupture are unknown. Here, we combine atomistic and coarse-grained simulations with quantitative single molecule force clamp measurements to resolve these questions.

We first build the energy landscape for Ecad trans dimers and investigate the transition pathway for the interconversion between X-dimers and S-dimers by evaluating the principal motions [Teodoro et al., (2003), (2002)] of cadherin experimental structures using Principal

Component Analysis (PCA). Construction of the multidimensional energy landscape of cadherins sheds light on the different conformations of Ecad dimers and the steric constraints that govern their interconversion. Next, with all-atom molecular dynamics (MD) simulations and PCA, we examine the conformational evolution of different cadherin structures and identify key elements crucial to their interconversion. By using steered molecular dynamics

(SMD) simulations, we investigate the force-induced unbinding dynamics of different cadherin structures and predict the conditions under which ideal bonds are observed. Finally, we experimentally test these predictions by generating mutants that either inhibit ideal bond formation by trapping Ecad in an S-dimer conformation (K14E) or promote ideal bond formation by enhancing the sampling of the intermediate state (W2F). We use single

79 molecule AFM force clamp measurements to measure the force-dependent lifetimes of WT

Ecad, the K14E mutant and the W2F mutant.

Our results show that as Ecads shuttle between X-dimer and S-dimer conformations, they sample an intermediate state; encouragingly, both coarse-grained and all-atom simulations suggest similar transition pathways between S-dimers and X-dimers. When this intermediate conformation is pulled, it undergoes a torsional motion perpendicular to the pulling direction, which results in unbinding that is insensitive to force. Since this torsional motion is blocked when cadherins associate laterally on the cell surface in a cis orientation, our studies suggest that only trans dimers, that are not incorporated into cis clusters, can form ideal bonds. Given the exceedingly short lifetimes of ideal bonds, it is likely that mechanical force preferentially ruptures stand-alone trans dimers and enhances the concentration of cadherin clusters in intercellular junctions.

4.2. Results

4.2.1. Cadherin energy landscape and transition pathway for the interconversion between X-dimers and S-dimers

In order to determine the classical cadherin energy landscape, we used PCA to evaluate the most important coordinate variations [Teodoro et al., (2003), (2002)] from a set of eleven classical cadherin dimeric crystal structures that include both X-dimers and S-dimers. The entire structural variation present in the set could be summarized in only a few important motions or principal components (PCs), which provided the directions of fluctuation of each residue along the x, y and z coordinates. Several studies have shown that such variations extracted from sets of experimental structures closely correspond to motions sampled from a single structure using MD simulations or coarse-grained elastic network models [Yang et al.,

80

(2008); L.-W. Yang et al., (2009)], and usually correspond to functionally important conformational changes of the protein [Hub and Groot, (2009)]. While the first principal motion (represented by PC1) corresponded to a scissoring motion between the interacting protomers, the second principal motion (represented by PC2) corresponded to a twisting motion of the two monomers with respect to the each other along a plane perpendicular to

PC1 (Fig. 4.1). These two principal motions, PC1 and PC2, captured 94% and 5% of the total variation in the dataset, suggesting that these two motions are highly effective in describing all of the motions present within the set of classical-cadherin crystal structures.

Since these are the most important directions of motion, the two PCs represent a convenient and efficient set of coordinates upon which to construct energy landscapes

[Haliloglu and Bahar, (2015); G. G. Maisuradze et al., (2009)]. Analogous to the x, y, and z directions, the PC directions are orthogonal and can be considered to form a multidimensional landscape, and the structures can be projected onto this space. The energy landscape of cadherin along the PC1 and PC2 coordinates was constructed as described in

Methods (Fig. 4.1a). Structures were sampled along PC1 and PC2 by deforming a representative structure (by linearly extrapolating its 3D coordinates) along each PC direction as described in [Sankar et al., (2015)]. Then, the energy of each structure on this landscape was estimated using coarse-grained knowledge-based potentials. We employed coarse- grained models in this case, because the conformational changes in cadherins involve remarkably large motions between the protomers (changes of order of nm). Previous works have shown that, for large conformational changes, physics based all-atom fields are not as good at discriminating native structures of proteins from decoy models as are knowledge- based potential functions based on frequencies of contacts between different types of amino

81 acids, counted in a large non-redundant set of (> 500) known crystal structures of diverse proteins. [Bradley et al., (2005); Skolnick, (2006); Zhang et al., (2005)]. We specifically used an optimized potential function combining long range four-body contacts which are appropriate for densely packed proteins, together with short-range terms, which have been shown to perform well in discriminating between native and non-native modeled structures

[Gniewek et al., (2011)]. Following this, the energies of structures on the entire landscape were visualized in the form of an energy contour plot and the experimental structures are shown as solid red circles (Fig. 4.1a). It is worth noting that the 11 experimental structures were observed to lie in low energy regions of the energy landscape; structures along low values of PC1 corresponded to X-dimers; whereas the structures along high values of PC1 corresponded to S-dimers. Two clusters of S-dimers were observed, corresponding to low and high values of PC2, indicating structures that differ in their twist angles (Fig. 4.1a).

Without sufficient twist of the monomers relative to one another (along PC2), the scissoring motion (along PC1) caused the EC domains to overlap with each other. Consequently, this resulted in a high energy region in the center of the landscape where no crystal structures were observed (Fig. 4.1a). One point to note is that the energy landscape does not reflect the free energy of dimer formation from monomers, but simply captures the relative energies of various dimeric conformations.

In order to understand how the intrinsic fluctuations of the structures drive the conformational change between X-dimers and S-dimers, we derived the transition paths between W2A mutant and WT forms as well as between the WT and K14E mutant forms by using a coarse-grained elastic network model based interpolation, performed by using the

PATH-ENM server [Zheng et al., (2007)]. The transition intermediates between the two

82 conformations were projected onto the PC1-PC2 space as shown in Fig. 4.1a. Interestingly, the transition path between the W2A mutant (PDB: 1EDH) exhibiting an X-dimer structure and WT S-dimer crystal structure (PDB: 2QVF) conformation passed close to the crystal structures 3Q2L and 3Q2N (V81D and L175D cis dimer mutants forming S-dimers). The conformational change between the WT and the K14E mutant (PDB: 3LNI) primarily involved motion along PC2, reiterating the fact that the K14E mutant adopted a slightly different twist angle.

83

Figure 4.1. Energy landscape and transition state pathway of E-cadherin dimers. (a) Two dimensional energy landscape of Ecad constructed along PC1 and PC2 extracted from a set of 11 crystal structures (solid red circles) are shown as a contour plot in VIBGYOR color format (violet = low energies and red = high energies). X-dimer structures are located at low values of PC1 (left), while S-dimers are at high values of PC1 (right). The intermediate structures along the transition paths between different forms as obtained by using the PATH-ENM server are shown as open black circles. (b) Visualization of the transition between X-dimers and S-dimers show the computed intermediates along the transition path between the two forms. The X-dimer first twists along PC2 to give rise to an intermediate, which moves along PC1 to give rise to the wild-type, which then twists back in the PC2 direction to yield the S-dimer K14E structure.

The transition path between the X-dimer and the S-dimer visualized on these structures is shown in Fig. 4.1b. Starting from the X-dimer (PDB: 1FF5), the EC1-2 domains twisted away from one another along PC2 to form an intermediate shown in the second position in

Fig. 4.1b, before reaching the third position, by moving along PC1 corresponding to the experimental structure (PDB: 3Q2L). A further sliding along PC1 (increasing the angle between the two molecules) formed the wild type conformation (PDB: 2QVF). This structure then twisted back along PC2 to reach the K14E S-dimer (PDB: 3LNE). The X-dimer could not slide directly along the PC1 direction because the EC1-EC2 domains clashed on such a path.

4.2.2. Wild Type and conformational-shuttling mutants form a metastable, intermediate dimer state

Next, we used MD simulations to investigate the molecular determinants of cadherin conformational interconversion. Since only the EC1-EC2 domains are involved in trans binding, we used four EC1-EC2 trans dimer crystal structures in our simulations: (i) wildtype

(WT) cadherin (PDB: 2QVF), (ii) mutant (W2A; PDB: 3LNH) that replaces the swapped Trp with an Ala and traps cadherin in an X-dimer conformation [Harrison et al., (2010)], (iii) mutant (K14E; PDB: 3LNE) that eliminates a key salt-bridge in the X-dimer interface and traps cadherin in an S-dimer structure [Harrison et al., (2010)] and (iv) W2F,

84

‘conformational-shuttling’ mutant (PDB: 4NUQ) that relieves the strain in the swapped N- terminal β-strand and forms weak binding affinity S-dimers [Vendome et al., (2011)]; previous DEER experiments showed that this W2F mutant construct populates an equilibrium ensemble of X-dimers and S-dimers [Rakshit et al., (2012)]. All structures used in these simulations were from Ecad except the W2F mutant which was from N-cadherin

(Ncad), a closely related type I classical cadherin that shares a high sequence identity (~77%) of its N-terminal -strand with Ecad (~53% sequence identity over the whole EC1-EC2 dimer structure), with conserved amino acids forming the swapping interface. Since cadherin dimerization is a rather slow process that is completed in ~ 1s, we performed long-duration

150 ns MD simulations (Fig. 4.2, Supplementary Videos 1-4, SI Text) to capture conformational interconversion in these structures. All MD simulations executed on W2A,

W2F, WT and K14E dimer structures (Figures 4.2a, 4.2d, 4.2g, 4.2j) were performed using

GROMACS with a GROMOS 53a6 all-atom force field [Oostenbrink et al., (2004); Van Der

Spoel et al., (2005)] with 2-fs integration time steps.

Although the design rules for cadherin strand swapping have been extensively studied

[Vendome et al., (2011)], the process by which dimers interconvert between different structural conformations have not been investigated. Since X-dimers are believed to serve as the first intermediate along the pathway to strand-swapping [Harrison et al., (2010); Li et al.,

(2013); Sivasankar et al., (2009)], we assessed how the dimer structures evolved over time by measuring the root mean square displacement (RMSD) of every simulation frame relative to both the X-dimer crystal structure and to the structure at the start of the simulation (Figures

4.2b, 4.2e, 4.2h, 4.2k). We also analyzed the principal motions of the dimer structures by

85 projecting the MD trajectories onto the PC1-PC2 space of cadherin dimer crystal structures

(Figures 4.2c, 4.2f, 4.2i, 4.2l).

Figure 4.2. Wild Type cadherin and conformational-shuttling mutant form an intermediate dimer state. Formation of the intermediate-state was investigated using 150 ns MD simulations on

86 four EC1-EC2 trans dimer crystal structures: W2A mutant Ecad, W2F Ncad mutant, WT Ecad and K14E Ecad mutant. Conformational changes in the dimer structures were monitored by measuring the root mean square displacement (RMSD) of each simulation frame relative to both the X-dimer crystal structure and the structure at the start of the simulation. PCA was used to calculate the principal motions by projecting structures along the MD trajectories onto the PC1-PC2 space obtained from dimer crystal structures (PC1 and PC2 are parallel and perpendicular to the plane of the page). Snapshots at 0 ns and 150 ns serve as visual guides for the change in the dimer structures. (a) Snapshots of W2A structures at 0 ns and 150 ns are similar showing that they remain in an X-dimer conformation (b) with a relatively small RMSD of 0.38 nm relative to its initial structure. (c) The observed W2A PC scores are tightly clustered near the initial structure with a standard deviation of 1.5 nm and 3.9 nm, along PC1 and PC2 directions, respectively. In contrast, (d) snapshots of the W2F mutant at 0 ns and 150 ns show that the W2F S-dimer structure shuttles towards an X-dimer conformation. This is evident from (e) its RMSD relative to its initial structure which increases to a stable average value of 1.84 nm while the RMSD relative to the W2A crystal structure decreases to an average value of 1.06 nm. Furthermore, (f) PC scores approach the X-dimer structure with a significant PC1 standard deviation of 7.7 nm. Similarly, (g) WT-Ecad snapshots at 0 ns and 150 ns reveal that the initial S-dimer structure converts to an X-dimer-like conformation (h) with measured RMSDs of 1.74 nm and 1.06 nm, relative to its initial structure and W2A crystal structure respectively. (i) PCA on the MD trajectories also shows that the WT structure evolves to an X-dimer like conformation. (j) Snapshots of the K14E mutant at 0 ns and 150 ns clearly shows that K14E dimers remain in a strand-swap conformation exhibiting a (k) relatively low and high RMSD values relative to its initial and W2A crystal structures, respectively. (l) In the PC1-PC2 space, its trajectories cluster around the strand-swap conformational structures. Our data showed that W2A mutant X-dimers stabilized rapidly and maintained a relatively low RMSD of 0.38 nm while remaining in their original conformation (Figures

4.2a-b). Moreover, this mutant was able to sample only a small space along PC1-PC2, with standard deviations of 1.8 nm and 4.4 nm, respectively (Fig. 4.2c), which suggested strong resistance to conformational change.

In contrast, both the W2F and WT dimer structures appeared to convert from an S-dimer structure at the start of the simulation to an X-dimer-like conformation, albeit with the swapped amino acids still within their complementary hydrophobic pockets (Figures 4.2d-i).

The RMSD of W2F and WT dimers relative to the W2A crystal structure decreased from initial values of 2.1 nm and 2.4 nm respectively, to 1.1 nm upon equilibration.

Simultaneously, the RMSDs of W2F and WT dimers relative to their initial structures increased to 1.8 nm and 1.7 nm respectively (Figures 4.2e, 4.2h). Furthermore, the PCs of both the W2F mutant and WT-Ecad approached the low PC1 values of the X-dimer, with

87 measured PC1 standard deviations of 7.7 nm and 10.3 nm respectively, and PC2 standard deviations of 6.9 nm and 5.5 nm, respectively (Figures 4.2f, 4.2i; Table E.1). Finally, while the low-affinity, W2F mutant reached an X-dimer-like conformation within ~30 ns, the higher-affinity WT-Ecad took ~75 ns to adopt a similar structure (Figures 4.2e, 4.2h). The more rapid equilibration of W2F mutant dimers and comparable standard deviations of the

W2F mutant both along PC1 and PC2 directions is consistent with Phe2 being smaller than

Trp2, which provides a weaker constraint in the PC1-PC2 space, allowing the W2F mutant to shuttle more readily between S-dimer and X-dimer and adopt an X-dimer-like structure

[Vendome et al., (2011)] (Fig. 4.2d). Nonetheless, although we performed 150 ns MD simulations for both WT and W2F cadherin dimers, these structures (Figures 4.2d, 4.2g) never reached the X-dimer conformation (Fig. 4.2a). This is likely due to the huge disparity in timescales between our MD simulations and experimentally measured [Li et al., (2013)] structural conversion. Consequently, the final MD structure of W2F (Fig. 4.2d) and WT (Fig.

4.2g) dimers correspond to an “intermediate” state in the interconversion between S-dimers and X-dimers. Our RMSD result which shows that the WT dimer adopts a structure different from its starting conformation is consistent with previous MD simulations, albeit with a different force field and significantly shorter production runs [Cailliez and Lavery, (2006)].

Similar to the W2A mutant, the K14E mutation abolished the ability of Ecad to shuttle between multiple conformations. While the K14E mutants attempted to adopt an X-dimer conformation, as indicated by the spike in RMSD observed at ~12 ns, the repulsive interactions introduced by mutating K14 to E14 kept this mutant from forming an X-dimer structure (Fig. 4.2j). Consequently, these dimers remained in a strand-swap conformation with equilibrated RMSDs of 0.70 nm and 2.0 nm relative to its initial structure and the X-

88 dimer crystal structure, respectively (Fig. 4.2k). However, as indicated by a PC1 standard deviation of 5.1 nm (Fig. 4.2l, Table E.1), the K14E mutant sampled a much larger conformational space compared to W2A X-dimers (Fig. 4.2c, Table E.1), although this space was limited only to S-dimer conformations.

This conformational evolution of cadherin structures was also confirmed when we compared the solvent accessible surface area (SASA) in the MD simulations (Fig. E.1).

While the X-dimer SASA remained fairly constant throughout the simulation (Fig. E.1a), the

SASA decreased in the W2F mutant and WT cadherin, which converted from an S-dimer to

X-dimer conformation (Supplementary Figures 4.1b-c). In contrast, SASA measurements in

K14E remained relatively constant with the highest observed value (Fig. E.1d). Aside from the measured SASA, the final K14E conformation resembled its original structure with an angle of 77° between opposing EC1 domains; this angle was much wider compared to angles of 29°, 45° and 21° for the final WT, W2F and W2A, structures respectively (Table E.2).

4.2.3. Conformational interconversion depends on the K14-D138 salt-bridge interaction

Since one of the principal interactions in an X-dimer is the salt bridge formed between

K14 on the EC1 domain of one protomer and D138 at the apex of the EC2 domain of its opposing partner, we monitored the distance between these amino acids in the MD simulations of the W2A mutant, the W2F mutant, the WT-cadherin and the K14E mutant dimer structures. Our data showed that in the W2A mutant, the K14-D138 salt bridge pair

“clips” the cadherins together into an X-shaped conformation, such that it reduced any observable dynamic motion along PC1 and PC2; while maintaining its original conformation throughout the 150 ns MD simulation (Fig. 4.3a). Consequently, that distance between the

89 center of mass of the K14 and D138 amino acids in the X-dimer remained fairly constant at

~0.92 nm (Figures 4.3b-c).

90

Figure 4.3. K14-D138 salt-bridge pair locks X-dimers. Center of mass distances between the complementary K14-D138 (R14-D137 in W2F Ncad) salt bridge pairs were monitored. (a,b,c) Distance measured between the center of masses of the K14 and D138 amino acid residues on the two protomers (Proteins A and B) remain stable in the W2A X-dimer structure. (d,e) The W2F Ncad dimer structure has two complementary R14-D137 salt-bridge pairs, which exhibit (f) decreasing distance relative to each other during a 150 ns MD run. (g,h) WT dimer structure also shows (i) decreasing distance between K14 and D138. (j,k) Mutating K14 to E14 prevents the two opposing proteins from coming together due to electrostatic repulsion. As a result, (l) the distance between E14 and D138 remains fairly constant at ~2.7 nm.

Similarly, as the W2F-Ncad and WT-Ecad converted from an S-dimer to an X-dimer-like structure, the R14-D137 pair (W2F-Ncad, Figures 4.3d-f) and K14-D138 pair (WT-Ecad,

Figures 4.3g-i) began to come together, approaching a final distance of 1.25 nm between the center of masses of the amino acid pairs. However, salt bridge formation was not completed within the 150 ns duration of the MD simulation. In contrast, the K14E mutant where X- dimer formation was abolished due to a repulsive interaction between E14 and D138, showed a large (~2.7 nm) distance between E14 and D138 (Figures 4.3j-l).

4.2.4. Intermediate dimer states exhibit force-induced conformational motion perpendicular to the pulling direction

Since cadherins populate an intermediate state as they interconvert between X-dimers and

S-dimers, we performed SMD simulations to determine the mechanical properties of the intermediate conformation and compare them to the force-induced unbinding of X-dimers and S-dimers (Fig. 4.4, Supplementary Video 5-8). Although SMD simulations have been previously used to uncover the molecular mechanism by which S-dimers and X-dimers form slip and catch-slip bonds, respectively [Bayas et al., (2004); Manibog et al., (2014)], the mechanical properties of intermediate structures have not been investigated. We therefore performed SMD simulations with 0.4 nm/ns constant pulling velocity in explicit solvent on

91

W2A-Ecad, W2F-Ncad, WT-Ecad and K14E-Ecad. In order to establish reproducibility, three SMD simulations were performed on each cadherin, using the final conformation from the MD simulation at 150 ns (henceforth referred to as 150-SMD), the structure at 140 ns

(henceforth referred to as 140-SMD) and the structure at 100 ns (henceforth referred to as

100-SMD) as initial structures. All SMD simulations employed a center-of-mass pulling method using a virtual harmonic spring to dissociate the opposing protomers via a C-terminal residue pulling group. Force was applied in a direction parallel to PC1, pulling the proteins away from one another (Supplementary Videos 5-8). All four dimer structures separated under the influence of force, as demonstrated by an increase in angle between the dimers

(Fig. 4.4a; red curve, Figures 4.4c-f). However, two distinct sets of unbinding motions were measured for the W2A, W2F and WT structures: a motion in the pulling direction (along

PC1) and a torsional motion perpendicular to the applied force (along PC2) (Fig. 4.4b; black curves, Fig. 4.4c-4.4e). In contrast, the K14E mutant primarily moved only in the direction of force application (black curve, Figure 4.4f).

Upon pulling the W2A dimer, we measured a peak-to-peak torsion angle change between opposing EC1 domains, along the PC2 direction, of 39° (150-SMD), 48° (140-SMD) and 46°

(100-SMD) (black line, Fig. 4.4c); the standard deviation of this torsional motion was 9° averaged over the three SMD simulations. Similarly, when the W2F dimer intermediate state was pulled, a significant torsional motion with a peak-to-peak EC1-EC1 angle of 47° (150-

SMD), 25° (140-SMD) and 34° (100-SMD) (black curve, Fig. 4.4d) and a standard deviation of 8° averaged across all three SMD simulations was measured. Pulling the WT-Ecad intermediate state also resulted in a motion perpendicular to the pulling force, comparable to both the W2A and W2F mutants: the peak-to-peak angle change between the EC1 domains

92 along PC2 was 40° (150-SMD), 32° (140-SMD) and 30° (100-SMD), with a standard deviation of 8° averaged over the three SMD simulations (black curve, Fig. 4.4e).

93

Figure 4.4. Cadherin intermediate states exhibit unbinding motion perpendicular to the pulling direction. Three SMD simulations were performed on the W2A, W2F, WT and K14E trans dimer structures using the structure at 150 ns (150-SMD), 140 ns (140-SMD) and 100 ns (100-SMD) of the corresponding MD simulation as the starting conformation. When the proteins were pulled apart, we monitored (a) their angular separation (depicted by the red arrow) and (b) the torsion angle perpendicular to the pulling direction (depicted by the black arrows). The time evolution of the angular separation (red curves) and torsion angle (black curve) show that while the (c) W2A mutant, (d) W2F intermediate state and (e) WT intermediates state unbind along the pulling direction, they also exhibit significant torsional motion in the orthogonal direction. In contrast, the (f) K14E structure unbinds primarily along the pulling direction.

Although motion orthogonal to the pulling force was observed in the W2A mutant, the X- dimers formed force-induced hydrogen bonds which resulted in a catch-slip bond formation

[Manibog et al., (2014)]. The formation force induced hydrogen bonds was facilitated by the

K14-D138 salt bridge pair (Fig. 4.3a-4.3c) which locked the proteins together. Consequently, the EC1-EC1 angular separation in the W2A mutant was maintained at ~14° (smallest among the four trans dimers) for ~9 ns, with minimal torsion unbinding motion, before the opposing

EC1 domains started to separate (Fig. 4.4c).

In contrast to W2A mutants, both the W2F and WT intermediate dimer structures did not have the R14-D137 (W2F-Ncad) or K14-D138 (WT-Ecad) salt bridge pairs. Consequently, they had a larger EC1-EC1 angular separation (~38° in W2F and ~34° in WT) (red line, Fig.

4.4d-4e) which increased earlier upon pulling (~7 ns in W2F and ~5 ns in WT) than for the

W2A mutants. However, the evolution of the torsional motion along PC2, was different between the W2F and WT-Ecad dimers. The change in torsional angle between the EC1 domains in the W2F mutant was more abrupt (black curve, Fig. 4.4d), likely due to both the smaller Phe2 side chain and fewer interactions at the dimer interface [Vendome et al., (2014),

(2011)]. In comparison, the bigger Trp2 indole ring occupied a larger surface area in the hydrophobic pocket of WT-Ecad dimers and was additionally stabilized by a formed between its side chain NH group and the Asp90 carbonyl group in the hydrophobic

94 pocket [Häussinger et al., (2004); Shapiro et al., (1995); Vendome et al., (2011)]. This imposed extra constraints on the N-terminal strand and resulted in the slower twisting motion between the opposing EC1 domains in WT-Ecad (black curve, Fig. 4.4e). Consequently, the angular separation along PC1 showed a more distinct stepping profile in W2F where the twisting of the opposing EC1 domains was accompanied by a smaller displacement along

PC1 (Fig. 4.4d). Nonetheless, both of these intermediate structures were largely stationary as the two opposing EC1 domains twisted relative to each other, indicating that the dissociation pathway was insensitive to the pulling force (red line, Fig. 4.4d-4.4e). Based on the results of these simulations, we predicted that while both W2F mutants and WT-Ecad form force- independent bonds, the W2F dimer interactions are more force-insensitive.

Finally, compared to the other cadherin constructs, the K14E mutant had fewer binding interface constraints due to its wide angle (Table E.2) and large solvent accessible surface area (Fig. E.1d). Consequently, pulling the opposing proteins along PC1 posed fewer restrictions on its unbinding motion and the K14E mutant exhibited negligible torsional motion, with a standard deviation of 4° averaged across all three SMD simulations (black curve, Fig. 4.4f). Thus, the force-induced dissociation of K14E dimers resulted in a steady unbinding in the PC1 direction with minimal motion along PC2 (Supplementary Video 8), suggesting that K14E dimers steadily weaken as force increases, an essential characteristic of a slip bond [Rakshit et al., (2012)].

4.2.5. Cadherins trapped in an intermediate dimer state form ideal bonds

To test the hypothesis that ideal bonds are formed due to the unbinding from an intermediate state, we performed single molecule AFM force clamp spectroscopy measurements using W2F, WT and K14E Ecad constructs. The complete Ecad ectodomains

95 were site-specifically biotinylated at a C-terminus AviTag sequence and immobilized on

Polyethylene Glycol (PEG)-functionalized AFM tips and glass coverslips that had been decorated with streptavidin, a biotin binding protein [Rakshit et al., (2012); Sivasankar et al.,

(2009); Zhang et al., (2009)] (Fig. 4.5a). The cadherin functionalized tip and substrate were brought into contact allowing the proteins to interact for specified contact times (0.3 s or 3.0 s). Subsequently, the tip was pulled away from the surface and clamped at a set force and the survival-time for each binding-event was measured (Fig. 4.5b). Cadherin binding frequency was adjusted to < 5% by controlling the cadherin density on the tip and substrate; Poisson statistics predicted that under these conditions, > 97% of measured events occur due to the rupture of single X-dimers.

Single molecule fluorescence microscopy experiments have previously shown that under similar experimental conditions, a majority of the measured unbinding events occur due to rupture of single trans-dimers [Zhang et al., (2009)]. Approximately 1,000-1,500 single molecule measurements were carried out at five different clamping forces. To directly identify specific-binding events and eliminate non-specific interactions, we simultaneously converted the force-clamp (force vs. time) measurements (Fig. 4.5b) to their corresponding force vs. distance traces; and monitored the stretching of the PEG tethers that anchor the interacting cadherins to the AFM tip and substrate (Supplementary Figures 4.2a-b)

[Manibog et al., (2014)]. As shown previously [Manibog et al., (2014)], this allowed us to unambiguously identify specific single molecule unbinding events since they occurred at a tip-surface separation corresponding to the stretching of two PEG tethers (Fig. E.2a).

96

Figure 4.5. Cadherins trapped in intermediate states form ideal bonds. Single molecule AFM force clamp spectroscopy was performed on WT Ecad, the K14E mutant and the W2F mutant. (a) AFM tips and glass cover slips were functionalized with biotinylated Ecads via a PEG-biotin- streptavidin complex. (b) The survival time of each unbinding event was measured as the persistence of the bond at the set force. Two methods were used to determine the bond lifetimes at each clamping force: (i) exponential decay fits of the bond survival probabilities and (ii) weighted mean lifetimes.

97

(c) K14E S-dimers show slip bond behavior. In contrast, WT Ecads that interact both for (d) 3.0 s and (e) 0.3 s show force-independent bonds. W2F mutants show ideal bonds both at (f) 3.0 s and (g) 0.3 s contact times. All bonds were fitted to the same microscopic slip bond model (green curve). The survival probabilities of specific unbinding events are shown in the insets of c, d, e, f, and g; for clarity, bond survival probability plots are zoomed in.

Two methods were then used to determine the bond lifetime at each clamping force

(Figures 4.5c-g). In the first method, bond lifetimes were determined from single-exponential decay fits to the bond survival probability (insets, Figures 5c-5g, Fig. E.2c). In the second method, we computed the weighted mean lifetime at each clamping force without ignoring any outliers. The measured weighted mean lifetimes were in good agreement with the lifetimes obtained from single exponential decay fits (Figures 4.5c-g). The error bars in the force and lifetime data are the standard error of means (s.e.m) measured using a bootstrap with replacement protocol [Efron, (1982)]. We first measured the force-dependent bond- lifetimes of K14E mutants in 1.5 mM Ca2+. Our experiments showed that the K14E mutant forms slip bonds, their lifetimes decreased with increasing force (Fig. 4.5c). This slip bond data is in good agreement with previous results demonstrating that K14E S-dimers form slip bonds in 2.5 mM Ca2+ [Rakshit et al., (2012)] and supports the predictions of our computer simulations showing that K14E mutants are always trapped in an S-dimer conformation and that pulling force results in unbinding in the PC1 direction with negligible motion along PC2.

Next, we measured the force-dependent bond-lifetimes of WT-Ecad in 2.5 mM Ca2+. Since

WT Ecads can unbind/rebind in ~1 s [Li et al., (2013)] as the proteins shuttle between X- dimer and S-dimer, we allowed the WT Ecads to interact for both long (3.0 s) and short (0.3 s) periods of time (Figures 4.5d-e). In both cases, the WT-Ecads formed bonds that had a weak dependence on pulling force, likely due to dissociation of the intermediate conformation. Our previous AFM force clamp data [Rakshit et al., (2012)] which showed

98 that WT Ecads form weak force-dependent bonds even when the time of contact is reduced to 0.001 s, suggests that the transition from X-dimer to the intermediate state occurs rapidly.

While we had previously interpreted the weak force-dependent bonds at short contact times to be ideal bonds, our current experiments, performed with more stringent data selection criteria, suggest that these bonds may possess some slip-bond properties. It is also possible that the difference compared to reference [Rakshit et al., (2012)] arises from the transient nature of the intermediate state from which cadherins unbind. [Li et al., (2013)]. The intrinsic parameters for K14E and WT cadherin dimers (Table E.4) were obtained by fitting the data to a microscopic slip bond model [Dudko et al., (2008), (2006)].

Finally, we engineered a weaker affinity W2F-Ecad conformational shuttling mutant by replacing Trp2 with Phe2 (Methods). Our single molecule measurements with the W2F-Ecad showed that these constructs formed very short lifetime ideal-bonds that remain insensitive to pulling force, at both short (0.3s) and long (3.0s) contact times (Figures 4.5f-g), suggesting that the W2F-Ecad remain in an intermediate conformation at both interaction times. This result is consistent with a recent report that W2F mutants form an equilibrium ensemble of

X-dimers and S-dimers [Vendome et al., (2014)]. When we globally fit the measured force vs. lifetime data to the slip bond model described above, the distance between the bound state and the transition state (xβ) along the pulling coordinate was zero (Table E.4), indicating that the pathway for the unbinding of the W2F mutant was independent of force.

4.3. Discussion

Our study integrates computer simulations and single molecule force measurements to predict the energy landscape for cadherin adhesion, the transition pathway for

99 interconversion between X-dimers and S-dimers and to identify cadherin structures that form ideal bonds. We show that ideal bonds are formed due to the force-induced dissociation of an intermediate conformation in the X-dimer to S-dimer interconversion pathway. Pulling on this intermediate state causes a torsional motion perpendicular to the applied force. Cadherin unbinding along a pathway orthogonal to the pulling coordinate results in identical molecular extensions in the bound and transition states, resulting in force-independent ideal bonds. Our study demonstrates that a simple one-dimensional energy landscape, which assumes conformational motions along the pulling direction, fails to capture the rich dynamics of cadherin response to mechanical force. Multidimensional energy landscapes have previously been reported for the refolding of RNA hairpins and pulling of DNA towards denatured structures [Gore et al., (2006); Hyeon and Thirumalai, (2007)].

Both WT-Ecad and W2F mutants adopt an intermediate state resembling an X-dimer, albeit with swapped N-terminal -strands. Since the pair of K14-D138 salt bridges are not present in the intermediate state, its unbinding motion along the pulling direction and orthogonal to the pulling coordinate is less constrained. In addition, a Pro5-Pro6 motif that is buried in the adhesive interface, prevents opposing protomers from forming a tight contact

[Vendome et al., (2011)] and enables the EC1 domains to twist past each other as the pulling force unbinds the proteins. Our AFM force clamp measurements confirm that unbinding of the intermediate state results in ideal bond formation. While a similar torsional motion is also observed in the W2A mutants, the K14-D138 salt bridge pair, which is present in this mutant, locks the proteins together and results in catch bonds due to force-induced hydrogen bond formation [Manibog et al., (2014)]. However, torsional motion weakens the pulling force

100 dependence of the W2A catch bonds; consequently these catch bonds have very short peak lifetimes that range between 0.06 s to 0.1 s [Manibog et al., (2014); Rakshit et al., (2012)].

Force-free solution binding measurements report that S-dimers have a ~10-fold higher binding affinity than X-dimers [Harrison et al., (2010); Katsamba et al., (2009)]. Our approximate energy landscapes also support this conclusion (Fig. 4.1a). However, our data also indicates that the energetic differences between X-dimer and S-dimers are insufficient to preclude them from interconverting. In fact, two complementary approaches - the coarse- grained knowledge-based energy landscapes as well as molecular dynamics based on all atomic force-fields yield structural intermediates between the X-dimer and S-dimer forms that are highly similar. The MD trajectories are also clearly seen to carefully avoid the central high energy region of the PC1-PC2 space, thus validating our predicted energy landscape.

The shuttling between X-dimer and S-dimer conformations likely explains why the WT and

W2F Ecads do not form slip bonds that are a signature of the S-dimer conformation.

Since key structural features of trans dimers formed by both the truncated ectodomains

(used in our simulations) and by the complete ectodomain are very similar, the force-induced multidimensional unbinding that we observe will likely be seen with full length cadherin ectodomains. In particular, the angle between opposing EC1 domains in the WT-Ecad trans dimer crystal structure used in our simulations (85°), are similar to the angle (91°) in full- length Ecad trans dimers (PDB: 3Q2V) [Harrison et al., (2011)], suggesting that our results are not an artifact of using just the outer two EC domains in the simulations. Similarly, since the EC1-EC1 angle (78°) in the Ncad WT EC1-EC2 trans dimer structure (PDB: 2QV1) is similar to the EC1-EC2 WT-Ecad trans dimer crystal structure, it is likely that the results obtained with the W2F Ncad crystal structure apply to W2F Ecads as well.

101

In intercellular junctions, cadherins self-organize to form a 2D lattice where proteins from opposing cells form trans dimers while proteins on the same cell surface participate in lateral cis interactions [Harrison et al., (2011)]. Within this 2D lattice, the trans dimers orient their unique cis interface in nearly perpendicular directions [Harrison et al., (2011)]. The orthogonal orientation of opposing trans- and cis- binding interfaces limit force-induced torsional motion orthogonal to the pulling direction (Fig. E.3). Consequently, our data suggests that only stand-alone trans dimers, that are not incorporated in cis clusters, form ideal bonds. Since ideal bonds have low lifetimes and are preferentially ruptured by mechanical force, they may serve to enhance the fraction of cis clusters on the cell surface.

4.4. Methods

4.4.1. Principal Component Analysis

Eleven crystal structures (3LNE, 1EDH, 1FF5, 1Q1P, 2QVF, 3LNG, 3LNH, 3LNI, 3Q2L,

3Q2N, 3Q2V) of classical cadherins were obtained from the Protein Data Bank (PDB). After processing the files to remove all hetero atoms, the structures were aligned to the structure

3LNE by use of the Procrustes analysis algorithm implemented in MATLAB. Based on this structural alignment, a set of 411 residues present in all the structures were retained for further analysis. The aligned 3D position coordinates (x, y, and z) of the Cα atoms in each structure constitutes a multivariate dataset for PCA, 횵푛 × 푝 where the number of structures,

푛 = 11 and the number of variables, 푝 = 3푁 where 푁 = 411, the number of residues considered in each structure. The variance-covariance matrix 푺 of the dataset is obtained as

̅ ̅ 푠푖푗 = 퐸[(휉푖 − 휉푖)(휉푗 − 휉푗)] ∀ 1 ≤ 푖, 푗 ≤ 3푁 (4.1)

102

푡ℎ ̅ where 휉푘 refers to the 푘 variable (x, y, or z coordinate) and 휉푘 refers to the mean of the

푘푡ℎ variable. The covariance matrix 푺 is decomposed as 푺 = 푬∆푬푇 where the columns of 푬 are the eigenvectors arranged in the decreasing order of the eigenvalues (elements of the diagonal matrix ∆). The amount of variance captured by each eigenvector is obtained from its eigenvalue. The projections of the points on each eigenvector 푷푪푛×3푁 = 횵푛×3푁 × 푬3푁×3푁 are called the principal components (PCs). The projections of the mean centered data onto

푇 푇 the PCs are called the PC scores, 푷푛×3푁 = (흃 − (⃗1 푝×1 × 흃 )) × 푬3푁×3푁 where 흃 is the transpose of the mean vector of position coordinates.

4.4.2. Projecting energy landscapes onto PC coordinates

Representative structures were sampled uniformly at equally spaced points along the first and second eigenvectors from the PCA to produce a rectangular grid. The limits of the grid were obtained from the range of PC scores observed among the 11 crystal structures. The coordinates of each representative structure on the grid were calculated using the coordinates of a central structure 푅0 (closest to the origin) on the PC grid and the PC eigenvectors. The

3D coordinates 푹1×3푁 of a structure 푅 on the PC grid at position (푅1, 푅2) could be obtained from the position coordinates of 푅0 as

0 0 0 푹1×3푁 = 푹1×3푁 + (푅1 − 푅1 ) × 풆1 + (푅2 − 푅2) × 풆2 (4.2)

0 0 0 where (푅1 , 푅2) were the scores of 푅 along 푃퐶1 and 푃퐶2 and 풆1 and 풆2 were the eigenvectors corresponding to the first and second PCs. The potential energy of each structure was estimated as an optimized linear combination of three different knowledge- based potential terms that we previously developed: four-body sequential potential, four- body non-sequential potential, and short-range potentials [I. Bahar et al., (1997b); Feng et al.,

(2010), (2007); Zimmermann et al., (2012)]. The potential energy was calculated as

103

푉표푝푡 = 푉4−푏표푑푦 푠푒푞 + 0.28 ∗ 푉4−푏표푑푦 푛표푛−푠푒푞 + 0.22 ∗ 푉푠ℎ표푟푡 푟푎푛푔푒 (4.3)

Here, the term ‘four-body’ refers to spatially close groups of four amino acids that can interact. Details on how the weights for the different terms were obtained [Gniewek et al.,

(2011)] by minimizing the RMSD of best decoys from homology modeling targets of CASP8

[Cozzetto et al., (2009)] to their corresponding native structures using particle swarm optimization (PSO) [Kennedy and Eberhart, (1995)] have been described previously.

4.4.3. Determination of transition path between X-dimers and stand-swap dimers

Conformational changes between large molecules such as cadherins can be efficiently captured using coarse-grained models like elastic network models (ENMs) [Atilgan et al.,

(2001); I Bahar et al., (1997); Tirion, (1996)]. In order to generate a transition path between two forms, we used the PATH-ENM server [Zheng et al., (2007)]. The method combines the

ENM potentials of the two end-point structures into a smoothly interpolated mixed potential function with one saddle point (corresponding to the transition state) and two minima

(corresponding to start and end conformations). The transition path between the start and end conformations were obtained as the lowest energy path that connects these structures via the saddle point. The structural intermediates along this transition path were projected onto the

PC1-PC2 space obtained from the 11 crystal structures. Only the Cα atoms of residues were used in the model and the distance cutoff for inter-residue interactions was chosen as 1.3 nm.

4.4.4. MD and SMD simulations and structural analysis

MD and SMD simulations were performed using the Condo cluster at the High Performance

Computing facility at Iowa State University. Four cadherin crystal structures were used in all the MD and SMD simulations. Of these, the K14E mutant (PDB ID: 3LNE), Wildtype (PDB

104

ID: 2QVF), and W2A mutant (PDB ID: 3LNH) structures were from Ecad while the W2F mutant (PDB ID: 4NUQ) structure was from Ncad. S-dimers were built from the crystallographic structure of their EC1-EC2 chains by applying crystallographic symmetry operations. Before simulations were performed, missing amino acid residues were added using Swiss-PDBViewer v4.1. All simulations were performed using GROMACS 4.6.5 software with GROMOS 53a6 force field and SPC216 water model.

The MD and SMD simulations were carried out similar to the method reported in our previous paper [Manibog et al., (2014)]. Briefly, for MD simulations, the cadherin dimer crystal structures were positioned at the center of a triclinic box, with all protein atoms at or more than 1 nm away from the walls of the solvated box. The solvated box contains 66,877 atoms, 83,446 atoms, 85,144 atoms, and 75,193 atoms, respectively, for W2A, W2F,

Wildtype, and K14E dimer structure MD simulations. The solvated box systems were charge-neutralized with the appropriate number of Na+ ions. Steric clashes within the box were eliminated by minimizing the potential energy of the system. Prior to a production MD run, we first equilibrated the water molecules and ions by establishing and maintaining a 300

K temperature using a modified Berendsen thermostat and stabilizing the pressure at 1 atm under isothermal-isobaric conditions using a Parinello-Rahman barostat. Once equilibration was attained, MD runs were performed for 150 ns with a 2-fs integration step using LINCS constraints and with frames recorded at 2-ps intervals. A 1 nm cutoff was used both for van der Waals interactions and for electrostatic interactions using the particle mesh Ewald technique. Periodic boundary conditions were assumed in all simulations. To monitor the evolution of structures in the MD simulations, we calculated RMSDs by least square fitting each trajectory to either its starting structure or to the X-dimer crystal structure.

105

Three structures – one taken at the end of the MD simulation at 150 ns, another at 140 ns, and a third at 100 ns – from each of the four MD simulations were used as starting conformations for the corresponding SMD simulations. The SMD starting structures of WT,

W2F and W2A dimers were placed at the center of 12 x 40 x 8 nm3 triclinic boxes, while the

K14E mutant was embedded at the center of a 40 x 12 x 8 nm3 box, with the longest dimension along the applied force direction. Each system consisted of ~122,455 atoms

(K14E mutant); ~146,815 atoms (WT); ~136,417 atoms (W2F mutant); and ~136,466 atoms

(W2A mutant). All SMD simulations used an umbrella pulling method utilizing a harmonic spring with stiffness k = 332 pN/nm. Similar to our AFM force measurements where the biotinylated C-terminal end of a complete EC1-EC5 cadherin repeats is pulled, this virtual harmonic spring pulled the opposing protomers apart at a constant velocity of 0.4 nm/ns. The

C-terminal residue of one protein chain was designated as the pulling group while the C- terminal residue from the other chain was the reference group. All SMD simulations are integrated at every 2 fs step with LINCS constraints and frames are recorded every 1 ps

(Supplementary Figures 4.4-4.6). To determine how the proteins dissociate in the SMD simulations, the long helical axis of opposing EC1 domains was determined every 1 ps using

VMD 1.9.1. Two sets of angles were computed using MATLAB: the angles between the EC1 domains at each time point (angular separation) and the EC1-EC1 angles along PC2 (torsion angle). The minimum/maximum torsion angles were determined by taking the average of 100 data points around the minimum and maximum values.

4.4.5. E-Cadherin Constructs and Single Molecule Force-Clamp Measurements: In order to engineer W2F mutants, the full extracellular region of Ecad with a C-terminal Avi-

106 tag was cloned into pcDNA3.1(+) vectors using primers containing a Tev sequence and His- tag. This resulted in an open reading frame of the complete E-cadherin/Avi/Tev/His sequence. The W2F mutation was introduced in the EC1 domain of E-cadherin/ATH by point mutation using QuikChange kit (Agilent). The engineered cadherin sequence was transfected into HEK 293 cells, which was selected using 400μg/ml of Geneticin (G418, Invitrogen

Corp). Cells were grown to confluency in high glucose DMEM containing 10% FBS and 200

µg/ml G418 and then exchanged into serum free DMEM. Conditioned media was collected 4 days after media exchange. Purification, and biotinylation of the W2F mutant, WT-Ecad and the K14E mutant followed protocols described previously [Rakshit et al., (2012)]. Briefly, the cadherin extracellular constructs were purified using a Nickel-NTA resin (Invitrogen) and biotinylated at the C-terminal Avi tag site using BirA enzyme (BirA500 kit; Avidity).

Prior to the single molecule force clamp spectroscopy experiments, glass coverslips and

Si3N4 AFM cantilevers were cleaned and functionalized with biotinylated cadherin monomers. The protocol used for cadherin immobilization has been previously described

[Manibog et al., (2014); Rakshit et al., (2012)]. Briefly, the cantilevers and coverslips were functionalized with amine groups using a 2% v/v solution of 3-Aminopropyltriethoxysilane

(Sigma) in Acetone. The amine groups were subsequently decorated with polyethylene glycol spacers (PEG 5000, Laysan Bio) containing an amine-reactive N-hydroxysuccinimide ester on one end; 7% of the PEGs were functionalized with a biotin group at the other end.

Prior to affixing cadherins on functionalized surfaces, the coverslips and cantilevers were incubated overnight in a 0.1 mg/ml BSA solution to minimize non-specific binding.

Following BSA incubation, the functionalized coverslips and cantilevers were incubated with

0.1 mg/ml streptavidin. Biotinylated cadherin monomers (150 – 200 M for 45 mins) were

107 then bound to the immobilized streptavidin. Free biotin binding sites of streptavidin molecules were blocked by 0.02 mg/ml solution of biotin.

Single molecule force clamp spectroscopy experiments were performed using an Agilent

5500 AFM with a closed-loop piezoelectric scanner. Force measurements were carried out in a 10 mM Tris,100 mM NaCl, 10 mM KCl, pH 7.5 buffer solution containing either 2.5 mM

CaCl2 (WT-Ecad and W2F-Ecad) or 1.5 mM CaCl2 (K14E-Ecad). The cadherin functionalized AFM tip and glass coverslip were brought into contact for 0.3 s or 3.0 s. The tip was then withdrawn from the substrate at a rate of 100-600 pN/s and clamped at a constant force. Lifetimes were determined from the persistence time of the bond at each clamping force. In order to unambiguously identify specific unbinding events from non- specific and multiple unbinding, the force clamp events were converted to force vs. tip- surface distance traces. Unbinding distance obtained in every trace was compared to the theoretical value determined using an extended freely jointed chain model [Oesterhelt et al.,

(1999)] for PEG.

4.5. Author Contributions

SS directed the research. KM performed the atomistic simulations and KS performed the coarse grained simulations. KM and SK performed the single molecule experiments. YZ contributed reagents. KM and SS analyzed the atomistic simulations and single molecule experiments. KS and RLJ analyzed the coarse grained simulations. KM and SS wrote the manuscript with assistance from KS and RLJ.

108

4.6. Acknowledgments

This work was supported in part by a grant from the American Cancer Society (Research

Scholar Grant 124986-RSG-13-185-01-CSM) to SS and a grant from the National Science

Foundation (MCB-1021785) to RLJ.

4.7. Supplementary Material

Supplementary Information is available in Appendix E.

109

CHAPTER 5. AN ANALYSIS OF CONFORMATIONAL CHANGES UPON

RNA-PROTEIN BINDING

Abridged version originally appeared in the Proceedings of the 5th ACM

Conference on Bioinformatics, Computational Biology, and Health

Informatics (2014): Pages 592-593.

Kannan Sankar, Rasna R. Walia, Carla M. Mann, Robert L. Jernigan, Vasant

G. Honavar and Drena Dobbs

Abstract

RNA-binding proteins (RBPs) have myriad functions in transcription, translation, and post- transcriptional gene regulation, with central roles in normal development as well as in both genetic and infectious diseases. When a protein binds RNA, a conformational change often occurs. For RNA-protein complexes that have been characterized, conformational changes have been observed in the protein, the RNA, or both. These conformational changes have not been sufficiently characterized, however, in part due to the small number of structures of bound and unbound complexes of RNA-binding proteins available until recently. Here we systematically analyze a new dataset of 90 pairs of bound and unbound proteins to evaluate the conformational changes that occur upon RNA binding. Most of the conformational changes were observed in non-interfacial regions of the RNA-binding proteins. Detailed analyses of the modes of RNA binding and any associated conformational changes in proteins are critical for fully understanding the mechanisms of RNA-protein recognition, for developing better RNA-protein docking methods and methods for predicting interfacial residues, and for RNA-based drug design.

110

5.1. Introduction

Proteins rarely function in isolation: they usually interact with other biomolecules, including other proteins, nucleic acids, lipids and small molecules. RNA-protein interactions play important roles in diverse cellular processes, ranging from transcription [Hogan et al.,

(2008); Kechavarzi and Janga, (2014)], RNA processing and translation [Galicia-Vázquez et al., (2009)], to viral replication [Denison, (2008); Li and Nagy, (2011)] and host defense against pathogens [Pumplin and Voinnet, (2013)]. In some cases, the interactions between the RNA and protein may be relatively transient [Blanco and Montoya, (2011)] (e.g., during transcription by RNA polymerase), whereas in others the interactions are more permanent and become stable structural components of a complex [Kim et al., (2014)] (e.g., in the spliceosome and ribosome). The binding of an RNA molecule to a protein can be associated with a conformational change in the protein and/or the RNA [Mu and Stock, (2006)], and this can significantly change the protein’s overall structure and dynamics, influencing its cellular function [Guo et al., (2012)].

Understanding the extent of conformational changes in proteins upon binding to RNA can be crucial for the accuracy of computational prediction of interfaces and RNA-protein docking. For example, machine learning approaches for RNA-protein interface predictions can perform better using features extracted from sequences instead of from structures [Puton et al., (2012); Walia et al., (2012)]. This somewhat surprising result could be explained if many RNA-binding proteins assume significantly different structures in their unbound versus

RNA-bound forms. In this case, the use of structural features extracted from the unbound structures could confound the identification of interfacial residues in the RNA-bound

111 structures. This possibility motivated us to look more carefully into the conformational changes that RNA-binding proteins undergo upon RNA binding.

In several well-studied cases, RNA-protein recognition is believed to be accompanied by an induced-fit, instead of a lock-and-key mechanism [Wang et al., (2013); Williamson,

(2000)]. However, a systematic analysis of conformational changes in a large dataset of proteins upon RNA binding, especially at the coarse-grained residue level, has not been performed. To our knowledge the only such work was by Ellis & Jones [Ellis and Jones,

(2008)], in which they examined 12 pairs of bound and unbound structures of RBPs. The number of RNA-protein complexes in the Protein Data Bank (PDB) has risen significantly since that study, and several protein-RNA docking benchmarks have been developed [Barik et al., (2012); Huang and Zou, (2012); Perez-Cano et al., (2012)].

Availability of at least two different structures of a protein, one in complex with RNA and another in its unbound state, can provide significant insights into conformational changes in the protein. Rather than using a global measure of conformational change such as RMSD, we followed the approach of Ellis & Jones [Ellis and Jones, (2008)]. Proteins rarely function in isolation: they usually interact with other biomolecules, including other proteins, nucleic acids, lipids and small molecules. RNA-protein interactions play important roles in diverse cellular processes, ranging from transcription [Hogan et al., (2008); Kechavarzi and Janga,

(2014)], RNA processing and translation [Galicia-Vázquez et al., (2009)], to viral replication

[Denison, (2008); Li and Nagy, (2011)] and host defense against pathogens [Pumplin and

Voinnet, (2013)]. In some cases, the interactions between the RNA and protein may be relatively transient [Blanco and Montoya, (2011)] (e.g., during transcription by RNA

112 polymerase), whereas in others the interactions are more permanent and become stable structural components of a complex [Kim et al., (2014)] (e.g., in the spliceosome and ribosome). The binding of an RNA molecule to a protein can be associated with a conformational change in the protein and/or the RNA [Mu and Stock, (2006)], and this can significantly change the protein’s overall structure and dynamics, influencing its cellular function [Guo et al., (2012)].

Understanding the extent of conformational changes in proteins upon binding to RNA can be crucial for the accuracy of computational prediction of interfaces and RNA-protein docking. For example, machine learning approaches for RNA-protein interface predictions can perform better using features extracted from sequences instead of from structures [Puton et al., (2012); Walia et al., (2012)]. This somewhat surprising result could be explained if many RNA-binding proteins assume significantly different structures in their unbound versus

RNA-bound forms. In this case, the use of structural features extracted from the unbound structures could confound the identification of interfacial residues in the RNA-bound structures. This possibility motivated us to look more carefully into the conformational changes that RNA-binding proteins undergo upon RNA binding.

In several well-studied cases, RNA-protein recognition is believed to be accompanied by an induced-fit, instead of a lock-and-key mechanism [Wang et al., (2013); Williamson,

(2000)]. However, a systematic analysis of conformational changes in a large dataset of proteins upon RNA binding, especially at the coarse-grained residue level, has not been performed. To our knowledge the only such work was by Ellis & Jones, 2008 [Ellis and

Jones, (2008)], in which they examined 12 pairs of bound and unbound structures of RBPs.

The number of RNA-protein complexes in the Protein Data Bank (PDB) has risen

113 significantly since that study, and several protein-RNA docking benchmarks have been developed [Barik et al., (2012); Huang and Zou, (2012); Perez-Cano et al., (2012)].

Availability of at least two different structures of a protein, one in complex with RNA and another in its unbound state, can provide significant insights into conformational changes in the protein. Rather than using a global measure of conformational change such as RMSD, we followed the approach of Ellis & Jones [Ellis and Jones, (2008)] and characterized individual residues as either changing conformations (‘flexible’) or being conformationally invariant. Our analysis was performed on a new set of 90 bound-unbound pairs of RNA- binding proteins (RB90Benchmark). These 90 pairs include a variety of different types of bound RNA, according to SCOR (Structural Classification of RNAs), [Klosterman et al.,

(2002); Tamura et al., (2004)] and different structural classes of proteins according to SCOP

(Structural Classification of Proteins) [Conte et al., (2000); Hubbard et al., (1999); Murzin et al., (1995)].

We systematically analyzed the conformational changes in all 90 pairs with regard to several criteria, including how many flexible and conformationally invariant residues mapped to buried versus exposed regions of the protein, and how many of each residue type were located in the RNA-protein interface versus non-interfacial regions. Surprisingly, we observed that most of the conformational changes occur in the non-interfacial surface regions.

These results have important implications for the improvement of knowledge-based methods for RNA-protein interface prediction and docking. For completeness, this analysis could be extended to evaluate conformational changes in the RNA, as well as the protein, and potential correlations between changes in the protein and the RNA.

114

5.2. Materials and Methods

5.2.1. Datasets

We compiled a dataset of 90 pairs of proteins (RB90Benchmark) for which both the RNA- bound and unbound forms are available, from three non-redundant protein-RNA docking benchmark datasets [Barik et al., (2012); Huang and Zou, (2012); Perez-Cano et al., (2012)], datasets from Ellis and Jones, 2008 [Ellis and Jones, (2008)] and OPRA [Perez-Cano and

Fernandez-Recio, (2010)], and from an in-house dataset extracted directly from the PDB.

Only X-ray structures were included, and only 4 of the 180 structures had resolution lower than 3.5 Å. RB90Benchmark is the largest curated set of bound-unbound pairs of RNA- binding proteins available to date, but is more selective than some others. In some benchmarks, the structure of a protein in complex with RNA may differ from the ‘unbound’ form in other ways, e.g., the protein may in fact be bound to a different RNA molecule. We excluded such cases from the dataset considered here. This dataset is available from the authors upon request and will be made available online upon publication of this manuscript.

5.2.2. Identification of ‘flexible’ regions

From the pair of structures for each protein, we considered only residues for which 3D coordinate information is available for both the bound and unbound structures. The pairs were then analyzed using the Error-inclusive Structure Comparison and Evaluation Tool,

ESCET [Schneider, (2000)], which calculates normalized or Error-scaled Difference

Distance (EDD) matrices between the pairs of crystal structures. The normalization is based on information about the uncertainty of residue positions, derived from quality assessment metrics for the structure. ESCET applies a genetic algorithm to classify residues as either conformationally invariant or ‘flexible.’ The largest subset of residues for which the internal

115 distances between all those residue pairs is the same in the bound & unbound form (within the error limits in the structures) is defined as the conformationally invariant part of the protein. Fig. 1A shows an example of an EDD matrix from ESCET for the DDX48 ATP- dependent RNA helicase [Schneider, (2000)] with the corresponding flexible and invariant parts depicted on a superimposition of the bound and unbound protein structures in Fig. 1B.

An important strength of this approach is the amount of detailed information it provides. A complete description of the algorithm is provided in [Schneider, (2004), (2002)].

Figure 5.1. ESCET analysis of the DDX48 ATP-dependent RNA helicase. (A) An EDD matrix illustrating changes in internal distances between unbound and bound forms of the RNA helicase. Regions that undergo expansion (red) and contraction (blue) are shown, with darker shades representing larger changes. In the diagram below the matrix, black and white boxes show locations of α-helices (white) and β-strands (black); the assignments of conformationally invariant (dark grey) and flexible regions (light grey) are summarized by shading behind the boxes. (B) Superimposition of unbound (purple, 2HXY:A) and bound (green, 2J0S:A) PDB structures for the helicase, with the interfacial residues in the bound form shown in red. For clarity, the bound RNA is not shown.

116

5.2.3. Identification of interface residues

A residue in the bound form is classified as an interface residue if any of its heavy atoms are within a cut-off distance of 5 Å from any heavy atom of the RNA in the crystal structure.

This is the same criterion used to identify and annotate interface residues in our in-house database, PRIDB [Lewis et al., (2010)].

5.2.4. Identification of surface vs. buried residues

The solvent accessible surface area of proteins in bound and unbound forms is calculated using the program NACCESS [Hubbard and Thornton, (1993)]. Residues with relative solvent accessibility (RSA) above a 5% cutoff are classified as exposed or surface, and others as buried [Jones et al., (2001)].

5.2.5. Classification based on SCOR category of bound RNA and SCOP class of protein

The type of RNA bound to the proteins is classified according to the SCOR database

[Klosterman et al., (2002); Tamura et al., (2004)] into one of the following 7 categories: genetic control elements (GCE), messenger RNA (mRNA), transfer RNA (tRNA), ribosomal

RNA (rRNA), viral RNA (vRNA), synthetic RNA or others (including ribozymes, SRP

RNAs, evolved and SELEX RNAs). The SCOP (Structural Classification of Proteins) class of each protein was obtained directly from the SCOP database v1.75 [Conte et al., (2000);

Hubbard et al., (1999); Murzin et al., (1995)], which classifies proteins in one of 5 categories: alpha (α), beta (β), alpha/beta (α/β), alpha+beta (α+β), or multidomain. For comparisons, statistics for all 90 pairs were averaged over each class and the results analyzed. Tables 5.1 and 5.2 show the numbers of RNAs and proteins in each SCOR or SCOP class, respectively.

117

Table 5.1. SCOR classifications for the 90 RNAs in the RB90 Benchmark dataset Type of RNA bound Number of cases GCE1 12 mRNA2 17 tRNA3 28 rRNA4 10 vRNA5 7 synthetic 8 others 8 1 GCE – genetic control element, 2 mRNA – messenger RNA, 3tRNA – transfer RNA, 4rRNA – ribosomal RNA, 5vRNA – viral RNA

Table 5.2. SCOP classification for the 90 proteins in the RB90 Benchmark dataset SCOP Class of Number of cases Protein α 13 β 15 α/β 18 α+β 23 multidomain 8 unclassified 13

5.2.6. Performance evaluation of sequence-based vs structure-based methods for RNA- binding site prediction

We use two sequence-based methods and two structure-based methods to classify residues in the RNA-binding proteins as ‘interface’ or ‘non-interface’ based on features of the unbound structures. These predicted classifications are compared with the actual information from the

RNA-bound structures of the same protein. Performance of each method is evaluated on a set of 30 proteins [Perez-Cano and Fernandez-Recio, (2010)] using four measures: Precision,

Recall, F-measure and Matthews Correlation Coefficient (MCC):-

푇푃 푃푟푒푐푖푠푖표푛 = (5.1) 푇푃+퐹푃

118

푇푃 푅푒푐푎푙푙 = (5.2) 푇푃+퐹푁

푝푟푒푐푖푠푖표푛×푟푒푐푎푙푙 2푇푃 퐹 − 푚푒푎푠푢푟푒 = 2 × = (5.3) 푝푟푒푐푖푠푖표푛+푟푒푐푎푙푙 2푇푃+퐹푃+퐹푁

푇푃×푇푁−퐹푃×퐹푁 푀퐶퐶 = (5.4) √(푇푃+퐹푃)(푇푃+퐹푁)(푇푁+퐹푃)(푇푁+퐹푁) where TP = true positives (predicted to be interface residue and actually an interface residue in the bound structure), TN = true negatives (predicted to be non-interface residue and actually not an interface residue in the bound structure), FP = false positives (predicted to be interface residue but actually not an interface residue in the bound structure) and FN = false negatives (predicted to be a non-interface residue but actually an interface residue in the bound structure).

5.3. Results

5.3.1. Sequence-based methods perform better than structure-based methods for predicting RNA-binding sites in proteins

We directly compared the performance of two sequence-based methods (RNABindRPlus

[Walia et al., (2014)] and PiRaNhA [Murakami et al., (2010)]) and two structure-based methods (KYG [Kim et al., (2006)] and PRIP [Maetschke and Yuan, (2009)]) in predicting

RNA-binding sites on an independent test set of 30 unbound proteins (Table 5.3).

For details regarding these published methods, see [Walia et al., (2014), (2012)]. We expected that structure-based methods, which exploit features extracted from the known structures of the proteins (e.g., in the case of KYG, individual and pair propensities of surface residues is exploited), should have superior performance over sequence-based methods, which use only information derived from the amino acid sequences of proteins

119

(e.g., in the case of RNABindRPlus [Walia et al., (2014)], the amino acid sequence is encoded as a PSSM and thus incorporates evolutionary conservation information). Instead, both of the sequence-based methods out-performed the structure-based methods in discriminating between RNA-binding and non-binding residues, except that PiRaNhA performed slightly worse than the structure-based methods in terms of Recall. These results are consistent with those obtained in previous studies [Puton et al., (2012); Walia et al.,

(2012)]. To explore whether conformational changes that occur upon formation of RNA- protein complexes could explain this result, we compiled a large dataset RNA-binding proteins for which structures of both bound and unbound forms are available,

RB90Benchmark (see Methods), and systematically analyzed conformational changes in all

90 pairs.

Table 5.3. Sequence-based methods perform better than structure- based methods in predicting RNA-binding sites Method Precision Recall F-measure MCC RNABindRPlus# 0.68 0.64 0.66 0.61 PiRaNhA# 0.50 0.38 0.43 0.36 KYG* 0.25 0.42 0.31 0.18 PRIP* 0.22 0.42 0.29 0.15 # Sequence-based methods; * Structure-based methods

5.3.2. Most conformationally flexible residues in RNA-binding proteins map to non- interface surface residues

In the 90 protein pairs analyzed, a total of 30,819 residues were present in structures of both the RNA-bound and unbound forms. As shown in Fig. 5. 2, among these common residues,

26,165 (~ 85%) are surface residues. After mapping the set of common residues to the flexible and interface regions in each protein, the fraction of ‘flexible’ residues located in

120 interface regions and the fraction of interface residues located in ‘flexible regions’ were calculated.

Figure 5.2. Residue counts in the dataset. Number of residues in the dataset according to different types (surface, flexible, interface)

Notably, most of the conformationally flexible residues are non-interface surface residues. Among all residues present in both the unbound and bound structures, 10% are interfacial residues (Fig. 5.2), and among these, only 27% are conformationally ‘flexible’

(not shown). In agreement with this, Fig. 5.3 shows that out of a total of 8,390 residues in

‘flexible’ regions, only 1,137 (13.6 %) are interface residues, whereas 7,253 (86.4%) are non-interface residues.

121

Figure 5.3. Pie chart showing characteristics of 8,390 residues found in “flexible’ regions (100%). Residues categories are: interface (red) vs non-interface (blue) and surface (lighter shades) vs buried (darker shades).

5.3.3. A large fraction of residues in rRNA-binding proteins undergo conformational changes upon RNA binding

Table 5.4. Percentage of interface residues located in flexible regions in proteins bound to RNA of different SCOR types. % of interface residues Type of RNA bound in flexible regions GCE1 17% mRNA2 12% tRNA3 16% rRNA4 32% vRNA5 11% synthetic 11% others 21% 1GCE – genetic control element, 2mRNA – messenger RNA, 3tRNA – transfer RNA, 4rRNA – ribosomal RNA, 5vRNA – viral RNA

One important consideration is whether the results of this analysis depend on the type of

RNA bound or the structural class of the protein, in any significant way. To explore this issue, we compared proteins bound to different SCOR categories of RNA or in different

SCOP classes, in terms of the percentage of interface residues located in conformationally

122 variable regions. Table 5.4 shows that the fraction of interface residues in flexible regions is substantially higher for rRNA-binding proteins (32%) compared to the average value for all

RNA classes (14%). In contrast, on average, the fraction of interface residues in ‘flexible’ regions is less variable when proteins in different SCOP classes are compared (Table 5.5).

Table 5.5. Percentage of interface residues in flexible regions of proteins in different SCOP classes. SCOP Class of % of interface residues in Protein flexible regions α 17% β 14% α/β 12% α+β 23% multidomain 17% unclassified 16%

5.4. Discussion

The current study was motivated, in part, by the hypothesis that pervasive conformational changes in RNA-binding proteins may explain why sequence-based machine learning approaches for predicting interfacial residues in RNA-protein complexes perform better than structure-based methods.

Based on the 90 pairs of characterized RNA-protein complexes analyzed in this study, most RNA-binding proteins do undergo a conformational change upon RNA binding. We found, however, that the most flexible regions in RNA-binding proteins correspond to non-

RNA-binding residues, and, conversely, most RNA-binding residues occur within conformationally invariant regions. We hypothesize that the inherently ‘flexible’ regions of

RNA-binding proteins play an important role in large conformational changes associated

123 with functional motions of RNPs (e.g., in ribosomes, spliceosomes, viral capsids), rather than in RNA-protein recognition alone. Ongoing experiments are designed to directly test this hypothesis.

The RB90 dataset is the largest non-redundant set of bound-unbound structures of RNA- binding proteins analyzed to date, and includes proteins and RNAs representing several different SCOP and SCOR categories (Tables 1 and 2), suggesting that the results reported here will be widely applicable. An important caveat, however, is that RB90 necessarily includes only RNA-protein complexes that have been amenable to X-ray crystallographic analysis and thus may be biased against RNA-binding proteins in which conformational changes do occur in interfacial regions. Many RNA binding proteins contain interfacial regions that are disordered regions in their unbound forms [Dyson, (2011); Peng et al.,

(2014)], and RNA-protein complexes can be difficult to crystallize [Ke and Doudna, (2004);

Wu et al., (2005)]. The RB90 dataset does not include several well-known complexes in which induced fit is known to play a critical role in RNA-protein recognition (e.g., the HIV-1

Rev-RRE complex [Casu et al., (2013)]). Thus, a definitive answer regarding the importance of conformational changes in RNA-protein recognition will require structural characterization of a larger collection of diverse RNA-protein complexes. The current study focused on conformational changes in proteins upon RNA binding. In future work, it will be interesting to examine the corresponding conformational changes in the RNA partner upon

RNA-protein complex formation.

5.5. Acknowledgments

KS and CMM were supported by fellowship funds from the Office of Biotechnology, Iowa

State University. RRW, CMM and DD gratefully acknowledge support from the Presidential

124

Initiative for Interdisciplinary Research and the Center for Integrated Animal Genomics at

Iowa State University. RRW, VGH and DD also acknowledge support from NSF grants DBI-

0501678 and DGE 0504304, and NIH grant GM066387. KS and RLJ gratefully acknowledge support from NSF grant MCB-1021785. The work of VH was supported in part by the Edward Frymoyer Professorship in Information Sciences and Technology at

Pennsylvania State University.

125

CHAPTER 6. MOLECULAR MECHANISM OF DRUG EFFLUX BY ABGT

FAMILY OF MEMBRANE TRANSPORTERS

(Manuscript in Preparation)

Kannan Sankar, Sayane Shome, Edward W. Yu and Robert L. Jernigan

Abstract

Drug extrusion through efflux pumps is an important mechanism for many pathogenic bacterial strains to achieve multi-drug resistance (MDR). Understanding the molecular mechanisms of drug extrusion by multi-drug efflux pumps is important for the development of anti-resistance drugs. The AbgT family of transporters that is involved in the folic acid biosynthesis pathway constitutes one such efflux pump system. In addition to transport of the folic acid precursor p-amino benzoic acid (PABA), members of this family are involved in the efflux of several sulfa drugs, conferring drug resistance upon the bacterium. With structural information for two members of this family (YdaH and MtrF) becoming available recently, we investigate the molecular mechanism of transport of PABA as well as a sulfa drug (sulfamethazine) by these transporters with the use of steered molecular dynamics. Our analyses reveal the likely ligand migration pathways in these transporters and also identify key residues involved in the transport cycle. In addition, simulations using both PABA and sulfamethazine show how the protein is equipped to transport ligands of different shapes and sizes.

6.1. Introduction

According to the World Health Organization (WHO), multiple drug resistance (MDR) is one of the most significant threats to public health. MDR enables several pathogenic bacteria to resist large doses of several structurally diverse drug molecules, making it extremely

126 difficult to treat diseases. Some common mechanisms of drug resistance include target alteration, drug inactivation, decreased permeability to drug and drug efflux. Among these, resistance plasmids and drug efflux systems comprise important mechanisms of MDR

[Nikaido, (2009); Sun et al., (2014)]. In addition to their ability to recognize and export structurally diverse drugs, they can lower the concentrations of antibiotics in bacterial cells, thereby making it conducive for drug resistant mutants to develop and accumulate [Sun et al.,

(2014)]. It has also been suggested that these drug efflux pumps not only help in conferring resistance, but also play physiologically important roles in bacteria [Piddock, (2006)].

One such efflux pump system is formed by the AbgT family of transporters [Prakash et al., (2003)], which includes members that play an important role in the pathway for the synthesis of folic acid (vitamin B9). Most plants and bacteria synthesize folic acid using the folate biosynthetic pathway that is absent in mammals [Hanson and Gregory, (2011)]. As a result, this pathway has attracted considerable attention as a drug target [Bermingham and

Derrick, (2002); Bourne, (2014)], with sulfonamides being the most popular example

[Woods, (1940)]. E. coli AbgT, the representative member of the family was shown to play a role in the uptake of the catabolite p-amino-benzoyl-glutamate (PABG) which is necessary for the synthesis of folic acid [Carter et al., (2007); Hussein et al., (1998)]. Interestingly, the only other member of the family which has been at least partially characterized, MtrF from

Neisseria gonorrhoeae was shown to play a role in the development of a high level of antibiotic resistance [Folster and Shafer, (2005); Veal and Shafer, (2003)].

Structural information on members of this family first became available in 2015 from the crystal structures of N. gononrrhoeae MtrF [Su et al., (2015)] and Alcanivorax borkumensis

YdaH [Bolla et al., (2015)]. Both of these proteins are located on the inner membrane and

127 were found to have a unique architecture with some overall resemblance to the dicarboxylate transporter VcINDY from Vibrio cholerae [Mancusso et al., (2012); Vergara-Jaque et al.,

(2015)]. These proteins are bowl-shaped dimers containing a solvent-filled basin that extends half-way into the inner membrane on the intracellular (IC) side, a possible site for substrate binding. They have nine transmembrane helices numbered TMs 1-9 and two helical hairpins

HPs 1-2. TMs 3 and 8 are each broken into segments a and b; and TMs 2 and 7 into segments a, b and c within the membrane and connected by intramembrane loops that lie close to the

HPs, forming an internal cavity. In addition, a small periplasmic domain is formed by the loops between TM1-TM2 (which contains an antiparallel β-sheet) and between TM5-TM6.

The dimeric structure can be thought of as being composed of two cores, an inner core consisting of TMs 1, 2, 5, 6 and 7; and an outer core consisting of TMs 3, 4, 8 and 9 as well as HPs 1-2. The dimerization interface is composed of TM2 and TM7 as well as the N- terminal ends of TM1 and TM6 (Figure 6.1).

Biochemical assays on YdaH and MtrF by the Yu Lab at ISU showed that both these transporters can export p-amino-benzoic acid (PABA) and decrease folic acid concentrations inside cells. Computational analysis revealed a tunnel through which the ligand can migrate; and this tunnel is lined by conserved residues D180, W400, P418 and R426 (D193, S417,

W420, P438, R446, D449, and P457) in YdaH (MtrF) [Bolla et al., (2015); Su et al., (2015)].

They also showed that mutation of these residues (except R426 in YdaH and S417 and R446 in MtrF) into Ala significantly reduced the ability of the cells to export PABA out of the cell.

In addition, residues N390, D429 and N433, which were involved in coordinating a Na ion in

YdaH, were also identified to be important for the transport mechanism. Similar analysis using wild-type and mutant proteins showed that both YdaH and MtrF could export

128 sulfonamide drugs (structurally similar to PABA) out of the cell, confirming their role as multidrug efflux pumps. Additionally, it was also shown that the MtrF transporter is proton motive force (PMF) dependent; that is, it uses differences in concentrations of protons between the periplasmic and cytoplasmic sides to export drugs out of the cell.

The crystal structures provide snapshots of the transporters in an inward-open state, i.e. the transporter is open to the periplasmic side and in a conformation suitable for binding a substrate in the periplasm. However, several membrane transporters are known to exhibit large conformational changes to provide alternate access to either side of the membrane, for which two different mechanisms have been proposed: the ‘rocking-bundle’ mechanism

[Forrest and Rudnick, (2009)] and the more recently proposed ‘elevator’ mechanism

[Mulligan et al., (2016)]. However, obtaining a more comprehensive picture of the transport process requires simulations to map the channel(s) through which the molecule is transported; by fully accounting for the conformational changes within the protein and the interactions between the small molecule and key residues of the protein.

Computational methods such as molecular dynamics (MD) provide a direct means to study this process. However, since the timescale of these membrane transport processes is beyond the reach of usual MD, methods are needed to accelerate this process. One such approach is steered molecular (SMD) [Izrailev et al., (1997)], in which the substrate is pulled at a constant velocity (or constant force) to drive it in a specific direction (e.g. from the inside to the outside of the cell). SMD has been widely employed to study ligand binding and transport for a variety of transporter proteins such as the water channel aquaporin [Ilan et al.,

(2004); Jensen et al., (2002); Wang et al., (2005)], leucine transporter LeuT [Celik et al.,

129

(2008)], lactose transporter LacY [Jensen et al., (2007)] and the ADP/ATP carrier (AAC) protein [Wang and Tajkhorshid, (2008)].

In this work, we have performed simulations using cv-SMD (constant velocity-steered molecular dynamics) to unravel several aspects of the transport mechanism in YdaH. Our study sheds light on the various migration channels present in the transporter; how the transporter accommodates substrates of different sizes (e.g. smaller PABA and larger sulfamethazine) and the role of specific interactions between the ligand and residues lining the passage during the transport cycle. These results can provide important insights into the design of anti-resistant drugs for treatment of a range of diseases including meningitis, pneumonia and gonorrhea.

6.2. Methods

6.2.1. Structure preparation

The structures of YdaH and MtrF were solved experimentally with X-ray crystallography in the Yu Laboratory and are available in the protein data bank (PDB) with the accession codes

4R0C and 4R1I respectively. The structures were first processed to remove all hetero-atoms and retain only the coordinates of the protein.

6.2.2. Analysis of pathways accessible through elastic network normal modes

The anisotropic network model (ANM) is an elastic network model [Atilgan et al., (2001); I

Bahar et al., (1997); Tirion, (1996)] which assumes that fluctuations of residues around their equilibrium points are anisotropic (represented by ellipsoids) and is used to calculate the normal modes from a given structure. In ANM, the potential energy is modeled as a simple

130

Hookean potential and is proportional to the sum of squares of the displacement vectors.

Parts of the structure that are within a certain distance cutoff Rc (here, chosen as 13 Å) are connected by springs with a certain stiffness γ (can be constant or distance dependent). An eigen-mode decomposition of the Hessian matrix (double derivative of the potentials) of the system is performed to extract the modes of motion of a protein as described in Atilgan et al

[Atilgan et al., (2001)]. A set of snapshots of the protein generated by deforming the structures along the lowest frequency normal modes is used to investigate the presence of channels using the program CAVER [Chovancova et al., (2012)].

6.2.3. Computational docking of ligands onto transporters

For obtaining the starting pose of ligand for the SMD simulations, we use AutoDockTools

[Morris et al., (2009)]. AutoDock uses a semi-empirical free energy force field [Huey et al.,

(2007)] that considers both pair-wise interactions as well as the conformational entropy lost upon binding of the ligand to the protein. The protein is first prepared for docking by adding hydrogen atoms (usually not present in crystal structures), determining partial atomic charges and identifying the degrees of freedom in the ligand (PABA or sulfamethazine) by determining rotatable bonds and torsional degrees of freedom. Then the 3D search space for the protein is specified using a grid box and the energy of interactions for each ligand atom type at each point in the grid is evaluated by using autogrid. Following this, docking is carried out (using autodock) with default parameters using the Lamarckian Genetic

Algorithm (LGA) which is one of the most efficient search methods [Morris et al., (1998)].

The top 10 docked poses are analyzed and the lowest energy conformation is chosen as the starting point for the simulations.

131

6.2.4. Steered molecular dynamics

A 90 X 90 Å2 POPE (1-Palmitoyl-2-oleoyl-sn-glycero-3-phosphoethanolamine) membrane bilayer is constructed using the ‘Membrane Builder’ plugin in VMD [Humphrey et al.,

(1996)] with the membrane normal to the z-axis. The protein along with docked PABA is inserted into the center of the membrane, with the correct orientation and position as predicted using the PPM (Positioning of Proteins in Membrane) server [Lomize et al.,

(2012)]. The system is solvated with the addition of 5 Å on either side of the bilayer and charge balance is ensured by adding Na+ and Cl- ions. After the system is first minimized for

100,000 steps, the system is equilibrated by starting at 10 K and raising the temperature slowly to 310 K in steps of 1 K with 1 ps runs at each step with a time-step of 1 fs. A harmonic restraint of 1 kcal/mol Å2 is applied to the protein during the temperature increase; following which a 100ps equilibration run is performed at 310K at constant temperature and pressure (NPT) with all atoms allowed to move freely. Langevin dynamics (damping coefficient of 1 ps-1) is used to maintain the temperature constant. A cutoff distance of 12 Å is used for van der Waals interactions, with the pair distance list extended to 14 Å. A periodic boundary condition is imposed with long-range full electrostatic interactions evaluated using the Particle Mesh Ewald method with a grid spacing of 1 Å. For all simulations, NAMD

[Phillips et al., (2005)] with the CHARMM27 parameter set [MacKerell et al., (1998);

Schlenkrich et al., (1996)] is used. The topology and parameter files for PABA and sulfamethazine are generated using the SwissParam server [Zoete et al., (2011)].

Then constant velocity steered MD (cv-SMD) is applied, specifically with the C6-atom of PABA/sulfamethazine being pulled in the positive z-direction (directly from cytoplasmic to periplasmic side) by using a harmonic constraint moving at a velocity of 10 Å/ns and with

132 a force constant (푘) of 5 kcal mol-1Å-2. To prevent the protein from drifting during the initial stages, specific residues at the periphery of the intracellular side of the protein (R112, G116,

A151, L232, S250, S314 and S320) were restrained with a force constant of 5 kcal mol-1Å-2 in the z-direction.

6.2.5. Analysis of mean square fluctuations (MSFs) of the protomer in monomeric vs. oligomeric state

For comparing the MSFs of monomer versus oligomer, we have used the Gaussian Network

Model (GNM) [I Bahar et al., (1997)] which assumes that fluctuations of residues around their equilibrium points are isotropic (and normally distributed) and is used to calculate the normal modes from a given structure. In GNM, the residues are modeled using only their Cα atoms, and those that are within a certain distance cutoff Rc (here, chosen to be 7 Å) are connected by springs with a certain stiffness γ (taken as 1). The contact topology of the protein is captured in the Kirchoff matrix Γ (details in reference [I Bahar et al., (1997)]). The

MSFs of residues are obtained directly from the diagonal elements of Γ-1, the inverse of the

Kirchoff matrix. The MSFs from the GNMs of the monomer in isolation and in the oligomeric assembly are computed and converted to Z-scores for comparison. Then, the fold- change in MSFs is calculated as:-

푍 퐹퐶 = 표푙푖푔표푚푒푟 , (6.1) 푍푚표푛표푚푒푟

where 푍표푙푖푔표푚푒푟 and 푍푚표푛표푚푒푟 represent the MSF Z-scores of the monomer in the oligomeric complex and in isolation respectively.

133

6.3. Results and Discussion

6.3.1. Identification of ligand migration channels

Snapshots of the transporter were obtained by deforming the structure along the first ten

ANM modes (see details in the Methods section). The snapshots were then used to identify the possible dynamic tunnels in the protein using the software CAVER [Chovancova et al.,

(2012)]. Two possible tunnels (Fig. 6.1) are identified from the dynamics of the protein provided by the third elastic network normal mode (see details in Methods). These would be some of the most entropically favored pathways through the structure. One of these pathways

(green) requires the salt bridge between residues D180 and R426 to be broken to permit the ligand to permeate, suggesting a possible gate-keeping activity for these two residues.

However, the salt-bridge is not broken in the simulations with either PABA or sulfamethazine, suggesting that the green channel is an unlikely ligand migration path.

Figure 6.1. Ligand Migration in YdaH (a) Two possible ligand migration pathways (green and blue) are indicated from the natural fluctuations of the structure, as computed with CAVER. The two helix hairpins HP1 and HP2, near the ligand-binding cavity are shown in orange. (b) Close view of the important residues lining the (blue) channel 2 which are identified in this study to interact with the ligands.

134

6.3.2. Steered molecular dynamics of PABA and sulfamethazine in YdaH

SMD simulations were performed on the transporters with two different substrates (PABA and sulfamethazine, in multiple replicates) using the protocol described in the Methods section. The ligands were docked onto the transporter using AutoDock [Morris et al., (2009)] and embedded onto POPC membrane bilayers, solvated and ionized. The substrate

(PABA/sulfamethazine) was pulled from the IC to the periplasmic side in a direction perpendicular to the membrane plane at constant velocity using a harmonic constraint moving at a velocity of 10 Å/ns. The applied force necessary to maintain constant velocity was measured simultaneously. Strong stabilizing effects due to interactions from residues lining the channel will be reflected as spikes in applied force. In this manner, residues that interact with the ligand during the transport cycle can be identified.

Figure 6.2. Variation of applied force with ligand position. Shown is the applied force along the channel position in the z-direction (centered at the center of mass of the transporter) for the simulations using PABA (blue curve) and sulfamethazine (red curve). Peaks in force correspond to stabilizing interactions with residues lining the channel which are labeled in the respective colors.

135

SMD simulations reveal the permeation pathway (similar to the channel 1 identified using CAVER as the well as the one identified by Bolla et al [Bolla et al., (2015)]) along which PABA migrates to the periplasmic side (Fig. 6.1a). Conserved residues lining this channel are shown in Fig. 6.1b. This simulated pathway also suggests that the two helix hairpins (shown in orange in Fig. 6.1a) together guide the substrate (PABA) through the channel to the periplasmic side.

a b

c d

Figure 6.3. Stabilizing interactions of PABA with residues lining the YdaH channel (a) H-bond of amino hydrogen of PABA with OG atom of S397 (b) Stacked ring interactions PABA and W400 (c) H-bonds of amino hydrogens H of PABA with OE1 atom of Q422 and OG atom of S187 (d) C-H…π interactions between the CH2 atoms of P418 and aromatic ring of PABA.

Figure 6.2 shows the applied force plotted against channel position (shown as distance from center of mass of transporter along z-direction in which force is applied). In the

136 simulations with PABA as ligand, a spike in force is obtained at 9.5 Å (which includes interaction with S397), 13 Å (W400) and 20.5 Å (Q422, S187, P418) from the center of the protein suggesting strong interactions of PABA with these residues lining the channel. In addition, analysis of structures corresponding to these time points showed that the side- chains of S187, S397, and Q422 were involved in H-bonds with PABA during its transport

(Figure 6.3b-c). Along this pathway, the aromatic ring of residue W400 (part of the hairpin loop) is clearly seen to form a particularly well stacked configuration (Fig. 6.3a) with the benzene ring of PABA. CH2 atoms of the side-chain ring of P418 formed C-H…π interactions with the aromatic ring of PABA with the hydrogen atom at a distance of 3.9 Å from the center of the aromatic ring of PABA (Fig. 6.3d). These interactions have been reported to be common in protein structures, with 3.5 to 4 Å being the most favorable distance [Bhattacharyya and Chakrabarti, (2003); Brandl et al., (2001)]. Interestingly, S187,

S397, W400, P418 and Q422 are highly conserved in the AbgT family (refer to the multiple sequence alignment in Supplementary Figure 1 in reference [Bolla et al., (2015)]. While the backbones of V179 and P178 also formed hydrogen bonds with PABA; these residues are not conserved in the AbgT family.

On the other hand; the salt bridge between D180 and R426 remains relatively unchanged during the course of the simulation. Our simulations suggest that D180 and R426 are likely responsible for maintaining the shape of the channel. R426A mutant was previously shown to accelerate the export of PABA, which is possibly because in the absence of salt-bridge with

D180, the channel may remain permanently open. It is also worth pointing out that the mutation of R426 to Ala was shown not to significantly affect the export of sulfamethazine

[Bolla et al., (2015)].

137

a b

c d

Figure 6.4. Stabilizing interactions of sulfamethazine with residues lining the YdaH channel (a) H-bond of amino hydrogen of sulfamethazine (sulf) with OH atom of Y425 and OD atom of D429 (b) H-bond between OG atom of S430 and amino hydrogen of sulfamethazine; stacked ring interactions PABA and W400 (c) H-bonds of amino hydrogens H of PABA with HG1 atom of S187 (d) C-H…π interactions between the CH2 atoms of P418 and aromatic ring of sulfamethazine.

Simulations with sulfamethazine showed force peaks at 0.3 Å (Y425 and D429), 2.5 Å

(W400 and S430), 12 Å (S187) and 23 Å (P418) suggesting interactions between the drug and these residues lining the channel. Stacked interactions of the benzene ring of sulfamethazine with W400 and C-H…π interactions of CH2 atoms of P418 with the aromatic ring of sulfamethazine are observed (as in the case of PABA). Interestingly, all of these residues are also highly conserved and mutation studies showed them to be important to the

138 functional mechanism of the transporter [Bolla et al., (2015)]. Thus the computations confirm the importance of these conserved residues.

Figure 6.4. Root mean square fluctuations (RMSFs) of residues during the simulations. RMSFs of each residue in YdaH averaged over all frames in the trajectories of SMDs with PABA (blue) and sulfamethazine (red). The secondary structure elements are mapped onto the sequence as green boxes below the plot. The hairpin HP2 (residues 375-414), TM8 and TM9 are more mobile in simulations with sulfamethazine, suggesting that larger movement of these regions in the outer core are necessary for accommodating the larger ligand within the channel.

One interesting result from the analysis of the simulations is that with sulfamethazine

(larger ligand), a noticeable separation of the helix hairpin HP2b from TM3b was observed, with sulfamethazine directly interacting with several residues on HP2b. This is also reflected in the higher RMSF values for residues in the HP2b hairpin as shown in Fig. 6.3. It is worth pointing out that residues on HP2b constitute a substrate-binding motif in the AbgT family of transporters [Vergara-Jaque et al., (2015)] and hence these simulations provide direct evidence for the importance of the HP2b motif in the functional mechanism of YdaH. In

139 addition, the exact nature of interactions between this motif and the drug can be understood from the SMD trajectories.

6.3.3. Analysis of the effect of dimerization on dynamics

Many membrane transporters exist as oligomers; for example, leucine transporter LeuT is a dimer [Celik et al., (2008)], glutamate transporter is a trimer [Yernool et al., (2004)], etc.

When separate channels for ligand migration are present in each protomer, as in the case for the AbgT family, the exact role of dimerization in ligand transport is not clear. In addition, there is also the issue of cooperativity, i.e. whether both protomers can transport simultaneously or whether they act in a synchronized alternating fashion. Recently, we have compared the mean square fluctuations (MSFs) in the oligomeric and isolated monomeric forms of 146 different homo-oligomeric proteins using coarse-grained elastic network models [Atilgan et al., (2001); Bahar et al., (2010); I Bahar et al., (1997); Tirion, (1996)] and found that functionally important residues have a remarkable tendency to be reduced in MSF

(that is, stabilized) in the oligomer when compared to the monomeric form.

We have analyzed the fold change in MSFs of residues in YdaH and MtrF upon oligomerization using the Gaussian Network Model (GNM) [I Bahar et al., (1997)] and found that all four (P418, W400, R426, D180) conserved residues in YdaH and 5 out of 7

(W420, R446, D193 S417, D449) conserved residues in MtrF undergo at least a 1.5-fold reduction in MSF upon oligomerization. All the residues (other than those on the dimerization interface) that undergo reduced fluctuations upon oligomerization are found to line the central channel. This strongly suggests that dimerization plays a role in stabilizing the residues lining the transport channel, thereby facilitating the migration of the ligand.

140

6.4. Conclusions

Multidrug resistance has become one of the biggest threats to public health in recent years.

This makes it necessary to accelerate the pipeline of new drug target development for many diseases. One of the ways in which several bacteria are able to develop multidrug resistance is through the use of efflux pumps which are polysubstrate specific; that is they can recognize and pump out a wide range of structurally different drug molecules. In this work we have studied the molecular mechanism of drug efflux in an important category of drug efflux transporters, the AbgT family. This has been made possible through the experimental determination of the crystal structures of two members of the family, YdaH and MtrF.

In this work, we have analyzed the dynamics of two substrates, PABA and sulfamethazine through the YdaH transporter. Our analysis reveals important interactions of the substrates with residues lining the channel, which could be targeted in a drug discovery process to abrogate the functioning of this channel. By understanding the interactions between the ligands and the residues in the channel, one can generate substrate analogues to block the transporter and prevent it from exporting drugs out of the cell. By using SMD simulations, we are able to provide detailed information about various interactions between each ligand and the transporter. In addition, we are also able to understand how the transporter is able to accommodate substrates of different sizes. We see that in the case of larger ligands like sulfamethazine, there is a conformational change resulting in the separation of two helices, which merges two channels inside the protein, enabling the ligand to pass through. In addition, we also find that the residues lining the tunnels are stabilized by oligomerization, which makes the transport process more restrictively controlled.

141

Similar experiments with other sulfonamide drugs (e.g. sulfadiazine, sulfathiazole and sulfanilamide) can also be performed and analyzed to shed light on the molecular mechanism of drug efflux in bacteria. Interestingly, SMD simulations on ligand migration in MtrF with

PABA or sulfamethazine did not yield only a single channel as in YdaH. There are two possible reasons: (a) MtrF is relatively loosely packed compared with YdaH (the resolution of the MtrF structure is also not as high as YdaH), and (b) MtrF is PMF dependent. Hence, proton concentrations on either side of the membrane may have to be modeled explicitly; and such studies are currently in progress as part of the ongoing collaboration between the

Jernigan and Yu Labs at Iowa State University.

6.5. Acknowledgements

This work was supported by the NIH grant R01AI114629.

142

CHAPTER 7. CONCLUSION

The research presented in this dissertation has utilized the remarkable data available in the PDB to develop knowledge-based approaches for evaluating protein structure and dynamics. We have developed innovative approaches to utilize multiple structures of the same protein in the PDB to develop improved elastic network models, a novel approach for evaluating local entropies of protein structures and also a method for constructing free energy landscapes inclusive of protein conformational changes. In addition, we have also used hybrid approaches combining experimental and computational studies to understand the functional dynamics of proteins like cadherins, multi-drug efflux transporters and carbon concentrating proteins. In this chapter, I shall recapitulate some of the important findings and contributions of this dissertation as well as provide suggestions for future research in the field.

7.1. Specific Findings and Contributions

(a) Construction of coarse-grained free energy landscapes of protein conformational changes

We have developed a new method for constructing coarse-grained free energy landscapes for conformational changes in proteins by combining principal motions from sets of structures with free energies obtained with knowledge-based potentials and entropies from elastic network models. This is different from other approaches that rely on molecular simulations to sample structures and compute free energies based on probability densities of observed structures. This latter approach suffers the drawback of inadequate sampling. We find that consideration of entropy is important to explain the conformational changes of most proteins.

143

Further, the landscapes provide direct evidence of whether conformational selection or induced fit is operative in each protein. For most proteins, a narrow low energy path is visible between distinct conformational clusters, suggesting transition paths. However in some cases such as ATP hydrolysis driven proteins like SERCA, there is a huge barrier which needs to be overcome through fit of a ligand.

(b) Development of a novel knowledge-based local entropy function for identification of native protein structures

By using the inverse Boltzmann approach, we have developed a novel knowledge-based entropy function for proteins based on observed frequencies of contact changes between alternate conformations over a set of proteins. When these entropies are combined with knowledge based potentials, they almost double the performance of the potentials over the use of energy alone, in terms of the ability to identify native structures of proteins from decoys. This clearly shows the importance of entropy considerations for proteins and highlights an important drawback of most computational approaches to structure and dynamics prediction. In addition, our novel free energy functions perform nearly as well as all atom statistical potentials signifying the power of coarse-grained models.

(c) Understanding the molecular mechanism of conformational changes in proteins using a combination of computational and experimental approaches

Through collaborations with experimental research groups at Iowa State University, we have analyzed the molecular mechanism of conformational changes in several proteins (cadherins, multidrug efflux transporters and carbon concentrating protein Lci1). In the case of

144 cadherins, the structures sampled through all-atom molecular dynamics avoided the high energy regions in the energy landscape predicted by coarse-grained knowledge-based potentials, highlighting the use of coarse-grained models for large-scale conformational changes. Steered molecular dynamics of both multidrug efflux proteins and Lci1 revealed the ligand migration pathways which were in remarkable agreement with experimentally identified residues important for the transport mechanism. Molecular simulations and computational evaluations can serve as a useful method to understand the molecular mechanisms of experimental observables. These studies demonstrate the importance of hybrid approaches combining computational and experimental methods to address complex questions in structural biology.

7.2. Future Directions

The work undertaken in this dissertation has provided a novel approach for evaluating entropies of protein structures and a new method for computing free energy landscapes for protein conformational changes. These methods have numerous applications, a few of which

I discuss below:

(a) Prediction of transition pathways between alternate conformations of proteins

One of the main results from our coarse-grained free energy landscapes is that experimental structures often fall into low energy basins, which are connected to each other along low energy paths. These landscapes can guide structural interpolation methods to directly generate transition pathways between the conformations corresponding to clusters of a protein.

145

(b) Use of knowledge-based free energies to improve protein structure prediction from residue coevolution data

With the advent of NGS technology, genomes of many organisms are being sequenced rapidly, resulting in huge bodies of gene and protein sequence data. Tens of thousands of sequences of the same protein are often available, and this data is being used to predict coevolving residues using multiple sequence alignments [de Juan et al., (2013); Hopf et al.,

(2014); Marks et al., (2012), (2011); Sutto et al., (2015)]. It has been observed that strongly coevolving residues often tend to be spatially contacting in the 3D structure and hence this information is being used in ab initio protein structure prediction. One of the biggest bottlenecks for such approaches is the lack of a reliable method for evaluating or scoring contact predictions. Our newly developed knowledge-based free energies (KBF) can be directly applied together with these contact predictions (without requiring a 3D structure), providing an easy way to rank possible predictions and ultimately to identify the native structure.

(c) Identification of deleterious vs. neutral mutations using knowledge-based free energies

The surge in sequencing of multiple individuals has ushered in the hope of the era of precision/personalized medicine. One of the bottlenecks in making use of personal genomics a reality is the ability to definitively predict the phenotypic effect of sequence variations. The primary task of personal genomics is to identify whether the variant of a gene in an individual is a ‘deleterious mutant’ (resulting in a disease phenotype) or a ‘neutral mutant’

(no effect on phenotype). Effects of single mutations on overall protein stability (traditionally measured as the relative change in free energy upon mutation, ΔΔG) has been widely studied

146 and correlated to the rate and dynamics of protein evolution [Tokuriki et al., (2007)]. Our newly developed KBFs capture both energetic and entropic aspects of protein structures and hence are well suited for capturing the effects of mutants on protein stability. Simple classifier models such as regression methods can be trained and tested on data for which the deleterious or neutral effect of mutations is known. This is a highly promising extension of the work presented in this dissertation and is currently in progress.

(d) Development of knowledge-based free energies for identification of native protein- protein interaction poses

Protein-protein docking is one of the most challenging problems in computational structural biology and is still in a stage of development. This is primarily due to (i) the difficulties of accounting for protein flexibility and (ii) the lack of a good scoring function that correctly identifies the native poses of protein-protein interactions. While there have been successes from incorporating protein flexibility by using elastic network models [De Vries et al.,

(2015); Zimmermann et al., (2012)], free energy functions for protein-protein interactions have been less successful. Similar to the way entropies were extracted from contact changes between alternate conformations of proteins; entropies for protein–protein interactions can be computed using the residue interactions of the proteins in the complex vs isolated form. This information can then be combined with potential energies specific for protein-protein interactions [Keskin et al., (1998)] to develop free energy functions for evaluating protein- protein docked poses.

147

APPENDIX A. THE USE OF EXPERIMENTAL STRUCTURES TO

MODEL PROTEIN DYNAMICS

Expanded version appeared originally as a book chapter in Methods Mol. Biol. (2015)

Vol 1215: 213-236.

Ataur R. Katebi, Kannan Sankar, Kejue Jia, and Robert L. Jernigan

Summary

The number of solved protein structures submitted in the Protein Data Bank (PDB) has increased dramatically in recent years. For some specific proteins, this number is very high – for example, there are over 550 solved structures for HIV-1 protease, one protein that is essential for the life cycle of human immunodeficiency virus (HIV), which causes acquired immunodeficiency syndrome (AIDS) in humans. The large number of structures for the same protein and its variants actually present a sample of different conformational states of the protein. A rich set of structures solved experimentally for the same protein can has information buried within the dataset that can explain the functional dynamics and structural mechanism of the protein. This chapter focuses on two methods – Principal Component

Analysis (PCA) and Elastic Network Models (ENM). PCA is a widely used statistical dimensionality reduction technique to classify and visualize high-dimensional data. On the other hand, ENMs are well established simple biophysical method for modeling the functionally important global motions of proteins. This chapter covers the theoretical framework of these two methods. Moreover, an improved ENM version that utilizes the variations in a given set of structures for a protein is described. As a practical example, we have extracted the functional dynamics and mechanism of HIV-1 protease dimeric structure

148 by using a set of 329 PDB structures of this protein. We have described step by step – how to select a set of protein structures, how to extract the needed information from the PDB files for PCA, how to extract the dynamics information using PCA, how to calculate ENM modes, how to measure the congruency between the dynamics computed from the principal components (PCs) and the ENM modes, and how to compute entropy using the PCs. We provide the computer programs or references to software tools to accomplish each step and show how to use these programs and tools. We also include computer programs to generate movies based on PCs and ENM modes and describe ways to visualize them.

1. Introduction

There are large numbers of structures in the protein data bank (PDB) [Berman et al., (2000)] for many categories of enzymes. Shown in Fig. A.1 are the most abundant enzyme structures ordered by enzyme commission (EC) numbers. Some other examples for individual EC categories, with the numbers of their related structures in parentheses are: 3.4.21: serine endopeptidases (2,459), 3.4.23: aspartic endopeptidases (1,146), 3.4.24: metalloendopeptidases (727), 3.4.22: cysteine endopeptidases (720), 3.4.11: aminopeptidases

(292), 3.4.19: omega peptidases (244), 3.4.17: metallocarboxypeptidases (144), 3.4.14: dipeptidyl-peptidases (120), 3.4.25: threonine endopeptidases (109), 2.7.7, nucleotidyltransferases (107) and 3.4.16: Serine-type carboxypeptidases (97). In addition, there are many structures of non-enzyme proteins – structural proteins, immunoglobulin

Fab’s, viral proteins, and many others. The PDB has many additional ways to search for functionally related structures that are invaluable for finding structures with similar dynamics. You can search by biological process such as gene ontology (GO), cellular

149 component, molecular function and transporter classification. In addition there are many receptors with multiple reported structures. Overall, there is abundant data to investigate functional protein dynamics of many classes of proteins directly from experimental structures.

Figure A.1. Numbers of related protein structures available for extracting protein functional dynamics – snapshot of the PDB statistics for the largest categories of enzymes (as of 08/30/2013). In total there are over 17,000 enzyme structures, and a significant number of structures for many diverse enzyme types. The most common structure on the left of this histogram with 1,285 structures is EC 3.2.1.17 that includes lysozymes, and at the right side is 5.2.1.8 acetylcholinesterases with 337 different structures (taken from enzyme classification histogram in PDB) [Berman et al., (2000)].

Important conformational changes can readily be extracted from a set of PDB structures for a protein and these are found to relate directly to function. Experimental structures can be a rich source of information. It is well established that functionally related structures must have similar structures and similar dynamics – building on the broad experience of many researchers. There have been several efforts at extracting dynamics from specific sets of

150 experimental structures. One approach is principal component analysis (PCA) [Hotelling,

(1933); Manly, (1986); Pearson, (1901)], a statistical method based on covariance analysis.

PCA can transform the original space of correlated variables into a greatly reduced space of independent variables (i.e., the principal components or PCs). By performing PCA, most of a system’s variance will usually be captured in a quite small subset of the PCs. PCA has been applied often to analyze trajectory data from MD simulations to find the essential dynamics

[Amadei et al., (1996), (1993)]. Teodoro et al. applied PCA to the dataset composed of many conformations for HIV-1 protease [Teodoro et al., (2003), (2002)]. They found that PCA transformed the original high-dimensional representation of protein motions into a low- dimensional one that provides the dominant protein motions. This is a huge reduction in dimensionality from hundreds of thousands to fewer than fifty degrees of freedom. Howe

[Howe, (2001)] used PCA to classify the structures in NMR ensembles automatically, according to correlated structural variations, and the results have shown that two different representations of the protein structure, the C coordinate matrix and the C-C distance matrix, gave equivalent results and permitted the identification of structural differences between conformations. More recent efforts include our own previous efforts in analyzing the HIV-1 protease set [Yang et al., (2008)], those of the Bahar group [L. W. Yang et al.,

(2009)], and our efforts in developing the MAVEN program [Zimmermann et al., (2011a)], as well as related efforts by the Bahar group with their ProDy [Bakan et al., (2011)], and

Grant with his Bio-3D [Grant et al., (2006)]. Any of these can provide a similar set of starting tools.

On the other hand, elastic network models (ENM) have proven themselves to be highly useful in representing the global motions for a wide variety of diverse protein structures

151

[Atilgan et al., (2001); Bahar and Jernigan, (1997), (1994); I Bahar et al., (1997); I. Bahar et al., (1997a)]. For modeling and simulating the dynamics of proteins, ENMs are applied multiple scales [Bahar and Rader, (2005); Bahar, (2010); Chennubhotla et al., (2005);

Jernigan et al., (2009)]. All atom ENM models give a finer description of the proteins.

However, coarse-graining is possible in different granularity and can do as good as the all atom models when the global functional motions for the proteins are calculated. The most common coarse-graining involves a single-site per residue representation, in which the sites are identified by the Cα atoms and connected by uniform springs. The dynamics of such interconnected model can be described by the Gaussian Network Model (GNM) [I Bahar et al., (1997)] or an Anisotropic Network Model (ANM) [Atilgan et al., (2001)]. GNM has been very successful to provide information on the magnitudes of the fluctuations of the protein structures and no directional information or the 3-D nature of motion of the protein is considered in the model. However, in reality protein fluctuations are generally anisotropic

[Ichiye and Karplus, (1991); Kuriyan et al., (1986)]. ANM considers the anisotropy of the protein structure in modeling its dynamics and thus ANM computed collective motions are more relevant to biological function and mechanism of the protein molecule.

In this chapter, we give an example of how to use computational methods to extract protein dynamics from a large set of experimental structures of HIV-1 protease. Behind this is the implicit assumption that there is a significant amount of information about protein dynamics, mechanisms and allostery buried within the structures in the PDB. We will show how to utilize PCA to extract dynamics from the abundantly available HIV-1 protease structures and how to compute the agreement between PCA-based protein motion and the

152

ANM modeled motion, and describe how these could be used in simulations with a new structure-based elastic network model.

A.2. Theory

A.2.1 Principal component analysis (PCA)

PCA is a multivariate technique to analyze a dataset where the observations are described quantitatively by a set of inter-correlated variables. The goals of PCA are to (i) extract the most important information from the data; (ii) remove noise and compress the data set by keeping only the important information; (iii) simplify the description of the data set; and (iv) analyze the structure of the observations and the variables. This method generates a set of new orthogonal variables called principal components (PCs). Each PC is a linear combination of the original variables. Hence, PCA can be considered as a mapping of the data points from the original variable space to the PC space. PCs are rank ordered in such a way that PC1 represents the maximum variance among all possible choices for the first axis.

Similarly, PC2 represents the second highest variance contribution, and so forth through all the modes. Usually only a few PCs are sufficient to understand the internal structure of the data [Abdi and Williams, (2010)].

For extracting functional dynamics from the PDB experimental structures, PCA is performed on the structure datasets. The input is the set of coordinates of all of the structures in the set [Teodoro et al., (2003), (2002)]. From these data, the average position of each point in the structure is computed as and the covariances for pairs of points i and j are computed according to

푐푖푗 = 〈(푟푖 − 〈푟푖〉)(푟푗 − 〈푟푗〉)〉 (A.1)

153

where brackets <> indicate averages over the entire set of structures. The covariance matrix C can be decomposed as

C  P  P T , (A.2)

where the eigenvectors P represent the principal components (PCs) and the eigenvalues are the elements of the diagonal matrix . The eigenvalues are sorted in order. Each eigenvalue is directly proportional to the amount of the variance it captures.

A.2.2 Elastic network model (ENM)

Anisotropic Network Model (ANM) is an elastic network model used to compute the directions of the normal modes from a single structure [Atilgan et al., (2001)]. In ANM, the potential energy V is a function of the displacement vector D of each point in the structure

 V  DHD T , 2 (A.3)

where  is the spring constant for all closely interacting points in a structure, and H is the

Hessian matrix containing the second derivatives of the energy, with respect to each of the coordinates r = . For a structure with n residues, the Hessian matrix H contains n  n super-elements of size 3  3. The Hessian matrix H can be decomposed [Atilgan et al.,

(2001); Teodoro et al., (2003), (2002)] as

H  M  M T , (A.4)

where  is a diagonal matrix comprising the eigenvalues with the eigenvectors forming the columns of the matrix M. This decomposition generates 3n-6 normal modes (the first 6 modes account for the rigid body translations and rotations of the system and must be factored out, meaning that we actually perform singular value decomposition to extract the normal modes) reflecting the vibrational fluctuations. We like to further mention that for

154

ANM coarse graining, it is shown that a cutoff distance of any value from 10Å to 13 Å is appropriate for placing the springs and such an ANM model represents the realistic protein dynamics. In this chapter, we use a cutoff distance of 13 Å.

A.2.3 Structure-based new ANM

The internal distance changes in a set of structures can provide information that can be used directly to derive new structure-based elastic network models. We have extracted spring constants between all residue pairs in a set of structures by simply relating these to the inverse of the variance of internal distance changes between pairs of residues, as the spring stiffness (normalized between 0 and 1). We have applied a cutoff of 13Å to limit the range of interactions. However the difference between the conventional ANM described in the previous section and this modified ANM is that here the values for the spring constants are obtained directly from the structure set rather than using a uniform value or distance dependent values, as is customary with ENM.

A.2.4 Comparing directions of motions using overlaps

The alignment between the directions of motion, for example between a given PC and a given normal mode, is measured by their overlap, which was defined as the dot product of the two vector directions by Tama and Sanejouand [Tama and Sanejouand, (2001)]:

PMij O ij  , PM ij (A.5)

th th where Pi is the i PC for model P and Mj is the j PC or normal mode for model M. A perfect match yields an overlap value of 1. They also defined the cumulative overlap (CO) between the first k vectors of M and Pi as

1 푘 2 2 퐶푂(푘) = (∑푗=1 푂푖푗) (A.6)

155

which measures how well the first k PCs for model M together can capture the motion of a single PC for model P.

A.2.5 Coarse-grained global entropies calculated from principal component analysis

As covariance matrix can be decomposed as in Eq. A.2 of section A.2.1, an approximation of the entropy from the PCs can be obtained as well [Yang et al., (2008),

(2007)]:

푁 푇 훥푆 = 푎 ∗ ∑푖=1 휆푖(푃퐶푖푃퐶푖 ) (A.7)

th th where PCi is the i PC, and λi is the i eigenvalue, N is the total number of eigenvalues and 푎 is a constant.

Andricioaei, I et al also reported a similar result for entropy calculation from the covariance matrices of the atomic fluctuations [Andricioaei and Karplus, (2001)].

A.3 Materials and Methods

There are a huge number of available HIV-1 protease structures in the PDB (564 X-ray and three NMR structures as of 07/26/2013), which provides a remarkably rich set of different conformational states, which can be viewed as direct structural information on the protein’s dynamics.

The approach described here computes the essential or most important protein motions from multiple structures of the same protein, in contrast to using just the two structures such as the “open” and “closed” conformations, which have often been used to define the endpoints of conformational transitions. To demonstrate this approach, we use HIV-1 protease as an example. Its abundant experimentally determined structures are complemented

156 by the relatively small size of the protein. In the next section, first, we will give a description of the structural components that are important to drive the motion of the HIV-1 protease structure. Then, we will describe the dataset of HIV-1 structures that we have used to perform our computations.

A.3.1 HIV-1 protease architecture

Figure A.2. Description of HIV-1 protease homo-dimer and its critical structural components that facilitate the functional dynamics (A) HIV-1 protease has two symmetric subunits – subunit A (red) and subunit B (blue). (B) Each subunit has several structural components that are important for its coordinated motions. Fulcrum (orange, residues 9 – 21) is a comparatively less mobile region that swings up and down similar to the flap elbow. E-34 (blue) – Hinge residue which is responsible for transmitting the motion from the fulcrum to the flap region. Flap elbow (magenta, residues 37 – 42) – Hinge residue E-34 drives the motion of this region to transfer the dynamics further away from the fulcrum to the upper flap region. This loop can make top-down and bottom-up swings. When the flap elbow swings from top to bottom, the flap domain opens up, and when it swings upward the flap domain closes. The Flap domain (residues 43 – 58) consists of flap tip (yellow, residues 49 – 52) and β-hairpin flaps (dark orange, residues 43 – 48 and 53 – 58). Opening and closing of the flap domains enable the protein to bind ligands and release its products after proteolysis. Cantilever (green, residues 59 – 75) functions as a base for the flap domain. The C-terminal β-hairpin flap is held by the N-terminal end of the cantilever and this arrangement is important to control the swinging of the flap [Harte et al., (1990); Hornak et al., (2006)].

157

HIV-1 protease functions as a homo-dimer as shown in Fig. A.2A. The dimer has a single active site and 99 residues per monomer. Each monomer has three domains: a terminal domain (residues 1-4 and 95-99 of each chain), which is important for the dimerization and stabilization; a core domain (residues 10-32 and 63-85 of each chain), for dimer stabilization and catalytic site stability; and a flap domain that includes two solvent accessible loops

(residues 33-43 of each chain) followed by two flexible flaps (residues 44-62 of each chain) important for ligand binding interactions. The conserved Asp25-Thr26-Gly27 active site triad is located at the interface between parts of the core domains. The active site of HIV-1 protease is formed at the homo-dimer interface. Each monomeric unit has important structural components (as identified in Fig. A.2B) that are important for its functional dynamics. The principal advantage of this structural arrangement is that the hinge residue E

34 causes the up-down swinging motion of the flap elbow (residues 37 – 42), which transmits the motion generated in the fulcrum (residues 9 – 21) to drive the dynamics of the flap domain (residues 42 – 58), whose conformation switches between open and closed states to facilitate substrate trapping in the catalytic pocket and product release following hydrolysis [Harte et al., (1990); Hornak et al., (2006)].

A.3.2 HIV-1 protease structure set (X-ray-329)

We have used 329 PDB structures of HIV-1 protease for the computations to extract protein dynamics from experimental structures. The PDB IDs of the data set are shown in Table A.1.

Only the Cα atoms of each residue were used for the performing PCA (Details including the scripts used can be found in the original book chapter and have been omitted here for the sake of simplicity).

158

A.3.3 Aligning PDB structures using MUSTANG

There are several successful multiple structural alignment programs such as MUSTANG

[Konagurthu et al., (2006)], TM-align [Zhang and Skolnick, (2005)], DaliLite [Holm and

Sander, (1996)], etc. We have used MUSTANG for multiple structural alignments of the selected PDB structures. MUSTANG does not consider sequence information in its alignment algorithm. Rather, it performs the alignment by finding maximal similar substructures. Thus it can capture the conformational variations among the structures much better than the alignment algorithms that use sequence similarity information. Moreover, in its alignment MUSTANG uses the Cα backbone atoms only.

Table A.1. List of 329 structures of HIV-1 protease used for perform PCA 1A30 1A8G 1A8K 1A94 1A9M 1AAQ 1AID 1AJV 1AJX 1AXA 1B6J 1B6K 1B6L 1B6M 1B6P 1BDL 1BDQ 1BDR 1BV7 1BV9 1BWA 1BWB 1C6X 1C6Y 1C6Z 1C70 1D4S 1D4Y 1DAZ 1DIF 1DMP 1DW6 1EBK 1EBW 1EBY 1EBZ 1EC0 1EC1 1EC2 1EC3 1F7A 1FEJ 1FF0 1FFF 1FFI 1FG6 1FG8 1FGC 1FQX 1G2K 1G35 1GNM 1GNN 1GNO 1HBV 1HIH 1HIV 1HOS 1HPO 1HPS 1HPV 1HPX 1HSG 1HSH 1HTE 1HTF 1HTG 1HVH 1HVI 1HVJ 1HVK 1HVL 1HVR 1HVS 1HWR 1HXW 1IIQ 1IZH 1IZI 1K1U 1K2B 1K2C 1K6C 1K6P 1K6T 1K6V 1KJ4 1KJ7 1KJF 1KJG 1KJH 1LZQ 1M0B 1MER 1MES 1MET 1MEU 1MRW 1MRX 1MSM 1MSN 1MT7 1MT8 1MT9 1MTB 1MTR 1MUI 1N49 1NH0 1NPA 1NPV 1NPW 1ODW 1ODX 1PRO 1QBR 1QBS 1QBT 1QBU 1RL8 1RPI 1RQ9 1RV7 1SDT 1SDU 1SDV 1SGU 1SH9 1SP5 1T3R 1T7I 1T7J 1T7K 1TCX 1TW7 1U8G 1VIJ 1VIK 1XL2 1XL5 1YT9 1YTG 1YTH 1Z8C 1ZBG 1ZLF 1ZPK 1ZSF 1ZSR 2A1E 2A4F 2AID 2AOF 2AQU 2AVM 2AVO 2AVS 2AVV 2AZC 2B7Z 2BB9 2BBB 2BPV 2BPW 2BPX 2BPY 2BPZ 2BQV 2CEJ 2CEM 2CEN 2F3K 2F80 2F81 2F8G 2FDD 2FDE 2FGU 2FGV 2FNS 2FNT 2FXD 2FXE 2HB3 2HC0 2HS1 2HS2 2I4D 2I4U 2I4V 2I4W 2I4X 2IDW 2IEN 2IEO 2J9J 2J9K 2JE4 2NMZ 2NNK 2NNP 2O4K 2O4L 2O4P 2O4S 2P3A 2P3B 2P3C 2P3D 2PK5 2PK6 2PQZ 2PWC 2PWR 2PYM 2PYN 2Q3K 2Q63 2Q64 2QAK 2QCI 2QD6 2QD7 2QD8 2QHC 2QHY 2QHZ 2QI0 2QI1 2QI3 2QI4 2QI5 2QI6 2QI7 2QMP 2QNN 2QNP 2QNQ 2R38 2R3T 2R3W 2R43 2R5P 2R5Q 2RKF 2UPJ 2UXZ 2UY0 2Z4O 3A2O 3AID 3BGB 3BGC 3BVA 3BVB 3CKT 3CYW 3CYX 3D1X 3D3T 3FX5 3GGA 3GGV 3GGX 3GI4 3GI5 3GI6 3I7E 3KF0 3KFN 3KFR 3KFS 3LZS 3LZU 3MWS 3NDU 3NDX 3NU3 3NU4 3NU5 3NU6 3NU9 3NUJ 3NUO 3O9F 3O9G 3O9H 3O9I 3OK9 3OTS 3OXC 3PWM 3PWR 3QAA 3R4B 3S43 3S53 3S54 3S56 3S85 3SO9 3T11 3U7S 3UCB 3UF3 3UHL 4DQB 4DQC 4DQE 4DQG 4DQH 4EJ8 4EJK 4EJL 4FAE 4FL8 4FLG 4FM6 4HVP 4I8W 4I8Z 4J54 4J55 4J5J 4PHV 7HVP 7UPJ 8HVP 9HVP

159

Figure A.3. Distributions of the 329 PDB structures by PCA. (A) Distribution of the structures on a PC1-PC2 plot. (B) Distribution of the structures on a PC1-PC3 plot. (C) Distribution of the structures on a PC2-PC3 plot. In plots A and B, open structures are located on the left side; closed structures are located on the right side; and the intermediate structures fall in between. Distribution of structures on PC2-PC3 plot (panel C) is based on primarily on the conformational differences along the flap elbow region. PC1, PC2, and PC3 capture 30%, 20%, and 7% of the variances in the dataset, respectively.

160

Figure A.4. Visualization of the first three PCs of HIV-1 protease on the structures. (A) Structures showing the closed form (left, PDB 1ebw) and open form (right, PDB 1rpi) of HIV-1 protease. The two subunits are shown in red and blue color and in ribbon diagram. (B) Snapshots of the structures displaced along the directions of PC1 shown in connected line segment. The direction of motions of the protein along each PC is shown with a black arrow. It can be seen that the opening- closing motion of the flaps can be easily identified from the extrema of PC1. Two extrema are shown for each motion in each row, together with arrows that indicate the directions for transition to the other structure. (C) PC2 images are shown looking down from the top of those in PC1 and PC3. PC2 is a twisting of the flap regions whereas (D) PC3 is a hinge motion between the core and flaps, with the core and flaps moving to and fro relative to one another.

161

A.4. Results

A.4.1 Significance of principal components (PCs)

Figure A.3 shows the distribution of the 329 PDBs Ids projected onto the space of the first few PCs from three separate views – PC1-PC2 (panel A), PC1-PC3 (panel B), and PC2-

PC3 (panel C). In panels A and B, open and closed structures are clearly separated in two regions (open structures on the left side and closed structures on the right side) and the intermediate conformations (1aid, 3t11, 4ej8, etc) spanning the middle region. The PC2-PC3 views in panel C shows that the structures are distributed based on conformational differences in the flap elbow region.

The motion of the structure along PC1, PC2, and PC3 can be observed using PyMOL visualization software. It is evident that PC1 is closely related to the opening and closing (or expansion/contraction) of the flaps and the ligand binding cavity as shown in Fig. A.4A. The two extreme ends of PC1 motion correspond closely to the closed (+) and the open (-) experimental structures (closed: PDB 1ebw, open: PDB 1rpi). The PC2 and PC3 correspond to twisting motions that are best seen in a perpendicular direction to those of PC1. PC2 is predominantly a twisting motion of the flap domains (panel C), whereas PC3 is predominantly a hinge motion of the core domains moving towards and away from the flaps

(panel D).

A.4.2 Comparing PC based and ANM computed dynamics

We then computed the overlap and the cumulative overlaps of the modes with the previously computed PCs as described in Methods section. Figure A.5 shows the overlaps between the first 10 PCs and the first 10 ANM modes. The highest overlap is 60% found between PC1

162 and ANM mode 3. Table A.2 shows the cumulative overlaps between PCs and the ANM modes. The cumulative overlap between each of the first and the second PCs and the first 20 modes is above 80%. Interestingly, the cumulative overlap reaches 80% between the second

PC and the first 6 modes. This clearly indicates that given an appropriate experimental dataset the motions captured by the PCs conform quite closely with the ANM motions.

Figure A.4. Overlap between PCs and ANM modes. PC1 and mode 3 gives the highest overlap of 0.6. Both of these correspond to the opening-closing motion of the flaps.

Table A.2. Cumulative Overlap between the first 3 PCs and sets of ANM modes

Modes PC1 PC2 PC3

3 modes 0.62 0.71 0.44

6 modes 0.64 0.80 0.54

10 modes 0.77 0.83 0.59

20 modes 0.80 0.85 0.65

*CO between a PC and ANM modes is shown in bold type if it is greater than 0.80

163

A.4.3 New internal distance based ANM motions

The use of structural information in ANM improves the modeling of the protein dynamics.

Section A.2.3 describes a way to derive spring constants from the structures. Here, we compute the inverse of the variance of the internal distances from the aligned structures in the MUSTANG aligned file. Table A.3 shows the overlaps of PCs based on these internal distances and the new ANM modes. The highest overlap is 79% that occurs between PC1 and mode 2, which is much higher than the highest overlap (60%) that occurred between PCs and conventional ANM modes (Fig. A.5).

Table A.3. Overlaps between PCs of X-ray-329 and the new ANM modes

PC/Mode Mode 1 Mode 2 Mode 3

PC1 0.09 0.79 0.40

PC2 0.34 0.01 0.24

PC3 0.34 0.01 0.10

Table A.4. Cumulative overlaps between PCs of X-ray-329 and the new ANM modes

Modified ANM modes PC1 PC2 PC3

3 modes 0.90 0.42 0.35

6 modes 0.91 0.44 0.41

20 modes 0.95 0.89 0.84

Values in bold indicate cumulative overlaps above 80%

Table A.4 shows the cumulative overlap between PCs and the new ANM modes. We can see that cumulative overlap between PC1 and the first three modes reaches 90% which is

164 quite high compared to the cumulative overlap between PC1 and the first three modes (62% as shown in Table A.2). However, the cumulative overlap between PC2 and the first three modified ANM 42%; on the other hand this value between PC2 and the first three conventional ANM modes is 71%, a much higher value. Therefore, in some cases cumulative overlap between a PC and the new ANM modes gets improved compared to the similar values between a PC and the conventional ANM modes. But when 20 new ANM modes are included, the values are constantly higher. Taken together, this suggests that modified ANM can improve the performance of the ANM models.

Figure A.5. Depiction of entropies of HIV-1 protease structure (PDB 1rpi) computed from PCs. Residues are colored spectrally according to the entropy values – coloring from red for the highest entropy to blue for the lowest entropy. Some of the residues along the flap and flap elbow regions on the subunit A (right subunit) have higher entropies than the same residues on subunit B (left subunit).

165

A.4.4. Computing entropy using PCs

We compute the entropy of the HIV-1 protease system using Eq. 2 described in section

A.2.5. We compute the entropy from the principal components of the 329 aligned HIV-1 protease structures. The residues of HIV-1 protease are colored in Fig. A.6 by the entropy values in VIBGYOR format. It is clear from the figure that the entropies are asymmetrically distributed in the two HIV-1 protease subunits. Subunit A (right subunit) has higher entropies along the flap and flab elbow regions.

A.5 Conclusions

This chapter gives the background of two important methods – PCA and ENM. By following the steps with the set of 329 HIV-1 PDB structures, one can get a hands-on experience on how to apply PCA to extract dynamics and mechanism information by capturing the conformational variability buried in different PDB structures of the same protein. One can also learn how to model the functionally important global motions of the protein using the widely accepted ANM model and compare the dynamics and mechanism found from experimental structures by PCA and from the ANM model. The higher overlaps between PCs and modified ENM modes indicate that a rich dataset of protein structures can play an important role in understanding functional dynamics and mechanism of the protein.

Moreover, the PC’s represent the variability apparent within the sets of structures, and hence these are used as a direct measure of the conformational entropy of the protein structure.

This approach can also be extended to other highly diverse protein structure sets. The

PDB database continues to grow rapidly – in 2008 there were ~43,000 protein structures and

166 now in 2013 there are more than 90,000 structures [Berman et al., (2000)]. In the future if new technologies for X-ray structure determination are developed that are much more efficient and very rapid, then there will be truly abundant structures of related proteins, including aberrant protein structures from patients. Among the various structures there are many single proteins with multiple X-ray structures determined under different conditions, as well as NMR structures. Generally proteins are robust and not easily disturbed by different environments or mutations; and the preponderance of evidence suggests that proteins have a limited range of conformations that are essential for their function. Therefore, the approach described here can generally be used to extract dynamics of any protein with significant numbers of available experimental structures.

A.6 Acknowledgments

We gratefully acknowledge the support provided by NIH Grant R01GM072014 and NSF

Grant MCB-1021785. We used several MATLAB functions from MAVEN by Zimmerman

M, et al [Zimmermann et al., (2011a)] as mentioned in section A.4. MAVEN is also used to compute PCs, PC-plot, ANM modes, overlap between PCs and ANM modes.

167

APPENDIX B. INVESTIGATIONS ON ELASTIC NETWORK MODELS OF

COARSE-GRAINED MEMBRANE PROTEINS

(Originally presented as an oral talk at the International Symposium on Bioinformatics

Research and Applications (ISBRA) 2012 at University of Texas at Dallas)

Kannan Sankar, Michael T. Zimmermann and Robert L. Jernigan

Abstract

Despite their overwhelming importance, still relatively few structures of membrane proteins have been experimentally solved. Given that the majority of drugs target membrane proteins, insights from structural and functional analysis of existing structures with computational tools can be extremely useful. The dynamics and function of membrane proteins relate closely to the membrane to which they are embedded. Here we use anisotropic elastic network models (ANMs) to investigate the motions of a G-protein coupled receptor

(GPCR), β-2 adrenergic receptor, in the absence and presence of membranes where the surrounding patch of membrane has various shapes and sizes and also using different parameters of the ANM. Our results indicate that the normal modes of the protein are significantly modified by the presence of the membrane. The extent of membrane-induced modifications and the membrane’s impact upon proposed functional motions is investigated.

168

B.1. Introduction

Membrane proteins play a crucial role in cells by playing diverse functions ranging from signal transduction and cell adhesion to small molecule transport and catalysis. They are also the largest class of protein drug targets [Terstappen and Reggiani, (2001)]. However the structures of only a few membrane proteins have been solved experimentally due to difficulties in expressing and crystallizing them [Ostermeier and Michel, (1997)]. This makes computational approaches and simulations particularly important for understanding their structure-function relationships.

Computational analysis of membrane proteins is complicated by their large size and also by the fact that the membrane itself could play a significant role in modulating their effective dynamics. Elastic Network Models (ENMs) including Gaussian (GNMs) and Anisotropic

Network models (ANMs) offer a fast and convenient way for analyzing such large systems

[Atilgan et al., (2001); I Bahar et al., (1997); Tirion, (1996)]. By modeling complex systems as a set of particles that are interconnected by springs (if two particles are within a particular distance cutoff Rc) and applying normal mode analysis, ENMs can capture the collective most important motions of the parts of a system. Experimental studies early in protein science have demonstrated that substantial structural fluctuations occur in proteins, and that these fluctuations are essential to protein function [Careri et al., (1979); Weber, (1975)].

Previous studies have shown that the mean square fluctuations of atoms obtained from ENMs correlate well with experimentally determined temperature factors and NMR ensembles of structures, as well as with the results of principal component analysis of ensembles of independently determined structures [Bakan and Bahar, (2011); Yang et al., (2008)].

169

Customarily, the first few slowest normal modes capture the functional motions for a wide range of proteins [Atilgan et al., (2001); I Bahar et al., (1997)].

Our study focuses on the human β-2 adrenergic receptor (ADRB2) which is a G-protein coupled receptor (GPCR) involved in response to adrenaline-mediated smooth muscle relaxation. We investigate differences in the normal modes obtained from ANMs [Atilgan et al., (2001)] of the protein in the presence and absence of membrane, using membrane models of different shapes (cubic and cylindrical) and different sizes and also by using different parameters of the ANM.

B.2. Methods

The X-ray crystal structure of ADRB2 was obtained from the Protein Data Bank (PDB ID:

2RH1) [Cherezov et al., (2007)]. The cubic POPC (1-palmitoyl-2-oleoyl phosphatidyl choline) membrane with sides of length 100 Å are built using Membrane Builder in the

VMD1.9 (Visual Molecular Dynamics) [Humphrey et al., (1996)] package. The protein is coarse-grained to only Cα atoms and the POPC atoms were ‘vertically’ coarse-grained (along the length of POPC) to retain the atoms N, P1 and O21 in the polar head group and C32,

C24, C28, C212, C216, C36, C310 and C314 in the hydrophobic tails. This provides a membrane with somewhat more detail than the protein. So we further utilized a spherically coarse-grained membrane by iteratively removing atoms within a 5Å cutoff to ensure uniform density.

ENMs are generated using a subset of the atomic coordinates (coarse-grained structures) connected by harmonic springs with unit stiffness γ = 1 kcal/(mol.Å-2) where points are within a cutoff radius of Rc = 13 Å (unless otherwise stated). Similarities between modes of

170 motion from different models are measured in terms of overlap (O), cumulative overlap (CO) and Root Mean Squared Inner Product (RMSIP) between the normal mode vectors as described in detail elsewhere [Leo-Macias et al., (2005); Tama and Sanejouand, (2001)].

B.3. Results and Discussion

The first 10 normal modes of the free ADRB2 protein show little overlap with the first 10 modes of the protein from models where the membrane (cubic or cylindrical) is included.

The models which include the membrane yield higher mean-square fluctuations in various regions of the protein, especially the loop regions (Fig. B.2). Upon visualizing the modes, we find that, in the presence of the membrane, the motions exhibited by the free protein are highly damped.

Figure B.1 Comparison of modes using different types of membranes. (a) Overlaps between the first 10 normal modes of the model with cubic membranes and cylindrical membrane of radius 26.5Å is only moderate (gray squares) indicating a significant influence of these details on the motion. (b) Overlap between the first 10 normal modes of the model with spherically coarse-grained membrane and vertically coarse grained membrane is significantly high (dark squares) showing that the motions are similar. The gray-scale reflects the extent of overlap in the directions of the motions between modes as shown in the legend bar on right.

171

Also, there was only a moderate overlap between the modes generated using cubic in comparison with cylindrical membranes (Fig. B.1a); perhaps indicating that the local membrane environment around the protein can have a major impact on the protein's functional motions that may affect, for example, the formation of membrane rafts. Also, cylindrical membranes of increasing radii from 27Å to 41Å (in steps of 2Å) yield modes of decreasing overlaps with the modes of the free protein. Although spherical coarse-graining

(cg) of the membrane yields similar motions, some specific individual modes in the vertical cg are absent in the spherical c-g motions (Fig. B.1b).

α We varied the values of γ and Rc for the protein C , as well as for the head and tail atoms of the lipid molecules in the bilayer. Similar behaviors are observed between the modes obtained with such models and the ones reported here, demonstrating insensitivity to these details (results not shown). Our results, however, suggest that the membrane may does play an important role in affecting the functional motions of a membrane protein.

Figure B.2 Mean square fluctuations (MSFs) from ANMs. (a) with and (b) without membrane are mapped onto the C backbone of ADRB2 structure in a spectral coloring scheme. Most of the significant differences are in the extra- and intra-cellular loops and the N- and C-termini. Red represents regions with high MSF while blue represents regions with low MSF.

172

APPENDIX C. INVESTIGATIONS ON THE MECHANISM OF CARBON

CONCENTRATION BY Lci1

(Part of a manuscript in preparation, in collaboration with the Yu & Spalding Labs at ISU)

Tsung-Han Chou, Kannan Sankar, Sayane Shome, Robert L. Jernigan, Edward W. Yu,

Martin H. Spalding

Abstract

Carbon concentrating mechanisms (CCMs) play an important role in photosynthetic algae like Chlamydomonas reinhardtii during CO2-limiting stress conditions by helping to increase cellular levels of inorganic carbon (Ci; CO2 or bicarbonate ion). However the molecular mechanism of carbon concentration has remained elusive because of missing structural data for any of the constituent proteins. Since the first crystal structure of one of the components

(Lci1) is available, is now becomes possible to begin investigating the mechanism by which inorganic carbon is concentrated by CCMs. In this work, we investigate which amino acids are likely to be crucial for the concentrating mechanism from steered molecular dynamics simulations on Lci1 protein with CO2 and bicarbonate as ligands. Results from this study yield some insights for attempts in introducing CCMs to improve the productivity of higher plants.

C.1. Introduction

Harnessing carbon dioxide for photosynthesis is a challenge for all plants; mainly due the inefficiency of ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO), the enzyme that catalyzes the first step in carbon fixation [Spreitzer and Salvucci, (2002)]. The

173

concentration of CO2 in the air is only at trace levels, much lower than that of O2. Rubisco has a higher affinity for O2 than CO2, leading to photorespiration. It becomes a more serious problem for aquatic plants because (a) the rate of diffusion of CO2 is much lower in water; and (b) they experience drastic changes in pH and Ci levels which alter the availability of

- HCO3 and CO2 for photosynthesis [Pedersen et al., (2013)]. As a result, most algae like

Chlamydomonas reinhardtii have developed highly efficient carbon concentrating mechanisms (CCMs) that can increase Ci concentrations and make it available for photosynthesis [Moroney and Rouge, (2007)]. Neither the exact nature of this concentration mechanism nor the players in the CMM have been fully determined [Thoms et al., (2001)].

However several candidate genes have been identified that could code for the various components of the system [Moroney and Ynalvez, (2007); Wang et al., (2011)]. These include several carbonic anhydrases and transporters/channels that may control the

- movement of HCO3 and CO2 across various cellular membranes before they can reach the chloroplast.

One such candidate gene is lci1, which codes for the protein low CO2-inducible 1 (Lci1).

Lci1 was first identified to be over-expressed in cells grown under CO2-starved conditions

[Burow et al., (1996); Yamano et al., (2008)]. The protein has no homologs in the NCBI database but was predicted to possess four transmembrane (TM) domains and hypothesized to be involved in transport of Ci [Mitchell et al., (2014); Ohnishi et al., (2010)]. The crystal structure of Lci1 was recently solved by the Yu Lab and verified to possess four TM domains with a central channel (unpublished). The protein consists of four transmembrane helices,

TM1-4 with two extracellular loops ECL1-2 and one intracellular loop ICL1 (see Fig. C1) and constitute a four-helix bundle.

174

The crystal structure of Lci1 provides a snapshot of the transporter in some ‘state’ of the transport cycle. However, a more comprehensive picture of the transport process can be obtained from the application of computational simulations to learn about likely conformational changes within the protein and the interactions between the ligand and key residues of the protein. Computational methods such as molecular dynamics (MD) provide a direct means for studying this process. However, since the timescale of these membrane transport processes is beyond the reach of usual MD, other suitable methods are needed to accelerate this process. One such approach is steered molecular (SMD) [Izrailev et al.,

(1997)], where the substrate is pulled at a constant velocity (or constant force) to drive it in a specific direction (e.g. from the outside to the inside of the cell). SMD has been widely employed to study ligand binding and transport for a variety of transporter proteins such as the water channel aquaporin [Ilan et al., (2004); Jensen et al., (2002); Wang et al., (2005)], leucine transporter LeuT [Celik et al., (2008)], lactose transporter LacY [Jensen et al.,

(2007)] and the ADP/ATP carrier (AAC) protein [Wang and Tajkhorshid, (2008)].

Figure C.1. Structure of Lci1. (a) Lci1 consists of four transmembrane helices: TM1 (blue), TM2 (green), TM3 (orange) and TM4 (red) connected to each other by two extracellular loops, ECL1 (cyan) and ECL2 (magenta) and an intracellular loop ICL1 (yellow). Both the N- and C- termini are on the cytoplasmic side. (b) Rotated view of the protein showing the central channel through which carbon dioxide or bicarbonate migrates.

175

- Here, we use SMD simulations of Lci1 with ligands CO2 and HCO3 to identify the key residues involved in their transport through Lci1. Our analysis provides compelling evidence that

- the protein can indeed transport CO2 and HCO3 in only one direction (from the extracellular

(EC) side to the cytoplasm), forming a Ci valve. It also sheds light on the how the protein forms the Ci valve, forcing the ligands to be drawn into the cytoplasm by a coordinated movement of its TM1 and TM2 helices.

C.2. Methods

C.2.1. Preparation of membrane-protein-ligand system

A 120 X 120 Å2 palmitoyl-oleyl-phosphatidylethanolamine (POPE) membrane bilayer is constructed using the membrane builder plugin in VMD [Humphrey et al., (1996)] with the membrane normal to the z-axis. The protein (along with ligand) is inserted into the center of the membrane carefully by taking into consideration the position of the TM helices with respect to the bilayer. To embed the protein into the membrane, any heavy atoms of the lipid within 0.75 Å of a heavy atom of the protein are removed. The system is solvated with the addition of 5 Å of water on either side of the bilayer and Na+ and Cl- ions are added to ensure charge balance, resulting in a total of 112,470 atoms (112,472 atoms in the case of bicarbonate) in the system. Following minimization for 200,000 steps, the system is equilibrated by starting at 10K and the temperature raised slowly to 310 K in 1̊ K steps with 1 ps runs at each step. A harmonic restraint of 1 kcal/mol Å2 is applied to the protein during the temperature increase, following which a 100 ps equilibration run was performed at 310̊̊̊ K with constant temperature and pressure (NPT) and all atoms unconstrained. Langevin dynamics (damping coefficient of 1 ps-1) is used to maintain the constant temperature. A

176 cutoff distance of 12 Å is used for van der Waals interactions, with the pair list distance increased to 14 Å. A periodic boundary condition is imposed with long-range full electrostatic interactions evaluated using the particle mesh Ewald method with a grid spacing of 1 Å. For all simulations, NAMD [Phillips et al., (2005)] with the CHARMM27 parameter set [MacKerell et al., (1998); Schlenkrich et al., (1996)] is used. The topology and parameter files for CO2 are generated using the SwissParam server [Zoete et al., (2011)] and for bicarbonate ion the Paratool plugin in VMD [Humphrey et al., (1996)] is used.

C.2.2. Steered Molecular Dynamics

Two SMD simulations are carried out to investigate the pathway of the ligand transport: from

− EC to IC side as well as IC to EC side. CO2 or HCO3 is placed accordingly at either end of the channel and constant velocity steered MD (cv-SMD) is applied, specifically with the

− carbon atom of CO2 or HCO3 being pulled in a direction perpendicular to the membrane

(along the z-axis) by using a harmonic constraint and moving at a velocity of 30 Å/ns and with a force constant (푘) of 5 kcal mol-1Å-2. To prevent overall translational motion of the system, specific residues at the periphery of the extracellular side: T39, V59, P73, N135 and

D143 (intracellular side: V10, V99, M107 and S170) of the protein are harmonically restrained with a force constant of 5 kcal/mol Å2 in the z-direction for the EC to IC (IC to

EC) simulation. Similarly, for the simulation from IC to EC side, residues V10, V99, M107 and S170 at the periphery of the IC side were restrained.

C.3. Results

To study the pathway of migration of CO2 within the channel, we have performed SMD simulations on the channel embedded in the POPE membrane. In order to understand the interactions of the amino acids that stabilize the CO2 ligand, we also measure the magnitude

177 of applied force necessary to maintain constant velocity of the ligand through the channel.

SMD simulations from the EC to the IC side reveal the permeation pathway through which

CO2 can migrate into the interior of the cell (shown in red in Fig. C.1).

Figure C.2. Migration of carbon dioxide/bicarbonate ion from steered MD simulations. − Pathway of CO2 or HCO3 (red tube) migration from the EC to the IC side as observed in cv-SMD simulations. The ligand moves through the central channel in each protomer up to the ‘valve’ region, where the ligand is stabilized inside the channel and drifts to the cytoplasmic side accompanied by a scissoring motion of the helices TM1 and TM2. The z-axis is shown on the right.

CO2 moves through the central channel (of each protomer) to the IC side up to a distance of ~20 Å from the center of the channel, which corresponds to the location where the electron density of CO2 has been observed in the crystal structure. Once the CO2 reaches this place, a noticeable reduction in the magnitude of applied force was observed (Fig. C.3a). This suggests that the residues lining the region tend to push the CO2 further into the channel, and prevent it from migrating out of the cell; thereby constituting a ‘valve’. Beyond the valve region, the CO2 moves freely into the cytoplasm accompanied by a scissoring motion involving separation of TM1 and TM2. This is also evident from the residue level root mean square fluctuation (RMSF) plot in Fig. C.3b where there is strong increase in RMSF of

178 residues at the IC side of TM2 (residues 84-99) and intracellular loop 2 (ICL2; residues 100-

107) in the presence of ligands versus in the absence of ligand.

(a) (b)

Figure C.3. Characterization of steered MD simulations. (a) Variation of applied force along − the course of the simulation with CO2 (blue) and HCO3 (red). As the ligand enters the valve region, there is a substantial reduction in the applied force, suggesting that the valve tends to push  the CO2 toward the IC side. (b) Root mean square fluctuation (RMSF) of residue C atoms for − protein alone (green line), protein with CO2 (blue line) and HCO3 (red line). A substantial increase in RMSF is observed for the residues at the IC terminus of TM2 as well as ICL2 that is moves relative to TM1 to allow ligand to move out of the Lci1 into the cell.

SMD simulations in the opposite direction, from the IC to the EC using identical parameters, but with the direction of the force reversed show that the CO2 molecule migrates only a short distance into the channel (up to the valve) and afterwards gets pushed sideways

(parallel to them membrane, through the space between TM1 and TM2), thereby reinforcing the fact that the valve strongly favors the movement of CO2 towards the IC side.

Residues that interact with and stabilize the ligand within the channel are directly identified by analyzing the frames in the SMD trajectory corresponding to peaks in the force- position plot in Fig. C.3a (blue curve). Structures at four specific locations with a peak in the applied force identified as follows (locations measured in terms of distance from center of mass of the channel as indicated in Fig. C.3a): -11 Å, -2.5 Å, 10 Å and 15 Å and 20 Å. The oxygen atoms of CO2 form hydrogen bond interactions with the hydrogens in the side chains

179 of Q80, N161 and Q16 (Fig. C.4), suggesting a crucial role for these residues in the transport mechanism.

a b c

Figure C.4. Interactions of carbon dioxide with residues lining the channel. (a) H-bonds formed between oxygen atom of CO2 and HE2 atom of Q80. (b) H-bond formed between CO2 and HD2 atom of N161. (c) H-bond formed between CO2 and HE2 atom of Q16.

a b

Figure C.5. Interactions of bicarbonate with residues lining the channel. (a) H-bonds formed − − between HCO3 and HG1 atoms of T19 and T94. (b) H-bond formed between HCO3 and HE2 atom of Q16.

− Similar SMD simulations were also performed to investigate the migration of HCO3 through the channel. SMD simulations from the EC to the IC side reveal that the permeation pathway is similar to that for CO2 migration (shown in red in Fig. C.2). A plot of the applied force versus the position of ligand (red curve in Fig. C.3a) shows peaks corresponding to interaction with residues lining the channel at the following locations: -12 Å, 5 Å, 14 Å and

18 Å. Hydrogen bonds are also seen between the bicarbonate ion and the side chains of Q16,

T19 and T164 (shown in Fig. C.5).

180

C.4. Conclusions

Crop plants like rice and wheat use the C3 pathway for photosynthesis and rely on

RuBisCO for carbon fixation. Since O2 is present in much higher concentrations in the environment, it competes with CO2 for the active site of RuBisCO, leading to photorespiration and reduced yield (due to reduced carbon fixation). In other words, the crop yield can be tremendously improved if the plants can be modified to concentrate CO2 in their cells efficiently, using mechanisms similar to those present in certain algae like

Chlamydomonas. Recently, there have been attempts to introduce algal CCMs into higher plants to improve their productivity [Atkinson et al., (2016)]. Understanding the structure and functional mechanism of components of the CCMs can provide important insights for such studies.

By using SMD simulations we have been able map out the pathway for ligand migration in the Lci1 transporter. Our study gives compelling evidence that Lci1 is indeed responsible for transport of Ci from the periplasm to the cytoplasm. By forming a ‘valve’ region, it prevents Ci from going out of the cell, thereby making it available for photosynthesis. We

- have also been able to identify the key residues involved in interactions with CO2 and HCO3 during the transport process.

C.5. Acknowledgements

This work was supported by the NIH grant R01AI114629.

181

APPENDIX D. SUPPLEMENTARY INFORMATION FOR CHAPTER 3

Table D.1. List of 167 protein structure pairs (from MolMovDB) used for extraction of amino acid contact change information. Structure pairs are separated by a ‘:’ and corresponding PDB IDs and chain IDs of each structure are separated by a ‘_’. 1CBU_B : 1C9K_B 1LCC_A : 1LQC_A 1GU0_A : 1GU1_A 1TDE_A : 1F6M_E 1E0S_A : 2J5X_A 1J74_A : 1J7D_A 1JEJ_A : 1JG6_A 1FGU_A : 1JMC_A 1BRD_A : 2BRD_A 2C4Q_A : 2IZ9_A 1DKX_A : 1DKY_A 1MSW_D : 1QLN_A 4ICB_A : 1CLB_A 1I69_A : 1I6A_A 1DDT_A : 1MDT_A 1BP5_A : 1A8E_A 5DFR_A : 4DFR_A 1F8A_B : 1PIN_A 1DPE_A : 1DPP_A 1TTP_A : 1TTQ_A 1PFH_A : 1HDN_A 9AAT_A : 1AMA_A 1N0U_A : 1N0V_C 1AKZ_A : 1SSP_E 1A8V_A : 2A8V_A 8ADH_A : 6ADH_A 2EFG_A : 1FNM_A 1K93_A : 1K8T_A 1RCL_A : 1RCK_A 1JKY_A : 1J7N_A 1ERK_A : 2ERK_A 1K92_A : 1KP2_A 1K9K_A : 1K9P_A 1BU7_A : 1JPZ_A 2NAC_A : 2NAD_A 1HOO_B : 1QF5_A 1BJY_A : 1BJZ_A 3COX_A : 1COY_A 1D9V_A : 1MRP_A 1CF4_A : 1GRN_A 3TMS_A : 2TSC_A 1NJG_A : 1NJF_A 1E88_A : 1E8B_A 1EFT_A : 1ETU_A 2RAN_A : 1AXN_A 1CTS_A : 4CTS_A 1JBV_A : 1JBW_A 1GDT_A : 2RSL_A 1F3Y_A : 1JKN_A 1A2V_A : 1AVK_A 1GGG_A : 1WDN_A 1GPW_C : 1THF_D 1B0O_A : 1BEB_A 1EFW_A : 1G51_B 1PSD_A : 1GDH_A 1G7T_A : 1G7S_A 1A03_B : 1CNP_A 2EIA_A : 1EIA_A 1EX6_A : 1EX7_A 1DFK_A : 1B7T_A 5CRO_A : 6CRO_A 5ER2_E : 4APE_A 1OXU_C : 1OXS_C 1I7D_A : 1D6M_A 1A67_A : 1CEW_I 2EZA_A : 3EZA_A 1FTO_B : 1FTM_B 1MCP_L : 1NCA_L 3ENL_A : 7ENL_A 1GLN_A : 1G59_A 1HRD_A : 1K89_A 1IDI_A : 1IDG_A 1ECB_A : 1ECC_A 2GD1_O : 1GD1_O 1AON_A : 1OEL_A 1GP2_A : 1CIP_A 3HVP_A : 4HVP_A 1BE1_A : 1B1A_A 8OHM_A : 1CU1_A 7AHL_A : 1LKF_A 3ICD_A : 1AI2_A 1N8Z_C : 1N8Y_C 1IBO_A : 1IBN_A 3HVT_A : 2HMI_A 6LDH_A : 1LDM_A 1I6I_A : 1I5S_A 1EQ0_A : 1HKA_A 1KLQ_A : 1DUJ_A 1CRL_A : 1THG_A 1LCI_A : 1BA3_A 1AA7_A : 1EA3_A 1PSI_A : 7API_A 1BMD_A : 4MDH_A 1H9K_A : 1H9M_A 1IPD_A : 1OSJ_A 5AT1_A : 8ATC_A 1DQZ_A : 1DQY_A 1JYS_A : 1NC3_A 1BCC_A : 2BCC_A 1EER_A : 1BUY_A 1ORO_A : 1STO_A 1DV7_A : 1DVJ_A 2LAO_A : 1LAF_E 1TIP_A : 1FBT_A 1RRP_A : 1BYU_A 1JQJ_A : 2POL_A 1LFG_A : 1LFH_A 9GPB_A : 1GPB_A 6Q21_A : 4Q21_A 1PVU_A : 1PVI_A 1P7Q_D : 1G0X_A 2HCO_A : 4HHB_A 1SER_B : 1SES_A 1UJ1_A : 1UK2_A 1L97_B : 1L96_A 1JMJ_A : 1JMO_A 2KTQ_A : 3KTQ_A 1NG1_A : 1FFH_A 1OMP_A : 3MBP_A 2PFK_A : 1PFK_A 1TRE_A : 6TIM_A 1WRP_R : 3WRP_A 1MML_A : 1QAI_A 1DKR_A : 1DKU_A 1WAS_A : 1JMW_A 1EVK_A : 1EVL_A 1LUA_A : 1LU9_A 1G6O_B : 1NLZ_B 1CHN_A : 3CHY_A 1AMN_A : 1EA5_A 1EJD_A : 1A2N_A 1II0_A : 1IHU_A 1DCM_A : 1D5W_A 1ANK_A : 1AKE_A 1RKM_A : 2RKM_A 1BHM_A : 1BAM_A 1LGR_A : 2GLS_A 1M8P_A : 1I2D_A 1OIB_A : 1QUK_A 1Q1B_A : 1Q12_A 1PRV_A : 1PRU_A 1BNC_A : 1DV2_A 1PHP_A : 13PK_A 1FU1_A : 1IK9_A 2PHY_A : 3PYP_A 1B47_B : 2CBL_A 1CMK_E : 1JLU_E 1IWO_A : 1SU4_A 1A32_A : 1AB3_A 1HNG_A : 1HNF_A 1BPD_A : 2BPG_A 1PJR_A : 3PJR_A 1CAQ_A : 1CQR_A 1CLL_A : 1CTR_A 1K23_A : 1K20_A 1LWT_A : 1VDE_A 1THV_A : 1THI_A 1CRX_A : 4CRX_A 1NYL_A : 1GTR_A 1IH7_A : 1IG9_A 1LB5_A : 1LB4_A 1L5E_A : 1L5B_A 1URP_A : 2DRI_A 1BJT_A : 1BGW_A 1FOX_A : 2FOW_A 1DAP_B : 3DAP_A 1JSA_A : 1IKU_A

182

Table D.2. Number of changed and unchanged contacts between amino acid pairs in the dataset. Upper matrix represents number of unchanged contacts whereas lower matrix represents number of changes contacts. Amino acids are identified by their one-letter codes and symbols are as mentioned in SI Materials and Methods. The fractional and normalized cc matrices can be computed using the raw counts in this table as per formulae in SI Materials and Methods. 푐 푛푖 A I L V M F W G P C N Q S T Y D E R H K 4189 3034 4710 3556 1115 1981 607 3695 2296 626 2086 1840 2701 2727 1677 2908 3569 2601 1066 3179 푛푖 1524 1311 2171 1481 490 980 302 819 472 216 443 424 584 727 723 534 649 717 293 667 A A 3709 264 1894 2724 1947 597 1241 346 707 479 259 398 418 559 755 869 490 577 693 290 591 I I 4495 323 564 4558 2643 890 1956 585 1105 789 363 601 699 801 1116 1274 646 936 1090 465 830 L L 6899 462 701 1214 2204 560 1256 379 788 520 280 444 457 634 944 881 516 736 750 310 739 V V 4382 325 408 667 468 254 462 137 272 198 88 162 169 203 282 326 195 230 275 127 192 M M 1663 108 143 239 138 48 1052 312 534 405 206 335 333 475 623 664 357 458 521 272 436 F

F 3046 175 281 439 251 122 202 110 193 206 71 131 122 150 193 253 140 185 210 80 167 W W 966 60 78 129 60 21 58 22 616 343 148 409 310 470 502 516 438 456 582 234 507 G G 3070 202 173 235 178 87 134 60 196 300 107 265 261 327 335 467 269 394 438 186 311 P P 2357 147 133 243 168 65 109 36 125 106 306 115 91 128 165 163 73 105 133 84 127 C C 923 55 67 105 65 22 50 18 53 46 40 360 274 368 366 374 385 399 404 165 375 N N 2163 116 130 166 120 52 106 21 139 81 32 84 248 322 354 330 301 344 386 138 368 Q Q 2053 130 106 185 124 49 94 33 116 84 32 85 108 426 462 379 401 476 485 223 447 S S 2735 164 155 265 168 57 115 47 172 122 47 119 103 152 612 509 399 503 549 222 500 T

182 T 3374 196 218 319 236 103 156 49 194 137 63 152 109 174 240 514 447 519 586 259 500 Y

Y 2673 137 198 279 180 74 148 51 150 130 41 116 94 119 147 92 274 323 819 232 784 D D 2804 161 160 213 161 81 113 44 170 112 30 122 103 151 174 129 134 500 1126 280 1016 E E 3498 208 172 324 186 74 126 54 175 127 33 153 146 157 204 169 153 214 606 263 405 R R 3721 202 188 280 185 78 156 60 235 162 56 165 157 189 227 173 250 372 288 156 188 H H 1275 59 83 137 88 25 60 18 54 64 26 52 47 58 73 74 65 87 98 26 466 K K 3644 215 214 297 206 77 151 47 222 160 42 152 148 201 203 172 278 364 200 81 214 A I L V M F W G P C N Q S T Y D E R H K Totals are: 푛 = 50,613; N = 236,085; 푁푐 = 29,725. 푐 Upper matrix represents 푢푖푗 whereas lower matrix represents 푛푖푗 = 푏푖푗 + 푓푖푗 Scaling factors are 푛푖 × 2, 푛 × 2

183

Table D.3. Contact-change (cc) matrices. Lower matrix shows fractional contact-change matrix whereas the upper matrix shows the normalized contact-change matrix. Amino acids are identified by their one-letter codes. A I L V M F W G P C N Q S T Y D E R H K 0.28 0.23 0.21 0.25 0.21 0.17 0.18 0.42 0.41 0.24 0.35 0.38 0.38 0.30 0.17 0.43 0.44 0.27 0.24 0.38 A A 0.08 0.16 0.15 0.14 0.13 0.12 0.12 0.25 0.22 0.15 0.25 0.19 0.23 0.19 0.12 0.27 0.25 0.16 0.20 0.25 I I 0.11 0.13 0.15 0.17 0.15 0.12 0.12 0.22 0.25 0.16 0.22 0.20 0.27 0.19 0.12 0.28 0.29 0.15 0.20 0.25 L L 0.10 0.11 0.12 0.18 0.17 0.13 0.10 0.28 0.31 0.16 0.26 0.24 0.26 0.20 0.13 0.32 0.26 0.17 0.24 0.24 V V 0.10 0.09 0.11 0.10 0.11 0.14 0.08 0.31 0.26 0.14 0.25 0.21 0.23 0.23 0.12 0.33 0.26 0.16 0.14 0.27 M M 0.10 0.11 0.12 0.11 0.09 0.10 0.10 0.24 0.21 0.13 0.24 0.20 0.19 0.16 0.11 0.26 0.22 0.17 0.15 0.23 F F 0.08 0.10 0.10 0.09 0.12 0.09 0.10 0.28 0.14 0.13 0.13 0.19 0.23 0.16 0.10 0.25 0.23 0.15 0.15 0.19 W W 0.09 0.10 0.10 0.07 0.07 0.09 0.09 0.55 0.50 0.35 0.47 0.47 0.51 0.44 0.27 0.57 0.55 0.39 0.29 0.53 G G 0.11 0.11 0.10 0.10 0.14 0.11 0.13 0.14 0.40 0.33 0.35 0.34 0.42 0.37 0.21 0.49 0.38 0.30 0.33 0.48 P

P 0.13 0.12 0.13 0.14 0.14 0.12 0.08 0.15 0.15 0.08 0.22 0.25 0.29 0.25 0.13 0.33 0.26 0.23 0.21 0.23 C C 0.11 0.11 0.13 0.10 0.11 0.11 0.11 0.15 0.18 0.06 0.27 0.32 0.37 0.37 0.23 0.38 0.44 0.32 0.31 0.40 N

N 0.12 0.14 0.12 0.12 0.14 0.14 0.07 0.15 0.13 0.12 0.10 0.40 0.34 0.27 0.20 0.38 0.45 0.30 0.30 0.37 Q Q 0.13 0.11 0.12 0.12 0.13 0.12 0.12 0.16 0.14 0.15 0.13 0.18 0.41 0.35 0.23 0.45 0.40 0.31 0.26 0.44 S S 0.12 0.12 0.14 0.12 0.12 0.11 0.14 0.15 0.16 0.16 0.14 0.14 0.15 0.30 0.18 0.42 0.39 0.27 0.27 0.33 T

T 0.12 0.13 0.13 0.11 0.15 0.11 0.11 0.16 0.17 0.16 0.17 0.13 0.16 0.16 0.09 0.23 0.25 0.16 0.18 0.22 Y 183 Y 0.09 0.10 0.10 0.09 0.10 0.10 0.09 0.13 0.12 0.11 0.13 0.12 0.14 0.13 0.08 0.59 0.57 0.27 0.30 0.38 D D 0.13 0.14 0.14 0.13 0.17 0.14 0.14 0.16 0.17 0.17 0.14 0.15 0.16 0.18 0.13 0.20 0.51 0.28 0.32 0.38 E E 0.14 0.13 0.15 0.11 0.14 0.12 0.13 0.16 0.14 0.14 0.16 0.18 0.14 0.17 0.14 0.19 0.18 0.26 0.26 0.34 R R 0.12 0.12 0.11 0.11 0.12 0.13 0.13 0.17 0.16 0.17 0.17 0.17 0.16 0.17 0.13 0.13 0.14 0.19 0.15 0.36 H H 0.09 0.13 0.13 0.12 0.09 0.10 0.10 0.10 0.15 0.13 0.14 0.15 0.12 0.14 0.13 0.12 0.13 0.16 0.08 0.40 K K 0.14 0.15 0.15 0.12 0.17 0.15 0.12 0.18 0.20 0.14 0.17 0.17 0.18 0.17 0.15 0.15 0.15 0.20 0.18 0.19

184

Table D.4. Entropy and potential MJ potential matrices in reduced alphabet. Fractional cc entropy matrix (left) and Miyazawa-Jernigan potential matrix (right) shown in reduced alphabet of amino acids (H: hydrophobic, P: polar, A: acidic/negatively charged, B: basic/positively charged). Fractional cc entropy MJ2h potential energy

H P A B H P A B H 0.11 0.12 0.14 0.13 H -6.33 -5.09 -3.80 -4.14 P 0.12 0.14 0.15 0.16 P -5.09 -4.24 -3.33 -3.45 A 0.14 0.15 0.19 0.14 A -3.80 -3.33 -2.06 -3.15 B 0.13 0.16 0.14 0.18 B -4.14 -3.45 -3.15 -2.35

Table D.5. List of 105 CASP11 targets used for training the knowledge based free energy functions (KBFs). Targets are indicated with their target IDs and domain number separated by a hyphen. T0759-D1 T0780-D1 T0801-D1 T0820-D1 T0840-D1 T0759-D2 T0780-D2 T0803-D1 T0820-D2 T0840-D2 T0760-D1 T0781-D1 T0805-D1 T0821-D1 T0841-D1 T0761-D1 T0781-D2 T0806-D1 T0822-D1 T0843-D1 T0761-D2 T0782-D1 T0807-D1 T0823-D1 T0845-D1 T0762-D1 T0783-D1 T0808-D1 T0824-D1 T0845-D2 T0763-D1 T0783-D2 T0808-D2 T0827-D1 T0847-D1 T0764-D1 T0784-D1 T0810-D1 T0827-D2 T0848-D1 T0765-D1 T0785-D1 T0810-D2 T0829-D1 T0848-D2 T0766-D1 T0786-D1 T0811-D1 T0830-D1 T0849-D1 T0767-D1 T0789-D1 T0812-D1 T0830-D2 T0851-D1 T0767-D2 T0789-D2 T0813-D1 T0831-D1 T0852-D1 T0768-D1 T0790-D1 T0814-D1 T0831-D2 T0852-D2 T0769-D1 T0790-D2 T0814-D2 T0832-D1 T0853-D1 T0770-D1 T0791-D1 T0814-D3 T0833-D1 T0853-D2 T0771-D1 T0791-D2 T0815-D1 T0834-D1 T0854-D1 T0772-D1 T0792-D1 T0816-D1 T0834-D2 T0854-D2 T0773-D1 T0794-D1 T0817-D1 T0835-D1 T0855-D1 T0774-D1 T0794-D2 T0817-D2 T0836-D1 T0856-D1 T0776-D1 T0796-D1 T0818-D1 T0837-D1 T0857-D1 T0777-D1 T0800-D1 T0819-D1 T0838-D1 T0858-D1

185

Table D.6. Optimized weights for each term obtained using particle swarm optimization for each KBF. 푤4푏 = weight of four body sequential term, 푤푔푓 = weight of four body non-sequential term, 푤푠푟 = weight of short-range potential term, 푤푐푐 = weight of contact change based entropy term. Optimal Value Function 푤 푤 푤 푤 of Function 4푏 푔푓 푠푟 푐푐 KBF_corrPa -0.49 1.00 0.99 0.13 7.35 KBF_corrSa -0.44 1.00 0.61 0.18 9.31 KBF_corrKa -0.32 1.00 0.57 0.17 9.54 KBF_rankNa 29.71 1.00 0.00 0.74 10.84 KBF_bdRMSDa 6.66 1.00 3.33 0.56 19.17 KBF_bdZscorea -2.54 1.00 0.00 0.00 1.79 KBF_corrPb -0.55 1.00 0.80 0.11 4.66 KBF_corrSb -0.45 1.00 0.56 0.11 3.38 KBF_corrKb -0.32 1.00 0.58 0.10 3.59 KBF_rankNb 21.12 1.00 0.00 0.44 4.04 KBF_bdZscoreb 8.42 1.00 8.34 0.36 13.28 KBF_bdRMSDb -3.24 1.00 0.00 0.00 20.00 aOptimization performed using entropy term computed using fractional contact change matrix bOptimization performed using entropy term computed using normalized contact change matrix

186

Table D.7. Performance of the novel optimized KBFs in comparison with 23 other residue-level statistical potential functions and 3 all-atom statistical potential functions. Metrics represent average values across 143 푹푴푺푫 protein targets: 푵_풏풖풎 = total number of natives (out of 143) extracted, 흆푷 = Pearson correlation with 푹푴푺푫 푹푴푺푫 RMSD from native, 흆푺 = Spearman’s rank correlation with RMSD from native, 흉푲 = Kendall’s rank 푻푴푺 correlation with RMSD from native, 푵_풓풂풏풌 = rank of native structure, 흆푷 = Pearson correlation with TM- 푻푴푺 푻푴푺 score with native, 흆푺 = Spearman’s rank correlation with TM-score with native, 흉푲 = Kendall’s rank correlation with TM-score with native, 푵_풁풔풄풐풓풆 = Z-score of native structure, 풃풅_푹푴푺푫 = RMSD of best decoy from native structure, 풃풅푻푴푺 = TM-score of best decoy with native structure, 풃풅_풁풄풐풓풆 = Z-score of best decoy. 푹푴푺푫 푹푴푺푫 푹푴푺푫 푻푴푺 푻푴푺 푻푴푺 Function 푵_풏풖풎 흆푷 흆푺 흉푲 푵_풓풂풏풌 흆푷 흆푺 흉푲 푵_풁풔풄풐풓풆 풃풅_푹푴푺푫 풃풅_푻푴푺 풃풅풁풔풄풐풓풆 KBF_corrSa 97 0.58 0.61 0.47 1.76 -0.59 -0.62 -0.49 -1.92 5.92 0.73 1.59 KBF_corrKa 96 0.58 0.60 0.47 1.77 -0.59 -0.61 -0.48 -1.92 5.94 0.73 1.60 KBF_corrPa 91 0.59 0.63 0.48 1.87 -0.62 -0.65 -0.51 -1.86 5.91 0.72 1.59 KBF_rankNa 91 0.48 0.50 0.38 1.80 -0.49 -0.51 -0.39 -1.73 6.82 0.68 1.48 KBF_bdRMSDa 90 0.58 0.62 0.47 1.88 -0.61 -0.64 -0.50 -1.86 5.94 0.73 1.56 KBF_corrKb 87 0.57 0.61 0.47 1.91 -0.63 -0.65 -0.52 -1.77 6.07 0.71 1.56 KBF_corrPb 87 0.58 0.62 0.48 1.92 -0.63 -0.65 -0.52 -1.80 5.84 0.73 1.58 KBF_corrSb 87 0.57 0.61 0.47 1.92 -0.63 -0.65 -0.52 -1.76 6.08 0.71 1.56 KBF_rankNb 86 0.47 0.52 0.40 1.87 -0.52 -0.55 -0.43 -1.66 6.83 0.68 1.48 KBF_bdZscoreb 79 0.47 0.46 0.35 2.26 -0.41 -0.43 -0.33 -1.65 7.06 0.67 1.70 KBF_bdRMSDb 78 0.55 0.59 0.45 2.42 -0.62 -0.63 -0.49 -1.59 5.74 0.74 1.56 4BOPTc 72 0.44 0.51 0.39 2.38 -0.54 -0.57 -0.45 -1.46 6.71 0.70 1.47 KBF_bdZscorea 71 0.52 0.56 0.42 2.37 -0.58 -0.59 -0.46 -1.61 6.11 0.71 1.65 BT 58 0.41 0.44 0.33 3.75 -0.47 -0.47 -0.36 -1.49 5.53 0.75 1.80 Qp 57 0.39 0.40 0.30 4.00 -0.41 -0.42 -0.32 -1.32 6.25 0.74 1.94 BFKV 57 0.42 0.46 0.35 3.41 -0.48 -0.49 -0.37 -1.52 5.51 0.74 1.86 TD 57 0.39 0.40 0.30 4.32 -0.43 -0.41 -0.32 -1.30 5.78 0.73 1.82 SKJG 55 0.34 0.41 0.31 3.04 -0.42 -0.45 -0.34 -1.37 6.06 0.72 1.62 fourBodyd 53 0.42 0.48 0.36 3.04 -0.52 -0.52 -0.40 -1.30 7.16 0.69 1.59 genFoure 53 0.46 0.52 0.39 3.48 -0.56 -0.57 -0.44 -1.25 6.41 0.72 1.55 VD 53 0.39 0.43 0.31 3.87 -0.45 -0.46 -0.34 -1.37 6.18 0.73 1.82 MJ3h 51 0.40 0.43 0.32 4.17 -0.46 -0.45 -0.35 -1.34 5.64 0.73 1.82 HLPL 50 0.38 0.39 0.29 4.01 -0.40 -0.41 -0.31 -1.28 6.38 0.73 2.00 SKOa 49 0.31 0.38 0.28 3.20 -0.38 -0.42 -0.31 -1.27 7.34 0.72 1.57 SKOb 48 0.37 0.42 0.31 3.02 -0.44 -0.46 -0.34 -1.45 6.50 0.72 1.88 4BOPT_GNMf 46 0.20 0.33 0.26 3.18 -0.32 -0.41 -0.33 -1.06 14.03 0.58 1.80 TEl 46 0.38 0.42 0.31 4.07 -0.46 -0.45 -0.35 -1.36 5.84 0.73 1.86 TEs 46 0.39 0.43 0.31 3.82 -0.47 -0.47 -0.35 -1.39 5.75 0.74 1.88 MJ3 46 0.35 0.41 0.30 3.43 -0.43 -0.44 -0.33 -1.36 6.26 0.71 1.73 MJPL 45 0.31 0.29 0.22 5.05 -0.29 -0.29 -0.22 -1.02 6.87 0.69 1.86 MJ2h 45 0.33 0.31 0.24 5.10 -0.31 -0.32 -0.24 -1.05 6.94 0.69 1.84 Qm 42 0.28 0.36 0.27 3.85 -0.36 -0.39 -0.29 -1.21 7.53 0.71 1.64 MS 42 0.36 0.41 0.30 3.63 -0.44 -0.44 -0.33 -1.34 6.16 0.72 1.79 shortRangeg 39 0.24 0.29 0.22 3.76 -0.30 -0.32 -0.24 -1.05 9.63 0.57 1.40 GKS 38 0.31 0.35 0.25 4.43 -0.37 -0.38 -0.28 -1.18 6.47 0.71 1.80 Qa 37 0.27 0.36 0.27 4.10 -0.36 -0.39 -0.29 -1.16 6.79 0.71 1.62 TS 36 0.30 0.28 0.21 5.36 -0.27 -0.28 -0.21 -1.00 6.89 0.68 1.89 MSBM 17 0.01 0.00 0.00 8.24 0.00 0.02 0.01 -0.11 9.86 0.61 1.17 RO 7 0.12 0.18 0.13 8.06 -0.18 -0.20 -0.15 -0.36 10.64 0.64 1.49 MJ1 0 -0.24 -0.20 -0.15 14.34 0.19 0.19 0.14 0.76 17.51 0.50 1.80 calRW 110 0.59 0.61 0.48 1.71 -0.63 -0.65 -0.51 -1.69 4.91 0.78 1.32 calRWplus 106 0.59 0.61 0.48 1.78 -0.63 -0.65 -0.52 -1.69 4.87 0.78 1.34 dDFIRE 100 0.54 0.58 0.45 2.87 -0.61 -0.63 -0.50 -1.41 4.79 0.78 1.40 a KBF obtained from optimization performed using entropy term computed using fractional contact change matrix b KBF obtained from optimization performed using entropy term computed using normalized contact change matrix c Optimized potential function from [Gniewek et al., (2011)] d Sequential four-body potential function developed in [Feng et al., (2007)] e Non-sequential four-body potential function introduced in [Feng et al., (2010)] f Free energy function combining optimized potential with entropies from elastic network models introduced in [Zimmermann et al., (2012)] g Short range potentials introduced in [I. Bahar et al., (1997b)] h Side-chain orientation dependent all-atom statistical potentials introduced in [Zhang and Zhang, (2010)]. i All-atom statistical potential introduced in [Yang and Zhou, (2008)]. All other abbreviated potentials as described in [Pokarowski et al., (2005)]

187

Figure D.1. Schematic workflow of the methodology for extracting contact change matrices. Structures for 167 pairs of alternate conformations of proteins are obtained from the MolMovDB. Contact maps are constructed for each form using the distance cutoff of 4.5 A between heavy atoms. Contact maps of the two forms of each protein are compared and statistics of changed and unchanged contacts are collected for every pair of amino acids and summed up for all proteins to obtain the contact change matrices.

188

Figure D.2. Heatmap of the fractional amino acid contact change matrix. (a) Heatmap showing the fractional amino acid contact changes between pairs of amino acids expressed as ratio of number of changed contacts to total number of contacts (blue = low entropy, red = high entropy). Amino acids are represented by the corresponding one-letter codes, clustered hierarchically using average linkage method and ordered in each matrix as shown in the dendrograms.

189

Figure D.3. Heatmap of the normalized amino acid contact change matrix. (a) Heatmap showing the normalized amino acid contact changes between pairs of amino acids expressed as ratio of observed to expected probability of contacts (blue = low entropy, red = high entropy). Amino acids are represented by the corresponding one-letter codes, clustered hierarchically using average linkage method and ordered in each matrix as shown in the dendrograms.

190

Figure D.4. Principal component analysis of contact change matrices. (a) Percentage of variance captured by each individual PC (blue) of the normalized contact change matrix. Cumulative percentage of variance is shown in red lines. The first PC captures 43.8% of the variance and the subsequent PCs capture much lesser variance.

Figure D.5. Value of each principal component eigenvector for each of the 20 amino acids. Heatmap showing the value of each eigenvector (columns) along each of the 20 amino acids (rows).

191

APPENDIX E. SUPPLEMENTARY MATERIAL FOR CHAPTER 4

Table E.1. PC standard deviation of wild type and mutant cadherin dimers in the MD simulations

Cadherin dimers PC1 standard Standard deviation deviation (nm) normal to PC1 (nm)

W2A X-dimer 1.8 4.4

W2F Ncad S-dimer 7.7 6.9

WT S-dimer 10.3 5.5

K14E S-dimer 5.1 6.8

Table E.2. Separation angle between opposing EC1 domains observed in the MD simulations of wild type and mutant cadherin dimers.

Cadherin dimers Initial Angle (i) Final Angle (f) i-f

W2A X-dimer 23 21 2

W2F S-dimer 80 45 35

WT S-dimer 85 29 56

K14E S-dimer 103 77 26

192

Table E.3. Fitted parameters for WT and K14E S-dimers. The data was fit to the slip bond model described in (44). The WT-Ecad data at 3.0 s and 0.3 s contact times were globally fit to obtain best-fit parameters across both datasets.

∗ ∗ Cadherins 흉ퟎ (풔) 풙 (풏풎) ∆푮 (풌푩푻) 

K14E 9.2 0.78 52.6 0.51 0.3 s contact time

WT (Global Fit) 0.31 0.31 18.2 0.51 3.0 s and 0.3 s contact time

Table E.4. Fitted parameters for W2F E-cadherins. The W2F data at 3.0 s and 0.3 s contact times were globally fit to the slip bond model described in (44) to obtain best-fit parameters across both datasets. In the fitting procedure, all parameters were shared across both data sets except for 휏0.

∗ ∗ Cadherins 흉ퟎ (풔) 풙 (풏풎) ∆푮 (풌푩푻) 

W2F 0.04 0.0 9.0 0.51 3.0 s contact time

W2F 0.02 0.0 9.0 0.51 0.3 s contact time

193

Figure E.1. Accessible surface area of wild type and mutant cadherin dimers during MD simulations. Total solvent accessible surface area (SASA) of (a) W2A X-dimer (b) W2F Ncad S- dimer (c) wild type (WT) S-dimer and (d) K14E S-dimer, show that W2A and K14E mutants have the lowest and highest average SASA, respectively, remaining fairly constant throughout their simulations. In contrast W2F Ncad and WT S-dimer cadherins SASAs decrease to the X-dimer SASA value.

194

Figure E.2. Typical single molecule force clamp measurement. (a) Specific unbinding events correspond to the stretching of two PEG tethers clamped at a constant force. (b) Specific events are directly filtered by converting force clamp events (example in Figure 5b) to force vs distance data and selecting events that correspond to the stretching of two PEG tethers. As a result, (c) the survival probabilities of specific cadherin interactions decay as a single exponential.

195

Figure E.3. At intercellular junctions, cadherins form a 2D lattice composed of alternating trans and cis interactions. Crystal structures show that trans-binding of cadherins from opposing cells do not interfere with cis-binding of cadherins on the same cell surface (49). The orientation of the trans- and the cis- binding interface limit the dynamic motion of cadherins in directions orthogonal to force application. Cadherins from the same cell surface are represented by the same color (blue or red). Cyan amino acid residues (V81 and L175) participate in cis interaction while W2 involved in S-dimer interactions are shown in yellow.

196

Figure E.4. Force and end-to-end distance measured as a function of time as cadherin dimers dissociate in 150-SMD simulations. SMD simulations of (a,b) W2A Ecad X-dimer (c,d) W2F Ncad dimer intermediate state (e,f) WT Ecad dimer intermediate state, and (g,h) K14E Ecad S-dimer. Each SMD simulation was performed at a constant velocity of 0.4 nm/ns.

197

Figure E.5. Force and end-to-end distance measured as a function of time as cadherin dimers dissociate in 140-SMD simulations. SMD simulations of (a,b) W2A Ecad X-dimer (c,d) W2F Ncad dimer intermediate state (e,f) WT Ecad dimer intermediate state, and (g,h) K14E Ecad S-dimer. Each SMD simulation was performed at a constant velocity of 0.4 nm/ns.

198

Figure E.6. Force and end-to-end distance measured as a function of time as cadherin dimers dissociate in 100-SMD simulations. SMD simulations of (a,b) W2A Ecad X-dimer (c,d) W2F Ncad dimer intermediate state (e,f) WT Ecad dimer intermediate state, and (g,h) K14E Ecad S-dimer. Each SMD simulation was performed at a constant velocity of 0.4 nm/ns.

199

BIBLIOGRAPHY

Abdi, H., Williams, L.J., (2010). Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4): 433–459. Adamczyk, K., Candelaresi, M., Robb, K., Gumiero, A., Walsh, M.A., Parker, A.W., Hoskisson, P.A., Tucker, N.P., Hunt, N.T., (2012). Measuring protein dynamics with ultrafast two-dimensional infrared spectroscopy. Meas. Sci. Technol. 23(6): 062001. Alder, B.J., Wainwright, T.E., (1959). Studies in Molecular Dynamics. I. General Method. J. Chem. Phys. 31(2): 459. Altis, A., Nguyen, P.H., Hegger, R., Stock, G., (2007). Dihedral angle principal component analysis of molecular dynamics simulations. J. Chem. Phys. 126(24): 244111. Amadei, A., Ceruso, M.A., Di Nola, A., (1999). On the convergence of the conformational coordinates basis set obtained by the essential dynamics analysis of proteins’ molecular dynamics simulations. Proteins 36(4): 419–24. Amadei, A., Linssen, A.B., de Groot, B.L., van Aalten, D.M., Berendsen, H.J., (1996). An efficient method for sampling the essential subspace of proteins. J. Biomol. Struct. Dyn. 13(4): 615–625. Amadei, A., Linssen, A.B.M., Berendsen, H.J.C., (1993). Essential Dynamics of Proteins. Proteins 17: 412–425. Andricioaei, I., Karplus, M., (2001). On the calculation of entropy from covariance matrices of the atomic fluctuations. J. Chem. Phys. 115(14): 6289–6292. Anheim, N., Inouye, M., Law, L., Laudin, A., (1973). Chemical Studies on the Enzymatic Specificity of Goose Egg White Lysozyme. J. Biol. Chem. 248(1): 233–236. Arnold, G.E., Manchester, J.I., Townsend, B.D., Ornstein, R.L., (1994). Investigation of domain motions in bacteriophage T4 lysozyme. J. Biomol. Struct. Dyn. 12(2): 457–474. Arnold, G.E., Ornstein, R.L., (1997). Protein hinge bending as seen in molecular dynamics simulations of native and M61 mutant T4 lysozymes. Biopolymers 41(5): 533–544. Atilgan, A.R., Durell, S.R., Jernigan, R.L., Demirel, M.C., Keskin, O., Bahar, I., (2001). Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys. J. 80(1): 505–15. Atilgan, C., Atilgan, A.R., (2009). Perturbation-response scanning reveals ligand entry-exit mechanisms of ferric binding protein. PLoS Comput. Biol. 5(10): e1000544. Atkinson, N., Feike, D., Mackinder, L.C.M., Meyer, M.T., Griffiths, H., Jonikas, M.C., Smith, A.M., Mccormick, A.J., (2016). Introducing an algal carbon-concentrating mechanism into higher plants: Location and incorporation of key components. Plant Biotechnol. J. 14(5): 1302–1315. Ayton, G.S., Noid, W.G., Voth, G.A., (2007). Multiscale modeling of biomolecular systems: in serial and in parallel. Curr. Opin. Struct. Biol. 17(2): 192–198. Bahar, I., (2010). On the functional significance of soft modes predicted by coarse-grained models for membrane proteins. J. Gen. Physiol. 135(6): 563–73.

200

Bahar, I., Atilgan, A.R., Erman, B., (1997). Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential. Fold. Des. 2(3): 173–81. Bahar, I., Erman, B., Haliloglu, T., Jernigan, R.L., (1997a). Efficient characterization of collective motions and interresidue correlations in proteins by low-resolution simulations. Biochemistry 36(44): 13512–13523. Bahar, I., Jernigan, R.L., (1997). Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266(1): 195–214. Bahar, I., Jernigan, R.L., (1994). Cooperative Structural Transitions Induced by Non- Homogeneous Intramolecular Interactions in Compact Globular Proteins. Biophys. J. 66(2): 467–481. Bahar, I., Kaplan, M., Jernigan, R.L., (1997b). Short-range conformational energies, secondary structure propensities, and recognition of correct sequence-structure matches. Proteins Struct. Funct. Genet. 29(3): 292–308. Bahar, I., Lezon, T.R., Yang, L.-W., Eyal, E., (2010). Global dynamics of proteins: bridging between structure and function. Annu. Rev. Biophys. 39: 23–42. Bahar, I., Rader, A.J., (2005). Coarse-grained normal mode analysis in structural biology. Curr. Opin. Struct. Biol. 15(5): 586–592. Bakan, A., Bahar, I., (2011). Computational generation inhibitor-bound conformers of p38 map kinase and comparison with experiments. Pac. Symp. Biocomput. (Md): 181–192. Bakan, A., Meireles, L.M., Bahar, I., (2011). ProDy: protein dynamics inferred from theory and experiments. Bioinformatics 27(11): 1575–7. Barik, A., C, N., P, M., Bahadur, R.P., (2012). A protein-RNA docking benchmark (I): nonredundant cases. Proteins 80(7): 1866–1871. Bartesaghi, A., Merk, A., Banerjee, S., Matthews, D., Wu, X., Milne, J.L.S., Subramaniam, S., (2015). 2.2 Å resolution cryo-EM structure of β-galactosidase in complex with a cell-permeant inhibitor. Science 348(6239): 1147–1152. Bastolla, U., Farwer, J., Knapp, E.W., Vendruscolo, M., (2001). How to Guarantee Optimal Stability for Most Representative Structures in the Protein Data Bank. Proteins Struct. Funct. Bioinforma. 44(2): 79–96. Bateman, A., (2007). ClustalW and ClustalX version 2.0. Bioinformatics 21(23): 2947–8. Bayas, M. V, Schulten, K., Leckband, D., (2004). Forced dissociation of the strand dimer interface between C-cadherin ectodomains. Mech. Chem. Biosyst. 1(2): 101–111. Berendsen, C., Edholmt, O., (1984). Free Energy Determination of Polypeptide Conformations Generated by Molecular Dynamics. Macromolecules 17(10): 2044– 2050. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., (2000). The Protein Data Bank. Nucleic Acids Res. 28(1): 235–42. Bermingham, A., Derrick, J.P., (2002). The folic acid biosynthesis pathway in bacteria: evaluation of potential for antibacterial drug discovery. Bioessays 24(7): 637–48.

201

Betancourt, M.R., Thirumalai, D., (1999). Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes. Protein Sci. 8(2): 361–369. Bhattacharyya, R., Chakrabarti, P., (2003). Stereospecific interactions of proline residues in protein structures and complexes. J. Mol. Biol. 331(4): 925–940. Blake, C.C.F., Koenig, D.F., Mair, G.A., North, A.C.T., Phillips, D.C., Sarma, V.R., (1965). Structure of hen egg-white lysozyme, a three dimensional fourier synthesis at 2~Ångstroms resolution. Nature 206(4986): 757–761. Blanco, F.J., Montoya, G., (2011). Transient DNA / RNA-protein interactions. FEBS J. 278(10): 1643–1650. Boggon, T.J., Murray, J., Chappuis-Flament, S., Wong, E., Gumbiner, B.M., Shapiro, L., (2002). C-cadherin ectodomain structure and implications for cell adhesion mechanisms. Science 296(5571): 1308–13. Bolla, J.R., Su, C.-C., Delmar, J.A., Radhakrishnan, A., Kumar, N., Chou, T.-H., Long, F., Rajashankar, K.R., Yu, E.W., (2015). Crystal structure of the Alcanivorax borkumensis YdaH transporter reveals an unusual topology. Nat. Commun. 6: 6874. Bourne, C., (2014). Utility of the Biosynthetic Folate Pathway for Targets in Antimicrobial Discovery. Antibiotics 3(1): 1–28. Bradley, P., Malmstrom, L., Qian, B., Schonbrun, J., Chivian, D., Kim, D.E., Meiler, J., Misura, K.M.S., Baker, D., (2005). Free modeling with Rosetta in CASP6, in: Proteins: Structure, Function and Genetics. pp. 128–134. Brady, G.P., Sharp, K.A., (1997). Entropy in protein folding and in protein-protein interactions. Curr. Opin. Struct. Biol. 7(2): 215–221. Brandl, M., Weiss, M.S., Jabs, A., Sühnel, J., Hilgenfeld, R., (2001). C-H...π-interactions in proteins. J. Mol. Biol. 307(1): 357–377. Brasch, J., Harrison, O.J., Honig, B., Shapiro, L., (2012). Thinking outside the cell: How cadherins drive adhesion. Trends Cell Biol. 22(6): 299–310. Brooks, B., Brooks, B., Karplus, M., Karplus, M., (1983). Harmonic dynamics of proteins: normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc Natl Acad Sci U S A 80(21): 6571–6575. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., Karplus, M., (1983). CHARMM: A program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4(2): 187–217. Brooks, C.L., Gruebele, M., Onuchic, J.N., Wolynes, P.G., (1998). Chemical physics of protein folding. Proc. Natl. Acad. Sci. 95(19): 11037–11038. Brooks, C.L., Onuchic, J.N., Wales, D.J., (2001). Statistical thermodynamics. Taking a walk on a landscape. Science 293(5530): 612–613. Bryngelson, J.D., Onuchic, J.N., Socci, N.D., Wolynes, P.G., (1995). Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins Struct. Funct. Genet. 21(3): 167–195.

202

Bu, Z., Callaway, D.J.E., (2011). Proteins move! Protein dynamics and long-range allostery in cell signaling. Adv. Protein Chem. Struct. Biol. 83: 163–221. Buchete, N. V., Straub, J.E., Thirumalai, D., (2004). Development of novel statistical potentials for protein fold recognition. Curr. Opin. Struct. Biol. 14(2): 225–232. Burow, M.D., Chen, Z.Y., Mouton, T.M., Moroney, J. V, (1996). Isolation of cDNA clones of genes induced upon transfer of Chlamydomonas reinhardtii cells to low CO2. Plant Mol Biol 31(2): 443–448. Burton, B., Zimmermann, M.T., Jernigan, R.L., Wang, Y., (2012). A computational investigation on the connection between dynamics properties of ribosomal proteins and ribosome assembly. PLoS Comput. Biol. 8(5): e1002530. Cailliez, F., Lavery, R., (2006). Dynamics and stability of E-cadherin dimers. Biophys. J. 91(11): 3964–71. Careri, G., Fasella, P., Gratton, E., (1979). Enzyme dynamics: the statistical physics approach. Annu. Rev. Biophys. Bioeng. 8(10): 69–97. Carter, C.W., LeFebvre, B.C., Cammer, S.A., Tropsha, A., Edgell, M.H., (2001). Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. J. Mol. Biol. 311(4): 625–638. Carter, D.C., He, X., Munson, S.H., Twigg, P.D., Gernert, K.M., Broom, M.B., Miller, T.Y., (1989). Three-Dimensional Structure of Human Serum Albumin. Science 244: 1195– 1198. Carter, E.L., Jager, L., Gardner, L., Hall, C.C., Willis, S., Green, J.M., (2007). Escherichia coli abg genes enable uptake and cleavage of the folate catabolite p-aminobenzoyl- glutamate. J. Bacteriol. 189(9): 3329–3334. Casu, F., Duggan, B.M., Hennig, M., (2013). The arginine-rich RNA-binding motif of HIV-1 Rev is intrinsically disordered and folds upon RRE binding. Biophys. J. 105(4): 1004– 1017. Celik, L., Schiøtt, B., Tajkhorshid, E., (2008). Substrate binding and formation of an occluded state in the leucine transporter. Biophys. J. 94: 1600–1612. Chan, H.S., Dill, K.A., (1998). Protein folding in the landscape perspective: chevron plots and non-Arrhenius kinetics. Proteins 30(1): 2–33. Chang, C.A., Trylska, J., Tozzini, V., Andrew McCammon, J., (2007). Binding pathways of ligands to HIV-1 Protease: Coarse-grained and atomistic simulations. Chem. Biol. Drug Des. 69: 5–13. Chang, C.-E., Shen, T., Trylska, J., Tozzini, V., McCammon, J.A., (2006). Gated binding of ligands to HIV-1 protease: Brownian dynamics simulations in a coarse-grained model. Biophys. J. 90: 3880–3885. Cheluvaraja, S., Meirovitch, H., (2005). Calculation of the entropy and free energy by the hypothetical scanning Monte Carlo method: Application to peptides. J. Chem. Phys. 122(5).

203

Chen, J., Brooks, C.L., (2007). Can molecular dynamics simulations provide high-resolution refinement of protein structure? Proteins Struct. Funct. Genet. 67(4): 922–930. Chennubhotla, C., Rader, a J., Yang, L.-W., Bahar, I., (2005). Elastic network models for understanding biomolecular machinery: from enzymes to supramolecular assemblies. Phys. Biol. 2(4): S173–80. Cherezov, V., Rosenbaum, D.M., Hanson, M.A., Rasmussen, S.G.F., Thian, F.S., Kobilka, T.S., Choi, H.-J., Kuhn, P., Weis, W.I., Kobilka, B.K., Stevens, R.C., (2007). High- resolution crystal structure of an engineered human beta2-adrenergic G protein-coupled receptor. Science 318(5854): 1258–1265. Cheung, M.S., Chavez, L.L., Onuchic, J.N., (2004). The energy landscape for protein folding and possible connections to function. Polymer 45(2): 547–555. Chovancova, E., Pavelka, A., Benes, P., Strnad, O., Brezovsky, J., Kozlikova, B., Gora, A., Sustr, V., Klvana, M., Medek, P., Biedermannova, L., Sochor, J., Damborsky, J., (2012). CAVER 3.0: a tool for the analysis of transport pathways in dynamic protein structures. PLoS Comput. Biol. 8(10): e1002708. Ciatto, C., Bahna, F., Zampieri, N., VanSteenhouse, H.C., Katsamba, P.S., Ahlsen, G., Harrison, O.J., Brasch, J., Jin, X., Posy, S., Vendome, J., Ranscht, B., Jessell, T.M., Honig, B., Shapiro, L., (2010). T-cadherin structures reveal a novel adhesive binding mechanism. Nat. Struct. Mol. Biol. 17(3): 339–347. Clarke, D.M., Loo, T.W., MacLennan, D.H., (1990a). Functional consequences of alterations to polar amino acids located in the transmembrane domain of the Ca2+-ATPase of sarcoplasmic reticulum. J. Biol. Chem. 265(11): 6262–6267. Clarke, D.M., Loo, T.W., MacLennan, D.H., (1990b). Functional consequences of mutations of conserved amino acids in the β-strand domain of the Ca2+-ATPase of sarcoplasmic reticulum. J. Biol. Chem. 265(24): 14088–14092. Clarke, D.M., Maruyama, K., Loo, T.W., Leberer, E., Inesi, G., MacLennan, D.H., (1989). Functional consequences of glutamate, aspartate, glutamine, and asparagine mutations in the stalk sector of the Ca2+-ATPase of sarcoplasmic reticulum. J. Biol. Chem. 264(19): 11246–11251. Conte, L. Lo, Ailey, B., Hubbard, T.J.P., Brenner, S.E., Murzin, A.G., Chothia, C., (2000). SCOP: a Structural Classification of Proteins database. Nucleic Acids Res. 28(1): 257– 259. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W., Kollman, P.A., (1995). A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117(19): 5179–5197. Cozzetto, D., Kryshtafovych, A., Fidelis, K., Moult, J., Rost, B., Tramontano, A., (2009). Evaluation of template-based models in CASP8 with standard measures. Proteins Struct. Funct. Bioinforma. 77(SUPPL. 9): 18–28.

204

Creamer, T.P., Rose, G.D., (1992). Side-chain entropy opposes a-helix formation but rationalizes experimentally determined helix-forming propensities. Nat. Sci. 89(250): 5937–5941. Czaplewski, C., Rodziewicz-Motowidło, S., Da̧bal, M., Liwo, A., Ripoll, D.R., Scheraga, H.A., (2003). Molecular simulation study of cooperativity in hydrophobic association: Clusters of four hydrophobic particles. Biophys. Chem. 105(2-3): 339–359. Czaplewski, C., Rodziewicz-Motowidło, S., Liwo, A., Ripoll, D.R., Wawak, R.J., Scheraga, H.A., (2000). Molecular simulation study of cooperativity in hydrophobic association [ In Process Citation ] Molecular simulation study of cooperativity in hydrophobic association. Protein Sci. 9(6): 1235–1245. Daniels, N.M., Gallant, A., Peng, J., Cowen, L.J., Baym, M., Berger, B., (2013). Compressive genomics for protein databases. Bioinformatics 29(13): i283–i290. de Groot, B.L., Hayward, S., van Aalten, D.M., Amadei, A., Berendsen, H.J., (1998). Domain motions in bacteriophage T4 lysozyme: a comparison between molecular dynamics and crystallographic data. Proteins 31(2): 116–127. de Juan, D., Pazos, F., Valencia, A., (2013). Emerging methods in protein co-evolution. Nat. Rev. Genet. 14(4): 249–61. De Vries, S.J., Schindler, C.E.M., Chauvot De Beauchêne, I., Zacharias, M., (2015). A web interface for easy flexible protein-protein docking with ATTRACT. Biophys. J. 108(3): 462–465. Denison, M.R., (2008). Seeking Membranes: Positive-Strand RNA Virus Replication Complexes. PLoS Biol 6(10): e270. Dill, K.A., Phillips, A.T., Rosen, J.B., (1997). Protein structure and energy landscape dependence on sequence using a continuous energy function. J. Comput. Biol. 4(3): 227–239. Dixit, A., Verkhivker, G.M., (2011). The energy landscape analysis of cancer mutations in protein kinases. PLoS One 6(10): e26071. Dixon, M.M., Nicholson, H., Shewchuk, L., Baase, W.A., Matthews, B.W., (1992). Structure of a hinge-bending bacteriophage T4 lysozyme mutant, Ile3→Pro. J. Mol. Biol. 227(3): 917–933. Dockal, M., Carter, D.C., Rüker, F., (1999). The three recombinant domains of human serum albumin. Structural characterization and ligand binding properties. J. Biol. Chem. 274(41): 29303–10. Doig, A.J., Sternberg, M.J., (1995). Side-chain conformational entropy in protein folding. Protein Sci. 4(11): 2247–2251. Domingues, F.S., Koppensteiner, W.A., Jaritz, M., Prlic, A., Weichenberger, C., Wiederstein, M., Floeckner, H., Lackner, P., Sippl, M.J., (1999). Sustained performance of knowledge-based potentials in fold recognition. Proteins Struct. Funct. Genet. 37(SUPPL. 3): 112–120.

205

DuBay, K.H., Geissler, P.L., (2009). Calculation of Proteins’ Total Side-Chain Torsional Entropy and Its Influence on Protein-Ligand Interactions. J. Mol. Biol. 391(2): 484–497. Dudko, O.K., Hummer, G., Szabo, A., (2008). Theory, analysis, and interpretation of single- molecule force spectroscopy experiments. Proc. Natl. Acad. Sci. U. S. A. 105(41): 15755–15760. Dudko, O.K., Hummer, G., Szabo, A., (2006). Intrinsic rates and activation free energies from single-molecule pulling experiments. Phys. Rev. Lett. 96: 108101. Dunker, A.K., Lawson, J.D., Brown, C.J., Williams, R.M., Romero, P., Oh, J.S., Oldfield, C.J., Campen, A.M., Ratliff, C.M., Hipps, K.W., Ausio, J., Nissen, M.S., Reeves, R., Kang, C., Kissinger, C.R., Bailey, R.W., Griswold, M.D., Chiu, W., Garner, E.C., Obradovic, Z., (2001). Intrinsically disordered protein. J. Mol. Graph. Model. 19(1): 26–59. Durand, P., Trinquier, G., Sanejouand, Y.H., (1994). A new approach for determining low- frequency normal modes in macromolecules. Biopolymers 34(6): 759–771. Durrant, J.D., McCammon, J.A., (2011). Molecular dynamics simulations and drug discovery. BMC Biol. 9: 71. Dyson, H.J., (2011). Roles of intrinsic disorder in protein–nucleic acid interactions. Mol. Biosyst. 8(1): 97–104. Earl, D.J., Deem, M.W., (2008). Monte Carlo simulations. Methods Mol. Biol. 443: 25–36. Echols, N., Milburn, D., Gerstein, M., (2003). MolMovDB: analysis and visualization of conformational change and structural flexibility. Nucleic Acids Res. 31(1): 478–482. Edholm, O., Berendsen, H.J.C., (1984). Entropy estimation from simulations of non-diffusive systems. Mol. Phys. 51(4): 1011–1028. Efron, B., (1982). The Jackknife, the Bootstrap and Other Resampling Plans, in: Society of Industrial and Applied Mathematics CBMS-NSF Monographs. p. 103. Ellis, J.J., Jones, S., (2008). Evaluating conformational changes in protein structures binding RNA. Proteins 70(4): 1518–1526. Emekli, U., Schneidman-duhovny, D., Wolfson, H.J., Nussinov, R., Haliloglu, T., (2008). HingeProt: automated prediction of hinges in protein structures. Proteins 70(4): 1219– 1227. Faber, H.R., Matthews, B.W., (1990). A mutant T4 lysozyme displays five different crystal conformations. Nature 348(6298): 263–266. Feliu, E., Aloy, P., Oliva, B., (2011). On the analysis of protein-protein interactions via knowledge-based potentials for the prediction of protein-protein docking. Protein Sci. 20(3): 529–541. Feng, Y., Kloczkowski, A., Jernigan, R.L., (2010). Potentials “R” Us web-server for protein energy estimations with coarse-grained knowledge-based potentials. BMC Bioinformatics 11: 92.

206

Feng, Y., Kloczkowski, A., Jernigan, R.L., (2007). Four-body contact potentials derived from two protein datasets to discriminate native structures from decoys. Proteins 68(1): 57– 66. Fenwick, R.B., van den Bedem, H., Fraser, J.S., Wright, P.E., (2014). Integrated description of protein dynamics from room-temperature X-ray crystallography and NMR. Proc. Natl. Acad. Sci. U. S. A. 111(4): E445–54. Fischer, N., Neumann, P., Konevega, A.L., Bock, L. V., Ficner, R., Rodnina, M. V., Stark, H., (2015). Structure of the E. coli ribosome-EF-Tu complex at less than 3 angstrom resolution by Cs-corrected cryo-EM. Nature 520: 567–570. Fleming, A., (1922). On a Remarkable Bacteriolytic Element found in Tissues and Secretions. Proc. R. Soc. B Biol. Sci. 93(653): 306–317. Fleming, A., Allison, V.D., (1922). Further Observations on a Bacteriolytic Element Found in Tissues and Secretions. Proc. R. Soc. B Biol. Sci. Flores, S., Echols, N., Milburn, D., Hespenheide, B., Keating, K., Lu, J., Wells, S., Yu, E.Z., Thorpe, M., Gerstein, M., (2006). The Database of Macromolecular Motions: new features added at the decade mark. Nucleic Acids Res. 34(Database issue): D296–301. Folster, J.P., Shafer, W.M., (2005). Regulation of mtrF expression in Neisseria gonorrhoeae and its role in high-level antimicrobial resistance. J. Bacteriol. 187(11): 3713–3720. Forrest, L.R., Rudnick, G., (2009). The rocking bundle: a mechanism for ion-coupled solute flux by symmetrical transporters. Physiology (Bethesda). 24(23): 377–386. Fraser, J.S., Clarkson, M.W., Degnan, S.C., Erion, R., Kern, D., Alber, T., (2009). Hidden alternative structures of proline isomerase essential for catalysis. Nature 462(7273): 669–673. Frauenfelder, H., Sligar, S.G., Wolynes, P.G., (1991). The Energy Landscapes and Motions of Proteins. Science 254(5038): 1598–1603. Friesner, R.A., Abel, R., Goldfeld, D.A., Miller, E.B., Murrett, C.S., (2013). Computational methods for high resolution prediction and refinement of protein structures. Curr. Opin. Struct. Biol. 23(2): 177–184. Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W., (2012). CD-HIT: Accelerated for clustering the next- generation sequencing data. Bioinformatics 28(23): 3150–3152. Galicia-Vázquez, G., Lindqvist, L., Wang, X., Harvey, I., Liu, J., Pelletier, J., (2009). High- throughput assays probing protein–RNA interactions of eukaryotic translation initiation factors. Anal. Biochem. 384(1): 180–188. Gendoo, D.M.A., Harrison, P.M., (2012). The landscape of the prion protein’s structural response to mutation revealed by principal component analysis of multiple NMR ensembles. PLoS Comput. Biol. 8(8): e1002646. Gerlt, J.A., Babbitt, P.C., (2009). Enzyme (re)design: lessons from natural evolution and computation. Curr. Opin. Chem. Biol. 13(1): 10–18.

207

Gibrat, J.F., Go, N., (1990). Normal mode analysis of human lysozyme: study of the relative motion of the two domains and characterization of the harmonic motion. Proteins 8(3): 258–279. Gniewek, P., Leelananda, S.P., Kolinski, A., Jernigan, R.L., Kloczkowski, A., (2011). Multibody coarse-grained potentials for native structure recognition and quality assessment of protein models. Proteins 79(6): 1923–9. Go, N., Noguti, T., Nishikawa, T., (1983). Dynamics of a small globular protein in terms of low-frequency vibrational modes. Proc. Natl. Acad. Sci. U. S. A. 80(June): 3696–3700. Goldstein, H., Poole, P.C., Safko, J.L., (2002). Classical Mechanics, Pearson. Gore, J., Bryant, Z., Nöllmann, M., Le, M.U., Cozzarelli, N.R., Bustamante, C., (2006). DNA overwinds when stretched. Nature 442(7104): 836–839. Grant, B.J., Rodrigues, A.P.C., ElSawy, K.M., McCammon, J.A., Caves, L.S.D., (2006). Bio3d: an R package for the comparative analysis of protein structures. Bioinformatics 22(21): 2695–6. Gumbiner, B.M., (2005). Regulation of cadherin-mediated adhesion in morphogenesis. Nat. Rev. Mol. Cell Biol. 6(8): 622–34. Guo, D., Liu, S., Huang, Y., Xiao, Y., (2012). Preorientation of protein and RNA just before contacting. J. Biomol. Struct. Dyn. 31(7): 716–728. Halbleib, J.M., Nelson, W.J., (2006). Cadherins in development: Cell adhesion, sorting, and tissue morphogenesis. Genes Dev. 20(23): 3199–3214. Haliloglu, T., Bahar, I., (2015). Adaptability of protein structures to enable functional interactions and evolutionary implications. Curr. Opin. Struct. Biol. 35: 17–23. Hamacher, K., McCammon, J. a., (2006). Computing the amino acid specificity of fluctuations in biomolecular systems. J. Chem. Theory Comput. 2(3): 873–878. Hamelryck, T., Borg, M., Paluszewski, M., Paulsen, J., Frellsen, J., Andreetta, C., Boomsma, W., Bottaro, S., Ferkinghoff-Borg, J., (2010). Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PLoS One 5(11): e13714. Hansmann, U.H., Okamoto, Y., (1999). New Monte Carlo algorithms for protein folding. Curr. Opin. Struct. Biol. 9(2): 177–83. Hanson, A.D., Gregory, J.F., (2011). Folate biosynthesis, turnover, and transport in plants. Annu. Rev. Plant Biol. 62: 105–25. Harris, T.J., Tepass, U., (2010). Adherens junctions: from molecules to morphogenesis. Nat Rev Mol Cell Biol 11(7): 502–514. Harrison, J.S., Higgins, C.D., O’Meara, M.J., Koellhoffer, J.F., Kuhlman, B.A., Lai, J.R., (2013). Role of electrostatic repulsion in controlling pH-dependent conformational changes of viral fusion proteins. Structure 21(7): 1085–1096. Harrison, O.J., Bahna, F., Katsamba, P.S., Jin, X., Brasch, J., Vendome, J., Ahlsen, G., Carroll, K.J., Price, S.R., Honig, B., Shapiro, L., (2010). Two-step adhesive binding by classical cadherins. Nat. Struct. Mol. Biol. 17(3): 348–57.

208

Harrison, O.J., Jin, X., Hong, S., Bahna, F., Ahlsen, G., Brasch, J., Wu, Y., Vendome, J., Felsovalyi, K., Hampton, C.M., Troyanovsky, R.B., Ben-Shaul, A., Frank, J., Troyanovsky, S.M., Shapiro, L., Honig, B., (2011). The extracellular architecture of adherens junctions revealed by crystal structures of type I cadherins. Structure 19(2): 244–56. Harte, W.E., Swaminathan, S., Mansuri, M.M., Martin, J.C., Rosenberg, I.E., Beveridge, D.L., (1990). Domain communication in the dynamical structure of human immunodeficiency virus 1 protease. Proc. Natl. Acad. Sci. U. S. A. 87(22): 8864–8. Häussinger, D., Ahrens, T., Aberle, T., Engel, J., Stetefeld, J., Grzesiek, S., (2004). Proteolytic E-cadherin activation followed by solution NMR and X-ray crystallography. EMBO J. 23(8): 1699–708. Hayward, S., de Groot, B.L., (2008). Normal modes and essential dynamics. Methods Mol. Biol. 443: 89–106. He, X.M., Carter, D.C., (1992). Atomic structure and chemistry of human serum albumin. Nature 358(6383): 209–215. Hinsen, K., (1998). Analysis of domain motions by approximate normal mode calculations. Proteins Struct. Funct. Genet. 33(3): 417–429. Hinsen, K., Petrescu, A.J., Dellerue, S., Bellissent-Funel, M.C., Kneller, G.R., (2000). Harmonicity in slow protein dynamics. Chem. Phys. 261(1-2): 25–37. Hogan, D.J., Riordan, D.P., Gerber, A.P., Herschlag, D., Brown, P.O., (2008). Diverse RNA- binding proteins interact with functionally related sets of RNAs, suggesting an extensive regulatory system. PLoS Biol. 6(10): e255. Holm, L., Sander, C., (1996). Mapping the protein universe. Science 273(5275): 595–603. Hong, H., Szabo, G., Tamm, L.K., (2006). Electrostatic couplings in OmpA ion-channel gating suggest a mechanism for pore opening. Nat. Chem. Biol. 2(11): 627–635. Hong, S., Troyanovsky, R.B., Troyanovsky, S.M., (2011). Cadherin exits the junction by switching its adhesive bond. J. Cell Biol. 192(6): 1073–1083. Hopf, T.A., Schärfe, C.P.I., Rodrigues, J.P.G.L.M., Green, A.G., Kohlbacher, O., Sander, C., Bonvin, A.M.J.J., Marks, D.S., (2014). Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 3: e03430. Hornak, V., Okur, A., Rizzo, R.C., Simmerling, C., (2006). HIV-1 protease flaps spontaneously open and reclose in molecular dynamics simulations. Proc. Natl. Acad. Sci. U. S. A. 103: 915–920. Hotelling, H., (1933). Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24(6): 417–441. Howe, P.W., (2001). Principal components analysis of protein structure ensembles calculated using NMR data. J. Biomol. NMR 20(1): 61–70. Huang, S.-Y., Zou, X., (2012). A nonredundant structure dataset for benchmarking protein- RNA computational docking. J. Comput. Chem. 34(4): 311–318.

209

Hub, J.S., Groot, B.L. De, (2009). Detection of Functional Modes in Protein Dynamics. PLoS Comput. Biol. 5(8): e1000480. Hubbard, S., Thornton, J., (1993). “NACCESS”, computer program. Hubbard, T.J.P., Ailey, B., Brenner, S.E., Murzin, A.G., Chothia, C., (1999). SCOP: a Structural Classification of Proteins database. Nucleic Acids Res. 27(1): 254–256. Huey, R., Morris, G.M., Olson, A.J., Goodsell, D.S., (2007). A Semiempirical Free Energy Force Field with Charge-Based Desolvation. J. Comput. Chem. 28(6): 1145–1152. Humphrey, W., Dalke, A., Schulten, K., (1996). VMD: Visual molecular dynamics. J. Mol. Graph. 14(1): 33–38. Hussein, M.J., Green, J.M., Nichols, B.P., (1998). Characterization of mutations that allow p- aminobenzoyl-glutamate utilization by Escherichia coli. J. Bacteriol. 180(23): 6260– 6268. Hyeon, C., Thirumalai, D., (2007). Measuring the energy landscape roughness and the transition state location of biomolecules using single molecule mechanical unfolding experiments. J. Phys. Condens. Matter 19: 113101. Ichiye, T., Karplus, M., (1991). Collective motions in proteins: a covariance analysis of atomic fluctuations in molecular dynamics and normal mode simulations. Proteins 11(3): 205–217. Ikeguchi, M., Ueno, J., Sato, M., Kidera, A., (2005). Protein structural change upon ligand binding: linear response theory. Phys Rev Lett 94(7): 78102. Ilan, B., Tajkhorshid, E., Schulten, K., Voth, G.A., (2004). The Mechanism of Proton Exclusion in Aquaporin Channels. Proteins Struct. Funct. Genet. 55: 223–228. Ishima, R., Torchia, D., (2000). Protein dynamics from NMR. Nat. Struct. Mol. Biol. 76(2-3): 145–152. Izrailev, S., Stepaniants, S., Balsera, M., Oono, Y., Schulten, K., (1997). Molecular Dynamics Study of Unbinding of the Avidin-Biotin Complex. Biophys. J. 72(April). Jensen, M.Ø., Park, S., Tajkhorshid, E., Schulten, K., (2002). Energetics of glycerol conduction through aquaglyceroporin GlpF. Proc. Natl. Acad. Sci. U. S. A. 99: 6731– 6736. Jensen, M.Ø., Yin, Y., Tajkhorshid, E., Schulten, K., (2007). Sugar transport across lactose permease probed by steered molecular dynamics. Biophys. J. 93: 92–102. Jernigan, R.L., Yang, L., Song, G., Kurkckuoglu, O., Doruker, P., (2009). Elastic Network Models of Coarse-Grained Proteins Are Effective for Studying the Structural Control Exerted over Their Dynamics, in: Voth, G.A. (Ed.), Coarse-Graining of Condensed Phase and Biomolecular Systems. CRC Press, Boca Raton, FL, pp. 237–254. Johnson, L.N., Phillips, D.C., (1965). Structure of some crystalline lysozyme-inhibitor complexes determiend by x-ray analysis at 6 A resolution. Nature 208(4986): 761–763. Jones, S., Daley, D.T., Luscombe, N.M., Berman, H.M., Thornton, J.M., (2001). Protein- RNA interactions: a structural analysis. Nucleic Acids Res. 29(4): 943–954.

210

Jorgensen, W.L., TiradoRives, J., (1996). Monte Carlo vs molecular dynamics for conformational sampling. J. Phys. Chem. 100(96): 14508–14513. Jorgensen, W.L., Tirado-Rives, J., (1988). The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin. J. Am. Chem. Soc. 110(6): 1657–1666. Karplus, M., Kushick, J.N., (1981). Method for estimating the configurational entropy of macromolecules. Macromolecules 14(2): 325–332. Karplus, M., McCammon, J.A., (2002). Molecular dynamics simulations of biomolecules. Nat. Struct. Biol. 9(9): 646–652. Katebi, A.R., Sankar, K., Jia, K., Jernigan, R.L., (2015). The use of experimental structures to model protein dynamics, in: Molecular Modeling of Proteins. pp. 213–236. Katsamba, P., Carroll, K., Ahlsen, G., Bahna, F., Vendome, J., Posy, S., Rajebhosale, M., Price, S., Jessell, T.M., Ben-Shaul, A., Shapiro, L., Honig, B.H., (2009). Linking molecular affinity and cellular specificity in cadherin-mediated adhesion. Proc. Natl. Acad. Sci. U. S. A. 106(28): 11594–9. Kay, L.E., (2005). NMR studies of protein structure and dynamics. J. Magn. Reson. 173(2): 193–207. Ke, A., Doudna, J.A., (2004). Crystallization of RNA and RNA–protein complexes. Methods 34(3): 408–414. Kechavarzi, B., Janga, S.C., (2014). Dissecting the expression landscape of RNA-binding proteins in human cancers. Genome Biol. 15(1): R14. Kendrew, J.C., Perutz, M.F., (1957). X-ray studies of compounds of biological interest. Annu. Rev. Biochem. 26: 327–372. Kennedy, J., Eberhart, R., (1995). Particle Swarm Optimization. Proc. 1995 IEEE Int. Conf. Neural Networks : 1942–1948. Keskin, O., Bahar, I., Badretdinov, A.Y., Ptitsyn, O.B., Jernigan, R.L., (1998). Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter- residue interactions. Protein Sci. 7: 2578–2586. Kihara, D., Chen, H., Yang, Y.D., (2009). Quality assessment of protein structure models. Curr. Protein Pept. Sci. 10(3): 216–228. Kim, H., Abeysirigunawarden, S.C., Chen, K., Mayerle, M., Ragunathan, K., Luthey- Schulten, Z., Ha, T., Woodson, S.A., (2014). Protein-guided RNA dynamics during early ribosome assembly. Nature 506(7488): 334–338. Kim, J.G., Kim, T.W., Kim, J., Ihee, H., (2015). Protein Structural Dynamics Revealed by Time-Resolved X-ray Solution Scattering. Acc. Chem. Res. 48(8): 2200–2208. Kim, M.H., Seo, S., Jeong, J. Il, Kim, B.J., Liu, W.K., Lim, B.S., Choi, J.B., Kim, M.K., (2013). A mass weighted chemical elastic network model elucidates closed form domain motions in proteins. Protein Sci. 22(5): 605–13. Kim, M.K., Jernigan, R.L., Chirikjian, G.S., (2003). An elastic network model of HK97 capsid maturation. J. Struct. Biol. 143(2): 107–117.

211

Kim, M.K., Jernigan, R.L., Chirikjian, G.S., (2002). Efficient generation of feasible pathways for protein conformational transitions. Biophys. J. 83(3): 1620–30. Kim, O.T.P., Yura, K., Go, N., (2006). Amino acid residue doublet propensity in the protein- RNA interface and its application to RNA interface prediction. Nucleic Acids Res. 34(22): 6450–6460. Kleckner, I.R., Foster, M.P., (2011). An introduction to NMR-based approaches for measuring protein dynamics. Biochim. Biophys. Acta - Proteins Proteomics 1814(8): 942–968. Klosterman, P.S., Tamura, M., Holbrook, S.R., Brenner, S.E., (2002). SCOR: a Structural Classification of RNA database. Nucleic Acids Res. 30(1): 392–394. Kolinski, A., Skolnick, J., (1994a). Monte Carlo simulations of protein folding. I. Lattice model and interaction scheme. Proteins 18(4): 338–352. Kolinski, A., Skolnick, J., (1994b). Monte Carlo simulations of protein folding. II. Application to protein A, ROP, and crambin. Proteins 18(4): 353–66. Kon Ho Lee, Xie, D., Freire, E., Amzel, L.M., (1994). Estimation of changes in side chain configurational entropy in binding and folding: General methods and application to helix formation. Proteins Struct. Funct. Genet. 20(1): 68–84. Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J., Lesk, A.M., (2006). MUSTANG: a multiple structural alignment algorithm. Proteins 64(3): 559–74. Kondrashov, D.A., Cui, Q., Phillips, G.N., (2006). Optimization and evaluation of a coarse- grained model of protein motion using x-ray crystal data. Biophys. J. 91(8): 2760–7. Kondrashov, D.A., Van Wynsberghe, A.W., Bannen, R.M., Cui, Q., Phillips, G.N., (2007). Protein Structural Variation in Computational Models and Crystallographic Data. Structure 15(2): 169–177. Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P.E., Berman, H.M., (2006). The RCSB PDB information portal for structural genomics. Nucleic Acids Res. 34(Database issue): D302–D305. Kožíšek, M., Bray, J., Řezáčová, P., Šašková, K., Brynda, J., Pokorná, J., Mammano, F., Rulíšek, L., Konvalinka, J., (2007). Molecular Analysis of the HIV-1 Resistance Development: Enzymatic Activities, Crystal Structures, and Thermodynamics of Nelfinavir-resistant HIV Protease Mutants. J. Mol. Biol. 374(4): 1005–1016. Krishnamoorthy, B., Tropsha, A., (2003). Development of a four-body statistical pseudo- potential to discriminate native from non-native protein conformations. Bioinformatics 19(12): 1540–1548. Kryshtafovych, A., Fidelis, K., (2009). Protein structure prediction and model quality assessment. Drug Discov. Today 14(7-8): 386–393. Kuriyan, J., Petsko, G.A., Levy, R.M., Karplus, M., (1986). Effect of anisotropy and anharmonicity on protein crystallographic refinement. An evaluation by molecular dynamics. J. Mol. Biol. 190(2): 227–254.

212

Kurkcuoglu, O., Doruker, P., Sen, T.Z., Kloczkowski, A., Jernigan, R.L., (2008). The ribosome structure controls and directs mRNA entry, translocation and exit dynamics. Phys. Biol. 5(4): 046005. Kurkcuoglu, O., Kurkcuoglu, Z., Doruker, P., Jernigan, R.L., (2009). Collective dynamics of the ribosomal tunnel revealed by elastic network modeling. Proteins Struct. Funct. Bioinforma. 75(4): 837–845. Lange, O.F., Lakomek, N.-A., Farès, C., Schröder, G.F., Walter, K.F.A., Becker, S., Meiler, J., Grubmüller, H., Griesinger, C., de Groot, B.L., (2008). Recognition dynamics up to microseconds revealed from an RDC-derived ubiquitin ensemble in solution. Science 320(5882): 1471–5. Leckband, D., Sivasankar, S., (2012a). Cadherin recognition and adhesion. Curr. Opin. Cell Biol. 24(5): 620–627. Leckband, D., Sivasankar, S., (2012b). Biophysics of cadherin adhesion. Subcell Biochem 60: 63–88. Lennard-Jones, J.E., (1924). On the Determination of Molecular Fields. II. From the Equation of State of a Gas. Proc. R. Soc. A 106(738): 463–477. Leo-Macias, A., Lopez-Romero, P., Lupyan, D., Zerbino, D., Ortiz, A.R., (2005). An Analysis of Core Deformations in Protein Superfamilies. Biophys. J. 88(2): 1291–1299. Levitt, M., (2007). Growth of novel protein structural data. Proc. Natl. Acad. Sci. U. S. A. 104(9): 3183–3188. Levitt, M., Lifson, S., (1969). Refinement of protein conformations using a macromolecular energy minimization procedure. J. Mol. Biol. 46(2): 269–279. Levitt, M., Sander, C., Stern, P.S., (1985). Protein normal-mode dynamics: Trypsin inhibitor, crambin, ribonuclease and lysozyme. J. Mol. Biol. 181(3): 423–447. Levitt, M., Warshel, A., (1975). Computer Simulation of protein folding. Nature 253: 694– 698. Levy, R.M., Karplus, M., Kushick, J., Perahia, D., (1984). Evaluation of the Configurational entropy for Proteins: Application to Molecular Dynamics Simulations of an alpha-Helix. Macromolecules 17(1): 1370–1374. Lewis, B.A., Walia, R.R., Terribilini, M., Ferguson, J., Zheng, C., Honavar, V., Dobbs, D., (2010). PRIDB: a protein-RNA interface database. Nucleic Acids Res. 39(Database): D277–D282. Li, D.W., Khanlarzadeh, M., Wang, J., Huo, S., Brüschweiler, R., (2007). Evaluation of configurational entropy methods from peptide folding-unfolding simulation. J. Phys. Chem. B 111(49): 13807–13813. Li, W., Godzik, A., (2006). Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13): 1658–1659. Li, Y., Altorelli, N.L., Bahna, F., Honig, B., Shapiro, L., Palmer, a G., (2013). Mechanism of E-cadherin dimerization probed by NMR relaxation dispersion. Proc. Natl. Acad. Sci. U. S. A. 110(41): 16462–16467.

213

Li, Z., Nagy, P.D., (2011). Diverse roles of host RNA binding proteins in RNA virus replication. RNA Biol. 8(2): 305–315. Liu, Y., Bahar, I., (2012). Sequence evolution correlates with structural dynamics. Mol. Biol. Evol. 29(9): 2253–63. Lomize, M.A., Pogozheva, I.D., Joo, H., Mosberg, H.I., Lomize, A.L., (2012). OPM database and PPM web server: Resources for positioning of proteins in membranes. Nucleic Acids Res. 40(D1): D370–D376. Lu, H., Skolnick, J., (2001). A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins Struct. Funct. Genet. 44(3): 223–232. Lu, M., Dousis, A.D., Ma, J., (2008). OPUS-PSP: An Orientation-dependent Statistical All- atom Potential Derived from Side-chain Packing. J. Mol. Biol. 376(1): 288–301. Mackerell, A.D., (2004). Empirical force fields for biological macromolecules: Overview and issues. J. Comput. Chem. 25(13): 1584–1604. MacKerell, A.D., Bashford, D., Bellott, M., Dunbrack, R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Kuczera, K., Lau, F.T., Mattos, C., Michnick, S., Ngo, T., Nguyen, D.T., Prodhom, B., Reiher, W.E., Roux, B., Schlenkrich, M., Smith, J.C., Stote, R., Straub, J., Watanabe, M., Wiórkiewicz-Kuczera, J., Yin, D., Karplus, M., (1998). All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102(18): 3586– 616. MacLennan, D.H., Rice, W.J., Green, N.M., (1997). The mechanism of Ca2+ transport by sarco(endo)plasmic reticulum Ca2+-ATPases. J. Biol. Chem. 272(46): 28815–28818. Maetschke, S.R., Yuan, Z., (2009). Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics 10(1): 341. Maisuradze, G., Liwo, A., Scheraga, H.A., (2009). How Adequate are One- and Two- Dimensional Free Energy Landscapes for Protein Folding Dynamics? Phys. Rev. Lett. 102(23): 238102. Maisuradze, G.G., Leitner, D.M., (2006). Principal component analysis of fast-folding λ- repressor mutants. Chem. Phys. Lett. 421(1-3): 5–10. Maisuradze, G.G., Liwo, A., Scheraga, H.A., (2009). Principal component analysis for protein folding dynamics. J. Mol. Biol. 385(1): 312–29. Makhatadze, G.I., Privalov, P.L., (1996). On the Entropy of Protein-Folding. Protein Sci. 5(3): 507–510. Mancusso, R., Gregorio, G.G., Liu, Q., Wang, D.-N., (2012). Structure and mechanism of a bacterial sodium-dependent dicarboxylate transporter. Nature 491(7425): 1–6. Mandell, D.J., Kortemme, T., (2009). Computer-aided design of functional protein interactions. Nat. Chem. Biol. 5(11): 797–807. Manibog, K., Li, H., Rakshit, S., Sivasankar, S., (2014). Resolving the molecular mechanism of cadherin catch bond formation. Nat. Commun. 5: 3941. Manly, B.F.J., (1986). Multivariate Statistical Methods: A primer, Chapman & Hall/CRC.

214

Maragakis, P., Karplus, M., (2005). Large amplitude conformational change in proteins explored with a plastic network model: adenylate kinase. J. Mol. Biol. 352(4): 807–22. Marks, D.S., Colwell, L.J., Sheridan, R., Hopf, T.A., Pagnani, A., Zecchina, R., Sander, C., (2011). Protein 3D structure computed from evolutionary sequence variation. PLoS One 6(12): e28766. Marks, D.S., Hopf, T.A., Sander, C., (2012). Protein structure prediction from sequence variation. Nat. Biotechnol. 30(11): 1072–80. Marques, O., Sanejouand, Y.H., (1995). Hinge-bending motion in citrate synthase arising from normal mode calculations. Proteins Struct. Funct. Genet. 23(4): 557–560. Marsh, J.A., Teichmann, S.A., (2014). Parallel dynamics and evolution: Protein conformational fluctuations and assembly reflect evolutionary changes in sequence and structure. BioEssays 36(2): 209–218. Matthews, B.W., Remington, S.J., (1974). The three dimensional structure of the lysozyme from bacteriophage T4. Proc. Natl. Acad. Sci. U. S. A. 71(10): 4178–4182. McCammon, J.A., Geliin, B.R., Karplus, M., Wolynes, P.G., (1976). The hinge-bending mode in lysozyme. Nature 262: 325–326. McCammon, J.A., Gelin, B.R., Karplus, M., (1977). Dynamics of folded proteins. Nature 267(5612): 585–90. Meireles, L., Gur, M., Bakan, A., Bahar, I., (2011). Pre-existing soft modes of motion uniquely defined by native contact topology facilitate ligand binding to proteins. Protein Sci. 20(10): 1645–1658. Meirovitch, H., (2007). Recent developments in methodologies for calculating the entropy and free energy of biological systems by computer simulation. Curr. Opin. Struct. Biol. 17(2): 181–186. Meirovitch, H., Koerber, S.C., Rivier, J.E., Hagler, A.T., (1994). Computer simulation of the free energy of peptides with the local states method: Analogues of gonadotropin releasing hormone in the random coil and stable states. Biopolymers 34(7): 815–839. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6): 1087. Metropolis, N., Ulam, S., (1949). The Monte Carlo method. J. Am. Stat. Assoc. 44(247): 335–341. Miller, D.W., Dill, K.A., (1997). Ligand binding to proteins: the binding landscape model. Protein Sci. 6(10): 2166–2179. Misquitta, C.M., Mack, D.P., Grover, A.K., (1999). Sarco/endoplasmic reticulum Ca2+ (SERCA)-pumps: link to heart beats and calcium waves. Cell Calcium 25(4): 277–290. Mitchell, M.C., Meyer, M.T., Griffiths, H., (2014). Dynamics of carbon-concentrating mechanism induction and protein relocalization during the dark-to-light transition in synchronized Chlamydomonas reinhardtii. Plant Physiol. 166(2): 1073–82.

215

Miyazawa, S., Jernigan, R.L., (1999). Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins Struct. Funct. Genet. 34(1): 49–68. Miyazawa, S., Jernigan, R.L., (1996). Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256(3): 623–644. Miyazawa, S., Jernigan, R.L., (1985). Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18(3): 534–552. Moroney, J. V., Ynalvez, R.A., (2007). Proposed carbon dioxide concentrating mechanism in Chlamydomonas reinhardtii. Eukaryot. Cell. Moroney, J. V, Rouge, B., (2007). Algal Carbon Dioxide Concentrating Mechanisms. J. Bacteriol. : 1–7. Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K., Olson, A.J., Al, M.E.T., (1998). Automated Docking Using a Lamarckian Genetic Algorithm and an Empirical Binding Free Energy Function. J. Comput. Chem. 19(14): 1639–1662. Morris, G.M., Huey, R., Lindstrom, W., Sanner, M.F., Belew, R.K., Goodsell, D.S., Olson, A.J., (2009). AutoDock4 and AutoDockTools4 : Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 30(16): 2785–2791. Mouawad, L., Perahia, D., (1993). Diagonalization in a mixed basis: A method to compute low-frequency normal modes for large macromolecules. Biopolymers 33(4): 599–611. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A., (2014). Critical assessment of methods of protein structure prediction (CASP)--round x. Proteins 82 Suppl 2(October): 1–6. Mu, Y., Nguyen, P.H., Stock, G., (2005). Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins 58(1): 45–52. Mu, Y., Stock, G., (2006). Conformational Dynamics of RNA-Peptide Binding: A Molecular Dynamics Simulation Study. Biophys. J. 90(2): 391–399. Mulligan, C., Fenollar-Ferrer, C., Fitzgerald, G.A., Vergara-Jaque, A., Kaufmann, D., Li, Y., Forrest, L.R., Mindell, J.A., (2016). The bacterial dicarboxylate transporter VcINDY uses a two-domain elevator-type mechanism. Nat. Struct. Mol. Biol. 23(3): 256–263. Munson, P.J., Singh, R.K., (1997). Statistical significance of hierarchical multi-body potentials based on Delaunay tessellation and their application in sequence-structure alignment. Protein Sci. 6(7): 1467–1481. Murakami, Y., Spriggs, R. V, Nakamura, H., Jones, S., (2010). PiRaNhA: A Server for the Computational Prediction of RNA-Binding Residues in Protein Sequences. Nucleic Acids Res. 38(suppl 2): W412–W416. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., (1995). SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4): 536–540.

216

Nagar, B., Overduin, M., Ikura, M., Rini, J.M., (1996). Structural basis of calcium-induced E-cadherin rigidification and dimerization. Nature 380(6572): 360–364. Nagarajan, A., Andersen, J.P., Woolf, T.B., (2012). The role of domain: Domain interactions versus domain: Water interactions in the coarse-grained simulations of the E1P to E2P transitions in Ca-ATPase (SERCA). Proteins Struct. Funct. Bioinforma. 80(8): 1929– 1947. Navia, M.A., Fitzgerald, P.M., McKeever, B.M., Leu, C.T., Heimbach, J.C., Herber, W.K., Sigal, I.S., Darke, P.L., Springer, J.P., (1989). Three-dimensional structure of aspartyl protease from human immunodeficiency virus HIV-1. Nature 337(6208): 615–620. Nienhaus, G.U., Heinzl, J., Huenges, E., Parak, F., (1989). Protein crystal dynamics studied by time-resolved analysis of X-ray diffuse scattering. Nature 338: 665–666. Niessen, C.M., Leckband, D., Yap, A.S., (2011). Tissue organization by cadherin adhesion molecules: dynamic molecular and cellular mechanisms of morphogenetic regulation. Physiol. Rev. 91(2): 691–731. Nikaido, H., (2009). Multidrug resistance in bacteria. Annu. Rev. Biochem. 78: 119–46. Nussinov, R., Wolynes, P.G., (2014). A second molecular biology revolution? The energy landscapes of biomolecular function. Phys. Chem. Chem. Phys. 16(14): 6321–2. Oesterhelt, F., Rief, M., Gaub, H.E., (1999). Single molecule force spectroscopy by AFM indicates helical structure of poly ( ethylene-glycol ) in water. New J. Phys. 1: 1–11. Ohnishi, N., Mukherjee, B., Tsujikawa, T., Yanase, M., Nakano, H., Moroney, J. V, Fukuzawa, H., (2010). Expression of a low CO₂-inducible protein, LCI1, increases inorganic carbon uptake in the green alga Chlamydomonas reinhardtii. Plant Cell 22(9): 3105–17. Ohtaka, H., Schön, A., Freire, E., (2003). Multidrug Resistance to HIV-1 Protease Inhibition Requires Cooperative Coupling between Distal Mutations. Biochemistry 42(46): 13659– 13666. Onuchic, J.N., Luthey-Schulten, Z., Wolynes, P.G., (1997). Theory of protein folding: the energy landscape perspective. Annu. Rev. Phys. Chem. 48: 545–600. Oostenbrink, C., Villa, A., Mark, A.E., Van Gunsteren, W.F., (2004). A biomolecular force field based on the free enthalpy of hydration and solvation: The GROMOS force-field parameter sets 53A5 and 53A6. J. Comput. Chem. 25(13): 1656–1676. Ostermeier, C., Michel, H., (1997). Crystallization of membrane proteins. Curr. Opin. Struct. Biol. 7: 697–701. Parisini, E., Higgins, J.M.G., Liu, J., Brenner, M.B., Wang, J., (2007). The Crystal Structure of Human E-cadherin Domains 1 and 2, and Comparison with other Cadherins in the Context of Adhesion Mechanism. J. Mol. Biol. 373(2): 401–411. Park, B., Levitt, M., (1996). Energy functions that discriminate X-ray and near native folds from well-constructed decoys. J. Mol. Biol. 258(2): 367–392. Pearson, K., (1901). On lines and planes of closest fit to systems of points in space. Philos. Mag. 2(11): 559–572.

217

Pedersen, O., Colmer, T.D., Sand-Jensen, K., (2013). Underwater photosynthesis of submerged plants - recent advances and methods. Front. Plant Sci. 4: 140. Peng, Z., Oldfield, C.J., Xue, B., Mizianty, M.J., Dunker, A.K., Kurgan, L., Uversky, V.N., (2014). A creature with a hundred waggly tails: intrinsically disordered proteins in the ribosome. Cell. Mol. life Sci. C. 71(8): 1477–1504. Perahia, D., Mouawad, L., (1995). Computation of low-frequency normal modes in macromolecules: Improvements to the method of diagonalization in a mixed basis and application to hemoglobin. Comput. Chem. 19(3): 241–246. Perez-Cano, L., Fernandez-Recio, J., (2010). Optimal protein-RNA area, OPRA: a propensity-based method to identify RNA-binding sites on proteins. Proteins 78(1): 25– 35. Perez-Cano, L., Jimenez-Garcia, B., Fernandez-Recio, J., (2012). A protein-RNA docking benchmark (II): extended set from experimental and homology modeling data. Proteins 80(7): 1872–1882. Pertz, O., Bozic, D., Koch, A.W., Fauser, C., Brancaccio, A., Engel, J., (1999). A new crystal structure , Ca2+ dependence and mutational analysis reveal molecular details of E- cadherin homoassociation. EMBO J. 18(7): 1738–1747. Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kalé, L., Schulten, K., (2005). Scalable molecular dynamics with NAMD. J. Comput. Chem. 26: 1781–1802. Pickett, S., Sternberg, M.J., (1993). Empirical scale of side-chain conformational entropy in protein folding. J. Mol. Biol. 231(3): 825–839. Piddock, L.J. V, (2006). Multidrug-resistance efflux pumps - not just for resistance. Nat. Rev. Microbiol. 4(8): 629–636. Pokarowski, P., Kloczkowski, A., Jernigan, R.L., Kothari, N.S., Pokarowska, M., Kolinski, A., (2005). Inferring ideal amino acid interaction forms from statistical protein contact potentials. Proteins 59(1): 49–57. Pokutta, S., Weis, W.I., (2007). Structure and mechanism of cadherins and catenins in cell- cell contacts. Annu. Rev. Cell Dev. Biol. 23: 237–261. Ponder, J.W., Case, D.A., (2003). Force fields for protein simulations. Adv. Protein Chem. Prakash, S., Cooper, G., Singhi, S., Saier, M.H., (2003). The ion transporter superfamily. Biochim. Biophys. Acta - Biomembr. 1618(1): 79–92. Price, D.J., Brooks, C.L., (2002). Modern protein force fields behave comparably in molecular dynamics simulations. J. Comput. Chem. 23(11): 1045–1057. Pumplin, N., Voinnet, O., (2013). RNA silencing suppression by plant pathogens: defence, counter-defence and counter-counter-defence. Nat. Rev. Microbiol. 11(11): 745–760. Puton, T., Kozlowski, L., Tuszynska, I., Rother, K., Bujnicki, J.M., (2012). Computational methods for prediction of protein-RNA interactions. J. Struct. Biol. 179(3): 261–268. Rakshit, S., Zhang, Y., Manibog, K., Shafraz, O., Sivasankar, S., (2012). Ideal, catch, and slip bonds in cadherin adhesion. Proc. Natl. Acad. Sci. U. S. A. 109(46): 18815–20.

218

Riccardi, L., Nguyen, P.H., Stock, G., (2009). Free-energy landscape of RNA hairpins constructed via dihedral angle principal component analysis. J. Phys. Chem. B 113(52): 16660–16668. Ringe, D., Petsko, G.A., (1985). Mapping protein dynamics by X-ray diffraction. Prog. Biophys. Mol. Biol. 45(3): 197–235. Ritchie, D.W., (2008). Recent progress and future directions in protein-protein docking. Curr. Protein Pept. Sci. 9(1): 1–15. Russ, W.P., Ranganathan, R., (2002). Knowledge-based potential functions in protein design. Curr. Opin. Struct. Biol. 12(4): 447–452. Ruvinsky, A.M., Kirys, T., Tuzikov, A. V., Vakser, I. a., (2012). Structure Fluctuations and Conformational Changes in Protein Binding. J. Bioinform. Comput. Biol. 10(02): 1241002. Rykunov, D., Fiser, A., (2010). New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics 11: 128. Sali, A., Sali, A., (2006). Statistical potential for assessment and prediction of protein structures. Protein Sci. : 2507–2524. Samudrala, R., Moult, J., (1998). An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275(5): 895–916. Sanejouand, Y.-H., (2011). Elastic Network Models: Theoretical and Empirical Foundations, in: Biomolecular Simulations. p. 26. Sankar, K., Liu, J., Wang, Y., Jernigan, R.L., (2015). Distributions of experimental protein structures on coarse-grained free energy landscapes. J. Chem. Phys. 143(24): 243153. Sankar, K., Walia, R.R., Mann, C.M., Jernigan, R.L., Honavar, V.G., Dobbs, D., (2014). An analysis of conformational changes upon RNA-protein binding. Proc. 5th ACM Conf. Bioinformatics, Comput. Biol. Heal. Informatics - BCB ’14 : 592–593. Sapra, K.T., Balasubramanian, G.P., Labudde, D., Bowie, J.U., Muller, D.J., (2008). Point Mutations in Membrane Proteins Reshape Energy Landscape and Populate Different Unfolding Pathways. J. Mol. Biol. 376(4): 1076–1090. Scheraga, H.A., Khalili, M., Liwo, A., (2007). Protein-folding dynamics: Overview of molecular simulation techniques. Annu. Rev. Phys. Chem. 58: 57–83. Schlenkrich, M., Brickmann, J., MacKerell, A.D., Karplus, M., (1996). An Empirical Potential Energy Function for Phospholipids: Criteria for Parameter Optimization and Applications, in: Biological Membranes: : A Molecular Perspective from Computation and Experiment. pp. 31–81. Schneider, T.R., (2004). Domain identification by iterative analysis of error-scaled difference distance matrices. Acta Crystallogr. D. Biol. Crystallogr. 60(Pt 12 Pt 1): 2269–2275. Schneider, T.R., (2002). A genetic algorithm for the identification of conformationally invariant regions in protein molecules. Acta Crystallogr. D. Biol. Crystallogr. 58(Pt 2): 195–208.

219

Schneider, T.R., (2000). Objective comparison of protein structures: error-scaled difference distance matrices. Acta Crystallogr. D. Biol. Crystallogr. 56(Pt 6): 714–721. Schug, A., Onuchic, J.N., (2010). From protein folding to protein function and biomolecular binding by energy landscape theory. Curr. Opin. Pharmacol. 10(6): 709–714. Schuyler, A.D., Jernigan, R.L., Qasba, P.K., Ramakrishnan, B., Chirikjian, G.S., (2009). Iterative cluster-NMA: A tool for generating conformational transitions in proteins. Proteins 74(3): 760–76. Scott, W.R.P., Hünenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda, A.E., Huber, T., Krüger, P., van Gunsteren, W.F., (1999). The GROMOS Biomolecular Simulation Program Package. J. Phys. Chem. A 103(19): 3596–3607. Shapiro, L., Fannon, A.M., Kwong, P.D., Thompson, A., Lehmann, M.S., Grübel, G., Legrand, J.F., Als-Nielsen, J., Colman, D.R., Hendrickson, W.A., (1995). Structural basis of cell-cell adhesion by cadherins. Nature 374(6520): 327–337. Shapiro, L., Weis, W.I., (2009). Structure and biochemistry of cadherins and catenins. Cold Spring Harb. Perspect. Biol. 1(3): a003053. Shortle, D., Simons, K.T., Baker, D., (1998). Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. U. S. A. 95(19): 11158– 11162. Sicard, F., Senet, P., (2013). Reconstructing the free-energy landscape of Met-enkephalin using dihedral principal component analysis and well-tempered metadynamics. J. Chem. Phys. 138(23): 235101. Sinha, N., Kumar, S., Nussinov, R., (2001). Interdomain interactions in hinge-bending transitions. Structure 9(12): 1165–1181. Sinha, N., Smith-Gill, S.J., (2002). Electrostatics in protein binding and function. Curr. Protein Pept. Sci. 3(6): 601–614. Sippl, M.J., (1993). Boltzmann’s principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J. Comput. Aided. Mol. Des. 7(4): 473–501. Sippl, M.J., (1990). Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge-based prediction of local structures in globular proteins. J. Mol. Biol. 213(4): 859–883. Sivasankar, S., Zhang, Y., Nelson, W.J., Chu, S., (2009). Characterizing the Initial Encounter Complex in Cadherin Adhesion. Structure 17(8): 1075–1081. Skjaerven, L., Hollup, S.M., Reuter, N., (2009). Normal mode analysis for proteins. J. Mol. Struct. THEOCHEM 898(1-3): 42–48. Skolnick, J., (2006). In quest of an empirical potential for protein structure prediction. Curr. Opin. Struct. Biol. 16(2): 166–171. Spreitzer, R.J., Salvucci, M.E., (2002). Rubisco: structure, regulatory interactions, and possibilities for a better enzyme. Annu. Rev. Plant Biol. 53(1): 449–75.

220

Stone, M.J., (2001). NMR relaxation studies of the role of conformational entropy in protein stability and ligand binding. Acc. Chem. Res. 34(5): 379–388. Su, C., Bolla, J.R., Shafer, W.M., Yu, E.W., Su, C., Bolla, J.R., Kumar, N., Radhakrishnan, A., Long, F., Delmar, J.A., (2015). Structure and Function of Neisseria gonorrhoeae MtrF Illuminates a Class of Antimetabolite Efflux Pumps. Cell Rep. 11: 61–70. Sugio, S., Kashima, A., Mochizuki, S., Noda, M., Kobayashi, K., (1999). Crystal structure of human serum albumin at 2.5 A resolution. Protein Eng. 12(6): 439–446. Summa, C.M., Levitt, M., (2007). Near-native structure refinement using in vacuo energy minimization. Proc. Natl. Acad. Sci. U. S. A. 104(9): 3177–3182. Sun, J., Deng, Z., Yan, A., (2014). Bacterial multidrug efflux pumps: Mechanisms, physiology and pharmacological exploitations. Biochem. Biophys. Res. Commun. 453(2): 254–267. Sutto, L., Gervasio, F.L., (2013). Effects of oncogenic mutations on the conformational free- energy landscape of EGFR kinase. Proc. Natl. Acad. Sci. U. S. A. 110(26): 10616–21. Sutto, L., Marsili, S., Valencia, A., Gervasio, F.L., (2015). From residue coevolution to protein conformational ensembles and functional dynamics. Proc. Natl. Acad. Sci. 112(44): 13567–13572. Tama, F., Gadea, F.X., Marques, O., Sanejouand, Y.H., (2000). Building-block approach for determining low-frequency normal modes of macromolecules. Proteins Struct. Funct. Genet. 41(1): 1–7. Tama, F., Sanejouand, Y.H., (2001). Conformational change of proteins arising from normal mode calculations. Protein Eng. 14(1): 1–6. Tamura, M., Hendrix, D.K., Klosterman, P.S., Schimmelman, N.R.B., Brenner, S.E., Holbrook, S.R., (2004). SCOR: Structural Classification of RNA, version 2.0. Nucleic Acids Res. 32(suppl 1): D182–D184. Tanaka, S., Scheraga, H.A., (1976). Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 9(6): 945–50. Tekpinar, M., Zheng, W., (2010). Predicting order of conformational changes during protein conformational transitions using an interpolated elastic network model. Proteins 78(11): 2469–81. Teodoro, M.L., Phillips Jr., G.N., Kavraki, L.E., (2003). Understanding Protein Flexibility through Dimensionality Reduction. J. Comput. Biol. 10: 617–634. Teodoro, M.L., Phillips, G.N., Kavraki, L.E., (2002). A dimensionality reduction approach to modeling protein flexibility. Proc. sixth Annu. Int. Conf. Comput. Biol. RECOMB 02 : 299–308. Terstappen, G.C., Reggiani, A., (2001). In silico research in drug discovery. Trends Pharmacol. Sci. 22(1): 23–26. Thoms, S., Pahlow, M., Wolf-Gladrow, D. a, (2001). Model of the carbon concentrating mechanism in chloroplasts of eukaryotic algae. J. Theor. Biol. 208(3): 295–313.

221

Tirion, M.M., (1996). Large Amplitude Elastic Motions in Proteins from a Single-Parameter, Atomic Analysis. Phys. Rev. Lett. 77(9): 1905–1908. Tokuriki, N., Stricher, F., Schymkowitz, J., Serrano, L., Tawfik, D.S., (2007). The Stability Effects of Protein Mutations Appear to be Universally Distributed. J. Mol. Biol. 369(5): 1318–1332. Toyoshima, C., Nakasako, M., Nomura, H., Ogawa, H., (2000). Crystal structure of the calcium pump of sarcoplasmic reticulum at 2.6 A resolution. Nature 405(6787): 647– 655. Toyoshima, C., Sasabe, H., Stokes, D.L., (1993). Three-dimensional cryo-electron microscopy of the calcium ion pump in the sarcoplasmic reticulum membrane. Nature 362(6419): 467–471. Tozzini, V., (2010). Multiscale modeling of proteins. Acc. Chem. Res. 43(2): 220–230. Tozzini, V., McCammon, J.A., (2005). A coarse grained model for the dynamics of flap opening in HIV-1 protease. Chem. Phys. Lett. 413: 123–128. Tozzini, V., Trylska, J., Chang, C., McCammon, J.A., (2007). Flap opening dynamics in HIV-1 protease explored with a coarse-grained model. J. Struct. Biol. 157: 606–615. Trylska, J., Tozzini, V., Chang, C.A., McCammon, J.A., (2007). HIV-1 protease substrate binding and product release pathways explored with coarse-grained molecular dynamics. Biophys. J. 92: 4179–4187. Tzeng, S.-R., Kalodimos, C.G., (2012). Protein activity regulation by conformational entropy. Nature 488(7410): 236–240. Vajda, S., Kozakov, D., (2009). Convergence and combination of methods in protein-protein docking. Curr. Opin. Struct. Biol. 19(2): 164–170. Vakser, I.A., Kundrotas, P., (2008). Predicting 3D structures of protein-protein complexes. Curr. Pharm. Biotechnol. 9(2): 57–66. van den Bedem, H., Fraser, J.S., (2015). Integrative, dynamic structural biology at atomic resolution—it’s about time. Nat. Methods 12(4): 307–318. Van Der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A.E., Berendsen, H.J.C., (2005). GROMACS: Fast, flexible, and free. J. Comput. Chem. 26(16): 1701–1718. Veal, W.L., Shafer, W.M., (2003). Identification of a cell envelope protein (MtrF) involved in hydrophobic antimicrobial resistance in Neisseria gonorrhoeae. J. Antimicrob. Chemother. 51(1): 27–37. Vendome, J., Felsovalyi, K., Song, H., Yang, Z., Jin, X., Brasch, J., Harrison, O.J., Ahlsen, G., Bahna, F., Kaczynska, A., Katsamba, P.S., Edmond, D., Hubbell, W.L., Shapiro, L., Honig, B., (2014). Structural and energetic determinants of adhesive binding specificity in type I cadherins. Proc. Natl. Acad. Sci. 111(40): E4175–E4184. Vendome, J., Posy, S., Jin, X., Bahna, F., Ahlsen, G., Shapiro, L., Honig, B., (2011). Molecular design principles underlying β-strand swapping in the adhesive dimerization of cadherins. Nat. Struct. Mol. Biol. 18(6): 693–700.

222

Vergara-Jaque, A., Fenollar-Ferrer, C., Mulligan, C., Mindell, J.A., Forrest, L.R., (2015). Family resemblances: A common fold for some dimeric ion-coupled secondary transporters. J. Gen. Physiol. 146(5): 423–434. Vilsen, B., Andersen, J.P., MacLennan, D.H., (1991). Functional consequences of alterations to amino acids located in the hinge domain of the Ca2+-ATPase of sarcoplasmic reticulum. J. Biol. Chem. 266(24): 16157–16164. Vizcarra, C.L., Mayo, S.L., (2005). Electrostatics in computational protein design. Curr. Opin. Chem. Biol. 9(6): 622–626. Walia, R.R., Caragea, C., Lewis, B.A., Towfic, F.G., Terribilini, M., El-Manzalawy, Y., Dobbs, D., Honavar, V., (2012). Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics 13(1): 89. Walia, R.R., Xue, L.C., Wilkins, K., El-Manzalawy, Y., Dobbs, D., Honavar, V., (2014). RNABindRPlus: A Predictor that Combines Machine Learning and Sequence Homology-Based Methods to Improve the Reliability of Predicted RNA-Binding Residues in Proteins. PLoS One 9(5): e97725. Wand, A.J., (2013). The dark energy of proteins comes to light: Conformational entropy and its role in protein function revealed by NMR relaxation. Curr. Opin. Struct. Biol. 23(1): 75–81. Wang, J., Brüschweiler, R., (2006). 2D entropy of discrete molecular ensembles. J. Chem. Theory Comput. 2(1): 18–24. Wang, Q., Zhang, P., Hoffman, L., Tripathi, S., Homouz, D., Liu, Y., Waxham, M.N., Cheung, M.S., (2013). Protein recognition and selection through conformational and mutually induced fit. Proc. Natl. Acad. Sci. : 201312788. Wang, Y., Duanmu, D., Spalding, M.H., (2011). Carbon dioxide concentrating mechanism in Chlamydomonas reinhardtii: Inorganic carbon transport and CO2 recapture, in: Photosynthesis Research. pp. 115–122. Wang, Y., Rader, a J., Bahar, I., Jernigan, R.L., (2004). Global ribosome motions revealed with elastic network model. J. Struct. Biol. 147(3): 302–14. Wang, Y., Schulten, K., Tajkhorshid, E., (2005). What makes an aquaporin a glycerol channel? A comparative study of AqpZ and GlpF. Structure 13: 1107–1118. Wang, Y., Tajkhorshid, E., (2008). Electrostatic funneling of substrate in mitochondrial inner membrane carriers. Proc. Natl. Acad. Sci. U. S. A. 105: 9598–9603. Warshel, A., (1976). Bicycle-pedal model for the first step in the vision process. Nature 260: 679–683. Warshel, A., Levitt, M., (1976). Theoretical studies of enzymic reactions: Dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme. J. Mol. Biol. 103(2): 227–249. Weber, G., (1975). Energetics of Ligand Binding to Proteins. Adv. Protein Chem. 29: 1–83. Williamson, J.R., (2000). Induced fit in RNA-protein recognition. Nat. Struct. Mol. Biol. 7(10): 834–837.

223

Wolynes, P.G., (2005). Energy landscapes and solved protein-folding problems. Philos. Trans. A. Math. Phys. Eng. Sci. 363(1827): 453–464; discussion 464–467. Wolynes, P.G., (1996). Symmetry and the energy landscapes of biomolecules. Proc. Natl. Acad. Sci. U. S. A. 93(25): 14249–14255. Woods, D.D., (1940). The relation of p-aminobenzoic acid to the mechanism of the action of sulphonamide. Brit. J. Exptl. Path 21: 74–90. Wu, H., Finger, L.D., Feigon, J., (2005). Structure determination of protein/RNA complexes by NMR. Methods Enzymol. 394: 525–545. Wüthrich, K., (2001). The way to NMR structures of proteins. Nat. Struct. Biol. 8(11): 923– 925. Xu, D., Lin, S.L., Nussinov, R., (1997). Protein binding versus protein folding: the role of hydrophilic bridges in protein associations. J. Mol. Biol. 265(1): 68–84. Xu, J., Zhang, Y., (2010). How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26(7): 889–895. Yamano, T., Miura, K., Fukuzawa, H., (2008). Expression analysis of genes associated with the induction of the carbon-concentrating mechanism in Chlamydomonas reinhardtii. Plant Physiol. 147(May): 340–354. Yang, L., Song, G., Carriquiry, A., Jernigan, R.L., (2008). Close correspondence between the motions from principal component analysis of multiple HIV-1 protease structures and elastic network modes. Structure 16(2): 321–30. Yang, L., Song, G., Jernigan, R.L., (2009). Comparisons of experimental and computed protein anisotropic temperature factors. Proteins 76(1): 164–75. Yang, L., Song, G., Jernigan, R.L., (2007). How well can we understand large-scale protein motions using normal modes of elastic network models? Biophys. J. 93(3): 920–9. Yang, L.W., Eyal, E., Bahar, I., Kitao, A., (2009). Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): Insights into functional dynamics. Bioinformatics 25(5): 606–614. Yang, L.-W., Eyal, E., Bahar, I., Kitao, A., (2009). Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics. Bioinformatics 25(5): 606–14. Yang, Y., Zhou, Y., (2008). Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins Struct. Funct. Genet. 72(2): 793–803. Yernool, D., Boudker, O., Jin, Y., Gouaux, E., (2004). Structure of a glutamate transporter homologue from Pyrococcus horikoshii. Nature 431(7010): 811–818. Zhang, J., Zhang, Y., (2010). A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS One 5(10): e15386. Zhang, X.J., Wozniak, J.A., Matthews, B.W., (1995). Protein flexibility and adaptability seen in 25 crystal forms of T4 lysozyme. J. Mol. Biol. 250(4): 527–552.

224

Zhang, Y., (2008). Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 18(3): 342–348. Zhang, Y., Arakaki, A.K., Skolnick, J., (2005). TASSER: An automated method for the prediction of protein tertiary structures in CASP6. Proteins Struct. Funct. Genet. 61(SUPPL. 7): 91–98. Zhang, Y., Sivasankar, S., Nelson, W.J., Chu, S., (2009). Resolving cadherin interactions and binding cooperativity at the single-molecule level. Proc. Natl. Acad. Sci. U. S. A. 106(1): 109–114. Zhang, Y., Skolnick, J., (2005). TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33(7): 2302–2309. Zhang, Y., Skolnick, J., (2004a). SPICKER: A clustering approach to identify near-native protein folds. J. Comput. Chem. 25(6): 865–871. Zhang, Y., Skolnick, J., (2004b). Scoring function for automated assessment of protein structure template quality. Proteins 57(4): 702–10. Zhao, N., Pang, B., Shyu, C.R., Korkin, D., (2011). Charged residues at protein interaction interfaces: Unexpected conservation and orchestrated divergence. Protein Sci. 20(7): 1275–1284. Zheng, W., Brooks, B.R., Hummer, G., (2007). Protein conformational transitions explored by mixed elastic network models. Proteins Struct. Funct. Genet. 69(1): 43–57. Zhou, H., Zhou, Y., (2002). Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11(11): 2714–26. Zhu, G., Xia, Y., Nicholson, L.K., Sze, K.H., (2000). Protein dynamics measurements by TROSY-based NMR experiments. J. Magn. Reson. 143(2): 423–426. Zimmermann, M.T., Kloczkowski, A., Jernigan, R.L., (2011a). MAVENs: motion analysis and visualization of elastic networks and structural ensembles. BMC Bioinformatics 12: 264. Zimmermann, M.T., Leelananda, S.P., Gniewek, P., Feng, Y., Jernigan, R.L., Kloczkowski, A., (2011b). Free energies for coarse-grained proteins by integrating multibody statistical contact potentials with entropies from elastic network models. J. Struct. Funct. Genomics 12(2): 137–47. Zimmermann, M.T., Leelananda, S.P., Kloczkowski, A., Jernigan, R.L., (2012). Combining statistical potentials with dynamics-based entropies improves selection from protein decoys and docking poses. J. Phys. Chem. B 116(23): 6725–31. Zoete, V., Cuendet, M.A., Grosdidier, A., Michielin, O., (2011). SwissParam: A fast force field generation tool for small organic molecules. J. Comput. Chem. 32(11): 2359–2368.