COMPUTATIONAL MODELING OF STRUCTURAL HETEROGENEITY IN FOLDED PROTEINS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF MECHANICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Ankur Dhanik August 2010

© 2010 by Ankur Dhanik. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/vn290ds4143

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jean-Claude Latombe, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Scott Delp, Co-Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Proteins are biomolecules that play a key role in a wide diversity of vital functions, such as metabolism and signal transmission. Each protein is a linear chain of amino acids that folds into a flexible three-dimensional structure. Protein’s flexibility is widely believed to be essential for its function. Indeed, proteins usually achieve their main functions by bind- ing other molecules, called ligands. Binding requires shape and chemical complementarity of the two molecules at their binding interface. Conformation selection theory suggests that the protein and the ligand exist in an ensemble of continuously deforming conforma- tions and that the most compatible conformations recognize each other and bind together. Binding conformations of proteins often differ significantly from non-binding ones. To understand protein’s function one must be able to determine or predict these binding con- formations. Motion of a protein occurs at timescales that span several orders of magnitude. Ther- mal fluctuations, which occur in picoseconds, are small-amplitude, uncorrelated, harmonic motions of the individual atoms. In contrast, conformational deformations closely related to the protein’s function occur in microseconds to milliseconds. These slow deformations are usually large-scale, correlated, anharmonic motions that correspond to transitions be- tween meta-stable states, such as binding and non-binding states. In this dissertation we are mainly interested in modeling structural heterogeneity associated with such slow defor- mations. This dissertation presents new computational methods to study the flexibility of folded

iv protein in the context of three important biological problems: - Loop sampling: Loop between helices and/or strands are often highly flexible frag- ments of proteins that participate in binding sites. The problem is to determine if a given protein loop can achieve a shape where it may bind against a specified ligand. - Interpretation of electron density maps: X-ray crystallography is the most common experimental method to determine the folded structure of a protein. The experiment pro- vides an electron density map (EDM) from which the positions of the protein atoms can be determined. However, when there is heterogeneity among the protein conformations present in the crystal, the EDM is blurred and extremely difficult to interpret. The problem is to model structural heterogeneity present in the crystal. - Determination of allosteric pathways: An allosteric protein is a protein whose shape changes when it binds an effector ligand at the protein’s allosteric site. This change alters the ability of the protein to bind another molecule at its functional site. The problem is to identify the sequences of side-chains through which the change in shape propagates from the allosteric to the binding site. Computational modeling of structural heterogeneity in the folded state of a protein is a challenging problem, mainly because of the high-dimensionality of the protein’s confor- mation space and the very small relative volume of its feasible motion space. Although our methods are specific to each of the three problems, they share the same sample and select approach: they combine efficient sampling algorithms that allow us to represent structural heterogeneity in a folded protein by a collection of sampled conformations and selection algorithms that allow us to reliably pick the sampled conformations that provide a solution to the problem. In addition, they share several similar techniques, like efficient kinematic modeling, fast collision detection among atoms to handle van der Waals volume exclu- sion among atoms, and optimization techniques. This dissertation demonstrates the power of geometric computation and efficient sampling to model structural heterogeneity in the folded protein.

v Acknowledgements

I would like to thank my adviser, Professor Jean-Claude Latombe, for his invaluable guid- ance and support throughout my PhD. I have benefited immensely from the discussions we had on scientific concepts, research methodologies, and life in general. His world- wide mountaineering trips also provided me an unique opportunity to learn from his travel experiences. He is a role model on how to build a productive and joyful research career. I would like to thank my co-adviser, Professor Scott Delp, for his guidance that helped me achieve PhD milestones. He is a very kind spirit. I would also like to thank my commit- tee members Professor Serafim Batzoglou, Professor Axel Brunger, Professor Gill Bejer- ano, Professor Russ Altman, and Professor Eric Darve for their feedback on my research. Most of the work in this dissertation has been done in collaboration with Dr. Henry van den Bedem at Joint Center for Structural Genomics. He always brought new insights into the research topics that are focus of this dissertation. I have greatly benefited from his pointed critiques that always helped in improving and refining my research. Thanks for all the help. Many thanks to the members of Latombe group Peggy Yao, Guanfeng Liu, Ruixiang Zhang, Liangjun Zhang, Kris Hauser, Tim Bretl, Mitul Saha, Philip Fong, Nathan Marz, Ryan Propper, and Charles Kou, with whom I enjoyed discussions on research and general fun topics. Peggy Yao worked with me on some of the research presented in this disser- tation and was very helpful throughout. Nathan, Ryan, and Charles worked with me on early software development that proved very useful later. The administrative assistant of

vi Latombe group, Alex Sandra Pinedo, generously helped me with many things that saved me a lot of time. Special thanks to her. I would also like to thank my friends Benoit, Tarun, Menaka, Arjun, Gaurav, Supreet, Shloke, Anshika, Sachin, Renu, Deepak, Mini, Rajeev, Ujvalla, Nitin, Gauri, Tirthankar, Beatrice, Sonti, Desingh, Naveen, Ashok, and Musu, with whom I shared fun moments of my student life. My parents, brother Vyom, sister Taru, and sister-in-law Shweta have been a constant source of love and support. My parents were always concerned about when I will finish my PhD but they never lost their patience. Their unwavering confidence in my abilities was always reassuring. I would also like to thank my in-laws, sister-in-law Niti, and brother-in- law Deepak for their love and best wishes. This dissertation would not have been possible without the love and support of my wife Neha. She came into my life when I was getting into the hectic final year of my PhD and provided all the support that allowed me to solely focus on research. She always helped me feel relaxed and provided constant motivation. Her valuable comments on my research reports and especially on my PhD defense presentation helped in improving this dissertation. I share this dissertation with her. Finally, I would like to thank Stanford University for providing me the facilities and a conducive environment for research. I would also like to thank Indian Institute of Tech- nology Kanpur and National University of Singapore for their contribution to my early technical education. This research was partially funded by NSF grant DMS-0443939 and by two research grants from the KAUST-Stanford Academic Excellence Alliance (AEA) program. Thanks for the support.

vii Contents

Abstract iv

Acknowledgements vi

1 Introduction 1 1.1 Motivation and Goals ...... 1 1.2 Structural Heterogeneity of a Folded Protein ...... 5 1.3 Why Model Structural Heterogeneity? ...... 7 1.4 Computational Challenge ...... 8 1.5 Main Contributions ...... 10 1.6 Relation to Previous Work ...... 11

2 Exploring the Motion Space of Protein Loops 13 2.1 Related Work ...... 15 2.2 Seed Sampling Algorithm ...... 18 2.2.1 Sampling front/back-end conformations ...... 19 2.2.2 Sampling mid-portion conformations ...... 19 2.2.3 Placing side-chains ...... 20 2.3 Deformation Sampling Algorithm ...... 21 2.3.1 Overview ...... 21

viii 2.3.2 Computation of a basis of the tangent space ...... 22 2.3.3 Selection of a direction in the tangent space ...... 22 2.3.4 Placing side-chains ...... 23 2.4 Collision Detection ...... 23 2.5 Experiments ...... 24 2.5.1 Seed sampling ...... 24 2.5.2 Deformation sampling ...... 29 2.5.3 Placements of side-chains ...... 31 2.5.4 Calcium-binding site prediction ...... 32 2.6 Conclusion ...... 35

3 Modeling Structural Heterogeneity From X-ray Data 37 3.1 Related Work ...... 39

3.2 SAMPLE-SELECT Algorithm ...... 40 3.2.1 Selection step ...... 41 3.2.2 Conformation sampling ...... 42 3.3 Side-chain Driven Heterogeneity ...... 42 3.3.1 Sampling ...... 43 3.3.2 Validation with simulated data ...... 44 3.3.3 Results with experimental data ...... 46 3.3.4 Comparison to ensemble of independent conformations ...... 49 3.4 Main-chain Driven Heterogeneity ...... 52 3.4.1 Sampling ...... 52 3.4.2 Validation with simulated data ...... 55 3.4.3 Results with experimental data ...... 57 3.5 Conclusion ...... 59

ix 4 Determination of Allosteric Pathways 61 4.1 Related Work ...... 63 4.2 RDPG Method ...... 66 4.2.1 Overview ...... 66 4.2.2 Node generation ...... 67 4.2.3 Arc generation ...... 69 4.2.4 Reduced RDPG and allosteric pathways ...... 71 4.2.5 Extension to multiple main-chain conformations ...... 72 4.3 Test Results ...... 73 4.3.1 CREB-binding protein ...... 74 4.3.2 PDZ domain family protein ...... 76 4.4 Conclusion ...... 79

5 Conclusion 82 5.1 Summary ...... 82 5.2 Problem-Specific Contributions ...... 83 5.3 Future Directions ...... 85

Bibliography 87

x List of Tables

2.1 Testset of 20 loops. Each row lists PDB id of the protein, number of residues in the protein, starting residue of the loop, number of residues in the loop, average sampling time (in seconds) for one closed, collision-free confor- mation using seed sampling and naive sampling algorithms...... 26 2.2 Number of collision-free placements of side chains for five loops...... 31

3.1 Sixteen structural models are rebuilt and subjected to modeling with

SAMPLE-SELECT. Columns represent the PDB id of the model, resolu- tion (A),˚ number of residues in the asymmetric unit, number of reflec-

tions, Rfree/R, and the root mean square deviations of the bond lengths and the bond angles of the single-conformer reference model and the multi- conformer model...... 48 3.2 Details of calculated dual conformations for loop 104-112 of 2R4I. Each row lists occupancies for the conformations (Occ), map resolution (Res, in

A),˚ RMSD of calculated conformations to PDB conformations, the cumula- tive calculated occupancies for the conformations (Calc Occ), and average temperature factor of calculated conformations (B¯calc, in A˚ 2). Average, observed temperature factors are 19.0A˚ 2...... 57

xi 3.3 Details of calculated dual conformations for loop 104-112 of 2R4I. The true conformations were added in the sampling protocol. Each row lists occu-

pancies for the conformations (Occ), map resolution (Res, in A),˚ RMSD of

calculated conformations to PDB conformations, the cumulative calculated occupancies for the conformations (Calc Occ), and average temperature factor of calculated conformations (B¯calc, in A˚ 2). Average, observed tem- perature factors are 19.0A˚ 2...... 58

4.1 Number of pathways and number of involved residues for 20 main-chain

conformations of CREB-binding protein (PDB id 2AGH) as computed by RDPG method...... 74 4.2 Number of pathways and number of involved residues for 20 main-chain

conformations of PDZ domain family protein (PDB id 1D5G) as computed by RDPG method...... 79

xii List of Figures

1.1 An amino acid contains an amine group, a carboxylic acid group, and a side-chain (R) that defines the amino acid type...... 2 1.2 Amino acid types...... 3 1.3 A protein is a linear chain made up of amino acids (residues) joined to- gether by peptide bonds...... 4

1.4 The main-chain of a protein is the sequence of all the successive N, Cα, C, and O atoms contributed by the amino acids. Side-chains are represented

by Ri, i=1,2, ...... 4 1.5 Secondary structure elements in a folded protein. Helices, sheets, and loops are shown in red, yellow, and green colors, respectively...... 5 1.6 Dihedral angles φ, ψ, and χ in a protein chain. Here χ angles are shown for MET and SER residues...... 7 1.7 Calcium-binding proteins play an important role in the nervous system. Calcium ions bound to a calcium-binding protein are shown in white color. 8

2.1 Some backbone conformations generated by seed sampling for the loops in 1TIB, 3SEB, 8DFR, and 1THW...... 25

2.2 Positions of the middle Cα atom (red dots) in 100 loop conformations com- puted by seed sampling for four proteins: 1K8U, 1MPP, 1COA, and 1G5A. 28 2.3 Conformations of the loop in 1HML...... 29

xiii 2.4 Twenty conformations of the loop in 1MPP generated by deforming a given seed conformation along randomly picked directions...... 29 2.5 Deformation of the loop in 1COA by pulling the N atom (white dot) of THR 58 along a specified direction...... 30

2.6 Volume reachable by the 5th Cα atom in the loop of 1MPP...... 31 2.7 Use of deformation sampling to remove collisions involving side-chains. . . 32 2.8 Parvalbumin loop ALA51-ILE58: The apo and holo conformations

recorded in the PDB are shown blue and green, respectively. The loop conformation in red is the conformation generated by seed sampling and recognized by FEATURE as a calcium-binding site. The black dot is the

position of the calcium ion recorded in the PDB. The green and red dots are the calcium positions predicted by FEATURE for the loop conformations of the same color...... 34

2.9 Grancalcin loop ALA62-ASP69. The holo conformation in the PDB file is shown in green. The conformation in red is generated using deforma- tion sampling. FEATURE correctly recognizes the red conformation as a calcium-binding site, but fails to do so on the green conformation (see Section 2.5.4)...... 35

3.1 Validation of the SAMPLE-SELECT algorithm. Fraction of the 29 side- chains in alternate conformations in the reference structure correctly iden-

tified and modeled by SAMPLE-SELECT algorithm (squares) and false pos- itives (triangles) at resolution levels ranging from 1.1A˚ to 2.4A˚ are shown. . 45

xiv 3.2 Summary of the performance of the SAMPLE-SELECT algorithm on ex-

perimental data. Shown on top are Rfree values of the reference models

(squares), the SAMPLE-SELECT models (circles), and an ensemble of four independent models (crosses) as a function of resolution. The bottom panel

is similar to Figure 3.1, but with experimental data. The PDB ids of the six- teen test structures are listed in order of decreasing resolution along the horizontal axis. The fraction of side-chains with alternate conformations

in each of the sixteen reference PDB structures correctly modeled by the algorithm is represented by diamonds. Additional conformations are rep- resented by squares. The triangles represent the relative improvement in

Rfree. A positive value indicates a drop in Rfree...... 47 3.3 Structural heterogeneity around residues TYR18 and ARG77 in protein

with PDB id 2NLV. A) The single-conformer reference model (in cyan

color). Positive density of the difference EDM (observed density minus calculated density) corresponding to the single model (in lime color) is contoured at 1.75σ. B) The multi-conformer model (in gray color), neatly models alternate conformations of ASN17, TYR18 and ARG77 in the pos- itive density, albeit at the cost of a small misfit in the B conformation of ARG77. The algorithm does not find sufficient evidence for the ASN17 side-chain conformation of the reference model. Examination of the differ-

ence EDM reveals a substantial negative density in this area...... 51 3.4 Residues 104-112 of 2R4I. Top panel: Conformations from chain A and B

in the EDM at 0.7/0.3 occupancy. At high contour levels, atoms from the chain at lower occupancy are no longer contained within the iso-surface.

Lower panel: PDB fragment from chain A (left) in green and PDB fragment from chain B (right) in cyan together with the calculated conformations. . . 56

xv 4.1 Schematic of allosteric communication: (a) a protein with two binding sites, (b) allosteric effector binds to the protein at the allosteric binding site and alters activity at the other binding site through allosteric commu- nication, and (c) binding of ligand at the other binding site is facilitated. . . 62 4.2 A CREB-binding protein. KIX domain (in blue color) of the protein binds domains of Mixed linkage leukemia (MLL, in light green color) and CREB/c-Myb (in orange color). Upon MLL binding, affinity for c-Myb binding increases by twofold...... 62 4.3 Strain propagation is akin to domino effect: (a)-(e) Strain successively propagates from one residue to another with each residue pushing a spatially-close residue, and (f) Residues A, i, j, and B form an allosteric pathway...... 66 4.4 Arc computation: (a) Strain is created at residue j by deformation of side-

chain at residue i to rotamer ri, (b) Strain is reduced at residue j by de-

formation of side-chain at residue j to rotamer rj, and (c) an arc is added in the residue deformation propagation graph (RDPG) connecting nodes

corresponding to rotamers ri and rj...... 70

4.5 Arc deletion. The node corresponding to rotamer ri (in red color) is not connected to any node corresponding to rotamers at residue f...... 71 4.6 Reduced RDPG and allosteric pathway. The reduced RDPG (b) is ob- tained from the RDPG (a). Potential allosteric pathways are computed using graph traversal over the reduced RDPG. Residues A, i, j, and B form an allosteric pathway because a path of nodes connects A to B via i and j...... 72

xvi 4.7 Allosteric studies on CREB-binding protein. Allosteric residues (in yel- low color) are determined in NMR-relaxation dispersion experiments de- scribed in (a) [12] and (b) [43]. Deformation propagates from the residue PHE612 (in pink color). The red arrows show approximate allosteric path- ways connecting allosteric residues. Residues determined from RDPG do not contain ALA654 and TYR658 (in red circles)...... 73 4.8 Allosteric pathways (in red arrows) determined from reduced RDPG prop- agate deformation from residue PHE612 to residue TYR650 in the CREB-

binding protein (PDB id 2AGH). (a) The pathway (in red arrows) is similar to the pathway hypothesized in [12] and shows the propagation of defor- mation through Helix3, (b)-(c) The pathways show that the propagation of deformation can also take place through Helix2 as observed in [8]. Al- losteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) PHE612, ILE611, ILE660, LYS659, GLU655, HIS651, TYR650; (b) PHE612, ILE611, LEU607, LYS606, LEU603, TYR650; (c) PHE612, LEU628, ILE611, LEU607, LYS606, LEU603, TYR650...... 75 4.9 Allosteric pathways (in red arrows) determined from reduced RDPG prop- agate deformation from residue PHE612 to residues in a novel binding site

in the CREB-binding protein(PDB id 2AGH). Pathways are similar to the pathways hypothesized in [43]. Allosteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) PHE612, THR614, LEU620, LYS621; (b) PHE612, LEU628, ARG624, ASP622; (c) PHE612, LEU628, ASN627, ARG623, GLU626...... 76

xvii 4.10 Allosteric studies on PDZ domain family protein. Allosteric residues (in yellow color) are determined by (a) NMR relaxation-dispersion [39] and (b) thermodynamic mutant analysis [75]. Deformation propagates from residues VAL26 and HIS71 (in pink color) in (a) and (b) respec- tively. The red arrows show approximate allosteric pathways connecting al- losteric residues. Residues determined from RDPG do not contain ALA39, ALA46, ILE52, and ALA69 (in red circles)...... 77 4.11 Allosteric pathways (in red arrows) determined from reduced RDPG prop- agate deformation from residue ILE20 to residues on two distal surfaces

in the PDZ domain family protein (PDB id 1D5G). Allosteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) ILE20, VAL40; (b) ILE20, LEU18, VAL85; (c) ILE20, LEU18, LEU78, THR81; (d) ILE20, LEU18, LEU78, VAL61, VAL64 and ILE20, LEU18, LEU78, VAL61, VAL66...... 78 4.12 Allosteric pathways (in red arrows) determined from reduced RDPG prop- agate deformation from residue HIS71 to residues on two distal surfaces

in the PDZ domain family protein (PDB id 1D5G). Allosteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) HIS71, VAL75, ARG79, LEU18, VAL85; (b) HIS71, VAL75, ARG79, LEU78, LEU18, SER17, LYS13, ASN16, ASP15, SER48. . . . . 80

xviii Chapter 1

Introduction

1.1 Motivation and Goals

That all men and women are equal is the most cherished and valued tenet of modern so- cieties. But does equality really holds between a person blessed with a healthy biological system and a person plagued with diseases due to genetic defects? By fully understanding the human biological system at the molecular level, one may hope that it will eventually be possible to develop therapeutics that will give all humans the same desirable opportunity to live their lives to the fullest. This dissertation is a small step in this direction: it presents new computational tools aimed at facilitating this understanding. It focuses on proteins. Proteins are the workhorses of all living organisms, as they play a key role in a wide diversity of vital functions such as metabolism, signal transmission in the nervous sys- tem, storage of energy, defense against intruders, and muscle buildup. They are large biomolecules that range from hundreds to tens of thousands of atom in size. Each pro- tein is a linear chain of smaller molecules, called amino acids. As Figure 1.1 shows, each amino acid consists of an amine group (around a nitrogen atom N), a carboxylic acid group

(around a carbon atom C), and a side-chain connected to a carbon atom denoted by Cα.

1 CHAPTER 1. INTRODUCTION 2

There are 20 different types of amino acids (Figure 1.2), with two amino acids of differ- ent types differing only by their side-chains. Two consecutive amino acids in a protein are connected by a peptide bond between the C atom of the carboxylic acid group of one amino acid and the N atom of the amine group of the next amino acid (Figure 1.3). The sequence of all the successive N, Cα, C, and O atoms contributed by the amino acids forms the protein’s main-chain (Figure 1.4).

Figure 1.1: An amino acid contains an amine group, a carboxylic acid group, and a side- chain (R) that defines the amino acid type.

A protein is highly dynamic. First, under normal physiological conditions, it folds into a compact three-dimensional structure, called the folded or native state of the protein. This structure consists of secondary structure elements—helices and strands—connected by loops (Figure 1.5). But this folded structure is not fully rigid. Once folded, the protein keeps some flexibility and deforms continuously. Both the shape of the protein’s folded structure and its flexibility are widely believed to be essential for the protein to achieve its function [3, 83, 85]. Indeed, a protein usually achieves its function by binding another molecule, called a ligand. To be possible, binding requires shape and chemical comple- mentarity between the binding sites of the two molecules. CHAPTER 1. INTRODUCTION 3

Figure 1.2: Amino acid types.

The main goal of this dissertation is to develop computational tools to study the flex- ibility of folded protein in the context of three important biological problems where such flexibility is critical: 1)Loop sampling (Chapter 2): Loop between helices and/or strands are often highly flexible fragments of proteins that participate in binding sites. The problem is to determine if a given protein loop can achieve a shape where it may bind against a specified ligand. 2)Interpretation of electron density maps (Chapter 3): X-ray crystallography is the most common experimental method to determine the folded structure of a protein. A beam of X-rays strikes a crystal made of many aligned copies of the protein. The diffraction pat- tern leads to an electron density map (EDM) in which the positions of the protein atoms can be determined. However, if there is too much heterogeneity in the crystal, the EDM is CHAPTER 1. INTRODUCTION 4

Figure 1.3: A protein is a linear chain made up of amino acids (residues) joined together by peptide bonds.

Figure 1.4: The main-chain of a protein is the sequence of all the successive N, Cα, C, and O atoms contributed by the amino acids. Side-chains are represented by Ri, i=1,2, ... blurred and extremely difficult to interpret. The problem is to model structural heterogene- ity present in the crystal. 3)Determination of allosteric pathways (Chapter 4): An allosteric protein is a protein whose shape changes when it binds an effector ligand at the protein’s allosteric site. This change alters the ability of the protein to bind another molecule at its active site where it performs its function. The problem is to identify the sequences of side-chains through which the change in shape propagates from the allosteric to the binding site. In all three problems, our tools are based on both efficient sampling algorithms that allow us to represent structural heterogeneity in a protein’s folded state by a collection of sampled conformations and selection algorithms that allow us to reliably select the sampled conformations that provide a solution to the problem. Although our tools are specific to each problem, they share the same sample and select approach, as well as several similar CHAPTER 1. INTRODUCTION 5

Figure 1.5: Secondary structure elements in a folded protein. Helices, sheets, and loops are shown in red, yellow, and green colors, respectively. concepts and techniques, like efficient kinematic modeling (including inverse kinematics), efficient collision detection among atoms, and optimization techniques.

1.2 Structural Heterogeneity of a Folded Protein

As mentioned above, a folded protein remains flexible. Its continuous deformation is the aggregate result of complex interactions among its individual atoms. Various deformations occur at timescales that span several orders of magnitude. Thermal fluctuations, which occur in picoseconds (10−12 seconds), are small-amplitude, uncorrelated, harmonic mo- tions of the individual atoms. They are mostly random, but occasionally provide the pro- tein enough momentum to overcome energy barriers between meta-stable states (which roughly correspond to basins of attraction of the energy landscape). In contrast, confor- mational deformations closely related to the protein’s function occur in microseconds to milliseconds. These “slow” deformations are usually large-scale, correlated, anharmonic motions that correspond to transitions between meta-stable states. For example, they may occur between binding and non-binding states. In this dissertation we are mainly interested in modeling structural heterogeneity associated with such slow deformations. CHAPTER 1. INTRODUCTION 6

There are two well-established ways [70, 91] to represent the conformations of a pro- tein, the atomic and the linkage representation:

• In the atomic representation, a conformation is described by a list of the 3-D coordi- nates of the atoms. This representation makes it possible to model deformations at all timescales and is typically used to perform Molecular Dynamics (MD) simulation. It yields a very high-dimensional conformational space.

• In the linkage representation, a conformation is represented by a list of dihedral an-

gles, usually the φ and ψ angles, respectively around the N–Cα and Cα–C bonds along the protein’s main-chain, and the χ angles around the rotatable bonds in the side-chains (Figure 1.6). This representation is based on the observation that, once the high-frequency thermal fluctuations have been filtered out (through averaging), bond lengths and bond angles are approximately constant. It yields a conformational space of smaller dimensionality than the atomic representation. A conformation rep- resented using the linkage model can be seen as the average of several conformations in the atomic model over a short interval of time.

In this dissertation, we will represent structural heterogeneity by sampling plausible confor- mations using the linkage model. In Chapter 2 (loop sampling) we will aim at distributing the sampled conformations of a loop broadly across the folded state of a protein in order to explore the entire space of possible deformations of the loop. In Chapter 3 (interpretation of electron density maps), our goal will be to sample conformations that are well represented (i.e., occur frequently) in a crystal. In Chapter 4 (determination of allosteric pathways), we will sample favorable side-chain conformations to identify key interactions among them. CHAPTER 1. INTRODUCTION 7

Figure 1.6: Dihedral angles φ, ψ, and χ in a protein chain. Here χ angles are shown for MET and SER residues. 1.3 Why Model Structural Heterogeneity?

Proteins usually achieve their functions by binding other molecules, called ligands, such as other proteins, small molecules, or ions (Figure 1.7). Binding requires good geometric and chemical complementarity of the binding sites in both molecules. For many years, the prevailing binding model was the key-and-lock model, which assumes that both molecules are rigid. Recently, two more realistic models have emerged, the induced fit [56, 79] and the conformation selection models [50]:

• Induced fit model: The induced fit model proposes that binding of a ligand to a substrate induces a change of conformation in the binding site to facilitate the binding process. The model is an extension of key-and-lock model, except that the lock changes shape to accommodate the key.

• Conformation selection model: The conformation selection model proposes that a protein exists as an inter-converting ensemble of conformations at equilibrium. Upon CHAPTER 1. INTRODUCTION 8

binding the equilibrium shifts towards conformations that favor formation of com- plexes with ligand conformations.

Most likely, both models contain part of the truth. The two molecules must have good initial compatibility to bind (conformation selection model), but this compatibility may be perfected during the binding process (induced fit model). The knowledge of protein’s structural heterogeneity is critical to both models. Induced conformational change, for example in allosteric proteins upon binding of an effector ligand involves slow deforma- tions of the protein. Inter-conversion between meta-stable conformations, for example in calcium-binding loops forms the basis of the conformation selection model and is a di- rect product of structural heterogeneity in proteins. Therefore, to understand a protein’s function, modeling of its structural heterogeneity is crucial.

Figure 1.7: Calcium-binding proteins play an important role in the nervous system. Cal- cium ions bound to a calcium-binding protein are shown in white color.

1.4 Computational Challenge

The goal of conformation sampling is to quickly compute low-energy folded conforma- tions. The computational challenge is twofold. On the one hand, the space of low-energy CHAPTER 1. INTRODUCTION 9

folded conformations is very small. Because a folded protein is highly compact, even small conformational changes can result in atomic collisions and therefore in high-energy conformations. On the other hand, a full energy function is made of many terms usually representing the electrostatic and van der Waals interactions among pairs of atoms. Testing whether a sampled conformation is low-energy requires computing this function. Through- out this dissertation we use a variety of techniques to address this challenge. First, we simplify the energy function. The dominant energy terms in the folded state of a protein are the van der Waals (vdW) energy terms, as the atoms are compactly packed together. Although electrostatic and solvent interaction terms may affect motion kinetics, they have relatively small impact on conformation reachability. For this reason, our sam- pling algorithms, which are not dealing with kinetics issues, only consider the vdW terms of the energy function, either by directly avoiding collision among atoms or by minimizing the total vdW potential. Folded conformations with no collisions span a very small subset of the conformation space. For this reason sampling plausible folded conformations is a difficult computational problem. But this is exactly the same reason why we can hope to represent the folded state of a protein by a collection of sampled conformations of reason- able size. Second, we use sampling protocols and strategies that include (a) prioritization of con- straints, e.g., sampling of closed and collision-free conformations of loops involves pri- oritization of closure constraints and collision-avoidance, and (b) alternation of sampling and selection of samples, e.g., interpretation of EDMs involves alternating sampling a set of candidate conformations of a protein fragment and selecting the best conformations that explain the EDM. Third, we use efficient techniques to achieve various tasks in our computational tools. Fast detection of atomic collisions is achieved using a grid-based method. The closure con- straints are satisfied by a combination of analytical inverse kinematics (IK) based methods and exploiting motion in the null space of protein fragments. CHAPTER 1. INTRODUCTION 10

1.5 Main Contributions

Our contributions are twofold: general and problem-specific. a)General contribution: This dissertation demonstrates that a general approach combining fast, massive conformation sampling with precise selection of relevant samples can solve several important biological problems. It also demonstrates that geometric algorithms based on vdW volume exclusion of atoms and linkage kinematics provide an efficient way to generate good samples. b)Problem-specific contribution: The contributions to the three biological problems stated in Section 1.1 are as follows: 1)Loop sampling (Chapter 2): This dissertation develops a deformation sampling al- gorithm that finely samples in the feasible null motion space of protein loops. Along- with a seed sampling algorithm developed by Peggy Yao, the deformation sampling al- gorithm provides an effective tool to explore the feasible motion space of protein loops. The algorithm is implemented in a software toolkit called loopTK that is available at http://simtk.org/home/looptk. 2)Interpretation of electron density maps (Chapter 3): This dissertation develops a sample-select algorithm that models structural heterogeneity present in a protein crystal. The algorithm automatically computes an ensemble (collection) of atomic models and as- sociated occupancies (frequency of atomic model in the protein crystal) that near-optimally

fits the EDM. The algorithm is implemented in a software called qFit that is available at http://smb.slac.stanford.edu/qFitServer. 3)Determination of allosteric pathways (Chapter 4): This dissertation develops an al- gorithm that computes a sequence of residues that propagate deformations in the allosteric protein upon binding of an effector ligand. The algorithm is based on the concept of a CHAPTER 1. INTRODUCTION 11

residue deformation propagation graph (RDPG). This graph represents pairwise interac- tions between residues due to vdW volume exclusion. The main focus is on deformation due to structural heterogeneity in the side-chains.

1.6 Relation to Previous Work

The computational tools developed in this dissertation rely on efficient sampling and selec- tion of samples. Sampling-based methods have proven effective over the years for analyz- ing the connectivity of the high-dimensional configuration spaces of moving objects such as robots [49, 59, 68, 69], digital characters [50], and mechanical assembly [45]. Sampling methods for protein conformations have been broadly studied as well. They fall in two broad categories. In energy-based sampling methods the sampling process is driven by an energy minimization objective. Monte-Carlo based sampling methods [47, 74, 91] lead to sampling low-energy conformations with a high probability. High- energy conformations are also sampled, but with a low probability, to occasionally over- come energy barriers that separate meta-stable states of the protein. Molecular Dynamics methods [35, 82, 91] model the forces that the atoms of the protein experience under the effect of a potential field. These forces are then used to simulate motion. The outcome is a motion trajectory sampled at a fine timestep. A related, simplified method, based on normal-modes analysis [73, 96], models a protein as a set of spheres (atoms) connected with virtual springs. The normal modes of the harmonic vibrations of this spring-mass sys- tem are used to simulate motion. However, the method is limited to the study of harmonic motion. The other category of sampling methods are based on kino-geometric sampling. These methods rely on simplified kinematics of the protein chains (e.g., the linkage model) and consider only geometric constraints, like avoidance of atomic collisions and loop closure CHAPTER 1. INTRODUCTION 12

constraint. The degrees of freedom (DOFs) in the kinematic chain are sampled and re- sulting conformations are checked for collisions and other geometric constraints. Kino- geometric methods have been used in [53] for sampling of folded protein conformations by detecting rigid subsets using H-bonds and in [20] for sampling of closed conformations of protein loops. The computational tools developed in this chapter relate to kino-geometric sampling approach. Chapter 2

Exploring the Motion Space of Protein Loops∗

Several applications in require exploring the motion space of a flexible fragment (usually, a loop) of a protein. For example, in order to bind a small ligand, a fragment may have to achieve a certain shape [82]. Exploring the loop’s motion space can allow one to determine whether this shape is possible. Often the shape is not known in advance. Instead, it can be “recognized” by evaluating its energy, or by testing it using a trained classifier [104]. Incorporating such flexibility in binding simulation algorithms is a major challenge

[95]. In X-ray crystallography experiments, EDMs are often blurred due to heterogeneity in the crystal, resulting in an initial model with missing loops as described in Chapter 3. Similarly, in homology modeling [90], only parts of a can be reliably inferred from known structures with similar sequences. Determination of structures of loops is therefore required for several applications and this necessitates exploring their motion spaces. In this chapter, we model protein loops as kinematic chains and we describe sampling methods for exploring their motion spaces. Sampling the motion space of protein loop

*The computational tools developed in this chapter have been previously presented in [107].

13 CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 14

requires the simultaneous satisfaction of two constraints:

• Loop closure: The two termini of the loop must be correctly connected to their an- chors in the rest of the protein.

• Collision-avoidance: No two atoms should be colliding.

We present a deformation sampling algorithm which, combined with a seed sampling algorithm, efficiently samples closed and collision-free conformations of a loop. The kinematic model of the loop used in this chapter is as follows. A loop L is defined as a sequence of p > 3 consecutive residues in a protein P , such that none of the two termini of L is also a terminus of P . The residues of L are numbered from 1 to p, starting from the N terminus. The backbone of L is modeled as a kinematic linkage whose DOFs are the n = 2p torsion angles φi and ψi around the bonds N–Cα and Cα–C, in residues i = 1, ..., p. The rest of the protein, denoted by P \L, is assumed rigid. Let LB denote the backbone of L. Here, it includes the Cβ and O atoms respectively bonded to the Cα and C atoms in the backbone.

A Cartesian coordinate frame Ω1 is attached to the N terminus of L and another frame

Ω2 to its C terminus. When LB is connected to the rest of the protein, i.e., when it adopts a closed conformation, the pose (position and orientation) of Ω2 relative to Ω1 is fixed. This pose is denoted by Πg. However, if the values of φi and ψi, i = 1 to p, are arbitrarily picked, then in general an open conformation of LB is obtained, where the pose of Ω2 differs from

Πg. The set Q of all open and closed conformations of LB is a space of dimensionality n = 2p. The subset Qclosed of closed conformations is a subspace of Q of dimensionality n−6. Let Π(q) denote the pose of Ω2 relative to Ω1 when the conformation of LB is q ∈ Q. −1 The function Π and its inverse Π are the “forward” and “inverse” kinematics map of LB, respectively.

A conformation of LB is collision-free if and only if no two atoms, one in LB, the other in LB or P \L, are such that their centers are closer than ε times the sum of their CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 15

van der Waals radii, where ε is a constant in (0, 1), usually on the order of 0.75. The set free of closed and collision-free conformations of LB is denoted by Qclosed. It has the same dimensionality as Qclosed, but its volume is usually a small fraction of that of Qclosed.

2.1 Related Work

The problem considered in this paper is a version of the “loop closure” problem studied in [14, 21, 30, 54, 63, 103]. Several works have specifically focused on kinematic closure. Analytical Inverse Kinematics (IK) methods are described in [21, 103] to close a fragment of 3 residues. For longer fragments, iterative techniques have been proposed, like the popular CCD (Cyclic Coordinate Descent) [14] and the “null space” technique [5, 93]. Several of these techniques are used in the sampling algorithms developed in this chapter. In particular, the seed sampling algorithm applies the analytical IK method described in [21] in a new way to close loops with more than 3 residues and the deformation sampling algorithm uses the null space technique to deform loops without breaking closure. Procedures to sample closed collision-free conformations of loops by varying dihedral angles have been proposed in [20, 30, 54]. The goal of RAPPER [30] and the hierarchical method described in [54] is to generate near-native conformations by minimizing an en- ergy function. Instead, the goal of our seed and deformation sampling algorithms and the one presented in [20] is to explore the closed collision-free conformation space of a loop by sampling conformations broadly distributed across this space. This ability to explore a conformation space is critical for a number of applications. For example, the conformation selection theory [9] suggests that a protein and a ligand exist in an ensemble of deforming folded conformations and that the most compatible conformations “recognize” each other and bind together. Binding conformations of proteins often differ significantly from na- tive ones. To predict protein’s function one must be able to sample these non-native but biologically relevant conformations. As another example, an electron-density map (EDM) CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 16

obtained from an X-ray crystallographic experiment can be particularly difficult to inter- pret when the protein appears in the crystalline sample in multiple states. An ensemble of sampled conformations may then be needed to provide a satisfactory interpretation of the

EDM [40, 72]. Nevertheless, the deformation sampling algorithm also allows energy min- imization, when this is desirable. We show in Section 2.5.4 that the seed and deformation sampling algorithms can generate biologically important conformations. RAPPER [30] iteratively generates a loop conformation from its N terminus toward its C terminus by selecting the values of the dihedral angles φ and ψ at random from a predefined discrete table of values. It also checks that the Cα atom in each residue is sufficiently close to the loop’s C anchor on the protein. In the end, to close the gap between the loop’s last residue and its anchor on the protein, RAPPER runs an iterative minimization procedure to reduce this gap. Unlike RAPPER, the seed sampling algorithm does not select dihedral angles from discrete tables, but picks them according to continuous probability distributions input by the user. In addition, the algorithm retains a sufficient number of dihedral angles (in the middle portion of the loop) to make it possible to close the loop using an exact IK method. Like the seed sampling algorithm, the method presented in [54] also exploits the idea of loop decomposition. It breaks a loop into two fragments, then independently samples collision-free conformations for each fragment (by sampling dihedral angles starting from their respective anchors), and finally generates closed conformations by bridging close- enough fragment conformations. Like RAPPER, this method selects dihedral angles from predefined discrete tables. It uses IK and collision detection techniques that are very dif- ferent from the techniques used in the seed and deformation sampling algorithms. Both RAPPER and this method have been tested on relatively short loops having between 2 and 12 residues in length. The Random Loop Generator (RLG) method described in [20] is used to study the po- tential mobility of a loop in the presence and absence of certain side-chains. It successively CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 17

samples closed conformations that it later tests for collisions. To sample closed conforma- tions it divides the loop backbone into an “active” and a “passive” fragment. The latter has exactly 3 residues (hence, 6 dihedral angles). The dihedral angles in the active fragment are successively sampled at random using a geometric algorithm that increases the likelihood that a closed conformation will eventually be obtained. The 6 dihedral angles of the passive fragment are used to close the loop using an IK procedure. The generated closed conforma- tions are then tested for collisions. To explore the conformation space of the loop, a tree of sampled conformations is built starting from a known structure (typically, the native struc- ture), the root of the tree. Each node of the tree is a conformation generated using RLG in a neighborhood of its parent in the tree. The deformation sampling algorithm, which also generates each new conformation in the neighborhood of an already sampled conforma- tion, has similarities with this method. However, unlike RLG, the deformation sampling algorithm perturbs the dihedral angles in such a way that it does not break closure. Some sampling procedures try to sample conformations using libraries of fragments obtained from previously solved structures [27, 63, 101, 102]. For example, a divide-and- conquer approach is described in [101] that generates a database of fragments of different residue lengths and types, by using a Ramachandran plot distribution. These fragments are then concatenated to build conformations of a longer loop. However, collisions are not taken into account during this process. Other works sample loop conformations directly by minimizing an energy function [26, 30, 37, 54, 93] or running a molecular dynamics simulation [10] with the goal to identify loop fragments close to native structure. How- ever, as discussed above, in a number of applications it is preferable to explore the closed collision-free conformation space of a loop. In the seed sampling and deformation sampling algorithms, collision detection is done using the efficient grid method described in [46]. A similar detection method is also used in RAPPER [30]. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 18

2.2 Seed Sampling Algorithm

1 The goal of the seed sampling algorithm is to generate conformations of LB broadly dis- free tributed over Qclosed. The challenge comes from the interaction between the kinematic closure and collision-avoidance constraints. Computational tests (see Section 2.5) show that the algorithm (hereafter called the naive algorithm) that first samples conformations from Qclosed and next rejects those with collisions is often too time consuming, except for short loops, due to its huge rejection ratio. The reverse algorithm—sampling the angles φi and ψi of LB to avoid collisions—will inevitably end up with open conformations, since

Qclosed has lower dimensionality than Q. These insights motivated a prioritized constraint-satisfaction approach, hereafter called the prioritized approach. LB is partitioned into three segments, the front-end F , the mid- portion M, and the back-end B. F starts at the N terminus of LB and B ends at its C terminus. M is the segment between them. Due to the immediate proximity of atoms in P \L, the conformations of F and B are more limited by the collision-avoidance constraint than by the closure constraint; so, the dihedral angles are sampled in F and B to avoid collisions, ignoring the closure constraint. Then, for any pair of conformations of F and B, the possible conformations of M are mainly limited by the closure constraint; so, the naive algorithm is used to sample conformations of M, by running an IK procedure to close the gap between F and B and testing the collision-avoidance constraint afterward. In this way, the prioritized approach reduces the application of the naive approach to a short fragment in the middle part (M) of the loop. The length of M must be large enough for the IK procedure to succeed with high probability, but not too large since collision-avoidance is only tested afterward. In the software implementation of the algorithm, the number of residues in M is usually set to half of that of LB or to 4, whichever of these two numbers is larger. The number of residues of F and B are then selected equal (± 1). Tests show that

1This algorithm was developed primarily by Peggy Yao. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 19

these choices are close to optimal on average for a wide range of loops. For unusually long loops, it may be suitable to set an upper bound on the length of M. The dihedral angles φ and ψ in the three fragments F , M, and B are selected to generate free conformations of LB broadly distributed over Qclosed.

2.2.1 Sampling front/back-end conformations

Consider the front-end F . The angles φ and ψ closest to the fixed terminus of F are the most constrained by possible collisions with the rest of the protein P \L. So, the angles are sampled in the order in which they appear in F , that is φ1, ψ1, φ2, etc. In this order, each angle φi (resp., ψi) determines the positions of the next two atoms Cβi and Ci (resp., the next three atoms Oi,Ni+1 and Cαi+1). The angle is sampled so that these atoms do not collide with any atom in P \L or any preceding atom in F . Its value is picked at random, either uniformly or according to a user-input probabilistic distribution (e.g., one based on Ramachandran tables). If no value of the angle prevents the two or three atoms it governs from colliding with other atoms, the algorithm backtracks and re-samples a previously sampled angle. Clash-free conformations of the back-end B are sampled in the same way, by starting from its fixed C terminus and proceeding backward.

2.2.2 Sampling mid-portion conformations

Given two non-colliding conformations of F and B such that the gap between them does not exceed the maximal length that M can achieve, a conformation of M is sampled as follows. The values of the φ and ψ angles in M are picked at random, uniformly or according to a given distribution. This leads to a conformation q of M that is connected to F at one end and open at the other end. To close the gap between M and B, the IK method described in [21] is used. This method solves the IK problem analytically, for any sequence of residues CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 20

in which exactly three pairs of (φ, ψ) dihedral angles are allowed to vary. These pairs need not be consecutive.

Let ANALYTICAL-IK(q, i, j, k) denote the IK method, where argument q is the initial open conformation of M and arguments i, j, and k are the integers identifying the three residues that contain the pairs of dihedral angles that are allowed to vary. Experiments show that, on average, the IK method is the most likely to succeed in closing the gap when one pair is the last one in M and the other two are distributed in M. Let r and s denote the integers identifying the first and last residue of M in LB. As the IK method is extremely fast, ANALYTICAL-IK(q, i, j, s) is called for all i = r, ..., s − 2 and j = i + 1, ..., s − 1, in a random order, until a closed conformation of M has been generated. If this conformation tests collision-free, then the seed sampling algorithm constructs a closed collision-free conformation of LB by concatenating the conformations of F , M, and B. If the above operations fail to generate a closed collision-free conformation of M, then they are repeated (with new initial values for the φ and ψ angles in M) until a predefined maximal number of iterations have been performed. Experiments were also performed with iterative IK techniques, like CCD, to close the gap between M and B. But in the software implementation they were found to be slower than the above algorithm based on analytical IK.

2.2.3 Placing side-chains

free For each conformation of LB sampled from Qclosed, SCWRL3 [15] is used to place the side-chains. One way is to compute the placements of the side-chains in LB given the placements of the side-chains in P \L. Alternatively, the placements of all the side-chains in the protein can be (re-)computed. In each case, SCWRL3 minimizes an energy function that contains volume-exclusion terms. But it does not fully guarantee that the conforma- tions of the side-chains will be collision-free. If needed, deformation sampling can be used CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 21

to slightly deform the conformation of LB in order to eliminate the collisions (see Section 2.5.3).

2.3 Deformation Sampling Algorithm

2.3.1 Overview

free The deformation sampling algorithm is given a “seed” conformation q in Qclosed. It first selects a vector in the tangent space T Qclosed(q) of Qclosed at q. By definition, any vector ˙ ˙ T in this space is a velocity vector [φ1, ..., ψn] that maps to the null velocity of Ω2 (relative to Ω1); hence, it defines a direction of motion that does not instantaneously break loop 0 closure. A new conformation of LB is then computed as q = q + δq where δq is a short vector in T Qclosed(q). Since the tangent space is only a local linear approximation of Qclosed at q, the closure constraint is in fact slightly broken at q0. So, ANALYTICAL-IK(q0, p − 2,

0 p − 1, p) is called to bring back the frame Ω2 to its goal pose Πg. Since q is already almost closed, the six DOFs used by ANALYTICAL-IK are the angles φp−2, ... ψp corresponding to the last three residues of LB (recall that n = 2p). If ANALYTICAL-IK generates several solutions for these angles, the closest values from those in q + δq are selected. Finally, the atoms in LB are tested for collisions among themselves and with the rest of the protein. If a collision is detected, the algorithm exits with failure. The deformation sampling algorithm may be run several times with the same seed con- free formation q to explore the subset of Qclosed around q. Alternatively, each run may use the conformation generated at the previous run as the new seed conformation to generate a free “pathway” in the set Qclosed. More generally, one may also build a tree of pathways rooted at a seed conformation or a forest of trees rooted at multiple seeds, e.g., to optimize an objective function. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 22

2.3.2 Computation of a basis of the tangent space

To define a direction in T Qclosed(q), first a basis for this space is computed. This can be done as follows [5]. Let J(q) be the 6 × n Jacobian matrix that maps the velocity ˙ ˙ T ˙ T q˙ = [φ1, ..., ψp] of the dihedral angles in LB at q to the velocity [x, ˙ y,˙ z,˙ α,˙ β, γ˙ ] of ˙ T Ω2, i.e.,: [x, ˙ y,˙ z,˙ α,˙ β, γ˙ ] = J(q)q ˙. J(q) can be computed analytically using techniques presented in [17]. For simplicity, assume that J has full rank (i.e., 6). A basis of T Qclosed(q) is built by first computing the Singular Value Decomposition UΣV T of J(q) where U is a 6 × 6 unitary matrix, Σ is a 6 × n matrix with non-negative numbers on the diagonal and zeros off the diagonal, and V is an n × n unitary matrix [42]. Since the rows 6, ..., n of V do not affect the product J(q)q ˙, their transposes form an orthogonal basis N(q) of

T Qclosed(q).

2.3.3 Selection of a direction in the tangent space

The deformation sampling algorithm may select a direction in T Qclosed(q) at random. However, in most cases, it is preferable to minimize an objective function E(q). Let

T y = −∇E(q) be the negated gradient of E at q and yN = NN y the projection of y into T Qclosed(q). The deformation sampling algorithm selects the increment δq along yN .

In this way, all the DOFs left available in LB by the closure constraints are used to move the conformation in the direction that most reduces E. E(q) may be a function of the distances between the closest pairs of atoms at con- formation q (where each pair consists of one atom in LB and one atom in either L\B or

LB). These pairs can be efficiently computed by the same grid method that is used to de- tect collisions (Section 2.4). Minimizing E then leads deformation sampling to increase the distances between these pairs of atoms, if this goal does not conflict with the closure constraint. In this way, deformation sampling picks increments δq that have small risk of causing collisions. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 23

Another interesting objective function leads to moving a designated atom A in LB to- ward a desired position xd. This objective function can be defined as:

2 E(q) = kxA(q) − xdk (2.1)

where xA(q) is the position of A when LB’s conformation is q. This function can be used to iteratively move an atom as far as possible along selected directions to explore the boundary free of Qclosed. E can also be an energy function or any weighted combination of functions, each designed to achieve a distinct purpose.

2.3.4 Placing side-chains

For each new conformation of LB, side-chains can be placed using SCWRL3, as described in Section 2.2.3. Another possibility is to provide an initial seed conformation that already contains the loop’s side-chains to the deformation sampling algorithm. These side-chains are then considered rigid and the algorithm deforms LB so that the produced conformation remains collision-free.

2.4 Collision Detection

Collision detection is done using the grid method [46]. This method takes advantage of the fact that, to avoid collisions, atoms must spread out, so that any square box of a fixed vol- ume contains an upper-bounded number of atom centers, independent of the total number of atoms in the protein. The method tessellates the three-dimensional space of the protein into an array of equally sized cubes. The edge length of a cube is chosen approximately equal to the largest diameter of the atoms. For a given conformation of the protein, each atom is indexed in the cube that contains its center. Whenever the position of an atom is modified, the grid CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 24

structure is updated accordingly in constant time. The grid is implemented as a memory- efficient hash table. Only the grid cubes that contain atom centers are represented, each with the corresponding list of atoms. The collision detection algorithm iterates through all atoms that need to be checked

(e.g., the atoms in LB), asking for each atom if it is in collision. The atom only needs to be checked with the atoms indexed in its own grid cube and the 26 cubes surrounding it. Since the cubes of the grid are small, the number of atom centers is upper-bounded by a small constant. The number of pairs of atoms to check is thus upper-bounded by a constant. So, collision detection for a single atom runs in O(1) time, and the collision test for all O(n) atoms in LB or L runs in O(n) time, independent of the total number of atoms in the protein. The same algorithm can be used to find the k closest atoms to a given atom (for a small value of k), simply by considering another layer of grid cubes. This ability allows us to efficiently compute objective functions E, like the one in Equation (2.1) that contains terms aimed at preventing deformation sampling from producing conformations with collisions (Section 2.3.3).

2.5 Experiments

2.5.1 Seed sampling

Table 2.1 lists 20 loops, whose sizes range from 5 to 25 residues, which have been used to perform computational tests. Each row lists the id of the protein in Protein Data Bank

(PDB) [6], the number of residues in the protein, the number identifying the first residue in the loop, the number of residues in the loop, and the average time to sample one closed collision-free conformation of the loop using two distinct algorithms (the seed sampling algorithm and the “naive” algorithm outlined in Section 2.2). In some loops the two termini are close, while in others they are quite distant. Some loops protrude from the proteins and CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 25

have much empty space in which they can deform without collision (e.g., 3SEB), while others are very constrained by the other protein residues (e.g., 1TIB). The loop in 1MPP is constrained in the middle by side-chains protruding from the rest of the protein (see Figure 2.2(b)). In the results presented below, all φ and ψ angles are picked uniformly at random (i.e., no biased distributions, like the Ramachandran’s ones, are used).

(a) 1TIB 8-residue loop (b) 3SEB 10-residue loop

(c) 8DFR 13-residue loop (d) 1THW 14-residue loop

Figure 2.1: Some backbone conformations generated by seed sampling for the loops in 1TIB, 3SEB, 8DFR, and 1THW.

Each picture in Figure 2.1 displays a subset of backbone conformations generated by seed sampling for the loops in 1TIB, 3SEB, 8DFR, and 1THW. The loop in 1TIB, which resides at the middle of the protein, has very small empty space to move in. The PDB conformation of the loop in 1THW (shown green in the picture) bends to the right, but the seed sampling algorithm also finds collision-free conformations that are very different. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 26

Protein Loop Sampling Id Size Start Size Seed Naive 1XNB 185 SER 31 5 0.22 0.21 1TYS 264 THR 103 5 0.06 0.06 1GPR 158 SER 74 6 0.38 0.38 1K8U 89 GLU 23 7 0.21 0.20 2DRI 271 GLN 130 7 0.42 0.46 1TIB 269 GLY 172 8 2.49 13.03 1PRN 289 ASN 215 8 0.33 0.66 1MPP 325 ILE 214 9 0.53 99.85 4ENL 436 LEU 136 9 1.46 19.35 135L 129 ASN 65 9 0.77 1.54 3SEB 238 HIS 121 10 0.50 3.80 1NLS 237 ASN 216 11 1.30 5.51 1ONC 103 MET 23 11 2.26 5.66 1COA 64 VAL 53 12 19.02 67.49 1TFE 142 GLU 158 12 0.48 8.14 8DFR 186 SER 59 13 2.02 39.36 1THW 207 CYS 177 14 1.48 9.84 1BYI 224 GLU 115 16 2.52 >800 1G5A 628 GLY 433 17 3.28 >800 1HML 123 GLY 51 25 17.74 >800

Table 2.1: Testset of 20 loops. Each row lists PDB id of the protein, number of residues in the protein, starting residue of the loop, number of residues in the loop, average sampling time (in seconds) for one closed, collision-free conformation using seed sampling and naive sampling algorithms.

Each picture in Figure 2.2 shows the distributions of the middle Cα atom in 100 sampled conformations of the loops in proteins 1K8U, 1MPP, 1COA, and 1G5A along with a few backbone conformations. The loops in 1K8U and 1COA have relatively large empty space to move in, whereas the loops in 1MPP and 1G5A are restricted by the surrounding protein residues. These figures illustrate the ability of the seed sampling algorithm to generate conformations broadly distributed across the closed collision-free loop conformation space. The average running time (in seconds) of the seed sampling algorithm to compute one closed collision-free conformation of each loop is shown in column 5 of Table 2.1. Each average time is obtained by running the algorithm until it generated 100 conformations CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 27

of the given loop and dividing the total running time by 100.2 The last column of Table 2.1 gives the average running time of the “naive” algorithm that first samples closed con- formations of the loop backbone and next rejects those which are not collision-free. In both algorithms, the factor ε used to define collisions (see chapter preamble) is set to 0.75. The seed sampling algorithm does not break a loop into 3 segments if it has fewer than 8 residues. So, the running times of both algorithms for the first 5 proteins are essentially the same. For all other proteins, the seed sampling algorithm is faster, sometimes by a large factor (188 times faster for the highly constrained loop in 1MPP), than the naive algorithm. For the last three proteins, this latter algorithm fails to sample 100 conformations after running for more than 80,000 seconds. Not surprisingly, the running times vary significantly across loops. Short loops with much empty space around them take a few 1/10 seconds to sample, while long loops with little empty space can take a few seconds to sample. The loops in 1COA and 1HML take significantly more time to sample than the others. In the case of 1COA, it is difficult to con- nect the loop’s front-end and back-end (3 residues each) with its mid-portion (6 residues). As Figure 2.5 shows, the termini of the loop are far apart and the protein constrains the loop all along. Due to the local shape of the protein at the two termini of the loop, many sampled front-ends and back-ends tend to point in opposite directions, which then makes it often impossible to close the mid-portion without collisions. In this case, a better average running time (4 seconds, instead of 19) is obtained by setting the length of the mid-portion to 8 (instead of 6). The loop in 1HML is inherently difficult to sample. Not only is it long, but there is also little empty space available for it. See Figure 2.3, where the red conforma- tion of the loop is obtained from the PDB and the other three conformations are sampled by deformation sampling. Other experiments not reported here indicate that the running times reported in Table 2.1 vary moderately when parameters like the factor ε and the number of

2The algorithms are written in C++ and run under Linux. Running times were obtained on a 3GHz Intel Pentium processor with 1GB of RAM. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 28

(a) 1K8U 7-residue loop (b) 1MPP 9-residue loop

(c) 1COA 12-residue loop (d) 1G5A 17-residue loop

Figure 2.2: Positions of the middle Cα atom (red dots) in 100 loop conformations computed by seed sampling for four proteins: 1K8U, 1MPP, 1COA, and 1G5A. residues in the loop’s mid-portion M are slightly modified. free For rather long loops, any seed sampling algorithm that samples broadly Qclosed can only produce a coarse distribution of samples. Indeed, for a loop with n dihedral angles, a set of N evenly distributed conformations defines a grid with N 1/n−6 discretized values for free each of the n−6 dimensions of Qclosed. If n = 18 (9-residue loop), a grid with 3 discretized values per axis requires sampling 531,441 conformations. Deformation sampling makes it free possible to sample more densely “interesting” regions of Qclosed. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 29

Figure 2.3: Conformations of the loop in 1HML.

2.5.2 Deformation sampling

Figure 2.4 shows 20 conformations of the loop in 1MPP generated by deformation sam- pling around a conformation computed by seed sampling. To produce each conformation, the deformation sampling algorithm starts from the same seed conformation and selects a short vector δq in T Qclosed(q) at random. This figure illustrates the ability of deformation free sampling to explore Qclosed around a given conformation.

Figure 2.4: Twenty conformations of the loop in 1MPP generated by deforming a given seed conformation along randomly picked directions.

Figure 2.5 shows a series of closed collision-free conformations of the loop in 1COA CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 30

successively sampled by pulling the N atom (shown as a white dot) of THR 58 away from its initial position along a given direction until a collision occurs (white circle). Note that the collision does not involve the pulled N atom. It occurs elsewhere, due to the deforma- tion of the loop. The initial conformation shown in red is generated by seed sampling and the side-chains are placed without collisions using SCWRL3. Each other conformation is sampled by deformation sampling starting at the previously sampled conformation and us- ing the objective function E defined by Equation (2.1) in Section 2.3.3. Only the backbone is deformed, and each side-chain remained rigid. Collisions are tested for all atoms in the loop.

Figure 2.5: Deformation of the loop in 1COA by pulling the N atom (white dot) of THR 58 along a specified direction.

th Figure 2.6 shows (in green) an approximation of the volume reachable by the 5 Cα atom in the loop of 1MPP. This approximation is obtained by sampling 20 seed confor-

th mations of the loop and, for each of these conformations, pulling the 5 Cα atom along several randomly picked directions until a collision occurs somewhere in the loop. The volume shown green is obtained by rendering the atom at all the positions it reached. The running time of deformation sampling depends on the objective function. In the above experiments, it is less than 0.5 seconds per sample on average. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 31

Figure 2.6: Volume reachable by the 5th Cα atom in the loop of 1MPP.

2.5.3 Placements of side-chains

In the software implementation SCWRL3 [15] is used to place side-chains. The result, however, is not guaranteed to be collision-free. To generate Table 2.2, the seed sampling algorithm is first used to sample conformations of the backbones of the loops in 1K8U, 2DRI, 1TIB, 1MPP, and 135L, with the uniform and Ramachandran sampling distributions for the dihedral angles (see Sections 2.2.1 and 2.2.2). For each loop, 50 conformations with the uniform distribution and 50 with the Ramachandran distribution are sampled. SCWRL3 is then used to place side-chains in the loop (with the side-chains in the rest of the protein fixed) and each conformation is checked for collisions. Table 2.2 reports the number of collision-free conformations (out of 50) for each loop. As expected, the backbone confor- mations generated using the Ramachandran distribution facilitate the collision-free place- ment of the side-chains.

Protein 1K8U 2DRI 1TIB 1MPP 135L Uniform 7 9 1 0 9 Ramachandran plots 18 14 6 4 13

Table 2.2: Number of collision-free placements of side chains for five loops. CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 32

When seed sampling generates a conformation q of a loop backbone, such that SCWRL3 computes a side-chain placement that is not collision-free, deformation sam- pling can then be used to sample more conformations around q, to produce one where side-chains are placed without collisions. In Figure 2.7(a) a conformation (shown blue) of the backbone of the loop in 1MPP is generated using seed sampling and the side-chains are placed by SCWRL3. However, there are collisions between two side-chains. In (b) a con- formation (shown yellow) is generated by the deformation sampling algorithm using the conformation shown in (a) as the start conformation. The new placement of the side-chains computed by SCWRL3 is free of collisions. Once such a collision-free conformation has been obtained, many other collision-free conformations can be quickly generated around it, again using deformation sampling, as shown in Figure 2.4.

(a) (b)

Figure 2.7: Use of deformation sampling to remove collisions involving side-chains.

2.5.4 Calcium-binding site prediction

Calcium-binding proteins play a key role in signal transduction. Many such proteins share the same functional domain, a helix-loop-helix structural motif called EF-hand [60]; the calcium ion binds at the loop region in this motif. As a loop is often flexible, its conforma- tion with calcium bound (called the holo state) and its conformation without calcium (the CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 33

apo state) can be significantly different [2]. Many functional site prediction methods, for example FEATURE [104], are based on structural properties of the binding site. However, if the conformation of the functional site changes upon calcium binding, these methods may not be able to recognize the binding site in the apo state due to the absence of the binding structural properties. One way to overcome this problem is to sample many closed collision-free conformations of the loop and run the functional site prediction method on each of them. If a sampled conformation is recognized by FEATURE, not only does this indicate that the loop may be a possible calcium-binding site, it also tells us what the holo conformation may look like. In fact, molecular dynamics simulation has already been used successfully to generate conforma- tions starting with apo proteins in order to identify unrecognized calcium binding sites in them [41]. For example, Parvalbumin [16] is a calcium-binding protein, where the loop ALA51-

ILE58 is a binding site that flips up upon calcium-binding. The PDB ids for its apo and holo structures are 1B8C and 1B9A, respectively. In Figure 2.8, these conformations are shown blue and green, respectively; the black dot is the center of the calcium ion in the holo PDB file. Successive conformations of this loop are sampled using the seed sampling algorithm and FEATURE is run on each of them, until FEATURE recognizes a loop con- formation as a calcium-binding site. The recognized conformation, shown red in Figure 2.8, is close to the holo structure 1B9A. The red dot represents the position of the calcium ion predicted by FEATURE in this recognized conformation. Similarly, the two green dots represent positions of the calcium ion predicted by FEATURE for the green holo conforma- tion. Note that all these dots are all very close to the calcium position recorded in the PDB. Correctly, FEATURE does not recognize the apo conformation shown blue as a binding- conformation; hence, there is no blue dot in the figure. The neighboring conformations of the seed conformation are then explored, trying to get conformations even closer to the PDB holo state. The seed conformation is deformed by deformation sampling until FEATURE CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 34

Figure 2.8: Parvalbumin loop ALA51-ILE58: The apo and holo conformations recorded in the PDB are shown blue and green, respectively. The loop conformation in red is the con- formation generated by seed sampling and recognized by FEATURE as a calcium-binding site. The black dot is the position of the calcium ion recorded in the PDB. The green and red dots are the calcium positions predicted by FEATURE for the loop conformations of the same color. returned a higher score than what was obtained with the seed. The final conformation only slightly improves the backbone RMSD to the holo conformation. Deformation sampling can also be used to enhance the performance of FEATURE. To recognize a binding site, FEATURE counts atoms contained in concentric spherical shells. Therefore, it is somewhat sensitive to the values of the radii of the shells, as well as to the position of the center of the shells. This may cause FEATURE to fail to correctly recognize a functional state. For example, in protein grancalcin, the loop ALA62-ASP69 is a calcium- binding site [55]. The holo structure has PDB code 1K94. It is shown in green in Figure 2.9, where the black dot is the position of the calcium ion recorded in the PDB. Surprisingly, FEATURE fails to recognize this structure as a binding site. So, deformation sampling is then used around the holo structure 1K94 and FEATURE is run on each one of them until FEATURE identifies it as a calcium-binding site. The resulting loop conformation is shown red in Figure 2.9, where the red dot is the predicted calcium position. The main CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 35

difference between the holo structure 1K94 and the conformation generated by deformation sampling is the location of ASP65, one of the four coordinating residues. Atoms from ASP65 are located slightly closer to the calcium binding site in the conformation obtained by deformation sampling. These small displacements are sufficient to change the atom counts in the spherical shells considered by FEATURE, thereby affecting the score of the entire site.

Figure 2.9: Grancalcin loop ALA62-ASP69. The holo conformation in the PDB file is shown in green. The conformation in red is generated using deformation sampling. FEA- TURE correctly recognizes the red conformation as a calcium-binding site, but fails to do so on the green conformation (see Section 2.5.4).

2.6 Conclusion

This chapter presents two distinct algorithms to sample the space of closed collision-free conformations of a flexible loop. The seed sampling algorithm produces broadly distributed conformations. It is based on a novel prioritized constraint-satisfaction approach that inter- weaves the treatment of the collision-avoidance and closure constraints. The deformation CHAPTER 2. MOTION SPACE OF PROTEIN LOOPS 36

sampling algorithm uses seed conformations as starting points to explore more finely cer- tain regions of the space. It is based on the computation of the null space of the loop back- bone at its current conformation. The sampling algorithms are implemented in a software toolkit called loopTK that available at http://simtk.org/home/looptk. Computational tests show that the algorithms can efficiently handle loops ranging from 5 to 25 residues in length. Additional tests demonstrate their ability to generate biologically interesting loop conformations, such as calcium-binding conformations. This critical abil- ity could be used in the future to predict loop conformations and improve other structure prediction techniques, like homology, when functional information is known in advance.

The algorithms have also been used successfully to model heterogeneity in EDMs obtained from X-ray crystallography experiments (Chapter 3). The algorithms developed in this chapter demonstrate the power of kino-geometric sam- pling for solving a difficult problem in biology, that of exploring motion spaces of protein loops which contain closed and collision-free conformations of the loops. Chapter 3

Modeling Structural Heterogeneity From X-ray Data∗

X-ray crystallography is the leading experimental technique for protein structure determi- nation. Around 88% of the protein structures currently in the Protein Data Bank (PDB) have been obtained using this technique. A crystallography experiment starts with the crystallization of the protein whose structure is to be determined. A protein crystal con- tains regularly arranged protein molecules and is bombarded with X-rays. This results in a diffraction pattern due to interference of X-ray waves reflecting from multiple diffrac- tion planes in the crystal. This pattern is recorded and mathematically transformed into an electron density map (EDM). An EDM is a 3-D array of voxels where each voxel en- codes an electron density value. The dense areas in the EDM correspond to atom locations and an atomic structure is modeled such that it best explains the EDM. The full technique involves several iterations mainly because the phase of the reflected waves is lost in the recorded diffraction pattern and has to be retrieved through a trial-end-error alternation of phase estimation and model refinement. Most of the existing modeling methods in X-ray

*This work was done in collaboration with Joint Center for Structural Genomics (JCSG). The computa- tional tools developed in this chapter have been previously presented in [4].

37 CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 38

crystallography explain EDM by one single protein conformation. The EDM however does not correspond to exactly one conformation of the protein. The protein crystal contains non-identical conformations of the protein owing to heterogeneity in the folded state of the protein and this results in a blurred EDM. The single conformation that is used to ex- plain the EDM by existing modeling methods is thus an average protein conformation. The uncertainty in the data is modeled with a Gaussian distribution of the positions for each atom around its average position. However, it has been suggested that the native state of a protein should be regarded as an ensemble of conformations [78, 79]. The presence of distinct side-chain and main-chain conformations in a crystal has been observed on many occasions [5, 24, 106] and the importance of accurately modeling structural heterogeneity has long been recognized [7, 29, 40, 89, 106] as even subtle conformational changes may have important functional consequences [57].

This chapter develops a new algorithm (called SAMPLE-SELECT) to model blurred por- tion of an EDM by a collection of conformations. The idea is to sample many candidate conformations and to use an optimization algorithm to select the subset of sampled confor- mations that best explains the EDM. The SAMPLE-SELECT algorithm uses different sam- pling strategies for modeling heterogeneity that is driven mainly by side-chain deformation and heterogeneity that is driven mainly by main-chain deformation. In the former, the side-chain deformation results in alternate side-chain conformations in the copies of pro- tein molecule in the crystal. The alternate side-chain conformations often involve small anharmonic main-chain deviations to support them [24]. Main-chain driven heterogeneity mainly involves larger deformation of the flexible loops that results in alternate conforma- tions of the loop’s main-chain. Some deviations of side-chain conformations may also be involved. Our algorithm performs very well for modeling side-chain driven heterogeneity, but it has some limitations for modeling main-chain driven heterogeneity. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 39

3.1 Related Work

While several software programs are available for automatically building an accurate struc- tural model into an EDM [33, 34, 52, 84, 97, 98], these methods aim at generating a single conformation model, where uncertainty is modeled with an isotropic Gaussian distribution of the position for each atom. This distribution, which is parameterized by the tempera- ture factor1, accounts for small vibrations about each atom’s equilibrium position. At high resolution, when experimental data is abundant, fitting an anisotropic (trivariate) Gaussian function becomes possible [105]. A sparser anisotropic, but still harmonic parameteriza- tion involves partitioning the protein into rigid bodies undergoing independent displace- ments [92]). Owing to their equilibrium-displacement nature, these models (henceforth called single-conformer models) are unable to accurately describe distinct conformational substates, such as those caused by the low-frequency, high-amplitude motion of the protein [51, 67, 106]. It has been suggested that an ensemble (collection) of independent models would be a more suitable representation of a crystal structure than a single model [40]. Distinct single-conformer models can be equally plausible interpretations of the diffraction data as described in [29], specifically in cases where the distinct conformations in the protein crystal have similar occupancies. Several modeling methods have been developed that try to explain the EDM with an ensemble of conformations. In ensemble refinement methods [13, 66, 72], a fixed-size set of identical conformations is first computed and all torsion angles are then simultaneously optimized. Ensemble of distinct conformations, thus ob- tained, better explains the EDM. However, there are fixed number of conformations in the computed ensemble. Although this is a convenient approximation, ensemble-based mod- eling methods should try to determine the number of distinct conformations in the protein crystal automatically. Some other ensemble-based modeling methods described in [5, 99]

1 2 2 Temperature factor or B factor is given by Bj = 8π Uj , where Uj is the mean displacement of atom j. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 40

use multiple runs of a single-conformer based modeling method. Each run starts with a randomly perturbed initial model. Multiple runs can thus result in distinct conformations. Such methods also assume predefined number of conformations in the ensemble, e.g., 20 random initial models were used in [99]. Since these methods rely eventually on single- conformer based modeling, they are more likely to identify conformation with the highest occupancy in the crystal. There is a lack of algorithms and software to automatically and consistently model heterogeneity as correlated, coordinated motion. This chapter develops a new SAMPLE-

SELECT algorithm to automatically identify and model heterogeneity in X-ray diffraction data using an occupancy-weighted ensemble of conformations that collectively best repre- sents the input EDM. Occupancy of a conformation is defined by the number of copies of the conformation divided by the total conformations in the crystal. The algorithm makes no as- sumption about the number of conformations in the ensemble and derives it automatically. It should be emphasized that each conformation in the ensemble obtained by our algorithm explains part of the EDM depending on its occupancy value. Thus, in contrast to methods that model heterogeneity with independent single-conformer models, our algorithm aims at identifying an ensemble of conformations that collectively models heterogeneity.

3.2 SAMPLE-SELECT Algorithm

The goal of the SAMPLE-SELECT algorithm is to compute an ensemble of conformations, the occupancy of each conformation, and the temperature factor of each atom in every conformation, which collectively represent the data in an input EDM E of a protein. The basic idea is to first sample a very large set of conformations and then select the best ensemble of conformations from this set. More precisely, the algorithm alternates two steps, SAMPLE and SELECT:

1. SAMPLE samples a large set Q of conformations that is highly likely to contain a CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 41

subset S representing E well.

2. SELECT simultaneously identifies this subset S and computes the occupancy of each conformation in S.

The space sampled by SAMPLE has high dimensionality, so the algorithm alternates the above two steps iteratively. Each run of SAMPLE uses the conformation subset selected at the previous iteration to sample a new set of candidate conformations, which in turn is submitted to SELECT. The core of the algorithm is an efficient optimization algorithm that is able to select pertinent ensembles from very large sets of sampled conformations. It should be emphasized that the algorithm infers the non-zero occupancies from the data; it has no prior knowledge about the number of conformations.

3.2.1 Selection step

SELECT is handed a large set Q = {q1, . . . , qN } of N conformations, together with a vector ti specifying the temperature factor of each atom in every conformation qi. It identifies the subset S of conformations that collectively provides the best explanation for the input EDM E, over all possible subsets of Q.

Let G be the grid over which E is defined. Let Ei be the simulated EDM (computed using phenix [1] or clipper [22] tools) that corresponds to the configuration qi with the temperature factors in ti. Let E(p) and Ei(p) denote the values of E and Ei, respectively, at point p ∈ G. The value at p of the EDM that corresponds to Q = {q1, ..., qN } with P occupancies α1, . . . , αN is i αiEi(p). SELECT minimizes the L1 or L2 difference be- tween E and this EDM. Since each Ei(p) is constant, this amounts to solving the following linear/quadratic programming problem (LP/QP): CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 42

P P Minimize p∈G | E(p) − i αiEi(p) |1or2

such that αi ≥ 0, for i = 1,...,N P i αi = 1.

The solution is the vector of optimal values for αi, i = 1,...,N. SELECT retains only the conformations qi whose occupancies are greater than a given threshold (set to 0.1 in the implementation). It returns the set S of retained conformations with occupancies re-normalized to sum up to 1. We use Coin-OR libraries [48] to solve the above LP/QP problem. The choice between LP and QP has non-significant impact on the results.

3.2.2 Conformation sampling

The goal of SAMPLE is to generate a set Q = {q1, . . . , qN } of candidate conformations, together with temperature-factor vectors t1, . . . , tN , such that a subset S of Q (with suitable occupancies) provides an optimal explanation of the EDM E. Each SAMPLE step uses the outcome of the previous SELECT step and samples a distinct subspace of reasonably small dimensionality. The details of the SAMPLE are described in the following sections. They depend on the kind of heterogeneity that must be modeled, side-chain driven or main-chain driven.

3.3 Side-chain Driven Heterogeneity

Here we assume that structural heterogeneity is mainly due to alternate side-chain confor- mations present in the crystal. Only small deviations of the main-chain are involved. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 43

3.3.1 Sampling

For modeling side-chain driven heterogeneity, the side-chain at each individual residue is independently sampled and the best side-chain conformations are selected using SELECT. The algorithm starts from a main-chain conformation derived from an initial model of the

EDM with the side-chains truncated at the Cβ atom. The following sampling procedure is used for each residue. A thermal ellipsoid is obtained from anisotropic refinement of a residue. A thermal ellipsoid indicates the mag- nitude and direction of the high-frequency thermal vibrations in atoms. These vibrations are usually anisotropic, i.e., the vibrations have different magnitudes in different directions.

Six trial positions for the Cα atom are selected by sampling the principal axes of its thermal ellipsoid at a surface of constant probability. To position the Cα at the trial position, the φ and ψ dihedral angles of a 7-residue fragment centered around the trial atom are adjusted using an inverse kinematics algorithm [5, 76] to maintain ideal geometry and closure. For each trial Cα position, the side-chain is added at rotameric positions [77], and furthermore sampled in a small neighborhood around each χ angle. For smaller side-chains the χ angles are currently discretized at 10 degree intervals. Temperature factors of side-chain atoms are assigned based on the temperature factor of the Cα atom of the residue. The temperature factors increase with the number of bonds between the side-chain atom and the Cα atom, and are slightly randomized.

SAMPLE generates on the order of 500 candidate conformations per residue. The best set of conformations that represents the electron density around a residue is obtained using

SELECT as described in Section 3.2.1. The experiments reported below were carried out by Henry van den Bedem at JCSG. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 44

3.3.2 Validation with simulated data

Here the SAMPLE-SELECT algorithm is used to model a simulated EDM with added Gaus- sian noise for validation. The number of true alternate conformations in the simulated EDM is known. In contrast, the number is subject to the interpretation of a crystallographer for experimental data. The crystal structure of a XisI-like protein (YP 324325.1) from An- abaena Variabilis, solved at 1.30A˚ resolution by the JCSG and deposited in the PDB with

PDB id 2NLV, is a dimer of 112 amino acids in length. Twenty-nine residues have an al- ternate conformer modeled, eight residues are truncated at various levels beyond the Cβ atom, the C termini are not modeled and two residues are missing at the N terminus of the B chain for a total of 212 residues. Crystallographic waters are removed thus obtain- ing a reference model. The resulting temperature factors are retained, averaging 14.36A˚ 2. Molprobity [25], a widely used protein structure quality validation tool, reported three ro- tamer outliers for dual-conformation residues GLN49B, ARG56B, LYS64B, and one for a single-conformation residue LEU97B. Identification of such outliers is impeded by the nature of the algorithm, which relies on sampling neighborhoods of rotamers. Simulated

EDM is calculated from the reference model and gaussian noise with a standard deviation of 10% of the magnitude of the calculated EDM data is added to simulate experimental er- rors. To obtain a single-conformer model, the crystal structure is rebuilt into the simulated

EDM with noise using phenix.autobuild [100] at resolutions ranging from 1.1A˚ to 2.4A.˚ The rebuilt model is then used as an initial model for our algorithm. The resulting structural model at each resolution is compared to the reference model. Figure 3.1 summarizes the results. A side-chain is considered modeled correctly if it is within 1A˚ RMSD of the corresponding side-chain in the reference model. Throughout,

RMSDs are calculated over all pertinent atoms starting at the Cβ atom. Final χ angles of side-chains with non-unique orientations (TYR, PHE, GLU, ASP, ARG) are adjusted to minimize RMSD, as well as those of side-chains that are nearly identical (GLN, ASN, CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 45

Figure 3.1: Validation of the SAMPLE-SELECT algorithm. Fraction of the 29 side-chains in alternate conformations in the reference structure correctly identified and modeled by SAMPLE-SELECT algorithm (squares) and false positives (triangles) at resolution levels ranging from 1.1A˚ to 2.4A˚ are shown.

HIS). RMSD values are calculated between equivalent atoms. Truncated residues as well as the first three residues at N termini and the final three residues at C termini are excluded in

RMSD calculations. At very high resolution, the algorithm correctly identifies and models 86% of twenty- nine residues with alternate conformations (Figure 3.1). The success rate remains high at about 70% as the resolution decreases. Of the rotamer outliers, GLN49 is correctly modeled only at the highest resolution. Fewer are correctly modeled toward lower resolutions, with all three modeled incorrectly at 2.1A,˚ but ARG56B is correctly included again at 2.4A.˚ A less desirable side-effect of identifying low-occupancy conformations is that am- biguous electron density can be falsely interpreted as a structural feature. A residue in this work is considered a false positive if for any of its side-chain conformations the RMSD to its (closest) corresponding conformations in the reference model exceeds 1A.˚ Thus, if a CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 46

residue with alternate conformations in the reference model is modeled with one conforma- tion within 1A˚ and the other exceeding 1A,˚ this residue is considered a false positive. The false positive rate, calculated for all 212 residues, approximately doubles from 5% at high resolution to just below 10% at a resolution of 2.4A˚ (Figure 3.1). Residue LEU97B is not correctly modeled at any of the resolutions, and is included as a false positive accordingly.

The rate at which true alternate conformations are identified in EDMs is high and only mildly affected by resolution. When ambiguity in side-chain density increases toward lower resolution levels, which happens frequently at the protein surface, the algorithm introduces low-occupancy structural features at an incremental rate to account for electron density re- sulting in part from noise, analogous to the increase in variability towards lower resolution levels observed in [99].

3.3.3 Results with experimental data

The SAMPLE-SELECT algorithm is employed to identify and model structural heterogeneity in experimental data starting from single-conformer models obtained by rebuilding sixteen

final, PDB-deposited models with phenix.autobuild. Non-protein atoms are excluded in the rebuilding process. The rebuilt models are used to calculate EDM from the experimental data [87]. The EDM and rebuilt model resembles what would typically be a starting point for manual refinement, and serves as input to the algorithm. The rebuilt single-conformer model is the reference model. Refinement statistics are shown in Table 3.1 and summarized in Figure 3.2. Geometry statistics, expressed by RMSDs of bond lengths and bond angles, of model obtained by SAMPLE-SELECT are comparable with those of the single-conformer model, see Table 3.1. The model obtained by SAMPLE-SELECT algorithm will be referred to as the multi-conformer model. It is an occupancy-weighted ensemble of conforma- tions that collectively best represents the input EDM. At resolution levels better than 2A˚ cross-validation residuals Rfree [11] improve for all multi-conformer models (see the top CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 47

Figure 3.2: Summary of the performance of the SAMPLE-SELECT algorithm on experimen- tal data. Shown on top are Rfree values of the reference models (squares), the SAMPLE- SELECT models (circles), and an ensemble of four independent models (crosses) as a func- tion of resolution. The bottom panel is similar to Figure 3.1, but with experimental data. The PDB ids of the sixteen test structures are listed in order of decreasing resolution along the horizontal axis. The fraction of side-chains with alternate conformations in each of the sixteen reference PDB structures correctly modeled by the algorithm is represented by diamonds. Additional conformations are represented by squares. The triangles represent the relative improvement in Rfree. A positive value indicates a drop in Rfree. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 48 , and the R / free R ) ◦ . Columns represent the ˚ A) ( rmsbond rms ( angle SELECT - R SAMPLE free R 0.25320.2669 0.2439 0.2595 0.0110.2750 0.014 1.49 0.2365 0.2533 1.574 0.2641 0.2279 0.010.2494 0.2516 0.01 1.342 0.2796 0.2515 0.004 1.323 0.2823 0.2403 0.011 0.848 0.2639 0.009 1.156 0.2490 0.011 1.232 0.2740 0.2147 1.369 0.3017 0.2469 0.0100.2711 0.2489 0.006 1.315 0.2906 0.2399 0.016 0.922 0.2925 0.2504 0.007 1.478 0.2577 0.005 1.065 0.2752 0.004 0.847 0.2317 0.2372 0.897 0.1778 0.004 0.007 1.369 1.063 ) ◦ ˚ A) ( rms rms ( R single-conformer model multi-conformer model free f R ˚ A), number of residues in the asymmetric unit, number of reflections, ˚ A) bond angle ( res rsd ID PDB 3EO6 0.97 212 216251 0.2748 0.2689 0.009 1.316 3F402NLV 1.273D02 1.3 1123E8O 1.3 2243F14 1.4 543733CCG 303 0.2741 52774 1.42NVH 1.5 238 133600 0.2691 0.2768 1.51A0J 0.2426 0.015 112 0.2662 63378 1903EBY 0.2354 1.48 0.011 1.7 0.2668 38973 1533EN8 1.75 0.008 30978 1.454 0.2541 0.25553ECF 1.255 34800 1.85 892 0.2909 0.011 153 0.25662Q7B 0.2832 1.9 0.2577 1.264 128 76148 0.01 349183ELE 2 0.2689 0.004 0.2509 0.28063B7F 520 27989 1.229 0.006 2.1 0.752 0.2205 0.24729ILB 0.2998 1.081 2.2 79131 181 1592 0.008 0.005 0.2565 2.3 0.2744 170242 1.203 0.874 30620 0.006 394 0.2444 0.2816 0.2889 0.948 153 0.006 27216 0.2544 0.2281 0.972 0.2672 0.005 0.047 9535 0.2345 0.91 3.307 0.2188 0.005 0.1658 0.893 0.013 1.476 id of the model, resolution ( PDB Table 3.1: Sixteen structural models are rebuilt and subjected to modeling with root mean square deviationsmulti-conformer of model. the bond lengths and the bond angles of the single-conformer reference model and the CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 49

panel of Figure 3.2) compared to the single-conformer models except one (3EN8). For this resolution range, Rfree values obtained from the multi-conformer models are on average

0.5% lower than those of the single-conformer model. The improvement in Rfree suggests that heterogeneity is indeed well-represented by the multi-conformer model that accounts for correlated, anharmonic motion, and that such a model is representative for the set of structures present in the crystal. At resolutions worse than 2A,˚ Rfree no longer improves.

Representing ambiguity in EDMs at poorer resolution levels with correlated structural fea- tures does not appear to be a better approximation of the experimental data than a single model plus an isotropic B factor. Thus, at resolutions worse than 2A˚ it appears that ambi- guity in the EDMs is no longer dominated by correlated motion. Variability in our model reflects uncertainty, which is consistent with the observations in [99].

The fraction of residues in the PDB entries for which the full set of alternate confor- mations is modeled to within 1A˚ RMSD is only slightly below that observed in simulated data, ranging from approximately 50% to 80% at high resolution (bottom panel of Figure 3.2). Additional conformations are proposed for 10%-30% of residues, increasing as the resolution falls. Note that this fraction is elevated for PDB entries in which no alternate conformations are deposited (1A0J and 9ILB). Extrapolating from the results on simulated data, at lower resolutions an increasing fraction of additional conformations is likely to account at least partially for spurious density.

3.3.4 Comparison to ensemble of independent conformations

Simultaneous refinement of an ensemble of multiple non-interacting, near-copies of a struc- tural model provides an alternative method to representing structural variability with few assumptions about the origin—uncertainty or correlated motion—of the disorder (see Sec- tion 3.1). Across a wide range of resolution levels, ensemble refinement has been shown to improve agreement with diffraction data over an isotropic single-conformer structural CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 50

model [13, 62, 72, 86, 106]. Four independent, diverse models, each consistent with the experimental data, are ob- tained by rebuilding the PDB entries using the multiple models option in phenix.autobuild. The four models are combined into a single ensemble of non-interacting conformations by fixing the occupancy of each model at 0.25. Table 2 in [72] demonstrates that at resolution levels 1.9A˚ and better, the best Rfree statistic is attained at ensemble sizes four or eight, with on average over 80% of improvement of the value occurring at ensemble size two. The en- semble size in this study is fixed at four; fewer conformations in the ensemble would leave the potential for Rfree to reach a substantial lower minimum, whereas increasing the num- ber of conformations carries the risk of modeling noise. Isotropic temperature factors of the rebuilt models are retained.

The Rfree values for the ensemble of independent conformations consistently improve those of the reference model compared with the structural models obtained by the SAMPLE-

SELECT algorithm (Figure 3.2). For three models, at resolutions better than 2A,˚ ensemble

Rfree values are worse than those of the reference model. In three cases the ensemble Rfree is lower than the multi-conformer Rfree. But an important limitation of ensemble refine- ment is that it identifies only the dominant conformation (conformation with the highest occupancy) in cases where the electron density is spatially well-separated for alternative conformations [62, 106]. Ensemble refinement achieves a better value of Rfree by provid- ing a better model of the dominant conformation group. This limitation is demonstrated in Figure 3.3A. While an ensemble model (in salmon color), is quite heterogeneous and globally yielded a slightly lower Rfree than the multi-conformer model, it fails locally to account for difference density (observed density minus calculated density) corresponding to distinct side-chain conformations. The multi-conformer model correctly identifies the alternate conformations (Figure 3.3B). CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 51

Figure 3.3: Structural heterogeneity around residues TYR18 and ARG77 in protein with PDB id 2NLV. A) The single-conformer reference model (in cyan color). Positive density of the difference EDM (observed density minus calculated density) corresponding to the single model (in lime color) is contoured at 1.75σ. B) The multi-conformer model (in gray color), neatly models alternate conformations of ASN17, TYR18 and ARG77 in the positive density, albeit at the cost of a small misfit in the B conformation of ARG77. The algorithm does not find sufficient evidence for the ASN17 side-chain conformation of the reference model. Examination of the difference EDM reveals a substantial negative density in this area. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 52

3.4 Main-chain Driven Heterogeneity

Here we consider the case where significantly different conformations of fragments of the main-chain are present in the crystal.

3.4.1 Sampling

Fragments of a protein main-chain such as flexible loops can undergo correlated motion of the main-chain resulting in main-chain driven heterogeneity. Changes in side-chain conformations take place as well to accommodate main-chain motion. Such motion often plays a critical role in binding such as in calcium binding protein calmodulin (see Chapter 2). Main-chain heterogeneity often leads to blurred density in relatively large regions of a EDM, making modeling of density extremely challenging. These EDM regions are often left uninterpreted, resulting in gaps in the protein models. Unfortunately, such gaps often correspond to functional fragments of a protein.

The SAMPLE-SELECT algorithm has proven to be very effective for modeling side-chain driven heterogeneity. So, we have extended the algorithm to model main-chain driven heterogeneity as well.

The algorithm alternates SAMPLE and SELECT steps with a different sampling strategy.

The protein fragment corresponding to the blurred region of the EDM is divided into a n front and a back half, each with p = d 2 e residues. Conformations of these two halves are built incrementally and are connected using an inverse kinematics (IK) algorithm. The various steps in the sampling procedure are described in more detail below. The key idea is to consider fractions of the front and back halves of increasing size, so that the number of conformations sampled by each SAMPLE step can be handled by the next SELECT operation. It should be noted that there are many possible variants, some of which might work equally well. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 53

a)Sampling the main-chains of the two halves. Temporarily ignoring side-chains and temperature factors, candidate partial conformations of the front half of the main-chain fragment are built by sampling one φ or ψ angle at a time, starting from the N terminus. The first φ angle is sampled at some uniform resolution  (set to 2 degrees in our implemen- tation). The bond angle centered at the N atom preceding this φ angle is also sampled at the same resolution  in the 12-degree interval around its corresponding Engh-Huber value.

A set of 6 × 2π/ candidate positions is thus obtained for the following Cβ and C atoms and are submitted to SELECT. Let k1 be the number of partial conformations retained by

SELECT. Next, the following dihedral angle (a ψ angle) and the two following bond angles centered at Cα and C atoms are sampled in the same way. A set of 12×2π/×k1 candidate positions is obtained for the following O, N, and Cα atoms. This set of conformations is submitted to SELECT and an ensemble of size k2 is obtained. At this point, the two φ and ψ angles are re-sampled at a finer resolution (0.5 de- grees) in small neighborhoods (±1 degree) of their values in the ensemble of size k2. This re-sampling step yields an expanded set of candidate conformations that is submitted to

SELECT. The remaining p − 1 residues in the front half of the main-chain fragment are handled in the same way. The same procedure is applied in reverse to the back half, starting from its C terminus. b)Inserting side-chains. Immediately after a pair of consecutive φ and ψ angles have been re-sampled, the side-chain of the residue containing those two angles is inserted. The rotamer library proposed in [77] is used to obtain the values of the χ angles. Adding the side-chain multiplies the number of partial conformations of the front half by the number of rotamers for the side-chain and the set of conformations is submitted to SELECT. The same procedure is applied to the back half. c)Assigning temperature factors. Temperature factors are assigned whenever a φ or ψ CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 54

angle is sampled or a side-chain is inserted. Their values are taken from a finite set T input by the user. However, assigning a distinct temperature factor to every atom would quickly lead to large sets of candidate conformations. So, groups of atoms are defined, and are assigned the same temperature factors. The Cβ and C atoms following a φ angle forms one group, so do the O, N, and Cα atoms following a ψ angle and the atoms in a side-chain. More specifically, consider the case where a φ angle is sampled in the front half. As described in paragraph (a), this gives a number of candidate conformations for the following Cβ and C atoms. Each of these conformations are paired with a distinct temperature factor from T . Similarly, when a side-chain is inserted, each rotamer is paired with a distinct temperature factor from T . d)Connecting the front and back halves. All pairs of conformations of the fragment’s front and back halves computed as above are enumerated. For each pair, complete closed conformations of the fragment’s main chain are obtained by computing six dihedral angles using the analytical IK algorithm described in [21]. More precisely, for each pair, every three consecutive residues are considered such that at least one belongs to the front half and another one to the back half, and the φ and ψ angles in those residues are re-computed using the IK algorithm, so that the fragment’s main-chain gets perfectly closed. The side-chain conformations and temperature factors for each of these residues are set as in either the front or back half conformation.

All the closed conformations are collected into a candidate set that is submitted to

SELECT. The result is the final multi-conformer model. If desired, a clustering algorithm can be run on this ensemble to merge conformations that are RMSD-wise very close. The above method sometimes eliminates a pertinent partial conformation. This is due to the fact that partial conformations are retained based on their fit with only a subset of the EDM. So, a SELECT step might retain one conformation and discard another based on CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 55

this local fit, while the inverse result could have been obtained if larger fractions of the fragment had been considered. Unfortunately, when a pertinent partial conformation has been discarded, it cannot be recovered later. So, to reduce the risk of eliminating a perti- nent partial conformation, a greater number of partial conformations are retained at each selection step. This is done as follows. Let m be the size of the set of conformations given to a selection step and m0 the size of the ensemble retained by SELECT. The remaining m − m0 conformations are submitted again to SELECT, and this operation is repeated until a pre-specified number of conformations have been obtained.

3.4.2 Validation with simulated data

The 123-residue protein with PDB ID 2R4I, a NTF-3 like protein, was solved by the JCSG at a resolution of 1.6A.˚ The asymmetric unit contained four, nearly identical copies of the molecule, distinguished by chain identifiers A-D in the PDB file. In each of the four chains the fragment spanning the residues 104-112 crystallized in slightly different conformations. The atoms from residues 104-112 from chain A are added to the corresponding residues from chain B (Figure 3.4). Indeed, the fragment can presumably adopt both of these con- formations. The conformations are closely intertwined, separated by only 1.4A˚ RMSD. Simulated electron density data is generated at different resolutions and at various oc- cupancies. Gaussian noise with a standard deviation of 10% of the magnitude of the cal- culated data was added to simulate experimental errors. The temperature factors of the individual PDB structures were retained, averaging 19.0A˚ 2. The algorithm returns an ensemble in excellent agreement with the actual conforma- tions, with a good estimate of the true occupancy values and average temperature factors (Table 3.2). The finite discretization of the sampling procedure results in multi-conformer models that contain more than two conformations. But every returned model contains two groups of very similar conformations that could be merged by a RMSD-based clustering CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 56

Figure 3.4: Residues 104-112 of 2R4I. Top panel: Conformations from chain A and B in the EDM at 0.7/0.3 occupancy. At high contour levels, atoms from the chain at lower occupancy are no longer contained within the iso-surface. Lower panel: PDB fragment from chain A (left) in green and PDB fragment from chain B (right) in cyan together with the calculated conformations. algorithm. The same test is run again, but this time the true conformations are added to the sample set. The results presented in Table 3.3 show that in most cases SAMPLE-SELECT algorithm returns the true conformations. In some cases, it produced more than two conformations and in all cases occupancies and temperature factors are slightly inexact. These small discrepancies seem to be caused by the Gaussian noise added to the EDM. The greater dis- crepancy in the results presented in Table 3.2 are, thus, manifestations of both discretization errors and errors due to added noise. Furthermore, coordinate error is larger for the lower occupancy conformation. It should CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 57

Occ Res RMSD Calc Occ B¯calc 0.5/0.5 1.3 0.26 0.34 0.29 24.3 0.38 0.64 0.77 1.29 0.71 0.5/0.5 1.5 0.32 0.64 0.36 27.4 0.38 0.49 0.52 0.69 0.64 0.5/0.5 1.7 0.29 0.42 0.42 0.43 0.50 25.5 0.23 0.23 0.34 0.50 0.6/0.4 1.3 0.29 0.40 0.64 0.53 25.5 0.31 0.35 0.64 0.47 0.6/0.4 1.5 0.30 0.44 0.54 0.58 0.61 25.7 0.33 0.62 0.39 0.6/0.4 1.7 0.23 0.24 0.35 0.60 0.58 22.0 0.23 0.62 0.41 0.7/0.3 1.3 0.27 0.34 0.34 0.34 0.64 0.70 24.1 0.48 0.30 0.7/0.3 1.5 0.33 0.33 0.40 0.64 29.0 0.61 0.76 0.36 0.7/0.3 1.7 0.31 0.37 0.42 0.47 0.65 21.7 0.44 0.62 0.35

Table 3.2: Details of calculated dual conformations for loop 104-112 of 2R4I. Each row lists occupancies for the conformations (Occ), map resolution (Res, in A),˚ RMSD of cal- culated conformations to PDB conformations, the cumulative calculated occupancies for the conformations (Calc Occ), and average temperature factor of calculated conformations (B¯calc, in A˚ 2). Average, observed temperature factors are 19.0A˚ 2. be noted that at an occcupancy of 0.3, a carbon atom only scatters at about twice the mag- nitude of a hydrogen atom. The signal of a hydrogen atom is distinguished from the back- ground level only at resolution levels better than 1.3A(˚ i.e., less than 1.3A).˚ At resolution levels considered here, hydrogen atoms are not explicitly included in PDB files.

3.4.3 Results with experimental data

A structural model for TM0755 is obtained by the JCSG from data at 1.8A˚ resolution. The asymmetric unit contains a dimer, with a short main-chain fragment around residue A320, CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 58

Occ Res RMSD Calc Occ B¯ calc 0.5/0.5 1.3 0.00 0.47 25.2 0.00 0.53 0.5/0.5 1.5 0.00 0.48 22.3 0.00 0.52 0.5/0.5 1.7 0.00 0.29 0.49 22.0 0.00 0.23 0.51 0.6/0.4 1.3 0.00 0.56 20.1 0.00 0.44 0.6/0.4 1.5 0.00 0.54 0.61 24.7 0.00 0.39 0.6/0.4 1.7 0.00 0.57 23.4 0.00 0.43 0.7/0.3 1.3 0.00 0.65 25.9 0.00 0.35 0.7/0.3 1.5 0.00 0.33 0.33 0.66 29.0 0.00 0.61 0.34 0.7/0.3 1.7 0.00 0.65 20.8 0.00 0.35

Table 3.3: Details of calculated dual conformations for loop 104-112 of 2R4I. The true conformations were added in the sampling protocol. Each row lists occupancies for the conformations (Occ), map resolution (Res, in A),˚ RMSD of calculated conformations to PDB conformations, the cumulative calculated occupancies for the conformations (Calc Occ), and average temperature factor of calculated conformations (B¯calc, in A˚ 2). Average, observed temperature factors are 19.0A˚ 2. and the same fragment around B320, bimodally disordered. Crystallographers had initially abandoned this fragment due to difficulty interpreting the EDM visually. A dual conforma- tion for the fragment A316-A325, separated by 2.96A,˚ was obtained from semi-automated methods at 0.5/0.5 occupancy [5]. The average, occupancy-weighted temperature factor was 24.9A˚ 2. The structure together with the heterogeneous fragment was refined, sub- jected to the JCSG’s quality control protocol (unpublished) and ultimately deposited in the

PDB. CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 59

An experimental electron density map is calculated from diffraction images [87]. The

SAMPLE-SELECT algorithm returned a multi-conformer model consisting of 5 conforma- tions. Two conformations in the model are 0.47 and 1.24A˚ RMSD away from one of the conformations obtained at JCSG with occupancies 0.15 and 0.23. The other three cal- culated conformations are 0.64, 0.72, and 0.82A˚ RMSD away from the other conforma- tion obtained at JCSG with occupancies 0.27, 0.23, and 0.12 respectively. The average, occupancy-weighted temperature factor of the ensemble is 30.3A˚ 2.

This result demonstrates that SAMPLE-SELECT algorithm is also effective with exper- imental data which, in contrast to simulated data, may contain substantial phase angle er- rors. However, the algorithm fares much better in modeling side-chain driven heterogene- ity. The main reason is that sampling errors in modeling main-chain driven heterogeneity propagate along the main-chain. The sampling procedure involves incrementally sampling partial conformations and submitting them to SELECT. The partial conformations retained by SELECT might contain errors and the errors propagate to subsequent sampling. The fi- nite sampling discretization is also a cause of errors that propagate along the main-chain. In fact, reducing error propagation has been the main motivation for dividing the main-chain fragment into two halves. In spite of remaining errors, the SAMPLE-SELECT algorithm provides a useful computational tool to automatically identify an occupancy-weighted en- semble of conformations that collectively model main-chain driven heterogeneity in the

EDM. Such conformations are very difficult to identify otherwise.

3.5 Conclusion

This chapter presents a SAMPLE-SELECT algorithm that models heterogeneity from X-ray data. It consists of alternating sampling and selection steps that produce an occupancy- weighted ensemble of conformations that provide a near-optimal explanation of an EDM. The software is implemented in C++, and uses the following packages: Clipper [22], CHAPTER 3. HETEROGENEITY FROM X-RAY DATA 60

LoopTK [31], CGAL (Computational Geometric Algorithms Library;http://www.cgal.org), and COIN-OR (Computational Infrastructure for Operations Research; http://www.coin- or.org). The algorithm applicable to side-chain driven heterogeneity is implemented in a software called qFit that is available at http://smb.slac.stanford.edu/qFitServer and is regu- larly used at JCSG’s protein structure determination pipeline.

Experiments show that the SAMPLE-SELECT algorithm models side-chain driven het- erogeneity particularly well, while modeling of main-chain driven heterogeneity is less ac- curate. Validation tests on simulated EDM with noise establish that the algorithm identifies and models true alternate conformations at a consistently high rate, even as the resolution falls below 2A˚ in the case of side-chain drive heterogeneity. The accuracy as measured by Rfree values decreases toward low resolution as ambiguity in the electron density in- creases. The convex optimization method used in the selection step of the algorithm is known to identify the global optimum of the target function. Within the limitation of a discrete set of samples, the conformations with non-zero occupancy optimally explain the electron density. Hence, alternate conformations modeled by the algorithm for EDMs with poor resolution mainly represent uncertainty rather than coordinated motion, consistent with the conclusion in [99]. In the case of main-chain driven heterogeneity, the results are less accurate mainly due to limitations of the sampling strategy. However, the algorithm provides a way to identify alternate main-chain conformations from EDMs that are difficult to interpret.

Modeling heterogeneity from EDMs is of major importance and may have a major im- pact on the way protein models are eventually stored in the Protein Data Bank. It is also important for better identification of possible functional states of a protein and thus devel- opment of therapeutics. This work is a step in the direction of automatically and consis- tently identifying structural heterogeneity in proteins from X-ray crystallography. It further shows that the power of kino-geometric sampling can be utilized to gain important biolog- ical knowledge. Chapter 4

Determination of Allosteric Pathways

An allosteric protein is a protein in which a change of shape at one site, the allosteric site, propagates across the protein to another site, the active site, where the protein performs its function. The change of shape at the allosteric site is usually due to the binding of a ligand, e.g., as in CREB-binding proteins [64] and PDZ family proteins [44]. But it may have other causes, such as phosphorylation (e.g., in chemotaxis protein Y [38]) or change in illumination (e.g., in phototropins [107]). The resulting change of shape at the active site alters (either increases or decreases) the ability of the protein to bind another molecule at this site, hence the protein’s function. The schematic shown in Figure 4.1 illustrates this process: an effector molecule binds the protein at the allosteric site; the change of shapes propagates to the active site where it becomes easier for another molecule (the substrate) to bind. In the opposite case, the allosteric effect makes it more difficult for the substrate to bind at the active site. CREB-binding proteins provide a well-known example of allosteric interaction. These proteins regulate the transcriptional process and therefore play an important role in cellu- lar differentiation and development. The KIX domain of a CREB-binding protein binds both MLL (mixed linkage leukemia) and CREB/c-Myb (transcription regulator proteins) at two distant sites (Figure 4.2). Thermodynamic analysis shows that the binding of MLL

61 CHAPTER 4. ALLOSTERIC PATHWAYS 62

(a) (b) (c) Figure 4.1: Schematic of allosteric communication: (a) a protein with two binding sites, (b) allosteric effector binds to the protein at the allosteric binding site and alters activity at the other binding site through allosteric communication, and (c) binding of ligand at the other binding site is facilitated. increases the affinity of c-Myb to the other binding site by twofold [43]. There is experi- mental evidence that by binding the allosteric site MLL deforms residue PHE612 and that the deformation propagates to the c-Myb binding site. It is believed that the propagation occurs over several pathways of spatially adjacent residues connecting the two sites.

Figure 4.2: A CREB-binding protein. KIX domain (in blue color) of the protein binds domains of Mixed linkage leukemia (MLL, in light green color) and CREB/c-Myb (in orange color). Upon MLL binding, affinity for c-Myb binding increases by twofold.

Many current pharmaceutical drugs are small molecules that act as inhibitors by bind- ing proteins at their actives sites to alter their functions [65, 80]. However, as such drugs compete with the ligands that naturally bind the proteins, they often have unintended and CHAPTER 4. ALLOSTERIC PATHWAYS 63

undesirable side-effects. Recent studies [18, 81] show that a drug molecule binding a pro- tein’s allosteric site may regulate, rather than inhibit, the protein function at the active site. In this way, such a drug may have decreased side-effects. This result has recently generated interest in developing drugs that bind allosteric sites, rather than active sites, especially in kinases [19] and GPCRs [36], which are targets for many marketed drugs. Allosteric drug design [61, 88] is an emerging paradigm in pharmaceutical drug research. This new ap- proach requires the ability to determine how deformation at one site propagates to another site. In this chapter, we present a new computational tool developed to determine allosteric pathways, i.e., sequences of residues through which deformation propagates in an allosteric protein. Our tool takes one or several folded conformations of a protein as input. It assumes that the main-chain undergoes only minor conformational changes during the allosteric deformation and that the deformation propagates through spatial rearrangements of side- chains. This assumption is valid for some, but not all, allosteric proteins. Our approach consists of (1) computing a collection of side-chain placements that achieve protein con- formations with low van der Waals energy, (2) using these placements to find possible in- teraction among residues, and (3) extracting potential pathways from a graph representing interaction among residues.

4.1 Related Work

The study of allosteric pathways is a relatively recent research theme. Both experimental and computational methods have been developed in recent years. a)Experimental methods: Experimental determination of allosteric pathways in a protein is notoriously difficult [28]. Existing methods first try to identify residues likely to propagate deformation and then hypothesize pathways connecting residues. CHAPTER 4. ALLOSTERIC PATHWAYS 64

• In one method, called thermodynamic mutant analysis [32, 75], energetic effects

of mutations are evaluated. The energetic effect of one mutation, m1, is measured

in (i) the wild-type protein (∆Gm1 ) and (ii) in the presence of a second mutation

(∆Gm1|m2 ). The energy difference gives the coupling energy (∆∆Gm1,m2) between

the two mutations. If m1 does not have the same effect in wild-type and in the pres-

ence of m2 (i.e., if ∆Gm1|m2 6= ∆Gm1 ), then ∆∆Gm1,m2 is non-zero, which implies thermodynamic coupling of the two mutations. Residues that are thermodynamically coupled are hypothesized to propagate deformation.

• In NMR (Nuclear Magnetic Resonance) relaxation-dispersion methods [12, 39, 43], transitions between conformations states can be observed. Residues that display structural rearrangements upon binding are considered to propagate deformation. However, rearrangements that take place over timescales on the order of 10−3 or higher, which are critical for allosteric communication, are not observed. Most of the NMR relaxation-dispersion based methods consider only methyl side-chain dy- namics and therefore non-methyl residues are often ignored.

Overall, experimental methods have provided interesting insights into allosteric pathways. However, different methods have led researchers to infer different pathways for the same protein [94]. This suggests that current experimental methods may provide incorrect results, or at least incomplete information about allosteric pathways. b)Computational Methods: Computational methods can be broadly classified into sequence-based and structure-based methods:

• In sequence-based methods evolutionary data for a protein family is used [75]. Mul- tiple sequence alignments provide information on evolutionary correlated mutations. The residues involved in these mutations are assumed to propagate deformation. Al- losteric pathways that connect these residues are subsequently hypothesized. So, CHAPTER 4. ALLOSTERIC PATHWAYS 65

sequence-based methods do not directly analyze structural rearrangements that prop- agate deformation.

• Structure-based methods, as their name suggests, directly study structural rearrange- ments of residues. In [23], contacts between residues are compared in given bound and unbound structures. Residues that display contact rearrangement are considered to propagate deformation. However, the method needs two structures (bound and unbound), which are not always available. Instead, in [71] edges are constructed be- tween two nodes if conformational variation at the residue corresponding to one node leads to conformational variation at the residue corresponding to the other node. To infer correlated conformational changes, the method analyzes sequences of confor- mations generated using Monte Carlo (MC) or Molecular Dynamics (MD) simula- tion. One drawback of this method is the high computational cost of using MC or MD simulation, as large energy barriers must be overcome in order to represent structural heterogeneity in residue conformations. Another drawback of MC/MD simulation is that it provides sequences of conformations that may incompletely represent all possible interactions among residues.

Our method, called RDPG (for Residue Deformation Propagation Graph), bears simi- larities with the method proposed in [71]. However, we use a different method to sample conformations. This method avoids the high computational cost of either MC or MD simu- lation. It allows us to find interactions among residues in a more general way than by using simulation techniques to generate sequences of conformations. CHAPTER 4. ALLOSTERIC PATHWAYS 66

4.2 RDPG Method

4.2.1 Overview

Given a folded protein conformation, the RDPG method constructs a graph that represents possible interactions between side-chains in the conformation. The underlying idea is that allosteric deformation is akin to domino effect (Figure 4.3): deformation at a residue prop- agates to spatially close residues, due to volume exclusion, and the deformation of these residues propagates in turn to more residues.

(a) (b) (c)

(d) (e) (f) Figure 4.3: Strain propagation is akin to domino effect: (a)-(e) Strain successively propa- gates from one residue to another with each residue pushing a spatially-close residue, and (f) Residues A, i, j, and B form an allosteric pathway.

Each node of the graph is a side-chain conformation extracted from a rotamer library. Each arc connects a rotamer r for one residue to a rotamer r0 for another, spatially close residue. This arc indicates that changing the side-chain conformation of the former residue to r in the given protein conformation is energetically prohibitive, unless the side-chain conformation of the second residue is also changed to r0. From this graph, we infer a re- duced graph whose nodes represent residues. Paths in this graph form candidate sequences of residues along which deformations may propagate. In Sections 4.2.2 and 4.2.3 we describe how the nodes and the arcs of the RDPG are CHAPTER 4. ALLOSTERIC PATHWAYS 67

generated from the input protein conformation, using a rotamer library and an energy func- tion. In Section 4.2.4 we show how candidate allosteric pathways are computed from the RDPG. Finally in Section 4.2.5 we extend the RDPG method to handle multiple slightly different main-chain conformations of the protein.

4.2.2 Node generation

We number the residues in the protein 1, 2, ..., n, as they appear in the main-chain sequence. Each node of the RDPG represents a side-chain conformation of some residue i. For each residue i the nodes are obtained by considering an initial set of conformations for the side-chain of i and then eliminating conformations using energy-based criteria. In our implementation of the method, we use the backbone-dependent (i.e., dependent on main-chain dihedral angles) rotamer library proposed in [58] to create the initial set of conformations. For a given main-chain conformation of a protein, this library provides a set of possible side-chain conformations for each residue. These side-chain conformations, called rotamers, have been determined by clustering side-chain conformations with similar shapes in known protein structures.

In the following, the side-chain of each residue consists of the residues Cβ atom and all the atoms beyond this Cβ atom. The atoms in the protein’s main-chain are all the atoms not contained in the side-chains. Our energy-based criteria to eliminate rotamers use two energy functions:

• For any given rotamer ri of residue i, we define E(ri) to be the sum of the internal energy of a residue i and the interaction energy of residue i with the protein’s main- chain.

• For any pair of rotamers ri and rj, where ri is a rotamer of residue i and rj a rotamer

of residue j, we define E(ri, rj) to be the interaction energy between residues i and j. CHAPTER 4. ALLOSTERIC PATHWAYS 68

In our implementation, both functions are constructed by only using the repulsive term of the Lennard-Jones 6-12 potential that is commonly used to model the van der Waals energy and account for volume exclusion. For a given pair of atoms, this term is proportional to d−12, where d is the distance between the atom centers. This choice is consistent with the remarks made in Section 1.4 of Chapter 1: since side-chains are compactly packed in a folded conformation, the repulsive component of the van der Waals energy is dominant. However, the steps of our algorithm are independent of the choice of energy function. We eliminate rotamers associated with each residue to eventually retain only energeti- cally favorable rotamers. Let Ri be the set of rotamers for residue i at any time during the elimination process. Ri is initialized as defined above using the rotamer library given in [58]. The elimination process consists of two phases.

In the first phase, for every residue i, we remove every rotamer r from Ri such that:

E(r) > Tself+bkb, (4.1)

where Tself+bkb is a given threshold. In our implementation, the value of Tself+bkb has been determined empirically and set to 100.0 kcal/mol. The second phase considers interactions among rotamers. First, for every pair of

i residues i and j, for every rotamer ri of i still in Ri, we determine the rotamer rj still in Rj that is defined by:

i rj = argminrj ∈Rj {E(rj) + E(ri, rj)}. (4.2)

min The minimum energy contribution ECi of ri is then computed as:

min i i ECi = E(ri) + Σj6=i(E(rj) + E(ri, rj)). (4.3)

min Rotamer ri is removed from Ri if ECi is greater than the energy of the input protein CHAPTER 4. ALLOSTERIC PATHWAYS 69

conformation. The steps of the second phase are repeated until no more rotamer can be eliminated. Every remaining rotamer is then installed as a distinct node of the RDPG. Node generation takes O(mn2) time, where m is the maximum number of rotamers for a side-chain and n is the number of residues in the protein. For a given rotamer library, m is a fixed constant. The quadratic cost of the second phase may be reduced by using a spatial indexing of the residues to only consider the residues j that are spatially close to i

min in the computation of ECi .

4.2.3 Arc generation

Let ui be the side-chain conformation at residue i in the input protein conformation. Con- sider all pairs (ri, rj) such that ri and rj are two nodes (rotamers) of the RDPG respectively associated with two distinct, but spatially close residues i and j. An arc (oriented edge) connecting ri to rj is inserted in the RDPG if all following three conditions are satisfied (Figure 4.4):

1) Changing ui into ri in the input protein conformation creates significant stress on side-chain conformation uj at residue j, a condition that we express by:

E(ri, uj) − E(ui, uj) > Tstress, (4.4)

where Tstress is a specified threshold.

2) Changing uj into rj reduces stress, i.e.:

E(ri, rj) < E(ri, uj). (4.5)

3) The interaction energy of ri and rj is less than a specified threshold Tpair, i.e.:

E(ri, rj) < Tpair. (4.6) CHAPTER 4. ALLOSTERIC PATHWAYS 70

(a) (b) (c) Figure 4.4: Arc computation: (a) Strain is created at residue j by deformation of side-chain at residue i to rotamer ri, (b) Strain is reduced at residue j by deformation of side-chain at residue j to rotamer rj, and (c) an arc is added in the residue deformation propagation graph (RDPG) connecting nodes corresponding to rotamers ri and rj.

In our implementation, two residues are considered spatially close if any atom in a rotamer at one residue is within 4.0A˚ of an atom in a rotamer at the other residue. The empirically determined values of Tstress and Tpair are 0.1 kcal/mol and 20.0 kcal/mol, re- spectively.

So, an arc (ri, rj) in the RDPG means that changing ui into ri leads to a significant increase of energy, unless the side-chain conformation of residue j is also changed, with rj being one possible conformation. However, it is possible at this stage that a node rep- resenting a rotamer ri at residue i is not connected to any node representing a rotamer at a residue k spatially close to i (Figure 4.5). This happens in either one of the following two situations:

• Changing ui into ri does not create significant stress on side-chain conformation uk at residue k, or

• Changing ui into ri does create significant stress on uk, but no rotamer at residue k satisfies conditions 2 and 3 above.

In that case, it is possible that changing side-chain conformation at i into ri is energetically CHAPTER 4. ALLOSTERIC PATHWAYS 71

prohibitive, a condition that we express by:

E(ri, uk) > Tpair. (4.7)

If ri satisfies this condition, then the corresponding node is removed from the RDPG, along with all its incoming and outgoing arcs. Nodes (and arcs) are iteratively deleted in this way until none satisfies the above condition. The construction of the RDPG is then complete.

Figure 4.5: Arc deletion. The node corresponding to rotamer ri (in red color) is not con- nected to any node corresponding to rotamers at residue f.

4.2.4 Reduced RDPG and allosteric pathways

To identify allosteric pathways, we first define the reduced RDPG G as follows. Each node of G represents a residue i such that at least one rotamer of i is represented by a node in the RDPG. An arc connects node i to node j in the reduced graph if there exists at least one arc from a rotamer of i to a rotamer of j in the RDPG. See Figure 4.6. Let A and F be two given disjoint sets of residues, such that the residues in A and F are assumed to belong to the protein’s allosteric and functional sites, respectively. Every sequence of residues forming a cycle-free path of G connecting a residue in A to a residue in F can be regarded as a potential allosteric pathway through which deformation at the allosteric site may propagate to the functional site. All potential allosteric pathways are CHAPTER 4. ALLOSTERIC PATHWAYS 72

(a) (b) Figure 4.6: Reduced RDPG and allosteric pathway. The reduced RDPG (b) is obtained from the RDPG (a). Potential allosteric pathways are computed using graph traversal over the reduced RDPG. Residues A, i, j, and B form an allosteric pathway because a path of nodes connects A to B via i and j. computed using a simple breadth-first graph traversal algorithm.

4.2.5 Extension to multiple main-chain conformations

Our RDPG method is only applicable to proteins in which allosteric deformation mainly propagates through side-chain deformations. Nevertheless, small main-chain deformations can still be handled as follows. Let S be a given set c1, ..., cr of folded conformations of an allosteric protein, such that the protein’s main-chain undergoes small conformational changes across the set. We compute a reduced RDPG Gi, i = 1, ..., r, for each conformation ci, as described in Sections 4.2.2 and 4.2.3. All potential allosteric pathways are identified in each reduced RDPG, as described in Section 4.2.4. The final set of allosteric pathways contains the union of all allosteric pathways thus identified. Small perturbations of a protein main-chain can have significant impact on possible side-chain conformations. In the tests presented in Section 4.3, we found out that using CHAPTER 4. ALLOSTERIC PATHWAYS 73

(a) (b) Figure 4.7: Allosteric studies on CREB-binding protein. Allosteric residues (in yellow color) are determined in NMR-relaxation dispersion experiments described in (a) [12] and (b) [43]. Deformation propagates from the residue PHE612 (in pink color). The red arrows show approximate allosteric pathways connecting allosteric residues. Residues determined from RDPG do not contain ALA654 and TYR658 (in red circles). several slightly different main-chain conformations allows the RDPG method to more re- liably identify allosteric pathways. In these tests we used multiple conformations obtained from NMR experiments. However, other methods could be used to get these conformations. For example, we could sample multiple main-chain conformations by slightly deforming the given one at random and retain only the sampled conformations that have low energy (after using a side-chain placement algorithm, like [15]).

4.3 Test Results

The purpose of this section is to compare results obtained with the RDPG method and pre- vious results obtained experimentally. Our tests were carried out on two allosteric proteins, the CREB-binding protein from mouse and the PDZ domain family protein. It should be noted, however, that the ground truth on allosteric pathways is unknown. Experimental methods give different results. So, comparison is difficult and should be made carefully. CHAPTER 4. ALLOSTERIC PATHWAYS 74

4.3.1 CREB-binding protein

As mentioned at the beginning of this chapter, CREB-binding proteins provide a well- known example of allosteric interaction. Here, we consider the 87-residue CREB-binding protein from mouse, whose PDB id is 2AGH. The PDB structure contains 20 distinct main- chain conformations of the protein obtained through NMR experiment. The 20 confor- mations are very similar to one another. The KIX domain of the protein binds both MLL and CREB/c-Myb at two distant sites (Figure 4.2). Residues believed to be involved in allosteric deformation have been determined by NMR relaxation-dispersion experiments reported in [12, 43]. Figures 4.7.a and 4.7.b show in yellow the residues respectively de- termined by the experiments in [12] and in [43]. In both figures, the red dotted arrows show the approximate allosteric pathways that have been hypothesized from these experi- ments. The pathways found in [12] and [43] are different. The pathway shown in Figure 4.7a connects residue PHE612 to residue TYR650; it is believed to play a major role in altering activity at the c-Myb binding site upon MLL binding. The pathway in Figure 4.7b connects PHE612 to a different region of the protein that may be another binding site for other ligands.

Conformation Pathways Residues Conformation Pathways Residues 1 1 4 11 1 2 2 0 0 12 9 14 3 13 11 13 7 9 4 74 21 14 3 5 5 6 8 15 3 7 6 2 3 16 14 15 7 2 5 17 2 3 8 56 40 18 42 24 9 113 37 19 0 0 10 4 5 20 0 0 Table 4.1: Number of pathways and number of involved residues for 20 main-chain con- formations of CREB-binding protein (PDB id 2AGH) as computed by RDPG method.

Using the RDPG method with the same parameter settings as described in Section 4.2, CHAPTER 4. ALLOSTERIC PATHWAYS 75

(a) (b) (c) Figure 4.8: Allosteric pathways (in red arrows) determined from reduced RDPG propagate deformation from residue PHE612 to residue TYR650 in the CREB-binding protein (PDB id 2AGH). (a) The pathway (in red arrows) is similar to the pathway hypothesized in [12] and shows the propagation of deformation through Helix3, (b)-(c) The pathways show that the propagation of deformation can also take place through Helix2 as observed in [8]. Al- losteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) PHE612, ILE611, ILE660, LYS659, GLU655, HIS651, TYR650; (b) PHE612, ILE611, LEU607, LYS606, LEU603, TYR650; (c) PHE612, LEU628, ILE611, LEU607, LYS606, LEU603, TYR650. we computed potential allosteric pathways originating from residue PHE612, i.e., all paths in the reduced RDPG starting at this residue. Our method generated 324 distinct pathways in the 20 main-chain conformations of the protein. The total number of the residues in- volved in all pathways is 65, with the maximum number of residues in a single pathway being equal to 16 residues. Table 4.1 lists the number of residues involved in the path- ways for each conformation. The number varies from a maximum of 40 in main-chain conformation 8 to a minimum of 0 in main-chain conformations 2, 19, and 20. Our method computes a larger number of residues that are likely to be involved in allosteric pathways. Many of these residues are spatial neighbors of the residues identified in [12, 43]. Sev- eral of the computed pathways are very similar and involve almost the same residues as the pathways identified in [12, 43]. All the residues determined in [12, 43] to be involved in allosteric deformation were also identified by the RDPG method, except ALA654 and TYR658. Here, it should be noted that ALA and GLY residues are small residues that do CHAPTER 4. ALLOSTERIC PATHWAYS 76

(a) (b) (c) Figure 4.9: Allosteric pathways (in red arrows) determined from reduced RDPG propagate deformation from residue PHE612 to residues in a novel binding site in the CREB-binding protein(PDB id 2AGH). Pathways are similar to the pathways hypothesized in [43]. Al- losteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) PHE612, THR614, LEU620, LYS621; (b) PHE612, LEU628, ARG624, ASP622; (c) PHE612, LEU628, ASN627, ARG623, GLU626. not have rotamers in the library used by our method [58]. So, they cannot be identified by the method as participating in allosteric interaction. Figure 4.8 shows allosteric path- ways determined from the reduced RDPG connecting PHE612 to TYR650. The pathways extracted from the reduced RDPG suggest that both Helix2 and Helix 3 participate in the propagation of the deformation. Helix2 was not identified by NMR experiments, but there is some evidence reported in [8] that it actually participates in the deformation. Moreover, recent work emphasizes the importance of multiple pathways in allosteric deformation [28]. Figure 4.9 shows allosteric pathways to the other potential binding site; they are similar to the pathway hypothesized in [43].

4.3.2 PDZ domain family protein

PDZ domain family proteins mediate protein-protein interactions and are involved in signal transduction pathways. Binding of RA-GEF2 peptide leads to deformation that propagates through the protein. Structural changes are observed at two distal surfaces away from the CHAPTER 4. ALLOSTERIC PATHWAYS 77

(a) (b) Figure 4.10: Allosteric studies on PDZ domain family protein. Allosteric residues (in yel- low color) are determined by (a) NMR relaxation-dispersion [39] and (b) thermodynamic mutant analysis [75]. Deformation propagates from residues VAL26 and HIS71 (in pink color) in (a) and (b) respectively. The red arrows show approximate allosteric pathways connecting allosteric residues. Residues determined from RDPG do not contain ALA39, ALA46, ILE52, and ALA69 (in red circles). peptide binding site and these changes are thought to be important for facilitating protein- protein interactions. Here, a PDZ domain family protein called human phosphatase with

PDB id 1D5G is studied. It is 96 residues long. The PDB structure contains 20 distinct main-chain conformations of the protein obtained through NMR experiment. The resides that are involved in allosteric communication were determined by thermodynamic mutant analysis method in [75] and by NMR relaxation-dispersion method in [39]. Allosteric pathways connecting the residues were hypothesized and as in the case of CREB-binding protein the pathways were not explicitly mentioned. Figure 4.10a and Figure 4.10b show the residues (in yellow color) that were determined by the methods described in [39] and [75] respectively. The red dotted arrows show the approximate allosteric pathways that are thought to connect the residues. The pathways propagate deformation that starts at VAL26 (Figure 4.10a) and HIS71 (Figure 4.10b) and deform residues at two distal surfaces of the protein. Allosteric pathways starting at residues (that are close to RA-GEF2 binding site) VAL26, HIS71, and ILE20 are determined from reduced RDPG. Our method generated 698 distinct pathways in the 20 main-chain conformations of the protein. The total number CHAPTER 4. ALLOSTERIC PATHWAYS 78

(a) (b)

(c) (d) Figure 4.11: Allosteric pathways (in red arrows) determined from reduced RDPG propa- gate deformation from residue ILE20 to residues on two distal surfaces in the PDZ domain family protein (PDB id 1D5G). Allosteric residues that form the pathways are shown in yel- low color. The pathways are made up of residues: (a) ILE20, VAL40; (b) ILE20, LEU18, VAL85; (c) ILE20, LEU18, LEU78, THR81; (d) ILE20, LEU18, LEU78, VAL61, VAL64 and ILE20, LEU18, LEU78, VAL61, VAL66. of the residues involved in all pathways is 46, with the maximum number of residues in a single pathway being equal to 13 residues. Table 4.2 lists the number of residues involved in the pathways for each conformation. The number varies from a maximum of 31 in main- chain conformation 13 to a minimum of 5 in main-chain conformations 16 and 20. RDPG method identified all residues identified in [39, 75], except ALA39, ALA46, ILE52 and ALA69. As mentioned before, it is to be noted that ALA residues cannot be determined from the RDPG. Figure 4.10a shows pathways hypothesized by NMR relaxation-dispersion experiment [39]. Pathways from the relaxation-dispersion experiment are shown to start at CHAPTER 4. ALLOSTERIC PATHWAYS 79

Conformation Pathways Residues Conformation Pathways Residues 1 6 11 11 14 13 2 10 13 12 29 25 3 10 13 13 20 31 4 20 15 14 13 21 5 18 12 15 3 8 6 22 17 16 4 5 7 51 24 17 21 13 8 20 11 18 9 10 9 4 6 19 8 9 10 31 18 20 4 5 Table 4.2: Number of pathways and number of involved residues for 20 main-chain con- formations of PDZ domain family protein (PDB id 1D5G) as computed by RDPG method. residue VAL26, but the pathways (Figure 4.11) determined from the reduced RDPG start at residue ILE20. However, the pathways determined from the reduced RDPG propagate deformation to two distal surfaces which is important for the protein’s function. Using ther- modynamic mutant analysis and sequence-based computational method, it is demonstrated in [75] that deformation from residue HIS71 propagates to two residues VAL85and ALA46 located on the opposite sides of the binding site. A pathway (Figure 4.12a) determined from the reduced RDPG connects residue VAL85 to residue HIS71. Another pathway (Figure 4.12b) connects residue SER48, that is spatially close to ALA46, to residue HIS71.

4.4 Conclusion

The Residue Deformation Propagation Graph (RDPG) method is based on the idea of stress propagation and is caused by volume exclusion among spatially close residues. Keeping in mind that there is no known ground truth for allosteric pathways and that different exper- imental methods and settings provide different results, the allosteric pathways determined from RDPG are in good agreement with various recent studies. The advantages of RDPG are that (a) bound and unbound conformations are not required to determine allosteric path- ways and (b) fast computation of arcs is performed as compared to Molecular Dynamics CHAPTER 4. ALLOSTERIC PATHWAYS 80

(a) (b) Figure 4.12: Allosteric pathways (in red arrows) determined from reduced RDPG propa- gate deformation from residue HIS71 to residues on two distal surfaces in the PDZ domain family protein (PDB id 1D5G). Allosteric residues that form the pathways are shown in yellow color. The pathways are made up of residues: (a) HIS71, VAL75, ARG79, LEU18, VAL85; (b) HIS71, VAL75, ARG79, LEU78, LEU18, SER17, LYS13, ASN16, ASP15, SER48. and Monte-Carlo sampling-based graph methods. But there are some limitations as well. Only coarse approximation of side-chain motion in the form of rotamers is currently utilized. This does not allow determination of allosteric pathways involving small defor- mations of the side-chains. This limitation can be alleviated by sampling around rotamers. In addition, large main-chain deformation has not been considered so far. Only small main- chain deformation can be represented by sampling main-chain deformations (e.g., using the deformation sampling algorithm presented in Chapter 2) and/or by using alternate main- chain conformations determined by experimental techniques, like NMR or X-ray crystal- lography (using the sample-select algorithm presented in Chapter 3). The RDPG does not contain nodes corresponding to ALA and GLY residues because they do not have rotamers in the library used by our current software. In spite of these limitations, the results obtained with the CREB-binding protein and the PDZ domain family protein are encouraging. Our method can complement experi- mental methods, such as NMR relaxation-dispersion and thermodynamic mutant analysis: although these methods can determine residues involved in allosteric deformation, direct CHAPTER 4. ALLOSTERIC PATHWAYS 81

observation of allosteric pathways connecting them is challenging. The RDPG method can quickly predict candidate allosteric pathways that can then be utilized by more focused experimental methods. Effective determination of allosteric pathways may lead to better understanding of allosteric communication between binding sites. It may hold the key for success in efforts to develop pharmaceutical drugs with lowered side-effects and increased effectiveness. Chapter 5

Conclusion

5.1 Summary

The important role that proteins play in living organisms cannot be overstated. Proteins are the workhorses that carry out various physiological functions in the living organisms. To understand how proteins function remains a big challenge. The folded structure of a protein is critical to its function. There are millions of known protein sequences, but only a few structures (approximately 61,000) have been determined. That in itself is a major roadblock to obtaining proper understanding of the functioning of proteins. Even if the structure is known, studying protein’s function remains very challenging due to the dynamic and flexi- ble nature of its folded state. The folded state of the protein is believed to exist as an ensem- ble of conformations that inter-convert depending on physiological conditions. For exam- ple, according to the conformation selection model the ensemble of conformations shifts equilibrium towards conformations that are likely to bind molecules (ligands) involved in protein’s function. The structural heterogeneity in the folded proteins arises due to its dy- namic nature. It can range from high-frequency, harmonic vibrations to low-frequency, anharmonic, deformations involving correlated motions of atoms. It is the latter that has been the focus of this dissertation because the correlated motions play a more significant

82 CHAPTER 5. CONCLUSION 83

role in the functioning of proteins. This dissertation develops computational methods that analyze and solve problems related to structural heterogeneity in the folded proteins. To solve the problems, this dissertation adopts a kino-geometric approach. The analysis of structural heterogeneity requires exploration of the protein’s conformation space. Due to its high-dimensionality, robotics-inspired algorithms are developed that model the protein as a kinematic linkage and efficiently sample its conformation space. The samples thus obtained are then carefully selected to solve specific biological problems. Computational methods have been developed for solving three complex problems using this sample and select approach. This dissertation demonstrates that a kino-geometric approach, though devoid of detailed considerations of protein’s energy landscape, provides a powerful way to model structural heterogeneity.

5.2 Problem-Specific Contributions

The problem-specific contributions of this dissertation are as follows:

1)Loop sampling (Chapter 2): Flexible loops of proteins play an important role in lig- and binding and thereby protein functions. For example, calcium-binding loops are crucial for the proper functioning of the nervous system. Modeling the structural heterogeneity in loops due to their flexibility requires exploration of the closed, clash-free conformation space of the loops. The simultaneous satisfaction of two constraints required for solving this problem makes it difficult. Alongwith a seed sampling algorithm (developed by Peggy Yao) that employs a prioritized constraint satisfaction approach, a deformation sampling algorithm provides an effective tool to explore the feasible conformation space of protein loops. The sampling algorithms are computationally efficient and, with the help of a functional site prediction software, are able to ascertain functionally active conformations in a calcium-binding loop. The algorithm is implemented in a software toolkit called CHAPTER 5. CONCLUSION 84

loopTK available at http://simtk.org/home/looptk.

2)Interpretation of electron density maps (Chapter 3): EDMs obtained from X-Ray crystallography experiments reflect atom positions in a protein structure. Since a protein crystal contains multiple non-identical conformations of the protein, the EDM also reflects structural heterogeneity in the folded state of the protein. Single-conformer models explain the EDM by a single protein conformation alongwith a measure of harmonic vibrations around atoms called temperature factor. This dissertation develops SAMPLE-SELECT, an algorithm that models structural heterogeneity present in a protein crystal due to correlated, anharmonic motion of the protein. The algorithm automatically computes an ensemble (collection) of conformations and associated occupancies that near-optimally fits the EDM. It first efficiently samples protein conformations and then selects the ensemble of conformations that best explain the EDM using convex optimization. The algorithm is validated in separate experiments for modeling side-chain driven and main-chain driven heterogeneity. In the case of side-chain driven heterogeneity, the algorithm computes an ensemble of conformations that better fits the EDM at high resolutions as compared to a single-conformer model. At poorer resolutions where the boundary between structural heterogeneity and uncertainty in the data dissolves, the ensemble does not improve the fit. In the case of main-chain driven heterogeneity, our algorithm models alternate main-chain conformations in a difficult-to-interpret EDM. The algorithm, however, has limitations due to the cumulative sampling errors along the main-chain. The SAMPLE-SELECT algorithm applied to side-chain driven heterogeneity is implemented in a software called qFit that is available at http://smb.slac.stanford.edu/qFitServer.

3)Determination of allosteric pathways (Chapter 4): Allosteric propagation of defor- mation from one binding site of the protein to another binding site alters activity at the CHAPTER 5. CONCLUSION 85

latter. Understanding the mechanism of deformation propagation may hold the key to de- veloping therapeutics with decreased side-effects. This dissertation develops an algorithm that computes allosteric pathways that propagate deformations from an allosteric-effector binding site to another ligand binding site. The algorithm is based on a novel concept of a residue deformation propagation graph (RDPG) and mainly focuses on propagation of deformation due to structural heterogeneity in side-chain conformations. The nodes of the graph are rotamers (from a rotamer library) that sample the low-energy conformations of the side-chains. An arc that connects two nodes in the graph represents a propagation of deformation from one side-chain to the other. Paths in the graph are computed using graph traversal. The residues corresponding to the rotamers in the paths then form candidate al- losteric pathways. In the experiments performed on two allosteric proteins, CREB-binding protein and PDZ domain family protein, the RDPG method computes allosteric pathways that are consistent with other studies.

5.3 Future Directions

Following are some of the directions for future work:

1) In Chapter 2, we developed algorithms for sampling closed clash-free conformations of the protein loop. Developing sampling algorithms for larger proteins and protein complexes is an exciting new direction where we can extend this work. Protein complexes like chaperonin play an important role in protein folding and understanding their discrete meta-stable states therefore assumes significance. However, due to the large size of these molecules, developing sampling algorithms will require efficient representation of kinematics and collision avoidance schemes. A kinematic representation where secondary structures are considered as rigid bodies that are interconnected with flexible loops can be used. CHAPTER 5. CONCLUSION 86

2) The RDPG method developed in Chapter 4 can be improved by modeling propa- gation of deformation due to main-chain deformations. Currently, the method considers only small main-chain deviations that were obtained from multiple NMR structures. In the future, deformation sampling algorithm (see Chapter 2) and alternate main-chain con- formations modeled from X-Ray data (see Chapter 3) can be used to include main-chain deformations in the RDPG. Our current RDPG method computes allosteric pathways starting at a specified residue. An exciting future direction for research is the development of methods to identify and localize new allosteric binding sites. One way this can be achieved is by computing all the pathways that lead to residues in the functional binding site. The source residues of these pathways can be clustered to determine if they form a putative allosteric binding site.

3) A distant future goal is to determine how deformations propagate in protein-protein interaction networks. Our current RDPG method explores intra-protein propagation of deformation. Determining the global propagation of deformation throughout the entire protein-protein network might provide a detailed understanding of such networks. This will take our work in a direction that will involve inputs from systems biology. Bibliography

[1] Pavel V. Afonine, Ralf W. Grosse-Kunstleve, and Paul D. Adams. The phenix re- finement framework. CCP4 newsletter, 42, 2005.

[2] Mariana Babor, Harry M. Greenblatt, Marvin Edelman, and Vladimir Sobolev. Flex- ibility of metal binding sites in proteins on a database scale. Proteins: Structure, Function and Bioinformatics, 59, 2005.

[3] and Andrej Sali. Protein structure prediction and structural genomics. Science, 294(5540):93–96, 2001.

[4] Henry Van Den Bedem, Ankur Dhanik, Jean Claude Latombe, and Ashley M. Dea- con. Modeling discrete heterogeneity in x-ray diffraction data by fitting multi- conformers. Acta Crystallographica D, 65, 2009.

[5] Henry Van Den Bedem, Itay Lotan, Jean Claude Latombe, and Ashley M Deacon. Real-space protein-model completion: an inverse-kinematics approach. Acta Crys- tallographica D, 61:2–13, 2005.

[6] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 2000.

[7] Helen M Berman, Kim Henrick, Haruki Nakamura, and Eddy Arnold. Reply to: Is one solution good enough? Nature Structural and Molecular Biology, 13:185, 2006.

87 BIBLIOGRAPHY 88

[8] Jennifer L. Best, Carlos A. Amezcua, Bernhard Mayr, Lawrence Flechner, Christo- pher M. Murawsky, Beverly Emerson, Tsaffrir Zor, Kevin H. Gardner, and Marc Montminy. Identification of small-molecule antagonists that inhibit an activa- tor:coactivator interaction. Proceedings of the National Academy of Sciences, 101(51):17622–17627, 2004.

[9] Hans Rudolf Bosshard. Molecular recognition by induced fit: How fit is the concept? News in Physiological Sciences, 16(4):171–173, 2001.

[10] R. E. Bruccoleri and M. Karplus. Conformational sampling using high temperature molecular dynamics. Biopolymers, 29(14):1847–1862, 1990.

[11] Axel T. Brunger. Free r value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature, 355, 1992.

[12] Sven Bruschweiler, Paul Schanda, Karin Kloiber, Bernhard Brutscher, Georg Kon- taxis, Robert Konrat, and Martin Tollinger. Direct observation of the dynamic pro- cess underlying allosteric signal transmission. Structure, 131(8):3063–3068, 2009.

[13] F. Temple Burling and Axel T. Brunger. Thermal motion and conformational disor- der in protein crystal structures: Comparison of multi-conformer and time-averaging models. Israel Journal of , 34, 1994.

[14] Adrian A. Canutescu and Roland L. Dunbrack Jr. Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Science, 12(5):963–972, 2003.

[15] Adrian A. Canutescu, Andrew A. Shelenkov, and Roland L. Dunbrack Jr. A graph-theory algorithm for rapid protein side-chain prediction. Protein Science, 12(9):2001–2014, 2003. BIBLIOGRAPHY 89

[16] M. S. Cates, M. B. Berry, R. L. Ho, Q. Li, J. D. Potter, and G. N. Phillips Jr. Metal- ion affinity and specificity in ef-hand proteins: coordination geometry and domain plasticity in parvalbumin. Structure, 7(10):1269–1278, 1999.

[17] Kyong-Sok Chang and Oussama Khatib. Operational space dynamics: Efficient algorithms for modeling and control of branching mechanisms. Proceedings IEEE International Conference on Robotics and Automation, 2000.

[18] Arthur Christopoulos. Allosteric binding sites on cell-surface receptors: novel tar- gets for drug discovery. Nature Reviews Drug Discovery, 1, 2002.

[19] Philip Cohen. Protein kinases the major drug targets of the twenty-first century? Nature Reviews Drug Discovery, 1, 2002.

[20] J. Corts, T. Simon, M. Remaud-Simon, and V. Tran. Geometric algorithms for the conformational analysis of long protein loops. Journal of Computational Chemistry, 25(7):956–967, 2004.

[21] Evangelos A. Coutsias, Chaok Seok, Matthew P. Jacobson, and Ken A. Dill. A kinematic view of loop closure. Journal of Computational Chemistry, 25(4):510– 528, 2004.

[22] Kevin Cowtan. The clipper c++ libraries for x-ray crystallography. IUCr Computing Commission Newsletter, 2:4–9, 2003.

[23] Michael D. Daily, Tarak J. Upadhyaya, and Jeffrey J. Gray. Contact rearrangements form coupled networks from local motions in allosteric proteins. Proteins, 71, 2008.

[24] Ian W. Davis, W. Bryan Arendall, David C. Richardson, and Jane S. Richardson. The backrub motion: How protein backbone shrugs when a sidechain dances. Structure, 14:265–274, 2006. BIBLIOGRAPHY 90

[25] Ian W. Davis, Andrew Leaver-Fay, Vincent B. Chen, Jeremy N. Block, Gary J. Kapral, Xueyi Wang, Laura W. Murray, W. Bryan Arendall III, Jack Snoeyink, Jane S. Richardson, and David C. Richardson. Molprobity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35, 2007.

[26] Paul I. W. de Bakker, Mark A. DePristo, David F. Burke, and Tom L. Blundell. Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all-atom statistical potential and the amber force field with the generalized born solvation model. Proteins: Structure, Function and Bioinformatics, 51(1):21– 40, 2003.

[27] Charlotte M. Deane and Tom L. Blundell. A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins. Proteins: Struc- ture, Function, and Bioinformatics, 40(1):135–144, 2000.

[28] Alexander del Sol, Chung-Jung Tsai, and Ruth Nussinov. The origin of allosteric functional modulation: Multiple pre-existing pathways. Structure, 17(8):1042– 1050, 2009.

[29] Mark A. DePristo, Paul I.W. de Bakker, and Tom L. Blundell. Heterogeneity and inaccuracy in protein structures solved by x-ray crystallography. Structure, 12:831– 838, 2004.

[30] Mark A. DePristo, Paul I.W. de Bakker, Simon C. Lovell, and Tom L. Blundell. Ab initio construction of polypeptide fragments: Efficient generation of accurate, representative ensembles. Proteins: Structure, Function, and Bioinformatics, 51:41– 55, 2003.

[31] Ankur Dhanik, Peggy Yao, Nathan Marz, Ryan Propper, Charles Kou, Guanfeng Liu, Henry van den Bedem, and Jean Claude Latombe. Efficient algorithms to ex- plore conformation spaces of flexible protein. BIBLIOGRAPHY 91

[32] Ruxandra I. Dima and D. Thirumalai. Determination of network of residues that regulate allostery in protein families using sequence analysis. Protein Science, 15, 2006.

[33] Frank DiMaio, Dmitry A. Kondrashov, Eduard Bitto, Ameet Soni, Craig A. Bingman, George N. Phillips Jr., and Jude W. Shavlik. Creating protein mod- els from electron-density maps using particle-filtering methods. Bioinformatics, 23(21):2851–2858, 2007.

[34] Frank DiMaio, Jude Shavlik, and George N. Phillips. A probabilistic approach to protein backbone tracing in electron density maps. Bioinformatics, 22(14):e81–e89, 2006.

[35] Anna Maria Ferrari, Binqing Q. Wei, Luca Costantino, and Brian K. Shoichet. Soft docking and multiple receptor conformations in virtual screening. Journal of Medic- inal Chemistry, 47(21):5076–5084, 2004.

[36] David Fillmore. It’s a gpcr world. Modern Drug Discovery, 7(11):24–28, 2004.

[37] Andrs Fiser, Richard Kinh Gian Do, and Andrej Sali. Modeling of loops in protein structures. Protein Science, 9(9):1753–1773, 2000.

[38] Mark S. Formaneck, Liang Ma, and Qiang Cui. Reconciling the old and new views of protein allostery: A molecular simulation study of chemotaxis y protein (chey). Proteins: Structure, Function, and Bioinformatics, 63, 2006.

[39] Ernesto J. Fuentes, Steven A. Gilmore, Randall V. Mauldin, and Andrew L. Lee. Evaluation of energetic and dynamic coupling networks in a pdz domain protein. Journal of Molecular Biology, 364(3):337–351, 2006. BIBLIOGRAPHY 92

[40] Nicholas Furnham, Tom L. Blundell, Mark A. DePristo, and Thomas C. Terwilliger. Is one solution good enough? Nature Structure Molecular Biology, 13(3):184–185, 2006.

[41] Dariya S. Glazer, Randall J. Radmer, and Russ B. Altman. Combining molecular dynamics and machine learning to improve protein function recognition. Pacific Symposium on Biocomputing, 13, 2008.

[42] Gene H. Golub and Charles F. Van Loan. Principles of protein X-ray crystallogra- phy. John Hopkins University Press, Baltimore, MD, 1996.

[43] Natalie K. Goto, Tsaffrir Zor, Maria Martinez-Yamout, H. Jane Dyson, and Peter E. Wright. Cooperativity in transcription factor binding to the coactivator creb-binding protein (cbp). Journal of Biological Chemistry, 277(45):43168–43174, 2002.

[44] Roberto N. De Guzman, Natalie K. Goto, H. Jane Dyson, and Peter E. Wright. Structural basis for cooperative transcription factor binding to the cbp coactivator. Journal of Molecular Biology, 355(5):1005–1013, 2006.

[45] Dan Halperin, Jean-Claude Latombe, and Randall H. Wilson. A general framework for assembly planning: the motion space approach. Proceedings of the fourteenth annual symposium on Computational geometry, 1998.

[46] Dan Halperin and Mark H. Overmars. Spheres, molecules, and hidden surface re- moval. Computational Geometry: Theory and Applications, 11(2):83–102, 1998.

[47] Ulrich H.E. Hansmann and Yuko Okamoto. New monte carlo algorithms for protein folding. Current Opinion in Structural Biology, 9(2):177–183, 1999.

[48] R. Lougee Heimer. The common optimization interface for operations research: promoting open-source software in the operations research community. IBM Journal of Research and Development, 47(1):57–66, 2003. BIBLIOGRAPHY 93

[49] David Hsu, Robert Kindel, Jean-Claude Latombe, and Stephen Rock. Random- ized kinodynamic motion planning with moving obstacles. International Journal of Robotics Research, 21(3):233–255, 2002.

[50] Z. Huang and C. F. Wong. Conformational selection of protein kinase a revealed by flexible-ligand flexible-protein docking. Journal of Computational Chemistry, 30:631–644, 2009.

[51] T. Ichiye and M. Karplus. Anisotropy and anharmonicity of atomic fluctuations in proteins: implications for x-ray analysis. Biochemistry, 27(9):3487–3497, 1988.

[52] Thomas Ioerger and James Sacchettini. The textal system: Artificial intelligence techniques for automated protein model building. In C. W. Carter and R. M. Sweet, editors, Methods in Enzymology, pages 244–270. Springer, 2003.

[53] Donald J. Jacobs, A. J. Rader, Leslie A. Kuhn, and M. F. Thorpe. Protein flexibility predictions using graph theory. Proteins: Structure, Function, and Genetics, 44, 2001.

[54] Matthew P. Jacobson, David L. Pincus, Chaya S. Rapp, Tyle J.F. Day, Barry Honig, David E. Shaw, and Richard A. Friesner. A hierarchical approach to all-atom protein loop prediction. Proteins: Structure, Function, and Bioinformatics, 55:351–367, 2004.

[55] Jia Jia, Niels Borregaard, Karsten Lollike, and Miroslaw Cygler. Structure of ca2+ loaded human grancalcin. Acta Crystalllographica D, 57, 2001.

[56] Daniel E. Koshland Jr. The key-lock theory and the induced fit theory. Angewandte Chemie International Edition in English, 33, 1994.

[57] Daniel E. Koshland Jr. Conformational changes: How small is big enough? Nature Medicine, 4(10):1112–1114, 1998. BIBLIOGRAPHY 94

[58] Roland L. Dunbrack Jr. and Martin Karplus. Backbone-dependent rotamer library for proteins application to side-chain prediction. Protein Science, 6, 1997.

[59] Lydia Kavraki, Petr Svestka, Jean claude Latombe, and Mark Overmars. Proba- bilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996.

[60] H. Kawasaki and R. H. Kretsinger. Calcium-binding proteins. 1: Ef-hands. Protein Profile, 2(4):297–490, 1995.

[61] Terry Kenakin and Laurence J. Miller. Seven transmembrane receptors as shapeshift- ing proteins: The impact of allosteric modulation and functional selectivity on new drug discovery. Pharmacological Reviews, 62(2):265–304, 2010.

[62] Jennifer L. Knight, Zhiyong Zhou, Emilio Gallicchio, Daniel M. Himmel, Richard A. Friesner, Eddy Arnold, and Ronald M. Levy. Exploring structural vari- ability in x-ray crystallographic models using protein local optimization by torsion- angle sampling. Acta Crystallographica D, 64, 2008.

[63] Rachel Kolodny, Leonidas Guibas, Michael Levitt, and Patrice Koehl. Inverse kine- matics in biology: The protein loop closure problem. International Journal of Robotics Research, 24(2–3):151–163, 2005.

[64] Guennadi Kozlov, Denis Banville, Kalle Gehring, and Irena Ekiel. Solution struc- ture of the pdz2 domain from cytosolic human phosphatase hptp1e complexed with a peptide reveals contribution of the β2β3 loop to pdz domainligand interactions. Journal of Molecular Biology, 320(4):813–820, 2002.

[65] Irwin D. Kuntz. Structure-based strategies for drug design and discovery. Science, 257(5073):1078–1082, 1992. BIBLIOGRAPHY 95

[66] John Kuriyan, K. Osapay, S. K. Burley, Axel T. Brunger, A. T. Hendrickson, and Martin Karplus. Probing disorder in high resolution protein structures by simulated annealing. Proteins: Structure, Function, and Genetics, 10:340–358, 1991.

[67] John Kuriyan, Gregory A. Petsko, Ronald M. Levy, and Martin Karplus. Effect of anisotropy and anharmonicity on protein crystallographic refinement: An eval- uation by molecular dynamics. Proteins: Structure, Function, and Bioinformatics, 190:227–254, 1986.

[68] Jean-Claude Latombe. Robot Motion Planning. Kluwer Academic Publishers, Nor- well, MA, 1991.

[69] Steven M. LaValle and James Kuffner. Randomized kinodynamic planning. Inter- national Journal of Robotics Research, 20(5):378–400, 2001.

[70] Andrew R. Leach. Molecular Modeling: Principles and Applications. Addison Wesley Longman, Essex, England, 1996.

[71] Tom Lenaerts, Jesper Ferkinghoff-Borg, Francois Stricher, Luis Serrano, and Joost WH Schymkowitzand Frederic Rousseau. Quantifying information transfer by pro- tein domains: Analysis of the fyn sh2 domain structure. BMC Structural Biology, 8(1):43, 2008.

[72] Elena J. Levin, Dmitry A. Kondrashov, Gary E. Wesenberg, and George N. Phillips Jr. Ensemble refinement of protein crystal structures: validation and application. Structure, 15(9):1040–1052, 2007.

[73] Michael Levitt, Christian Sander, and Peter S. Stern. Protein normal-mode dynam- ics: Trypsin inhibitor, crambin, ribonuclease and lysozyme. Journal of Molecular Biology, 181, 1985. BIBLIOGRAPHY 96

[74] Zhenqin Li and Harold A. Scheraga. Monte carlo-minimization approach to the multiple-minima problem in protein folding. Proceedings of the National Academy of Sciences, 84, 1987.

[75] Steve W. Lockless and Rama Ranganathan. Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286(5438):295–299, 1999.

[76] Itay Lotan, Henry van den Bedem, Ashley Deacon, and Jean-Claude Latombe. Com- puting protein structures from electron density maps: The missing loop problem. Algorithm Foundations of Robotics VI, Springer, 2005.

[77] Simon C. Lovell, J. Michael Word, Jane S. Richardson, and David C. Richardson. The penultimate rotamer library. Proteins: Structure, Function, and Bioinformatics, 40(3):389–408, 2000.

[78] Buyong Ma, Sandeep Kumar, Chung-Jung Tsai, and Ruth Nussinov. Folding funnels and binding mechanisms. Protein Engineering, 12(9):713–720, 1999.

[79] Buyong Ma, Sandeep Kumar, Chung-Jung Tsai, Haim Wolfson, Neeti Sinha, and Ruth Nussinov. Protein-ligand interactions: Induced fit. Encyclopedia of Life Sci- ences, John Wiley, 2002.

[80] Jens Meiler and David Baker. Rosettaligand: Protein-small molecule docking with full side-chain flexibility. Proteins: Structure, Function, and Bioinformatics, 65(3):538–548, 2006.

[81] C. La Motta, S. Sartini, M. Morelli, S. Taliani, and F. Da Settimo. Allosteric mod- ulators for adenosine receptors: An alternative to the orthosteric ligands. Current Topics in Medicinal Chemistry, 10, 2010. BIBLIOGRAPHY 97

[82] Kei-Ichi Okazaki, Nobuyasu Koga, Shoji Takada, Jose N. Onuchic, and Peter G. Wolynes. Multiple-basin energy landscapes for large-amplitude conformational mo- tions of proteins: Structure-based molecular dynamics simulations. Proceedings of National Academy of Sciences, 103(32):11844–11849, 2006.

[83] Christine A. Orengo, Annable E. Todd, and Janet M. Thornton. From protein struc- ture to function. Current Opinion in Structural Biology, 9(3):374–382, 1999.

[84] Anastassis Perrakis, Richard Morris, and Victor S. Lamzin. Automated protein model building combined with iterative structure refinement. Nature Structural Bi- ology, 6, 1999.

[85] Gregory A. Petsko. Protein Structure and Function. New Science Press Ltd, London, UK, 2004.

[86] Stephen D. Rader and David A. Agard. Conformational substates in enzyme mech- anism: the 120 k structure of α-lytic protease at 1.5 a resolution. Protein Science, 6(7):1375–1386, 1997.

[87] Randy J. Read. Improved fourier coefficients for maps using phases from partial structures with errors. Acta Crystallographica A, 42:140–149, 1986.

[88] S Rees, D Morrow, and T Kenakin. Gpcr drug discovery through the exploitation of allosteric drug binding sites. Receptors Channels, 8(5–6):261–268, 2002.

[89] P. A. Rejto and S. T. Freer. Protein conformational substates from x-ray crystallog- raphy. Progress in Biophysics and Molecular Biology, 66(2):167–196, 1996.

[90] J. Michael Sauder and Roland L. Dunbrack Jr. Genomic fold assignment and rational modeling of proteins of biological interest. Proceedings International Conference on Intelligent Systems for Molecular Biology, 8, 2000. BIBLIOGRAPHY 98

[91] Tamar Schlick. Molecular Modeling and Simulation. Springer-Verlag, New York, 2002.

[92] Verner Schomaker and K. N. Trueblood. On the rigid-body motion of molecules in crystals. Acta Crystallographica B, 24:63, 1968.

[93] Amarda Shehu, Cecilia Clementi, and Lydia E. Kavraki. Modeling protein con- formational ensembles: From missing loops to equilibrium fluctuations. Proteins: Structure, Function, and Bioinformatics, 65, 2006.

[94] Robert G. Smock and Lila M. Gierasch. Sending signals dynamically. Science, 324(5924):198–203, 2009.

[95] Srgio Filipe Sousa, Pedro Alexandrino Fernandes, and Maria Joo Ramos. Protein- ligand docking: Current status and future challenges. Proteins: Structure, Function, and Bioinformatics, 65(1):15–26, 2006.

[96] F. Tama and Y. H. Sanejouand. Conformational change of proteins arising from normal mode calculations. Protein Engineering, 14(1):1–6, 2001.

[97] Thomas C. Terwilliger. Automated main-chain model building by template matching and iterative fragment extension. Acta Crystallographica D, 59:38–44, 2003.

[98] Thomas C. Terwilliger. Improving macromolecular atomic models at moderate res- olution by automated iterative model building, statistical density modification and refinement. Acta Crystallographica D, 59:1174–1182, 2003.

[99] Thomas C. Terwilliger, Ralf W. Grosse-Kunstleve, Pavel V.Afonine, Paul D. Adams, Nigel W. Moriarty, Peter Zwart, Randy J. Read, Dusan Turk, and Li-Wei Hung. In- terpretation of ensembles created by multiple iterative rebuilding of macromolecular models. Acta Crystallographica D, 63:597–610, 2007. BIBLIOGRAPHY 99

[100] Thomas C. Terwilliger, Ralf W. Grosse-Kunstleve, Pavel V. Afonine, Nigel W. Mo- riarty, Peter H. Zwart, Li-Wei Hung, Randy J. Readvis, and Paul D. Adams. Iterative model building, structure refinement and density modification with the phenix auto- build wizard. Acta Crystallographica D, 64, 2008.

[101] Silvio C.E. Tosatto, Eckart Bindewald, Jrgen Hesser, and Reinhard Mnner. A divide and conquer approach to fast loop modeling. Protein Engineering, 15(4):279–286, 2002.

[102] Herman W. T. van Vlijmen and Martin Karplus. Pdb-based protein loop predic- tion: Parameters for selection and methods for optimization. Journal of Molecular Biology, 267(4):975–2001, 1997.

[103] William J. Wedemeyer and Harold A. Scheraga. Exact analytical loop closure in pro- teins using polynomial equations. Journal of Computational Chemistry, 20(8):819– 844, 1999.

[104] Liping Wei and Russ B. Altman. Recognizing protein binding sites using statistical descriptions of their 3d environments. Pacific Symposium on Biocomputing, 1998.

[105] Bertram Terence Martin Willis and Arthur William Pryor. Thermal vibrations in crystallography. Cambridge University Press, London, UK, 1975.

[106] Mark A. Wilson and Axel T. Brunger. The 1.0 A˚ crystal structure of Ca2+-bound calmodulin: an analysis of disorder and implications for functionally relevant plas- ticity. Journal of Molecular Biology, 301(5):1237–1256, 2000.

[107] Peggy Yao, Ankur Dhanik, Nathan Marz, Ryan Propper, Charles Kou, Guanfeng Liu, Henry van den Bedem, Jean Claude Latombe, Inbal Haperin Landsberg, and BIBLIOGRAPHY 100

Russ B. Altman. Efficient algorithms to explore conformation spaces of flexible pro- tein loops. IEEE/ACM Transactions on and Bioinformatics, 5(4):534–545, 2008.