Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2012 On coarse-grained Normal Mode Analysis and refined aG ussian Network Model for protein structure fluctuations Jun-Koo Park Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Bioinformatics Commons, and the Mathematics Commons

Recommended Citation Park, Jun-Koo, "On coarse-grained Normal Mode Analysis and refined Gaussian Network Model for protein structure fluctuations" (2012). Graduate Theses and Dissertations. 12668. https://lib.dr.iastate.edu/etd/12668

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. On coarse-grained Normal Mode Analysis and refined Gaussian Network Model

for protein structure fluctuations

by

Jun-Koo Park

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Applied Mathematics

Program of Study Committee:

Zhijun Wu, Major Professor

Jennifer Davidson

Karin Dorman

L. Steven Hou

Robert L. Jernigan

Iowa State University

Ames, Iowa

2012

Copyright c Jun-Koo Park, 2012. All rights reserved. ii

To Jiyoung iii

TABLE OF CONTENTS

LIST OF TABLES ...... v

LIST OF FIGURES ...... vi

ABSTRACT ...... vii

CHAPTER 1. INTRODUCTION ...... 1

CHAPTER 2. NORMAL MODE ANALYSIS AND GAUSSIAN NETWORK

MODEL ...... 6

2.1 Introduction ...... 6

2.2 Normal Mode Analysis ...... 7

2.3 Gaussian Network Model ...... 18

2.4 Concluding Remarks ...... 27

CHAPTER 3. EVALUATION OF GAUSSIAN NETWORK MODEL . . . . 30

3.1 Introduction ...... 30

3.2 Preparation for evaluation ...... 31

3.3 Implementation of algorithm ...... 31

3.4 Evaluation Results ...... 32

CHAPTER 4. COARSE-GRAINED NORMAL MODE ANALYSIS . . . . . 34

4.1 Introduction ...... 34

4.2 Issues of NMA and GNM ...... 35

4.3 New Approach ...... 36

4.4 Numerical Results ...... 37

4.5 Demonstrations ...... 44 iv

CHAPTER 5. REFINED GNM: STATISTICAL RESULTS FOR FORCE

CONSTANTS ...... 47

5.1 Introduction ...... 47

5.2 Refined GNM: Non-homogeneous GNM ...... 48

5.3 Test Results ...... 52

CHAPTER 6. CONCLUSIONS AND FUTURE WORK ...... 58

6.1 Summary ...... 58

6.2 Future Directions ...... 60

BIBLIOGRAPHY ...... 62

ACKNOWLEDGEMENTS ...... 72 v

LIST OF TABLES

Table 4.1 Correlation coefficients for small structuresa ...... 39

Table 4.2 Correlation coefficients for medium structuresa ...... 40

Table 4.3 Correlation coefficients for large structuresa ...... 41

Table 4.4 Comparison of correlation coefficients for small structures∗ ...... 42

Table 4.5 Comparison of correlation coefficients for medium structures∗ ..... 43

Table 4.6 Comparison of correlation coefficients for large structures∗ ...... 44

Table 5.1 B-factor correlations of nhGNM for small structuresa ...... 53

Table 5.2 B-factor correlations of nhGNM for medium structuresa ...... 54

Table 5.3 B-factor correlations of nhGNM for large structuresa ...... 55 vi

LIST OF FIGURES

Figure 2.2 Example elastic network model for protein 1HEL ...... 19

Figure 2.3 An example contact matrix ...... 19

Figure 3.1 Distribution of correlations for protein structures using GNM . . . . . 32

Figure 4.1 Comparison of residue mean-square-fluctuations for protein 1HJE . . . 44

Figure 4.2 Comparison of residue mean-square-fluctuations for protein 2BF9 . . . 45

Figure 4.3 Comparison of residue mean-square-fluctuations for protein 2HQK . . 45

Figure 5.1 Comparison of 4 different residue pairs ...... 50

Figure 5.2 Comparison of all residue pairs ...... 51

Figure 5.3 An example nhGNM contact matrix ...... 52

Figure 5.4 Distribution of correlations from GNM and nhGNM ...... 56

Figure 5.5 Distribution of correlation improvement ...... 57 vii

ABSTRACT

The functions of biological structures are related to the dynamics of the structures, es- pecially various kinds of large-amplitude molecular motions. With some assumptions, those motions can be investigated by Normal Mode Analysis (NMA) and Gaussian Network Model

(GNM). However, despite their contributions to many applications, the relationship between

NMA and GNM requires a further discussion. In this work, we review the Normal Mode Anal- ysis and Gaussian Network Model and address their common applications in structural biology.

We evaluate the GNM, based on how well it predicts the structural fluctuations, compared to experimental data for a large set of protein structures.

Then, we propose several ways of coarse-graining for NMA on protein residue-level struc- tural fluctuations by choosing different approaches to represent the amino acids and the forces between them. Using backbone atoms such as Cα, C, N, and Cβ, single-atom representa- tions are considered. Combinations of some of these atoms are also tested as a representative point for the residue. The force constants between the representative atoms are extracted from the Hessian matrix of the potential energy and used as the force constants between the corresponding residues. The residue mean-square-fluctuations and their correlations with the experimental B-factors are calculated for a large set of proteins. The results are compared with all-atom normal mode analysis and residue-level GNM based on the choice of different kinds of representative backbone atoms. The coarse-grained methods perform more efficiently than all-atom normal mode analysis, and also agree better with the B-factors. B-factor correlations are comparable or better than with those estimated with conventional GNM. The extracted force constants are surveyed for different pairs of residues with different extents of separation in sequence. The statistical averages are used to build a finer-grained GNM here called non- homogeneous GNM, which is able to predict residue-level mean square fluctuations significantly better than conventional GNM for many test cases. 1

CHAPTER 1. INTRODUCTION

The relationship between the structure and dynamics of biological molecules has been stud- ied for decades. Dynamics information can also shed light on the function of biological assem- blies. Among biological structures, proteins consisting of sequences of amino acids carry out many functions. Thus, it is essential to study the structures of proteins to obtain a better idea of their dynamical properties, and furthermore the knowledge of dynamics especially internal motions helps us to understand the functional roles of proteins. More specifically, the lowest frequency molecular motions are known to have a deep relation to the function of proteins.

Normal Mode Analysis (NMA) and Gaussian Network Model (GNM) are important meth- ods for analyzing protein structural fluctuations. In 1959, Alder et al. [59] proposed the general method of studying . Before the NMA and GNM were introduced, molecular dynamics (MD) simulations were the principal method used to understand structural

fluctuations. In the MD simulations, the system of the equations of motion for the structure is solved numerically since the potential energy function is a nonlinear form. However, as the size of biological structures becomes larger, MD simulations consume large amount of compu- tation time. Thus, the application to large-scale molecular assemblies is restricted to relatively short time scales even though computational techniques and processing power have been im- proved significantly. In other words, the utilization of standard molecular dynamics simulation is limited, as the timescales of biologically important are on the order of microseconds. Moreover, due to the limitation of X-ray crystallography, a high resolution structure is not always available and only low-resolution structural information is available. In that situation, atomic molecular dynamics simulations cannot be applied. Furthermore, be- cause of the complex nature of the potential energy surface and the existence of many energy minima, this method was not well-adapted. 2

Therefore an alternative method is needed that does not depend on MD simulations. In

Normal Mode Analysis, a complete analytical solution of the system of equations of motion can be derived with an assumption that the potential energy of the structure can be approxi- mated by using the second derivative, the Hessian matrix around the potential energy minimum and molecular motions are decomposed into vibrational modes. Despite this assumption, low- frequency modes obtained by this method appeared to describe sufficiently the conformational changes that are observed experimentally [37,62,28,76,90]. The solution of the system of equa- tions, the atomic fluctuation vector, can be computed analytically through the singular value decomposition (SVD) of the Hessian matrix. Using the fluctuation vector, the mean square

fluctuations of each atom can be computed. Especially, large amplitude motions, correspond- ing to low frequency modes are especially important to understand the conformational changes which often accompany functional activity. From the formula for the atomic mean square

fluctuations, the large-amplitude fluctuations are computed in a few small singular values of the Hessian matrix. Therefore, the mean square fluctuations can be approximated with only several small singular values. The predicted mean square fluctuations give important informa- tion about the structural fluctuations because the mean square fluctuations are proportional to the B-factors, and the B-factors can be measured experimentally, we can see how well the model predicts the fluctuations by computing correlations between the predicted mean square

fluctuations and experimental B-factors.

In 1950, Goldstein introduced the theory of normal modes in classical mechanics [4,64,79,86].

Since then, normal mode analysis has been widely used in several fields such as mechanical en- gineering, wave theory and optics. It was first applied in the early 1980’s to examine mechanical aspects of small biological systems such as bovine pancreatic trypsin inhibitor (BPTI) [34,60].

Afterwards, research was carried out on larger proteins [73,74], however, the number of possible residues for the analysis was still less than 300. With the progress of X-ray crystallography, formations of larger biological systems have become possible and new methods for NMA were needed to study fluctuations of these larger systems.

A coarse-grained elastic network is one of the methods which enabled analysis of large structures for the structural fluctuations. The coarse-grained models enable microsecond time 3 scale to be reached for small proteins [78]. Because the coarse-grained models are sufficient to reproduce the low frequency normal modes of structures, one can assume that there is no need for an atomic description of the molecule. When describing a protein structure (which is a chain of amino acids), one often approximates the location of each amino acid as the location of its Cα, the first carbon after the carbon that attaches to the functional group or the backbone carbon next to the carbonyl carbon. Thus, in these simulations, typically only the Cα atoms are considered, which considerably reduces the number of atoms necessary for simulation. However, these calculations are still computationally expensive for large biological assemblies. To simulate large and slow conformational rearrangements of large macromolecular assemblies, alternative methods that do not depend on the framework of molecular dynamics simulations need to be employed.

As we discussed above, the Hessian matrix of the potential energy should be obtained before the calculation of the mean square fluctuations. However, the size of the Hessian matrix is large. Therefore, predicted mean square atomic fluctuations may not be accurate because the atomic errors may be accumulated while the potential energy is estimated approximately using the Hessian matrix. The Gaussian Network Model is a residue-level model not requiring the

Hessian matrix of the potential energy function. In the GNM, the fluctuation mechanics of large biological molecules can be modeled as those of elastic networks composed of nodes and springs; the nodes are the amino acids, and the springs are the inter-residue forces. In other words, in this model amino acids (often called residues) are the basic units of representation, and usually the Cα atom is used to represent the residue location. The spatial interactions between nodes are modeled with a uniform harmonic spring. And instead of Hessian matrix, the contact matrix is defined using the distances among the residues with some cutoff distance (usually 7A˚). GNM only requires the singular value decomposition of the residue sized contact matrix, therefore it is very straightforward and computationally inexpensive compared to classical Normal Mode

Analysis. In GNM, the residue fluctuations are assumed isotropic and to follow the Gaussian distributions around the mean position, meaning that the model does not differentiate the X,

Y and Z directions and the residue fluctuations follow the normal distribution around the equilibrium position. In other words, the model does not predict the directions of fluctuations 4 and assumes that the fluctuations are randomly distributed with a single mean value.

Bahar et al. introduced Gaussian Network Model (GNM) in 1997 [29]. The GNM was based on the previous works by Tirion [89] and Go et al. [60]. Go et al. [60] showed that the dynamic structure of the globular protein can be described as a linear combination of high-frequency motions and low-frequency motions of collective variables. In 1996, Tirion [89] showed that a single-parameter potential is sufficient to reproduce the slow dynamics in good detail. From the work of Tirion [89], a single parameter potential energy is enough to derive the residue mean square fluctuations by computing singular value decomposition of contact matrix which only considers inter-residue distances. Therefore, it makes the calculations computationally inexpensive compared to molecular dynamic simulation.

The thesis is organized as follows. We provide a comprehensive review on the Normal

Mode Analysis and Gaussian Network Model in Chapter 2. We provide a brief explanation on the biological background of the methods, and give a brief introduction for each method. We describe the key steps of deriving mean square fluctuations by providing mathematical theory and proofs. We also include a brief description on the relationship between NMA and GNM, physical meanings behind the mathematical equations, and the application of NMA and GNM briefly. Finally, we close Chapter 2 by giving concluding remarks on the theory of normal modes and GNM.

In Chapter 3, we give an extensive evaluations of the GNM for a large set of protein struc- tures. We discuss a preparation step for evaluating the GNM, and provide some comments about the implementation of the algorithm for comparing B-factor correlations between pre- dicted mean square fluctuations and experimental data. We finish Chapter 3 by providing the numerical results and discussion on the comparisons.

In Chapter 4, we introduce some issues on NMA and GNM, and propose a new model so called coarse-grained Normal Mode Analysis (cgNMA), for the reside-mean square fluctuations with the possible choice of some combination of backbone atoms. We compute the residue mean square fluctuations for a large set of protein structures with the different choices of representative atoms for the residue, and compare them with conventional GNM with the same choice of representative atoms as cgNMA. We group the given structures as small, medium, and 5 large structures and compare the predicted mean square fluctuations to experimental B-factors.

We also provide demonstrations for NMA, GNM and cgNMA methods.

In Chapter 5, we present a new Gaussian Network Model, here called non-homogeneous

GNM (nhGNM) based on the statistical results of coarse-grained NMA discussed in Chapter

4. We discuss the test results of nhGNM using different choice of force constants by comparing

B-factor correlations between nhGNM and classical GNM for a large set of protein structures.

In Chapter 6, we summarize and describe the entire thesis work. We finish the thesis by discussing some important issues for future investigation and directions. 6

CHAPTER 2. NORMAL MODE ANALYSIS AND GAUSSIAN NETWORK MODEL

In this chapter, we study Normal Mode Analysis and the Gaussian Network Model. First, we give a brief introduction to the biological background. We provide the important steps of deriving mean square fluctuations by describing mathematical theories and proofs. We also give a brief description of the relationship between NMA and GNM, physical meanings behind the mathematical equations, and the application of NMA and GNM. Finally, we close Chapter

2 by providing concluding remarks on theory of normal modes and GNM.

2.1 Introduction

The study of the structural fluctuation of biological molecules has shed light on the func- tional role of the structure. In other words, the 3D structure determines the biological function of the given structure. Therefore, information about the three dimensional structure gives the understanding of the structure, especially structure-function relations. Normal mode is a powerful tool for analyzing the structural fluctuation for a given structure. A normal mode is a pattern of fluctuations of a system where all components of the system oscillate with a constant phase and fixed frequency. Such an oscillating system has multiple normal modes, therefore the motion of the system is a collection of all its normal modes. For example, from the simple harmonic spring model, a solution of the system can be found, and thus the solution of equation of motion for the given structure also can be obtained by using the Euler-Lagrange equation. This complete and analytical solution of equation of motion give better ideas of structural fluctuation than those an approximated numerical solution can give. For the normal mode analysis, it is necessary to calculate the singular value decomposition of the Hessian ma- 7 trix of the potential energy function. However, since the computational complexity is high for the all-atom model and the energy function is approximated using second derivative Hessian matrix, the estimation on fluctuation may not necessarily always be accurate. As a result, a simplified residue-level model called Gaussian Network Model (GNM) has been proposed in recent years. In this residue model, the Hessian matrix of the energy function need not be cal- culated. Instead, the contact matrix can be defined considering the distances between residues.

Similar to Normal Mode Analysis, the singular value decomposition of the contact matrix can be computed, to give the residue fluctuations for a given structure. The fluctuation calculated using Gaussian Network Model is proved to be more efficient as well as accurate than NMA .

2.2 Normal Mode Analysis

Proteins vibrate around their native state, meaning all the atoms of proteins fluctuate around their equilibrium position, as if all atoms are connected by a network springs. Therefore, structural fluctuations of the protein can be calculated approximately via normal mode analysis as well as molecular dynamics simulations. One of the first application of normal mode analysis to a small protein was done by Brooks and Karplus [95] and Go, Noguti, and Nishikawa [60].

They apply the normal mode analysis to understand the structural fluctuations of bovine pancreatic trypsin inhibitor (BPTI). Later, Levitt, Sander, and Stern [52] performed a detailed analysis on larger proteins including BPTI, Crambin, Ribonuclease, Lysozyme. Ever since, the method of normal mode has been applied extensively for structural fluctuations in protein modeling.

In normal mode analysis, the dynamics of a molecule can be approximately represented as those of a set of simple springs with some forces. In this sense, because the motion of a simple spring can be analytically described, this expression is very useful [65]. From the classical mechanical model, a simple harmonic oscillator of a mass m connected to a spring with the spring constant k, moves according to d2x F = −kx = m , (2.1) dt2 8 where F represents the total force between the two given particles and x represents its distance between two particles.

The solution of above equation is able to describe the dynamics of the particle.

x = α cos(ωt + φ), (2.2)

q k where α and φ are the amplitude and the phase at time t=0, respectively and ω = m the angular frequency associated with the vibrational mode [81].

Remark 2.1. (Euler-Lagrange equation) Let L be a continuously differentiable functional.

Then the Euler-Lagrange equation is given by

∂L(x, x0, t) d ∂L(x, x0, t) − = 0 (2.3) ∂x dt ∂x0

For the general case, based on the theory of classical mechanics, we have following lemma.

Lemma 2.2. Suppose that a biological molecule has n atoms. Let r(t) be the collection of the coordinates of the atoms in the given molecule at time t such that r(t) = {ri(t): i = 1,..., 3n} where r3j−2(t), r3j−1(t), r3j(t) are the x, y, z coordinates of atom j at time t, j = 1, . . . , n.

Let Ep(r) be the potential energy function. Then the molecular motion can be described as a collection of movements of the atoms in the molecule, as given in the following system of equations of motion:

00 Mr = −∇Ep(r) (2.4) where M is a diagonal matrix with the diagonal element mii being the mass of atom i. [56].

Proof. Let the Lagrangian of the physical system for the molecule be defined as

1 L(r, r0, t) = (r0)T M(r0) − E (r). 2 p Then ∂L ∂L = −∇E (r), = Mr0, ∂r p ∂r0 9 and d  ∂L  = Mr00. dt ∂r0

Based on the Euler-Lagrange equation,

d  ∂L  ∂L = , dt ∂r0 ∂r

we then have

00 Mr = −∇Ep(r).

Because a large biological molecules is composed of many atoms, the potential energy of a biological molecule is not simple, but usually given in a complicated nonlinear form. Thus, the system of equation (2.4) cannot be solved analytically. However, if one considers the motion near the equilibrium position, the potential energy can be approximated by a simple form, allowing the system of equation (2.4) to be solved numerically. While many methods have been developed for solving the system, they are all computationally expensive even for the calculation of very short time period trajectories (say in nano-seconds), because the methods have to use very small time steps (in femto-seconds) to match the fastest atomic motions and maintain the accuracy of the calculation [96, 14].

Suppose that we have a molecule with n atoms, and let (xi, yi, zi) be the coordinate of

0 atom i. Then the coordinates of all atoms are given by r = (x1, y1, z1, ··· , zn). Let r be the equilibrium configuration of the molecule which means that at r0 the potential energy is minimized. Then a Taylor expansion of the potential energy function Ep(r) around a minimum r0 gives:

2 0 X ∂Ep 0 1 X ∂ Ep 0 0 Ep(r) = Ep(r ) + (ri − ri ) + (ri − ri )(rj − rj ) + ··· ∂r 0 2! ∂r ∂r 0 i i r=r ij i j r=r 10

Since the r0 is a minimum of the energy function,

∂E p (r0) = 0. ∂ri In addition, without loss of generality, the potential energy in connection with this structure can be resealed so that

0 Ep(r ) = 0.

Finally, if one considers sufficiently small displacements, terms greater than or equal to the second order may be neglected. Therefore, the approximate potential energy function is given by the following harmonic approximation.

2 1 X ∂ Ep 0 0 Ep(r) ≈ (ri − ri )(rj − rj ). (2.5) 2 ∂r ∂r 0 ij i j r=r Let ∆r = r − r0. The function can also be written as

1 E (r) ≈ (∆r)T [∇2E (r0)](∆r). (2.6) p 2 p

By using this approximation, the system of equations (2.4) becomes

1  M∆r00 = Mr00 = −∇E (r) = −∇ (∆r)T [∇2E (r0)](∆r) = −∇2E (r0)∆r. (2.7) p 2 p p

That is,

00 2 0 M∆r = −∇ Ep(r )∆r. (2.8)

Comparing this equation (2.8) to equation (2.1), we see that the protein is fluctuating as if the protein is governed by a harmonic potential energy around the equilibrium position, and the force on the particles becomes approximately linear. Therefore, the equation can be

2 0 solved analytically through the singular value decomposition of the Hessian matrix, ∇ Ep(r ). Suppose that we have a protein with n atoms. Then the dimension of the Hessian matrix,

H is 3n by 3n and we are able to find 3n sets of eigenvalue, eigenvector solutions. Among those solutions, only 3n−6 normal modes are meaningful, because the first six smallest normal modes have eigenvalues equal to 0 and correspond to 3 translational and 3 rotational motions of the whole system [14].

The detailed formula for obtaining the normal modes are provided in the following lemma. 11

Lemma 2.3. Suppose that the singular-value-decomposition of the mass-weighted Hessian

−1 2 0 T matrix, H = M ∇ Ep(r ), is given by UΛU . Then, the solution of the system of equations (2.8) is given by the following functions:

3n X p ∆ri(t) = Uijαj cos(ωjt + βj), ωj = Λjj, i = 1,..., 3n, (2.9) j=1 where αj and βj can be determined by the given conditions of the system [52,56].

−1 2 0 Proof. Let H = M ∇ Ep(r ). Then,

00 2 0 M∆r = −∇ Ep(r )∆r = −MH∆r, and 3n 00 X ∆ri = − Hij∆rj. j=1 P3n Let ∆ri(t) = j=1 Uijαj cos(ωjt + βj). Then,

3n 00 X 2 ∆ri = − Uijαjωj cos(ωjt + βj), (2.10) j=1 and

3n 3n 3n X X X − Hij∆rj = − Hij Ujkαk cos(ωkt + βk) (2.11) j=1 j=1 k=1 3n 3n X X = − [ HijUjk]αk cos(ωkt + βk) (2.12) k=1 j=1 3n X = − UikΛkkαk cos(ωkt + βk), (2.13) k=1 by (2.10) and (2.13) proving that

3n X p ∆ri(t) = Uijαj cos(ωjt + βj), ωj = Λjj i = 1,..., 3n j=1

Note that 3n X p ∆ri(t) = Uijαj cos(ωjt + βj), ωj = Λjj i = 1,..., 3n (2.14) j=1 12

is a solution to the system of equations (2.8), where αj is called the amplitude, ωj the angular frequency, and βj the phase of the jth normal-mode of motion. Therefore, based on the above Lemma 2.3, because the atomic fluctuation of a protein, (2.9) can be expressed as the sum of a simple oscillator, the dynamics of the system can be described as a linear combination of independent normal mode oscillators. The calculated fluctuations show the structural flexibilities by adding over the significant modes. They give the beneficial dynamical properties of the given protein. Now let qj = αj cos(ωjt + βj). Then,

3n X ∆ri(t) = Uijqj, ∆r = Uq. j=1 Also,

1 1 1 1 E (r) =∼ (∆r)T [∇2E (r0)](∆r) = (∆r)T H(∆r) = (Uq)T UΛU T (Uq) = qT Λq p 2 p 2 2 2 3n 3n 1 X 1 X => E (r) =∼ Λ q2 = ω2q2. p 2 jj j 2 j j j=1 j=1 Remark 2.4. For any constant ω > 0 and β,

1 Z n 1 lim cos2(ωt + β)dt = hcos(ωt + β), cos(ωt + β)i = n→∞ n 0 2 Then, the time-averaged potential energy for each mode is:

1 1 1 ω2 hq , q i = ω2α2 hcos(ω t + β ), cos(ω t + β )i = ω2α2. (2.15) 2 j j j 2 j j j j j j 4 j j

Remark 2.5. Based on classical dynamics, each atom will have a time-averaged potential

1 energy Epta(ri) of 2 kBT , where kB is Boltzmann constant and T is absolute temperature [86], i.e. 1 E (r ) = k T pta i 2 B

Since at thermal equilibrium, the averaged potential energy should be equal to the kinetic energy kBT/2 for each mode, i.e.

1 1 ω2 hq , q i = k T (2.16) 2 j j j 2 B then

2 2kBT −1 αj = 2 = 2kBT Λjj . (2.17) ωj 13

Once the solution ∆r(t) for the system of equations (2.8) is obtained, the average fluctuation of r(t) from its equilibrium position r0(t) can be calculated using a simple equation, as given

2 th in the following theorem. Here, ωj = Λjj is called the j mode of the motion. Based on the above solution, the larger value the ωj has, the faster the corresponding mode is [73].

Remark 2.6. If the system is in the thermal equilibrium, by (2.16) we have

kBT hqj, qji = 2 (2.18) ωj

Remark 2.7. If ωi 6= ωj, then

hcos(ωit + βi), cos(ωjt + βj)i = 0

Atomic mean square fluctuations has a linear relationship with the experimentally detected average atomic fluctuations. Thus, it is important to calculate the atomic mean square fluc- tuations to understand how well the model predicts the dynamics of protein structures by comparing the experimental data.

Theorem 2.8. Let the solution to the system of equations (2.8) be given by

P3n 2 2 −1 T ∆ri(t) = j=1 Uijαj cos(ωjt + βj), ωj = Λjj, αj = 2kBT Λjj , i = 1,..., 3n. Let UΛU be the −1 2 0 singular-value-decomposition of the mass-weighted Hessian matrix H = M ∇ Ep(r ). Then, the mean-square-fluctuation of the ith mode motion of the molecule is [52]

3n X −1 h∆ri, ∆rii = kBT UijΛjj Uij, i = 1,..., 3n. (2.19) j=1 14

Proof.

3n 3n X X h∆ri, ∆rii = < Uijαj cos(ωjt + βj), Uijαj cos(ωjt + βj) > j=1 j=1 3n X = hUijαj cos(ωjt + βj),Uijαj cos(ωjt + βj)i j=1 3n X 2 = Uijαj Uij hcos(ωjt + βj), cos(ωjt + βj)i j=1 3n 1 X = U α2U 2 ij j ij j=1 3n 1 X = U 2k T Λ−1U 2 ij B jj ij j=1 3n X −1 = kBT UijΛjj Uij j=1

If we use vj(t) to represent the position vector of atom j at time t,

T then vj(t) = (r3j−2(t), r3j−1(t), r3j(t)) , and the mean-square-fluctuation of atom j can be calculated as

h∆vj, ∆vji = h∆r3j−2, ∆r3j−2i + h∆r3j−1, ∆r3j−1i + h∆r3j, ∆r3ji . (2.20)

Note that 3n 3n 2 X X Uij h∆v , ∆v i = 3k T U [Λ−1] U = k T (2.21) i i B ij jj ij B ω2 j=1 j=1 j

If the singular value Λjj is small, then the frequency ωj is also small, and therefore the corre- sponding mode is slow. Thus, the smaller the singular value is, the slower the corresponding mode; and the slower a mode is, the larger mean square fluctuation it makes as we see in equation (2.21).

Also, the atomic mean square fluctuation has a connection with the experimentally detected average atomic fluctuation, so called the temperature factor, Debye-Waller factor or B-factor in

X-ray crystallography, and the order parameter in NMR. Especially, using the above equation 15

(2.21), the B-factor, Bi of each atom is given by

3n 8π2 8π2 8π2 X B = (v − v0)2 = h∆v , ∆v i = k T U [Λ−1] U i 3 i i 3 i i 3 B ij jj ij j=1 (2.22) 2 3n 2 8π X Uij = k T 3 B ω2 j=1 j

From the above equation (2.22), we observe that the B-factors are proportional to the atomic mean square fluctuations and thus inversely proportional to all the singular values Λjj (or

2 squared frequencies ωj ). Therefore, the largest contributions to the atomic displacements come 2 from the lowest frequency normal modes (small Λjj or ωj ). In addition, the singular vectors corresponding to the lowest frequency normal modes represent the most globally distributed or collective motions. For these reasons, NMA studies usually focus on only a few low frequency normal modes, and in real applications, only several terms in (2.21) corresponding to the smallest singular values are used for the evaluation of the atomic mean-square-fluctuations [90].

Normal Mode Computations

When the theory of NMA is applied to biological molecules, energy minimization is required

first to ensure that the structure is indeed at a minimum of the potential energy function. This minimization is followed by the diagonalization of the Hessian matrix, the second derivatives of the potential energy with respect to the Cartesian coordinates. The size of this matrix increases as the square of the number of atoms. Generally, the diagonalization process is com- putationally expensive. Since the 1980’s, much research has focused on reducing the number of degrees of freedom. The first attempt used the dihedral angle space, ignoring the other degrees of freedom [52,60]. This approach reduces the degrees of freedom more than ten times, but the Hessian matrix size still increases as the square of the number of dihedral angles. Elastic

Network Model (ENM) methods neglect the coupling between the retained degrees of freedom and those that were ignored, although such a coupling may be essential to describe the confor- mational change. Thus, it is important to consider other methods that take into account all the degrees of freedom. One way is using iterative methods to diagonalize the Hessian matrix

[70,84,10]. There are three main iterative methods. One is based on the Rayleigh quotient, 16 another is a perturbation method and the last is a diagonalization in a mixed basis (DIMB) method [84,61,1]. The DIMB method was implemented in the CHARMM [30]. Moreover,

Tama et.al proposed the rotation translation block (RTB) methods in which a set of residues was regarded as a rigid block having six translation-rotation degrees of freedom [23]. This approach reduces drastically the size of the Hessian matrix and thus the computational time.

In conclusion, computational techniques such as DIMB or RTB enable NMA calculations in a very short amount of time for large biological molecules.

Potential energy functions used for NMA

When NMA is applied to biological molecules, the quality of the normal modes usually de- pends on reaching a true minimum of the potential energy, but for large biological molecules, the minimization procedure discussed above is not trivial. Generally the minimization process needs to be performed until it is sufficient to ensure that real frequencies for all modes are not connected with translation or rotation of the entire molecule.

Tirion proposed a simplified model in 1996 which can be used to overcome the major issues regarding the minimization procedure [89]. In this simplified model, the biological molecule can be described as a three-dimensional elastic network on which the probability distribution of atomic fluctuation is normal in an experimentally determined structure. One of the models which satisfies this condition is the Gaussian Network Model [1]. In this model, because the energy function is at the minimum, it is possible to eliminate the need for minimization prior to NMA. As a result, NMA can be applied directly on crystal coordinates of NMR structures.

Hence, it is now possible to determine whether two crystal forms of a protein, open form and closed form, are interconvertible using the slow modes as coordinates [89]. Those procedures reduce much of the computational expense by eliminating the need to minimize and make

NMA an ideal candidate for study of conformational change pathways between two states

[6,21,40,48,68,69,72,74]. For example, the direction of low frequency normal modes can be used to deform the structure from its original conformation to other known conformations. Due to the nonlinearity of the conformational change, one needs to deform the structure iteratively so that it fits optimally. This concept was introduced in the normal mode flexible fitting (NMFF) 17 method [46,75]. Those iteration procedure may require hundreds of NMA. Such deformations would cause the structure to move from its minimum, and therefore a minimization process would be necessary to perform NMA on the deformed structure. However, with elastic network, one can directly apply NMA without additional minimization.

More importantly, the elastic network model can be used for protein models without atomic detail. Using the coarse-grained representation of a biological molecule reduces the size of the

Hessian matrix. Thus, the model improves the computational efficiency dramatically, which is essential for the study of large biological molecules [13,25,26].

Normal modes are properties of the shape of biomolecules

As advanced models and computational techniques developed in many studies, the role of shape and form has become clear. From all-atom models to multi-resolution elastic network models, these advances have shown the connection between the shape and dynamics of a biological molecule. In fact, the elastic network model can be used to express the slow dynamical proper- ties of proteins with high degree of accord [24] with experimental data. Around high frequencies the match breaks down. However, many studies show that the collective motions found in the low frequency normal modes effectively describe biologically relevant conformational changes.

The good agreement between experimental data and the modes constructed from these meth- ods suggests that low frequency normal modes are a predominant property of the shape of the molecular systems [5,39,91]. In other words, the character of these low frequency normal modes is mainly due to the properties of the shape of the biological molecules [65]. Thus, an atomic representation of the biological molecules is not needed to obtain its dynamical properties, rather only its shape would be necessary. These observations indicates that larger biological assemblies may have been designed in a way that they adopt a specific form from which specific dynamics can arise [49,82]. Then, these machines use these motions that are the property of the shape. Hence, the shape corresponding to dynamical properties of the structure is the key to understand the function of biological system. 18

2.3 Gaussian Network Model

Tirion [89] introduced the idea of a single parameter potential energy for all the atomic pairs in a close distance, and demonstrated that the analysis with such an approximation is enough to reproduce the slow dynamics in good detail. Along this line, Bahar et al later proposed the Gaussian Network Model (GNM). In GNM, a biological structure is represented as the set of particles connected with elastic network of springs. This model is a coarse grained model, taking the amino acids (residues) as the basic units. Each residue can be represented by one of its backbone atoms, say Cα. All the springs are assumed to have the same force constant.

th th 0 0 th For i and j residues, let Ri and Rj denote the distance from the origin to the i and th 0 0 j residue of protein, respectively. So Ri and Rj are equilibrium position vectors, respectively. 0 0 0 Similarly, Rij denote the equilibrium distance vector between Ri and Rj . Also, let ∆Ri and th th ∆Rj be the instantaneous fluctuation vectors of each i , j node. Then

0 0 ∆Rj = Rj − Rj , ∆Ri = Ri − Ri

Instantaneous position vectors of these nodes are defined by Ri and Rj, and Rij is instantaneous distance vector between ∆Ri and ∆Rj. Therefore,

0 0 0 0 0 ∆Rij = (∆Rj − ∆Ri) = (Rj − Rj ) − (Ri − Ri ) = (Rj − Ri) − (Rj − Ri ) = Rij − Rij

In the GNM method, a three dimensional protein structure is usually described as an elastic network connected by harmonic springs with a certain cutoff distance. Generally, the residue of a protein is considered as a location of Cα atoms of the residue, and a sequence of such residues connected with strings forms an elastic network. In the mathematical field of graph theory, the Laplacian matrix, sometimes called admittance matrix or Kirchhoff matrix, is a matrix representation of a graph. This matrix can be used to express the connectivity of the residues of the protein structure. the contact matrix of Cα atoms of a protein is constructed 19

Figure 2.2: Example elastic network model for protein 1HEL

The picture on the left shows the Cα trace of 1HEL. The one on the right shows all con- nections between Cα nodes for 1HEL to indicate the nature of the elastic network analyzed by GNM [1]. by using the Kirchhoff matrix Γ, defined by   −1 if i 6= j and R < r  ij c  Γij = 0 if i 6= j and Rij > rc (2.23)   PN  − j,j6=i Γij if i = j where rc is a cutoff distance for spatial interactions, usually 7A˚ for proteins [80]. The energy function can be used to describe the interaction among the residues. When a protein changes from an arbitrary position to an equilibrium state, the connected residues can be considered as a set of masses contacting with springs. Thus around their equilibrium positions, using the harmonic potential approximation, potential energy of the network can be defined in terms of ∆Ri and ∆Rj.

Figure 2.3: An example contact matrix

 2 −1 0 0 −1 0   −1 3 −1 0 −1 0     0 −1 2 −1 0 0      (2.24)  0 0 −1 3 −1 −1     −1 −1 0 −1 3 0  0 0 0 −1 0 −1

From the integral of Hooke’s Law, Ep can be expressed in terms of ∆Xi, ∆Yi and ∆Zi 20

components of ∆Ri as follows

N γ X  E = Γ [(∆X − ∆X )2 + (∆Y − ∆Y )2 + (∆Z − ∆Z )2] p 2 ij j i j i j i i,j for some force constant γ.

If we express the X, Y and Z components of the fluctuation vectors ∆Ri as

T T T ∆X = [∆X1, ∆X2, ··· , ∆XN ] , ∆Y = [∆Y1, ∆Y2, ··· , ∆YN ] and ∆Z = [∆Z1, ∆Z2, ··· , ∆ZN ] , then above equation simplifies to [1]

γ E = [∆XT Γ∆X + ∆Y T Γ∆Y + ∆ZT Γ∆Z] (2.25) p 2

Furthermore, in the field of , each fluctuation vector ∆X, ∆Y and ∆Z has to follow the Maxwell-Boltzmann distribution, respectively

1 −E (∆X) 1  γ  p(∆X) = exp p = exp − ∆XT Γ∆X (2.26) N kBT N 2kBT where kB is the Boltzmann constant, T is the absolute temperature and N is the normalization factor [58]. Similar equations exist for p(∆Y ) and p(∆Z).

Note that because p(∆X) is the probability density function(PDF),

Z Z 1 γ p(∆X)dX = exp(− ∆XT Γ∆X) = 1 Rn Rn N 2kBT

Thus, Z γ N = exp(− ∆XT Γ∆X) Rn 2kBT

This model, which also referred to as the Gaussian Network Model [88], can be used to under- stand the movement of each residue in the protein mechanically. Especially, we are interested in calculating expected values of a particular residue fluctuations so-called mean-square fluc- tuation or the correlations between the fluctuations of two different residues. Based on the statistical physics foundations of GNM, mean-square fluctuation for each residue h∆Ri, ∆Rii, 21

and cross-correlations, h∆Ri, ∆Rji can be evaluated. The mean square fluctuations for a particular residue i, is the sum of component fluctuations,

h∆Ri, ∆Rii = h∆Xi, ∆Xii + h∆Yi, ∆Yii + h∆Zi, ∆Zii

Cross-correlations between the fluctuation of two different residues, i and j, is the sum of component cross-correlations,

h∆Ri, ∆Rji = h∆Xi, ∆Xji + h∆Yi, ∆Yji + h∆Zi, ∆Zji

2 2 Definition 2.1. Suppose that X1, ··· , Xn has a normal distribution N(µ1, σ1), ··· , N(µn, σn), respectively. Then an n-dimensional random variable X with mean vector µ and covariance matrix Σ is said to have a nonsingular multivariate normal distribution, if (i)Σ is positive definite, and (ii) the probability density function (PDF) of X is of the form   1 1 T −1 fX (x1, ··· , xn) = n 1 exp − (x − µ) Σ (x − µ) (2π) 2 |Σ| 2 2 where |Σ| is the determinant of the covariance matrix [87,9].

If we combine the nonsingular and singular cases together, we have following definition.

Definition 2.2. An n-dimensional random variable X with mean vector µ and covariance matrix Γ is said to have a multivariate normal distribution ( in symbols Nn(µ,Σ)) if either

X ∼ Nn(µ, Σ), Σ > 0, or X ∼ Nn(µ, Σ), |Σ| = 0

Under certain conditions, an n-dimensional random variable with mean vector µ and covariance matrix Σ can have a singular multivariate normal distribution [9].

From the above discussion, 1  γ  1  1 k T  p(∆X) = exp − ∆XT Γ∆X = exp − ∆XT ( B Γ−1)−1∆X N 2kBT N 2 γ

By the Definition 2.1, we know that ∆X has a n-dimensional multivariate normal distribution

kB T −1 with mean zero and the covariance matrix, γ Γ k T ∆X N(0, B Γ−1) ∼ γ 22

Therefore, the probability density function (PDF) for ∆X around the equilibrium positions is given by

1  1 k T  p(∆X) = exp − ∆XT ( B Γ−1)−1∆X (2.27) N k T 1 1 2 B 2 −1 2 2 γ (2π) ( γ ) |Γ |

Because the smallest eigenvalue of the contact matrix Γ is zero, the determinant of Γ is zero.

Thus, Γ−1 does not exist. Instead, the pseudo inverse can be computed by using singular-value decomposition of Γ. Γ−1 is constructed using the N-1 non-zero eigenvalues and associated eigenvectors.

Remark 2.9. From calculus, for some constant a>0,

Z r Z r 2 π 2 2 π exp(−ax )dx = , x exp(−ax )dx = 3 R a R 4a

Theorem 2.10. Suppose that X1, ··· , Xn is the equilibrium component positions of the residues in a protein, Γ is the corresponding contact matrix, and ∆X1, ··· , ∆Xn is the corre- sponding component residue fluctuations. Suppose that the singular value decomposition of Γ is given by UΛUT where U is orthogonal matrix and Λ is diagonal matrix with singular values.

Then,[56] n kBT X h∆X , ∆X i = U [Λ−1] U (2.28) i i γ ij jj ij j=1 23

Proof. Z 2 h∆Xi, ∆Xii = (∆Xi) p(∆Xi)d∆Xi R Z 2 1 −Ep(∆Xi) = (∆Xi) exp( )d∆Xi R Ni kBT Z 2 1 2 −γΓii(∆Xi) = (∆Xi) exp( )d∆Xi Ni R 2kBT 2 R 2 −γΓii(∆Xi) (∆Xi) exp( )d∆Xi R 2kB T = 2 R −γΓii(∆Xi) exp( )d∆Xi R 2kB T r r π c −γΓii = 3 , c = 4c π 2kBT k T = B [Γ−1] γ ii k T = B [UΛ−1U T ] γ ii n kBT X = U [Λ−1] U γ ij jj ij j=1

The above theorem shows that the mean-square fluctuation of residue can be evaluated with only simple calculations of singular-value decomposition [29,57,88].

In GNM, residue fluctuations are assumed to have uniformity in all directions. Thus,

1 h∆X , ∆X i = h∆Y , ∆Y i = h∆Z , ∆Z i = h∆R , ∆R i i i i i i i 3 i i the mean square fluctuation of the residue i is given by

3k T h∆R , ∆R i = B [Γ−1] i i γ ii and the cross-correlations between residue i and j is given by

3k T h∆R , ∆R i = B [Γ−1] i j γ ij

Mode decomposition

Based on the above discussion, the cross-correlations between residue i and j is 24

3k T 3k T 3k T h∆R , ∆R i = B [Γ−1] = B [(UΛU T )−1] = B [UΛ−1U T ] i j γ ij γ ij γ ij

3kBT X 1 T = [ ukuk ]ij γ λk k

The frequency and shape of a mode is represented by its eigenvalue and eigenvector, respec- tively. Since the Kirchhoff matrix is positive semi-definite, the first eigenvalue, λ1, is zero and the corresponding eigenvector have all its elements equal to √1 . This shows that the network N model is translation invariant. Here, the summation is performed over all non-zero eigenvalues of Γ. It is clear that the above equations allow us to calculate the amplitudes of fluctuations for individual residues without providing information regarding their absolute orientations or directions. However, in reality the fluctuations are anisotropic in general. And it is important to access the directions of collective motions, as these can be directly relevant to biological function and mechanisms. It is not indeed possible to acquire an understanding of the mecha- nism of motion unless the fluctuation vectors, in addition to their magnitudes, are elucidated.

An extension of the GNM, called the anisotropic network model (ANM) address this issue [1,2].

The main difference between the GNM and ANM is that all residue fluctuations are isotropic in the GNM, while ANM does have directional preferences [32].

In the GNM, for the X, Y and Z directions, fluctuations of X, Y and Z are independent and identically distributed each other, therefore the joint probability distribution function (PDF) of all fluctuation can be expressed by

P (∆R) = P (∆X, ∆Y, ∆Z) = p(∆X)p(∆Y )p(∆Z) = [p(∆X)]3 where P denotes the joint PDF of X, Y and Z and p denotes the PDF.

This formula requires inverse of the Kirchhoff matrix. In the GNM, the determinant of Kirchhoff matrix is zero, hence calculation of its inverse requires singular value decomposition. Therefore, the probability distribution of all fluctuations in GNM becomes

1 3 k T P (∆R) = p(∆X)p(∆Y )p(∆Z) = exp(− (∆XT ( B Γ−1)−1∆X)) 3N k T 3 3 2 B 2 −1 2 2 γ (2π) ( γ ) |Γ | 25

The relationship between NMA and GNM

Theorem 2.11. The Gaussian Network Model is equivalent to the Normal Mode Analysis for evaluating the mean square residue fluctuations of a protein around equilibrium state, with the energy function defined for the residues instead of the atoms by [52,56]

γ E ≈ [∆XT Γ∆X] (2.29) p 2

Proof. In the proof of Lemma 1.2.,

1 E (r) = QT ΛQ p 2

Thus, suppose that Ep(r) is defined by

γ E (r) = QT ΛQ p 2

Then, 6k T α2 = B [Λ−1] j γ jj Therefore, by Theorem 1.1.

3n 3kBT X h∆r , ∆r i = U [Λ−1] U i i γ ij jj ij j=1 and Theorem 1.2., n kBT X h∆X , ∆X i = U [Λ−1] U i i γ ij jj ij j=1 Therefore, n 3kBT X h∆R , ∆R i = U [Λ−1] U i i γ ij jj ij j=1

Therefore, for both NMA and GNM, the mean square fluctuation for residue i can be evaluated by using the singular value decomposition. Hence the Gaussian Network Model is equivalent to the Normal Mode Analysis for evaluating the mean square residue fluctuations of a protein around equilibrium state given potential energy. 26

Applications

As we know, in order to know a variety of biological functions of proteins, the knowledge of three dimensional structures of proteins and their dynamic behaviors are indispensable. Based on the GNM, to understand the dynamic behaviors of proteins, fluctuations of residue can be cal- culated around the equilibrium state. On the other hand, equilibrium fluctuations of biological molecules can be measured experimentally, too. One of the way to measure the fluctuations is based on the X-ray crystallography. In X-ray crystallography, B-factor (or temperature factor) of each atom is a measure of mean-squared fluctuation of the native structure. The other way to obtain the fluctuations is established on the work of nuclear magnetic resonance (NMR). In

NMR experiments, this measure can be obtained by calculating root-mean-squared differences between different models. In many applications and publications [17,22,27], including the orig- inal articles [29], it has been shown that expected residue fluctuations obtained from GNM is in good agreement with the experimentally measured native state fluctuations. The relation between B-factors, for example, and expected residue fluctuations obtained from GNM is as follows 8π2 8π2k T B = h∆R , ∆R i = B (Γ−1) (2.30) i 3 i i γ ii There is a single parameter in the theory, the force constant γ. This parameter is a measure of the strength of inter-molecular potential energy that stabilize the native folding. In the GNM theory, γ has been estimated for each protein by comparing above equation.

Physical meanings of slow and fast modes

Remark 2.12. The cross-correlations between residue i and j is

3kBT X 1 T h∆Ri, ∆Rji = [ ukuk ]ij γ λk k

th Then, consider the correlation h∆Ri, ∆Rjik, contributed by the k mode

3kBT 1 h∆Ri, ∆Rjik = [uk]i[uk]j γ λk

th where [uk]i is the i element of uk. 27

Similarly, we can consider mean square fluctuation for residue i, contributed by the kth mode

3kBT 1 2 h∆Ri, ∆Riik = [uk]i γ λk T 2 Because U is unitary, uk uk=1, uk is normalized. Thus, the plot of [uk]i for each residue i=1,··· ,n generalizes the normalized distribution of mean square fluctuations in the kth mode.

This plot is called the kth mode shape. Because the residue fluctuations are related with the B

th 1 factors, [uk]i represents the mobilities of residue i in the k mode. Due to the factor , the λk slowest mode has the largest contribution. Moreover, in general, the slowest motions have the highest degree of collectivity. In other words, many residues are involved in the slow motion [1].

Shapes of the slowest modes reveal the mechanisms of global motions, and the most constrained residues (minima of residue fluctuation) in these modes, play a important role that govern the correlated movements of entire domains [17,85]. Although these motions are slow, they involve substantial conformational change over several residues. On the other hand, the fastest modes involve the most tightly packed and constrained residues in the molecule.

Diagonalization of the Kirchhoff matrix decomposes the normal modes of collective motions of the Gaussian network model of a bimolecular. The expected values of fluctuations and cross- correlations are obtained from linear combinations of fluctuations along these normal modes.

The contribution of each mode is scaled with the inverse of that modes frequency. Hence, slow

(low frequency) modes contribute most to the expected fluctuations. Along the few slowest modes, motions are shown to be collective and global and potentially relevant to functionality of the biomolecules. Fast (high frequency) modes, on the other hand, describe uncorrelated motions not inducing notable changes in the structure.

2.4 Concluding Remarks

Since the establishment of the normal mode theory, it has been used in a variety of fields such as B-factor refinement in crystallography, NMR order parameters, the study of large-scale conformational changes of proteins [48,68,69,72,74].

Until the development of diagonalization in mixed basis (DIMB) method, normal mode studies 28 remained limited to proteins of less than 300 residues. By using DIMB method, studying on large molecular assembly comprising 2760 residues was successful [1] contrary to methods based on Rayleigh quotient [70] or on the perturbation approach [10]. Evolution in diagonalization techniques such as DIMB or RTB with the coarse-grained elastic network model has reduced computational expense on large biological systems.

In particular, as techniques in X-ray crystallography have progressed, structures of larger bio- logical molecules have become available and new techniques for NMA were needed to address dynamics of these larger molecules. A few lowest frequency normal modes are related to the motions of structural domains [20]. Thus, NMA is useful for studying the mechanical details of the motions if the motion has a collective character and amplitude is large enough [23,47].

Moreover, perturbing the structure along its lowest-frequency modes seems promising tool for molecular replacement problems [12], structure refinement [44] and fitting atomic structures into low resolution electron density maps [19].

Regarding the theory of normal mode, mean square fluctuation was reviewed mathematically.

In other words, residue fluctuation can be estimated as a sum of simple oscillator provided that singular value decomposition of the Hessian matrix is given.

For the larger biological molecules, classical molecular dynamics simulation has been limited due to their complexity. Based on the normal mode analysis approach, analysis for conformational rearrangement and low resolution experimental data of protein can be possible. Furthermore, a key to understand the dynamics of protein is the shape dependent on the structure.

The other method to understand dynamics of protein is the Gaussian Network Model based on the theory of elastic network and Gaussian distribution combined with the assumption that the structure is near its equilibrium state [1]. Based on the statistical physics foundations, residue mean-square fluctuation can be evaluated with only simple calculations referred to as singular-value decomposition.

More importantly, the Gaussian Network Model is equivalent to the Normal Mode Analysis for evaluating the mean square residue fluctuations of a protein around equilibrium state com- bined with same energy function. Therefore, the GNM can be also used in several areas that the NMA applied such as B-factor refinement in crystallography, NMR order parameters and 29 so on [17,22,27,29].

Furthermore, GNM can be used to understand the relationship between the degree of collec- tivity and the modes. Shapes of slow modes reveal the dynamics of global motions and fastest modes involve the most constrained residues [1]. 30

CHAPTER 3. EVALUATION OF GAUSSIAN NETWORK MODEL

In this chapter, we evaluate the Gaussian Network Model (GNM) which was reviewed in the previous chapter, for a set of protein structures. We downloaded a set of protein structures from the Protein Data Bank. And we explain how we implemented an algorithm in MATLAB

(Matrix Laboratory), which is a numerical computing program widely used in academic in- stitutions. In this work, we clarify the procedures for one structure first, and then illustrate those for the sequence of protein structures. Finally, based on the implemented algorithm, correlations with experimental data are calculated. And then we demonstrate the distribution of correlations.

3.1 Introduction

Proteins are one of the most important biological compound. Chains of sequence of amino acids forms a three dimensional protein structure. In other words, the order of the amino acids sequences determines the particular shape of three dimensional structure, and the shape defines the functional roles of the protein structure. Therefore, study of the relationship between the formation of protein structure and dynamics has been performed for decades. We can understand the biological functions better when we understand the dynamics of the entire structure better. Therefore, it is not only essential but also important to research the structural

fluctuation of proteins to get better ideas of biological functions as we discussed in the previous chapter. Normal Mode analysis requires the potential energy to reach minimum, in which the protein is in a stable state. Then, the potential energy can be approximated by a quadratic function around this minimum state. The system of equation of motion (2.4) can then be computed analytically, and the normal modes of the fluctuation can be extracted from the 31 solution for the analyzing atomic fluctuations as well as overall structural vibrations. However, calculation of the energy minimum is usually very expensive because the Hessian matrix is huge. In addition, the potential energy function is also estimated approximately, hence the function has atomic detailed errors. Therefore, the need for a model which does not require the energy minimization but also reflects the structural vibrations has been proposed. Gaussian

Network Model (GNM) was proposed and it turns out that GNM is useful especially when we investigate the structural fluctuations. In this chapter, we plan to examine how well the GNM predicts the residue level fluctuations compared to experimental data.

3.2 Preparation for evaluation

We downloaded a set of protein structures from the Protein Data Bank satisfying certain conditions: We chose protein structures based on X-ray crystallography experimental method with high resolution (higher than 1.5A˚). Also we filter the search results based on sequence similarity at 30% identity. In other words, multiple structures whose sequences have at least

30% of sequence identity will be represented by a single structure. Finally, we obtained 2,052 protein structures satisfying the previous conditions.

Since GNM is residue-level model, we consider Cα as the representative atom of the residue for given structure. Thus, a contact matrix of given structure can be obtained by considering the distances between Cα atoms. After that, we compute the residue mean square fluctuations using singular value decomposition of contact matrix for the given set of protein structures. Finally, because mean square fluctuation has a linear relationship with B-factor by equation (2.30), we can calculate correlation coefficients between residue mean square fluctuations computed by GNM and X-ray crystallography experimental B-factors given in PDB file.

3.3 Implementation of algorithm

We implemented an algorithm in MATLAB running on a standard desktop workstation.

By definition (2.23), the contact matrix is a so-called Laplacian matrix. The Laplacian matrix is always positive-semi-definite and therefore the matrix has a zero eigenvalue. Because a 32 contact matrix of GNM always has a zero eigenvalue, the inverse of the matrix does not exist.

Therefore, the pseudo-inverse should be computed by using the singular value decomposition

(SVD) of the matrix. In other words, while implementing the algorithm, we only collected the singular values greater than 1e-5 to compute mean square fluctuations by using equation

(2.28). After calculating the mean square fluctuations, we compute correlation coefficients of mean square fluctuations with experimental B-factors in PDB file. As we discussed in the

Chapter 2, since the B-factors are only constant multiplication of the mean square fluctuations

(2.22), correlation coefficients between predicted mean square fluctuations and experimental

B-factors can be directly computed. Once the correlation coefficients for one structure is obtained, we repeat the same procedures to have correlation coefficients for the given set of protein structures. And those calculated correlation coefficient values are used for showing the distribution for the chosen set of protein structures.

3.4 Evaluation Results

Figure 3.1: Distribution of correlations for protein structures using GNM

Figure 3.1 shows the distributions of correlation coefficients for the given set of protein structures. Among 2,052 protein structures, only correlations of 1,817 structures were com- puted; We remove some structures that the number of Cα for predicting mean square fluc- tuations is not equal to that of Cα for experimental B-factors in PDB file. Sometimes, the number of Cα atoms representing a residue does not equal to the number of Cα atoms having 33 the X-ray crystallography B-factors. Also, we exclude the chosen structure if the structure is not a protein such as DNA or RNA structure. Figure 3.1 illustrates the smooth distribution over the correlation coefficients for the 1817 structures. We can see that almost 70% of the structures have the correlations greater than or equal to 0.5, which means that GNM predicts the residue mean square fluctuations pretty well. In Chapter 5 and Chapter 6, we will com- pare our proposed models to this benchmark to understand how well the model predicts the structural fluctuations. 34

CHAPTER 4. COARSE-GRAINED NORMAL MODE ANALYSIS

In this chapter, we propose a new approach called the nonhomogeneous GNM, which can have a better correlation coefficients of mean square fluctuations for each residue with experi- mental B-factor than GNM. In this model, we define nonhomogeneous contact matrix from the

Hessian matrix instead of homogeneous contact matrix depending on the cutoff distance. We compare correlation coefficients of residue mean square fluctuations based on different meth- ods with experimental B-factor in PDB file. First, we calculate the correlation coefficients of residue mean square fluctuations (MSF) using GNM based on different backbone atoms, C,

N, Cα, Cβ with experimental B-factor in PDB file. Then, correlation coefficients of residue MSF using nonhomogeneous GNM based on different backbone atoms can be calculated. Fi- nally, correlation coefficients of residue MSF using average method which assigns sum of all values of Hessian matrix corresponding to each residue to contact constants can be calculated, and also all atom based normal mode analysis is used to calculate correlation coefficients with experimental B-factor.

4.1 Introduction

In chapter 2, we discussed the Normal Mode Analysis (NMA) and Gaussian Network Model

(GNM). In Normal Mode Analysis, the availability of the potential energy function is required and also the computation of the Hessian matrix of the function is needed [1]. The work of

Tirion showed that Hessian matrix can be approximated by using a single parameter model for all the atomic pairs in a close distance [89]. And it turned out that such an approximation can be as accurate as using the exact Hessian matrix. Furthermore, the Gaussian Network

Model was proposed by Bahar et al for further simplification on the conventional normal mode 35 analysis [29, 88, 17]. In GNM, a protein structure is described at a coarse grained level with the amino acid residues as the basic units, and is considered as connected network of elastic springs with the residues as particles. Each residue can be represented by one of its backbone atoms, say Cα, or a selected point in the residue, say the geometric center of the side-chain. Two residues are said to be in contact if they are close in distance (usually, in a 7A˚ cutoff distance).

All the springs are assumed to be the same, i.e., have the same force constant. However, a more accurate model with the introduction of different values of force constants based on the physical interaction between two representative atoms could be considered.

4.2 Issues of NMA and GNM

As analyzed in chapter 2, in conventional NMA, the system of equations of motion (2.4) can be solved with a quadratic approximation to the potential energy function. This approximation is accurate enough only in a small neighborhood of the energy minimum state (or structure).

The approximation also depends heavily on the original potential energy function, which itself is an approximation obtained semi-empirically [96, 14]. Besides, the conventional NMA is usually conducted at the atomic level, which is subject to all the atomic level errors induced in energy calculations. As a result, the structural fluctuations predicted by NMA are often even worse than those by GNM, which is in fact a coarser model than NMA [18, 1].

As describe in chapter 2, the GNM method can be considered as a type of residue level

NMA. In GNM, a residue contact matrix is defined only using the distance between residues

(the distances between the representative atoms Cα, N, C or Cβ). If the distance between two different residues is less or greater than the given cutoff value, say 7 A˚, then a constant, -1 or

0 (or more accurately, -k or 0) is allocated for the entry in the matrix, respectively. In other words, constants are assigned to the entries throughout the matrix according to the contact distances. The advantage of using GNM is that the model is simple and easy to construct, and the dimension of the model is much smaller (proportional to the number of residues), and the computational cost is not as expensive as atomic level NMA. Moreover, GNM does not require the exact potential energy and the Hessian matrix, reducing not only the computational cost but also possible errors in atomic energy calculations. 36

Physically speaking, in GNM, the protein is viewed as an elastic network with the residues as nodes and the contact distances as the links. The links are approximated by springs and assigned with some spring constants. In other words, the forces between the residues in con- tact are approximated by some harmonic forces. Mathematically speaking, this is equivalent to saying that a quadratic approximation is used to represent the potential energy for the residue interactions of the protein. As a first approximation, in GNM, a single spring con- stant is assigned to all the residue pairs in contact. However, different pairs of residues may interact differently with different force constants. A more accurate model may be built if these differences can be considered.

4.3 New Approach

In this section, we investigate an alternative approach to coarse-grained normal mode anal- ysis. Instead of using a homogeneous constant for all the interactions among the residues in contact, we assign different force constants for different types of interactions in terms of re- lated residue types and distances. In certain sense, it also can be considered as a refined, non-homogeneous Gaussian Network Model. There have been several efforts made in the past to refine GNM: For example, Sen and Jernigan [97] defined the force constants in terms of the residue contact distances, and Kondrashov, Cui, and Phillips [13] assigned different force constants for residue pairs of different sequence distances. On the other hand, Gur et.al [100] obtained spring constant matrix from the fluctuation correlation matrix by using Molecular dynamics simulations to study the two bound peptides. We derive the force constants from the atomic level Hessian matrix of the potential energy function. Therefore, our approach can be considered to be more physics-based.

For a given structure, we first perform sufficient steps of energy minimization in an atomic- level force field. Once an energy minimum is reached, we save the Hessian matrix of the energy function and use it to derive the force constants for residue interactions. Let Ri and Rj be two residues represented by two atoms Ai and Aj, where Ai and Aj must have two position

T T (ij) vectors (vi1, vi2, vi3) and (vj1, vj2, vj3) and correspond to a 3 by 3 submatrix S of the

Hessian H. We then define the force constant for residues Ri and Rj to be the average value 37

(ij) of the entries in S . In other words, if kij is the force constant for residues Ri and Rj, then

(ij) kij = −kS kF , where k · kF is the matrix Frobenius norm. More accurately,  (ij)  −kS kF if i 6= j kij = (4.1)  Pm  − j,j6=i kij if i = j

where m is the number of residues in the protein. Then, the matrix Γ = {kij : i, j = 1, . . . , m} can serve as a reduced Hessian matrix for coarse-grained NMA.

4.4 Numerical Results

We have applied this new model to a set of protein structures downloaded from the PDB

[98], [11]. The resolutions of the structures are 1.5A˚ or higher and the sequence similarity is 30% or lower. The sizes of the structures are from small to large, with the numbers of atoms ranging from 35 to 2387. We have employed the CHARMM version 27b2 [99] for energy minimization and implemented a MATLAB Normal Mode Analysis code running on a standard desktop workstation for residue normal mode calculations. For each structure, we have used different representative atoms Cα, N, C, or Cβ for residues, to obtain the reduced Hessian matrix. We have computed the residue mean-square-fluctuations and their correlations with the B-factors of the corresponding representative atoms of the residues (B-factor correlations). We call our model the coarse-grained NMA or cgNMA for short. A cgNMA with Cα, N, C, or Cβ as the representative atoms for the residues is denoted as cgNMA(Cα), cgNMA(N), cgNMA(C), cgNMA(Cβ), respectively. For comparison, we have also computed the residue mean-square-fluctuations and their B- factor correlations for each structure, using GNM with Cα, N, C, Cβ as the representative atoms for the residues. A GNM with Cα, N, C, or Cβ as the representative atoms for the residues is denoted as GNM(Cα), GNM(N), GNM(C), GNM(Cβ), respectively. An atomic level NMA has also been performed for each structure with the CHARMM NMA routine. The mean-square-

fluctuations of Cα atoms are recorded and compared with their experimental B-factors as well.

A more sophisticated cgNMA is to use a set of atoms to represent each residue. Let two residues 38

Ri and Rj be represented by the same types of N atoms, Ai1,...,AiN and Aj1 ...,AjN , re- spectively. Let S(ij) be the 3N by 3N submatrix of the Hessian matrix H corresponding to the

(ij) (ij) representative atoms of Ri and Rj. We suppress S to a 3 by 3 matrix T , with

N N ! (ij) X X (ij) 2 Txy = sqrt [S3(k−1)+x,3(l−1)+y] , x, y = 1, 2, 3. k=1 l=1

We then define the force constant kij for residues Ri and Rj to be the average value of the entries in T (ij), i.e.,  (ij)  −kT kF if i 6= j kij =  Pm  − j,j6=i kij if i = j

where m is the number of residues in the protein. We denote this model by cgNMA(M), meaning the coarse-grained NMA with multiple representative atoms. In particular, we have used Cα, N, C for cgNMA(M) in our test. We have divided the test structures into three groups of relatively small-, medium-, and large-sized 104 structures. We then summarize the test results for each of them separately by us- ing three different tables. Table 4.1, 4.2, 4.3 show the test results on small, medium, and large structures, respectively. In these tables, we have listed the names of the proteins (ID), the num- bers of the atoms and residues in the proteins (TA and TR), and the correlation coefficients be- tween the computed residue mean-square-fluctuations and X-ray experimental B-factors for the given structures using different kinds of methods. The methods include GNM based on differ- ent representative atoms (cutoff distance 7A˚), GNM(Cα), GNM(N), GNM(C), GNM(Cβ), the coarse-grained NMA with different representative atoms, cgNMA(Cα), cgNMA(N), cgNMA(C), cgNMA(Cβ), cgNMA(M), and conventional NMA using CHARMM force field. The B-factor correlation coefficients are calculated between the B-factors of the representative atoms except for NMA for which the B-factors of Cα atoms are compared with. For cgNMA(M), the averages of fluctuations of representative atoms of residues are compared with the averages of B-factors of representative atoms of residues. In column GNM-Cβ, there is no result for 1OB4 and 1OB7 because the distances between Cβ atoms are greater than the cutoff value. 39 0.881 0.081 0.056 0.699 0.463 0.600 0.463 0.847 0.609 0.896 0.750 0.209 0.405 0.698 0.430 0.807 0.670 0.690 0.570 0.472 0.254 0.384 0.507 0.609 0.553 0.421 0.564 -0.149 -0.001 -0.027 -0.041 -0.079 0.965 0.423 0.771 0.814 0.193 0.543 0.926 0.071 0.706 0.606 0.790 0.462 0.923 0.838 0.813 0.654 0.274 0.701 0.271 0.584 0.808 0.789 0.925 0.415 0.637 0.487 0.746 0.322 0.747 0.512 -0.490 -0.236 ) cgNMA(M) NMA β 0.722 0.549 0.507 0.010 0.845 0.381 0.141 0.664 0.761 0.081 0.488 0.652 0.092 0.396 0.168 0.233 0.322 0.670 0.291 0.549 0.216 0.075 0.221 0.107 0.760 0.111 0.579 0.700 -0.033 -0.358 -0.258 -0.157 a 0.898 0.524 0.722 0.405 0.947 0.025 0.725 0.583 0.701 0.400 0.873 0.686 0.885 0.647 0.358 0.618 0.398 0.566 0.806 0.830 0.849 0.332 0.653 0.536 0.719 0.374 0.808 0.509 -0.164 -0.091 -0.298 -0.132 ) - residue B-factor correlations using GNM with different β 0.989 0.238 0.971 0.707 0.956 0.843 0.621 0.538 0.019 0.540 0.897 0.496 0.779 0.741 0.805 0.480 0.653 0.511 0.447 0.763 0.614 0.938 0.369 0.417 0.407 0.654 0.285 0.658 0.314 -0.104 -0.070 -0.003 , N, C, C ) cgNMA(N) cgNMA(C) cgNMA(C α α 0.886 0.445 0.458 0.567 0.056 0.930 0.952 0.784 0.355 0.888 0.027 0.900 0.562 0.603 0.850 0.235 0.088 0.712 0.822 0.646 0.939 0.304 0.868 0.433 0.628 0.377 0.570 0.555 -0.229 -0.537 -0.203 -0.399 , N, C atoms as the representative atom, NMA - correlations using NMA ) cgNMA(C α β NaN NaN 0.741 0.596 0.149 0.593 0.586 0.288 0.655 0.710 0.287 0.812 0.057 0.524 0.515 0.285 0.384 0.313 0.365 0.748 0.509 0.789 0.487 0.478 0.272 0.053 0.492 0.772 0.578 -0.054 -0.005 -0.230 ) - residue B-factor correlations using cgNMA with different choices of representative atoms, cgNMA(M) β 0.831 0.543 0.354 0.305 0.225 0.625 0.214 0.828 0.782 0.701 0.606 0.487 0.367 0.507 0.178 0.872 0.864 0.900 0.593 0.717 0.630 0.544 0.555 0.624 -0.099 -0.402 -0.711 -0.377 -0.135 -0.193 -0.174 -0.046 Table 4.1: Correlation coefficients for small structures , N, C, C α 0.970 0.568 0.474 0.508 0.406 0.483 0.863 0.658 0.810 0.365 0.881 0.526 0.678 0.689 0.425 0.658 0.420 0.250 0.790 0.699 0.853 0.497 0.006 0.639 0.674 0.422 0.316 0.651 -0.721 -0.371 -0.107 -0.010 ) GNM(N) GNM(C) GNM(C α 0.689 0.434 0.562 0.523 0.270 0.185 0.754 0.628 0.808 0.432 0.844 0.616 0.625 0.656 0.431 0.530 0.327 0.155 0.821 0.656 0.901 0.594 0.706 0.127 0.671 0.434 0.591 0.674 -0.744 -0.463 -0.274 -0.142 6 6 8 5 5 13 16 16 15 12 12 18 12 20 13 29 35 36 36 13 31 39 43 46 45 40 51 52 64 55 52 65 51 55 69 96 109 110 111 112 138 147 147 153 160 172 214 225 255 257 273 277 295 335 345 361 367 382 387 399 452 470 491 503 ID TA TR GNM(C ID-protein ID, TA-total number of atoms, TR-total number of residues, GNM(C 1P9I 1U06 1O06 1AIE 1FF4 2OL9 2NLS 1HJE 1BX7 1OB4 1OB7 1Q9B 1USE 1XY2 1RJU 1PEF 1ETL 1YJO 2JKU 1GK7 2DSX 1VRZ 1PEN 2OLX 35 4 0.885 0.857 0.4101ETN 0.925 0.776 0.867 0.664 0.859 0.738 -0.933 a 6RXN 1NOT 1KYC 1UOY 1YZM 1AKG 1GVD 1ETM - residue B-factor correlations using cgNMA with a combination of C choices of representative atoms, cgNMA(C 40 0.714 0.367 0.330 0.749 0.351 0.372 0.177 0.680 0.613 0.561 0.353 0.301 0.322 0.707 0.609 0.292 0.756 0.248 0.648 0.428 0.581 0.694 0.492 0.130 0.789 0.540 0.612 0.531 0.665 0.248 0.758 0.297 0.734 0.307 -0.245 0.812 0.762 0.761 0.498 0.539 0.355 0.542 0.741 0.865 0.695 0.334 0.534 0.733 0.775 0.679 0.515 0.762 0.543 0.649 0.244 0.482 0.763 0.167 0.647 0.821 0.490 0.742 0.615 0.581 0.431 0.587 0.610 0.878 0.726 -0.247 ) cgNMA(M) NMA β 0.312 0.197 0.386 0.216 0.489 0.072 0.339 0.488 0.350 0.482 0.283 0.074 0.473 0.336 0.290 0.269 0.530 0.058 0.438 0.325 0.166 0.618 0.057 0.117 0.509 0.352 0.400 0.213 0.515 0.327 0.286 0.048 0.508 0.479 -0.257 a 0.845 0.840 0.720 0.387 0.516 0.255 0.514 0.665 0.840 0.648 0.325 0.409 0.772 0.743 0.697 0.530 0.643 0.493 0.636 0.344 0.506 0.803 0.110 0.582 0.757 0.484 0.711 0.589 0.550 0.458 0.546 0.620 0.871 0.651 -0.279 0.736 0.556 0.717 0.568 0.554 0.298 0.409 0.654 0.761 0.421 0.355 0.583 0.644 0.622 0.659 0.314 0.825 0.450 0.602 0.277 0.382 0.683 0.176 0.611 0.673 0.403 0.561 0.485 0.584 0.402 0.526 0.403 0.834 0.626 -0.220 ) cgNMA(N) cgNMA(C) cgNMA(C α 0.579 0.521 0.565 0.308 0.362 0.796 0.453 0.632 0.780 0.223 0.380 0.385 0.826 0.688 0.508 0.485 0.581 0.187 0.057 0.078 0.389 0.774 0.351 0.432 0.643 0.289 0.535 0.485 0.517 0.102 0.533 0.497 0.685 0.628 -0.165 ) cgNMA(C β 0.775 0.147 0.570 0.540 0.441 0.317 0.550 0.540 0.719 0.548 0.334 0.431 0.493 0.483 0.549 0.653 0.557 0.161 0.554 0.489 0.511 0.682 0.562 0.432 0.778 0.626 0.693 0.406 0.314 0.435 0.563 0.327 0.738 0.326 -0.042 0.510 0.302 0.463 0.378 0.382 0.694 0.652 0.738 0.354 0.466 0.454 0.549 0.659 0.190 0.530 0.288 0.410 0.547 0.405 0.662 0.807 0.356 0.284 0.787 0.253 0.770 0.651 0.661 0.266 0.617 0.546 0.696 0.629 -0.021 -0.288 Table 4.2: Correlation coefficients for medium structures 0.249 0.269 0.491 0.414 0.559 0.606 0.775 0.669 0.387 0.608 0.511 0.404 0.631 0.270 0.432 0.388 0.360 0.556 0.377 0.670 0.800 0.344 0.421 0.770 0.214 0.764 0.578 0.711 0.333 0.580 0.407 0.756 0.710 -0.078 -0.209 ) GNM(N) GNM(C) GNM(C α 0.690 0.419 0.747 0.583 0.485 0.398 0.654 0.605 0.798 0.495 0.549 0.497 0.572 0.695 0.348 0.517 0.421 0.481 0.613 0.368 0.632 0.741 0.466 0.397 0.820 0.433 0.710 0.615 0.631 0.331 0.620 0.594 0.771 0.529 -0.286 75 35 75 82 93 85 80 81 77 87 83 91 95 87 89 85 93 88 98 87 90 96 88 99 97 96 93 99 99 112 100 113 103 112 104 551 562 598 623 626 642 642 651 661 677 683 688 700 702 705 721 723 726 727 728 729 731 731 742 746 769 771 775 778 780 800 812 816 818 828 ID TA TR GNM(C 1I71 See descriptions in Table 4.1 2IP6 1Z21 1R7J 1V05 1LR7 5222BF9 73 0.620 0.667 0.751 0.624 0.795 0.704 0.628 0.449 0.712 0.629 2FQ3 2CE0 2QJL 1FK5 2E3H 1N7E 2RB8 2PLT 2EHS 1ZVA 1X3O 3BZQ 2BRF 1ULR 1W2L 2PKT 2EAQ 5CYT 1ABA 1OPD 1USM 1UHA 1CYO 1NOA 1NNX 1QAU 2NUH 1GXU a 2MCM 41 0.115 0.300 0.379 0.542 0.243 0.296 0.519 0.462 0.355 0.702 0.459 0.552 0.672 0.492 0.262 0.542 0.583 0.741 0.579 0.645 0.491 0.281 0.491 0.465 0.591 0.575 0.715 0.618 0.223 0.813 0.687 0.392 0.709 0.663 0.200 0.324 0.509 0.693 0.357 0.417 0.462 0.534 0.364 0.873 0.568 0.496 0.670 0.290 0.715 0.492 0.539 0.684 0.487 0.760 0.541 0.521 0.742 0.323 0.669 0.538 0.716 0.642 0.372 0.628 0.820 0.523 0.745 0.572 ) cgNMA(M) NMA β 0.272 0.295 0.343 0.200 0.104 0.119 0.255 0.405 0.256 0.731 0.261 0.217 0.462 0.068 0.282 0.287 0.383 0.380 0.136 0.308 0.020 0.296 0.351 0.188 0.440 0.411 0.333 0.346 0.184 0.484 0.392 0.351 0.538 0.216 a 0.216 0.338 0.541 0.671 0.282 0.436 0.486 0.524 0.315 0.861 0.638 0.474 0.622 0.222 0.699 0.535 0.476 0.625 0.444 0.732 0.552 0.562 0.769 0.233 0.644 0.485 0.588 0.596 0.369 0.554 0.781 0.491 0.699 0.471 0.305 0.278 0.288 0.645 0.312 0.334 0.335 0.413 0.349 0.801 0.476 0.349 0.599 0.348 0.530 0.281 0.525 0.620 0.545 0.642 0.514 0.408 0.547 0.288 0.593 0.522 0.733 0.644 0.290 0.568 0.727 0.396 0.666 0.663 ) cgNMA(N) cgNMA(C) cgNMA(C α 0.546 0.365 0.530 0.447 0.221 0.441 0.720 0.330 0.308 0.844 0.380 0.339 0.725 0.414 0.702 0.468 0.322 0.620 0.411 0.739 0.401 0.616 0.584 0.154 0.594 0.574 0.743 0.524 0.133 0.385 0.514 0.593 0.720 0.594 ) cgNMA(C β 0.277 0.273 0.323 0.625 0.347 0.288 0.710 0.652 0.286 0.766 0.385 0.504 0.422 0.382 0.478 0.476 0.411 0.716 0.355 0.621 0.565 0.452 0.614 0.409 0.613 0.495 0.677 0.566 0.489 0.502 0.587 0.556 0.709 0.514 0.257 0.618 0.331 0.357 0.160 0.515 0.773 0.590 0.436 0.809 0.214 0.413 0.293 0.224 0.660 0.579 0.437 0.727 0.580 0.773 0.513 0.410 0.600 0.543 0.443 0.539 0.272 0.640 0.411 0.486 0.564 0.483 0.859 0.424 Table 4.3: Correlation coefficients for large structures 0.616 0.512 0.322 0.314 0.094 0.443 0.721 0.689 0.455 0.838 0.254 0.393 0.374 0.307 0.575 0.590 0.409 0.662 0.625 0.768 0.503 0.308 0.506 0.553 0.609 0.559 0.553 0.690 0.365 0.580 0.524 0.550 0.844 0.578 ) GNM(N) GNM(C) GNM(C α 0.418 0.607 0.351 0.547 0.212 0.494 0.742 0.637 0.379 0.843 0.417 0.562 0.334 0.270 0.685 0.668 0.368 0.859 0.616 0.761 0.514 0.309 0.560 0.497 0.576 0.549 0.365 0.696 0.552 0.515 0.512 0.515 0.826 0.528 90 64 108 101 111 106 104 113 102 113 113 107 122 122 123 107 122 188 175 206 203 221 205 231 204 204 213 227 224 237 233 237 238 135 846 855 855 863 864 872 873 878 879 890 906 925 934 937 937 963 991 1423 1476 1609 1700 1722 1736 1749 1750 1774 1810 1847 1848 1872 1885 1893 1948 2387 ID TA TR GNM(C 2I24 See descriptions in Table 4.1 2C71 2R16 1V70 832 105 0.162 0.1661IFR 0.190 0.424 0.285 0.279 0.281 0.395 0.282 0.265 1O08 1PZ4 1BYI 2IMF 1NLS 3SEB 1E5K 2V9V 2CG7 2VIM 2VPA 2PPN 1WHI 1ATG 1CCR 3VUB 1EW4 1RRO 1UKU 1AHO 1QTO 1NKO 2VYO 2HYK 2AGK a 2HQK 2CWS 1PMY 1WPA 1WBE 42

In Table 4.1, 4.2, 4.3, we see consistently that the results from cgNMA methods are comparable with GNM except when Cβ atoms are used as the representative atoms. Both types of methods performed much better than classical NMA using CHARMM. Among all these methods, it seems that cgNMA(M) performed the best for all three categories of structures.

Table 4.4: Comparison of correlation coefficients for small structures∗

cgNMA(Cα)-GNM(Cα) cgNMA(N)-GNM(N) cgNMA(C)-GNM(C) cgNMA(M)-GNM(Cα) + 13 17 17 16 − 17 13 12 14 = 3 3 4 3 cgNMA(Cα)-NMA cgNMA(N)-NMA cgNMA(C)-NMA cgNMA(M)-NMA + 22 19 22 23 − 9 12 11 10 = 2 2 0 0 ∗+ : number of structures whose B-factor correlations for the two given models are positive; − : number of structures whose B-factor correlations for the two given models are negative; = : number of structures whose B-factor correlations for the two given models are compara- ble.

Table 4.4, 4.5, 4.6, compare the test results for different pairs of methods for small, medium, and large structures, respectively. In these tables, we compare the B-factor corre- lation coefficients of the residue mean-square-fluctuations of the structures produced by each pair of methods, method A vs. method B, denoted as A-B. We list the numbers of B-factor correlations of method A that are significantly higher (+) or lower (−) than that of method

B. If the difference of two B-factor correlations is within 1.0e-02, we consider them to be com- parable (=). From these tables, we can see more clearly how different methods performed and compared to each other. It seems that GNM using Cα as the representative atom performed the best among all GNM methods; cgNMA using N or C as the representative atom performed better than using Cα or Cβ; cgNMA(M) performed better than GNM(Cα) and classical NMA consistently.

From the difference of GNM(Cα)-cgNMA(Cα), GNM(Cα)-cgNMA(M) the number of struc- tures that difference of two methods is positive is the same as the number of structures that 43

Table 4.5: Comparison of correlation coefficients for medium structures∗

cgNMA(Cα)-GNM(Cα) cgNMA(N)-GNM(N) cgNMA(C)-GNM(C) cgNMA(M)-GNM(Cα) + 10 19 19 23 − 24 16 13 9 = 2 1 4 4 cgNMA(Cα)-NMA cgNMA(N)-NMA cgNMA(C)-NMA cgNMA(M)-NMA + 15 18 21 25 − 20 16 13 9 = 1 2 2 2 ∗ See descriptions in Table 4.4

the difference is negative. From this fact, we may conclude that cgNMA(Cα) and cgNMA(M) is as good as GNM(Cα) in the sense that mean- square fluctuations using cgNMA(Cα) and cgNMA(M) correlates well with experimental temperature factors.

In Table 4.2, from the comparison of GNM(Cα) and cgNMA in Table 4.5, only 10 struc- tures were the case that the correlation coefficients with temperature factors using GNM(Cα) is less than those with B-factors using cgNMA(Cα) and cgNMA(N). However, 25 structures were the case that correlation coefficients using cgNMA(M) is greater than those using GNM(Cα).

The similar pattern showed in the next row. cgNMA(Cα) may be not as good as average to determine correlation coefficients with temperature factors. However, cgNMA(N), cgNMA(C) and cgNMA(M) induces better correlation coefficients than average. Furthermore, we can see even big difference between average and cgNMA(M) regarding the number of structures. For the comparison of NMA and cgNMA, we can see a quite difference for number of structures between NMA and cgNMA(M). GNM(Cα) has a tendency having a better correlation coeffi- cients than NMA.

In Table 4.6, unlike the small-sized and medium-sized structures, only cgNMA(M) repre- sents more structures that correlates better with B-factors than GNM(Cα) between GNM(Cα) and cgNMA with various choices of backbone atoms. cgNMA(Cα), cgNMA(C) and cgNMA(M) also represent more structures than average, and moreover, cgNMA(M) represents better result than cgNMA(Cα), cgNMA(C). cgNMA(M) shows still more structures than cgNMA(N) and cgNMA(C) for the comparison of NMA and cgNMA. 44

Table 4.6: Comparison of correlation coefficients for large structures∗

cgNMA(Cα)-GNM(Cα) cgNMA(N)-GNM(N) cgNMA(C)-GNM(C) cgNMA(M)-GNM(Cα) + 13 13 19 18 − 19 22 15 15 = 3 0 1 2 cgNMA(Cα)-NMA cgNMA(N)-NMA cgNMA(C)-NMA cgNMA(M)-NMA + 15 15 19 22 − 18 15 16 11 = 2 5 0 2 ∗ See descriptions in Table 4.4

4.5 Demonstrations

As described in chapter 2, mean square fluctuations are experimentally measurable (e.g.

X-ray crystallographic B-factors), and they have often been used as criterion for improving computational models and methods. Experimental temperature factors have a linear relation- ship with mean square fluctuation by (??). In this section, we investigate the agreement and difference of correlation coefficients between the predicted fluctuations and experimental observations by providing graphical figures.

Figure 4.1: Comparison of residue mean-square-fluctuations for protein 1HJE

(Left) The residue mean-square-fluctuations calculated by GNM(Cα) for 1HJE are com- pared with the experimental B-factors of Cα. (Middle) The residue mean-square- fluctuations calculated by cgNMA(M) for 1HJE are compared with the experimental B- factors of Cα. (Right) The residue mean-square-fluctuations calculated by NMA for 1HJE are compared with the experimental B-factors of Cα.

Figure 4.1, 4.2 and 4.3 demonstrate the test results on three example structures, 1HJE,

2BF9 and 2HQK, each representing a small, medium, and large structure. For each struc- 45

ture, the residue mean-square-fluctuations calculated by GNM(Cα), cgNMA(M), and NMA are shown in red curves. The experimental B-factors of the representative atoms are shown in black curves.

Figure 4.2: Comparison of residue mean-square-fluctuations for protein 2BF9

(Left) The residue mean-square-fluctuations calculated by GNM(Cα) for 2BF9 are compared with the experimental B-factors of Cα. (Middle) The residue mean-square-fluctuations calculated by cgNMA(M) for 2BF9 are compared with the experimental B-factors of Cα. (Right) The residue mean-square-fluctuations calculated by NMA for 2BF9 are compared with the experimental B-factors of Cα.

For the comparison of GNM(Cα) and NMA with experimental data, the residue experi- mental B-factors corresponding to Cα atoms are extracted. For the comparison of cgNMA(M) with experimental data, we use the mean of the residue experimental B-factors corresponding to Cα, N and C atoms as representative residue experimental B-factors.

Figure 4.3: Comparison of residue mean-square-fluctuations for protein 2HQK

(Left) The residue mean-square-fluctuations calculated by GNM(Cα) for 2HQK are com- pared with the experimental B-factors of Cα. (Middle) The residue mean-square- fluctuations calculated by cgNMA(M) for 2HQK are compared with the experimental B- factors of Cα. (Right) The residue mean-square-fluctuations calculated by NMA for 2HQK are compared with the experimental B-factors of Cα. 46

In Figure 4.1, the B-factor correlation coefficients of the residue mean-square-fluctuations of the small-sized structure 1HJE calculated by GNM(Cα), cgNMA(M), and NMA are 0.616, 0.838, and 0.209, respectively.

In Figure 4.2, the B-factor correlation coefficients of the residue mean-square-fluctuations of the medium-sized structure 2BF9 calculated by GNM(Cα), cgNMA(M), and NMA are 0.419, 0.762, and 0.367, respectively.

In Figure 4.3, the B-factor correlation coefficients of the residue mean-square-fluctuations of the large-sized structure 2HQK calculated by GNM(Cα), cgNMA(M), and NMA are 0.365, 0.716, and 0.715, respectively. 47

CHAPTER 5. REFINED GNM: STATISTICAL RESULTS FOR FORCE CONSTANTS

In this chapter, we propose a refined Gaussian Network Model based on statistical collection of force constants between two connecting residue pairs. As we analyzed in chapter 4, cgNMA are comparable with conventional GNM except using Cβ atom as the representative atom. Since cgNMA is the model based on Hessian matrix of the potential energy, the constructed contact matrix reflects the physical forces between the two residue pairs. However, to construct the contact matrix, the Hessian matrix of the potential energy function should be obtained at the minimum level. So computational cost is much higher than conventional Gaussian Network

Model. Therefore, in this chapter we consider a more refined GNM, which contains the physical forces between given two residues, but which does not require the Hessian matrix of potential energy function.

In introduction, we explain the motivation for the refined Gaussian Network Model. In section 5.2, a more physically based refined Gaussian Network Model is proposed based on the previous work of cgNMA in Chapter 4. Also in this section, the distribution of forces between all possible two residue pairs in contact but separated by some number of residues in between the two end residues is showed. Finally in section 5.3, the test results on the previous large set of protein structures are presented.

5.1 Introduction

Accurate predictions for the structural fluctuations has been studied for decades since the accurate predictions on the fluctuations provide better ideas of how the dynamics of the given structures relate to their functional abilities. There are more than eighty thousand biological 48 structures deposited in Protein Data Bank [98], and so it is important and essential not only to determine the 3D structure but also predict the structural fluctuations for the structures.

We present a new Gaussian Network Model based on our study on coarse-grained Normal

Mode Analysis described in Chapter 4. In cgNMA, a reduced Hessian matrix of energy func- tion for the energy-minimized structure needs to be obtained to construct the contact matrix.

However, in GNM, only the distances between contact residue pairs are required for developing the contact matrix. Therefore, we consider the requirement on the potential energy function as a key distinction between Gaussian Network Model and Normal Mode Analysis. Indeed, the coarse-grained NMA in Chapter 4 still requires the availability of a potential energy function to obtain the Hessian matrix, although it conducts the same analysis as GNM at the residue level.

However, we do see that with non-homogeneous force constants extracted from the Hessian matrix, the coarse-grained NMA such as cgNMA(M) performs better than GNM. The question then is whether or not GNM can be improved by introducing some non-homogeneous force constants, yet without requiring the Hessian matrix from an energy function.

Therefore, in this chapter, we collect all force constant elements in the reduced Hessian matrix extracted from the energy function for the same set of protein structures as in Chapter

4. Once we have a distribution of all the forces, we propose a nonhomogeneous GNM based on the results of distribution. It turns out that the bidiagonal element of the contact matrix should have a large magnitude than the other off diagonal elements because two directly connected residue pairs seems to have a strong forces than the other connected residues.

5.2 Refined GNM: Non-homogeneous GNM

In order to answer the previous question of whether GNM can be improved by introducing non-homogeneous force constants without the Hessian matrix, we have examined the Hessian matrices of our downloaded protein structures. We collected all the force constants (the entries in the Hessian matrices) defined in coarse-grained NMA(M). For each structure, we have a matrix of force constants {kij : i, j = 1, . . . , m} where m is the number of residues in the structures.

Because there are 20 amino acids which are encoded by the universal genetic code, the 49 number of possible residue pair is 400. And also there are many possible cases regarding the connectivity of residue pairs. Residue pairs can be connected directly, or can be connected throughout one or more residues between two given residues. If two residues are connected directly, we represent the case as ”separated by 0” and if two residues are connected throughout one or more residues, then we represent the cases as ”separated by 1” or ”separated by the number of residues between the two residues”. In other words, we grouped the constants for all the residue pairs in contact but separated by s residues in sequence, where s = 0, 1, 2, 3, etc: s = 0 means that two residues are directly connect, and s = 1 means that two residues are in contact but there is a residue in between the two contact residue, and so on. When we extract the force constants from the Hessian matrices, only the force constants of two residues whose distance is less than 7A˚ are extracted.

Figure 5.1 shows the distribution of force constants collected from Hessian matrices (defined in cgNMA(M)) corresponding to two directly connected (s = 0) residue pairs (ALA-GLU, ALA-

GLY, ALA-LEU and ASP-GLU) for the large set of protein structures. From this Figure 5.1, we can see that the collected force constants range from magnitude 2 to 26. Even though they are connected directly (s = 0), they show not only wide range of force constants, but also a little different shape of distributions. Therefore, we need to see the distribution of forces between all possible residue pairs for the structures to determine the overall forces constants based on the physical environment.

Figure 5.2 shows the distribution of force constants for residue pairs in contact but separated by 0, 1, 2, 3 residues in sequence for the large set of downloaded structures. The graph on the left above of Figure 5.2 shows a relatively even distribution of force constants for the two residues connecting directly (s = 0). Most of the pairs have the force constants between 4 and

20 range and there is a high peak between 8 and 12 range. The right above graph shows a distribution of residue pairs separated by 1 residue (s = 1). It shows that most pairs have their force constants between 0 and 1.5 range, which are strictly less than those of previous pairs. The two bottom graphs show relatively small number of residue pairs because only force constants of residue pairs whose distance is less than 7A˚ are indicated. The bottom left graph

(s = 2) shows that most force constants are located between 0.5 and 1.5 range. From the 50

Figure 5.1: Comparison of 4 different residue pairs

The graph on the above shows the number of structures(vertical axis) of different values of force constants(horizontal axis) for the 4 different residue pairs ALA-GLU, ALA-GLY, ALA-LEU and ASP-GLU separated by 0 of 104 structures. bottom right graph (s = 3), we can see that most force constants are located between 0 and 1 range, which are relatively less than those of previous cases. It seems that the more residues have between two residue pairs, the weaker the force constants between two given residue pairs.

To sum up, for a fixed s, we can see that different residue pairs do not have significantly different force constants in the distribution. However, for different s, the force constants are different, especially between the directly connected pairs (s = 0) and other types of pairs

(s > 0). Clearly, the constants are around 12 units when s = 0, and are all around 1 unit when s > 0.

Based on the above survey, we have a reason to suggest that the forces between the residue 51

Figure 5.2: Comparison of all residue pairs

The graph on the above shows the number of structures(vertical axis) of different values of force constants(horizontal axis) for all the residue pairs separated by 0, 1, 2 and 3 of 104 structures. pairs in contact (s = 0) should not be treated as the same as (s = 1, 2, . . . etc.). At the very least, the forces between directly connected residue pairs should be stronger than others in a magnitude. In other words, the force constants for directly connected residue pairs should be at least one magnitude larger than others. These constants correspond exactly to the bidiagonal elements of the contact matrices. Therefore, a simple way to build a new Gaussian Network

Model that incorporates this non-homogeneity of the force constants is to define all the entries of the contact matrix in the same way as the conventional GNM except for the bidiagonal entries set to a number of larger magnitude, say -10. Such a model may reflect the real residue interactions in proteins more accurately. We call it a GNM model with non-homogeneous force 52 constants or nhGNM for short. An example nhGNM contact matrix can be found in Figure

5.3.

Figure 5.3: An example nhGNM contact matrix

 11 −10 0 0 −1 0   −10 21 −10 0 −1 0     0 −10 21 −10 0 0      . (5.1)  0 0 −10 21 −10 −1     −1 −1 0 −10 22 −10  0 0 0 −1 −10 11

5.3 Test Results

We have tested the nhGNM model on the large downloaded set of protein structures that discussed in Chapter 4, each time with the force constants for directly connected residue pairs set to −8, −9, −10, −11, −12, −13, −14, and −15, to see if the results vary with varying these constants. Again, we divide the given protein structures into three groups corresponding to relatively small-, medium-, and large-sized structures. The test results on each group are shown in Table 5.1, 5.2, 5.3, respectively. In each table, we show the B-factor correlations of the residue mean-square-fluctuations of the protein structures calculated by nhGNM. For compar- ison, we have also listed the B-factor correlations obtained from using GNM in each table. For both types of models, Cα atoms were used as the representative atoms for the residues. We use “+” to mean that nhGNM with one of the selected force constants has a higher B-factor correlation than GNM, “−” to mean the opposite. We use “=” to mean that the difference in the B-factor correlations computed by GNM and nhGNM is within 1.0e-02. 53 + = = + − = = − = = − + − + + − − = + − + = + + − = + − + = − + 0.43 0.75 0.86 0.36 -0.34 0.903 0.506 0.727 0.191 0.419 0.869 0.067 0.919 0.601 0.624 0.395 0.499 0.819 0.823 0.946 0.505 0.825 0.079 0.689 0.423 0.314 0.783 -0.744 -0.463 -0.115 -0.526 -0.023 0.43 0.82 0.69 0.78 0.902 0.512 0.728 0.196 0.749 0.426 0.868 0.075 0.918 0.862 0.602 0.624 0.398 0.508 0.351 0.824 0.946 0.509 0.823 0.075 0.424 0.318 a -0.744 -0.463 -0.107 -0.524 -0.006 -0.337 0.43 0.82 0.901 0.518 0.729 0.201 0.748 0.435 0.866 0.084 0.917 0.863 0.603 0.623 0.402 0.517 0.012 0.343 0.824 0.945 0.512 0.821 0.071 0.691 0.424 0.322 0.777 -0.744 -0.463 -0.099 -0.523 -0.333 0.9 0.43 0.82 -0.09 0.525 0.729 0.207 0.748 0.444 0.864 0.094 0.916 0.864 0.604 0.622 0.406 0.525 0.032 0.333 0.824 0.944 0.516 0.818 0.067 0.691 0.425 0.328 0.773 -0.744 -0.463 -0.521 -0.328 0.43 0.82 0.52 0.898 0.532 0.728 0.213 0.747 0.455 0.862 0.106 0.914 0.864 0.605 0.622 0.409 0.532 0.055 0.323 0.823 0.944 0.815 0.064 0.691 0.426 0.334 0.769 -0.744 -0.463 -0.079 -0.518 -0.323 ” - nhGNM has a lower B-factor correlation than GNM, “=” - nhGNM has a comparable − 0.43 0.54 0.22 0.82 0.897 0.726 0.746 0.468 0.859 0.121 0.912 0.862 0.607 0.621 0.413 0.539 0.079 0.312 0.821 0.943 0.524 0.811 0.061 0.691 0.427 0.341 0.765 -0.744 -0.463 -0.068 -0.515 -0.318 0.3 0.43 0.91 0.86 0.82 0.35 0.76 -0.51 0.894 0.548 0.724 0.228 0.745 0.483 0.856 0.138 0.608 0.621 0.416 0.545 0.105 0.818 0.941 0.529 0.806 0.059 0.691 0.428 -0.744 -0.463 -0.054 -0.311 Table 5.1: B-factor correlations of nhGNM for small structures 0.5 0.43 0.72 0.61 0.42 0.55 0.94 0.43 0.891 0.557 0.236 0.744 0.853 0.159 0.907 0.856 0.621 0.134 0.287 0.821 0.814 0.534 0.801 0.057 0.691 0.361 0.755 -0.744 -0.463 -0.038 -0.503 -0.303 0.689 0.434 0.562 0.523 0.270 0.185 0.754 0.628 0.808 0.432 0.844 0.616 0.625 0.656 0.431 0.530 0.327 0.155 0.821 0.656 0.901 0.594 0.706 0.127 0.671 0.434 0.591 0.674 -0.744 -0.463 -0.274 -0.142 6 6 8 5 5 13 16 16 15 12 12 18 12 20 13 29 35 36 36 13 31 39 43 46 45 40 51 52 64 55 52 65 51 55 69 96 109 110 111 112 138 147 147 153 160 172 214 225 255 257 273 277 295 335 345 361 367 382 387 399 452 470 491 503 ID TA TR GNM nhGNM-8 nhGNM-9 nhGNM-10 nhGNM-11 nhGNM-12 nhGNM-13 nhGNM-14 nhGNM-15 nhGNM-GNM ID-protein ID, TA-total number of atoms, TR-total number of residues, GNM - residue B-factor correlations using GNM, nhGNM-k - residue B-factor 1P9I 1U06 1O06 1AIE 1FF4 2OL9 2NLS 1HJE 1BX7 1OB4 1OB7 1Q9B 1USE 1XY2 1RJU 1PEF 1ETL 1YJO 2JKU 1GK7 2DSX 1VRZ 1PEN a 2OLX 35 4 0.885 0.885 0.8851ETN 0.885 0.885 0.885 0.885 0.885 0.885 = 6RXN 1NOT 1KYC 1UOY 1YZM 1AKG 1GVD 1ETM correlations using nhGNM with“+” -k - as the nhGNM value has for a the higher bidiagonal B-factor elements correlation in the than contact GNM, matrix, “ nhGNM-GNM - comparison between nhGNM and GNM: B-factor correlation with GNM 54 − + − = − + − + − − − = − + + − + − = − − + = + = + − = + − − = − + + -0.22 0.563 0.608 0.586 0.544 0.398 0.661 0.537 0.661 0.683 0.349 0.524 0.476 0.694 0.786 0.221 0.562 0.322 0.456 0.291 0.191 0.608 0.721 0.505 0.367 0.878 0.262 0.699 0.662 0.504 0.165 0.602 0.517 0.765 0.634 a 0.571 0.609 0.593 0.549 0.403 0.653 0.543 0.663 0.688 0.359 0.522 0.475 0.689 0.788 0.225 0.563 0.325 0.459 0.306 0.194 0.613 0.722 0.504 0.369 0.878 0.267 0.701 0.663 0.511 0.173 0.604 0.523 0.769 0.631 -0.223 0.55 0.79 0.23 0.53 0.579 0.609 0.601 0.554 0.409 0.644 0.665 0.693 0.371 0.519 0.475 0.684 0.564 0.328 0.462 0.321 0.197 0.618 0.724 0.503 0.372 0.878 0.272 0.703 0.663 0.519 0.183 0.605 0.773 0.628 -0.226 0.2 -0.23 0.588 0.608 0.609 0.558 0.415 0.634 0.557 0.665 0.699 0.383 0.517 0.475 0.679 0.792 0.235 0.564 0.332 0.466 0.337 0.622 0.725 0.502 0.374 0.878 0.277 0.706 0.663 0.528 0.192 0.607 0.537 0.777 0.625 0.24 0.78 0.597 0.607 0.618 0.563 0.421 0.623 0.564 0.666 0.705 0.395 0.515 0.474 0.673 0.794 0.564 0.337 0.469 0.354 0.204 0.627 0.727 0.501 0.377 0.878 0.283 0.708 0.663 0.537 0.202 0.609 0.544 0.621 -0.234 0.38 0.29 0.606 0.605 0.627 0.568 0.427 0.611 0.572 0.666 0.712 0.407 0.512 0.474 0.667 0.796 0.246 0.565 0.341 0.472 0.372 0.208 0.632 0.729 0.499 0.877 0.711 0.663 0.546 0.212 0.611 0.552 0.784 0.616 -0.237 0.58 0.42 0.51 0.56 0.616 0.601 0.637 0.573 0.434 0.598 0.665 0.719 0.474 0.661 0.797 0.252 0.564 0.347 0.476 0.392 0.231 0.636 0.731 0.497 0.384 0.876 0.297 0.713 0.662 0.556 0.223 0.613 0.788 0.612 -0.242 Table 5.2: B-factor correlations of nhGNM for medium structures 0.44 0.626 0.596 0.648 0.578 0.583 0.589 0.663 0.727 0.433 0.508 0.473 0.654 0.797 0.259 0.564 0.352 0.479 0.412 0.219 0.641 0.733 0.495 0.388 0.875 0.305 0.716 0.661 0.566 0.235 0.615 0.569 0.791 0.606 -0.246 0.690 0.419 0.747 0.583 0.485 0.398 0.654 0.605 0.798 0.495 0.549 0.497 0.572 0.695 0.348 0.517 0.421 0.481 0.613 0.368 0.632 0.741 0.466 0.397 0.820 0.433 0.710 0.615 0.631 0.331 0.620 0.594 0.771 0.529 -0.286 75 35 75 82 93 85 80 81 77 87 83 91 95 87 89 85 93 88 98 87 90 96 88 99 97 96 93 99 99 112 100 113 103 112 104 551 562 598 623 626 642 642 651 661 677 683 688 700 702 705 721 723 726 727 728 729 731 731 742 746 769 771 775 778 780 800 812 816 818 828 ID TA TR GNM nhGNM-8 nhGNM-9 nhGNM-10 nhGNM-11 nhGNM-12 nhGNM-13 nhGNM-14 nhGNM-15 nhGNM-GNM 1I71 See descriptions in Table 5.1 . 2IP6 1Z21 1R7J 1V05 1LR7 5222BF9 73 0.620 0.693 0.7 0.706 0.712 0.717 0.722 0.727 0.731 + 2FQ3 2CE0 2QJL 1FK5 2E3H 1N7E 2RB8 2PLT 2EHS 1ZVA 1X3O 3BZQ 2BRF 1ULR 1W2L 2PKT 2EAQ 5CYT 1ABA 1OPD 1USM 1UHA 1CYO 1NOA 1NNX 1QAU 2NUH 1GXU a 2MCM 55 + + + − + − + − = + + − + + − − − − − + − + + + + + − = − + + + + + 0.33 0.48 0.59 0.701 0.566 0.425 0.299 0.252 0.466 0.773 0.501 0.371 0.846 0.444 0.489 0.384 0.636 0.624 0.326 0.789 0.564 0.763 0.397 0.595 0.501 0.545 0.208 0.686 0.397 0.568 0.538 0.562 0.832 0.561 0.57 0.57 0.56 0.693 0.573 0.421 0.306 0.253 0.467 0.773 0.508 0.371 0.848 0.444 0.493 0.381 0.326 0.639 0.625 0.329 0.793 0.767 0.482 0.392 0.596 0.504 0.591 0.546 0.212 0.688 0.404 0.543 0.563 0.835 a 0.37 0.85 0.77 0.69 0.685 0.581 0.418 0.314 0.253 0.468 0.773 0.516 0.445 0.497 0.378 0.323 0.643 0.627 0.332 0.798 0.576 0.483 0.388 0.598 0.507 0.592 0.548 0.216 0.411 0.572 0.549 0.565 0.838 0.559 0.6 0.51 0.55 0.676 0.589 0.414 0.323 0.254 0.469 0.773 0.524 0.369 0.852 0.445 0.502 0.375 0.319 0.647 0.629 0.335 0.804 0.582 0.773 0.484 0.383 0.593 0.221 0.692 0.418 0.574 0.554 0.566 0.842 0.558 0.47 0.665 0.597 0.411 0.333 0.254 0.772 0.532 0.368 0.853 0.446 0.506 0.372 0.314 0.651 0.631 0.338 0.809 0.588 0.777 0.486 0.378 0.602 0.513 0.594 0.551 0.225 0.694 0.426 0.575 0.561 0.567 0.845 0.557 0.31 0.78 0.654 0.605 0.407 0.343 0.254 0.472 0.772 0.541 0.368 0.855 0.446 0.511 0.369 0.656 0.633 0.341 0.815 0.594 0.488 0.373 0.604 0.515 0.594 0.553 0.231 0.696 0.435 0.577 0.567 0.568 0.848 0.556 0.49 0.641 0.614 0.402 0.355 0.253 0.474 0.771 0.551 0.367 0.857 0.446 0.516 0.366 0.305 0.661 0.635 0.344 0.821 0.601 0.784 0.368 0.606 0.518 0.594 0.555 0.237 0.697 0.444 0.578 0.573 0.569 0.851 0.556 Table 5.3: B-factor correlations of nhGNM for large structures 0.3 0.77 0.52 0.626 0.622 0.398 0.369 0.252 0.476 0.561 0.366 0.858 0.446 0.521 0.362 0.667 0.638 0.347 0.827 0.608 0.787 0.492 0.362 0.607 0.593 0.557 0.244 0.699 0.454 0.579 0.579 0.569 0.853 0.555 0.418 0.607 0.351 0.547 0.212 0.494 0.742 0.637 0.379 0.843 0.417 0.562 0.334 0.270 0.685 0.668 0.368 0.859 0.616 0.761 0.514 0.309 0.560 0.497 0.576 0.549 0.365 0.696 0.552 0.515 0.512 0.515 0.826 0.528 90 64 108 101 111 106 104 113 102 113 113 107 122 122 123 107 122 188 175 206 203 221 205 231 204 204 213 227 224 237 233 237 238 135 846 855 855 863 864 872 873 878 879 890 906 925 934 937 937 963 991 1423 1476 1609 1700 1722 1736 1749 1750 1774 1810 1847 1848 1872 1885 1893 1948 2387 ID TA TR GNM nhGNM-8 nhGNM-9 nhGNM-10 nhGNM-11 nhGNM-12 nhGNM-13 nhGNM-14 nhGNM-15 nhGNM-GNM See descriptions in Table 5.1 . 2I24 2C71 2R16 1V70 832 105 0.162 0.1711IFR 0.17 0.17 0.17 0.17 0.17 0.169 0.169 = 1O08 1PZ4 1BYI 2IMF 1NLS 3SEB 1E5K 2V9V 2CG7 2VIM 2VPA 2PPN 1WHI 1ATG 1CCR 3VUB 1EW4 a 1RRO 1UKU 1AHO 1QTO 1NKO 2VYO 2HYK 2AGK 2HQK 2CWS 1PMY 1WPA 1WBE 56

We see from Table 5.1, 5.2, 5.3 that in many test cases, the nhGNM performed signifi- cantly better than GNM. Out of total 33 relatively small-sized structures, nhGNM predicted the residue mean-square-fluctuations no worse than GNM for 23 cases (Table 5.1). nhGNM outperformed GNM significantly for 12 cases. Out of total 36 relatively medium-sized struc- tures, nhGNM predicted the residue mean-square-fluctuations no worse than GNM for 19 cases. nhGNM outperformed significantly GNM for 13 cases. Out of total 35 relatively large-sized structures, nhGNM predicted the residue mean-square-fluctuations no worse than GNM for 23 cases. nhGNM outperformed GNM significantly for 20 cases.

Figure 5.4: Distribution of correlations from GNM and nhGNM

Distribution of correlations between experimental and the computed B-factors from nhGNM and GNM. It can be seen clearly that nhGNM yields higher correlations than GNM.

We provide a more detailed comparison of the B-factor correlations using the histograms of the distribution of correlation coefficients between the experimental B-factors and the predicted

B-factors from nhGNM-10 and GNM in Figure 5.4.

We also compute the percentage of B-factor correlations improved by using nhGNM over

GNM for further comparison of results between GNM and nhGNM. Figure 5.5 demonstrates 57 the improvement with nhGNM over the classical GNM for B-factor prediction. For more than

73% of protein structures, the B-factor correlations from nhGNM is at least comparable or better than the original GNM correlations.

Figure 5.5: Distribution of correlation improvement

Percentage (%) of correlation improvement in B-factors by using nhGNM in comparison with GNM. 58

CHAPTER 6. CONCLUSIONS AND FUTURE WORK

6.1 Summary

In this thesis, we have reviewed a well-known method in and simulation, so called Normal Mode Analysis and Gaussian Network Model with a given set of protein structures obtained from X-ray crystallography experiment. We also evaluated the Gaussian

Network Model with a large set of protein structures. A more physically based coarse-grained model was proposed and tested with the set of protein structures. Finally, a more refined and physics based Gaussian Network Model has been proposed and evaluated with the set of protein structures and compared to all the other models.

A normal mode is a pattern of oscillation system having a specific frequency and phase.

The theory of normal mode were applied to a small biological interests BPTI [95] a few years after the first molecular dynamics simulation. Since then, the normal mode analysis has been a useful method to understand the structural fluctuations of biological interests.

Let n be the number of atoms in a given protein, and r(t)={ri(t): i = 1,..., 3n} be the collection of the coordinate of the atoms at time t where r3j−2(t), r3j−1(t), r3j(t) are the x, y, z coordinates of atom i at time t, i = 1, . . . , n. Let Ep(r) be the potential energy function. Then system of equations of motion is given by

00 Mr = −∇Ep(r) (6.1) where M is a diagonal matrix with the diagonal element mii being the mass of atom i. [56]. The system of equation (2.4) cannot be solved analytically but numerically when consid- ering the potential energy approximation at the minimum as described in (2.6).

By using this approximation, the solution of the system of equations (2.4) can be computed analytically through the singular value decomposition of the mass-weighted-Hessian matrix, 59 and finally the mean-square-fluctuation of the ith mode motion of the molecule is given by the following equation[52]

3n X −1 h∆ri, ∆rii = kBT UijΛjj Uij, i = 1,..., 3n. (6.2) j=1

The mean square fluctuations are related to experimentally detected average fluctuation so called B-factor in X-ray crystallography. Thus, predicted fluctuations and experimental

fluctuations can be compared together.

Bahar et al. introduced later the single parameter potential so called Gaussian Network

Model [29]. In this residue-level model, the mean-square fluctuations can be calculated via the singular value decomposition of the contact matrix (2.23) which does not require the energy function but only requires the distances between residues. Therefore, Gaussian Network Model is much less expensive than Normal Mode Analysis regarding the computational cost.

We have examined a few possible approaches to coarse-grained normal mode analysis. Sim- ilar to GNM, we take the amino acids so called residues as the basic units in a protein structure and use a single backbone atom N, C, or Cα (or a combination of several backbone atoms) to represent each residue. Different from GNM, the force constants, entries of the contact matrix are not the same. In other words, the force constants for pairs of representative atoms are not the same as those of GNM, and force constants are instead extracted from the Hessian matrix of the potential energy function of the given structure as described in equation (4.1). Our results showed that cgNMA(M) performed relatively consistently greater than when a single backbone atoms was used to obtain the contact matrix.

Using the new models, we have calculated the mean-square-fluctuations of the residues and their correlations with the experimental B-factors (called the B-factor correlations) for a large set of protein structures (see Table 4.1, 4.2, 4.3). We have compared them with the all-atom normal mode analysis and the residue-level Gaussian Network Model (see Table

4.4, 4.5, 4.6). We have shown that our models performed more efficiently than the all-atom normal mode analysis, and the B-factor correlations are also higher. The B-factor correlations are comparable with those estimated by the Gaussian Network Model and in some cases better.

Following the development of the coarse-grained NMA, we have conducted a statistical 60 survey on the extracted force constants, for different pairs of residues with different numbers of separation residues in sequence (see 5.1, 5.2). We then based on their statistical averages to build a refined Gaussian Network Model so called non-homogeneous Gaussian Network

Model. A nhGNM is proposed to provide a finer-grained model for residue fluctuations than the conventional GNM, yet not requiring all the detailed atomic level information as NMA. A stronger physical basis than GNM is used in nhGNM to obtain a more accurate description of residue-residue interactions based on the actual physical interactions among the representative atoms in the corresponding pairs of residues.

We have shown that the force constants for the neighboring residue pairs are always about one-magnitude larger than other pairs in contact. Therefore, in the refined GNM, the entries of the contact matrix could be defined in the same way as the conventional GNM except that the bidiagonal entries have large magnitudes. We have computed mean-square-fluctuations of the residues and their B-factor correlations for a large set of protein structures using various bidiagonal elements, and also compared them with conventional residue-level GNM (see Table

5.1, 5.2, 5.3). We have shown that such a simply refined GNM could predict residue-level structural fluctuations significantly better than the conventional GNM in many test cases.

6.2 Future Directions

The coarse-grained Normal Mode Analysis and refined Gaussian Network Model do not always outperform conventional methods such Gaussian Network Model. In our future efforts, we would like to investigate each of our test results more carefully and find the reasons for the disagreements between the predicted structural fluctuations and the experimental temperature factors, and to further improve our predictions. For example, for nhGNM, if we differentiate further the tri-diagonal elements with the other elements of the contact matrix based on the distribution results, we may see some improvements in the model with more physical basis.

However, in order to determine the reasons of the disagreements, we may look through the test cases more carefully, and try to find some common factors may be in the X-ray crystallography structures.

Another future direction to explore is to develop a relatively accurate residue-level energy 61 function so that the residue-level Normal Mode Analysis can be carried out exactly as the atomic-level Normal Mode Analysis. This could be difficult but challenging because of the lack of complete physical understanding of the residue interactions.

We would also like to apply our methods to a few proteins of specific biological interest and importance such as the HIV-1 protease and analyze the results in great details. The structures of these proteins have just been modeled in recent studies. Accurate predictions on their structural fluctuations may provide great insights into how the dynamics of the structures relate to their functions. Moreover, we also would like to apply our methods to very large supra-molecular assemblies such as the , viral , and motor proteins.

In the previous work we considered 20 amino acids pairs to extract the force constants between the residue pairs. The 20 amino acids can be divided as 8 groups depending on the property of side chains. Those are amino acids with electrically positively or negatively charged, polar uncharged, hydrophobic side chains and 4 special cases. So our another work is instead of considering 20 amino acids pairs to consider the force constants between the 8 groups of representatives. Then without knowledge of amino acids, only group name can be needed to obtain force constants.

Finally, our future work to consider is to document these statistical results for more struc- tures. In the previous work, only more than one hundred structures were examined to get the force constants. There are more than tens of thousand structures deposited in Protein Data

Bank [98]. More structures can be examined to extract more force constants for the residue pairs in order to obtain more reliable the diagonal element of force constants. Furthermore, a database can be made so that if the two residue pairs with the number of residues that connect two residues are given, then the force a constant between them can be obtained. 62

Bibliography

[1] Cui, Q., Bahar, I. (2006). Normal Mode Analysis: Theory and Application to biological

and chemical systems. Chapman & Hall/CRC

[2] Eran Eyal, Lee-Wei Yang and Ivet Bahar (2006). Anisotropic Network Model: Systematic

Evaluation and a New Web Interface

[3] Wako, H., Otsuka, M., Tomizawa, Y., Kato, M., Endo, S. (2005). Comparisons between

Dynamic Properties of Homologous Protein Structures in ProMode.

[4] (2003). Classical Normal Mode Analysis: Harmonic Approximation. (lecturenote)

http://www.colby.edu/chemistry/PChem/notes/NormalModesText.pdf

[5] Tama, F., Valle, M., Frank, J., Brooks C.L. (2003). Dynamic reorganization of the func-

tionally active ribosome explored by normal mode analysis and cryo-electron microscopy.

PNAS, 100(16), 9319-9323

[6] Valadie, H., Lacapc?re, J.J., Sanejouand, Y.-H., Etchebest, C. (2003). Dynamical Prop-

erties of the MscL of Escherichia coli: A Normal Mode Analysis, Journal of Molecular

Biology, 332, 657-674

[7] Yang,L.W., Liu,X., Jursa,C.J., Holliman,M., Rader,A.J., Karimi,H.A. and Bahar,I. (2005).

iGNM: a database of protein functional motions based on Gaussian Network Model. Bioin-

formatics, 21, 2978-2987.

[8] Van Wynsberghe, A. W., Cui, Q. (2006). Interpreting Correlated Motions Using Normal

Mode Analysis. Structure, 14, 1647-1653

[9] Tong, Y. (1990). The Multivariate Normal Distribution. Srpinger-Verlag 63

[10] Durand, P., Trinquier, G., Sanejouand, Y. (1994). A new approach for determining low

frequency normal modes in , Biopolymers, 34, 759

[11] Yang, L.W., Rader, A. J., Liu, X., Jursa, C.J., Chen, S.C., Karimi, H.A., Bahar, I. (2006).

oGNM: online computation of structural dynamics using the Gaussian Network Model.

Nucleic Acids Research, 34, W24-W31

[12] Suhre,K. and Sanejouand,Y.-H. (2004) On the potential of normal-mode analysis for solv-

ing difficult molecular-replacement problems. Acta Crystallogr., D60, 796-799.

[13] Kondrashov, D.A., Cui, C., Phillips, G.N. (2006). Optimization and Evaluation of a

Coarse-Grained Model of Protein Motion Using X-Ray Crystal Data. Biophysical Journal,

91, 2760-2767

[14] Schlick, T.(2003). Molecular Modeling and Simulation, An Interdisciplinary Guide,

Springer

[15] Eman, B. (2006). The Gaussian Network Model: Precise Prediction of Residue Fluctua-

tions and Application to Binding Problems. Biophysical Journal, 91, 3589-3599

[16] Li, G., and Q. Cui. (2002). A coarse-grained normal mode approach for macromolecules:

an efficient implementation and application to Ca21-ATPase. Biophys. J. 83:2457-2474.

[17] Bahar, I., Atilgan, A., Demirel, C., Erman, B. (1998). Vibrational Dynamics of Folded

Proteins: Significance of Slow and Fast Motions in Relation to Function and Stability.

Physical Review Letters, 80(12), 2733-2736.

[18] Micheletti, C., Carloni, P., Maritan, A. (2004). Accurate and efficient description of protein

vibrational dynamics: comparing molecular dynamics and Gaussian models. Proteins, 55,

635-645

[19] Tama, F., Miyashita, O., Brooks III, C. (2004). Flexible multi-scale fitting of atomic

structures into low-resolution electron density maps with elastic network normal mode

analysis. J. Mol. Biol, 337, 985-999 64

[20] Hinsen, K. (1998). Analysis of Domain Motions by Approximate Normal Mode Calcula-

tions, Proteins, 33:417-429

[21] Hinsen, K., Thomas, A, Field, M.J. (1999). Analysis of Domain Motions in Large Proteins,

Proteins, 34:369-382

[22] Atilgan, A. R. , Durell, S. R., Jernigan, R. L., Demirel, M. C., Keskin, O. and Bahar I.

(2001). Anisotropy of Fluctuation Dynamics of Proteins with an Elastic Network Model,

Biophysical Journal, 80:505-515

[23] Tama, F., F. X. Gadea, O. Marques, and Y. Sanejouand. (2000). Building block approach

for determining low-frequency normal modes of macromolecules. Prot. Struct. Funct. Gen.

41:1-7.

[24] Petrone, P., Pande, V.S. (2006). Can Conformational Change Be Described by Only a Few

Normal Modes? Biophysical Journal, 90, 1583-1593

[25] Bahar,I. and Rader,A.J. (2005) Coarse-grained normal mode analysis in structural biology.

Curr. Op. Struct. Biol., 15, 586-592.

[26] Eom, K., Na, S. (2008). Coarse-Grained Structural Model of Protein Molecules. Preprint

to the chapter of the book ”Computational Biology: New Research”

[27] Keskin, O., Jernigan, R., Bahar, I. (2000). Proteins with Similar Architecture Exhibit

Similar Large-Scale Dynamic Behavior. Biophysical Journal, 78, 20932106

[28] Frauenfelder, H., Parak, F., Young, R. (1988). Conformational substates in proteins. Ann.

Rev. Biophys. Chem, 17, 451-479

[29] Bahar, I., Atilgan, A.R., Erman, B. (1997). Direct evaluation of thermal fluctuations in

proteins using a single-parameter harmonic potential, Folding & Design, 2, 173-181

[30] Brooks, B., Bruccoleri, R. Olafson, B., States, D., Swaminathan, S., Karplus, M. (1983).

CHARMM: a program for macromolecular energy, minimization, and dynamics calcula-

tions. J. Comp. Chem, 4, 187 65

[31] Tama, F., Valle, M., Frank, J., Brooks, C.L. (2003). Dynamic reorganization of the func-

tionally active ribosome explored by normal mode analysis and cryo-electron microscopy.

PNAS, 100(16), 9319-9323

[32] Doruker, P., A. R. Atilgan, and I. Bahar. (2000). Dynamics of proteins predicted by molec-

ular dynamics simulations and analytical approaches: application to a-Amylase inhibitor.

Prot. Struct. Funct. Gen. 40:512-524.

[33] Suhre,K. and Sanejouand,Y.H. (2004) ElNemo: a normal mode web server for protein

movement analysis and the generation of templates for molecular replacement. Nucleic

Acids Res., 32, W610-W614.

[34] Brooks BR and Karplus M (1983) Harmonic dynamics of proteins: normal mode and

fluctuations in bovine pancreatic trypsin inhibitor. Proceedings of the National Academy

of Sciences of the USA 80: 6571-6575.

[35] Wako, H., Kato, M., Endo, S. (2003). Improvements in ProMode (a Database of Normal

Mode Analyses of Proteins). Genome Informatics, 14, 663-664

[36] Lindahl, E., Azuara, C., Koehl, P., Delarue, M. (2006). NOMAD-Ref: visualization, de-

formation and refinement of macromolecular structures based on all-atom normal mode

analysis. Nucleic Acids Research, 34, W52-W56

[37] Perahia, D., Mouawad, L. (1995). Computation of low-frequency normal modes in macro-

molecules: improvements to the method of diagonalization in a mixed basis and application

to hemoglobin. Comput. Chem, 19, 241

[38] Krebs,W.G., Alexandrov,V., Wilson,C.A., Echols,N., Yu,H. and Gerstein,M. (2002) Nor-

mal mode analysis of macromolecular motions in a database framework: developing mode

concentration as a useful classifying statistic. Proteins: Struct. Funct. and Genet., 48,

682-695.

[39] Lu, M., Ma, J. (2005). The role of shape in determining molecular motions. Biophys. J.,

89, 2395-2401 66

[40] Tama F. (2003). Normal mode analysis with simplified models to investigate the global

dynamics of biological systems. Protein and Peptide Letters, 10(2), 119-132

[41] Alexandrov,V., Lehnert,U., Echols,N., Milburn,D., Engelman,D. and Gerstein,M. (2005)

Normal modes for predicting protein motions: a comprehensive database assessment and

associated Web tool. Prot.Sci., 14, 633-643.

[42] Van Wynsberghe, A., Li, G., Cui, Q. (2004). Normal-Mode Analysis Suggests Protein

Flexibility Modulation throughout RNA Polymerase’s Functional Cycle. Biochemistry, 43,

13083-13096

[43] Harary, F. (1971) Graph theory. Addison-Wesley.

[44] Delarue,M. and Dumas,P. (2004) On the use of low-frequency normal modes to enforce

collective movements in refining macromolecular structural models. Proc. Natl. Acad. Sci.,

101, 6957-6962.

[45] Wako, H., Endo, S. (2002). ProMode: A Database of Normal Mode Analysis of Proteins.

Genome Informatics, 13, 519-520

[46] Gorba, C., Miyashita, O., Tama, F. (2008). Normal mode flexible fitting of high resolution

structure of biological molecules toward one dimensional low resolution data. Biophysical

Journal, 94, 1589-1599

[47] Delarue,M. and Sanejouand,Y.-H. (2002) Simplified normal mode analysis of conforma-

tional transitions in DNA-dependent polymerases: the elastic network model. J. Mol.

Biol., 320, 1011-1024.

[48] Thomas, A., K. Hinsen, M. J. Field, and D. Perahia. (1999). Tertiary and quaternary

conformational changes in aspartate transcarbamylase: a normal mode study. Prot. Struct.

Funct. Gen. 34:96-112.

[49] Ma JP (2005) Usefulness and limitations of normal mode analysis in modeling dynamics

of biomolecular complexes. Structure 13(3): 373-380. 67

[50] Hollup,S.M., Salensminde,G. and Reuter,N. (2005) WEBnm@: a web application for nor-

mal mode analysis of proteins. BMC Bioinformatics, 6, 52.

[51] Cui, Q., Li, G., Ma, J., Karplus, M. (2004). A Normal Mode Analysis of Structural

Plasticity in the Biomolecular Motor F1-ATPase. Journal of Molecular Biology, 340, 345-

372

[52] Levitt, M., Sander, C., Stern, P.S. (1985). Protein Normal-mode Dynamics: Trypsin In-

hibitor, Crambin, Ribonuclease and Lysozyme. Journal of Molecular Biology, 181, 423-447

[53] Li, G., Cui, Q. (2004). Analysis of Functional Motions in Brownian Molecular Machines

with an Efficient Block Normal Mode Approach: Myosin-II and Ca21-ATPase. Biophysical

Journal, 86, 743-763

[54] Yu, X., Park, J., Leitner, D. (2003). Thermodynamics of Protein Hydration Computed by

Molecular Dynamics and Normal Modes. Journal of physical chemistry, 107, 12820-12828

[55] Chacon, P., Tama, F., Wriggers, W. (2003). Mega-Dalton biomolecular motion captured

from electron microscopy reconstructions. J. Mol. Biol, 326, 485-492

[56] Wu, Z. (2006). Linear Algebra in Biomolecular Modeling. Handbook of Linear Algebra,

Chapman/Hall CRC Press

[57] Kloczkowaski, A., Mark, J., Erman, B. (1989). Chain Dimensions and Fluctuations in

Random Elastomeric Networks. 1. Phantom Gaussian Networks in the Undeformed State.

Macromolecules, 22, 1423-1432

[58] Huang, Kerson (1990). Statistical Mechanics. Wiley, John & Sons, Inc. ISBN 0-471-81518-

7.

[59] Alder, B. J.; Wainwright, T. E. (1959). Studies in Molecular Dynamics. I. General Method.

Journal of Chem. Phys. 31(2), 459

[60] Go, N., T. Noguti, and T. Nishkawa. (1983). Dynamics of a small globular protein in terms

of low-frequency vibrational modes. Proc. Natl. Acad. Sci. USA. 80:3696-3700. 68

[61] Perahia, D., Mouawad, L. (1995). Computation of low-frequency normal in macro-

molecules: improvements to method of diagonalization in a mixed and application to

hemoglobin. Computers Chem., 19(3), 241-246

[62] Marques, O., and Y. Sanejouand. (1995). Hinge-bending motion in citrate synthase arising

from normal mode calculations. Prot. Struct. Funct. Gen. 23:557-560.

[63] Darst, S., Opalka, N., Chacon, P., Polyakov, A., Richter, C., Zhang, G. (2002). Confor-

mational flexibility of bacterial RNA polymerase. PNAS. 99(7), 4296-4301

[64] Hayward S. (2001). Normal mode analysis of biological molecules. In Computational bio-

chemistry and biophysics. New-York, Marcel Dekker, Inc., 153-168.

[65] Miyashita, O., Tama, F. (2007).Normal Mode Analysis Techniques in Structural Biology.

Encyclopedia of life Sciences, John Wiley & Sons.

[66] http://emotion.biomachina.org/about.html

[67] Delarue M, Lindahl E. (2004). Normal mode calculation and visualization using Pymol.

[http://lorentz.immstr.pasteur.fr/nma/].

[68] Reuter N, Hinsen K, Lacapere JJ. (2003). Transconformations of the SERCA1 Ca-ATPase:

a normal mode study. Biophys J, 85:2186-2197.

[69] Brink J, Ludtke SJ, Kong YF,Wakil SJ, Ma JP, Chiu W. (2004). Experimental verification

of conformational variation of human fatty acid synthase as predicted by normal mode

analysis. Structure 12:185-91

[70] Brooks, B., and Karplus, M. (1985). Normal Modes for Specific Motions of Macromolecules:

Application to the Hinge-Bending Mode of Lysozyme. Proc. Natl. Acad. Sci. USA 82, 4995-

4999.

[71] Kidera, A., Inaka, K., Matsushima, M., Go, N. (1991). Normal mode refinement: Crystallo-

graphic refinement of protein dynamic structure applied to human lysozyme. Biopolymers,

32(4), 315-319. 69

[72] Ma, J., and Karplus, M. (1997). Ligand-induced conformational changes in ras p21: a

normal mode and energy minimization analysis. J. Mol. Biol. 274, 114-131.

[73] Wu, Z. (2008). Lecture notes on computational structural biology. Singapore ; Hackensack,

NJ : World Scientific

[74] Seno, Y., and Go, N. (1990b). Deoxymyoglobin studied by the conformational normal

mode analysis. I. Dynamics of globin and the heme-globin interaction. J. Mol. Biol. 216,

95-109.

[75] Tama, F., Miyashita, O., Brooks, C.L. (2004). Normal mode based flexible fitting of high-

resolution structure into low-resolution experimental data from cryo-EM. Journal of Struc-

tural Biology, 147, 315-326

[76] Thomas, A., Field, M.J., and Perahia, D. (1996b). Analysis of the Low-frequency Normal

Modes of the R State of Aspartate Transcarbamylase and a Comparison with the T State

Modes. J. Mol. Biol. 261, 490-506.

[77] Tirion, M.M., ben-Avraham, D., Lorenz, M., and Holmes, K.C. (1995). Normal modes as

refinement parameters for the F-actin model. BiAcad. ophys. J. 68, 5-12.

[78] Tozinni, V., Mccammon JA. (2005). A coarse grained model for the dynamics of ap opening

in HIV-1 protease. Chem. Phys. Lett. 413,123-128.

[79] Moore, W. J.(1972). Physical Chemistry, 4th Ed., Prentice-Hall: Englewood Cliffs, NJ, ,

Chapter 17, Sec. 14., 775-776.

[80] Godsil,C. (2001). Algebraic graph theory. New York : Springer ISBN:0387952209

[81] Serway, R., Jewett, J. (2003). Physics for Scientists and Engineers. Brooks/Cole. ISBN

0-534-40842-7.

[82] Ma, J. (2004). New Advances in Normal Mode Analysis of Supermolecular Complexes and

pplications to Structural Refinement. Current Protein and Peptide Science, 5(2), 119-123 70

[83] Wako, H., Otsuka, M., Tomizawa, Y., Kato, M., Endo, S. (2005). Reflection of Knowledge

Information in ProMode.

[84] Mouawad, L., Perahia, D. (1993). Diagonalization in a mixed basis: a method to compute

low-frequency normal modes for large macromolecules. Biopolymers, 33, 599-611.

[85] Bahar, I., Jernigan, R. L. (1998). Vibrational dynamics of transfer RNAs: comparison of

the free and synthetase bound forms. J. Mol. Biol. 281, 871-884

[86] Goldstein, H. (2002). Classical mechanics. Addison Wesley

[87] Casella, G., Berger, R. (2002). Statistical inference. Duxbury.

[88] Haliloglu, T., Bahar, I., Erman, B. (1997). Gaussian Dynamics of Folded Proteins. Physical

Review Letters, 79(16), 3090-3093.

[89] Tirion, M. (1996). Large Amplitude Elastic Motions in Proteins from a Single-Parameter,

Atomic Analysis. Physical Review Letters, 77(9), 1905-1908.

[90] Tama, F., Sanejouand, Y. (2001). Conformational change of proteins arising from normal

mode calculations. Protein Engineering, 14(1), 1-6

[91] Tama, F., Brooks, C. (2006). Symmetry, Form, and Shape: Guiding Principles for Ro-

bustness in macromolecular Machines. Annu. Rev. Biophys. Biomo. Struct, 35: 115-133

[92] Wilson, K., Malcolm, B., Matthews, B. (1992). Structural and thermodynamic analysis

of compensating mutations within the core of chicken egg white lysozyme. J. Biol. Chem,

267, 10842

[93] Xu, C., Tobi, D., Bahar, I. (2003). Allosteric Changes in Protein Structure Computed by

a Simple Mechanical Model: Hemoglobin T ↔ R2 Transition, J. Mol. Biol, 333, 153-168

[94] Tama, F., Wriggers, W., Brooks III, C.(2002). Exploring Global Distortions of Biological

Macromolecules and Assemblies from Low-resolution Structural Information and Elastic

Network Theory. J. Mol. Biol, 321, 297305 71

[95] Brooks, B. R. and Karplus, M., Harmonic dynamics of proteins: normal mode and fluc-

tuations in bovine pancreatic trypsin inhibitor, Proceedings of the National Academy of

Sciences of the USA 80, 6571-6575, 1983.

[96] Brooks III, C. L., Karplus, M., and Pettitt, B. M., Proteins: A Theoretical Perspective of

Dynamics, Structure, and Thermodynamics, Wiley, 1989.

[97] Sen, T. Z. and Jernigan, R. L., Optimizing the parameters of the Gaussian Network

Model for ATP-binding proteins, in Normal Mode Analysis: Theory and Applications to

Biological and Chemical Systems, Cui, Q. and Bahar, I., Eds., 171-186, 2006.

[98] Berman, H. et al, PDB Data Bank Annual Report, http:www.rcsb.orgpdb, 2010.

[99] Brooks, B. R., et al., CHARMM: The biomolecular simulation program, Journal of Com-

putational Chemistry 30, 1545-1614, 2009.

[100] Gur, M., Erman, B. (2012). Quasi-Harmonic Fluctuations of Two Bound Peptides, Pro-

teins, 72

ACKNOWLEDGEMENTS

First and foremost, I would like to deeply thank my advisor Dr. Zhijun Wu for his guidance, patience and support throughout this research and the writing of this thesis. I would also like to thank Dr. Robert Jernigan for his collaboration and valuable contributions to this work.

I would also like to thank my other committee members: Dr. Jennifer Davidson, Dr. Karin

Dorman, and Dr. Steven Hou for their efforts and contributions to this work.