Finding the Free Energy Profiles of Protein Transitions

Olle Hedkvist KTH September 22, 2015

Abstract The rapid discovery of new highly resolved protein structures made possible by developments in x-ray crystallography as well as NMR-spectroscopy and cryo electron-microscopy has created a situation where in many cases we are quite well aware of what the structure of proteins are for several sta- ble or meta-stable states, however the question still remains how proteins transition between these states. This is an area which is highly interest- ing for the pharmaceutical industry as it may in future reveal novel target areas for drugs, and as no real world experiments as of yet can provide data at a detailed level this is a question which is well suited for molec- ular dynamics simulations. Our knowledge of the terminal states of the motion should in principle make it possible to simply allow a simulation to proceed until the system has transitioned from the initial to the end state, providing perfect knowledge of the transitional motion. However this is a process which for large systems will take far more computational power than is currently available even on modern super-computers. For this reason a number of approaches have been developed to decrease the needed computation, either by forcing the system to transition by ap- plying external forces, or by reducing the complexity of the system using coarse graining. This work concerned the investigation of free energy land- scapes traversed during transitions of the SERCA P-type ion pump using umbrella sampling of intermediate structures along pathways arrived at through a method building on elastic network modelling, Brownian dy- namics and principle component analysis. Acknowledgements I want to thank my supervisor Magnus Andersson as well as all the wonderful people at TCB lab for their help and company during the time of this master thesis project. Contents

I Introduction and method 1

1 Introduction 2 1.1 Sarcoplasmic Reticulum Ca2+ ATPase (SERCA) ...... 2 1.1.1 Function ...... 2 1.2 Structure ...... 3 1.3 Generating a Transition Pathway ...... 5 1.3.1 Brownian Dynamics ...... 5 1.3.2 Elastic Network Modelling ...... 5 1.4 Umbrella sampling ...... 6 1.5 Simulation ...... 7 1.5.1 Coupling Algorithms ...... 7 1.5.2 Gromacs ...... 8 1.5.3 Plumed ...... 8

2 Method 9 2.1 Simulations ...... 9 2.1.1 Forcefield ...... 9 2.1.2 Lipid model ...... 9 2.1.3 Simulation series ...... 9 2.1.4 Equilibration ...... 10 2.1.5 Production ...... 12 2.2 Umbrella sampling ...... 13 2.2.1 Validation ...... 14

II Results and Discussion 16

3 Results 17 3.1 Free energy profiles ...... 17 3.2 Validation ...... 20 3.2.1 Histograms ...... 20 3.2.2 PCA ...... 23 3.2.3 RMSD ...... 23

4 Discussion 27 4.1 Results ...... 27 4.1.1 Free energy profiles ...... 27 4.1.2 Principal component analysis ...... 28 4.1.3 RMSD ...... 29 4.2 Reliability ...... 30 4.3 Applications ...... 31

Part I Introduction and method

1 1 Introduction 1.1 Sarcoplasmic Reticulum Ca2+ ATPase (SERCA) 1.1.1 Function SERCA is a membrane bound P-type protein abundant in the sarcoplasmic reticulum (SR) of muscle cells. Its function is to pump calcium against its elec- trochemical gradient into the SR and out of the cytosol using ATP hydrolysis. The resulting low cytoplasmic calcium concentration allows the sudden release of calcium from the SR triggered by an incoming action potential to serve as a signal for the muscle to contract. The cyclical action of SERCA is generally modelled as a seven state model as in the figure below, first proposed by Meis et al.[1]. Each of these states corresponds to a distinct protein conformation.

Figure 1: This is a modern version of the flowchart for the Serca pumping action as described by Meis et al produced by Møller et al.[2] (the numbering was added by us).

The most important changes that occur over the course of this cyclical ac- tion is the interchange of ions, the binding and release of ATP and ADP, the phosphorylation of the protein and the shift of the protein between two energy states E1 and E2. As the information represented in 1 the changes that occur are also described in words below: Beginning from the HnE2-ATP state at the top left and proceeding clockwise around the cycle; (1-2) the protein relaxes into the low energy E1 state and (2-3) picks up two calcium ions from the cytoplasm.(3-4) The protein then hydrolyses the ATP and is excited to an intermediate state where the calcium ions are

2 occluded and cannot return to the cytoplasm. The protein binds the phosphor released at the ATP hydrolysis, changing the global minimum of the free energy landscape to correspond to the E2 state.(4-5) The protein relaxes into its new minimum energy state E2, releasing ADP and binding ATP. In the E2 state the path for the calcium into the cytoplasm is still occluded whereas a new path is opened into the ER.(5-6) The calcium ions escape through this path and are replaced by hydrogen ions. (6-7) As the bound phosphate group is highly energetic it is only a matter of time before it is released by the protein, at which time the original E1 state is once again favoured. (7-1) The protein relaxes into its new minimum energy state E1. Several of the protein conformations corresponding to these states have been identified using x-ray crystallography. Searching the protein data bank the following unique conformations were found identified here using their protein data bank id tags: Structures [Ca2]E1 PADP 1T5T Ca2E2P-ATP 3B9B Ca2E1-ATP 1T5S HnE2-ATP 1IWO

1.2 Structure SERCA has four functionally distinct domains, the nucleotide binding domain (N), the phosphorylation domain (P), the actuator domain (A) and the trans- membrane domain (TM). The N-domain contains the binding pocket where the nucleotides i.e. ATP and ADP are bound to the protein, the P-domain contains the protein residue which receives the phosphate group cleaved by hydrolysis from the ATP molecule. The A-domain is instrumental in transferring the con- formational change from the energy released through the ATP-hydrolysis to the TM-domain, making the ion binding sites inside the TM domain accessible to the environment on either the SR or cytoplasmic side. These domains are shown in the figure below for the SERCA protein in the 1T5T conformation. The SERCA structure is also closely related to those of many other ATP- powered ion transporters such as the Na+,K+-atpase and plant H+-ATPases. And results pertaining to the SERCA protein has been used as a framework in the work with these other proteins[3] At the time this was written (2015-01-18) there were 53 different x-ray struc- tures of the SERCA protein in the protein data bank. Covering the states repre- sented in the tabular above and also many different ligand conjugated variants of these states. It should be noted that x-ray structures are never fully repre- sentative of the protein conformation in vito. This is the case since in order to achieve a strong enough x-ray interference pattern the signal from the single protein must be enhanced by many orders of magnitude, which enhancement is achieved by producing a protein . However a crystal is a very different en- vironment from that which the protein naturally occupies, and this is especially

3 Figure 2: A cartoon representation of the SERCA protein colorschemed to show the different domains, red = N-domain, blue = A-domain, yellow = P-domain, cyan = TM-domain true when additional salts and other compounds are added in order to stabilize the crystal. In many cases such states as are thought to bind to ligands in vito, in the case of SERCA ATP or ADP, these ligands are often interchanged for homologues in order to increase the stability of the crystal.

4 1.3 Generating a Transition Pathway A transition pathway is the path that a protein covers as it transitions between two different stable conformational states. There are many different ideas and recipies for finding transition pathways to be found in the literature, utilis- ing many different simulation methods, an example of this is the ANPpathway method [4]. For this project we have made use of ready made pathways gen- erated by Laura Orellana using a method relying on Brownian dynamics and elastic network modelling.

1.3.1 Brownian Dynamics Brownian dynamics is a dynamics simulation which rather than as in molecular dynamics solving Newton’s equations for all particles in a system deals with the system in a more abstract way. Instead of simulating water molecules clashing with the protein and thereby adding energy to the protein necessary for it to adopt overcome energy barriers and adopt new conformations it simply treats these clashes as random addi- tions of energy. This approach has been used to simulate a wide variety of systems, such as ionic currents through calcium channels [5] and the effects of phosphorylation of polypeptides [6]. In Brownian dynamic the equations of motion are written as a simplified version of the Markovian Langevin equation, assuming a diagonal friction tensor and and neglecting the inertial term:

V mf X m ~v˙ = − − ζ ~v (t) + ~η (t) i i r ij j i i i ⇓

1 η(t) x˙ = F~ sys + ζ ζ

Here η(t) is the random energy added to the system at point x along any generalized coordinate.

1.3.2 Elastic Network Modelling An elastic network model of a protein is made by connecting the particles with harmonic springs. These named particles can be either the atoms or groups of atoms in a coarse grained approach. The elastic network model used to find the trajectories used as the starting point for the umbrella sampling of this project were coarse grained, so that each particle corresponded to an entire protein residue.

5 Figure 3: Image showing transition pathways of SERCA projected on a two dimensional space using principal component analysis. This image and the shown pathways were generated by Laura Orellana

1.4 Umbrella sampling Umbrella sampling is a technique for measuring the free energy for a transition by applying harmonic restraints to the system at a series of snapshot structures along the transition pathway. It has been successfully used to find free energy profiles along transition pathways for many different types of biomolecules. [7] [8] It has also been used in the context of investigating the binding free energy of ligands [9]. Umbrella sampling can be applied to any collective variable of the system under investigation, some examples are the angles between protein domains, the positions of protein domains or atoms and the root mean square distance (RMSD) between the protein and some reference structure. Since we have knowledge of the strength of the applied restraining potential and the positional data of the system along the restrained collective variable, we are able to find the free energy profile of the transition simply by applying Boltzmann statistics. In this work the positional data was analysed using the weighted histograms

6 analysis method (WHAM) [10]. Which is a common method for analysing umbrella simulations data. The chosen collective variable (CV) for this work was the ∆RMSD variable - the difference between the RMSD between simulated and first structure and the RMSD between the simulated and final structure as will be discussed further in the methods section.

1.5 Molecular Dynamics Simulation Molecular dynamics is a powerful tool used in a large variety of fields, from statistical mechanics to medicine development. Molecular Dynamics operates by calculating Newtons equations of motion for all particles in the system and progressing the system at sufficiently small discrete time-steps to include all important events, typically values of one or a few femtoseconds are used cor- responding to the fast vibrational modes of water molecules. This means that using MD simulation implies a few important assumptions. The first assumption is that the Born-Oppenheimer approximation holds for the investigated system, or at least that any quantum mechanical contributions can be sufficiently cor- rected for. The second is that the observed system is sufficiently large to avoid influences from non-physical boundaries or periodicities. The third is that the forcefield parameters used in the molecular dynamics engine are good enough to allow the simulation to yield relevant results.

1.5.1 Coupling Algorithms Coupling algorithms are an essential feature of molecular dynamics simulations, they are mathematical tools used to force the system to keep at a constant value of some environmental variable, in this work we have made use of the Parinello- Rahman pressure coupling scheme and the Nos´e-Hoover temperature coupling scheme, both built in features of the molecular dynamics suite. In the Nos´e-Hoover approach the temperature of the system is kept con- stant by adding a heat-reservoir term to the equations of motion of the system particles. The modified equations of motion can be stated as:

2 d ~ri F~i pξ d~ri 2 = − dt mi Q dt where the equation of motion for the heat bath parameter ξ is:

dp ξ = (T − T ) dt 0 In Parinello-Rahman pressure coupling the pressure of the system is kept con- stant by varying the system box vectors according to the equation of motion below:

d~b2 = V W~ −1~b0−1(P~ − P~ ) dt2 ref

7 The equations of motion of the particles are also modified by addition of a box vector dependant term, as below:

2 ~ d ~ri Fi ~ d~ri 2 = − M dt mi dt where

d~b0 d~b M~ = ~b−1[~b + ~b0]~b0−1 dt dt

1.5.2 Gromacs The simulations in this work were made using the molecular dynamics package gromacs [11]. There were several reasons for us to choose this particular suite, firstly because this thesis project was done at TCB-lab in Stockholm which is also where gromacs is developed, but also because of the strengths of the suite itself. One of gromacs’ strengths is the scalability across multiple nodes, which for normal molecular dynamics simulations is nearly linear. Another is the use that gromacs makes of the GPU, together these mean that gromacs is very fast when utilized on large clusters. Gromacs is open source and is compatible with a wide range of programs such as vmd, plumed, etc. It is also well documented which is a major boon to anyone new to the suite. Importantly for this work gromacs also supports a wide range of different force fields of both the coarse grained and all atom variety as well as also a good number of analysis tools as well as file-parsing tools allowing the user to easily manipulate trajectory files, extract the data of interest, manipulating structure files etc. To add to this there is also quite a large amount of user created content such as modified forcefields, ligand topologies, tutorials etc. All these benefits together make gromacs a very user friendly and efficient alternative among the molecular dynamics suites.

1.5.3 Plumed In general umbrella sampling simulations are handled in gromacs using the built in pull code. However since RMSD pulling is not yet implemented in the gromacs pull-code and so in order to make use of the ∆RMSD collective variable gromacs was patched with PLUMED, which is an open source library for free energy calculations in molecular systems. [12] [13]. Using this CV meant however that for the umbrella sampling simulations we were not able to benefit fully from the scalability of the gromacs code, since the RMSD values - global variables, had to be updated at each simulation step in order to correctly assign the restraint potential forces. In practice this meant that we had to use a low number of nodes in order to get a decent computational efficiency, which of course also lead to a longer simulation time. The results of the simulations were then analysed using the wham code developed by A Grossfield [14].

8 2 Method 2.1 Simulations In all four different umbrella sampling simulations where run; one simply using the initial systems acquired by adding hydrogen to the protein structures taken from the initial transition pathways, as well as three different systems differing from the first by inclusion of ligands and modified amino-acid residues, all of this is discussed more in detail below. Choosing how to perform a simulation is not straightforward, there are many parameters that must be chosen and it is not always obvious how these choices should be made.

2.1.1 Forcefield In this work the amber99ss-ildn forcefield was used, with extensions to provide forcefield parameters for the lipids. This is a very commonly used set of forcefield parameters and it was especially convenient for us to use in this project since we wished to take advantage of some pre-equilibrated lipid bilayer systems provided by Erik Lindvall for use in the different simulations, as well as the available forcefield parameters for ADP and ATP [15]. We made use of the tip3p water- model which is what gromacs suggests for the amber99ss-ildn force-field.

2.1.2 Lipid model The membrane into which the SERCA was inserted consisted wholly of POPC lipids in a pre-equilibrated membrane comprised of 288 lipids. The forcefield parameters for these lipids have been used in are a modified version of the berger parameters[16] which are still widely used for md and have also been used in dedicated lipid membrane simulations [17]. POPC is much used in membrane protein simulations since it has well known properties and these properties are similar to those of eukaryote cell membranes.

2.1.3 Simulation series As shown above in the flowchart Fig 1 on page 2, the pumping action of the SERCA protein is comprised of several structural transitions accompanied by the binding and release of several different ligands. This means that in order to fully characterize this motion all transitions must be covered, with and without the relevant ligands. If this is done then the free energy landscapes may be compared and the function of the protein elucidated. However to cover all of these transitions would be a large endeavour and at any rate is outside of the scope for this masters degree project, and so instead a single transition was chosen. This project was concerned with the character- ization of the free energy profile of the 1T5T[18]-3B9B[19] transition see Fig 3 on page 6 which was seen as especially interesting as this covers the tran- sition between the low energy E1 state from which the protein is excited by

9 ATP hydrolysis, and the high energy E2 state of the protein. This is a difficult transition for molecular dynamics since the release of ADP and binding of ATP cannot be reproduced on a computationally attainable time-scale. The first approach taken was to simulate a simplified version of the protein, by simply allowing gromacs to build the all-atom protein from the coarse grained snapshot structures from the transition pathways described above. This means that in this case the protein was unphosphorylated, no ligands were added and no calcium ions were placed in the calcium binding pocket. This simulation se- ries will from here on be referred to as the ”unmodified case” since no additional modifications were made to the protein structures from the transition pathway. The second approach was to introduce as much of what is known about the reality of the transition as possible, this was done by modifying the protein to contain the phosphorylated ASP 351 residue, by introducing the calcium ions to the calcium binding pocket and by aligning ATP and ADP into the nucleotide binding pocket. Since the actual release and binding of ADP and ATP was not to be sim- ulated three separate simulations were made for the ATP-bound, ADP-bound and unbound cases respectively. These cases will from here on be referred to as the modified ATP-bound case, the modified ADP-bound case and the modified unbound case respectively. The structure of the phosphorylated ASP 351 was built in the system UCSF Chimera [20] and the its forcefield parameters were directly taken from the atomtypes (.atp) file from the amber99ss-ildn forcefield in gromacs. The ADP and ATP molecules were made by plugging the forcefield parameters cited above under forcefield into a program called acpype [21] which then translated these amber forcefield parameter files for use in gromacs. One issue was that the alignment of the nucleotides in the nucleotide binding pocket in such a way as to avoid clashes and other energetic interactions turned out to be challenging. In order to overcome this several methods were assessed, including by hand alignment, and the use of the online resource DOCK blaster. In the end alignment by hand turned out to be the most viable method. The alignment was mainly guided by maximal conservation of the pi-stacking interaction between the nucleotide and the Phe 487 residue and the hydrogen bonds formed with between the nucleotide phosphate groups and the LYS 492 residue. an example of the alignment is shown in Fig 4 on page 11.

2.1.4 Equilibration Though as was mentioned the membrane bilayer systems themselves were pre- equilibrated, the insertion of a protein into the membrane necessarily produces some disturbances in the bilayer, which means that the system must be re- equilibrated after the insertion of the protein. The equilibration of lipid membranes is time consuming work, and for that reason it is desirable to use some tricks in order to cut down the time used for these equilibrations. One common approach, and the one that was adopted in this work, when dealing with a series of simulations of similar structures is

10 Figure 4: This figure shows the nucleotide binding site with bound ATP. to simply ”cut and paste” the new protein structure into the hole in the lipid membrane left by the previous structure in the series. This method reduces the amount of necessary equilibration since the membrane is less disturbed by a change in protein position than it is by the sudden insertion of a protein. This is especially true if, as is often the case, the motion of the protein is small in the transmembrane domain. Though this scheme cuts down the computation for the membrane equilibration it may also cause overlap between protein atoms and water or lipid atoms in the system, causing infinite energies and the simulation to explode. In order to avoid this overlapping waters and lipids must be removed. This must in turn be compensated for, either by reinserting waters or simply by starting from a system with more water in it, so that the loss of water may be neglected. The removed lipids are far more difficult to compensate for, and in general all that can be done is to at each step insert the protein carefully, so as to cause as few lipid-protein clashes as possible. In order to achieve this we elected to,in each case, insert the protein into the membrane using the g membed[22] functionality. g membed works by inserting an unphysical thinned protein into the membrane where at first all distances between atoms and atom extensions in the xy-direction are reduced by one half. The protein is then ”reinflated” over the course of 1000 simulation steps, allowing the protein to push any previously overlapping lipids aside. For each consecutive window g membed was also used but starting from an xy-extension of 0.75. In this way we were to a large extent able to avoid removing lipids and

11 waters when changing between different simulation windows. Once the systems had been built, the proteins embedded in the membrane, an energy minimization was run(EM). EM is not itself a molecular dynamics simulation but is used to find a local potential energy minimum near the starting structure. In this case we used the steepest descent minimization algorithm with an energy step size of 0.00001 and the systems generally found an energy minimum with forces less than 1000kJ/mol/nm after a few thousand steps. Once the system has been energy minimized it was equilibrated in a canon- ical ensemble (NVT) simulation using Nos´e-Hoover temperature coupling with reference temperature at 310K. This equilibration was only run for 20ps as was sufficient to relax the system for the next step; NPT equilibration. In the NPT equilibration in addition to the temperature coupling Parinello-Rahman pressure coupling was also added. The equilibration of the initial system was performed for 100ns after which as was laid out above the consecutive struc- tures were input into this equilibrated membrane wherefore we found it suffi- cient to equilibrate these systems for 5ns. The same 100ns equilibration was used for each of the four simulation series. During each of these steps the pro- tein conformation was conserved by the addition of positional restraints on the alpha-carbons.

2.1.5 Production The production runs where done in using Parinello-Rahman pressure coupling and Nos´e-Hoover temperature coupling as in the NPT equilibration, the simu- lations where run for 50 ns per simulation window in the unmodified case and for 20 ns per simulation window in the modified cases. The production run is where the umbrella sampling takes place, this can be done with the pull code parameters available for implementation in the molec- ular dynamics parameters (.mdp) file in gromacs. But at this time this code only implements distance and direction restraints, and so for our purposes we patched gromacs with plumed. The following is a snippet of the code used in the plumed file.

WHOLEMOLECULES STRIDE=1 ENTITY0=1-15347 CV1: RMSD REFERENCE=firststate.pdb TYPE=SIMPLE CV2: RMSD REFERENCE=endstate.pdb TYPE=SIMPLE CV3: COMBINE ARG=CV1,CV2 POWERS=1,1 COEFFICIENTS=1,-1 PERIODIC=NO RESTRAINT ARG=CV3 KAPPA=1000 AT=pos LABEL=umbrella PRINT ARG=CV3 STRIDE=10 FILE=COLVAR

The first line is there to prevent the molecule from being broken by domain decomposition boundaries which can otherwise interfere with the calculation. The next two lines define the RMSD value between the present structure and the first and end structure respectively. The forth line combines these into the ∆RMSD CV. The fifth line adds the umbrella restraints and the sixth tells the program to print the position of the system along CV3 at every tenth simulation step to the file COLVAR.

12 2.2 Umbrella sampling Umbrella sampling uses harmonic restraints to restrain the protein in a pre- determined position. in our case this positions was the value of the ∆RMSD collective variable for each of the intermediate structures respectively. ∆RMSD is simply the difference between the RMSD between the intermediate structure that is simulated and the last structure of the pathway subtracted from the RMSD between the simulated structure and the first structure of the pathway:

∆DRMSD = rmsd(Xi,Xstart) − rmsd(Xi,Xend)

Plumed calculates RMSD using the formula below: v u x,y,z 0 uX X wi 0 0 2 d(X.X ) = t [(Xi,a − comα(X) − X + comα(X )] i P w i,a a j j where 0 X wi comα(X) = Xi,α i P 0 j wj 0 Here wi and wi are the weight assigned to each atom in the reference structure for the RMSD and center of mass calculation. These are assigned in the reference structure pdb file. For this job we decided to weigh only the Cα atoms giving these weight one and all other atoms weight 0, for both weight parameters. This made sense since the method by which the transition pathway was arrived at included only Cα atoms, and so restraining further would be a baseless addition of information to the system. The basic idea behind umbrella sampling is to take a number of structures generated through some non-equilibrium method forcing the system along a certain route and then perform measurement on each of these structures in equilibrium. For each of these simulation windows the system will fluctuate around the constrained value of the CV under investigation. In order to get full sampling of the path over the CV it is necessary that there is overlap in the values of the CV between several windows, how many windows and what extent of overlap is needed for convergence is not predetermined but must be investigated in each case. In essence; if including more intermediate structure or extending the simulation time appreciably changes your free energy profile then it has not converged. It is always necessary however for adjacent windows to overlap in order to relate the free energy measurements from adjacent simulation windows, the term umbrella sampling is derived from this necessary overlap. The system is restrained only in the chosen CV or CVs, this means that it can still move freely along any dimension perpendicular to the restrained CV(s), this is considered a strength since it allows the system to explore more of the free energy landscape around the chosen trajectory, making it more likely that the energy profile gathered through the simulation corresponds to the energy profile of the true trajectory. In this way umbrella sampling can also be used as a refinement tool to a pathway finding method. [23]

13 It should also be noted that the results of the umbrella sampling is entirely dependant on the choice of transition pathway. There are certain weaknesses to our chosen approach; using the ∆RMSD collective variable implicitly assumes that there is a linear relationship between the movement of the protein along the transition pathway and the distance from the first and last structure in the series in order for the surrounding energy landscape to be properly sampled, this is not necessarily the case. However upon review of how the two RMSD values changed over the course of the transition the conclusion was reached that in the case of the 1T5T - 3B9B transition there was indeed a linear relationship between the two RMSD values. On the positive side it may be said that using the ∆RMSD CV is less punish- ing to small scale movement of residues than most alternative ’whole molecule’ restraints, such as backbone restraints, leading to a relatively better sampling of the energy landscape. There is also an attractive generality to this CV as it can describe any protein motion. Even if nothing should be known or could be guessed about how the protein moves, The movement will result in changes in RMSD. Another benefit to this approach is that it has been tried before and so we have something to compare our results to [7] [8]. The WHAM method is the most commonly used way of analysing the results of umbrella sampling, the method works by first finding the free energy profile for the values in the CV spanned by each individual window. The free energy calculated as: X exp(−fj) = Pλ ,β (V, ξ) V,ξ j j Where P is the probability of the system to occupy a the CV value ξ at a certain restraining potential V, coupling parameter (in FEP) λ and at a certain temperature. It is gathered from the positional data from the simulation output as: R L P P Nk(V, ξ)exp(−β λjVJ ) k=1 j=0 P (V, ξ) = λ,β R L P P nm exp(fm − βm λj,mVj) m=1 j=0 [10] Where n is the number of snapshots taken from the ith simulation. That is, for each discrete value along the CV we sum up the probability calculated from each of the simulation windows and divide by the total number of measurements. This straightforward approach has become very widely used as an analysis tool for free energy simulations.

2.2.1 Validation Because of the piecewise nature of umbrella sampling there is an inherent risk that if the chosen pathway does not correspond to a minimum energy corri- dor, the system may explore conformational areas far away from the intended pathway. In this case, though we may be able to produce an energy profile

14 for a transition pathway, it is by no means clear that this pathway is identical to the one we presume to be exploring. It may equally be the case that the explored areas of conformational space do not form a single pathway but are only unconnected local minima. In order to see whether or not the umbrella sampling simulations over the transition pathway did indeed stay close to the right path. And therefore if the output free energy profile it generates truly corresponds to the free energy profile of the proposed transition pathway, two different tests were performed. Firstly the RMSD distance between the structures recorded in each of the umbrella frames and the average structure of the previous frame were compared. If these RMSD values are small then this means that the frames form a continuum whereas if they are large then there must be a discontinuity. Secondly principle component analysis (PCA) was performed by Cathrine Bergh on the trajectory files of the unmodified case simulation series. The principal components used were the same as those of the coordinate system above in Fig 3 on page 6. PCA can be used to validate the umbrella sampling series in the same way as the RMSD test, by showing that the trajectories are continuous in collective variables other than the restrained one.

15 Part II Results and Discussion

16 3 Results 3.1 Free energy profiles The main results of this investigation are off course the free energy profiles of the 1T5T-3B9B transition as generated using the four different case systems described in the methods section. in each case there are high free energy barriers at the edges of the simulated ∆RMSD values, this is a feature of the umbrella sampling method. The free energy profile for the unmodified case, shown in Fig 5 has a global minimum around ∆RMSD 0.3, and four local energy maxima near 20 kJ/mol around ∆RMSD -0.75, -0.5, -0.25 and 0.1.

80

60

40 kJ/mol

20

0 -1 -0,5 0 0,5 1 DRMSD (nm)

Figure 5: The free energy profile from the 50ns/window simulation of the un- changed protein.

The free energy profile for the modified ADP-bound case, shown in Fig 6, has free energy minima at ∆RMSD of about 0.55, and 0.8. and two free energy peaks reaching 10 kJ/mol at ∆RMSD -0.4 and 0.7 as well as a number of peaks at about 5kJ/mol. The free energy profile for the modified ATP-bound case, shown in Fig 7, has free energy minima at ∆RMSD near -0.5 and 0.8. And a number of free energy maxima of about 5kJ/mol. The free energy profile for the modified unbound case, shown in Fig 8 has a rugged appearance with free energy minima at ∆RMSD near -0.75, -0.5, -0.25, 0, 0.5 and 0.75 separated by energy maxima of about 5 kJ/mol at ∆RMSD -0.7, -0.4, -0.1, 0.1, 0.4 and 0.6.

17 50

40

30 kJ/mol 20

10

0 -1 -0,5 0 0,5 1 DRMSD (nm)

Figure 6: The free energy profile for the 20ns/window modified ADP-bound protein.

40

30

20 kJ/mol

10

0 -0,5 0 0,5 1 DRMSD (nm)

Figure 7: The free energy profile for the 20ns/window modified ATP-bound protein

18 20

15

10 kJ/mol

5

0 -1 -0,5 0 0,5 1 DRMSD (nm)

Figure 8: The free energy profile for the 20ns/window modified unbound protein

19 3.2 Validation 3.2.1 Histograms The histogram plots used by the wham method to derive the free energy profiles from Fig 5 - 8 are shown in Fig 9 - 12. The histogram plot for the unmodified case Fig 9 shows a lack of overlap at ∆RMSD ca: -0.5 which makes the relations between the early and late part of that free energy profile uncertain.

Distribution

20

15

10 Counts

5

0 -1 0 1 DRMSD (nm)

Figure 9: The histogram plot for the 50ns/window unmodified protein

20 Distribution

30

25

20

15 Counts

10

5

0 -1 0 1 DRMSD (nm)

Figure 10: The histogram plot for the 20ns/window modified ADP-bound pro- tein

Distribution

20

15

10 Counts

5

0 -0,5 0 0,5 1 DRMSD (nm)

Figure 11: The histogram plot for the 20ns/window modified ATP-bound pro- tein

21 Distribution

20

15

10 Counts

5

0 -1 -0,5 0 0,5 1 DRMSD (nm)

Figure 12: The histogram plot for the 20ns/window modified unbound protein

22 3.2.2 PCA Fig 13 shows the projection of the protein conformation from the unmodified case taken at integer nanosecond value shown as dots of color ranging from blue to red depending on the position of the simulation window in the series, blue being early and red late. The blue line represents the structures of the transition pathway generated by Laura Orellana, some of which were used as starting structures for the umbrella simulation. These are plotted in a coordinate system of the same principal coordinates as in the transition pathways of Fig 3 on page 6.

Figure 13: The trajectory described by the protein over the simulations pro- jected onto two principal components. The figure was generated by Cathrine Bergh.

3.2.3 RMSD Fig 14 - 17 show the RMSD plots used to validate the four different umbrella sampling series described above. What you see are histograms over the RMSD between the simulated structure for each window as compared to the average structure of the previous window. The figures are not meant to show the de- tail of these histograms but rather to show which values the RMSD between

23 consecutive simulation windows are reached by the systems. In general it holds true for all the RMSD plots that the bulk of the measured RMSD values are between about 0.2-0.4 nm. Looking at the graph of the RMSD distribution for the unmodified case Fig 14 we notice two outliers with measured values reaching an RMSD of approx 0.6 nm. In the graph of the RMSD distribution for the modified ADP-bound case the same is true for one simulation series. Whereas the modified ATP-bound and unbound cases show maximum values of only about 0.5 nm. With one outlier in the ATP-bound case and two for the unbound case.

Figure 14: Graph showing the histograms of the RMSD between adjacent sim- ulation windows for the 50ns/window unmodified protein

24 RMS Distribution

5

4

3 Counts 2

1

0 0 0,1 0,2 0,3 0,4 0,5 RMS (nm)

Figure 15: Graph showing the histograms of the RMSD between adjacent sim- ulation windows for the 20ns/window modified ADP-bound protein

RMS Distribution

5

4

3 Counts 2

1

0 0 0,1 0,2 0,3 0,4 0,5 0,6 RMS (nm)

Figure 16: Graph showing the histograms of the RMSD between adjacent sim- ulation windows for the 20ns/window modified ATP-bound protein

25 RMS Distribution

5

4

3 Counts 2

1

0 0 0,1 0,2 0,3 0,4 0,5 0,6 RMS (nm)

Figure 17: Graph showing the histograms of the RMSD between adjacent sim- ulation windows for the 20ns/window modified unbound protein

26 4 Discussion 4.1 Results 4.1.1 Free energy profiles Looking at the free energy profile from the unmodified case Fig 5 on page 17 we see that it has a rather rugged curve, with peaks at about 20 kJ/mol, we also note that there are no stable minima close to either end point of the simulation series, that is the crystal structures, and instead it shows its global minimum at a ∆RMSD value of about 0.3. From this information it may be concluded that this simplified version of the protein, which lacks both bound ligands as well as phosphorylized residue and calcium ions, would not complete the 1T5T-3B9B transition. This is not necessarily surprising, the phosphorylation of the protein is after all what is supposed to trigger the transition from the E1 state to the E2 state. However it is equally the case that for many simulation experiments such as this simplified systems are used for practicality, and before the facts it may have been argued that the unmodified system is closer to the system used to generate the transition pathway used as the starting point and was therefore to be preferred. The simple fact of the matter is that the transition from Brownian dynamics and coarse grained systems to molecular dynamics and all atom systems is a huge step up in complexity and whereas in the first case the overall structural information about the protein was sufficient to find likely transition pathways, in the second further detail is needed for the simulation to yield relevant results. The comparison between the unmodified case and the modified ADP-bound case, which here is taken to be representative of the modified cases in Fig 18 shows that in the modified cases the entire free energy profile is considerably lowered as compared to the unmodified case. This is likely due to the added elec- trostatic forces from the ions and phosphorylated amino acid residue, stabilizing the protein conformation in positions closer to the initial transition pathway. Comparing the modified ADP-bound and ATP-bound cases as in Fig 19 we see considerable similarities between the cases. The main difference is the absence in the ATP-bound case of the free energy peak at ∆RMSD ca: -0.2. Reviewing the free energy profiles for the modified cases we see that in general, relative to the unmodified case, they show lower local maxima and have lower local minima near the crystal structures. Especially the ADP-bound case has its global minimum at a ∆RMSD value close to the 3B9B crystal structure. The ADP-bound case also stands out in that it has local maxima peaks of about 10 kJ/mol whereas the other two cases have maxima of about half that hight. As for the order of events over the course of the transition not much can be said with regards to the free energy profiles, if anything the position of the global minimum for the ADP bound case may indicate that the entire conformational transition is in fact made with bound ADP, on the other hand the lower energy maxima of the ATP-bound case may be seen to indicate that the nucleotide exchange indeed does take place over the course of conformational change, to

27 80

60

kJ/mol 40

20

0 -1 0 1 DRMSD (nm)

Figure 18: A comparison between the free energy profiles of the modified ADP- bound case (red) and the unmodified case (black) properly answer this question one would need to perform some kind of dynamics experiment. Note also that the starting point of the ATP-bound case differs from the other two cases, this is so because in the early stages of the transition the nucleotide binding pocket was to narrow to fit the ATP molecule. Upon review of the conformational states corresponding to the free energy maxima, it was found that these often corresponded to a relative movement between transmembrane alpha-helices. This may be due to the nature of the constraints, so that in order for the protein conformation to correspond closer with the bias value of the ∆RMSD. If this is the case it may be that using RMSD based umbrella sampling on protein transitions risks underestimating the heights of the free-energy maxima, a way to temper this may be to use weaker restraint potentials.

4.1.2 Principal component analysis Looking at Fig 13 on page 23 we see that in general the simulation trajectory is continuous and is not far removed from the transition pathway. The main differences seem to be near the ends of the transition pathway where the protein seems to shy away from the crystal structures. In the first (blue) part of the transition near the 1T5T crystal structure the protein seems to find a local minimum removed from the crystal structure, whereas at the end (red) part near the 3B9B crystal structure it does not seem to find any deep local minimum and

28 50

40

30 kJ/mol 20

10

0 -1 -0,5 0 0,5 1 DRMSD (nm)

Figure 19: A comparison between the free energy profiles of the modified ADP- bound case (black) and the modified ATP-bound case (red) the result is a very nebulous appearance of the projection dot envelope. This is the behaviour we would expect considering the free energy profile of Fig 5 on 17. We have already speculated on the reason for these behaviour above, in short it is likely due to the less than realistic conditions of the unmodified case simulation and the crystal structures differing from the in vito protein state conformations. The reason that there are dots separated from the rest on or near the transi- tion pathway line is that all structures at integer ns values were projected onto the plot, including those at the start of the simulation, these will naturally be close to the starting point on the transition pathway line.

4.1.3 RMSD These plots were made and shown in order to argue for the validity of the pre- sented free-energy profiles. As was argued above, there is an inherent danger to any method such as umbrella sampling which depend on data from a series of simulations, namely that the simulations in the series are not connected as one might wish. In this case, it may well have been that although the simulation windows overlapped in terms of the ∆RMSD variable, they did not overlap in protein conformational space. Therefore in order to alleviate such concerns it is important to make sure that there is an overlap also as regards to some un- restricted dimensions. The RMSD plots shown above show how far from each

29 other the simulated protein conformations were in terms of RMSD. They were made so that the colored curves are histograms showing the RMSD between the simulated protein in each simulation window at integer nanosecond values and the mean structure of the previous simulation window as calculated using the gromacs g cluster function. The proteins generally stay in an RMSD range of 0.2-0.4 which is on par with the fluctuation in ∆RMSD between the windows. As is seen in the plots above, there are cases that deviate from this quite con- siderably, reaching differences in RMSD values of 0.6-0.7. These are in all cases the windows on the extreme parts of the trajectories, that is the ones close to the crystal structures. Possibly this is due to differences between the crystal structures, which can only be gained by imposing unnatural conditions on the protein. And the true minimum energy conformation which will be close to but not identical with the crystal structure.

4.2 Reliability The reliability of simulation data is always a complicated question, so how can we be sure that the performed simulations resulted in reliable data? As regards the basic question of the reliability of the simulations we may respond that we have made use of forcefields, lipid parameters, and methods which have previously been shown to give results closely resembling those of real world experiments and that the generated free energy profiles show reasonable characteristics. There is however an issue specific to this investigation which may have negatively influenced the results. That issue is the placement of the bound ATP and ADP molecules, which was performed by hand. This means that the simulations above for the nucleotide-bound cases are not easily repeat- able, since other people may dock the nucleotides to the protein in different ways. For this reason it would have been better if some sort of docking with a strictly defined set of docking criteria was used, as was attempted with DOCK blaster. The main reason for the difficulty in docking the ligands is that the starting structure of the binding pocket is changed over the course of the Brownian dynamics simulation which is not meant to take into account the detail of the transitional motion. In order to somewhat alleviate this concern the placement was done using criteria detailed in the method section above, and illustrated in Fig 4 on page 11. A point was also made of checking the simulation windows after equilibration to make sure that the nucleotides and nucleotide binding pockets were looking well. Another factor to consider when speaking of the reliability of the above re- sults are the umbrella histogram graphs, which as was pointed out earlier should show overlap between the adjacent windows, there are some cases where there is only little overlap between adjacent windows as in Fig 9 on page 20. The result of this is that in these cases the relation between the free energy informa- tion in the different sections separated by this region of little overlap becomes uncertain. However within these regions the information is not compromised.

30 4.3 Applications The above investigations showcase two different application of umbrella sam- pling. Firstly the refinement of transition pathways arrived at by different means as in the case of the unmodified protein simulation, where we can see that the transition pathway followed over the course of the umbrella simulations is slightly different from the one arrived at using Brownian dynamics and elas- tic network modelling. In this capacity umbrella sampling can also be seen as a validation tool for methods aiming at deriving protein transition pathways. Secondly it can be used as a method of discovering additional information about protein transitions as was done in all cases above, finding for instance free energy profiles for protein transitions.

31 References

[1] L. Meis and A. L. Vianna, “Energy interconversion by the ca2+-dependent atpase of the sarcoplasmic reticulum,” Annual Review of Biochemistry, vol. 481(1), pp. 275–292, 1979. [2] J. Vuust Møller, . Olesen, A.-M. Lund Winther, and P. Nissen, “What can be learned about the function of a single protein from its various x-ray structures: The example of the sarcoplasmic calcium pump,” Methods in molecular biology, vol. 654, pp. 119–40, 2010. [3] M. Nyblom, H. Poulsen, P. Gourdon, L. Reinhard, M. Andersson, E. Lin- dahl, N. Fedosoca, and P. Nissen, “Crystal structure of na+,k(+)-atpase in the na(+)-bound state,” Science, vol. 342(6154), pp. 123–27, 2013. [4] A. Das, M. Gur, M. H. Cheng, S. Jo, I. Bahar, and B. Roux, “Exploring the conformational transitions of biomolecular systems using a simple two- state anisotropic network model,” PLoS Comput Biol, vol. 10, p. e1003521. doi:10.1371/ journal.pcbi.1003521, 2014. [5] B. Corry, T. W. Allen, S. Kuyucak, and S.-H. Chung, “A model of calcium channels,” BBA - Biomembranes, vol. 1509(1), pp. 1–6, 2000. [6] T. Shen, C. F. Wong, and J. A. McCammon, “Atomistic brownian dy- namics simulation of peptide phosphorylation,” Journal of the American Chemical Society, vol. 123(37), pp. 9107–11, 2001. [7] N. K. Banavali and B. Roux, “Free energy landscape of a-dna to b-dna conversion in aqueous solution,” Journal of the American Chemical Society, vol. 127, pp. 6866–76, 2005. [8] K. Arora and C. L. Brooks III, “Large-scale allosteric conformational tran- sitions of adenylate kinase appear to involve a population-shift mechanism,” Proceedings of the National Academy of Sciences of the United States, vol. 104, 2007. [9] D. L. Mobley, J. D. Chodera, and K. A. Dill, “The confine-and-release method: Obtaining correct binding free energies in the prescence of protein conformational change,” Journal of Chemical Theory and Computation, vol. 3, pp. 1231–1235, 2007. [10] S. Kumar, J. M. Rosenberg, D. Bouzida, R. H. Swendsen, and P. A. Kol- man, “The weighted histogram analysis method for free-energy calcula- tions on biomolecules. i. the method,” Journal of Computational Chem- istry, vol. 13, pp. 1011–21, 1992. [11] S. Pronk, S. Pall, R. Schultz, P. Larsson, P. Bjelkmar, R. Apostolov, M. R. Shirts, J. C. Smith, P. M. Kasson, D. van der Spoel, B. Hess, and E. Lin- dahl, “Gromacs 4.5: a high-throughput and highly parallel open source molecular simulation toolkit,” Bioinformatics, vol. 29(7), pp. 845–54, 2013.

32 [12] G. Tribello, M. Bonomi, D. Branduardi, C. Camilloni, and G. Bussi, “Plumed2: New feathers for an old bird,” Computer Physics Communi- cations, vol. 185, pp. 604–13, 2014. [13] M. Bonomi, D. Branduardi, G. Bussi, C. Camilloni, D. Provasi, P. Raiteri, F. Marinelli, F. Petrucci, R. Broglia, and M. Parinello, “Plumed: a portable plugin for free enerfy calculations with molecular dynamics,” Computer Physics Communications, vol. 180, pp. 1961–72, 2009. [14] A. Grossfield, “WHAM: an implementation of the weighted histogram anal- ysis method.” http://membrane.urmc.rochester.edu/content/wham/. ver- sion 2.0.9.

[15] K. Meagher, L. Redman, and H. Carlson, “Development of polyphosphate parameters for use with the amber force field,” Journal of , vol. 24, p. 1016, 2003. [16] O. Berger, O. Edholm, and F. J¨ahnig,“Molecular dynamics simulations of a fluid bilayer of dipalmitoylphosphatidylcholine at full hydration, con- stant pressure, and constant temperature,” Biophysical Journal, vol. 72, pp. 2002–13, 1997. [17] P. M. Kasson, B. Hess, and E. Lindahl, “Probing microscopic material properties inside simulated membranes through spatially resolved three- dimensional local pressure fields and surface tensions,” Chemistry and Physics of Lipids, vol. 169, pp. 106–12, 2013. [18] T. Lykke-Møller Sørensen, J. Vuust Møller, and P. Nissen, “Phosphoryl transfer and calcium ion occlusion in the calcium pump,” Science, vol. 304, pp. 1672–75, 2004.

[19] C. Olesen, M. Picard, A.-M. Lund Winther, C. Gyrup, J. P. Morth, C. Oxvig, J. Vuust Møller, and P. Nissen, “The structural basis of cal- cium transport by the calcium pump,” Nature, vol. 450, pp. 1036–42, 2007. [20] E. Pettersen, T. Goddard, C. Huang, G. Couch, D. Greenblatt, E. Meng, and T. Ferrin, “Ucsf chimera–a system for exploratory re- search and analysis,” Journal of Computational Chemistry, vol. 25(13), pp. 1605–12, 2004. [21] A. W. S. da Silva and W. F. Vranken, “Acpype - antechamber python parser interface,” BMC Research Notes, vol. 5, p. 367, 2012. [22] M. G. Wolf, M. Hoefling, C. Aponte-santamara, H. Grubm¨uller, and G. Groenhof, “g membed: Efficient insertion of a membrane protein into an equilibrated lipid bilayer with minimal perturbation,” Journal of Com- putational Chemistry, vol. 31(11), pp. 2169–74, 2010.

33 [23] M. Moradi and E. Tajkhorshid, “Computational recipe for efficient de- scription of large-scale conformational changes in biomolecular systems,” Journal of Chemical Theory and Computation, vol. 10, 2014.

34