Reliable Ligand Docking to Flexible Receptors by Dual Alanine Scanning and Refinement (SCARE) Algorithm

Giovanni Bottegoni,‡ Irina Kufareva,‡ Maxim Totrov,¶ and Ruben Abagyan‡¶ *

‡ The Scripps Research Institute, 10550 N Torrey Pines Rd., La Jolla, CA 92037, USA

¶ Molsoft, LLC, 3366 N Torrey Pines Ct. Suite 300, La Jolla, CA 92037, USA

* Corresponding Author. Dept. of Molecular Biology, TPC28, The Scripps Research Institute, 10550 N Torrey

Pines Rd., La Jolla, CA 92037, USA. E-mail: [email protected], Tel: 00 1 (858) 784 8543, Fax: 00 1

(858) 784 8299

Supplementary Information

Materials and Methods

Structures preparation

Benchmark selection. A protein was included in the set if at least two X-ray structures were available, co-crystallized with different ligands or with and without ligand, respectively. Furthermore, only structures that were already reported in recent cross- docking studies were considered for the set [1; 2; 3; 4; 5; 6]. To enhance the diversity of the set, no more than two structures of the same protein were collected; when possible, holo structures were preferred over apo structures and complexes with small organic molecules over complexes with endogenous binders, peptides, or peptidomimetic drugs.

Metalloproteins where metal ion(s) directly interacted with the ligand were excluded from the selection. Thirty two structures (2 apo and thirty holo), namely two conformations of sixteen different proteins, were eventually selected. The atomic coordinates were retrieved from the RCSB Protein Data Bank (PDB) [7].

1 Receptor. The receptor set up was based on the following approach. The chains not involved in defining the binding region were deleted. When multiple copies of the biological entity were included in the asymmetric unit, the best solved one was chosen. HET groups were kept only if interacting directly with the ligand. After the assignment of the correct atom types, hydrogen and missing heavy atoms were added to the structure. For the side chains whose occupancy was lower than one, the best energy conformation was selected.

If available, histidines were assigned the tautomeric state reported in the primary literature; otherwise, the most energetically favourable state was calculated and assigned. The positions of asparagines and glutamines side chains were optimized to improve the hydrogen bonds pattern. The position of the polar hydrogen atoms underwent a refinement procedure. Unless involved in at least three specific and conserved contacts with non- water atoms, water molecules were deleted. Finally, the original ligand was deleted from the holo structures. The residues were renumbered according to the protein sequence retrieved from UniProt [8].

Ligand. Ligand atomic coordinates were extracted from the crystallographic complexes. Bonds order, tautomeric form, stereochemistry, and protonation state were assigned based on the primary literature description. Each ligand was assigned the MMFF

[9] force field atom types and charges, and hydrogen atoms were added.

Reference. The reference co-crystal structure was prepared employing the same procedures reported above. The binding pocket region in the reference and the correspondent region in the receptor were superimposed. The superimposition algorithm

[10] was based on an iterative procedure that, through an unbiased weight assignment to different atomic subsets, gradually found the better ‘alignable’ core in both structures. In this way, highly movable but confined regions, like loops and tails, were less likely to affect the global superimposition outcome. The minimal fraction of equivalent atom pairs to be

2 superimposed with significant weights was set equal to 50%. The maximum number of iterations was set equal to 100.

Receptor Pocket Definition

Experimental Binding Pocket. A mesh representing the ligand molecular surface

[11] at the binding site of the reference structure was generated. All the residues with at least one side chain non-hydrogen atom in the range of 3.0 Å from the molecular surface were considered part of the experimental pocket.

Predicted Binding Pocket. Putative binding sites were identified by means of the

IcmPocketFinder [12; 13] tool as implemented in ICM 3.5. The tolerance value was set equal to 4.6. The macro provided a mesh associated to every detected pocket. If more than one pocket was detected, the graphical object closest to the ligand position in the reference structure was selected. All the residues with at least one side chain non- hydrogen atom in the range of 3.0 Å from the selected mesh were considered part of the pocket. The activation loop region of JNK3 was not completely solved in both the crystal structures from the test set (PDBid: 1PMN and 1PMV, respectively) and it was excluded from the region explored by the IcmPocketFinder macro.

Single Grid Docking

ICM addresses the docking issue as a global optimization problem, implementing a biased probability Monte Carlo (BPMC) global stochastic optimizer [14; 15]. Since the

BPMC method was previously reported and thoroughly described, it is here only briefly summarized. During docking, one of the ligand torsional or roto-translational variables is randomly changed. A local refinement is carried out on the analytically differentiable terms by means of a conjugate-gradient minimization. The complete energy is calculated adding up the contributions of the solvation energy and those of the conformational entropy. The

3 new conformation is accepted or rejected according to the Metropolis criterion [16]. A new random change is introduced and the whole procedure is repeated all over until a previously set number of steps are performed.

The molecular conformation was described by means of internal coordinate variables. The adopted force field was a modified version of the ECEPP/3 force field [17] with a distance-dependent dielectric function. In order to decrease the calculation time, a reduced receptor representation was provided. The residues that define the binding pocket were selected; gaps up to three positions wide among the selected segments were filled and the bridging amino acids included in the selection. All the residues outside the selection were deleted. The binding pocket was described as a combination of five potentials which accounted for two van der Waals boundaries, electrostatics, hydrophobicity, and hydrogen bonding. Each potential was expressed on a pre-calculated grid spacing 0.5 Å. Since the standard 6-12 van der Waals potential was considered too sensitive to steric clashes for the purpose of the simulations, a truncated soft version was introduced and the other potentials were rescaled accordingly. Two different grid docking simulations employing two different van der Waals potentials were carried out for each complex in the set, capping the potential at 4.0 Kcal/mol and 1.0 Kcal/mol, respectively.

The ligand was docked into the grid representation of the pocket using the BPMC method described above. Given the number of rotatable bonds in the ligand, the basic number of

BPMC steps to be carried out was calculated by an adaptive algorithm. In the present study, the actual number of steps was obtained increasing ten-fold the default number.

The global optimization provided a set of geometrically diverse ligand conformations. The free energy of binding of each conformation was assessed by means of the standard ICM empirical scoring function [18; 19; 20].

4 The SCARE algorithm:

Alanine Scanning. In order to overcome the limitations imposed by the rigid structure approach, a protocol based on multiple receptor representations was developed.

Multiple receptor representations were generated adapting to the present task the GAP model, originally proposed by Eisenmerger and co-authors in 1993 [21]. According to the

GAP model, each residue in the binding pocket except for Glycines, Prolines, and

Alanines, was considered a candidate for alanine mutation. Cysteines involved in a disulphide bond were excluded from the selection as well. The mutation process did not involve any change in the backbone conformation. Dual mutations were carried out only if the residues were no more than 5 Å apart. The distance between two residues was calculated as the distance between the closest pair of side chain non-hydrogen atoms.

Likewise, triple mutations were carried out only if the distance between the third residue and a previously considered couple was within 3 Å.

Refinement. Each ligand docked conformation was assumed to be a good prediction of the native binding mode and it was strongly tethered during the whole procedure. The refinement process is a three steps procedure based on the BPMC global energy optimization. In each step, different groups of residues were considered. Each group was defined in terms of distance of the non-hydrogen atoms from the ligand docked pose. If at least one non-hydrogen atom was within the specified cut-off distance, the residue was included in the group. Harmonic restraints pulling each heavy atom in the complex to its original position were introduced [10]. Every restraint took the form of a well- shaped penalty function; the bottom of the well was a flat area where no penalty was imposed. Different weights, upper and lower bounds were specified for each individual restraint. For all the atoms in the system, the weight and lower bound were set equal to 10

Kcal/mol and 0 Å, respectively. During the entire refinement procedure, temperature was

5 set equal to 1200 K and the van der Waals potential was capped at 7.0 Kcal/mol. In the residues involved in a salt bridge interaction, a 10 Kcal/mol distance restraint in the range of 5 Å was imposed between charged oxygen and nitrogen atoms.

In the first refinement step, only the side chain variables of a very narrow set of residues, namely those within a sphere of 2 Å from the ligand, were sampled. The most severe clashes between ligand and receptor, mainly caused by the side chains of those residues that were temporary mutated into alanine during the docking phase, were quickly relaxed. The number of local minimizations and energy evaluations during BPMC was limited to 100 and 10000, respectively. After sampling, the fully flexible receptor was minimized in the space of the atomic Cartesian coordinates. The area where the atoms were allowed to move without triggering the energy penalty due to the tethers was set equal to 1 Å for the receptor and 0 Å for the ligand. After the first refinement step, each conformation was re-scored employing the above reported scoring function. Only the best ten scoring solutions proceeded to the next refinement steps. In the second step, the whole binding site was allowed to move. Thanks to the tethering function, not only the side chains displacements but also the local rearrangements of the protein backbone could be effectively optimized by means of Monte Carlo sampling. Two concentric regions surrounding the ligand were defined. The inner region was defined first selecting all the residues within 5 Å from the ligand and then filling the gaps of up to three contiguous positions in the selection. The outer region was defined selecting the residues within 2 Å from the inner region. During the BPMC optimization, the side chain and backbone variables of the inner region were sampled and minimized while the outer region variables were only minimized. The number of local minimizations and energy evaluations was limited to 200 and 100000, respectively. The restraints upper bound was set equal to 3 Å for the backbone atoms and 4 Å for the side chain atoms. In the third and last step, the side chain variables of the inner region as defined in the previous refinement step were

6 sampled and minimized together with the free variables of the ligand. The ligand, even if strongly tethered to its starting conformation, was allowed to move. This final step was mainly intended to optimize the h-bonds pattern. The restraints upper bound was set equal to 5 Å and 0.5 Å for the receptor and ligand atoms, respectively. The number of local minimizations and energy evaluations was limited to 200 and 50000, respectively. The energy of each conformation was re-assessed and the final results were ranked accordingly.

Re-scoring. Each complex in the stack underwent a three steps refinement procedure. After each step, a re-scoring was performed. Re-scoring was based on a version of the standard ICM scoring function modified as to include the receptor contributions at the free energy of binding. In a standard simulation, since its composition and its conformation do not change, the receptor provides constant contributions to the binding energy. Therefore, when different ligand binding modes are compared, those contributions can be omitted. In the present study, since each ligand was docked at different receptor representations and the receptor was allowed to move during refinement, that assumption could no longer be considered true. The receptor contributions to the free energy of binding must be explicitly accounted for. For each conformation of the ligand/receptor complex, ligand binding score was calculated using full-atom representation of all molecules. ICM binding score [10] provides a reasonable approximation of the free energy of binding and is calculated using the following formula:

Score  E  TS  E   E   E   E   E int Tor vw 1 el 2 hb 3 hp 4 sf

Here, Eint is the internal torsional strain of the ligand calculated as a difference between its internal force-field energy in the bound state and its ground state energy. Ground state energy was estimated by local gradient minimization of the ligand free torsional variables

7 in vacuum; TΔSTor = 0.60n is the ligand conformational entropy loss contribution upon binding, which is assumed to be proportional to the number n of free torsional variables;

Evw is the Van der Waals interaction energy. This was calculated using softened Van der

Waals potential that behaves as the regular one at distances close or exceeding the sum of Van der Waals radii of the two interacting atoms, but is capped so as not to exceed 1

complex rec lig kcal/mole at the smaller distances. Eel = Eel – (Eel +Eel ) is the electrostatic interaction energy. The electrostatic energy of the receptor alone, ligand alone, and the complex in aqueous environment were calculated using generalized Born solvent model

[22]. Ehb is the hydrogen bonding term. Ehp is the non-polar atoms desolvation energy proportional to the decrease in their solvent accessible area upon complex formation. Esf is the polar atoms desolvation penalty. It was calculated together with Ehb and corrected to down-weight the negative contribution the polar atoms of the receptor and the ligand that formed hydrogen bonds upon complex formation. The α coefficients are empirical weights introduced to balance the different contributions.

Calculation Time. Each single rigid receptor docking run took approximately between three and ten minutes depending on the ligand size. The refinement procedure took around one hour for each pose. On a single CPU, an average SCARE simulation would take around fifteen hours to complete, three hours for the posing phase and twelve for refinement. Since rigid docking and refinement runs are carried out independently, the calculation load could be easily split between different CPUs. A multiprocessor computer could reduce the overall calculation time up to one hour.

8 Software and Hardware

The receptors and ligands preparation, the docking simulations, the refinement protocols, and the energy evaluations were all carried out by means of ICM 3.5 (Molsoft

L.L.C., La Jolla, CA).

The docking simulations ran on an Intel CoreTM 2 Duo 2.40 GHz CPUs and 2

GBytes of memory workstation. The refinement ran in the 1196 3.4 Ghz Intel XEON-EMT

CPUs and 2.38 TBytes of memory Linux cluster Bluefish of the Scripps Research Institute

(La Jolla, CA).

Bibliography

[1] C.N. Cavasotto and R.A. Abagyan, J Mol Biol, 337 (2004) 209.

[2] J. Meiler and D. Baker, Proteins, 65 (2006) 538.

[3] T. Polgar, A. Baki, G.I. Szendrei and G.M. Keseru, Journal of Medicinal Chemistry, 48

(2005) 7946.

[4] T. Polgar and G.M. Keseru, J Chem Inf Model, 46 (2006) 1795.

[5] X.Z. Sheng-You Huang, 66 (2007) 399.

[6] W. Sherman, T. Day, M.P. Jacobson, R.A. Friesner and R. Farid, J Med Chem, 49

(2006) 534.

[7] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N.

Shindyalov and P.E. Bourne, Nucl. Acids Res., 28 (2000) 235.

[8] C. The UniProt, Nucl. Acids Res., 35 (2007) D193.

[9] T.A. Halgren, Journal of Computational Chemistry, 17 (1996) 490.

[10] R. Abagyan, A. Orry, E. Raush, L. Budagyan and M. Totrov, ICM Manual 3.5, Molsoft

LCC, La Jolla, CA, 2007.

[11] M. Totrov and R. Abagyan, J Struct Biol, 116 (1996) 138.

[12] J. An, M. Totrov and R. Abagyan, Genome Inform, 15 (2004) 31.

9 [13] J. An, M. Totrov and R. Abagyan, Mol Cell Proteomics, 4 (2005) 752.

[14] R. Abagyan and M. Totrov, J Mol Biol, 235 (1994) 983.

[15] R.A. Abagyan and M. Totrov, Journal of Computational Physics, 151 (1999) 402.

[16] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller, J Chem

Phys, 21 (1953) 1087.

[17] G. Nemethy, K.D. Gibson, K.A. Palmer, C.N. Yoon, G. Paterlini, A. Zagari, S. Rumsey

and H.A. Scheraga, J. Phys. Chem., 96 (1992) 6472.

[18] M. Totrov and R. Abagyan, RECOMB '99. Proceedings of the Third Annual

International Conference on Computational Molecular Biology, ACM Press - New

York, Lyon (France), 1999.

[19] M. Totrov and R. Abagyan, In R.B. Raffa, (Ed.), Drug-receptor thermodynamics :

introduction and applications, Wiley, Chichester ; New York, 2001, pp. 603-624.

[20] M. Schapira, R. Abagyan and M. Totrov, J Med Chem, 46 (2003) 3045.

[21] F. Eisenmenger, P. Argos and R. Abagyan, J Mol Biol, 231 (1993) 849.

[22] M. Totrov, Journal of Computational Chemistry, 25 (2004) 609.

10