<<

PhD Thesis

Molecular Determinants of Serine and Cysteine

Substrate Recognition and Implications for Rational Drug

Design

by

Birgit Waldner

submitted to the Faculty of Chemistry and Pharmacy of the Leopold Franzens University of

Innsbruck in partial fulfillment of the requirements for the degree of Doctor rerum naturalium (Dr. rer.

nat.)

LEOPOLD-FRANZENS-UNIVERSITY INNSBRUCK

FACULTY OF CHEMISTRY AND PHARMACY

INSTITUTE OF GENERAL, INORGANIC AND THEORETICAL CHEMISTRY

Innsbruck, October 2017 Acknowledgements

Many people accompanied me on the way to finishing my PhD thesis and helped me become the person and researcher I am today.

First, I would like to thank Univ. -Prof. Dr. Dr. Klaus Liedl for giving me the opportunity to be a PhD student in his group and for all of the discussions during the course of my PhD thesis, without whom this work would not be what it is today. His supervision allowed me to grow as a researcher and could not prepare me better for all the challenges that are to come. Secondly,

I would like to thank Prof. Gabriele Cruciani of the University of Perugia and Prof. Rafaela

Ferreira of the UFMG in Belo Horizonte, who both allowed me to spend research stays in their laboratories and shared their knowledge with me to help me improve my work. I also would like to thank Prof. Hans Brandstetter, for giving me the possibility to travel to his lab and produce factor Xa. I am grateful to the Austrian Academy of Sciences (ÖAW) for funding my

PhD through the DOC-grant.

Thank you to all the members and former members of the Liedl group, for not only scientific discussions, but also sledging fun and cooking evenings. Alex, Anna, Christian, Dennis,

Florian, Hannes, Julian, Maren, Meli, Michael, Moni, Niko, Radu, Roland, Stefania, Sonja, Suse,

Ursula and Wang Yin. Thank you to Sarah and Martina for taking me to Aqua-Boxing,

Magician Shows and Italian Aperitifo and to Giovanni for going climbing with me in Perugia.

Thank you also to Daniela, Fabrizio, Laura, Massimo, Paulo, Simon and Susi of the Cruciani group for making my exchange stay in Perugia such a good experience. I am also glad to have met Silvia during my stay in Perugia, who is still a dear friend to me.

I don’t know what I would have done without Elany of the Ferreira group, who showed me around Brazil and together with Viviane, Lorrena, Rafael and Glaecia made me feel welcome

i from the first moment on I arrived. Thank you also to Lucianna for connecting me with her sister Marianna-Luiza, without whom travelling to Rio would not have been as much fun.

Life as a PhD student would have been half as good, if it were not for the “Bombsquad” filling every lunch break with action. Christoph, David, Gabs, Johann, Johannes, Kathi, Richie,

Theresa, Sti, Wolle and all of the guest players, thank you for all the laughs we had. Thank you also to the members of the Thursday lunch group: Christiane, Dani, Danny, Lisa, Maren,

Theresa, Sebastian, Willi.

I would like to thank my parents and my little brother for their endless support and patience.

I know how lucky I am to have them.

Last, but not least I would like to thank Tobi, whom I love more than words could describe. He lived with me through more than one PhD crisis and never got tired of putting the smile back into my face. For his constant love and support I am grateful every day.

ii

The work presented in this thesis led to the following three articles as first author in peer- reviewed academic journals:

"Electrostatic Recognition as First Step of Binding to Serine ", Waldner, B.

J., Kramml, J., Kahler, U., Spinn, A., Schauperl, M., Podewitz, M., Fuchs, J. E., Cruciani, G.,

Liedl, K. R., Bioinformatics (2017), submitted June 2017.

"Protease Inhibitors in View of Peptide Substrate Databases", Waldner, B. J., Fuchs, J. E.,

Schauperl, M., Kramer, C., & Liedl, K. R.. Journal of chemical information and modeling (2016).

DOI: 10.1021/acs.jcim.6b00064

"Quantitative Correlation of Conformational Binding Enthalpy with Substrate Specificity of

Serine Proteases", Waldner, B. J., Fuchs, J. E., Huber, R. G., von Grafenstein, S., Schauperl, M.,

Kramer, C., & Liedl, K. R. The Journal of Physical Chemistry B (2015). DOI:

10.1021/acs.jpcb.5b10637

Furthermore, the work presented in this thesis led to the following contributions as co- author in articles in peer-reviewed journals:

"Benzimidazoles as potent inhibitors: characterization against rhodesain and molecular basis for structure-activity relationships and selectivity between trypanosomal ", Santos, L. ; Pereira, G. ; Villela, F.; Dessoy, M.; Dias, L.; Andricopulo, A.; Costa, M.;

Nagem, R.; Caffrey, C.; Waldner, B.J.; Fuchs, J.; Liedl, K.R.; Caffarena, E.; Ferreira, R.. Journal of Medicinal Chemistry (2017), submitted September 2017.

"A Binding Pose Flip Explained via Enthalpic and Entropic Contributions", Schauperl, M.,

Czodrowski, P., Fuchs, J. E., Huber, R. G., Waldner, B. J., Podewitz, M., Kramer, C., Liedl, K.R..

Journal of Chemical Information and Modeling (2017). DOI: 10.1021/acs.jcim.6b00483

iii

"Enthalpic and Entropic Contributions to Hydrophobicity", Schauperl, M., Podewitz, M.,

Waldner, B. J., & Liedl, K. R. Journal of Chemical Theory and Computation (2016). DOI:

10.1021/acs.jctc.6b00422

"Dynamics Govern Specificity of a Protein-Protein Interface: Substrate Recognition by

Thrombin", Fuchs, J. E., Huber, R. G., Waldner, B. J., Kahler, U., von Grafenstein, S., Kramer,

C., & Liedl, K. R., PloS one (2015), 10(10), e0140713. DOI: https://doi.org/10.1371/journal.pone.0140713

"Characterizing Protease Specificity: How Many Substrates Do We Need?" Schauperl, M.,

Fuchs, J. E., Waldner, B. J., Huber, R. G., Kramer, C., & Liedl, K. R., PloS one (2015), 10(11), e0142658. DOI: https://doi.org/10.1371/journal.pone.0142658

"Independent Metrics for Protein Backbone and Side-Chain Flexibility: Time Scales and Effects of Ligand Binding", Fuchs, J. E., Waldner, B. J., Huber, R. G., von Grafenstein, S., Kramer, C.,

& Liedl, K. R., Journal of Chemical Theory and Computation (2015), 11(3), 851-860. DOI:

10.1021/ct500633u

iv

Abstract

Proteases are enzymes that catalyze the cleavage of peptide bonds and play a crucial role in a plethora of biological pathways. They recognize their substrates in eight subpockets termed

S4-S4’ according to the corresponding substrate (P4-P4’) with the peptide’s scissile bond lying between P1 and P1’. Despite having the same fold, proteases often show distinct specificities, depending on the biological function(s) they bear. Starting with a test set of nine serine proteases with fold, this thesis is concerned with elucidating the key molecular drivers of substrate recognition in proteases and applying the findings in rational drug design. The methods devised for the test set are subsequently applied to parasitic and human cysteine proteases, both in selective drug design efforts and for the investigation of molecular mechanisms in the (de)activation of enzymatic function upon ligand binding.

Substrate binding in biomolecular recognition is governed by the interplay of entropic and enthalpic factors. Therefore, within this thesis, an ensemble of computational methods is employed to characterize entropic and enthalpic contributions to substrate recognition.

Quantification of protease substrate specificity is achieved through the so-called cleavage- entropy metric, which is a value between 0 and 1, with 0 indicating a completely specific subpocket, accepting only one single type of amino acid, and 1 indicating a completely unspecific subpocket, accepting each of the 20 amino acids at the corresponding substrate position. As entropic factors, dynamics are investigated at different time scales by means of molecular dynamics (MD) simulations. The investigation of enthalpic factors is carried out with the program GRID, which allows for the calculation of molecular interaction fields (MIFs) of selected probe molecules by moving probe molecules along an equi-distant grid placed over the binding site and calculating the interaction potentials of the probe molecules with the binding interface of serine proteases at each grid point. In addition, water thermodynamics, both in terms of water enthalpy and entropy are investigated, using Grid

Inhomogeneous Solvation Theory (GIST).

v

While for some examples of the serine protease test set, enzyme dynamics, quantified as backbone and sidechain flexibility of subpockets, seem to be the determining factor in substrate recognition, in other cases water thermodynamics and binding enthalpy were found to also play a crucial role in defining substrate specificity. Looking solely at binding enthalpy for X- ray structures of three selected serine protease examples, resulted in low correlation with substrate specificity quantified as cleavage entropy. Considering conformational variability through calculation of a conformational binding enthalpy incorporating the binding enthalpy of several representative conformations extracted from MD simulations led to an improvement in the correlation, but was still not able to explain substrate specificity satisfactorily. In collaboration with Prof. Gabriele Cruciani from the University of Perugia, Italy, exclusively concentrating on electrostatic interactions of the binding site with a positive probe and a negative probe bearing charges of +1 and -1 respectively, resulted in high correlation between electrostatic molecular interaction field (eMIF) similarity and electrostatic substrate readout similarity, a metric based on the binning of substrate amino acids into three bins according to their charge. Based on the knowledge gained on serine proteases, a peptide substrate-based shape-based virtual screening approach using vROCS, solely requiring information on the substrate sequences of cleaved peptide substrates, was developed and validated for four test cases. In collaboration with Prof. Rafaela Ferreira from the UFMG in Belo Horizonte, Brazil, the approach developed was employed to screen for new potential small molecule inhibitors of the parasitic cysteine protease cruzain, an important drug target in fighting the neglected tropical

Chagas disease. In addition to cruzain also rhodesain, a drug target against African Sleeping

Sickness, was investigated. Differences in subpocket dynamics of cruzain and rhodesain to the human cathepsins B and L were explored as a means to achieve selective targeting of the parasitic cysteine proteases over the human cysteine proteases. The results revealed that differences in both backbone and sidechain flexibility of to cruzain and rhodesain may be exploited in targeted drug design, while flexibility differences to are too insubstantial.

In collaboration with Dieter Brömme from the Life Sciences Centre in Vancouver, the human cysteine protease , important in physiological and pathological bone degradation, vi was investigated in terms of the influence of single and double mutations and the binding of different types of inhibitors as well as glycosaminoglycan ligands on subpocket and exosite-1 flexibility. Results confirm the of collagenolytic activity in cathepsin K.

In summary, this thesis provides insight into the key molecular drivers of substrate recognition at various protein-protein interfaces and uses the knowledge gained in targeted drug design.

vii

Graphical Abstract

viii

Contents

1. Motivation and Aims ...... 1

1.1 Motivation ...... 1

1.2 Aims ...... 1

2. Background ...... 3

2.1 Proteases ...... 3

2.2 Serine Proteases ...... 4

2.3 Cysteine Proteases ...... 7

2.4 Specificity of Serine and Cysteine Proteases ...... 10

2.5 Molecular Recognition Models ...... 13

3. Methods ...... 14

3.1 Generic Binding Site Definition ...... 14

3.2 Quantification of Protease Substrate Specificity and Substrate Readout Similarity ...15

3.3 Preparation of X-Ray Structures ...... 17

3.4 Molecular Dynamics (MD) Simulations ...... 18

3.5 Metrics for Calculation of Subpocket Flexibility from MD Trajectories ...... 27

3.6 Clustering of MD Trajectories ...... 29

3.7 Grid Inhomogeneous Solvation Theory (GIST) ...... 31

3.8 Calculation of (Electrostatic) Molecular Interaction Fields ((e)MIFs) ...... 33

3.9 Calculation of Subpocket Interaction Potentials ...... 34

3.10 Calculation of Electrostatic Molecular Interaction Fields and Overlap ...... 37

3.11 Shape Based Virtual Screening ...... 39

4. Investigation of Chymotrypsin Fold Serine Protease Model Systems ...... 49

4.1 Protease Test Set ...... 50

ix

4.2 Quantitative Correlation of Subpocket Flexibility with Subpocket Specificity of Serine Proteases ...... 53

4.3 Quantitative Correlation of GIST Results with Subpocket Specificity of Serine Proteases ...... 58

4.4 Quantitative Correlation of (Conformational) Binding Enthalpy with Substrate Specificity of Serine Proteases ...... 63

4.5 Comparison of eMIF Similarity with Electrostatic Substrate Readout Similarity ...... 68

5. Transferring the Information on Protease Substrate Specificity to the Small Molecule Space ...... 83

6. Investigation of Parasitic Cysteine Protease Model Systems ...... 92

6.1. Cruzain as Drug Target in Chagas Disease ...... 92

6.2. Rhodesain as Drug Target in Human African Trypanosomiasis (HAT) ...... 94

6.3. Selective Targeting of Cruzain and Rhodesain in Rational Drug Design ...... 94

6.4. Peptide Substrate-Based Virtual Screening to Find New Cruzain Inhibitors ...... 100

7. Investigation of the Human Collagenolytic/Elastolytic Cysteine Protease Cathepsin K .103

7.1 Cathepsin K ...... 103

7.2 Impact of Single and Double Mutations on the Flexibility of Cathepsin K ...... 108

7.3 Impact of the Mechanism of Inhibition on the Flexibility of Cathepsin K ...... 112

7.4 Impact of Chondroitin Sulfate Binding on the Flexibility of Cathepsin K ...... 116

7.5 Impact of Dimerization in the Presence of Chondroitin Sulfate on the Flexibility of Cathepsin K ...... 116

7.6 Impact of Tetramerization in the Presence of Chondroitin Sulfate on the Flexibility of Cathepsin K Flexibility ...... 117

7.7 Discussion ...... 118

8. Conclusion and Prospects ...... 120

9. Bibliography ...... 123

Appendix A ...... 142

x

A1. List of PDB Structures for Generic Binding Site Definition ...... 142

A2. Pocket Residues In Generic Pocket Definition of Chymotrypsin-Like Serine Protease Test Set ...... 144

A3. Cα RMSD Values Between X-Ray and Representative Cluster Structures Extracted From MD Simulations and Cluster Occupancies...... 153

Appendix B...... 159

B1. Generic Subpocket Definition for Cruzain, Rhodesain, Cathepsin B and Cathepsin L .159

B2. Flexibility Differences Between Cruzain, Rhodesain, Cathepsin B and Cathepsin L ....163

Appendix C ...... 184

C1. Subpocket Definition for Cathepsin K ...... 184

C2. Influence of Single and Double Mutations, Type of Inhibition, Chondroitin Sulfate Binding and Di- and Tetramerization on Cathepsin K Flexibility ...... 185

Statutory Declaration ...... 204

xi

1. Motivation and Aims

1.1 Motivation

Proteases are a class of enzymes that cleave proteins and are key players in numerous fundamental cellular reactions [1]. They represent potential drug targets for diseases ranging from cardiovascular disorders [2] to cancer [3],[4] as well as for fighting many viruses [5],[6] and parasites [7]. The main challenge in developing drugs for new protease targets lies in the difficulty in achieving selectivity when targeting a certain protease binding site [8],[9]. Despite being amongst the most studied targets in drug design, the mechanisms of substrate recognition in proteases have not yet been fully understood. Therefore, the focus of this work is on the determination of the molecular driving forces of substrate specificity in proteases. The knowledge of the key drivers of substrate recognition in proteases will not only provide implications for small molecule inhibitor design, but also provide a basis for the design of new therapeutic peptides and peptidomimetics, which are an emerging class of drugs in the pharmaceutical industry [10].

1.2 Aims

When researching the molecular determinants of protease specificity, serine proteases are intriguing candidates, as depending on the biological processes they are part of, they show differences in their substrate specificities, despite having the same fold [11]. The main part of this work is thus concerned with investigating the key molecular drivers of substrate specificity for a selected set of serine proteases with chymotrypsin fold. A combination of theoretical methods is used to analyze experimental structural data of the selected targets as well as the sequence data on the cleaved protease substrates available in respectively online and freely accessible databases (the PDB [12] database for structural data and the MEROPS [13] database for data on protease peptide substrate sequences). The concepts and methods developed in the course of the investigation of the serine protease test systems within this work are also applied to selected cysteine proteases to show transferability of the devised approaches. 1

Differences in substrate preferences are caused by differences in binding free energy ∆∆퐺 with

∆퐺 = ∆퐻 − ∆푆푇. The preference of a protease for a certain substrate is thus determined by a combination of enthalpic and entropic factors. Within this thesis, enthalpic interactions are explored by calculating and comparing molecular interaction fields (MIFs) with selected probe molecules. Entropic factors are studied by means of molecular dynamics (MD) simulations and appropriate metrics for describing conformational variability on different time scales.

The aim of this thesis is to provide a comprehensive overview of the interplay between enthalpic and entropic factors leading to diverse substrate specificities even if only subtle differences in binding site structure are present, as in the case of the investigated serine proteases with chymotrypsin fold.

Chapter 2 provides background information on serine and cysteine proteases and respective protease substrate specificity and introduces the reader to existing models for binding mechanisms. Chapter 3 gives an overview of the methods applied within this work. Chapter 4 summarizes the results obtained for the investigated serine protease systems with chymotrypsin fold. Chapter 5 introduces a newly developed method for virtual screening, based on the knowledge gained during the investigation of serine proteases. Chapter 6 focuses on the investigation of the parasitic cysteine proteases cruzain and rhodesain, while Chapter 7 investigates the human cysteine protease cathepsin K, important in physiological and pathological bone degradation. Concluding remarks and future prospects are provided in

Chapter 8.

2

2. Background

2.1 Proteases

Proteases are enzymes that catalyze the cleavage of peptide bonds and play an essential role in a great number of fundamental physiological processes. They are not only key enzymes in complex signaling cascades such as the apoptosis pathway [14] or the blood coagulation cascade [15], but they are also involved in various degradation processes such as the digestion of food proteins or antigens during immune response [16]. The importance of in human biology is reflected in the fact that 2% of all human genes are proteases or protease inhibitors. Proteases also account for 1-5% of the genome of infectious organisms such as bacteria, parasites and viruses, making them an especially attractive target in drug design [17].

Proteases catalyze the hydrolysis of peptide bonds through acceleration of the nucleophilic attack of the peptide amide group, which is caused by either the deprotonation of a serine, threonine or cysteine group or the activation of a water molecule [18]. They recognize their substrates in eight subpockets, which are termed according to the convention of Schechter and

Berger [19], with the peptide’s scissile bond lying between P1 and P1’ as given in Figure 1.

3

Figure 1. Convention of Schechter and Berger [19] for terming of protease subpockets. The peptide’s scissile bond lies between P1 and P1’.

When classifying proteases according to the catalytic mechanism, one can term the protease families according to the name of the crucial residues involved in the catalytic process.

Important protease family members are the serine, threonine, cysteine, aspartic and metalloproteases [13]. If the cleavage site is used as a classifier, one can distinguish between endo-peptidases, which attack internal peptide bonds and exo-peptidases, which cut amino acids from the N- or C-terminal end of proteins [1].

2.2 Serine Proteases

Serine proteases are amongst the best studied enzyme classes and their mechanism of catalysis is achieved through the so-called “” or “charge relay system” consisting of an

4 aspartate, a histidine and a serine residue [20]. The triad motif has evolved from at least twelve ancestors leading to the known clans of homologous serine proteases, which are named according to the most prominent representative, such as for example chymotrypsin, and cytomegalovirus protease [21, 22]. The chymotrypsin-like serine proteases are the most abundant clan consisting of 23 known families [13], comprising not only serine proteases, but also viral cysteine proteases, which show a homologous structural fold [23],[24]. The serine protease members of the chymotrypsin clan range from digestive proteases such as , chymotrypsin or -1 to proteases involved in complex signaling pathways such as and factor Xa (fXa), which are both enzymes of the blood coagulation cascade [15]. Their catalytic serine protease site consists of about 250 amino acids that form two beta barrels flanking the [25].

The residues of the catalytic triad are His-57, Asp-102 and Ser-195 according to chymotrypsin numbering. During the catalytic process, the deprotonation of Ser-195 is achieved through protonation of His-57, which is stabilized by Asp-102 [26]. The catalytic process is given in

Figure 2.

5

Figure 2. Catalytic triad in serine proteases consisting of His-57, Asp-102 and Ser-195 according to chymotrypsin numbering. During the catalytic process, Ser-195 is deprotonated through protonation of His-57, which in turn is stabilized by Asp-102.

As the chymotrypsin-like serine proteases represent the most abundant clan and there is a vast amount of experimental structural data and information on peptide substrate sequences present in the PDB [12] and MEROPS [13] databases, in this work they were selected as model systems for the determination of the molecular driving forces of protease specificity. Applying the criteria of availability of apo X-Ray structures with a resolution better than 2 Å and availability of substrate sequence information in the MEROPS [13], nine serine proteases were chosen as model test set: Trypsin, factor VIIa (fVIIa), factor Xa (fXa), thrombin, -1, chymotrypsin, elastase-1, M and (see Figure 3).

6

Figure 3. Protease test set of nine serine proteases with chymotrypsin fold. Trypsin (A), factor

VIIa (fVIIa) (B), factor Xa (fXa) (C), thrombin (D), kallikrein-1 (E), chymotrypsin (F), elastase-1 (G), granzyme M (H) and granzyme B (I). The active site is flanked by two beta barrels in all examples.

2.3 Cysteine Proteases

While serine proteases catalyze the cleavage of peptide bonds through nucleophilic attack, a catalytic serine residue, deprotonated by a histidine residue, which, depending on the type of cysteine protease, is stabilized by an asparagine or aspartate residue, performs the nucleophilic attack in cysteine proteases. In contrast to serine proteases, the catalytic cysteine residue is already deprotonated prior to the attack, making cysteine proteases a priori activated enzymes

[27]. The MEROPS [13] lists ten known clans of homologous cysteine proteases. The biggest clan are the -like cysteine proteases comprising 41 known families, followed by the CD clan, which comprises 7 known families and has no name-giving representative [13]. Important members of the CD clan are the legumains, which are present in mammalians and plants and have recently gained interest due to their ability to show protease, carboxylase or activity, depending on the surrounding milieu [28] and the caspases, which are important mediators in the apoptotic pathway [29]. The catalytic center of members of the CD clan is formed only by a catalytic dyad consisting of a histidine residue that deprotonates the catalytic cysteine.

7

Among the papain-like cysteine proteases, the most familiar members are the cathepsins

[30],[31], which are important in a wide range of biological processes. In mammalians, lysosomal cathepsins are important in protein degradation, bone resorption, proenzyme activation, hormone maturation as well as epidermal homeostasis and antigen presentation and processing [32]. In parasites, they are important in growth, development and replication of parasites, evasion of host immune response and in skin penetration and host organism invasion [32]. Especially in the fight against (neglected) tropical diseases [33] such as malaria

[34], Chagas disease [35] and African trypanosomiasis [36, 37] cysteine proteases are important drug targets.

Similar as in serine proteases with chymotrypsin fold (see Section 2.2), the catalytic site of papain-like cysteine proteases is highly conserved and consists of the three residues Cys-25,

His-159 and Asn-175, which build a catalytic triad as shown in Figure 4.

Figure 4. Catalytic triad in papain-like cysteine proteases. Deprotonation of the catalytic Cys-25 is achieved through His-159, stabilized by Asn-102, facilitating nucleophilic attack of the peptide bond using papain numbering. In contrast to the catalytic mechanism in serine proteases, the catalytic cysteine is already deprotonated before attacking the peptide bond, making cysteine proteases a priori activated enzymes.

The catalytic domains of papain-like cysteine proteases are between 220 and 260 amino acids in length, with several parasite-derived cysteine proteases constituting exceptions as they possess a C-terminal extension of unknown function [32].

Within this thesis, the parasitic cysteine proteases cruzain and rhodesain, which are important targets for combating Chagas disease and Human African trypanosomiasis (HAT), also called 8

African Sleeping Sickness [37] ,[36], [35] are investigated (see Section 6). As selectivity [38] against the human cathepsins B and L is an issue when designing inhibitors against those targets, the goal is to apply the methods developed for serine proteases to find new strategies for selective inhibitor design. The parasitic cysteine proteases cruzain and rhodesain and the human cathepsins B and L as well as their fold-identity are illustrated in Figure 5. The human cysteine protease cathepsin K is a papain-like cysteine protease like cathepsins B and L, and within this thesis it is investigated with respect to subpocket and exosite flexibility upon binding of different types of inhibitors, di- and tetramer formations, mutations and the binding of glycosaminoglycans (see Section 7). The results will provide insight into the effect of the investigated activity modulators on cathepsin K collagenolytic and elastolytic activity.

9

Figure 5. Cysteine proteases are important when fighting the neglected tropical diseases

Chagas disease and Human African trypanosomiasis (HAT). Cruzain (A) and rhodesain (B) are the parasitic cysteine proteases targeted in drug design strategies. Selectivity problems arise due to the similarity with the human cathepsins B (C) and L (D).

2.4 Specificity of Serine and Cysteine Proteases

Specificity of serine and cysteine proteases is driven by chemical interactions at the protein- protein interface between protease and substrate. The individual residues of the peptide substrates interact with defined subpockets as classified following the convention of Schechter and Berger [19] (see Figure 1, Section 2.1). The information on substrates that are cleaved by proteases, the so-called “degradome” [39], is collected in the MEROPS [13] database. Within the MEROPS database, the substrate preference of a protease is both visualized in a cleavage site sequence logo [40] and numerically listed in a specificity matrix.

A method to quantitatively describe the specificity of serine proteases using the cleavage pattern data present in the MEROPS database was developed by Fuchs, et al. [41]. The so-called cleavage entropy indicates subpocket specificity with a score ranging from 0 (completely specific) to 1 (completely unspecific). Summing up the cleavage entropy scores of all the

10 individual subpockets yields an overall specificity score that can be used to rank the proteases according to specificity. For better understanding, subpocket cleavage entropy scores are mapped to the binding site for the three examples chymotrypsin (A), granzyme B (B) and fXa

(C) in Figure 6.

Figure 6. Cleavage entropy mapped to the binding site of chymotrypsin (A), granzyme B (B) and fXa (C). A cleavage entropy of 0 (red) indicates a completely specific subpocket, accepting only one specific amino acid, while a cleavage entropy of 1 (green) indicates a completely unspecific subpocket, accepting every amino acid. The examples are ranked according to overall specificity, obtained by summing the cleavage entropy scores of individual subpockets. Chymotrypsin only shows mild specificity in the S1 subpocket, while being completely unspecific in other subpockets. Granzyme B shows moderate specificity in the S1 subpocket and also does not show pronounced specificity in other subpockets. FXa is the most specific example and shows high specificity in the S1 subpocket and moderate specificity in other subpockets.

Overall substrate specificity does not only depend on the interactions between S1 and P1, but it is also determined by substrate specificity in other binding pockets. P4-S4 interactions were found to be highly specific in case of the non-homologous serine proteases [42].

11

Especially in the S4-S1 region, significant differences in respective cleavage specificity are found in close homologues, reaching from limited proteolysis to almost unspecific substrate cleavage. Several cleavage site prediction tools are based on such rules and are available online

[43]. In contrast to most metalloproteases, serine proteases do not show high specificity in the

S'-region. In case of the highly specific thrombin, insertions in loop regions were shown to increase specificity in the P'-region [44].

Explaining the specificity score in a certain binding pocket is not an easily achieved task. Taking the example of chymotrypsin and the homologous trypsin, the S1 specificity of chymotrypsin for hydrophobic residues can be explained by the lack of charged residues in the S1 subpocket

[45]. In contrast to the homologous trypsin, Asp-189 located at the bottom of the S1 subpocket determines specificity for Arg and Lys at P1 [46].

The famous work of Hedstrom [47] shows the limitation of a simplistic model considering only

P1-S1 interactions. When trying to convert trypsin to chymotrypsin by site-directed mutagenesis, chymotrypsin specificity of trypsin could not be achieved by simply changing the trypsin S1 pocket to a chymotrypsin S1 pocket. Only when surface loops adjacent to the binding pocket were modified accordingly, specificity exchange of the two homologous enzymes could be achieved [47], [45], [48]. In several cases it was found that loop interactions and the dynamics of loops adjacent to the binding pocket considerably influence substrate specificity [49], [50]. According to Hedstrom [25], catalysis and specificity are determined not only by a few residues, but they are caused by motions and charge distributions over the whole protein framework.

The link between dynamics and catalysis has been made on several occasions with α -lytic protease being a very prominent example [51], [52], [53], [54]. In the case of α -lytic protease, enzyme specificity is governed by enzyme dynamics [55]. For retroviral proteases, domain flexibility seems to be the cause for drug resistance in compensatory mutated forms [56], [57].

For snake venom metalloproteases, Wallnöfer et al. [58] could show that specificity could be predicted through analysis of conformational ensembles applying computational techniques.

For the apoptotic proteases caspase 3 and 7, substrate specificity is found to be correlated to binding site rigidity, with being slightly more specific due to its narrower 12 conformational space. Expanding the conformational space by in silico exchange mutation as performed by Fuchs et al. [59] is expected to also lead to expansion of substrate specificity.

While there are several publications linking flexibility or enzyme-substrate interactions to specificity [60], [59], so far there has been no approach, which quantitatively rationalizes enthalpic and dynamic contributions to specificity. Developing such an approach would allow to move away from a substrate driven explanation of specificity to a complete model defined by protease binding pocket characteristics.

2.5 Molecular Recognition Models

The first model of and biomolecular recognition was proposed by Fischer in

1894 [61]. It explains enzyme-substrate binding as a complementary fit of substrate and enzyme with both being static structures. Since then, the prevailing view has moved on from Fischer's

“lock and key” model to the “induced fit” model propagated by Koshland [62]. According to

Koshland, the binding of a substrate to a protein induces a conformational change in the protein leading to energetically favorable interactions between enzyme and substrate.

However, proteins are intrinsically flexible and undergo conformational changes independently of the presence of a ligand. For several systems the mechanism of the so-called

“conformational selection” (also called population selection, fluctuation fit or selected fit, depending on the nomenclature used) has been suggested [63], with the initial experimental evidence going back to Zavodszky in 1966 [64]. Following the model of conformational selection, the binding partner selects the most appropriate conformation from pre-existing ensembles of conformations. While binding, the population of the conformational ensembles is shifted and the preferred conformation might become more dominant [65]. In 1999, the energy landscape-based view of dynamic ensembles led to the generalized theory of

“conformational selection and population shift” [66]. The process of population shift has been extensively studied for Ubiquitin using a combination of NMR-techniques, accelerated MD simulations and extended molecular dynamics [67], [68]. Long et al. recently even described a directional selection process preceding the conformational selection in Ubiquitin-Ubiquitin- interacting-motif binding [69].

13

3. Methods

3.1 Generic Binding Site Definition

Protease subpocket definitions are usually substrate based and subpockets are termed according to their corresponding ligand binding site [19]. An example of a detailed explicit definition of all subpockets of a serine protease is the definition of elastase-1 subpockets created by Bode in 1989 [70]. The drawbacks of this definition are that the included amino acid residues differ considerably amongst the two different forms of he worked on and that there is a high degree of overlap within the subpocket definition. Both shortcomings make systematic statistical analysis a difficult task, creating the need for a generic binding site definition as a basis for the subsequent calculation of binding site characteristics and the assignment of calculated characteristics to specified subpockets.

For serine proteases, complex X-ray structures of the serine protease test set are used as starting points for generating a generic serine protease subpocket definition. The PDB codes for the serine protease complex structures used for the generation of a generic pocket definition are given in Appendix A1. In a first subpocket definition, amino acid residues within a radius of 4

Å from the corresponding ligand binding site are included. This first subpocket definition is refined by performing a multiple sequence alignment in MOE [71] and identifying corresponding residues and residues that are conserved amongst the different serine proteases with chymotrypsin fold. Considering the physico-chemical properties of the amino acids involved and the subpocket characteristics known from literature as well as subpocket- definitions known from literature, a subpocket definition containing the same amount of residues for all investigated serine proteases and minimum overlap of the different subpockets is created. The process of generating a generic subpocket definition for serine proteases with chymotrypsin fold is schematically illustrated in Figure 7.

14

Figure 7. Schematic illustration of generic pocket definition generation for serine proteases with chymotrypsin fold, mapped to the surface of chymotrypsin. All serine protease with chymotrypsin fold ligands found in the PDB are pooled (A). Subsequently pockets are selected based on proximity (4 Å) to the corresponding substrate binding site and refined by comparing geometrically equivalent amino acid positions in the serine protease test set. Starting from S1 (B), subsequently S2

(C), S3 (D) and S4 (E) were selected until finally a pocket definition for all eight subpockets was achieved

(F). For all serine proteases with chymotrypsin fold, subpockets have the same number of residues and minimum overlap.

Within members of the same fold, the generic pocket definition can be transferred through alignment and subsequent assignment of amino acids to subpockets without needing to repeat all of the steps described above.

3.2 Quantification of Protease Substrate Specificity and Substrate Readout Similarity

Linking protease binding site characteristics quantitatively to serine protease substrate specificity requires a metric, which quantifies protease substrate specificity.

One method to quantify substrate specificity is the cleavage entropy Si developed by Fuchs et al. [41]. Calculation of the cleavage entropy is described in Equation 1 with pa,i being the amino acid probability of amino acid a in subpocket i of known substrates. The probability pa,i is 15 normalized to the natural occurrence of amino acids [72] to overcome the limitation that the dataset is biased towards cleavage sites occurring in tested proteomes. If less than 30 substrate sequences are available for calculation of the cleavage entropy, a correction algorithm developed by Schauperl et al. [73] has to be employed to obtain accurate results.

20 푆푖 = − ∑푎=1 푝푎,푖푙표푔20푝푎,푖 Equation 1

Based on the probabilities used in Equation 1, substrate readout similarity s between two proteases is calculated by performing a vector multiplication on the corresponding substrate specificity vectors v1, v2 as shown in Equation 2:

s = v⃗⃗⃗1⃗⃗⃗ ∙ v⃗⃗⃗2⃗⃗⃗ ∙= (pS4,AlapS4,Arg … pS4,ValpS3,Ala … pS4′,Val)1 ∙ (pS4,AlapS4,Arg … pS4,ValpS3,Ala … pS4′,Val)2 Equation 2

The drawback of the approach designed by Fuchs et al. [41] is that characteristics such as charge and hydrophobicity of amino acids are not considered. Therefore, the method was adapted to obtain an electrostatic substrate readout similarity and respective specificity, that considers the different charges of substrate amino acids by binning them into three bins, according to their charge, positively, negatively charged and neutral amino acids:

Substrate sequences for protease peptide substrates are again extracted from the MEROPS [22] database. For each substrate position, amino acids are assigned to one of the three bins according to their electrostatic properties. For each substrate position, a vector is constructed via Equation 3.

3 푃푖 2 푃푖 15 푃푖 푃푖 ∑푎=1 푛푎 푃푖 ∑푎=1 푛푎 푃푖 ∑푎=1 푛푎 퐯Pi = (푁푝표푠푖푡푖푣푒 = 3 , 푁푛푒𝑔푎푡푖푣푒 = 2 , 푁푛푒푢푡푟푎푙 = 15 ) Equation 3 ∑푎=1 푝푎 ∑푎=1 푝푎 ∑푎=1 푝푎

In Equation 3 NbinPi is the score in one of the three electrostatic bins (positive, negative or neutral) of substrate position Pi, a is the index of the amino acids in three different bins, naPi is the number of occurrences of amino acid a in the substrate position Pi and pa is the natural occurrence of amino acid a. The weighting with the natural occurrence of the amino acids in

16 human proteins [72] similar as done for the cleavage entropy results in an intrinsic normalization of the bins.

Similar to Equation 2, the concatenation of vectors vPi for substrate positions P4-P4’ and bins gives a vector of dimension 3⋅8=24. This vector is subsequently normalized to 1 and contains the electrostatic substrate preferences for each protease as shown in Equation 4.

(풗푃4 ,… , 풗푃4′) 풗푃푟표푡푒푎푠푒 = Equation 4 ||(풗푃4 ,… , 풗푃4′)||

The electrostatic substrate similarity of two proteases is calculated by forming the scalar product between the respective electrostatic substrate preference vectors. A scalar product of 0 indicates orthogonal substrate preferences. A scalar product of 1 implies identical electrostatic substrate preferences, as found when comparing an individual protease with itself, i.e. the self- similarity. Additionally, the product of every single line of the protease vector may be interpreted as a contribution corresponding to either positive, negative or neutral substrate residues at every single substrate position. The sum of individual substrate positions indicates the overall contribution of positive, negative and neutral substrate residues.

3.3 Preparation of X-Ray Structures

X-ray structures for molecular dynamics (MD) simulations (see Section 3.4) and calculation of molecular interaction fields (MIFs) (see Section 3.8) were downloaded from the PDB database[12]. X-ray structures for chymotrypsin (PDB code 4CHA [74]), elastase (PDB code

1QNJ [75]), granzyme B (PDB code 1FQ3 [76]), granzyme M (PDB code 2ZGC [77]), fVIIa (PDB code 1KLI [78]), fXa (PDB code 1C5M [79]), kallikrein 1 (PDB code 1SPJ [80]), thrombin (PDB code 1FPH [81]) and trypsin (PDB code 1PQ7 [82]) of the investigated serine protease test set were downloaded from the PDB [12]. The criteria for the selection were that the structures were not bound to a ligand and that the resolution should be better than 2 Å. Apo structures were chosen so that the binding site conformation would not be determined by a bound ligand or substrate. Only in the case of thrombin a substrate-bound structure was accepted, to ensure that the binding site was in the active conformation as thrombin exists in several different forms

17

[83]. The bound substrate was deleted using MOE [71]. If present in serine protease X-ray structures, the light chain was removed.

For cruzain (PDB code 3KKU [84]), rhodesain (PDB code 2P86, unpublished work), cathepsin

L (PDB code 4AXL [85]) and cathepsin B (PDB code 1HUC [86]) of the investigated cysteine proteases, X-ray structures were downloaded from the PDB [12]. In each case, the highest quality structure available without a bound ligand was used. Investigated cathepsin K structures are described in detail in Section 7 as some of the structures used were directly obtained from experimental collaborators.

For all investigated systems water molecules with a distance higher than 4 Å from any protein atom were removed from the X-Ray structure. Buried water molecules were kept, as they are crucial for the stability of MD simulations. Protonation was carried out with the protonate3D protocol implemented in MOE [71]. Protonation of histidines, asparatates and glutamates was manually checked and corrected, if necessary.

For calculation of molecular interaction fields (MIFs) (see Section 3.8) protein hydrogens were removed, as protonation is carried out automatically within the program GRID used for their determination.

3.4 Molecular Dynamics (MD) Simulations

As long as the temperature is above 0 K, proteins do not represent static structures, but are constantly undergoing dynamic changes leading to an ensemble of conformations as given in

Figure 8.

18

Figure 8. Example for a structural ensemble at temperature >0 K. The protein does not exist in only one static conformation, but constantly undergoes dynamic changes to form an ensemble of conformations, with each conformation indicated by a different color.

Molecular Dynamics (MD) simulations are one way of gaining access to the structural ensemble of a protein. The structural ensemble of a protein can also be obtained through Monte Carlo simulations [87-89], for example. However, in contrast to MD simulations Monte Carlo simulations do not provide any information on the timeline of structural variability and are thus hard to interprete in context of biological research questions. Nevertheless, Monte Carlo simulations are often applied to study protein folding (see for example [90-92]). Within this work MD simulations were performed to obtain information on the dynamics of the investigated systems on different time scales. The following paragraphs will introduce the reader to the basics of MD simulations in context of simulation techniques applied in the course of this work.

MD simulations require a starting structure that can either be experimentally retrieved through crystallization and X-ray crystallography or NMR spectroscopy in solution or obtained

19 through computational techniques such as homology modeling [93, 94], if structural data is not accessible through experiments.

The basis of MD simulations is formed by the force field and Newtons’ equations of motion.

The potential function of the force field considers bonding and non-bonding interactions as given in Equation 5. Bond stretching (Ebond), angle bending (Eangle), and dihedral angle (Edihedral) potentials describe bonding, van der Waals (Evan der Waals) and Coulomb (Eelectrostatic) potentials non-bonding interactions within the protein. The form of the potential functions for individual potentials differs depending on the applied force field.

퐸푡표푡푎푙 = 퐸푏표푛푑 + 퐸푎푛𝑔푙푒 + 퐸푑푖ℎ푒푑푟푎푙 + 퐸푣푎푛 푑푒푟 푊푎푎푙푠 + 퐸푒푙푒푐푡푟표푠푡푎푡푖푐 Equation 5

Starting from the initial coordinates r0 and the initial arbitrarily assigned velocities v0 for each atom position, new velocities v1 and coordinates r1 are calculated, following these steps:

휕퐸 1. Calculation of the force from the potential function according to 퐹 = 휕푟 푚 2. Calculation of the acceleration a from F according to 푎 = 퐹

3. Calculation of new velocities v1 according to v1=v0+aΔt

4. Calculation of new coordinates r1 according to r1=r0+v0Δt

The force field applied within this thesis is the Amber12 forcefield 99SB-ILDN [95], implemented in Amber12. In the following, the individual terms of the potential functions as implemented in Amber12 will be presented.

Bond Stretching. In Amber12, bond stretching is described by a harmonic potential as given in Equation 6.

퐸푏표푛푑 = ∑푏표푛푑 퐾푟 (푟 − 푟0)² Equation 6

In Equation 6, Kr is a parameter, r0 is the equilibrium bond length and r the current bond length.

Both Kr and r0 are empirically derived parameters. Bond stretching and the harmonic potential are illustrated in Figure 9.

20

Figure 9. Harmonic potential to describe bond stretching. At equilibrium distance r0 the potential is at its minimum. Further compression or extension moves away from the energy minimum as repulsive forces dominate upon bond compression and attractive forces are not able to act effectively upon bond extension.

Angle Bending. Like bond stretching, angle bending is described by a harmonic potential, see

Equation 7:

퐸푎푛𝑔푙푒 = ∑푎푛𝑔푙푒 퐾훩 (훩 − 훩푒푞)² Equation 7

In Equation 7, KΘ again is a parameter and Θeq is the equilibrium angle. Θ is the actual angle.

Both, KΘ and Θeq are empirically derived. Θ is illustrated in Figure 10.

21

Figure 10. Θ in angle bending potential.

Dihedral Angles. In Amber12 dihedral angles are described as given in Equation 8.

푉 퐸 = ∑ 푛 [1 + 푐표푠(푛휑 − 훾)] Equation 8 푑푖ℎ푒푑푟푎푙 푑푖ℎ푒푑푟푎푙 2

In Equation 8, Vn is a parameter, n is a natural number, φ is the dihedral angle and γ the torsional rotation. An illustration is given in Figure 11.

22

Figure 11. Dihedral angle potential. Exemplarily, the dihedral angle potential for ethane is shown, 휑 is illustrated in (A). (B) shows the course of the potential function when wandering from eclipsed (E) to staggered (S) conformation. The difference between the two conformations for ethane is 12.5 kJ/mol.

Van der Waals Interactions. Van der Waals interactions are non-bonding interactions and in

Amber12 are considered according to Equation 9:

퐴푖푗 퐵푖푗 퐸푣푎푛 푑푒푟 푊푎푎푙푠 = ∑푖<푗 ( 12 − 6 ) Equation 9 푟푖푗 푟푖푗

Aij and Bij are the empirically determined van der Waals parameters and rij is the distance between atoms i and j. The first term represents the repulsive interactions, which are 1 1 proportional to , the second term the attractive forces, proportional to . 푟12 푟6 As van der Waals interactions are effective over a wider range, a cut-off has to be applied during MD simulations to find the optimum between accuracy and required computational time. For van der Waals interactions, usually a cut-off of 8-10 kcal/mol is applied.

23

Figure 12. Illustration of van der Waals interactions. If atoms i and j are brought together from a distance, attractive forces dominate until the equilibrium distance r0 is reached. If atoms are brought closer together than r0, repulsion of their electron shells starts dominating, causing the potential to increase again. Van der Waals interactions are long-range interactions. To speed up the computational process, usually a cut-off of 8-10 kcal/mol is applied.

Coulomb Interactions. The second type of non-bonding interactions considered in the

Amber12 force field are Coulomb interactions, which are calculated according to Equation 10.

푞푖푞푗 퐸푒푙푒푐푡푟표푠푡푎푡푖푐 = ∑푖<푗 Equation 10 휀푟푖푗

In Equation 10, qi and qj are the charges of atoms i and j, ε is the dielectric constant and rij the distance between atoms i and j. Depending on the type of charges, Coulomb interactions can be repulsive or attractive. If charges qi and qj are of opposite sign, the resulting potential is attractive, if they are of the same sign, a repulsive potential results. Coulomb interactions decay 1 with and are thus long-range interactions. If a cut-off were applied as done for van der Waals 푟 interactions, the resulting error would be unacceptably high. Therefore, a technique is required, which transforms the infinite sum of Equation 10 into a calculable sum. A method to achieve this is the Ewald summation. In Ewald summation, the infinite sum is Fourier-transformed and

24 calculated as a finite sum in reciprocal space. For further reading the reader is referred to

Darden et al. [96] and Salomon-Ferrer et al. [97].

Figure 13. Illustration of Coulomb interactions. If atoms i and j bear charges of the same sign, a repulsive potential results, if they are of different sign, the resulting potential is attractive. Coulomb 1 interactions are long-range and decay with . Therefore, the application of a cut-off would lead to an 푟 unacceptably high error.

The time step of a MD simulation is limited by the highest frequency vibration, which is the C-

H bond stretching. C-H bond stretching occurs in the time-range of 0.1-0.2 fs. To increase the time step and thus lower the computational time required, the SHAKE [98] algorithm is applied. The SHAKE algorithm constrains the hydrogen distances during the simulation and allows an increase of the time step to 2.0 fs.

MD simulations can be performed at constant energy (microcanonical ensemble) or constant temperature (canonical ensemble). In the microcanonical and canonical ensemble particle

25 exchange between different subsystems is not allowed. It is possible, however, in the grand- canonical ensemble. All MD simulations performed in the course of this thesis were performed in the canonical ensemble under constant pressure and constant temperature (Npt-Ensemble).

During the simulations performed in this work, the temperature is kept constant using a

Langevin thermostat. In Langevin dynamics, the temperature is maintained constant through

푝푖 modification of Newtons’ equation of motion 푟푖̇ = with: 푚푖

푝̇푖 = 퐹푖 − 훾푖푝푖 + 푓푖 Equation 11

In Equation 11, Fi is the force acting on atom i due to the interaction potential, 훾푖 is a friction coefficient and fi is a random force with dispersion 𝜎푖 that is related to the friction coefficient 2 via 𝜎푖 as given in Equation 12:

푇 𝜎2 = 2푚 훾 푘 Equation 12 푖 푖 푖 퐵 ∆푡

∆푡 is the time-step used in MD simulations to integrate the equations of motion, kB is the

Boltzmann constant. Pressure within the system is kept constant by using isotropic position scaling. All simulations within this work were performed applying periodic boundary conditions. In periodic boundary conditions, each atom in the computational cell is replicated to form an infinite lattice. Each particle in the computational cell interacts not only with particles within the computational cell, but also with particles in adjacent cells. If a particle moves out of the computational cell, in the course of the analysis of the MD trajectory it is imaged back to its equivalent position within the computational cell. This is done with tools such as cpptraj [99], employed for the analyses of all MD trajectories recorded in the course of this work.

In the following, the conditions applied within the MD simulations performed in this work are summarized: As stated above, all MD trajectory production runs were performed in NpT ensemble. Pressure was kept constant through constant pressure periodic boundary conditions with isotropic position scaling. An octahedral solvation box with the initial condition that the closest distance between any atoms of the solute and the box edges is 12 Å was used in all 26 simulations performed in this work. The net charge of the periodic box was neutralized through application of uniform neutralizing background plasma. A van der Waals cutoff of 8

Å was used. The SHAKE algorithm [98] was used on hydrogen atoms, allowing for a time step of 2.0 fs. The simulation temperature was kept constant at 300 K, maintained by a Langevin thermostat [100].

For solvation of the system, the TIP3P model was used as explicit water model [101]. Before performing the MD production runs, the systems were thoroughly equilibrated using an extended protocol developed by Wallnöfer et al. [102]. Under the equilibration protocol of

Wallnöfer et al. [102], the system is first minimized with harmonic constraints on the protein heavy atoms. Afterwards, the system is gradually heated to 300 K in NVT ensemble. An unrestrained density equilibration over a time of 1 ns is performed as last equilibration step.

For a more extensive description of the equilibration protocol the reader is referred to

Wallnöfer et al. [102]. In the production run, 500 million unrestrained sampling steps were performed, equivalent to a total simulation time of 1 μs for each system. 50,000 equally spaced snapshots were saved for subsequent analyses.

3.5 Metrics for Calculation of Subpocket Flexibility from MD Trajectories

Alignment Dependent Backbone and Side Chain Flexibility Metrics. To quantify flexibility from MD simulations state-of-the-art trajectory analysis includes calculation of the residue- wise backbone B-factors based on a single alignment for each snapshot to a reference structure.

As this standard B-factor relies on the simultaneous alignments of all Cα atoms in one step, this metric is referred to as backbone flexibility based on global Cα alignment (see Figure 14,

A). Fuchs et al. [103] recently developed a metric for the calculation of local flexibility of the protease backbone and the side chains. This local measurement of backbone flexibility is created by performing multiple local alignments for each residue individually to the respective backbone atoms, including the hydrogen atoms (see Figure 14, B), for all snapshots of the trajectory. Multiple local alignments to the three heavy atoms in the peptide main chain (N,

Cα, C in Amber12 naming) are performed for all trajectory snapshots and the subsequent

27 calculation of the side chain flexibility as B-factor of the side chain atoms only (see Figure 14,

C). Using this minimal set of three atoms required for alignment in three-dimensional space allows for an independent measurement of side chain flexibility as it minimizes the influence of backbone motions on side chain dynamics. Within this work, backbone flexibility based on global Cα alignment and side chain flexibility based on local alignment are investigated to determine flexibility on two different time scales, with side chain movements occurring at a faster time scale than backbone motions [103]. Both metrics were calculated for each protease residue with cpptraj [99] using 25,000 equally spaced snapshots. Using the subpocket definition created as described in Section 3.1, subpocket flexibility for backbone and side chain flexibility metrics is calculated as average of the residue-wise B-factors for each residue present in the generic subpocket definition. Correlation between subpocket flexibilities with the subpocket cleavage entropy is calculated as Spearman correlation coefficient r using the statistics package

R [104].

Figure 14. Different strategies to calculate A) backbone flexibility after global alignment on

Cα, B) local backbone flexibility requiring multiple residue-wise alignments and C) local side chain after alignment on backbone atoms (taken from Fuchs et al. [103]).

Alignment Independent Metric for Calculation of Backbone Flexibility. In addition to the previously described alignment dependent method for calculation of the backbone flexibility, subpocket flexibility can also be quantified as alignment-independent backbone dihedral entropies calculated from the distribution of dihedral angles φ and ψ (see Figure 11, Section 3.4 for an illustration of dihedral angles). The backbone dihedral angles φ and ψ define the local 28 backbone conformation of a protein[105]. The distribution of these angles during the course of a molecular dynamics trajectory allows to derive a state probability function and to subsequently calculate a local backbone entropy. During the course of the simulation, backbone dihedral angles φ and ψ are recorded. Subsequently, non-parametric kernel density estimation is applied in order to obtain a state probability distribution function as demonstrated previously[106]. Data around -180°/180° is periodically duplicated to avoid a decrease of the kernel at the boundaries in density estimation.

̇ 푆훼 = −푅∫ 푝(훼)푙푛̇ 푝(훼)푑훼 Equation 13

Equation 13 shows the integration leading to thermodynamic entropy based on conformational flexibility in the degree of freedom α. This metric gives an upper bond for total backbone entropy. It is just an upper bond, as correlation between the backbone angles is not accounted for due to the fact that a one-dimensional probability density function is obtained. Total thermodynamic entropy is lower, as correlation of individual degrees of freedom restricts the overall conformational space. Independent entropies for both dihedral angles φ and ψ were calculated as Sφ and Sψ. Ordered states correspond to low entropies. Therefore, a single dihedral peak with a width of 1° yields an entropy of zero, whereas disordered states yield positive values. 25,000 equally spaced snapshots were used for calculation of the dihedral entropies. Correlation between subpocket dihedral entropies with the subpocket cleavage entropy is calculated as Spearman correlation coefficient r using the statistics package R [104].

3.6 Clustering of MD Trajectories

To obtain representative cluster structures for the most abundant conformations appearing during the course of a MD simulation, MD trajectories are clustered. During the clustering procedure, molecular structures are grouped together according to molecular similarity [107], which is measured by an appropriate metric, e.g. Cα or all-atom RMSD of the whole protein or just the binding site of the protein. A plethora of different clustering algorithms exist, which differ in the natures of generated clusters and the techniques and theories they are based on

[108]. Roughly, clustering techniques can be classified into hierarchical and partitional

29 clustering. In hierarchical clustering, data objects are grouped with a sequence of partitions, while in partitional clustering data objects are directly divided into a pre-specified number of clusters without the hierarchical structures [108]. For a general review on clustering algorithms and a comparison of different clustering algorithms applied for the clustering of MD trajectories, the reader is referred to [108] and [107], respectively.

The method of choice for clustering MD trajectories in this work is hierarchical-agglomerative clustering, which is the default clustering algorithm implemented in cpptraj [99].

Agglomerative clustering is a so-called bottom-up approach, as in the beginning each object forms a cluster and the formed clusters are subsequently combined to larger clusters [109]. The metric used for grouping molecular conformations into clusters is the all atom RMSD of the binding site residues. In the beginning, each conformation forms a cluster. Subsequently, the clusters which have the smallest distance in binding site all atom RMSD of cluster-centroids are combined to larger clusters until a pre-specified number k of clusters is reached. The number of clusters significantly impacts the result of the clustering procedure. Although it can be pre-specified, it has to be selected based on the nature of the dataset [108]. Within this work, the number of clusters was selected by performing several rounds of clustering starting from a pre-defined number of ten clusters and subsequently lowering the number of clusters until the following criteria were satisfied: The relative occupancy of the least occupied cluster should be higher than 1 % and the clustering procedure should result in at least three clusters. Depending on the conformational variability of the selected protein, this might lead to the occupancy of one or more of the three clusters to be lower than 1 %. Representative cluster structures were selected as cluster centroids, i.e. the structures, which had the lowest all atom RMSD of the binding site to all other cluster structures. For each MD trajectory clustered in this work, 25,000 equally spaced snapshots of the MD trajectory were used in the clustering procedure using default hierarchical agglomerative clustering algorithm implemented in cpptraj [99] as described above.

30

3.7 Grid Inhomogeneous Solvation Theory (GIST)

Similar to Schauperl et al. [110], this section will give a brief introduction to the Grid

Inhomogeneous Solvation Theory (GIST) developed by Nguyen et al. [111], [112] which allows for the determination of water thermodynamics in protease subpockets.

In its general form, the free energy of solvation ∆퐺푠표푙푣 can be formulated as done in Equation 14:

∆퐺푠표푙푣 = ∫ ∆퐺푠표푙푣(풒)푝(풒)푑풒 Equation 14

In Equation 14, the solute is constrained to state q, with p(q) being the probability of finding the solute in this state. ∆퐺푠표푙푣 is calculated as the integral over the free energy ∆퐺푠표푙푣(풒) of the molecule. Under the GIST approach, the molecule is restrained to one conformation. For simplicity, in the following it is assumed that only one conformation exists. Extension of the theory to multiple conformations can simply be carried out using Equation 14. The impact of conformational variability of a protein can be investigated by performing GIST calculations for multiple conformations. ∆퐺푠표푙푣 can be expressed as difference between the solvation enthalpy

ΔEsolv and the solvation entropy ΔSsolv, multiplied by the temperature T:

∆퐺푠표푙푣 = ∆퐸푠표푙푣 − 푇∆푆푠표푙푣 Equation 15

The GIST approach uses a grid to discretize the analytical expressions used in inhomogeneous solvation theory, which approximates thermodynamic quantities by transformation of integrals over coordinates to integrals over distribution functions. In GIST, thermodynamic quantities are calculated at every grid point as discrete values from stored MD trajectory frames. The solvation enthalpy ∆E푠표푙푣 is expressed as sum of the additional interaction between the solvent and the solute ∆E푆푊 and the changes in enthalpic interactions between the solvent molecules due to the presence of a solute ∆E푊푊:

∆퐸푠표푙푣=∆퐸푆푊 + ∆퐸푊푊 Equation 16

31

Both terms of Equation 16 are calculated for each voxel of the grid. ∆E푆푊 expresses the average of the interaction of all water molecules with the solute during the simulation time, while ∆E푊푊 expresses the same for the interaction of all water molecules in a voxel with all water molecules of the simulated system.

The solvation entropy ∆푆푠표푙푣 is approximated by the translational entropy ∆푆푡푟푎푛푠 and the orientational entropy ∆푆표푟푖푒푛푡:

∆푆푠표푙푣 ≈ ∆푆푡푟푎푛푠 + ∆푆표푟푖푒푛푡 Equation 17

Equation 17 neglects water-water correlations as the calculation would slow down the convergence of the MD simulations. The calculation of the integrals is therefore only dependent on the coordinates of one water molecule.

The water molecule distribution 푔푠푤(풓흎) depends on the position of the water oxygen in the three-dimensional space r and the orientation 흎 of the water molecule.

푔푠푤(풓흎) = 푔푠푤(풓) 푔푠푤(흎\풓) Equation 18

푔푠푤(풓) depends only on r and 푔푠푤(흎\풓) solely on 흎 with the constraint that r is fixed.

With the help of Equation 18, the approximated entropy can be split in a translational and an orientational term. The translational part 훥푆푡푟푎푛푠 depends only on the position of the water molecule in space:

훥푆푡푟푎푛푠 = −푘푆𝜌 ∑ 푞푠푤(풓) 푙푛 (푔푠푤(풓)) Equation 19

The reference state is pure water that shows a uniform distribution with the same density of water molecules in every voxel. This is characteristic for a system with low order and high entropy. The orientational part 훥푆표푟푖푒푛푡 describes the orientation of the water molecule within a voxel:

푎 훥푆표푟푖푒푛푡 = 𝜌 ∑ 푔푠푤(풓) 푆 (풓) Equation 20

With: 32

푎( ) ( ) ( ) 푆 풓 = 퐾 ∫푎 푞푠푤 흎\풓 푙푛(푔푠푤 흎\풓 ) 푑흎 Equation 21

In Equation 21, K is a normalization factor. The underlying assumption is that the orientation of the water is independent of the water position in the voxel. Also for the orientational entropy a low value means low order and thus high entropy. In the reference state of pure water, water molecules do not show a preferred orientation, but are randomly oriented.

To calculate hydration thermodynamics from MD simulations for the serine protease test set, first each standard MD trajectory generated as described in Section 3.3 was clustered.

Representative cluster conformations and cluster populations were determined as described in

Section 3.6. For each representative cluster conformation, a MD simulation restraining the protein with harmonic constraints applying a force constant of 100 kcal/mol was performed. 50 million restrained sampling steps were run, corresponding to a simulation time of 100 ns.

100,000 snapshots were extracted for subsequent analysis, as recommended by Nguyen et al.

[111]. For each representative cluster structure, the GIST output for each subpocket was obtained by performing the analysis on a grid with 6 Å length and a spacing of 0.5 Å centered on the center of mass of the respective subpocket residues. The results for each representative cluster structure were combined by calculating the weighted average for each value, using the cluster occupancy as weighting factor.

3.8 Calculation of (Electrostatic) Molecular Interaction Fields ((e)MIFs)

Molecular interaction fields (MIFs) are calculated with the program GRID [113], [114]. In GRID, an equally spaced grid is placed over the binding site. A probe molecule is moved along the grid and the interactions of the probe molecule with the protein are calculated at each grid point. Depending on the selected probe molecule, interactions accounted for are Van der Waals interactions, hydrophobic interactions and Coulomb interactions.

When working with GRID on the serine and cysteine protease test sets, structures of the investigated test set were aligned with PyMOL [115] based on Cα atoms prior to MIF calculation. Regarding the selected probe molecules, in the first approach [116] the C3 probe, the O- probe and the N3+ probe, pre-specified within GRID, were used to calculate subpocket 33 interaction potentials that can be correlated to the substrate specificity quantified as cleavage entropy [116]. The C3 probe tests for interactions with a hydrophobic substrate and mimics a methyl group that only tests for van der Waals interactions. The N3+ probe and the O- probe are charged probes. The N3+ probe mimics a sp3 cationic NH3 group and thus tests for van der

Waals interactions, hydrogen bonding interactions and electrostatic interactions with its positive charge. The O- probe mimics an anionic phenolate oxygen atom probing for van der

Waals interactions, hydrogen bonding interactions and electrostatic interactions with its negative charge. The fact that the charged probes simultaneously test for electrostatic, van der

Waals and hydrogen bonding interactions is their main drawback. Van der Waals interactions are highly dependent on the conformational state of the protein binding site, therefore the choice of conformation of the binding site considerably influences the resulting MIFs. Even if conformational variability can be considered to a certain extent by choosing X-Ray structures with different conformations and/or extracting conformations from MD trajectories, the whole conformational space of a protease is not accessible. As electrostatic interactions are long-range interactions and are less dependent on the conformational state of the binding site, it is thus implicated that focusing only on electrostatic interactions will lead to more conclusive results.

In a second approach electrostatic GRID probes (a positive probe and a negative probe) were designed that test only for electrostatic interactions without simultaneously probing for van der Waals interactions. With the positive and the negative probes, electrostatic MIFS (eMIFs) can be calculated. eMIFs and eMIF overlap (see Section 3.10) were applied to describe electrostatic substrate readout and similarity (see Section 4.5).

3.9 Calculation of Subpocket Interaction Potentials

For correlation of the information contained in the MIFs with the cleavage entropy, subpocket interaction potentials are required [116]. For calculation of subpocket interaction potentials, first MIFs are calculated with GRID [113], [114] as described in Section 3.8 using a grid spacing of 0.25 Å. Postprocessing of the GRID output files was carried out with a C#-script. A subpocket definition has to be available for subpocket definition using geometric constraints:

34

Grid points in proximity between 3.5 and 6 Å to at least two subpocket residues and an interaction potential <0 kcal/mol are selected in a first filtering step to include only grid points positioned within the subpocket and to avoid unfavorable positions due to van der Waals clashes. In a second filtering step only the 25% of points with the lowest interaction potentials are retained to represent the most favorable subpocket regions binding partners would select.

Subpocket interaction potentials are calculated through averaging of the selected and filtered grid points. Through mapping of the selected points to the respective subpockets, areas with highly favorable interactions within the subpocket can be localized (see Figure 15).

35

Figure 15. Selection of 25% of grid points with most favorable interaction potentials, exemplarily depicted for the S1 subpocket of fXa. In (A), white points indicate interaction potentials of 0 kcal/mol, representing redundant information. (B) shows the grid points after filtering and keeping only the 25% of points with the most favorable interaction potentials. Close to the illustrated Asp at the bottom of the S1 subpocket, a large area of favorable interactions with the N3+ probe is localized.

For calculation of weighted subpocket interaction potentials that incorporate the information on the conformational variability of a subpocket, representative conformations are first extracted from MD trajectories. Representative cluster conformations and cluster populations are obtained as described in Section 3.6.

For each representative cluster conformation, MIFs are calculated with GRID and subpocket interaction potentials are determined as described above. The weighted interaction potential is calculated as weighted mean from subpocket interaction potentials of representative cluster conformations using cluster populations from MD trajectories as weighting factors.

Correlations between X-ray subpocket interaction potentials or weighted interaction potentials with the subpocket cleavage entropy are calculated as Pearson correlation coefficient r using the statistics package R [104].

36

In order to investigate the role of binding enthalpy in determining substrate specificity of serine proteases, subpocket interaction potentials were calculated for the three serine protease examples fXa, elastase-1 and granzyme B as they are three serine proteases with chymotrypsin fold showing distinct specificities (see Section 4.4). The X-ray structures of fXa (PDB code 1C5M

[79]), elastase-1 (PDB code 1QNJ [75]) and granzyme B (PDB code 1FQ3 [76]) were used as starting structures for molecular dynamics (MD) simulations and for calculation of interaction potential maps of X-ray structures. All three structures are free of a ligand with a resolution <2

Å.

3.10 Calculation of Electrostatic Molecular Interaction Fields and Overlap

As described in Section 3.8, the use of GRID probes simultaneously testing for van der Waals and electrostatic interactions, leads to problems due to the high shape dependence of van der

Waals interactions. Therefore, a positive and a negative probe solely probing electrostatic interactions were developed for the determination of eMIFs. To explain differences in electrostatic substrate readout, eMIF similarity may be compared to electrostatic substrate readout similarity introduced in Section 3.3. As a measure for eMIF similarity between two eMIFs, the Gaussian overlap of the two eMIFs is calculated using the Gaussian product theorem [117]. Within this thesis, the eMIF overlap is calculated for the whole binding site to avoid the difficulties in detecting individual subpockets as described in Section 3.9 and to avoid the non-trivial task of assigning long-range electrostatic interactions to specific subpockets[118].

For calculation of the eMIF overlap, three molecular interaction fields (MIFs) were calculated on a grid for the entire binding interface of the proteases using the program GRID

[113], [114]. For each probe, a grid-spacing of 1 Å was used. For the first MIF a hydrophobic

H-probe was used in order to characterize the shape of the binding cleft. It was further restricted by a distance criterion (5 Å) to ligands (P4 to P4’) in aligned peptide complex structures, i.e. Thrombin (PDB codes 1DE7 [119], 3LU9 [120], 4AYY [121]) and Granzyme M

(PDB code 2ZGH [77]). Only grid points that fulfilled the distance criterion and showed

37 favorable interactions (<0 kcal/mol) with the H-probe were used in the further calculations of the eMIFs. In order to minimize van der Waals interactions and focus on electrostatic contributions alone, the eMIFs were calculated with user-defined GRID probes with charges of +1 and -1. Both the van der Waals radius of the probes and the cutoff for the van der

Waals interactions, were set to their smallest allowed input values of 0.01 Å and 3 Å respectively, basically switching off the van der Waals interactions. Only points of the eMIFs that showed favorable interactions (<0 kcal/mol) were kept for further calculations.

Obviously, the points of favorable interactions for the positive probe are points of unfavorable interactions for the negative probe and vice versa. Thus, the final eMIFs are represented by a grid, whose points need to have a negative energy (favorable interaction) and furthermore have to be a subset of the previously selected grid points of the H-probe

(proximity to the ligand, no overlap with the protease itself).

The proteases and their eMIFs were realigned slightly by overlaying the weighted center of grid points of the H-probe MIFs and aligning the first eigenvector [122] of the H-probe MIFs tensor of inertia, i.e. the one corresponding to the largest eigenvalue. Due to the high similarity of the binding clefts, this procedure resulted in an excellent alignment. Thus, the second and third eigenvalues were not used for further refinement, avoiding problems due to their near degeneration. The resulting alignment was consequently used for analyses with the electrostatic GRID probes.

The overlap of the eMIFs corresponding to the same charge was calculated by using spherical Gaussian functions with a σ of 2 Å centered at the grid points, according to

Equation 22. Several options for the width of the Gaussian function were tested, ranging from 1 to 2 Å, but the resulting correlation with experimental substrate data was only minimally affected, which is in line with the long-range nature of electrostatic interactions.

Equation 22

38

In Equation 22, NA,B is the height of the spherical Gaussian function, σ2A,B is the variance of the spherical Gaussian function and rA,B is the center of the spherical Gaussian function, for A and

B respectively.

3.11 Shape Based Virtual Screening

In virtual screening, computer-based methods are used to discover new peptide or small molecule ligands on the basis of biological structures [123]. Current virtual screening strategies to find new small molecule inhibitors are divided into two groups: ligand-based approaches and structure-based approaches. To apply a ligand-based approach, information on one or more ligands binding to the target is required. A set of known actives is used to discover structurally diverse compounds with similar bioactivity [124].

Structure-based methods require either an X-ray or NMR structure or a homology model of the target. In structure-based virtual screening, docking and scoring is the most popular method.

However, finding the correct binding conformation through a docking experiment remains a challenging task [125]. Even if flexible docking methods are applied, consideration of flexibility of protein and ligand is not easy to achieve [126]. Pharmacophore-based virtual screening is another form of structure-based virtual screening [127]. In pharmacophore-based virtual screening, functional groups are “stripped” to pharmacophores, which has the advantage that scaffold-hopping is possible [128].

Shape-based virtual screening with ROCS [129] is an alternative to docking and pharmacophore-based virtual screening [130]. ROCS allows for a combination of the chemical information and the information about the shape when screening for small molecule inhibitors.

Screening of the DUD [131] by using a combination of shape and pharmacophore properties revealed a superior performance of ROCS with regard to docking approaches [132].

With recently developed methods like PICS (proteomic identification of protease cleavage site specificity) [133] and TAILS (terminal isotopic labeling of substrates) [134] and the usage of

39 proteome-derived substrate libraries [133], protease specificity profiles can be readily determined.

In PICS the carboxy-peptide cleavage products of an oligopeptide library consisting of natural biological sequences derived from human proteomes are selectively isolated and liquid chromatography-tandem mass spectrometry (LC-MS/MS) is used to identify the prime side sequences of the cleaved peptides. Non-prime side sequences are determined through automated database searches of the human proteome. PICS thus allows for simultaneous determination of prime and non-prime side sequences of cleaved peptides [133]. N-TAILS allows to distinguish between N-termini of proteins and N-termini of protease cleavage products. Dendritic polyglycerol aldehyde polymers are used to remove tryptic and C-terminal peptides. Tandem mass spectrometry is used to analyze unbound naturally acetylated, cyclized or labeled N-termini from proteins and their protease cleavage products [135]. C-TAILS complements N-TAILS and represents an isotope-encoded quantitative C-terminomics strategy to identify neo-C-terminal sequences and protease substrates [134]. With such efficient approaches for protease substrate profiling available, the amount of information on protease peptide substrates is steadily growing. In order to make use of the abundant information on protease peptide substrates, within this work a virtual screening workflow based solely on the information on protease peptide substrate sequences was developed [136]. The workflow makes use of the information on protease peptide substrates present in the MEROPS database

[13] to find new small molecule inhibitors. The types of possible interactions of the substrate peptides are the same as for small molecules. Therefore, it should be possible to find small molecules which give the same interactions with a protease as the corresponding peptide substrates. In developing a virtual screening workflow transferring the information on the peptide substrate specificity to small molecule specificity is a complex 3D problem. The relative position of the features of the amino acid side chains in the peptide substrates and the overall shape of the bound peptide substrates are of great importance. In addition, the relative frequencies of amino acids in peptide substrate sequences have to be considered. As a shape- based virtual screening method is most suited to addressing the problem and ROCS also offers

40 the possibility to selectively weight pharmacophore features, shape-based virtual screening with ROCS is the method of choice.

In order to develop and test the method, four targets, thrombin, fXa, fVIIa and caspase-3 (casp-

3), were selected according to substrate specificity profiles. In addition to showing different substrate specificities, the proteases also have different catalytic mechanisms. Thrombin, fXa and fVIIa are serine proteases, while casp-3 is a cysteine protease. Cleavage site sequence logos for all four targets are given in Figure 16. The developed workflow is schematically depicted in Figure 17.

Figure 16. Validation test set for peptide substrate-based virtual screening workflow. Thrombin

(A), fXa (B), fVIIa (C) and casp-3 (D) cleavage site sequence logos, created with WebLogo100, based on

168 substrates for thrombin, 59 substrates for fXa, 9 substrates for fVIIa and 651 substrates for casp-3.

Thrombin, fXa and fVIIa are serine, casp-3 is a cysteine protease. Members of the validation test set show varying substrate preferences.

41

Figure 17. Workflow for shape-based virtual screening with vROCS for thrombin [136]. First, a suitable template peptide substrate structure has to be extracted from the X-ray structure of a peptide substrate complex or manually generated. Only the amino acid positions P3-P1 are retained. With a

MOE residue scan, each position P3-P1 is mutated into each of the 20 natural amino acids, leading to a mutational space of 60. In vROCS, first the single amino acid queries without considering the backbone are created. In a second step, the final query is created and the single amino acid queries for each position are combined according to the relative frequencies amino acids are found in the protease substrates. The final query is used to perform a virtual screening with vROCS.

For the preparation of substrate sequences, sequences were downloaded from the MEROPS database [22], considering substrate positions P3-P1 in a first step as most known inhibitors for the investigated proteases bind to the corresponding protease subpockets. For casp-3 also tetrapeptides ranging from P4-P1 were explored. Unique tri- or tetrapeptide sequences were

42 downloaded from the MEROPS. As the MEROPS provides only substrate sequences but no information on substrate conformations, 2D sequences were converted to 3D conformations by making use of the fact that proteases universally recognize beta-strands in their active site [137].

To obtain peptides in beta-strand conformation, a mutation strategy based on a known X-ray structure of a protease-substrate complex downloaded from the PDB [12] was used. For fVIIa and fXa no suitable complex-structures could be found, therefore the same template (PDB code

1FPH [81]).was used for the three serine proteases fVIIa, fXa and thrombin.

For casp-3 a template protease-substrate structure was available (PDB code 2DKO [138]). The

MOE software [71] was used for the preparation of substrate conformations. Only the template peptide-substrate positions P3-P1 or P4-P1 were considered. The mutation of the selected substrate positions was carried out using the residue scan functionality within the MOE software. The residue scan functionality allows to perform single point or multiple mutations within a peptide sequence. The mutation of each peptide position in all of the amino acids independently of each other, leads to a mutational space of 60 in the case of a tripeptide. Using the peptide substrate sequence lists, individual amino acids present in the peptide substrates listed in the MEROPS were extracted from the 60 mutated sequences generated with the MOE residue scan for each position P3-P1 or P4-P1. The single amino acids for each substrate position were written to individual pdb-files.

The DUD-E [139] database was used as validation database for all of the four test cases.

Database preparation was performed with MOE. Duplicate entries were removed and both actives and decoys of all data sets were subjected to the MOE wash procedure to disconnect simple metal salts drawn in covalent notation, remove counter ions and solvent molecules, add or remove explicit hydrogen atoms and rebalance protonation states.

For each active and decoy, 25 conformations were created with OMEGA [140], [141]. The actives database for casp-3 required special attention as several entries contained not the bioactive but the prodrug form of the molecule. Prior to conformer generation, lactone-prodrug structures in the dataset were manually hydrolyzed in MOE. Potentially covalently bound

43 molecules were kept as the interactions directing the ligand into the subpockets should still be found.

In order to create the query for shape-based virtual screening, first each individual amino acid was loaded into vROCS and the backbone features were disabled after automatic assignment of pharmacophore features as illustrated in Figure 18.

Backbone features excluded

Example Arg

Figure 18. Single amino acid query for example Arg. The backbone features are excluded to avoid overweighting of the backbone functional groups in the final query. Pharmacophores for the guanidine functionality are a positively charged moiety surrounded by three hydrogen bond donors.

For alanine, a hydrophobic feature had to be added, as this was not automatically done by vROCS. For each amino acid a separate single amino acid query was exported.

To create a query correctly representing the relative frequencies of amino acid side chains in the preferred substrates of the corresponding protease, first the relative frequencies were calculated according to the following scheme: Absolute frequencies were normalized according to the number of unique peptide substrate sequences and natural occurrence of amino acids.

In the same way as for the calculation of the cleavage entropy (see Section 3.2), the normalization by the natural occurrence of the amino acids is needed to remove the bias in the experimental results [142] of the MEROPS peptide substrate sequences. As vROCS does not allow to just set the number of times a feature should appear in the final query, each individual amino acid query has to be manually loaded according to the relative frequency in the protease

44 peptide substrates. Since vROCS does not handle a large number of different amino acid queries to be loaded in large numbers, frequencies were further normalized in such a way, that the highest amino acid occurring in the substrate has a frequency of 20.

45

Table 1. Calculation of relative amino acid frequencies for thrombina.

AA Nat. Abs. Freq. Abs. Freq. Abs. Freq. Rel. Freq. Rel. Freq. Rel. Freq.

VAL Occ.0.066 P311 P216 P10 P31 P22 P10

ASN 0.044 8 5 0 2 1 0 GLY 0.072 13 13 0 2 2 0 ILE 0.052 14 7 0 2 1 0 LEU 0.09 6 18 1 1 2 0 LYS 0.057 10 1 21 1 0 3 MET 0.024 3 1 0 1 0 0 PHE 0.039 5 4 0 1 1 0 PRO 0.051 3 62 0 0 10 0 SER 0.069 14 3 0 2 0 0 THR 0.058 13 4 0 2 1 0 GLN 0.04 8 6 0 2 1 0 ARG 0.057 11 0 135 2 0 20 HIS 0.022 3 0 1 1 0 0 TRP 0.013 2 0 0 1 0 0 TYR 0.032 4 0 0 1 0 0 ASP 0.053 8 0 0 1 0 0 CYS 0.017 6 0 0 3 0 0 ALA 0.083 14 19 1 1 2 0 GLU 0.062 3 0 0 0 0 0 a159 Thrombin substrates are recorded in the MEROPS database. Absolute frequencies were firstly normalized by the natural occurrence of amino acids and secondly normalized in such a way as to have the highest occurring amino acid result in a relative frequency of 20. For thrombin, this is Arg in P1 position.

46

In order to build the final query, each single amino acid query was loaded into vROCS according to the frequency table obtained. The query created for thrombin based on the relative frequencies given in Table 1 is illustrated in Figure 19.

20x

Example Thrombin

Figure 19. Thrombin final vROCS query based on peptide substrate sequences P3-P1. The highest occurring amino acid functionalities are the functionalities of Arg in P1 position, which has a relative frequency of 20. All other functionalities weighted in relation to the highest occurring amino acid functionalities.

The query was then used in a ROCS validation run using the previously prepared actives and decoys dataset. Of the 25 conformations for each active and decoy, only the highest ranked conformation was retained. Enrichment factors at X% EFX% were calculated according to the following metric [143]:

퐴푐푡푖푣푒푠 푁 퐸퐹푋% = 푆푎푚푝푙푒푑 푡표푡푎푙 Equation 23 푁푆푎푚푝푙푒푑 퐴푐푡푖푣푒푠푡표푡푎푙

47

퐴푐푡푖푣푒푠푆푎푚푝푙푒푑 is the number of actives found at X% of the screened database, 푁푆푎푚푝푙푒푑 is the number of compounds at X% of the database, 푁푡표푡푎푙 is the number of compounds in the database and 퐴푐푡푖푣푒푠푡표푡푎푙 is the number of actives in the database.

Query preparation for other targets follows the same procedure as described for the four test cases. Depending on the substrate specificity profiles of the respective targets, different substrate positions may be explored. To find new small molecule inhibitors out of a prescribed database, a simple ROCS run is performed and the molecules present in the database are ranked according to the desired metric. Selection of molecules for biological testing and optimization then occurs through predefined criteria, such as logP, molecular weight and potential toxicity.

For the cysteine protease cruzain, which is an important target in fighting Chagas disease, the developed substrate and shape-based virtual screening procedure was applied to find potential new small molecule inhibitors. The template peptide for the mutation strategy described was taken from a PDB structure of cathepsin B (PDB code 3K9M [144]). Substrate positions S2-S1’,

S3-S1 and S3-S1’ were used to cover different sizes of potential small molecule inhibitors. The

ZINC leads now database [145] was used, keeping only the molecules with a Tanimoto cutoff of 0.80 to avoid redundancy within the screened database. Queries based on the relative amino acid frequencies at the respective positions as described in previous paragraphs were generated for all three substrate position ranges and potential small molecule inhibitors for biological testing were selected based on potential toxicity, xlogP, number of rotatable bonds, combined with chemical experience.

48

4. Investigation of Chymotrypsin Fold Serine Protease Model Systems

The main goal of this PhD thesis is to explain protease specificity based on binding site characteristics. A test set of nine serine proteases with chymotrypsin fold, based on the availability of apo crystal structures was chosen to form the basis of all subsequent analyses.

The test set is presented in Section 4.1. At the beginning of this work, protease specificity was quantified as cleavage entropy [41], indicating how many different amino acids a protease accepts in a specific subpocket. The cleavage entropy forms the basis for the correlation of subpocket characteristics to subpocket specificity. The subpocket characteristics of initial interest were subpocket flexibility on different time scales (see Section 3.5), water ordering, solute-water and water-water interactions in subpockets calculated by Grid Inhomogeneous

Solvation Theory (GIST) (see Section 3.6) and subpocket interaction potentials calculated from molecular interaction fields (MIF) determined with GRID (see Sections 3.8 and 3.9).

At a later stage, the cleavage entropy metric was adapted to capture the differences in charge of the amino acids accepted in a subpocket (see Section 3.2). Similarities of electrostatic MIFs

(eMIFs) (see Section 3.10) were compared to electrostatic substrate readout similarity. The knowledge on protease peptide substrates was applied to develop a shape-based virtual screening method (see Section 3.11) using the peptide substrate sequences present in the

MEROPS [13] as a basis.

Correlation between subpocket flexibility, calculated by metrics capturing flexibility at different time scales is the subject of Section 4.2.

A review of GIST results with respect to subpocket specificity is given in Section 4.3.

Quantitative correlation of conformational binding enthalpy determined with GRID is presented in Section 4.3. Explanation of electrostatic substrate similarities through similarities in the respective eMIFs is the main argument in Section 4.4.

49

4.1 Protease Test Set

In order to investigate the molecular determinants of the specificity of serine proteases, a test set of nine serine proteases with chymotrypsin fold was chosen. Criteria for the selection of the test set were the availability of an apo crystal structure with a resolution higher than 2 Å.

Trypsin, fVIIa, fXa, thrombin, kallikrein-1, chymotrypsin, elastase-1, granzyme M and granzyme B are the members of the protease test set. Figure 20 gives substrate specificity quantified as cleavage entropy mapped to the binding site of each member of the test set and cleavage site sequence logos for all nine proteases. In Figure 20, proteases are ordered according to their S1 specificity, ranging from their preference for positively to negatively charged amino acids. Trypsin represents an unspecific protease, showing specificity for Lys and Arg only in the S1. Trypsin is part of the digestive system and acceptance of a wide range of substrates is thus desirable [146]. For fVIIa, substrates listed in the MEROPS always have an

Arg in P1. In the S2, fVIIa accepts polar amino acids such as Thr and Gln and small aliphatic amino acids such as Ile, Gly and Val. In the S1’-S4’, mainly aliphatic amino acids are preferred with a negatively charged Glu accepted in S2’ and a negatively charged Asp in S4’. In subpockets S4-S2 mainly aliphatic and polar amino acids are found. As the MEROPS lists only nine substrates for fVIIa, the information might be biased. FXa prefers Arg in the S1 as well, also showing specificity for Lys. In the S2 fXa prefers small non-polar amino acids such as Gly and Pro, while the S3 is rather unspecific. In the S4 non-polar amino acids such as Ile, Ala and

Phe are preferred [147]. At the prime site in the S1' mostly polar amino acids such as Ser and

Thr are found. The S2'-S4' all prefer non-polar amino acids, with the S2' showing a preference for Val according to MEROPS data. Thrombin prefers Arg over Lys in the S1 as well. In the S2 it shows high preference for small apolar amino acids, preferring Pro over Gly, Ala and Leu.

In the S1’ it also prefers small apolar and polar amino acids with the highest preference for Ser, followed by Pro and Gly. Kallikrein-1 also prefers Arg in the S1, but also accepts Tyr.

Chymotrypsin is another example of an unspecific protease, showing preference for Tyr, Phe and Leu only in the S1. Similar to trypsin, chymotrypsin is also part of the digestive system.

The elastase-1 investigated here preferentially cleaves C-terminal to amino acids with small

50 alkyl side chains such as Ile, Val and Ala. [148] Elastases can destroy connective tissue proteins and may thus be very destructive if they are not regulated. They are therefore controlled either by compartmentalization or naturally circulating plasma protease inhibitors [149].

Granzyme M is expressed in human lymphocytes [150] and plays a role in immunity to infection [151] and as mediator of cell death [152]. It shows specificity for Leu and Met in the

S1 and minor specificity in other subpockets.

Granzyme B plays a key role in cytotoxic T lymphocyte mediated apoptosis [153] and also shows antiviral and antitumor functions [154]. Granzyme B is unique among mammalian serine proteases and strictly requires an aspartic acid in P1 position of substrates similar to caspases [155]. Additionally, Granzyme B requires extended substrate interactions with preferences for Ile and Val at P4, Glu, Met or Gln at P3, broad preference at P2, an uncharged residue at P1' and Gly or Ala at P2' [156].

51

Figure 20. Protease test set comprising nine serine proteases with chymotrypsin fold. Ordered according to specificity ranging from their preference for positively to negatively charged amino acids in the S1, trypsin (A) with 13770, fVIIa (B) with 9, fXa (C) with 59, thrombin (D) with 168, kallikrein-1

(E) with 31, chymotrypsin (F) with 1057, elastase-1 (G) with 51, granzyme M (H) with 1363, granzyme

B (I) with 1673 substrate listed in the MEROPS database were selected as test set. Criteria were the availability of an apo crystal structure and a resolution higher than 2 Å. Specificity quantified as cleavage entropy is mapped to the binding site, with green indicating an unspecific, red a specific subpocket.

Corresponding cleavage site sequence logos were created with WebLogo [157].

The basis for calculating subpocket characteristics that can be related to subpocket specificity is the generic pocket definition, generated as described in Section 3.1. The generic pocket definition for the nine protease examples is illustrated in Figure 21. Residues of the generic pocket definition for all examples are given in Appendix A2. Subpockets have the same number

52 of residues in each protease and minimum overlap between subpockets to enable statistical analysis and comparison of subpocket characteristics. For the prime site more structural data on peptide ligands is available than for the non-prime site. The prime site subpocket definition is thus to be regarded as more reliable than the non-prime site subpocket definition, especially within subpockets S3’ and S4’.

Figure 21. Generic subpocket definition for serine protease test set with chymotrypsin fold. For trypsin (A), fVIIa (B), fXa (C), thrombin (D), kallikrein-1 (E), chymotrypsin(F), elastase-1 (G), granzyme M (H), granzyme B (I) the subpockets all have the same number of residues and minimum overlap between subpockets. Subpocket definitions on the prime site are less reliable than subpocket definitions on the non-prime site, as less structural data on peptide substrates is available for the non- prime site.

4.2 Quantitative Correlation of Subpocket Flexibility with Subpocket Specificity of Serine

Proteases

In order to identify the role subpocket dynamics recorded at different time scales play in determining subpocket specificity, subpocket flexibility was calculated from MD trajectories using the metrics described in Section 3.5. Backbone flexibility was determined using both, an alignment dependent (backbone B-factors based on Cα alignment) and an alignment 53 independent method (dihedral entropies phi and psi entropy). Using the generic subpocket definition generated according to the procedure described in Section 3.1, subpocket averages were calculated for each metric and compared to the cleavage entropy.

Figure 22 shows the subpocket flexibilities for the different metrics mapped to the binding sites of the serine protease test set and compared to the respective specificity quantified as cleavage entropy. The alignment dependent and independent metrics for backbone flexibility show high correlation. Both metrics identify the S1 as the most rigid subpocket in most cases. The peripheral subpockets S4 and S2’-S4’ are identified as being more flexible in most cases. Also in terms of side chain flexibility, the S1 subpocket is identified as the most rigid subpocket in most examples.

54

55

Figure 22. Global and local flexibility of theserine protease test set. Backbone flexibilities based on global Cα alignment and calculated as alignment independent dihedral entropies phi and psi and side chain flexibilities based on local alignment are mapped to serine protease subpockets and compared to the cleavage entropy. Backbone flexibility based on global Cα alignment is shown for a range of 0-50 A2, while dihedral entropies are shown for a range of -18—13 kcal/mol. Side Chain flexibility is mapped to subpockets using a range of 0-100 A2. Both the alignment dependent and independent metrics for backbone flexibility identify the region around the S1 subpocket as the most rigid pocket. Also the side chain flexibility is lowest in the region of the S1 subpocket.

The Spearman correlation between subpocket flexibility at different time scales and subpocket specificity quantified as cleavage entropy is given in Table 2. The Spearman correlation was used since the correlation between flexibility and specificity is not expected to be of linear form.

The correlation between flexibility and specificity for the applied metrics was investigated using all subpockets S4-S4’ and only the non-prime site S4-S1, as the generic pocket definition applied is more reliable on the non-prime site than on the prime site of serine proteases due to insufficient available structural data on the prime site (see Section 3.1). The best correlation in terms of backbone and side chain flexibility is found for kallikrein-1. Also for thrombin, correlation with backbone flexibility (backbone flexibility based on global Cα alignment and psi entropy) is high. This is in line with the findings of Fuchs et al. [60], who found that substrate recognition in thrombin is driven by active site dynamics. FXa shows mild correlation of backbone and side chain flexibility with specificity. For both M and B as well as trypsin and fVIIa, no correlation between flexibility and specificity can be derived from Table

2. Chymotrypsin shows mild correlation between specificity and backbone flexibility and elastase-1 only shows good correlation with phi and psi entropy if substrate positions S4-S1 are used. In general, the investigation of only non-prime site subpockets improves correlation.

However, performing Spearman rank correlation on only four data pairs is a too small dataset to obtain conclusive results. Results point to substrate recognition being dependent to a higher extent for the examples thrombin, kallikrein-1 and fXa, but playing only a minor role in determining substrate specificity for other examples. For trypsin and chymotrypsin it is known

56 that substrate specificity is influenced by the dynamics of binding site adjacent loops [47]. The correlation analysis presented here considers only binding site subpockets, but the influence of dynamics adjacent to the binding site cannot be determined when using subpocket average values for flexibility. Several publications have pointed out the allosteric regulation in serine proteases [158], [25, 159]. Structural changes resulting from allosteric modification usually occur on a larger time scale than the simulation time in MD simulations. In addition, MD simulations give access to an ensemble of conformations, but cannot sample the whole conformational space of a protein. Results for MD simulations are highly dependent on the starting structure and thus the good correlation for thrombin and kallikrein-1 may just be attributed to luck in finding the optimal starting structure. Longer simulation times and the use of enhanced sampling techniques, such as aMD [160] or REMD [161] may lead to more conclusive results for the other protease examples. For fXa, water thermodynamics were found to play a predominant role in substrate binding [162]. The results obtained for the backbone and side chain flexibilities for the protease test set have thus to be evaluated carefully in context of other contributing factors, such as the mentioned hydration thermodynamics (see Section

4.3) and (conformational) binding enthalpy (see Section 4.4).

57

Table 2. Correlation between subpocket flexibility at different time scales and subpocket specificity quantified as cleavage entropya.

Protease Backbone Phi Psi Side Chain

Flexibility Entropy / Entropy / Flexibility /

/ Å2 kcal/mol kcal/mol Å2 Trypsin (S4-S4’/S4-S1) -0.36/0.2 -0.33/-0.8 0.13/0.4 -0.49/-0.2 FVIIa (S4-S4’/S4-S1) -0.24/0.0 -0.10/0.0 0.24/0.6 -0.24/0.0 FXa (S4-S4’/S4-S1) 0.64/0.8 0.55/1.0 0.10/1.0 0.43/0.8 Thrombin (S4-S4’/S4-S1) 0.81/0.8 0.38/0.8 0.79/1.0 0.19/-0.2 Kallikrein-1 (S4-S4’/S4-S1) 0.83/0.8 0.67/0.8 0.79/1.0 0.86/0.8 Chymotrypsin (S4-S4’/S4-S1) 0.39/0.21 0.82/0.95 0.54/0.95 0.40/0.11 Elastase-1 (S4-S4’/S4-S1) -0.57/0.0 0.36/0.8 -0.12/0.8 -0.24/0.4 Granzyme M (S4-S4’/S4-S1) 0.18/0.2 0.19/0.4 0.24/0.4 0.18/0.2 Granzyme B (S4-S4’/S4-S1) -0.88/-0.6 -0.17/0.8 -0.57/0.4 -0.19/-0.4 aThrombin shows high positive correlation with substrate specificity for the backbone flexibility. Also kallikrein-1 and fXa show good correlation of backbone flexibility with substrate specificity.

4.3 Quantitative Correlation of GIST Results with Subpocket Specificity of Serine Proteases

In addition to protein dynamics, hydration thermodynamics are also known to play an important role in the binding of substrates and small molecules to their respective targets [163],

[162]. Within this work, hydration thermodynamics were calculated using Grid

Inhomogeneous Solvation Theory (GIST). For a brief introduction to GIST, please see Section

3.7. For each subpocket, the solid-water (Esw) and water-water (Eww) as well as the entropic measures of translational (dTS trans) and orientational entropy (dTS orient) are calculated as weighted averages of the results for each representative conformation extracted from MD trajectories (see Section 3.7). Cα RMSD values for each representative cluster structure are given in Appendix A3. Table 3 shows the Pearson correlation of GIST output to specificity quantified as cleavage entropy. It is striking that all proteases accepting positively charged amino acids in the S1 pocket show high correlation to either translational or orientational 58 entropy. For the enthalpic measurements of solute water and water-water enthalpy, no trend can be observed. The thermodynamics of water for fXa were already studied extensively by

Nguyen et al. [162]. In the work of Nguyen et al. [162], ligand scoring functions based on GIST were applied. Interestingly, the displacement of energetically unfavorable water molecules was found to be the dominant factor in the fitted scoring functions, while water entropy played only a minor role. This contradicts the results found in this work when performing GIST calculations on apo structures, where water entropy seems to play a more dominant role than water enthalpy in substrate recognition, at least for the examples trypsin to chymotrypsin in

Table 3.

59

Table 3. Quantitative correlation of GIST results with subpocket specificity quantified as cleavage entropya.

Protease ESW EWW dTS Orient dTS Trans / kcal/mol / kcal/mol / kcal/mol K / kcal/mol K Trypsin 0.19 -0.07 0.49 0.43 FVIIa 0.44 -0.19 0.75 0.51 FXa 0.34 0.25 0.59 0.83 Thrombin -0.24 0.37 0.36 0.41 Kallikrein-1 0.26 -0.26 0.30 0.50 Chymotrypsin -0.07 -0.05 0.31 0.64 Elastase-1 0.15 0.04 0.16 0.34 Granzyme M 0.13 -0.08 0.01 -0.41 Granzyme B 0.33 -0.26 0.26 0.08 aTrypsin, fVIIa, fXa, thrombin, kallikrein-1 and chymotrypsin all have a negatively charged S1 subpocket and show high correlation with dTS orient and dTS trans. Elastase-1 shows mild correlation with dTS orient and dTS trans. Granzyme B has a positively charged S1 subpocket and shows correlation with dTS orient. Granzyme M shows negative correlation with dTS trans. Apart from granzyme M, all examples show negative correlation with the subpocket water population. Correlations with ESW and

EWW are inconclusive.

Figure 23 shows the GIST results mapped to the binding sites of the protease test set and compared to the cleavage entropy and the backbone flexibility based on Cα alignment also mapped to the binding site. The water entropy and backbone flexibility show correlating results. Apart from granzyme M, water entropy is lowest in and around the S1 subpocket, while it is higher in peripheral subpockets. This is in line with the results for the backbone flexibility (see Section 4.2 for a detailed description of subpocket flexibility on different time scales). Results are intuitive, as one can imagine that strong movements of protein backbone and side chains may cause movements and exchanges of water molecules and thus disturb the ordering of water molecules. Results point towards a combination of water ordering and

60 flexibility as important factors in substrate recognition. However, as only a few conformations can be considered due to the high computational cost of performing the restrained simulations for GIST analysis and the GIST analysis itself, further analyses are necessary to obtain more conclusive results. Another drawback is the use of the TIP3P model, which only gives an approximation of the charge distribution in water [101]. Improved results and more conclusive results, especially for the water enthalpy, could be obtained by using a more sophisticated water model or considering the polarizability of water molecules through polarizable force fields.

61

62

Figure 23. GIST results mapped to serine protease subpockets. Esw, Eww, dTS orient and dTS trans calculated with GIST compared to cleavage entropy and backbone flexibility based on global Cα alignment. A correlation between orientational and translational entropy with the substrate specificity quantified as cleavage entropy and backbone flexibility is visible.

4.4 Quantitative Correlation of (Conformational) Binding Enthalpy with Substrate

Specificity of Serine Proteases

In order to investigate the role of binding enthalpy in determining substrate specificity of serine proteases, subpocket interaction potentials (see Section 3.9) were calculated for three serine protease examples and compared to the respective substrate specificities quantified as cleavage entropy [116]. FXa, elastase-1 and granzyme B were chosen as model systems for this approach, as they are three examples of serine proteases with chymotrypsin fold showing distinct specificity profiles. See Figure 24 for cleavage site sequence logos and respective cleavage entropy, calculated with data from the MEROPS[13].

Figure 24. MEROPS cleavage site sequence logo and subpocket cleavage entropies of fXa (A), elastase-1 (B) and granzyme B (C). Cleavage site sequence logos were generated with WebLogo [157].

First subpocket interaction potentials were calculated for the X-ray structures of all three examples and compared to the subpocket-wise cleavage entropy as described in Section 3.9. In order to incorporate the information on the conformational variability of subpockets, MD trajectories were clustered (see Section 3.6) and weighted subpocket interaction potentials were used for comparison to the subpocket cleavage entropy. For an in-detail discussion of the interaction potential maps obtained with the N3+, O-, H2O and C3 probe obtained for X-ray structures, the reader is referred to [116]. All interaction potential maps for X-ray structures and representative cluster conformations are available online as Supporting Information to 63

[116]. Within this work, results for the correlation between (weighted) subpocket interaction potentials presented in [116] will be summarized and discussed in relation to more recent work.

Quantitative Correlation between Subpocket Interaction Potentials and Cleavage Entropy.

The comparison between cleavage entropy and the interaction potentials calculated from X-ray structures are given in Figure 25A-C, the comparison of the cleavage entropy to the weighted interaction potentials calculated from MD simulations in Figure 25D-F.

Figure 25. Comparison of subpocket interaction potentials calculated for X-ray structures for fXa (A), elastase-1 (B) and granzyme B (C) and weighted subpocket interaction potentials calculated for representative cluster conformations from MD simulations for fXa (D), elastase-1 (E) and granzyme B (F).

The results of the correlation analysis are presented in Table 4.

64

Table 4. Correlation between cleavage entropy and subpocket interaction potentialsa.

Probe Pockets Protease FXa Elastase Granzyme B S4-S4' X-ray/MD 0.41/0.84 0.19/0.27 -0.07/0.15 N3+ S4-S1' X-ray/MD 0.76/0.83 0.94/0.84 0.11/0.35 S4-S1 X-ray/MD 0.77/0.83 0.99/0.95 0.31/0.59 S4-S4' X-ray/MD 0.06/0.51 0.61/0.58 -0.21/0.11 C3 S4-S1' X-ray/MD 0.37/0.53 0.83/0.90 -0.48/0.32 S4-S1 X-ray/MD 0.31/0.49 0.84/0.90 -0.40/0.87 S4-S4' X-ray/MD 0.29/0.79 0.40/0.60 -0.04/0.19 H2O S4-S1' X-ray/MD 0.57/0.83 0.95/0.84 0.10/0.40 S4-S1 X-ray/MD 0.56/0.82 0.98/0.95 0.40/0.78 S4-S4' X-ray/MD 0.37/0.46 0.78/0.84 -0.05/0.18 O- S4-S1' X-ray/MD 0.33/0.54 0.77/0.99 0.02/0.20 S4-S1 X-ray/MD 0.42/0.52 0.76/1.00 0.24/0.38 aCorrelations are shown for X-ray structures and weighted average of subpocket interaction potentials using representative cluster structures obtained through MD simulations. The correlation coefficient r increases for each of the four GRID probes when using the weighted average of normalized interaction potentials of representative cluster conformations obtained through MD simulations and looking at all subpockets S4-S4', with elastase-1 when looking also at subpockets S4-S1' and S4-S1.

Factor Xa. The linear correlation between the interaction potentials calculated from the X-ray structure of fXa and the cleavage entropy (see Figure 25A) is lower than 0.41 for all probes

(Table 4). In the S4' subpocket peaks for the interaction potentials for the N3+ and the H2O probe are detected. If only subpockets S4-S1' and S4-S1 are considered, the correlation increases to r=0.76 and r=0.77 for the N3+ probe. Also for the other probes a slight increase can be observed if the subpocket range is narrowed.

For the MD results the representative conformations are considered through weighting of subpocket interaction potentials for each representative conformation with the occupancies of the respective cluster. The N3+ probe almost perfectly follows the curve for the cleavage entropy (Figure 25D). The peak in the S4' disappears due to a better distribution of the areas with favorable interactions with the N3+ probe for the different representative conformations.

65

The highest correlation between subpocket interaction potentials and the cleavage entropy is r=0.84. The worst correlation is found for the O- probe with r=0.46. Considering only subpockets S4-S1' and S4-S1 only leads to a slight improvement for the H2O, C3 and O- probe, but does not improve correlation values for the N3+ probe.

Elastase. When comparing the interaction potentials calculated from the X-ray structure to the cleavage entropy (Figure 25B), it is evident that the N3+ probe shows the lowest interaction potentials for subpockets S2-S4'. The interaction potentials for the N3+ and the H2O probe both spike in the S2' and S4' pocket. Both subpockets show strong hydrogen bonding interactions with the N3+ and H2O probe, but the number of grid points selected after the two filtering steps is smaller by a factor of almost 10 than in other subpockets. The values for these two subpockets are thus not directly comparable to the results for the other subpockets as the number of points is not considered by the subpocket interaction potentials.

The correlation between subpocket interaction potentials is highest for the C3 and the O- probe when looking at all subpockets S4-S4'. However, if only subpockets S4-S1' or sub-pockets S4-

S1 are considered, the correlation between the interaction potentials for the N3+ and H2O probe increases to r>0.94. Also, the correlation between the interaction potential for the C3 probe increases, while the correlation between the O- probe and the cleavage entropy even shows a slight decrease if fewer subpockets are considered.

When looking at the weighted average of interaction potentials for representative cluster conformations one sees that the peak in the S2' pocket disappears (see Figure 25E). This is because in the representative cluster conformations the pocket adopts a more open conformation and more grid points are selected. For the S4' subpocket, however there is still only a small local very strong interaction and the peaks for the interaction potentials for the

N3+ and the H2O probe persist. For S4 and S3 subpockets the H2O probe now shows the strongest interaction potentials. The N3+ probe shows stronger interactions than the O- probe in the S4 if the weighted average of the interaction potential is used, however, not much difference between the O- and the N3+ probe is detected.

66

Granzyme B. For granzyme B, the interaction potentials calculated from the X-ray structure even show an unexpected inverse correlation between cleavage entropy and interaction potentials (Figure 25C). In the S2' and S3' only small local interaction areas are observed, meaning that a lower number of grid points is selected in the two filtering steps than in the other sub-pockets. For the S2' pocket this results in a peak of the interaction potential for the

O- probe.

Using the weighted average of interaction potentials calculated from the three representative cluster conformations slightly improves the results (see Figure 25F). Correlation coefficients between cleavage entropy and weighted average of interaction potentials now are r>0.11 for all probes. Considering only sub-pockets S4-S1 increases the correlation between weighted average of interaction potentials and cleavage entropy considerably, looking only at sub- pockets S4-S1' doubles the correlation between weighted average of interaction potentials and cleavage entropy for the H2O, C3 and N3+ probe.

Discussion. The results for fXa, elastase-1 and granzyme B show how important it is to use different conformations in virtual screening approaches and should be considered in flexible docking approaches [164]. The approach of using weighted interaction potentials gives a more complete picture of the thermodynamic contributions to substrate specificity of proteases.

Even if the correlation between subpocket interaction potentials and substrate specificity is improved by using weighted subpocket interaction potentials, results are still inconclusive. The presented approach suffers from several drawbacks:

Firstly, GRID probes test for potential interactions of a substrate with the binding site of a protein. The cleavage entropy metric, however, does not distinguish between different types of amino acids, but rather looks at how many different amino acids are accepted in a subpocket without considering possible interactions in any way. The cleavage entropy is therefore not a suitable metric to correlate subpocket interaction potentials to. As described in Section 3.2, a metric that distinguishes between positively charged amino acids, negatively charged amino acids and neutral amino acids was developed to overcome this challenge.

67

The second flaw in the approach is the nature of the GRID probes used. The probes testing for interactions with positively and negatively charged substrates, i.e. the N3+ and the O- probe, simultaneously also consider van der Waals interactions. Van der Waals interactions, however, are highly dependent on conformation, while electrostatic interactions depend less on the conformation due to their long-range nature. Considering both types of interactions at the same time might lead to a strong bias in the results. As a result of this realization, electrostatic GRID probes were designed that look only at electrostatic interactions (see Section 3.10).

In the presented approach, interaction potential maps were directly compared to the respective substrate specificity. However, when striving to achieve selective targeting of a protease in drug design, it makes more sense to focus on differences in specificity and the molecular drivers of these specificity differences. In the following, the approach was revised and differences in subpocket interaction potential maps became the new focus of interest. Section

4.5 will describe the comparison of eMIF similarities to electrostatic substrate readout similarities.

4.5 Comparison of eMIF Similarity with Electrostatic Substrate Readout Similarity

As described in Section 4.4, subpocket interaction potentials calculated from the MIFs of fXa, elastase-1 and granzyme B were compared directly to substrate specificity quantified as cleavage entropy [116]. In the approach, substrate specificity quantified as cleavage entropy could no be described satisfactorily by the subpocket interaction potentials calculated with the

N3+, O-, C3 and H2O probe. The results were improved by considering conformational variability of subpockets through calculation of weighted interaction potentials derived from calculation of a weighted mean of interaction potentials determined for each representative cluster structure extracted from MD trajectories using the cluster occupancy as a weighting factor (see Section 4.4).

One reason for the previously non-conclusive results are GRID probes that test van der Waals interactions and electrostatics at the same time. To separate electrostatics from van der Waals interactions, GRID probes considering only electrostatic interactions were designed.

68

Electrostatic interactions are long-range interactions that change little with small differences in distance. This causes them to vary less with conformational changes, much in contrast to shape dependent recognition like van der Waals interactions and recognition that relies on precise exit vectors like hydrogen bonds. Due to their long-range character, however, it is more challenging to assign them to specific subpockets. In order to predict the specificity of proteases

Pethe et al. [165] used a structure-based approach that ranks possible substrates according to interaction energies and reorganization penalties. Conventional methods that focus solely on knowledge-based prediction of substrate preferences are outperformed by their method.

For the comparison of binding sites various methods are available. Often the methods for binding site comparison are designed for the purpose of off-target prediction and drug repurposing [166]. Such methods rely on molecular interaction fields (MIFs), e.g. BioGPS [167],

[168] and IsoMIF [169], on shape and physico-chemical properties of the surface, e.g. protein functional surfaces [170], on graphs representing the 3D atomic similarities, e.g. IsoCleft [171] or on fingerprints describing the binding sites, e.g. PocketMatch [172]. In several data bases the properties of binding sites are stored for comparison, such as pseudocenters with projected physico-chemical properties in CavBase , in CavSimBase [173] and in SiteEngine [174], sequence and structural similarity in CPASS [175], position of functional groups in SuMo [176] or surface geometrics and electrostatics in eF-site [177].

Since most of these methods are not designed to compare structurally very similar binding sites as found in the set of chymotrypsin-like proteases within this work, a new method optimized to compare similar binding interfaces as described in detail in Section 3.10 was developed. The method compares the binding sites of the proteases present in the test set based on the similarities of their electrostatic molecular interaction fields (eMIFs) and correlates the results to electrostatic substrate similarity [118]. The work is a result of the collaboration with Prof.

Gabriele Cruciani at the University of Perugia, Italy.

69

In the following, first the electrostatic substrate similarities calculated for the nine proteases contained in the test set (see Section 4.1) are presented, before discussing the results of using electrostatic molecular interaction fields (eMIFs) and eMIF similarities for their description.

Electrostatic Substrate Similarities. Electrostatic substrate similarities, calculated as described in Section 3.2, show whether two proteases cleave similar substrates or whether they have opposing substrate preferences. The substrates are only characterized by their charge, i.e. positive, neutral or negative. The shape of the amino acids is neglected, therefore the recognition of glutamate and aspartate is considered to be exactly equivalent. The electrostatic substrate similarities can be broken down to the contributions of the single substrate positions summing up to the total value for all substrates. The self-similarities reflect the electrostatic substrate preference of single proteases and are normalized to give a total value of 1.

Figure 26 gives an overview of the similarities between two proteases in the substrate space after binning the amino acids into positively charged, negatively charged and neutral amino acids, as described in detail in Section 3.2. The electrostatic substrate similarity metric detects prominent similarities amongst the proteases with high preference for positively charged residues in the S1 subpocket, i.e. trypsin, factor VIIa, factor Xa, thrombin and kallikrein-1. The same holds true for the proteases reading primarily neutral amino acids, i.e. chymotrypsin, elastase 1 and granzyme M. Granzyme B shows low similarities to all other proteases, which is in line with the substrate preference difference to the rest of the protease test set, as granzyme

B is the only chymotrypsin-like protease present in the test set that prefers negatively charged amino acids in the S1 subpocket.

70

Figure 26. Electrostatic substrate readout similarity. The heights of the bars indicate the electrostatic substrate similarities between all nine investigated proteases, ranging from P4 to P4’, and the resulting sum (Σ) on the right. Blue represents favoring of positively charged amino acids, yellow neutral ones and red negatively charged ones. The self-similarities are depicted as diagonal entries and placed on a grey background in the symmetric matrix [118].

Electrostatic substrate preferences are not only highlighted for the S1 subpocket, but also for all other subpockets. The self-similarity of granzyme M, for example, reveals a preference for negatively charged residues in the subpockets S3, S3’ and S4’, which is hardly identifiable in

71 the cleavage site sequence logos (see Figure 6 in Section 4.1). Among the proteases that prefer positively charged amino acids in the S1 subpocket, fVIIa and kallikrein-1 are quite different in their substrate preferences in the other subpockets, yet they share a preference for positively charged substrate amino acids in S3’. Granzyme B favors negatively charged substrate residues over large parts of its binding site and in general shows minimal similarity to the proteases that prefer positively charged residues in the S1 subpocket. Within this group, the largest electrostatic substrate similarity with granzyme B is determined for trypsin, which is very specific for positive amino acids in the S1 subpocket, although it accepts negatively charged residues in remote subpockets. This is highlighted by the electrostatic substrate similarity when compared to granzyme B.

Electrostatic Molecular Interaction Fields (eMIFs). With the negatively and the positively charged GRID probes, the eMIFs of the proteases can be determined, as described in Section

3.10. Figure 27 illustrates the eMIFs for all members of the chymotrypsin-like protease test set.

Trypsin shows favorable interactions with the positive probe in its S1 subpocket, while in the more peripheral S4 and S4’ subpockets it prefers the negative probe. FVIIa, fXa, thrombin and kallikrein-1 favor the positive probe in large parts of their binding sites. Chymotrypsin interacts favorably with the positive probe in the prime site, while towards the outer non-prime site it prefers the negative probe. In the S2 and the S4’ subpocket elastase-1 favors the positive probe, while in the rest of the prime site and in S3 it favors the negative probe. Both granzymes seem to show a completely negative eMIF, but on closer inspection granzyme M favors the positive probe in the S1’ subpocket. The corresponding eMIF is hidden below a layer of grid points preferring the negative probe.

72

Figure 27. eMIFs. The electrostatic molecular interaction fields (eMIFs) are shown for the nine investigated proteases. The interactions with the positive probe are depicted in blue, whereas the interactions with the negative probe are depicted in red. A cutoff of -3 kcal/mol was used for the visualization of the fields [118].

Electrostatic Substrate Preferences and Electrostatic Molecular Interaction Fields. A comparison of electrostatic substrate preferences to the eMIFs of a protease demonstrates the presence of similar patterns in both metrics. Subpockets that are associated with a substrate readout preference for positively charged residues also give strong interactions with the positive probe. Complementarily, in subpockets with substrate readout preference for negatively charged amino acid residues, interactions with the negative probe are dominant. In general, where no strong electrostatic substrate preferences are detected, no strong electrostatic interactions are found in the corresponding eMIFs. Figure 28 gives the eMIFs of granzyme B, kallikrein-1, elastase-1 and trypsin.

73

Substrate self-similarities are in general well illustrated by the eMIFs. The eMIFs of granzyme

B show the same pattern as the substrate self-similarity, i.e. favorable interactions with the negative probe are visible over the entire binding site. The substrate self-similarity (see Figure

26) shows the same trend.

For kallikrein-1, primarily favorable interactions with the positive probe are detected. Only in the outer regions of the non-prime site, i.e., the S3 subpocket, larger areas which favor the negative probe are found. Again, this is very closely in line with the substrate self-similarity, as in the S3 and S4 subpockets negatively charged amino acids are preferred.

74

Figure 28. eMIFs of granzyme B, kallikrein-1, elastase-1 and trypsin. The eMIFs of granzyme B, kallikrein-1 (top), elastase-1 and trypsin (bottom) for the positive (blue) and the negative probe (red), using a cutoff of -3 kcal/mol and the electrostatic substrate self-similarity for positive amino acids (blue), neutral amino acids (yellow) and negative amino acids (red) [118].

Also for elastase-1, the eMIFs correspond well to the protease self-similarity, see Figure 28. As the S1 subpocket is specific for neutral amino acids, practically no electrostatic interactions are present. The prime site substrate self-similarities show preferences for the negative probe, whereas the non-prime site shows no distinct electrostatic preferences. In the S3 subpocket,

75 which is rather unspecific in terms of electrostatics, both interactions are observed, although the positive eMIF at that position is barely visible as it is hidden behind the negative eMIF.

Trypsin strongly prefers positively charged amino acids in the S1 subpockets, which is also visible in the eMIF (see Figure 28). On the periphery of the binding site, the protease starts favoring negatively charged amino acids, which is also reflected by the calculated eMIFs, as the negative eMIF starts to dominate around the S3’ and S4’ subpockets.

Electrostatic Substrate Similarities and eMIF Overlaps. Calculation of the overlap between the eMIFs of different proteases identifies areas with the same dominating electrostatic interactions. The eMIF overlap can be compared directly to electrostatic substrate readout similarity. Figure 29 gives some representative examples of this comparison between eMIF overlap and electrostatic substrate similarities.

76

Figure 29. eMIFs, eMIF overlap and electrostatic substrate similarity for chymotrypsin and kallikrein-1 (top) and thrombin and fXa (bottom). The eMIFs and their overlaps are depicted in blue (positive probe) and red (negative probe). The eMIF overlap was calculated at a cutoff of 0 kcal/mol and visualized at a cutoff of 50 (kcal/mol)². The eMIF overlap on the right is depicted without protein surfaces, revealing overlapping eMIFs deep in the S1 subpocket hidden by the protein surfaces on the left. Above the overlap eMIF, the substrate similarity for the proteases is depicted for the positively charged amino acids (blue), for the neutral ones (yellow) and for the negatively charged ones (red) [118].

Regarding chymotrypsin and kallikrein-1 (see Figure 29), the differences in substrate similarity are explained by the eMIFs. Both proteases favor positively charged and neutral amino acids nearly over the entire binding site. However, on the non-prime site, kallikrein-1 favors negatively charged amino acids in the S3 subpocket. This hotspot is illustrated in the overlap eMIF of the two proteases. On the entire prime site, the eMIF overlap shows favorable positive interactions, the same way as the substrate similarity. 77

Thrombin and fXa are the most similar protease examples of the investigated test set in terms of their electrostatic substrate readout. Their eMIFs show little difference (see Figure 29). There is little contrast in the subpockets on the outer regions of the prime site as well as farther away on the non-prime site. Near the S1 subpocket the differences of the eMIFs are negligible.

An overview of the similarities of the nine proteases in positive substrate readout and positive eMIF overlap is given in the upper part of Figure 30. The lower part of Figure 30 is the corresponding equivalent showing the negative part of the substrate similarities and the eMIF overlaps from the negative GRID probe. The matrices of the substrate similarities match the electrostatic parts of the total substrate similarities already given in Figure 27.

78

Figure 30. Comparison between substrate similarity and eMIF overlap for positive (top) and negative (bottom) substrate space and probe. High similarity and overlap are shaded in dark blue

(positive) and dark red (negative) respectively, while low similarity and overlap are depicted on a white background (both). For readability purposes the eMIF overlap was scaled logarithmically [118].

Figure 30 immediately shows that proteases, which are similar in electrostatic substrate readout are also similar in their eMIFs. The correlation of the similarity in substrate space and the eMIF similarity was quantified by a Mantel test resulting in a Pearson correlation of 0.82 for the positive and 0.57 for the negative probe respectively. The correlation for the positive 79 probe is excellent. The correlation for the negative probe is still surprisingly high, as the binding sites of most proteases of our set are dominated by interactions with the positive probe with minor contribution of the negative probe.

Discussion. As described in Section 4.4, a quantitative correlation between conformational binding enthalpy and substrate specificity could not be found when simultaneously considering van der Waals and electrostatic interactions. Comparing eMIFs and electrostatic substrate readout similarity in a way solely based on physics, revealed a high correlation of eMIF similarity and electrostatic substrate readout similarity. Electrostatics were thus identified as key drivers in substrate recognition in serine proteases. This is in line with work on thrombin by Huntington et al. [178] and on by Batra et al. [179]. Due to their strong nature and their long-range behavior, electrostatic interactions behave quite differently from other aspects of substrate recognition, which are dominated by shape complementarity. As they vary little with small changes in distance, it is not surprising that considering only static X-ray structures already yields a high correlation between electrostatic substrate similarities and electrostatic interaction field similarities. Flexibility of the binding interface influences electrostatics only to a minor extent. Electrostatic contributions would vary substantially solely with major conformational changes as well as with differences in protonation or ion coordination. The long-range behavior and continuous nature of electrostatic interactions makes assignment to specific subpockets a challenging task. This is avoided through correlation of the electrostatic substrate recognition with the electrostatic interaction field of the entire binding cleft. However, for an efficient substrate prediction, partitioning of electrostatic contributions to subpockets of the binding site should be achieved in future work.

In contrast to electrostatics, shape complementarities are limited in their spatial range and only influence their immediate neighborhood, providing either weak attractive interactions or strong repulsion in case of clashes. The transition from attraction to repulsion and vice versa may happen due to small displacements, which is why specificity by shape complementarity, although governed by seemingly weak interactions, is strongly influenced by flexibility [180].

80

Flexibility allows binding of substrates with diverse shapes, whereas in a rigid pocket clashes would prevent binding of many potential binding partners.

This is in line with the notion that protein-protein recognition follows a two-step mechanism.

Firstly, an initial encounter complex forms when enzyme and substrate meet. The association rates for this initial encounter complex are largely governed by electrostatics [181]. An energy funnel pulls substrate and enzyme together and directs the substrate towards the binding site, as given in Figure 31 for trypsin. In a second step, conformational changes lead to the formation of a compatible binding interface . Here, shape complementarity and flexibility are crucial for enabling weak van der Waals interactions and avoiding clashes.

81

Figure 31. The binding interface of trypsin depicted with the eMIFs of the positive (blue) and negative (red) probe. An energy cutoff of -0.5 kcal/mol was used for the visualization of far-reaching electrostatic interactions. The eMIF forms a funnel-like long-range interaction profile, which presumably guides substrates towards an initial encounter complex [118].

Electrostatics and shape complementarity in context of substrate recognition can be considered as orthogonal properties resulting in different aspects of substrate recognition [182]. Thus, the two aspects of substrate recognition can be separated very efficiently by the knowledge-based approach presented in order to analyze substrate readout data. Electrostatic substrate preferences are characterized by binning substrate residues according to their charge. On the other hand, shape complementarity may be characterized by analyzing substrate recognition within the three bins, especially within the neutral bin comprising 15 different neutral amino acid residues. Further studies will be aimed at the correlation of both the contributions of electrostatics and shape complementarity to differences and similarities in substrate recognition. If the contributions of electrostatics and shape complementarity can be described in a solely physics-based way, it will be possible to predict the localized specificity and promiscuity of proteases and most likely also of other biomolecular interfaces.

82

5. Transferring the Information on Protease Substrate Specificity to the Small Molecule Space

In order to make use of the information on protease peptide substrates collected in the

MEROPS [13] and protease substrate profiles experimentally accessible via methods such as

PICS [183] and TAILS [134], a method for virtual screening based on protease substrate sequences was created (see Section 3.11) [136]. The method finds potential new small molecule inhibitors through comparison of shape and pharmacophoric properties of a query based on the most specific substrate positions in protease substrates and incorporates the information on the frequencies at which amino acids are found in the protease substrates. To validate the method, four examples, fXa, thrombin, fVIIa and caspase-3 (casp-3) were investigated. The

DUD-E [139] database was used for benchmarking. It contains a set of known actives for each example and decoys of unknown activity, which have similar physico-chemical properties as known actives. Queries were generated as described in Section 3.11 and the DUD-E [139] was screened using vROCS [129].

Table 4 summarizes the results for the virtual screening. Enrichment factors at 1, 2 and 5 % are given. Figure 32 illustrates the corresponding ROC-curves.

83

Table 5. ROCS results [136]a.

Protease/ AUC EF EF 2% EF 5% Nsubstrates(MEROPS) Nactives Ndecoys

1% Parameter

Thrombin 0.66 20.36 19.51 9.65 168 369 25174 FXa 0.74 15.98 13.08 7.80 59 413 24893 FVIIa 0.84 4.30 2.79 8.19 9 68 1782 Casp-3 0.75 0.50 0.50 1.69 651 199 10666 Casp-3 0.74 1.99 2.48 2.58 651 199 10666 (P3-P1) aThrombin and fXa show high early enrichment values, but lower AUC values than fVIIa . Despite the (P4-P1) low number of known substrate sequences, a high AUC is obtained for fVIIa. For both substrate sequence ranges early enrichment for casp-3 is low, despite high AUC values. Results for AUC and early enrichment depend on the ratio of actives and decoys in the screened database.

84

Figure 32. ROC-curves for all examples and datasets corresponding to the data given in Table

5. High early enrichment is obtained in all cases except for casp-3. The figure is taken from [136].

Discussion.

Thrombin. Early enrichment for thrombin is highest, while the AUC is lowest compared to the other examples (see Table 5). The highest ranked decoys for thrombin all have the guanidine functionality in P1 position in common, which is also fundamental for substrate recognition in the thrombin peptide substrates [147].

With regard to shape as well as chemical functionalities, the highest ranked decoys are similar to classical thrombin inhibitors (see Figure 33) [184]. The reason for some of the actives being ranked lowest is the high shape penalty resulting from the difference in volume between the query and the active molecules and the lack of characteristic functional groups allowing for strong interactions with the binding site.

85

Figure 33. Highest ranked decoys and actives and lowest ranked actives for the example thrombin resulting from the virtual screening process of the DUD-E validation database. Both highest ranked actives and decoys possess guanidine-like functionalities in P1 position, which are also preferred in the peptide substrates of thrombin. The lowest ranked actives lack the typical functional groups present in the thrombin peptide substrates and are smaller than the query, resulting in a large shape penalty in the virtual screening procedure [136].

Factor Xa. The highest ranked decoys for fXa all contain the guanidine group and are similar in shape to classical fXa inhibitors. The number of peptides used for creation of the fXa-ROCS query is lower than for thrombin, as the number of substrates listed in the MEROPS [13] database is about a factor 3 lower (159 thrombin substrates:58 fXa substrates). However, the screening of the DUD-E [139] database resulted in high AUC values (see Table 5).

Factor VIIa. Also in fVIIa the highest ranked actives and decoys all possess the guanidine functionality in the S1 binding position. In the case of fVIIa, some of the lowest ranked actives bear negatively charged groups which are in contrast to the substrate specificity in S1 position

86

(see Figure 34). Only nine substrates are listed in the MEROPS [13] database for fVIIa, therefore the vROCS [129] query might miss some important information due to incomplete substrate data. However, considering the low number of known substrates, it is impressive how well the method performs in terms of AUC and early enrichment (see Table 5).

Figure 34. Lowest ranked actives of fVIIa. The low ranking of some of the active molecules despite possessing the guanidine functionality important for binding to the S1 subpocket might be due to missing information in the vROCS query caused by the low number of substrates listed in the MEROPS[13] database for fVIIa [136].

Casp-3. For a high ranking of actives or decoys in the casp-3 virtual screening, the carboxylate group in S1 position seems to be the key functionality (see Figure 35). Several of the highest ranked actives are prodrugs [185], possessing a lactone functionality. This lactone functionality has to be hydrolyzed (see Section 3.11) to convert the molecule to its bioactive form and subsequently obtain a high ranking in the virtual screening process. Based on the typical DEVD specificity [138] of casp-3 and therefore also high specificity in the S4 subpocket, two models based on substrate positions P3-P1 and P4-P1 were employed. AUC and early enrichment were not improved significantly when using a broader substrate position range. However, the highest ranked actives and decoys differ, depending on how many substrate positions are used, while the lowest ranked actives are unvaryingly independent of the substrate position range.

Importance of the Template Peptide. The results of the shape-based virtual screening runs depend to a large extent on the query conformation. Therefore, the importance of the template peptide was investigated. Either a thrombin protease-substrate complex as template for the mutation strategy or a casp-3 protease-substrate complex for fXa and fVIIa where there are no

87 protease-substrate complexes available in the pdb was employed in the virtual screening procedure. Table 6Table 6 gives the results for the virtual screening runs performed based on different protease-substrate complexes. For the fXa DUD-E validation runs, AUC and early enrichment are lowered if a casp-3 protease-substrate complex is used as template for the mutation strategy instead of a thrombin protease-substrate complex. For fVIIa, the AUC is not affected by the use of a different protease-substrate template complex, but early enrichment values for the casp-3 protease-substrate template complex are lower than for the thrombin protease-substrate template complex.

88

Table 6. Influence of different protease-substrate complex templates [136].a

Protease (template complex) AUC EF 1% EF 2% Nsubstrates(MEROPS) Nactives Ndecoys

FXa (thrombin protease- 0.74 15.98 13.08 59 413 24893

substrate complex template)

FXa (casp-3 protease- 0.68 6.54 10.66 59 413 24893

substrate complex template)

FVIIa (thrombin protease- 0.84 4.23 2.79 9 68 1782

substrate complex template)

FVIIa (casp-3 protease- 0.84 2.86 2.21 9 68 1782

substrate complex template)

aFor fXa, AUC values and enrichment factors for the DUD-E screening decrease when using a casp-3 protease-substrate template complex instead of a thrombin protease-substrate template complex. For fVIIa, the choice of protease-substrate complex template has no effect on AUC, early enrichment is slightly lower for the casp-3 protease-substrate template complex than for the thrombin protease- substrate template complex.

The results prove that the mutation strategy is successful even if the peptide substrate sequences and the template peptide show low sequence identity. As long as there is a template peptide in extended beta sheet conformation available, the virtual screening procedure can be applied [136].

89

Figure 35. Examples for high ranked actives and for the lowest ranked actives for casp-3 when using substrate positions P3-P1. Notice that CHEMBL329917, CHEMBL101545 and

CHEMBL327298 are used in their bioactive form in the virtual screening experiments. The bioactive form is obtained through hydrolysis of the lactone-functionality, present in the prodrug form. The lowest ranked actives are small and lack the negatively charged group offering the possibility for favorable interaction with the S1 subpocket of casp-3. This results in a high shape penalty in shape-based virtual screening and in a mismatch of functional groups with the vROCS query [136].

Comparison to Alternative Virtual Screening Strategies. The main advantage of the method presented here is that neither a structure of the target nor any information on known small molecule ligands is required to apply the virtual screening procedure to a protease target. The only data required is the information on the protease substrate sequences. Substrate specificity profiling is done rather quickly, in comparison to generating a structure or finding small molecule inhibitors, therefore the method is interesting also for protease targets for which no information is listed in the MEROPS [13] database.

90

The advantage over the method of Sukuru et al. [186], who also performed virtual screening based on the knowledge on protease substrates, is that the information about the known peptide substrates for a protease is directly transferred to the small molecule space. No a priori knowledge on small molecule ligands is thus required to find new small molecule inhibitors.

ROCS uses a very fast and efficient algorithm for the virtual screening runs, therefore hundreds of thousands of molecules can be screened within hours[129]. The method has significant advantages over docking and other structure-based methods as well as over ligand-based approaches using small molecule ligands as a basis for virtual screening experiments.

Conclusion. The virtual screening procedure presented here allows for the fast and efficient generation of a query derived from protease peptide substrate data that incorporates information on shape, type of functionalities and frequency of functionalities present in natural protease substrates. The method can be readily applied to screen for potential new small molecule inhibitors. The application to four different proteases which cover different active site mechanisms, substrate specificities and binding site shapes proves that the method performs well in terms of AUC and early enrichment in all cases. Even if available substrate data is limited, as in the case of fVIIa and fXa, the method successfully recovered actives from the very challenging datasets prepared from the DUD-E [139]. The workflow described [136] represents the first approach using protease substrate sequences as training set for a virtual screening experiment. The query creation in vROCS allows for the inclusion of the information on the relative frequencies of amino acids present in substrates in the respective subpockets and is focused on the properties of side chains in substrates. This allows for scaffold hopping. As the method can easily be applied to different protease systems, it may also be applied to members of other enzyme types, such as . In summary, the virtual screening procedure represents a new tool to be used for rational drug design and makes use of the huge amount of data on protease substrates in order to find new small molecule inhibitors.

91

6. Investigation of Parasitic Cysteine Protease Model Systems

Cysteine proteases play an important role in parasitic infections and are often targets when treating (neglected) tropical diseases. The cysteine proteases cruzain and rhodesain are important targets when treating Chagas disease and African Sleeping Sickness (Human

African trypanosomiasis (HAT)) [37]. In the following, the pathology of Chagas disease and

HAT will be described and the importance of cruzain and rhodesain in drug design will be emphasized.

6.1. Cruzain as Drug Target in Chagas Disease

Chagas disease is a chronic, systemic, parasitic infection caused by the protozoan Trypanosoma cruzi (T. cruzi) [187]. It is a vector-transmitted disease with infections occurring mainly in areas of North, Central and South America. Transmission also occurs through transfusions, organ and bone marrow transplantation or through contaminated foods or drinks [188]. Infection with Chagas disease follows two stages: The acute stage starts one to two weeks after vector- borne transmission. Symptoms include fever, malaise and in some cases skin nodules or painless eyelid edemas [189]. Most infections at this stage are not detected. After the acute stage, the infection moves on to the chronic stage. Most people are asymptomatic, but 20-30% develop cardiomyopathy over the course of decades [189].

The drugs most frequently used for the treatment of Chagas disease are nitroheterocyclic compounds, such as nifurtimox and benznidazole [190]. The anti T. cruzi activity was discovered over three decades ago. Nifurtimox and benznidazole make use of the fact that T. cruzi is more sensitive to oxidative stress than vertebrate cells. Both drugs act through the generation of free radicals. An overview of clinically available drugs used for the treatment of

Chagas disease is given in Figure 36. The drawback of all of these drug examples is that they either have significant side effects and are efficient only in the acute phase of the disease or their efficacy against T. cruzi infection is controversial [191].

92

Figure 36. Chemical structures of clinically available drugs for the specific treatment of Chagas disease (adapted from [190]). Nifurtimox (Lampit®) is a product of Bayer, A.G.

(http://www.bayer.com/) (A) and benznidazole (Rochagan®, Radanil®) is a product of Roche

Pharmaceuticals (http://www.roche.com/) (B). These compounds act via the generation of free radicals and are mostly effective in the acute or early chronic stages of the disease. Itraconazole (Sporonox®,

Janssen-Cilag; http://www.janssen-cilag.com/) is an inhibitor of fungal and protozoal sterol C14α sterol (C). Allopurinol blocks purine salvage by inhibiting the parasite's hypoxanthine-guanine phosphoribosyl (HGPRT) (D). Claims regarding the efficacy of itraconazole and allopurinol as specific treatments for human Chagas disease have been controversial [192], [191]

The T. cruzi cysteine protease cruzain is important for metacyclogenesis and host cell invasion in Chagas disease. Therefore, current drug design strategies are aimed at targeting cruzain

[193], [194]. Current cruzain inhibitors can be divided in peptidic and non-peptidic derivatives.

93

Different classes of non-peptidic derivatives are vinyl sulfones, thiosemicarbazones, purine nitriles and benzimidazoles [194].

6.2. Rhodesain as Drug Target in Human African Trypanosomiasis (HAT)

Human African trypanosomiasis (HAT), also called African Sleeping Sickeness is a neglected tropical disease, caused by an infection with the protozoan Trypanosoma brucei (T. brucei). It is found mainly in sub-Saharan Africa and similar to Chagas disease manifests in two stages

[195]. The parasites are transferred to the humans by the tsetse fly. HAT is almost always fatal if not or unadequately treated. The two stages of HAT are hard to differentiate. In the first stage, parasites are restricted to the blood and lymphatic system, while in the second stage parasites invade the central nervous system [196]. The main characteristic symptom of second- stage HAT is the disturbance of the sleep-wake cycle due to the presence of parasites in the brain of humans, which gives the disease its name in many languages. T. brucei has been extensively studied for the past three decades, however no new drugs have been developed for the treatment of HAT since the 1970s [196]. The T. brucei cysteine protease Rhodesain is a promising target for the treatment of HAT [37]. As it shows high structural similarity to cruzain, cruzain inhibitors might also effect rhodesain, making the development of dual inhibitors feasible [193].

6.3. Selective Targeting of Cruzain and Rhodesain in Rational Drug Design

Structural Similarity of Cruzain and Rhodesain to Cathepsins B and L. Cruzain and rhodesain are structurally very similar to the human cathepsins B and L. Therefore, achieving selectivity against the two human cysteine proteases when targeting cruzain and rhodesain is a challenging task. Figure 37 gives the cleavage site sequence logos for cruzain, rhodesain and cathepsins L and B. Peptide substrate specificity is highly similar in the S2 subpocket, which is the specificity-determining pocket in papain-like cysteine proteases [32].

94

Figure 37. Cleavage site sequence logos for parasitic and human cysteine proteases. Cruzain (A) prefers neutral amino acids in the S2 subpocket and positively charged amino acids in other subpockets

(65 substrates), rhodesain (B) prefers neutral amino acids in the S2 subpocket and positively charged amino acids in the S1 (11 substrates), cathepsin L (C) prefers neutral amino acids in the S2 as well (2937 substrates) and cathepsin B (D) accepts a broad range of different amino acids, while showing slight preferences for neutral amino acids (632 substrates). Especially in the S3’ cathepsin B highly prefers Gly.

Cleavage site sequence logos were downloaded from the MEROPS database [13].

Within this thesis, the methods developed for comparing serine proteases with chymotrypsin fold, which are mainly the definition of subpockets and the comparison of subpocket flexibility with regard to backbone flexibility based on Cα alignment and side chain flexibility, are applied to screen for differences between cruzain, rhodesain, cathepsin B and cathepsin L that can be exploited in rational drug design. The research question is a result of a collaboration with the group of Prof. Ferreira at the Federal University of Minas Gerais (UFMG), Brazil.

In the following, first the generation of a generic pocket definition in analogy to serine proteases with chymotrypsin fold is described, followed by an analysis of subpocket flexibility.

The peptide and substrate-based virtual screening approach presented in Section 5 is finally applied to screen for potential cruzain inhibitors, which are to be tested in Prof. Ferreira’s group at UFMG in Brazil.

95

Subsite Definition for Cruzain, Rhodesain, Cathepsin B and Cathepsin L. Similarily as for the serine protease test set with chymotrypsin fold, a generic binding site definition was generated for the cysteine proteases cruzain, rhodesain, cathepsin B and cathepsin L. The subpocket definition was transferred from a subpocket definition of human cathepsin K, obtained by Bromme et al. [197], [198], described in Section 7. Subpockets S3’ and S4’ were not defined, as crystallographic data on substrates binding to those two subpockets was not sufficient. Subpocket residues for all four cysteine protease examples are given in Appendix

B1. The generic pocket definition for S4-S2’ is given in Figure 38.

96

Figure 38. Generic pocket definition for parasitic and human cysteine proteases. Pockets S4-S2’ are shown for parasitic cysteine proteases cruzain (A) and rhodesain (B) and human cysteine proteases cathepsin B (C) and L (D).

Flexibility Differences between Cruzain, Rhodesain, Cathepsin B and Cathepsin L. One possible way to selectively target cruzain and rhodesain, without simultaneously hitting cathepsin B and cathepsin L, is to exploit flexibility differences in subpockets. Only flexibility differences in subpockets S4-S2’ are considered as for subpockets S3’ and S4’ the generation of a generic pocket definition was not possible due to insufficient data on protein-substrate complexes.

Subpocket flexibilities for subpockets S4-S2’ are compared with regard to B-factors based on global Cα alignment and dihedral entropy (phi and psi entropy), which are both metrics for backbone flexibility, and B-factors based on local alignment, which is a metric for side chain

97 flexibility. The results for the different flexibility metrics mapped to the respective subpockets of the investigated cysteine proteases and the corresponding histograms showing absolute B- factors/entropies and the respective differences are given in the Appendix B2. In the following, the differences in flexibilities detected by the applied metrics are discussed in respect of implications for rational drug design.

Differences in B-Factors Based on Global Cα Alignment (Backbone Flexibility). When comparing the parasitic cysteine protease cruzain to the human cysteine proteases cathepsin B and L in terms of backbone flexibility based on Cα alignment, cathepsin B shows a much higher flexibility than cruzain. Cathepsin L is also more flexible than cruzain in most subpockets, but the differences are less pronounced than for cathepsin B. In addition, in subpockets S3-S4, cruzain shows slightly higher flexibility than cathepsin L. The pronounced difference in backbone flexibility can be used to achieve selective targeting of cruzain, without simultaneously targeting cathepsin L. As the binding site of cruzain shows less conformational variability, inhibitors adapted to the binding site shape and location of side chains that can interact with the inhibitors can be rationally designed. As backbone flexibility differences to cathepsin L are less pronounced, using flexibility differences to achieve selectivity might be more challenging.

Also compared to rhodesain, cathepsin B shows a much higher flexibility in all subpockets if backbone flexibility based on Cαalignment is investigated. For cathepsin L, the same trend as for cruzain can be observed: Cathepsin L backbone flexibility is higher than rhodesain backbone flexibility in all subpockets but S3-S4. In subpocket S2 rhodesain and cathepsin L almost show the same backbone flexibility. Apart from backbone flexibility differences in the

S1’, the differences are too small to make use of them in selective targeting in rational drug design. Both parasitic cysteine proteases cruzain and rhodesain have similar backbone flexibilities.

Differences in B-Factors Based on Dihedral Entropies (Backbone Flexibility). Results for flexibility differences quantified as dihedral entropies phi and psi confirm the results for the

98 comparison of backbone flexibility based on Cα alignment. Phi entropy of cathepsin B is higher than phi entropy of cruzain in all subpockets, confirming the higher backbone flexibility of cathepsin B. For the psi entropy, only the S2 shows slightly higher psi entropy for cruzain than for cathepsin B. The comparison of cruzain to cathepsin L with regard to backbone flexibility quantified as phi/psi entropy shows higher phi entropy in subpockets S3-S4. The same result is obtained for the psi entropy. The conclusion for targeted drug design is therefore again that the differences in backbone flexibility between cruzain and cathepsin B can be exploited to achieve selective targeting of cruzain, but achieving selective targeting against rhodesain using backbone flexibility differences might be a challenging task.

Compared to rhodesain, cathepsin B also shows higher phi entropy in all subpockets. With regard to cathepsin L, subpockets S4-S1’ show higher phi entropy in rhodesain and the subpocket S2 shows almost the same phi entropy. Regarding the the psi entropy, like cruzain, rhodesain shows higher values than cathepsin B only in the S2. For cathepsin L, the psi entropy is higher for rhodesain in subpockets S3-S4.

The results from using dihedral entropies as metric for the quantification of backbone flexibility and comparison between cruzain, rhodesain, cathepsin B and cathepsin L thus agree and point to the possibility of selectively targeting cruzain and rhodesain without simultaneously targeting cathepsin B by making use of the distinct backbone flexibilities. For cathepsin L, backbone flexibility differences to cruzain and rhodesain are not as pronounced, making the achievement of selectivity based on backbone flexibility differences a challenging task.

Differences in B-Factors Based on Local Alignment (Side Chain Flexibility). Regarding the side chain flexibility quantified as B-factors based on local alignment, cathepsin B shows higher flexibility than cruzain only in the S1 and the S1’ subpocket. The high side chain flexibility difference in the S1 might be used for selectively targeting cruzain over cathepsin B. Cathepsin

L shows higher side chain flexibility than cruzain in subpockets S1, S3 and S1’-S2’. Especially the marked difference in the S3 subpocket might be exploited to selectively target cruzain over cathepsin L.

99

Compared to rhodesain, cathepsin B also shows distinctly higher flexibility in the S1 subpocket.

Also in the S1’ subpocket, side chain flexibility of cathepsin B is higher than in rhodesain.

Cathepsin L follows the same trend, side chain flexibility is higher in the S1 and S1’ subpockets than in rhodesain, however flexibility differences are less pronounced than the differences when compared to cathepsin B. Regarding rhodesain, the higher side chain flexibilities of cathepsin B and L in the S1 and S1’ subpockets can be exploited to achieve selective targeting of rhodesain against the human cysteine proteases.

6.4. Peptide Substrate-Based Virtual Screening to Find New Cruzain Inhibitors

The method developed for finding new small molecule inhibitors based on protease substrate preferences (see Section 3.11) was applied to find new inhibitors for the cysteine protease cruzain, which is an important drug target for Chagas disease. The project is a result of the exchange stay in fall 2015 in the laboratory of Prof. Rafaela Ferreira at the Federal University of Minas Gerais (UFMG) in Belo Horizonte, Brazil. The selected compounds are currently being tested in biological assays at UFMG.

The vROCS [129] screen was performed as described in Section 3.11 using substrate positions

P3-P1, P3-P1’ and P2-P1’. Apart from the Tanimoto combo score, which considers shape as well as pharmacophore similarity between query and database molecule, criteria for compound selection were the potential toxicity of compounds, the xlogP, molecular weight, number of h- bond donors/acceptors, combined with chemical experience. The best vROCS Tanimoto combo scores resulted when using substrate positions P3-P1. A total of 23 compounds was selected from the results based on the three different substrate position ranges. The 11 compounds selected based on virtual screening results for substrate positions P3-P1 are given in Figure 39.

100

Figure 39. Potential cruzain inhibitors selected from the ZINC database based on virtual screening with vROCS using cruzain substrate positions P3-P1 as template.

The compounds selected from screening with substrate ranges from P2-P1’ are given in Figure

40.

101

Figure 40. Potential cruzain inhibitors selected from the ZINC database based on virtual screening with vROCS using Cruzain substrate positions P2-P1’ as template.

The compounds selected using cruzain substrate positions P3-P1’ as template are given in

Figure 41.

Figure 41. . Potential cruzain inhibitors selected from the ZINC database based on virtual screening with vROCS using cruzain substrate positions P3-P1’ as template.

102

7. Investigation of the Human Collagenolytic/Elastolytic Cysteine

Protease Cathepsin K

7.1 Cathepsin K

Human cathepsin K is a papain-like cysteine protease that is a major drug target in the treatment of osteoporosis [199], [200], [201]. Cathepsin K not only shows activity, but also potent elastolytic activity, the same as the human cathepsins V and S[198].

Collagenolytic and elastolytic activity is regulated by two exosites, exosite-1 and exosite-2[198].

Figure 43 illustrates the location of the exosites-1 and 2 in cathepsin K. The inhibition of exosites-1 or 2 inhibits the elastase activity of cathepsin K, while the inhibition of exosite-1 inhibits the collagenolytic activity of cathepsin K. Neither exosite has an influence on the catalytic activity or the subsite specificity of cathepsin K or blocks the degradation of other biological substrates[198]. Through its collagenolytic activity, it facilitates physiological as well as pathological bone degradation[198]. The cleavage site sequence logo for cathepsin L is given in Figure 42.

103

Figure 42. Cathepsin K cleavage site sequence logo. Cathepsin K shows high specificity for neutral amino acids in the S2 subpocket, mainly Leu, Ile and Val (2207 substrates). The cleavage site sequence logo was taken from the MEROPS database[13].

Figure 43. Cathepsin K with colored exosites and allosteric inhibitor DHT bound to exosite-1

[198]. Exosite-1 is colored in green, exosite-2 is colored in violet. The inhibition of exosites 1 or 2 inhibits the elastase activity of cathepsin K, while the inhibition of exosite-1 inhibits the elastolytic activity[198].

The structure of cathepsin K with colored exosites-1 and 2 and bound inhibitor DHT was provided by

Brömme et al. [198].

Collagenically active cathepsin K is organized into C-shaped protease dimers, that contain a binding site for collagen, which is aided by glycosaminoglycans. Non-active site residues Glu-

21 and Glu-92 participate in collagen unfolding [197]. Figure 44 shows cathepsin K in its dimeric form with the glycosaminoglycan aiding the binding of collagen to the collagen- 104 binding interface. Inhibition of the binding interface or removal of the glycosaminoglycans blocks collagenolytic activity [197].

Figure 44. Cathepsin K in its dimeric form. The binding site interface is colored in blue and green to illustrate that the collagen binding interface consists of two different parts of the dimer-constituting monomers. The active site is colored in violet.

Based on the dimer crystal structure, Bromme et al. modeled a tetrameric structure, with monomer units arranged like beads on a string on glycosaminoglycan chains as illustrated in

Figure 45.

105

Figure 45. Cathepsin K in its tetrameric form with bound glycosaminoglycans. Within the tetramer, monomer units 1 and 3 and monomer units 2 and 4 are symmetric. The binding site interface is colored in blue, the active site is colored in red.

In collaboration with Brömme et al., the influence of different types of cathepsin K inhibitors on backbone and side chain flexibility of ligand-free cathepsin K was investigated. Backbone flexibility was calculated based on global Cα alignment, side chain flexibility based on local alignment (see Section 3.5). The influence of chondroitin sulfate binding on the flexibility of cathepsin K inhibited with the covalent inhibitor E64 [202] was also investigated. Further of interest was the effect of the single mutations D55N and the double mutation K119D/K176E on the flexibility of cathepsin K and the effect of chondroitin sulfate binding on ligand-free cathepsin K. The effects of tetramerization of E64 inhibited cathepsin K and the effect of dimerization on ligand-free cathepsin K, both in the presence of chondroitin sulfate, were also investigated. Backbone and side chain flexibility was calculated for the subpockets S4-S2’ as well as for the exosite-1. Subpocket and exosite-1 definitions were provided by Brömme et al.,

106

(compare [197] and [198]). In the following, the tasks for flexibility comparison are summarized:

● Effect of D55N single mutation on the flexibility of ligand-free cathepsin K.

● Effect of K119D/K176E double mutation on the flexibility of ligand-free cathepsin K.

● Effect of the reversible carbazone-based inhibitor CT 1 [203] on the flexibility of ligand-

free cathepsin K.

● Effect of covalent inhibitor E64 [202] on the flexibility of ligand-free cathepsin K.

● Effect of allosteric NSC13445 inhibitor bound to site 6 [204] on the flexibility of ligand-

free cathepsin K.

● Effect of CS binding on flexibility of the ligand-free cathepsin K.

● Effect of tetramerization of E64 inhibited cathepsin K in the presence of chondroitin

sulfate on the flexibility.

● Effect of dimerization of ligand-free cathepsin K in the presence of chondroitin sulfate

on the flexibility.

The structure of wild type cathepsin K as well as the structure of the E64 inhibited tetramer with bound chondroitin sulfate was obtained from Bromme et al.. Single and double mutations were performed with MOE [71]. The reversible carbazone-based inhibitor was simulated starting from the crystal structure [203], while the covalent inhibitor was simulated starting from the crystal structure 1ATK [202]. The structure of the bound allosteric NSC13445 inhibitor bound to site 6 was 4LEG [204]. The dimeric structure of the chondroitin sulfate bound cathepsin K was 4N79 [197]. MD simulations were performed using the parameters described in Section 3.4. Parameters for ligands were obtained using the antechamber tools of Amber12

[95]. Parameters for the glycosaminoglycan chains of the chondroitin sulfate bound to cathepsin K were obtained by the GLYCAM06 force field [205].

107

Figure 46 illustrates subpockets S4-S2’ as well as the exosite-1 of cathepsin K. Substrate specificity of cathepsin K is mainly located in S2-S2’. The residues present in the pocket definitions are given in Appendix C1, the flexibility differences for each investigated case are given in Appendix C2.

S1' S2'

S2 S1

Ex1 S3 S4

Figure 46. Cathepsin K subpockets. Subpockets S4-S2’ and the exosite-1 are illustrated. Cathepsin K specificity is mainly located in subpockets S2-S2’.

7.2 Impact of Single and Double Mutations on the Flexibility of Cathepsin K

Single Mutant D55N. The first mutant investigated was the single mutant D55N. The mutation is located adjacent to the binding site, as illustrated in Figure 47.

108

Figure 47. Cathepsin K D55N single mutant.

Apart from the S3 subpocket, backbone flexibility based on global Cα alignment is higher for the D55N mutant. Especially the exosite-1 seems to be mobilized by the presence of the D55N mutation. The mutation also causes a rupture in the helix structure, as illustrated in Figure 48.

Side Chain flexibility is lower in the D55N mutant, while subpockets S1’ and S2’ as well as the exosite-1 flexibility are increased in the D55N mutant. The D55N mutation close to the exosite-

1 and the S3-S2 subpockets of the binding site seems to highly mobilize the exosite-1.

109

Disrupted Helix in D55N Mutant

Figure 48. Impact of D55N mutation on helix structure in cathepsin K. MD simulations reveal a disrupted helix structure in the D55N mutant.

Figure 49 shows the combined 2D-RMSD plot between snapshots obtained from MD trajectories of wild type and D55N mutants. Wild type and D55N mutant sample different conformational spaces with RMSD differences not exceeding values of 2-3 Å. Structural differences between the wild type and D55N mutant are therefore small.

110

WT D55N

Figure 49. Combined 2D-RMSD based on Cα alignment for comparison between wild type and

D55N mutant of cathepsin K. Wild type and D55N mutant sample different conformational space, with Cα RMSD values not exceeding values of 2-3 Å.

Double Mutant K119D/K176E. Figure 50 illustrates the location of the double mutations

K119D/K176E. The mutations turn the positive charges into negative charges and thus inverse the electrostatics at the mutation positions.

111

Figure 50. Double mutation K119D/K176E.

Like the single mutant D55N, the double mutant K119D/K176E shows increased backbone flexibility in all subpockets S4-S2’ and the exosite-1. Especially the backbone flexibility of the exosite-1 is mobilized. With regard to side chain flexibility, subpockets S3 and S2 are more flexible in the wild type as in the double mutant K119D/K176E. As for the single mutant D55N, the side chain flexibility of the exosite-1 is increased the most by the mutations K119D/K176E.

7.3 Impact of the Mechanism of Inhibition on the Flexibility of Cathepsin K

Reversible Inhibitor. The reversible semicarbazone-based inhibitor CT 1[203] and its effects on subpocket S4-S2’ backbone and side chain flexibility and exosite-1 flexibilities were investigated. Figure 51 illustrates the reversible inhibitor CT 1 bound to the active site of cathepsin K and the interacting residues within 4 Å distance of CT 1. 112

Figure 51. Reversible inhibitor CT 1[203] bound to active site of cathepsin K. The inhibitor is colored in green, interacting residues within 4 Å distance of CT 1 are colored in yellow.

Backbone flexibility is impaired for all subpockets but subpocket S2 in the reversibly inhibited cathepsin K structure. The exosite-1 backbone flexibility, however, is increased by binding of the covalent inhibitor. Also the side chain flexibility is higher in the non-inhibited cathepsin K structure. Binding of the reversible inhibitor CT 1 seems to have no effect on the side chain flexibility of the exosite-1.

Covalent Inhibitor. The impact of the binding of the covalent inhibitor E64 [202] on cathepsin

K subpocket and exosite-1 backbone and side chain flexibility was investigated as well. E64 covalently binds to Cys-25 of the catalytic center and inhibits the catalytic activity of cathepsin

K. Figure 52 illustrates the binding of E64 to the active site of cathepsin K and shows the interacting residues of cathepsin K.

113

Figure 52. Covalent inhibitor E64 [202] bound to the binding site of cathepsin K. E64 is colored in green, the interacting residues of cathepsin K are colored in yellow. E64 covalently binds to CYS-25 and inhibits the catalytic activity of cathepsin K.

Apart from exosite-1, backbone flexibility is higher in the inhibitor-free structure than in the covalently inhibited structure. The backbone flexibility of the exosite-1 is slightly increased by the binding of E64. Side Chain flexibility is lower in covalently inhibited cathepsin K when compared to inhibitor-free cathepsin K for all subpockets and exosite-1.

Allosteric Inhibitor. Figure 53 illustrates the binding of the allosteric inhibitor NSC13445 to site 6 of cathepsin K [204]. Site 6 is located on the opposite side of exosite-1 and comprises a patch of positively charged amino acids.

114

Figure 53. Allosteric inhibitor NSC13445 bound to site 6 of cathepsin K[204]. NSC13445 is colored in green, interacting residues of site 6 in cathepsin K are colored in yellow.

The allosteric inhibitor NSC13445 lowers backbone flexibility in all subpockets and the exosite-

1. Side Chain flexibility is lower in allosterically inhibited cathepsin K for all subpockets but subpockets S3-S2.

115

7.4 Impact of Chondroitin Sulfate Binding on the Flexibility of Cathepsin K

Figure 54 illustrates chondroitin sulfate binding to cathepsin K.

Figure 54. Chondroitin sulfate bound to cathepsin K. Chondroitin sulfate is colored in green, interacting residues of cathepsin K are colored in yellow.

Chondroitin sulfate binding only has a mild effect on backbone flexibility. Backbone flexibility within the exosite-1, however, is significantly lowered by chondroitin sulfate binding. Side

Chain flexibility is lowered for all subpockets and the exosite-1 by chondroitin sulfate binding.

7.5 Impact of Dimerization in the Presence of Chondroitin Sulfate on the Flexibility of

Cathepsin K

With regard to dimerization, only the impact of dimerization in the presence of chondroitin sulfate on the backbone flexibility of cathepsin K was investigated.

Dimerization in the presence of chondroitin sulfate causes an increase in backbone flexibility for all subpockets and exosite-1 in both units of the dimer.

116

7.6 Impact of Tetramerization in the Presence of Chondroitin Sulfate on the Flexibility of

Cathepsin K Flexibility

Backbone flexibility is higher in all subpockets and exosite-1 for all units but unit 2. The difference in exosite-1 flexibility in comparison to monomeric cathepsin K is small, however.

Figure 55 shows the combined Cα RMSD plots for all four monomer units. The results indicate that all four units sample a different conformational space. However, RMSD values are not higher than 3 Å and thus not high, considering the size of the tetramer. The reason for the difference in units 1 and 3 and units 2 and 4, which should be equivalent, could be insufficient sampling due to limitations of the MD simulation methodology.

Figure 55. Combined 2D-RMSD plot based on Cα alignment for all four units of the tetrameric form of cathepsin K. All four subunits sample a different conformational space. However, the Cα

RMSD values are not higher than 3 Å, indicating small differences in monomer structures.

117

7.7 Discussion

The elastolytic and collagenolytic activity of cathepsin K is known to be regulated by allosteric mechanisms [206], [204], . Exosites-1 and 2 regulate elastolytic activity, while exosite-1 regulates collagenolytic activity [198]. Glycosaminoglycans are known to increase the elastinolytic activity of cathepsin K [206].

The binding of the allosteric inhibitor NSC13445 leads to a pronounced decrease in the flexibility of the exosite-1, but only a mild decrease in flexibility of subpockets both in terms of the backbone and the side chain flexibility. Allosteric inhibition thus does not affect the binding site of cathepsin K, which is in line with previous findings, where the inhibition of elastolytic or collagenolytic activity did not impair the recognition of other substrates [204]. Allosteric inhibitors targeting cathepsin K exosites for selective inhibition of elastolytic and collagenolytic activity have already been described by Panwar et al. [207].

The binding of chondroitin sulfate also leads to an increase in exosite-1 flexibility. This is in line with the findings of Novinec et al. [206], who identified glycosaminoglycans as crucial factors for the increase of the elastolytic activity of cathepsin K.

The effects of the single mutation D55N and the double mutation K119D/K176E on the flexibility of cathepsin K subpockets and exosites may be explained by changes in the hydrogen bonding network spanning over the entire protein.

The decrease in the flexibility of subpockets close to the binding site of the reversible inhibitor

CT 1 and the covalent inhibitor E64 is not surprising, as the movement of the backbone and side chains is impaired by the presence of the inhibitors in the binding site.

Dimerization in the presence of chondroitin sulfate also leads to an increase in the flexibility of the exosite-1. The results indicate that the increase in the flexibility of exosite-1 is aiding the activation of the collagenolytic activity of cathepsin K [198].

Regarding the tetrameric structure, for all monomer units except unit 2 an increase in flexibility of exosite-1 was observed in MD simulations in comparison to E64 inhibited cathepsin K. 118

Despite being equivalent, units 1 and 3 as well as units 2 and 4 differ in their conformational space during MD simulations. As compared to the monomeric form, the tetrameric form of cathepsin K is a large system, so the simulations might not be fully converged during the simulation time of 1 μs. Longer simulation times and enhanced sampling methods might cause the equivalent monomer units to converge towards the same conformations. Further experiments on the activity of the tetramer would help understand the biological relevance of its behavior in terms of flexibility observed in MD simulations.

119

8. Conclusion and Prospects

The main goal of this thesis was the determination of the molecular drivers of serine protease specificity. A series of nine serine proteases with chymotrypsin fold was investigated in terms of backbone and side chain flexibility as measurements of flexibility on different time scales, water thermodynamics and conformational binding enthalpy of subpockets and binding sites.

While in the case of thrombin specificity seems to be mainly determined by backbone flexibility, the results for other protease examples point towards a combination of the characteristics investigated in determining protease specificity.

As the approach trying to explain protease specificity with conformational binding enthalpy determined with probe molecules simultaneously testing for van der Waals and electrostatic interactions was not able to explain serine protease substrate specificity satisfactorily, in a second approach only electrostatic interactions were considered. A negative and a positive probe testing only electrostatic interactions with the serine protease binding site were designed to determine electrostatic molecular interaction fields (eMIFs). Simultaneously a metric for specificity was developed that bins amino acids into bins according to their charge. Based on that metric, electrostatic substrate similarity was calculated and compared to electrostatic molecular interaction fields. Due to the long-range character of electrostatics, the results for X- rays showed high correlation between eMIF similarity and electrostatic substrate readout similarity. The assignment of electrostatics to specific subpockets is still a challenging task, it is however necessary for the prediction of substrate readout similarity in serine protease subpockets. Therefore, future work will continue improving the approach presented in this work. Electrostatic substrate recognition is only the first step in substrate recognition.

Complementary to this first step, shape recognition is important in the second step. Further investigations will be directed at investigating the shape preferences of serine proteases for certain amino acids.

Based on the knowledge on serine proteases and the information on protease substrate sequences present in the MEROPS database, a method for shape-based virtual screening was

120 created that finds new small molecule inhibitors based on known protease substrates. The method was tested on four serine proteases with chymotrypsin fold, leading to good results in terms of AUC and early enrichment when screening the challenging DUD-E database.

The methods developed for serine proteases were subsequently applied to the parasitic cysteine proteases cruzain and rhodesain, important drug targets when fighting Chagas disease and Human African Trypanosomiasis (HAT). As in drug design selectivity issues due to structural similarities to the human cathepsins B and L are still a challenging tasks, the flexibility metrics, backbone flexibility based on global Cα alignment, backbone flexibility based on dihedral angles and side chain flexibility based on local alignment of the parasitic and the human cysteine proteases were compared. While the significantly higher backbone flexibility of cathepsin B as compared to the backbone flexibility of cruzain and rhodesain may be exploited in rational drug design, neither backbone nor side chain flexibility of cathepsin L shows notable enough differences to the parasitic cysteine proteases. Achieving selectivity against cathepsin L still remains a challenging task.

The peptide substrate sequence-based shape-based virtual screening approach developed for serine proteases was applied to cruzain to find new potential cruzain inhibitors. A set of 24 potential small molecule inhibitors was selected after screening the ZINC leads now database, applying the criteria of commercial availability, molecular weight, xlogP, number of rotatable bonds, potential toxicity and chemical knowledge. Compounds are currently being tested for activity against cruzain and rhodesain at UFMG, Belo Horizonte, Minas Gerais.

The flexibility metrics for determining backbone flexibility based on global alignment and side chain flexibility based on local alignment were also applied to the human cysteine protease cathepsin K, important for physiological and pathological bone degradation. The impact of different types of inhibitors, mainly a reversible, a covalent and an allosteric inhibitor, on the backbone and side chain flexibility of cathepsin K subpockets and exosite-1, important for regulating collagenolytic and elastinolytic activity, was investigated. In addition, the effect of chondroitin sulfate binding to cathepsin K on the flexibility was also of interest as well as the

121 effect of dimerization and tetramerization in the presence of chondroitin sulfate. The results point towards a decrease in flexibility in the exosite-1 for the allosteric inhibitor, which might be the cause of the inhibited collagenolytic and elastolytic activity. The binding of chondroitin sulfate leads to an increase in the flexibility of exosite-1, which is in line with the activation of the elastolytic activity of cathepsin K by the binding of glycosaminoglycans described in the literature. Dimerization seems to further increase exosite-1 flexibility, which is another indication for increased exosite-1 flexibility being important for collagenolytic and elastolytic activity. To understand the results obtained for the tetrameric form of cathepsin K, further simulations applying longer sampling times and further experiments to fully understand the effect of tetramerization on cathepsin K activity are required.

The results presented in this thesis form the basis for an in detail understanding of the key molecular drivers of specificity in proteases. With increasing knowledge on protease structures and an increasing amount of data on protease substrate sequences, the methods developed within this work can be applied for the generation of a model able to predict substrate specificity of (serine) proteases. The methods presented in this work may also be applied to other systems, such as for example kinases, and represent a major breakthrough in rational drug design.

122

9. Bibliography

1. Vanaman, T.C. and R.A. Bradshaw, Proteases in Cellular Regulation Minireview Series. J. Biol.

Chem., 1999. 274: p. 20047-20047.

2. ten Cate, H., T.M. Hackeng, and P.G. de Frutos, Coagulation Factor and Protease Pathways in

Thrombosis and Cardiovascular Disease. Thromb. Haemostasis, 2017. 117(7): p. 1265-1271.

3. Vandooren, J., et al., Proteases in Cancer Drug Delivery. Adv. Drug Delivery Rev., 2016. 97: p.

144-155.

4. Mullooly, M., et al., The ADAMs Family of Proteases as Targets for the Treatment of Cancer. Cancer

Biol. Ther., 2016. 17(8): p. 870-880.

5. Tremblay, N., A.Y. Park, and D. Lamarre, HCV NS3/4A Protease Inhibitors and the Road to Effective

Direct-Acting Antiviral Therapies, in Hepatitis C Virus II. 2016, Springer. p. 257-285.

6. Ghosh, A.K., H.L. Osswald, and G. Prato, Recent Progress in the Development of HIV-1 Protease

Inhibitors for the Treatment of HIV/AIDS. J. Med. Chem., 2016. 59(11): p. 5172-5208.

7. McKerrow, J.H., et al., Development of Protease Inhibitors for Protozoan Infections. Curr. Opin.

Infect. Dis., 2008. 21(6): p. 668.

8. Turk, B., Targeting Proteases: Successes, Failures and Future Prospects. Nat. Rev. Drug Discovery,

2006. 5(9): p. 785-799.

9. Drag, M. and G.S. Salvesen, Emerging Principles in Protease-Based Drug Discovery. Nat. Rev. Drug

Discovery, 2010. 9: p. 690-701.

10. Santos, R., et al., A Comprehensive Map of Molecular Drug Targets. Nat. Rev. Drug Discovery,

2017. 16(1): p. 19-34.

11. Di Cera, E., Serine Proteases. IUBMB Life, 2009. 61: p. 510–515.

123

12. Berman, H., K. Henrick, and H. Nakamura, Announcing the Worldwide Protein Data Bank. Nat.

Struct. Mol. Biol., 2003. 10: p. 980-980.

13. Rawlings, N.D., Peptidase Specificity from the Substrate Cleavage Collection in the MEROPS

Database and a Tool to Measure Cleavage Site Conservation. Biochimie, 2016. 122: p. 5-30.

14. Hengartner, M.O., The Biochemistry of Apoptosis. Nature, 2000. 407: p. 770-776.

15. Davie, E.W., K. Fujikawa, and W. Kisiel, The Coagulation Cascade: Initiation, Maintenance, and

Regulation. Biochemistry, 1991. 30: p. 10363-10370.

16. Rudd, P.M., et al., Glycosylation and the Immune System. Science, 2001. 291: p. 2370-2376.

17. Puente, X., et al., A Genomic View of the Complexity of Mammalian Proteolytic Systems. Biochem.

Soc. Trans., 2005. 33: p. 331.

18. Radzicka, A. and R. Wolfenden, Rates of Uncatalyzed Peptide Bond Hydrolysis in Neutral Solution and the Transition State Affinities of Proteases. J. Am. Chem. Soc., 1996. 118: p. 6105-6109.

19. Schechter, I. and A. Berger, Protease Subsite Nomenclature. Biochem. Biophys. Res. Commun.,

1967. 27: p. 157-162.

20. Kraut, J., Serine Proteases: Structure and Mechanism of Catalysis. Annu. Rev. Biochem., 1977. 46: p. 331-358.

21. Rawlings, N.D. and A.J. Barrett, MEROPS: The Peptidase Database. Nucleic Acids Res., 2000. 28: p. 323-325.

22. Rawlings, N.D., et al., MEROPS: The Database of Proteolytic Enzymes, their Substrates and

Inhibitors. Nucleic Acids Res., 2014. 42: p. D503–D509.

23. Bazan, J.F. and R.J. Fletterick, Viral Cysteine Proteases are Homologous to the Trypsin-Like Family of

Serine Proteases: Structural and Functional Implications. Proc. Natl. Acad. Sci., 1988. 85(21): p. 7872-

7876.

124

24. Gorbalenya, A.E., et al., Cysteine Proteases of Positive Strand RNA Viruses and Chymotrypsin-Like

Serine Proteases: A Distinct with a Common Structural Fold. FEBS Lett., 1989.

243(2): p. 103-114.

25. Hedstrom, L., Serine Protease Mechanism and Specificity. Chem. Rev. (Washington, DC, U. S.),

2002. 102: p. 4501-4524.

26. Carter, P. and J.A. Wells, Dissecting the Catalytic Triad of a Serine Protease. Nature, 1988. 332: p.

564-568.

27. Polgar, L. and P. Halasz, Current Problems in Mechanistic Studies of Serine and Cysteine Proteinases.

Biochem. J., 1982. 207(1): p. 1.

28. Dall, E. and H. Brandstetter, Structure and Function of Legumain in Health and Disease. Biochimie,

2016. 122: p. 126-150.

29. Thornberry, N.A., The Caspase Family of Cysteine Proteases. Br. Med. Bull., 1997. 53(3): p. 478-490.

30. Turk, B., D. Turk, and V. Turk, Lysosomal Cysteine Proteases: More than Scavengers. Biochim.

Biophys. Acta, Protein Struct. Mol. Enzymol., 2000. 1477(1): p. 98-111.

31. Turk, V., B. Turk, and D. Turk, Lysosomal Cysteine Proteases: Facts and Opportunities. The EMBO

Journal, 2001. 20(17): p. 4629-4633.

32. Lecaille, F., J. Kaleta, and D. Brömme, Human and Parasitic Papain-Like Cysteine Proteases: Their

Role in Physiology and Pathology and Recent Developments in Inhibitor Design. Chem. Rev., 2002.

102(12): p. 4459-4488.

33. Sajid, M. and J.H. McKerrow, Cysteine Proteases of Parasitic Organisms. Mol. Biochem. Parasitol.,

2002. 120(1): p. 1-21.

34. Bekono, B.D., et al., Targeting Cysteine Proteases from Plasmodium Falciparum: A General Overview,

Rational Drug Design and Computational Approaches for Drug Discovery. Curr. Drug Targets, 2016.

125

35. Latorre, A., et al., Dipeptidyl Nitroalkenes as Potent Reversible Inhibitors of Cysteine Proteases

Rhodesain and Cruzain. ACS Med. Chem. Lett., 2016. 7(12): p. 1073-1076.

36. Ettari, R., et al., The Inhibition of Cysteine Proteases Rhodesain and TbCatB: A Valuable Approach to

Treat Human African Trypanosomiasis. Mini-Rev. Med. Chem., 2016. 16(17): p. 1374-1391.

37. Ferreira, L.G. and A.D. Andricopulo, Targeting Cysteine Proteases in Trypanosomatid Disease Drug

Discovery. Pharmacol. Ther., 2017.

38. Siklos, M., M. BenAissa, and G.R. Thatcher, Cysteine Proteases as Therapeutic Targets: Does

Selectivity Matter? A Systematic Review of and Cathepsin Inhibitors. Acta Pharm. Sin. B, 2015.

5(6): p. 506-519.

39. López-Otín, C. and C.M. Overall, Protease degradomics: A new challenge for proteomics. Nature

Reviews Molecular Cell Biology, 2002. 3: p. 509-519.

40. Schneider, T.D. and R.M. Stephens, Sequence Logos: A New Way to Display Consensus Sequences.

Nucleic Acids Res., 1990. 18: p. 6097–6100.

41. Fuchs, J.E., et al., Cleavage Entropy as Quantitative Measure of Protease Specificity. PLoS Comput.

Biol., 2013. 9: p. e1003007.

42. Perona, J.J. and C.S. Craik, Structural Basis of Substrate Specificity in the Serine Proteases. Protein

Sci., 1995. 4: p. 337-360.

43. Verspurten, J., et al., SitePredicting the Cleavage of Proteinase Substrates. Trends Biochem. Sci.,

2009. 34: p. 319-323.

44. Bode, W. and R. Huber, Proteinase-Protein Inhibitor Interaction, in Molecular Aspects of

Inflammation. 1991, Springer. p. 103-115.

45. Perona, J.J., et al., Structural Origins of Substrate Discrimination in Trypsin and Chymotrypsin.

Biochemistry, 1995. 34: p. 1489-1499.

126

46. Steitz, T., R. Hendekson, and D. Blow, Structure of Crystalline α-Chymotrypsin: III.

Crystallographic Studies of Substrates and Inhibitors Bound to the Active Site of α-Chymotrypsin. J. Mol.

Biol., 1969. 46: p. 337-348.

47. Hedstrom, L., L. Szilagyi, and W.J. Rutter, Converting Trypsin to Chymotrypsin: The Role of Surface

Loops. Science, 1992. 255: p. 1249-1253.

48. Venekei, I., et al., Attempts to Convert Chymotrypsin to Trypsin. FEBS Lett., 1996. 379: p. 143-147.

49. Perona, J.J. and C.S. Craik, Evolutionary Divergence of Substrate Specificity within the

Chymotrypsin-Like Serine Protease Fold. J. Biol. Chem., 1997. 272: p. 29987-29990.

50. Ma, W., C. Tang, and L. Lai, Specificity of Trypsin and Chymotrypsin: Loop-Motion-Controlled

Dynamic Correlation as a Determinant. Biophys. J., 2005. 89: p. 1183-1193.

51. Bone, R., J.L. Silen, and D.A. Agard, Structural Plasticity Broadens the Specificity of an Engineered

Protease. Nature, 1989. 339: p. 191-195.

52. Bone, R., et al., Structural Basis for Broad Specificity in. Alpha.-Lytic Protease Mutants.

Biochemistry, 1991. 30: p. 10388-10398.

53. Schellenberger, V., C.W. Turck, and W.J. Rutter, Role of the S'Subsites in Serine Protease Catalysis.

Active-Site Mapping of Rat Chymotrypsin, Rat Trypsin,. Alpha.-Lytic Protease, and Cercarial Protease from

Schistosoma Mansoni. Biochemistry, 1994. 33: p. 4251-4257.

54. Ota, N. and D.A. Agard, Enzyme Specificity under Dynamic Control II: Principal Component

Analysis of α-lytic Protease Using Global and Local Solvent Boundary Conditions. Protein Sci., 2001. 10: p. 1403-1414.

55. Miller, D.W. and D.A. Agard, Enzyme Specificity under Dynamic Control: A Normal Mode Analysis of α-Lytic Protease. J. Mol. Biol., 1999. 286: p. 267-278.

56. Rose, R.B., C.S. Craik, and R.M. Stroud, Domain Flexibility in Retroviral Proteases: Structural

Implications for Drug Resistant Mutations. Biochemistry, 1998. 37: p. 2607-2621.

127

57. Baron, R., P.H. Hunenberger,̈ and J.A. McCammon, Absolute Single-Molecule Entropies from

Quasi-Harmonic Analysis of Microsecond Molecular Dynamics: Correction Terms and Convergence

Properties. J. Chem. Theory Comput., 2009. 5: p. 3150-3160.

58. Wallnoefer, H.G., et al., Backbone Flexibility Controls the Activity and Specificity of a Protein- Protein

Interface: Specificity in Snake Venom Metalloproteases. J. Am. Chem. Soc., 2010. 132: p. 10330–10337.

59. Fuchs, J.E., et al., Specificity of a Protein–Protein Interface: Local Dynamics Direct Substrate

Recognition of Effector Caspases. Proteins: Struct., Funct., Bioinf., 2013.

60. Fuchs, J.E., et al., Dynamics Govern Specificity of a Protein-Protein Interface: Substrate Recognition by Thrombin. PloS one, 2015. 10(10): p. e0140713.

61. Fischer, E., Einfluss der Configuration auf die Wirkung der Enzyme. Berichte der deutschen chemischen Gesellschaft, 1894. 27: p. 2985-2993.

62. Koshland, D., Enzyme Flexibility and Enzyme Action. J. Cell. Comp. Physiol., 1959. 54: p. 245-258.

63. Kumar, S., et al., Folding and Binding Cascades: Dynamic Landscapes and Population Shifts. Protein

Sci., 2000. 9: p. 10-19.

64. Závodszky, P., L. Abaturov, and Y.M. Varshavsky, Structure of Glyceraldehyde-3-phosphate

Dehydrogenase and its Alteration by Coenzyme Binding. Acta Biochim. Biophys. Acad. Sci. Hung.,

1966. 1: p. 389-403.

65. Ma, B. and R. Nussinov, Enzyme Dynamics Point to Stepwise Conformational Selection in Catalysis.

Curr. Opin. Chem. Biol., 2010. 14: p. 652-659.

66. Frauenfelder, H., P.G. Wolynes, and R.H. Austin, Biological physics. 1999: Springer.

67. Lange, O.F., et al., Recognition Dynamics up to Microseconds Revealed from an RDC-Derived

Ubiquitin Ensemble in Solution. Science, 2008. 320: p. 1471-1475.

68. Fenwick, R.B., et al., Weak Long-Range Correlated Motions in a Surface Patch of Ubiquitin Involved in Molecular Recognition. J. Am. Chem. Soc., 2011. 133: p. 10336-10339. 128

69. Long, D. and R. Brüschweiler, Directional Selection Precedes Conformational Selection in Ubiquitin-

UIM Binding. Angew. Chem., Int. Ed., 2013.

70. Bode, W., E. Meyer Jr, and J.C. Powers, Human Leukocyte and Porcine : X-Ray

Crystal Structures, Mechanism, Substrate Specificity, and Mechanism-Based Inhibitors. Biochemistry,

1989. 28: p. 1951-1963.

71. Inc., C.C.G., Molecular Operating Environment (MOE). 2013: 1010 Sherbooke St. West, Suite #919,

Montreal, QC, Canada, H3A 2R7.

72. Kozlowski, L.P., Proteome-pI: proteome isoelectric point database. Nucleic Acids Research, 2017. 45: p. D1112-D1116.

73. Schauperl, M., et al., Characterizing Protease Specificity: How Many Substrates Do We Need? PloS one, 2015. 10(11): p. e0142658.

74. Tsukada, H. and D. Blow, Structure of α-Chymotrypsin Refined at 1.68 Å Resolution. J. Mol. Biol.,

1985. 184(4): p. 703-711.

75. Würtele, M., et al., Atomic Resolution Structure of Native Porcine Pancreatic Elastase at 1.1 \AA.

Acta Crystallogr., Sect. D: Biol. Crystallogr., 2000. 56: p. 520–523.

76. Estébanez Perpiñá, E., W. Bode, and R. Huber, Crystal Structure of Human Granzyme B. 2003.

77. Wu, L., et al., Structural Basis for Proteolytic Specificity of the Human Apoptosis-Inducing Granzyme

M. J. Immunol., 2009. 183(1): p. 421-429.

78. Sichler, K., et al., Crystal Structures of Uninhibited Factor VIIa link its and Substrate-

Assisted Activation to Specific Interactions. J. Mol. Biol., 2002. 322(3): p. 591-603.

79. Katz, B.A., et al., Structural Basis for Selectivity of a Small Molecule, S1-Binding, Submicromolar

Inhibitor of -Type Plasminogen Activator. Chem. Biol. (Oxford, U. K.), 2000. 7: p. 299–312.

80. Laxmikanthan, G., et al., 1.70 Å X-Ray Structure of Human Apo Kallikrein 1: Structural Changes

Upon Peptide Inhibitor/Substrate Binding. Proteins: Struct., Funct., Bioinf., 2005. 58(4): p. 802-814. 129

81. Stubbs, M.T., et al., The Interaction of Thrombin with Fibrinogen. Eur. J. Biochem., 1992. 206: p.

187–195.

82. Schmidt, A., et al., Trypsin Revisited Crystallography at (Sub) Atomic Resolution and Quantum

Chemistry Revealing Details of Catalysis. J. Biol. Chem., 2003. 278(44): p. 43357-43362.

83. Di Cera, E., et al., Thrombin Allostery. Phys. Chem. Chem. Phys., 2007. 9: p. 1291-1306.

84. Ferreira, R.S., et al., Complementarity Between a Docking and a High-Throughput Screen in

Discovering New Cruzain Inhibitors. J. Med. Chem., 2010. 53(13): p. 4891-4905.

85. Ehmke, V., et al., Optimization of Triazine Nitriles as Rhodesain Inhibitors: Structure–Activity

Relationships, Bioisosteric Imidazopyridine Nitriles, and X-ray Crystal Structure Analysis with Human

Cathepsin L. ChemMedChem, 2013. 8(6): p. 967-975.

86. Musil, D., et al., The Refined 2.15 A X-Ray Crystal Structure of Human Liver Cathepsin B: The

Structural Basis for its Specificity. The EMBO Journal, 1991. 10(9): p. 2321.

87. Li, Z. and H.A. Scheraga, Monte Carlo-Minimization Approach to the Multiple-Minima Problem in

Protein Folding. Proc. Natl. Acad. Sci., 1987. 84(19): p. 6611-6615.

88. Schütte, C., et al., A Direct Approach to Conformational Dynamics Based on Hybrid Monte Carlo. J.

Comput. Phys., 1999. 151(1): p. 146-168.

89. Mitsutake, A., Y. Sugita, and Y. Okamoto, Replica-Exchange Multicanonical and Multicanonical

Replica-Exchange Monte Carlo Simulations of Peptides. I. Formulation and Benchmark Test. J. Chem.

Phys., 2003. 118: p. 6664–6675.

90. Kolinski, A. and J. Skolnick, Monte Carlo Simulations of Protein Folding. I. Lattice Model and

Interaction Scheme. Proteins: Struct., Funct., Bioinf., 1994. 18(4): p. 338-352.

91. Kolinski, A. and J. Skolnick, Monte Carlo Simulations of Protein Folding. II. Application to Protein

A, ROP, and Crambin. Proteins: Struct., Funct., Bioinf., 1994. 18(4): p. 353-366.

130

92. Liang, F. and W.H. Wong, Evolutionary Monte Carlo for Protein Folding Simulations. J. Chem.

Phys., 2001. 115(7): p. 3374-3380.

93. Rodriguez, R., et al., Homology Modeling, Model and Software Evaluation: Three Related Resources.

Bioinformatics (Oxford, England), 1998. 14(6): p. 523-528.

94. Xiang, Z., Advances in Homology Protein Structure Modeling. Curr. Protein Pept. Sci., 2006. 7(3): p. 217-227.

95. Case, D., Amber 12: Molecular Dynamics Package. 2012: UCSF.

96. Darden, T., D. York, and L. Pedersen, Particle Mesh Ewald: An N⋅ log (N) Method for Ewald Sums in Large Systems. J. Chem. Phys., 1993. 98: p. 10089–10092.

97. Salomon-Ferrer, R., et al., Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. J. Chem. Theory Comput., 2013. 9: p. 3878–3888.

98. Ryckaert, J.-P., G. Ciccotti, and H.J. Berendsen, Numerical Integration of the Cartesian

Equations of Motion of a System with Constraints: Molecular Dynamics of n-Alkanes. J. Comput.

Phys., 1977. 23(3): p. 327-341.

99. Roe, D.R. and T.E. Cheatham III, PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. J. Chem. Theory Comput., 2013.

100. Adelman, S., Quantum Generalized Langevin Equation Approach to Gas/Solid

Collisions. Chem. Phys. Lett., 1976. 40: p. 495–499.

101. Jorgensen, W.L., et al., Comparison of Simple Potential Functions for Simulating Liquid

Water. J. Chem. Phys., 1983. 79: p. 926.

102. Wallnoefer, H.G., K.R. Liedl, and T. Fox, A Challenging System: Free Energy Prediction for

Factor Xa. J. Comput. Chem., 2011. 32: p. 1743-1752.

103. Fuchs, J.E., et al., Independent Metrics for Protein Backbone and Side-Chain Flexibility:

Time Scales and Effects of Ligand Binding. J. Chem. Theory Comput., 2015. 11(3): p. 851-860. 131

104. Ramachandran, G., C.t. Ramakrishnan, and V. Sasisekharan, Stereochemistry of

Polypeptide Chain Configurations. J. Mol. Biol., 1963. 7: p. 95–99.

105. Huber, R.G., et al., Entropy from State Probabilities: Hydration Entropy of Cations. J.

Phys. Chem. B, 2013.

106. Shao, J., et al., Clustering Molecular Dynamics Trajectories: 1. Characterizing the

Performance of Different Clustering Algorithms. J. Chem. Theory Comput., 2007. 3: p. 2312-2334.

107. Xu, R. and D. Wunsch, Survey of Clustering Algorithms. IEEE Transactions on Neural

Networks, 2005. 16(3): p. 645-678.

108. Milligan, G.W. and M.C. Cooper, Methodology Review: Clustering Methods. Applied

Psychological Measurement, 1987. 11(4): p. 329-354.

109. Schauperl, M., et al., Enthalpic and Entropic Contributions to Hydrophobicity. J. Chem.

Theory Comput., 2016. 12(9): p. 4600-4610.

110. Nguyen, C., M.K. Gilson, and T. Young, Structure and Thermodynamics of Molecular

Hydration via Grid Inhomogeneous Solvation Theory. arXiv preprint arXiv:1108.4876, 2011.

111. Nguyen, C.N., T.K. Young, and M.K. Gilson, Grid Inhomogeneous Solvation Theory:

Hydration Structure and Thermodynamics of the Miniature Receptor Cucurbit [7] Uril. J. Chem.

Phys., 2012. 137: p. 044101.

112. Goodford, P.J., A Computational Procedure for Determining Energetically Favorable

Binding Sites on Biologically Important Macromolecules. J. Med. Chem., 1985. 28: p. 849-857.

113. Carosati, E., S. Sciabola, and G. Cruciani, Hydrogen Bonding Interactions of Covalently

Bonded Fluorine Atoms: From Crystallographic Data to a New Angular Function in the GRID

Force Field. J. Med. Chem., 2004. 47(21): p. 5114-5125.

114. DeLano, W., The PyMOL Molecular Graphics System (Palo Alto, CA: DeLano Scientific

LLC). 2008.

132

115. Waldner, B.J., et al., Quantitative Correlation of Conformational Binding Enthalpy with

Substrate Specificity of Serine Proteases. J. Phys. Chem. B, 2015.

116. Team, R.D.C., R - A language and environment for statistical computing. 2013, Vienna,

Austria.

117. Grant, J. and B. Pickup, Gaussian Shape Methods. Comput. Simul. Biomol. Syst., 1997: p.

150-176.

118. Waldner, B.J., et al., Electrostatic Recognition as First Step of Substrate Binding to Serine

Proteases. Bioinformatics, 2017.

119. Sadasivan, C. and V.C. Yee, Interaction of the Factor XIII Activation Peptide with α-

Thrombin Crystal Structure of its Enzyme-Substrate Analog Complex. J. Biol. Chem., 2000. 275: p.

36942–36948.

120. Gandhi, P.S., Z. Chen, and E. Di Cera, Crystal Structure of Thrombin Bound to the

Uncleaved Extracellular Fragment of PAR1. J. Biol. Chem., 2010. 285(20): p. 15393-15398.

121. Hilpert, K., et al., Design and Synthesis of Potent and Highly Selective Thrombin

Inhibitors. J. Med. Chem., 1994. 37(23): p. 3889-3901.

122. Guennebaud, G. and B. Jacob, Eigen. URl: http://eigen. tuxfamily. org, 2010.

123. Shoichet, B.K., Virtual Screening of Chemical Libraries. Nature, 2004. 432: p. 862–865.

124. Geppert, H., M. Vogt, and J.r. Bajorath, Current Trends in Ligand-Based Virtual

Screening: Molecular Representations, Data Mining Methods, New Application Areas, and

Performance Evaluation. J. Chem. Inf. Model., 2010. 50: p. 205-216.

125. Kitchen, D.B., et al., Docking and Scoring in Virtual Screening for Drug Discovery:

Methods and Applications. Nat. Rev. Drug Discovery, 2004. 3: p. 935-949.

133

126. Huang, S.Y. and X. Zou, Ensemble Docking of Multiple Protein Structures: Considering

Protein Structural Variations in Molecular Docking. Proteins: Struct., Funct., Bioinf., 2007. 66: p.

399-421.

127. Horvath, D., Pharmacophore-Based Virtual Screening, in Chemoinformatics and

Computational Chemical Biology. 2011, Springer: New York. p. 261-298.

128. Schneider, G., et al., “Scaffold-Hopping” by Topological Pharmacophore Search: A

Contribution to Virtual Screening. Angew. Chem., Int. Ed., 1999. 38: p. 2894–2896.

129. ROCS. OpenEye Scientific Software, Santa Fe, NM. http://www.eyesopen.com.

130. Hawkins, P.C., A.G. Skillman, and A. Nicholls, Comparison of Shape-Matching and

Docking as Virtual Screening Tools. J. Med. Chem., 2007. 50: p. 74-82.

131. Huang, N., B.K. Shoichet, and J.J. Irwin, Benchmarking Sets for Molecular Docking. J. Med.

Chem., 2006. 49: p. 6789-6801.

132. Kirchmair, J., et al., How to Optimize Shape-Based Virtual Screening: Choosing the

Right Query and Including Chemical Information. J. Chem. Inf. Model., 2009. 49: p. 678-692.

133. Schilling, O. and C.M. Overall, Proteome-Derived, Database-Searchable Peptide

Libraries for Identifying Protease Cleavage Sites. Nat. Biotechnol., 2008. 26: p. 685–694.

134. Schilling, O., et al., Proteome-Wide Analysis of Protein Carboxy Termini: C

Terminomics. Nat. Methods, 2010. 7: p. 508-511.

135. Kleifeld, O., et al., Isotopic Labeling of Terminal Amines in Complex Samples Identifies

Protein N-Termini and Protease Cleavage Products. Nat. Biotechnol., 2010. 28: p. 281-288.

136. Waldner, B.J., et al., Protease Inhibitors in View of Peptide Substrate Databases. J.

Chem. Inf. Model., 2016.

137. Tyndall, J.D., T. Nall, and D.P. Fairlie, Proteases Universally Recognize Beta Strands in their

Active Sites. Chem. Rev. (Washington, DC, U. S.), 2005. 105: p. 973-1000. 134

138. Ganesan, R., et al., Extended Substrate Recognition in Caspase-3 Revealed by High

Resolution X-Ray Structure Analysis. J. Mol. Biol., 2006. 359: p. 1378-1388.

139. Mysinger, M.M., et al., Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem., 2012. 55: p. 6582-6594.

140. Hawkins, P.C., et al., Conformer Generation with OMEGA: Algorithm and Validation

Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J.

Chem. Inf. Model., 2010. 50: p. 572-584.

141. Hawkins, P.C.D.S., A.G.; Warren, G.L.; Ellingson, B.A.; Stahl, M.T., OMEGA 2.5.1.4.

OpenEye Scientific Software, Santa Fe, NM. http://www.eyesopen.com.

142. McCaldon, P. and P. Argos, Oligopeptide Biases in Protein Sequences and their Use in

Predicting Protein Coding Regions in Nuleotide Sequences. Proteins: Struct., Funct., Bioinf., 1988.

4: p. 99-122.

143. Good, A.C. and T.I. Oprea, Optimization of CAMD Techniques 3. Virtual Screening

Enrichment Studies: A Help or Hindrance in Tool Selection? J. Comput.-Aided Mol. Des., 2008. 22: p. 169-178.

144. Renko, M., et al., Stefin A Displaces the Occluding Loop of Cathepsin B only by as much as Required to Bind to the Active Site Cleft. The FEBS Journal, 2010. 277(20): p. 4338-4345.

145. Irwin, J.J. and B.K. Shoichet, ZINC− A Free Database of Commercially Available Compounds for Virtual Screening. J. Chem. Inf. Model., 2005. 45(1): p. 177-182.

146. Leiros, H.K.S., et al., Trypsin Specificity as Elucidated by LIE Calculations, X-Ray

Structures, and Association Constant Measurements. Protein Sci., 2004. 13: p. 1056-1070.

147. Powers, J.C., et al., Specificity of porcine pancreatic elastase, human leukocyte elastase and Inhibition with peptide chloromethyl ketones. Biochimica et Biophysica Acta 1977.

485(1): p. 156-166.

135

148. Bode, W., E. Meyer, and J.C. Powers, Human leukocyte and porcine pancreatic elastase: x-ray crystal structures, mechanism, substrate specificity, and mechanism-based inhibitors.

Biochemistry, 1989. 28(5): p. 1951-1963.

149. Sayers, T.J., et al., The Restricted Expression of Granzyme M in Human Lymphocytes. J.

Immunol., 2001. 166(2): p. 765-771.

150. Pao, L.I., et al., Functional Analysis of Granzyme M and its Role in Immunity to

Infection. J. Immunol., 2005. 175(5): p. 3235-3243.

151. Kelly, J.M., et al., Granzyme M Mediates a Novel Form of Perforin-Dependent Cell Death. J.

Biol. Chem., 2004. 279(21): p. 22236-22242.

152. Lord, S.J., et al., Granzyme B: a natural born killer. Immunological Reviews, 2003. 193: p.

31-8.

153. Trapani, J.A. and V.R. Sutton, Granzyme B: Pro-Apoptotic, Antiviral and Antitumor

Functions. Curr. Opin. Chem. Biol., 2003. 15: p. 533–543.

154. Fuchs, J.E., et al., Substrate-Driven Mapping of the Degradome by Comparison of Sequence

Logos. PLoS Computational Biology, 2013. 9(11): p. e1003353.

155. Harris, J.L., et al., Definition and Redesign of the Extended Substrate Specificity of

Granzyme B. J. Biol. Chem., 1998. 273: p. 27364-27373.

156. Crooks, G.E., et al., WebLogo: A Sequence Logo Generator. Genome Res., 2004. 14: p. 1188–

1190.

157. Dang, O., A. Vindigni, and E. Di Cera, An Allosteric Switch Controls the Procoagulant and

Anticoagulant Activities of Thrombin. Proc. Natl. Acad. Sci., 1995. 92: p. 5977–5981.

158. Hauske, P., et al., Allosteric Regulation of Proteases. ChemBioChem, 2008. 9: p. 2920-2928.

136

159. Hamelberg, D., J. Mongan, and J.A. McCammon, Accelerated Molecular Dynamics: A

Promising and Efficient Simulation Method for Biomolecules. J. Chem. Phys., 2004. 120(24): p. 11919-

11929.

160. Sugita, Y. and Y. Okamoto, Replica-Exchange Molecular Dynamics Method for Protein

Folding. Chem. Phys. Lett., 1999. 314: p. 141–151.

161. Nguyen, C.N., et al., Thermodynamics of Water in an Enzyme Active Site: Grid-Based

Hydration Analysis of Coagulation Factor Xa. J. Chem. Theory Comput., 2014. 10(7): p. 2769-2780.

162. Timasheff, S.N., Protein Hydration, Thermodynamic Binding, and Preferential

Hydration. Biochemistry, 2002. 41(46): p. 13473-13482.

163. Rao, M.S. and A.J. Olson, Modelling of factor Xa-inhibitor complexes: a computational flexible docking approach. Proteins, 1999. 34(2): p. 173-83.

164. Pethe, M.A., A.B. Rubenstein, and S.D. Khare, Large-Scale Structure-Based Prediction and Identification of Novel Protease Substrates Using Computational Protein Design. J. Mol. Biol.,

2017. 429(2): p. 220-236.

165. Ehrt, C., T. Brinkjost, and O. Koch, Impact of Binding Site Comparisons on Medicinal

Chemistry and Rational Molecular Design. J. Med. Chem., 2016. 59(9): p. 4121-4151.

166. Ferrario, V., et al., BioGPS Descriptors for Rational Engineering of Enzyme Promiscuity and Structure Based Bioinformatic Analysis. PloS one, 2014. 9(10): p. e109354.

167. Siragusa, L., et al., BioGPS: Navigating Biological Space to Predict Polypharmacology,

Off-Targeting, and Selectivity. Proteins: Struct., Funct., Bioinf., 2015. 83(3): p. 517-532.

168. Chartier, M. and R. Najmanovich, Detection of Binding Site Molecular Interaction Field

Similarities. J. Chem. Inf. Model., 2015. 55(8): p. 1600-1615.

169. Binkowski, T.A. and A. Joachimiak, Protein Functional Surfaces: Global Shape Matching and Local Spatial Alignments of Ligand Binding Sites. BMC Struct. Biol., 2008. 8(1): p. 45.

137

170. Najmanovich, R., N. Kurbatova, and J. Thornton, Detection of 3D Atomic Similarities and

Their Use in the Discrimination of Small Molecule Protein-Binding Sites. Bioinformatics, 2008. 24(16): p. i105-i111.

171. Yeturu, K. and N. Chandra, PocketMatch: A New Algorithm to Compare Binding Sites in

Protein Structures. BMC Bioinf., 2008. 9(1): p. 1.

172. Leinweber, M., et al., CavSimBase: A Database for Large Scale Comparison of Protein Binding

Sites. IEEE Transactions on Knowledge and Data Engineering, 2016. 28(6): p. 1423-1434.

173. Shulman-Peleg, A., R. Nussinov, and H.J. Wolfson, SiteEngines: Recognition and

Comparison of Binding Sites and Protein–Protein Interfaces. Nucleic Acids Res., 2005. 33(suppl_2): p.

W337-W341.

174. Powers, R., et al., Searching the Protein Structure Database for Ligand-Binding Site

Similarities Using CPASS v. 2. BMC Res. Notes, 2011. 4(1): p. 17.

175. Jambon, M., et al., The SuMo Server: 3D Search for Protein Functional Sites. Bioinformatics,

2005. 21(20): p. 3929-3930.

176. Kinoshita, K. and H. Nakamura, eF-site and PDBjViewer: Database and Viewer for Protein

Functional Sites. Bioinformatics, 2004. 20(8): p. 1329-1330.

177. Huntington, J., Molecular Recognition Mechanisms of Thrombin. J. Thromb. Haemostasis,

2005. 3: p. 1861–1872.

178. Batra, J., et al., Long-Range Electrostatic Complementarity Governs Substrate Recognition by

Human Chymotrypsin C, a Key Regulator of Digestive Enzyme Activation. J. Biol. Chem., 2013. 288(14): p. 9848-9859.

179. Harris, R.C., et al., Opposites Attract: Shape and Electrostatic Complementarity in Protein-

DNA Complexes. Innovations in Biomolecular Modeling and Simulations, 2012. 2: p. 53-80.

138

180. Schreiber, G., G. Haran, and H.-X. Zhou, Fundamental Aspects of Protein− Protein

Association Kinetics. Chem. Rev., 2009. 109(3): p. 839-860.

181. Sulea, T. and E.O. Purisima, Profiling Charge Complementarity and Selectivity for Binding at the Protein Surface. Biophys. J., 2003. 84(5): p. 2883-2896.

182. Schilling, O., C.M. Overall, and others, Factor Xa Subsite Mapping by Proteome-Derived

Peptide Libraries Improved Using WebPICS, a Resource for Proteomic Identification of Cleavage Sites. Biol.

Chem., 2011. 392: p. 1031–1037.

183. Brandstetter, H., et al., X-ray Structure of Active Site-Inhibited Clotting Factor Xa

Implications for Drug Design and Substrate Recognition. J. Biol. Chem., 1996. 271: p. 29988-29992.

184. Winquist, J., et al., Identification of Structural–Kinetic and Structural–Thermodynamic

Relationships for Thrombin Inhibitors. Biochemistry, 2013. 52: p. 613-626.

185. Rautio, J., et al., Prodrugs: Design and Clinical Applications. Nat. Rev. Drug Discovery,

2008. 7: p. 255-270.

186. Sukuru, S.C.K., et al., A Lead Discovery Strategy Driven by a Comprehensive Analysis of

Proteases in the Peptide Substrate Space. Protein Sci., 2010. 19: p. 2096-2109.

187. Rassi Jr, A., A. Rassi, and J.A. Marin-Neto, Chagas Disease. The Lancet, 2010. 375(9723): p. 1388-1402.

188. Coura, J.R. and P.A. Viñas, Chagas Disease: A New Worldwide Challenge. Nature, 2010: p.

S6.

189. Bern, C., Chagas’ Disease. N. Engl. J. Med., 2015. 373(5): p. 456-466.

190. Urbina, J.A. and R. Docampo, Specific Chemotherapy of Chagas Disease: Controversies and

Advances. Trends Parasitol., 2003. 19(11): p. 495-501.

191. Coura, J.R. and S.L. De Castro, A Critical Review on Chagas Disease Chemotherapy. Mem.

Inst. Oswaldo Cruz, 2002. 97(1): p. 3-24. 139

192. Urbina, J.A., Specific Treatment of Chagas Disease: Current Status and New Developments.

Curr. Opin. Infect. Dis., 2001. 14(6): p. 733-741.

193. da Silva, E.B., G.A. do Nascimento Pereira, and R. Ferreira, Trypanosomal Cysteine

Peptidases: Target Validation and Drug Design Strategies. Comprehensive Analysis of Parasite

Biology: From Metabolism to Drug Discovery, 2016. 7: p. 121.

194. Martinez-Mayorga, K., et al., Cruzain Inhibitors: Efforts Made, Current Leads and a

Structural Outlook of New Hits. Drug Discovery Today, 2015. 20(7): p. 890-898.

195. Buguet, A., R. Cespuglio, and B. Bouteille, African Sleeping Sickness, in Sleep Medicine.

2015, Springer. p. 159-165.

196. Steverding, D., Sleeping Sickness and Nagana Disease Caused by Trypanosoma brucei, in

Arthropod Borne Diseases. 2017, Springer. p. 277-297.

197. Aguda, A.H., et al., Structural Basis of Collagen Fiber Degradation by Cathepsin K. Proc.

Natl. Acad. Sci., 2014. 111(49): p. 17474-17479.

198. Sharma, V., et al., Structural Requirements for the Collagenase and Elastase Activity of

Cathepsin K and its Selective Inhibition by an Exosite Inhibitor. Biochem. J., 2015. 465(1): p. 163-173.

199. Stoch, S. and J. Wagner, Cathepsin K Inhibitors: A Novel Target for Osteoporosis Therapy.

Clin. Pharmacol. Ther., 2008. 83(1): p. 172-176.

200. Brömme, D. and F. Lecaille, Cathepsin K Inhibitors for Osteoporosis and Potential Off-Target

Effects. Expert Opin. Invest. Drugs, 2009. 18(5): p. 585-600.

201. Costa, A.G., et al., Cathepsin K: Its Skeletal Actions and Role as a Therapeutic Target in

Osteoporosis. Nat. Rev. Rheumatol., 2011. 7(8): p. 447-456.

202. Zhao, B., et al., Crystal Structure of Human Osteoclast Cathepsin K Complex with E-64. Nat.

Struct. Mol. Biol., 1997. 4(2): p. 109-111.

140

203. Adkison, K.K., et al., Semicarbazone-Based Inhibitors of Cathepsin K, are they Prodrugs for

Aldehyde Inhibitors? Bioorg. Med. Chem. Lett., 2006. 16(4): p. 978-983.

204. Novinec, M., et al., A Novel Allosteric Mechanism in the Cysteine Peptidase Cathepsin K

Discovered by Computational Methods. Nat. Commun., 2014. 5: p. 3287.

205. Kirschner, K.N., et al., GLYCAM06: A Generalizable Biomolecular Force Field.

Carbohydrates. J. Comp. Chem., 2008. 29(4): p. 622-655.

206. Novinec, M., et al., Conformational Flexibility and Allosteric Regulation of Cathepsin K.

Biochem. J., 2010. 429(2): p. 379-389.

207. Panwar, P., et al., A Novel Approach to Inhibit Bone Resorption: Exosite Inhibitors Against

Cathepsin K. Br. J. Pharmacol., 2016. 173(2): p. 396-410.

141

Appendix A

A1. List of PDB Structures for Generic Binding Site Definition 1c5m

1cho

1de7

1dsu

1fle

1fph

1fq3

1ggd

1h9h

1iau

1klt

1mcv

1oxg

1ppf

1pq7

1qnj

1qr3

1t31

142

2uuy

2xtt

2z7f

2zgc

2zgh

2zgj

143

A2. Pocket Residues In Generic Pocket Definition of Chymotrypsin-Like Serine Protease Test Set Trypsin.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ TYR-169 TRP-212 HIS-57 ASP-189 TRP-40 PRO-39 GLY-37 SER-74 Ala-174 GLY-213 GLN-192 SER-190 CYS-41 TRP-138 GLY-38 ARG-75 ILE-175 SER-211 CYS-191 GLY-193 GLY-139 TRP-212 ASP-194 ASP-194 ALA-140 SER-195 SER-195 VAL-210 SER-211 GLY-215 CYS-216 GLY-225 TYR-227

144

Factor VIIa.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ SER-171 TRP-215 HIS-57 ASP-189 LEU-28 GLN-27 GLY-25 GLU-61 ASN-173 GLY-216 LYS-192 SER-190 CYS-29 TRP-135 ALA-26 HIS-64 ILE-174 SER-214 CYS-191 GLY-193 GLY-136 TRP-215 ASP-194 ASP-194 GLN-137 SER-195 SER-195 VAL-213 GLN-217 GLY-218 CYS-219 GLY-226 TYR-228

145

Factor Xa.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ TYR-99 TRP-215 HIS-57 ASP-189 PHE-41 GLY-40 ASN-35 THR-73 PHE-174 GLY-216 GLN-192 ALA-190 CYS-42 PHE-141 GLU-37 GLU-74 TRP-215 SER-214 CYS-191 GLY-193 GLY-142 TRP-215 ASP-194 ASP-194 ARG-143 SER-195 SER-195 VAL-213 SER-214 GLY-218 CYS-220 GLY-226 TYR-228

146

Thrombin.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ ARG-168 TRP-217 HIS-57 ASP-189 LEU-24 LEU-23 GLN-21 LYS-61 ILE-169 GLY-218 GLU-192 ALA-190 CYS-25 TRP-138 GLU-22 ARG-65 GLU-219 SER-216 CYS-191 GLY-193 GLY-139 TRP-217 ASP-194 ASP-194 ASN-140 SER-195 SER-195 VAL-215 SER-216 GLY-220 CYS-221 GLY-228 TYR-230

147

Kallikrein-1.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ HIS-170 TRP-211 HIS-57 ASP-189 GLN-35 PHE-34 SER-32 LEU-66 GLN-172 GLY-212 GLY-193 THR-190 CYS-36 TRP-139 THR-33 PHE-67 VAL-174 SER-210 CYS-191 GLY-193 GLY-140 TRP-211 ASP-194 ASP-194 SER-141 SER-195 SER-195 THR-209 SER-210 CYS-216 PRO-222 SER-223 VAL-224

148

Chymotrypsin.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ TRP-172 TRP-215 HIS-57 SER-189 PHE-43 HIS-42 THR-39 GLU-72 ILE-175 GLY-216 MET-192 SER-190 CYS-44 TRP-143 PHE-41 GLN-75 LYS-176 SER-214 CYS-191 GLY-193 GLY-144 TRP-215 ASP-194 ASP-194 LEU-145 SER-195 SER-195 VAL-213 SER-214 SER-218 CYS-220 GLY-226 TYR-228

149

Elastase-1.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ TRP-172 PHE-208 HIS-57 ARG-188 THR-41 HIS-40 SER-37 GLU-70 THR-175 VAL-209 GLN-192 SER-189 CYS-42 TRP-141 TRP-38 LEU-73 VAL-176 PHE-215 GLY-190 CYS-191 GLY-142 ALA-39 GLN-192 GLN-192 LEU-143 ASP-194 ASP-194 SER-195 SER-195 VAL-212 THR-213 CYS-220 THR-226 PHE-228

150

Granzyme M.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ TRP-170 PHE-213 HIS-57 PRO-190 LEU-38 HIS-37 SER-30 LEU-72 SER-173 SER-214 GLY-193 CYS-191 CYS-39 TRP-138 GLN-32 SER-74 LEU-174 SER-212 ASP-194 GLY-193 GLY-139 PHE-213 SER-195 ASP-194 LEU-140 LEU-211 SER-195 SER-212 ARG-216 VAL-217 CYS-218 THR-219 PRO-225

151

Granzyme B.

S4 S3 S2 S1 S1’ S2’ S3’ S4’ ASP-170 TYR-215 HIS-57 THR-189 ARG-41 LYS-40 TYR-32 ILE-73 LEU-171 GLY-216 LYS-192 SER-190 CYS-42 TRP-141 MET-34 LYS-74 ARG-217 SER-214 PHE-191 GLY-193 GLY-142 ASN-218 TYR-215 GLY-193 ASP-194 GLN-143 ASP-194 SER-195 SER-195 VAL-213 SER-214 ARG-217 ASN-219 CYS-228

152

A3. Cα RMSD Values Between X-Ray and Representative Cluster Structures Extracted From MD Simulations and Cluster Occupancies. Trypsin.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 1.101 0.000 0.72

Cluster 2 / Å 1.315 0.886 0.000 0.25

Cluster 3 / Å 0.89 0.705 0.959 0.000 0.03

Factor VIIa.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 0.868 0.000 0.90

Cluster 2 / Å 0.732 0.852 0.000 0.07

Cluster 3 / Å 0.923 1.053 1.047 0.000 0.03

153

Factor Xa.

X- Cluster Cluster Cluster Cluster Cluster Cluster Occupancy

Ray 1 / Å 2 / Å 3 / Å 4 / Å 5 / Å 6 / Å / -

/ Å

X-Ray / Å 0.000 -

Cluster 1 / 1.470 0.000 0.44

Å

Cluster 2 / 1.298 1.135 0.000 0.21

Å

Cluster 3 / 1.379 1.268 1.291 0.000 0.15

Å

Cluster 4 / 1.510 1.410 1.361 1.408 0.000 0.10

Å

Cluster 5 / 1.436 0.912 1.185 1.427 1.315 0.000 0.09

Å

Cluster 6 / 1.844 1.462 1.508 1.483 1.096 1.416 0.000 0.01

Å

154

Thrombin.

X-Ray Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Occupancy

/ Å / Å / Å / Å / Å / Å / -

X-Ray / 0.000 -

Å

Cluster 1 0.901 0.000 0.51

/ Å

Cluster 2 0.981 1.010 0.000 0.24

/ Å

Cluster 3 0.875 0.835 0.930 0.000 0.15

/ Å

Cluster 4 0.953 1.049 0.905 0.899 0.000 0.09

/ Å

Cluster 5 1.025 0.869 1.134 0.901 1.002 0.000 0.01

/ Å

155

Kallikrein-1.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 1.106 0.000 0.96

Cluster 2 / Å 1.028 0.866 0.000 0.02

Cluster 3 / Å 0.853 0.850 0.923 0.000 0.02

Chymotrypsin.

X-Ray / Cluster 1 / Cluster 2 / Cluster 3 / Cluster 4 / Occupancy /

Å Å Å Å Å -

X-Ray / Å 0.000 -

Cluster 1 / 0.966 0.000 0.58

Å

Cluster 2 / 0.950 0.943 0.000 0.25

Å

Cluster 3 / 1.039 0.986 0.890 0.000 0.10

Å

Cluster 4 / 0.811 1.039 0.949 1.050 0.000 0.07

Å

156

Elastase.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 1.071 0.000 0.76

Cluster 2 / Å 1.131 1.065 0.000 0.13

Cluster 3 / Å 0.845 1.037 0.818 0.000 0.07

Granzyme M.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 1.064 0.000 0.97

Cluster 2 / Å 0.972 0.980 0.000 0.02

Cluster 3 / Å 1.022 1.067 0.984 0.000 0.01

157

Granzyme B.

X-Ray / Å Cluster 1 / Å Cluster 2 / Å Cluster 3 / Å Occupancy / -

X-Ray / Å 0.000 -

Cluster 1 / Å 1.560 0.000 0.46

Cluster 2 / Å 1.333 1.256 0.000 0.43

Cluster 3 / Å 1.310 1.409 1.213 0.000 0.11

158

Appendix B

B1. Generic Subpocket Definition for Cruzain, Rhodesain, Cathepsin B and Cathepsin L Cruzain.

S4 S3 S2 S1 S1' S2'

THR-59 SER-61 LEU-67 GLY-23 GLY-23 GLN-19

ASP-60 LEU-67 MET-68 CYS-25 CYS-25 GLY-20

GLY-66 VAL-137 TRP-26 LEU-160 GLN-21

LEU-67 LEU-160 SER-64 ASP-161 CYS-22

GLY-163 GLY-65 GLY-23

GLU-208 GLY-66 SER-24

ASP-161

HIS-162

159

Rhodesain.

S4 S3 S2 S1 S1' S2'

ILE-59 PHE-61 LEU-67 GLY-23 GLY-23 GLN-19

ASP-60 LEU-67 MET-68 CYS-25 CYS-25 GLY-20

GLY-66 LEU-160 TRP-26 LEU-160 GLN-21

LEU-67 ASP-161 GLY-64 ASP-161 CYS-22

ILE-137 GLY-65 GLY-23

GLY-163 GLY-66 SER-24

ALA-208 ASP-161

HIS-162

160

Cathepsin B.

S4 S3 S2 S1 S1' S2'

MET-66 ASP-69 TYR-75 GLY-27 GLY-27 GLN-23

CYS-67 TYR-75 PRO-76 CYS-29 CYS-29 GLY-24

GLY-68 GLU-171 TRP-30 GLY-197 SER-25

GLY-74 GLY-172 ASN-72 GLY-198 CYS-26

ALA-200 GLY-73 GLU-245 GLY-27

GLU-245 GLY-74 SER-28

GLY-197

GLY-198

HIS-199

161

Cathepsin L.

S4 S3 S2 S1 S1' S2'

GLN-60 GLU-63 LEU-69 GLY-23 GLY-23 GLN-19

GLY-61 LEU-69 MET-70 CYS-25 CYS-25 GLY-20

ASN-62 VAL-134 TRP-26 ASP-160 GLN-21

GLY-68 GLY-164 ASN-66 MET-161 CYS-22

ALA-214 GLY-67 ASP-162 GLY-23

ASP-162 SER-24

HIS-163

162

B2. Flexibility Differences Between Cruzain, Rhodesain, Cathepsin B and Cathepsin L Flexibility metrics applied are the B-factors based on global Cα alignment, dihedral entropy

(phi and psi entropy), which are both metrics for backbone flexibility and B-factors based on local alignment, which is a metric for side chain flexibility.

Flexibility differences for subpockets S4-S2’ are mapped to the binding sites of cruzain, rhodesain, cathepsin B and cathepsin L, using the following color coding:

- Less Flexible

- Same Flexibility

- More Flexible

Absolute flexibilities/entropies and respective differences for each cysteine protease pairs are given in corresponding histograms.

163

B-Factors Based on Global Cα Alignment (Backbone Flexibility)

Cruzain vs. Cathepsin B B-Factors Global Alignment.

100

80 2 60

40

20 Cathepsin_B

0 Cruzain S1 S2 S3 S4 S1' S2' Cruzain_CatB -20

Factors Global AlignmentA / -40

- B -60

-80 Subpocket

164

Cruzain vs. Cathepsin L B-Factors Global Alignment.

20

2 15

10

Cathepsin_L 5 Cruzain Cruzain_CatL 0

S1 S2 S3 S4 S1' S2'

Factors Global AlignmentA / -

B -5

-10 Subpocket

165

Rhodesain vs. Cathepsin B B-Factors Global Alignment.

100 2 80 60 40 Cathepsin_B 20 Rhodesain 0 S1 S2 S3 S4 S1' S2' Rhodesain_CatB -20

-40 Factors Global AlignmentA /

- -60 B -80 Subpocket

166

Rhodesain vs. Cathepsin L B-Factors Global Alignment.

20 2

15

10 Cathepsin_L Rhodesain 5 Rhodesain_CatL 0 S1 S2 S3 S4 S1' S2'

-5

Factors Global AlignmentA /

- B -10 Subpocket

167

Cruzain vs. Rhodesain B-Factors Global Alignment.

18 2 16 14 12 10 Cruzain 8 Rhodesain 6 Cruzain_Rhodesain 4

2 Factors GlobalFactors Alignment A / - 0 B S1 S2 S3 S4 S1' S2' -2 Subpocket

168

Phi Entropy (Backbone Flexibility)

Cruzain vs. Cathepsin B Phi Entropy.

S1 S2 S3 S4 S1' S2' 0 -2 -4 -6 -8 Cathepsin_B -10 Cruzain -12 Cruzain_CatB

-14 PhiEntropy / kcal/mol -16 -18 -20 Subpocket

169

Cruzain vs. Cathepsin L Phi Entropy.

5

0 S1 S2 S3 S4 S1' S2'

-5 Cathepsin_L Cruzain -10

Cruzain_CatL Phi Entropy Entropy Phi / kcal/mol -15

-20 Subpocket

170

Rhodesain vs. Cathepsin B Phi Entropy.

0 S1 S2 S3 S4 S1' S2' -2 -4 -6 Cathepsin_B -8 Rhodesain -10 -12 Rhodesain_CatB -14

Phi Entropy Entropy Phi/kcal/mol -16 -18 -20 Subpocket

171

Rhodesain vs. Cathepsin L Phi Entropy.

5

0 S1 S2 S3 S4 S1' S2'

Cathepsin_L -5 Rhodesain

-10 Rhodesain_CatL

Phi Entropy Entropy Phi / kcal/mol -15

-20 Subpocket

172

Cruzain vs. Rhodesain Phi Entropy

5

0 S1 S2 S3 S4 S1' S2'

Cathepsin_L -5 Rhodesain

-10 Rhodesain_CatL

Phi Entropy Entropy Phi / kcal/mol -15

-20 Subpocket

173

Psi Entropy (Backbone Flexibility)

Cruzain vs. Cathepsin B Psi Entropy

2 S1 S2 S3 S4 S1' S2' 0 -2 -4 Cathepsin_B -6 Cruzain -8 -10 Cruzain_CatB -12

Psi Entropy Entropy Psi / kcal/mol -14 -16 -18 Subpocket

174

Cruzain vs. Cathepsin L Psi Entropy.

5

0 S1 S2 S3 S4 S1' S2'

-5 Cathepsin_L Cruzain -10

Cruzain_CatL Psi_Entropy/ kcal/mol -15

-20 Subpocket

175

Rhodesain vs. Cathepsin B Psi Entropy.

5

0 S1 S2 S3 S4 S1' S2'

-5 Cathepsin_B Rhodesain -10 Rhodesain_CatB

-15 Psi Entropy / kcal/mol / EntropyPsi

-20 Subpocket

176

Rhodesain vs. Cathepsin L Psi Entropy.

5

0 S1 S2 S3 S4 S1' S2' Cathepsin_L -5 Rhodesain

-10 Rhodesain_CatL

-15 Psi Entropy / kcal/mol / Entropy Psi

-20 Subpocket

177

Cruzain vs. Rhodesain Psi Entropy.

2 0 S1 S2 S3 S4 S1' S2' -2 -4 Cruzain -6 Rhodesain -8 -10 Cruzain-Rodesain -12

PsiEntropy / kcal/mol -14 -16 -18 Subpocket

178

B-Factors Based on Local Alignment (Side Chain Flexibility)

Cruzain vs. Cathepsin B B-Factors Based on Local Alignment.

60

2 40

20

Cathepsin_B 0 S1 S2 S3 S4 S1' S2' Cruzain Cruzain_CatB -20

Factors Local Alignment/ A -40

- B

-60 Subpocket

179

Cruzain vs. Cathepsin L B-Factors Based on Local Alignment.

50

40 2 30

20 Cathepsin_L 10 Cruzain 0 Cruzain_CatL S1 S2 S3 S4 S1' S2'

-10

Factors Local Alignment/ A

- B -20

-30 Subpocket

180

Rhodesain vs. Cathepsin B B-Factors Based on Local Alignment.

60 2 40

20 Cathepsin_B Rhodesain 0 S1 S2 S3 S4 S1' S2' Rhodesain_CatB -20

-40

Factors Alignment Local Factors /A

- B -60 Subpocket

181

Rhodesain vs. Cathepsin L B-Factors Based on Local Alignment.

60 2 50 40 30 Cathepsin_L 20 Rhodesain 10 Rhodesain_CatL 0 S1 S2 S3 S4 S1' S2'

-10 Factors Alignment Local Factors /A

- -20 B -30 Subpocket

182

Cruzain vs. Rhodesain B-Factors Based on Local Alignment.

60

2 50 40 30 20 Cruzain 10 Rhodesain 0 S1 S2 S3 S4 S1' S2' Cruzain_Rhodesain -10

-20

Factors Local Alignment/ A -

B -30 -40 Subpocket

183

Appendix C

C1. Subpocket Definition for Cathepsin K Exosite 1 S4 S3 S2 S1 S1' S2'

TYR-87 GLU-59 ASP-61 TYR-67 GLY-23 GLY-23 GLN-19

PRO-88 ASN-60 TYR-67 MET-68 CYS-25 CYS-25 GLY-20

TYR-89 GLY-66 ALA-133 TRP-26 ASN-158 GLU-21

VAL-90 LEU-157 GLY-64 HIS-159 CYS-22

GLY-91 ALA-160 GLY-65 GLY-23

GLN-92 LEU-205 ASN-158 SER-24

GLU-93 HIS-159

GLU-94

SER-95

CYS-96

MET-97

TYR-98

ASN-99

PRO-100

THR-101

GLY-102

184

C2. Influence of Single and Double Mutations, Type of Inhibition, Chondroitin Sulfate Binding and Di- and Tetramerization on Cathepsin K Flexibility Flexibility metrics applied are the B-factors based on global Cαalignment, which is a metric for backbone flexibility and B-factors based on local alignment, which is a metric for side chain flexibility.

Flexibility differences for subpockets S4-S2’ and exosite-1 are mapped to the binding sites of cathepsin K, using the following color coding:

- Less Flexible

- Same Flexibility

- More Flexible

185

WT vs. D55N – Backbone Flexibility Global Alignment.

120

100

80

60 WT

Factors A² / - D55N 40 D55N-WT

20 BackboneB

0 S4 S3 S2 S1 S1' S2' Exosite 1 -20 Subpocket/Exosite

186

WT vs. D55N – Side Chain Flexibility Local Alignment.

65

55

45

35

WT Factors A² / - 25 D55N

15 D55N-WT

SideChainB 5

-5 S4 S3 S2 S1 S1' S2' Exosite 1

-15 Subpocket/Exosite

187

WT vs. K119D/K176E – Backbone Flexibility Global Alignment.

120

100

80

60 WT

Factors / A² / Factors - K119D 40 K119D-WT

Backbone B Backbone 20

0 S4 S3 S2 S1 S1' S2' Exosite 1 -20 Subpocket/Exosite

188

WT vs. K119D/K176E – Side Chain Flexibility Local Alignment.

65

55

45

35

WT Factors A² / - 25 K119D 15 K119D-WT

BackboneB 5

-5 S4 S3 S2 S1 S1' S2' Exosite 1

-15 Subpocket/Exosite

189

Apo vs. Reversible Inhibitor (RI) – Backbone Flexibility Global Alignment.

100 90 80 70 60

50 Apo

Factors A² / - 40 RI (2AUX) 30 RI-Apo

BackboneB 20 10 0 S4 S3 S2 S1 S1' S2' Exosite 1 -10 Subpocket/Exosite

190

Apo vs. Reversible Inhibitor (RI) – Side Chain Flexibility Local Alignment.

60

50

40

30

Factors A² / Apo - 20 RI 10 RI-Apo

SideChainB 0 S4 S3 S2 S1 S1' S2' Exosite 1 -10

-20 Subpocket/Exosite

191

Apo vs. CatK with Covalent Inhibitor (E64) - Backbone Flexibility Global Alignment

80 70 60 50 40

Apo Factors A² / - 30 E64 20 E64-Apo 10

BackboneB 0 -10 S4 S3 S2 S1 S1' S2' Exosite 1 -20 Subpocket/Exosite

192

Apo vs. CatK with Covalent Inhibitor (E64) - Side Chain Flexibility Local Alignment

70 60 50 40 30

Factors A² / 20 Apo - 10 E64 0 E64-Apo S4 S3 S2 S1 S1' S2' Exosite 1

-10 SideChainB -20 -30 -40 Subpocket/Exosite

193

Apo vs. Allosteric Inhibitor (AI) – Backbone Flexibility Global Alignment.

65

55

45

35

Apo

Factors A² / - 25 AI

15 AI-Apo

BackboneB 5

-5 S4 S3 S2 S1 S1' S2' Exosite 1

-15 Subpocket/Exosite

194

Apo vs. Allosteric Inhibitor (AI) – Side Chain Flexibility Local Alignment.

105

85

65

Factors A²/ Apo - 45 AI AI-Apo

25 SideChainB 5

S4 S3 S2 S1 S1' S2' Exosite 1 -15 Subpocket/Exosite

195

Apo vs. CS Bound CatK (Holo) – Backbone Flexibility Global Alignment.

60

40

20 Apo

Factors A² / - CS (holo)

0 CS (holo)-Apo

S4 S3 S2 S1 S1' S2' Exosite 1 BackboneB

-20

-40 Subpocket/Exosite

196

Apo vs. CS Bound CatK (Holo) – Side Chain Flexibility Local Alignment.

55

45

35

25

Factors A² / Apo - 15 CS (holo)

5 CS (holo)-Apo

SideChainB -5 S4 S3 S2 S1 S1' S2' Exosite 1

-15

-25 Subpocket/Exosite

197

Impact of Dimerization and Tetramerization on CatK Flexibility

Apo vs. Dimer Unit 1.

250

200

150

Factor A² / - Unit 1 100 Apo Unit 1-Apo BackboneB 50

0 S1 S2 S3 S4 S1' S2' Exosite 1 Subpocket/Exosite

198

Apo vs. Dimer Unit 2.

250

200

150 Factors A²/ - Unit 2 100 Apo Unit 2-Apo

BackboneB 50

0 S1 S2 S3 S4 S1' S2' Exosite 1 Subpocket/Exosite

199

Apo vs. Tetramer Unit 1.

180

160

140

120

100 Factors A² / - Unit 1 80 Apo 60 Unit 1-Apo

BackboneB 40

20

0 S1 S2 S3 S4 S1' S2' Exosite Subpocket/Exosite

200

Apo vs. Tetramer Unit 2.

80

70

60

50

40

Unit 2 Factors A² / - 30 Apo 20 Unit 2-Apo

10 BackboneB 0 S1 S2 S3 S4 S1' S2' Exosite -10

-20 Subpocket/Exosite

201

Apo vs. Tetramer Unit 3.

140

120

100

80 Factors A² / - Unit 3 60 Apo

40 Unit 3-Apo BackboneB

20

0 S1 S2 S3 S4 S1' S2' Exosite Subpocket/Exosite

202

Apo vs. Tetramer Unit 4.

120

100

80 Factors A² / - 60 Unit 4 Apo 40

Unit 4-Apo BackboneB 20

0 S1 S2 S3 S4 S1' S2' Exosite Subpocket/Exosite

203

Statutory Declaration

I hereby declare that the presented thesis was written by myself as the sole author and that no sources and tools were used besides those mentioned in the thesis itself. All literally or summary citations have been marked as such and attributed to their original authors.

Date Signature

Eidesstattliche Erklärung

Ich erkläre hiermit an Eides statt durch meine eigenhändige Unterschrift, dass ich die vorliegende Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und

Hilfsmittel verwendet habe. Alle Stellen, die wörtlich oder inhaltlich den angegebenen Quellen entnommen wurden, sind als solche kenntlich gemacht.

Die vorliegende Arbeit wurde bisher in gleicher oder ähnlicher Form noch nicht als Magister-

/Master-/Diplomarbeit/Dissertation eingereicht.

Datum Unterschrift

204