<<

DISS.ETHNO. 25527

Development and Application of Bespoke Machine Learning Lipophilicity Models for Peptides

A thesis submitted to attain the degree of

DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zürich)

presented by

JENS-ALEXANDER FUCHS

Pharmacist (State Examination) University of Bonn

born on September 12, 1988 Citizen of Germany

accepted on the recommendation of Prof. Dr. Gisbert Schneider - examiner Prof. Dr. Stefanie Krämer - co-examiner

2018 ii

c 2018 Jens A. Fuchs: Development and Application of Bespoke Machine Learning Lipophilicity Models for Peptides iii

This work is dedicated to my wife Lisa, my parents Gabi and Wolfgang, and sister Jasmin. Your sympathy fosters my personality, well-being, and my thoughts about life and science.

“Our species needs, and deserves, a citizenry with minds wide awake and a basic understanding of how the world works.”

Carl Sagan iv

Publications

Parts of this thesis were published in:

• J. A. Fuchs, F. Grisoni, M. Kossenjans, J. A. Hiss, G. Schneider, "Lipophilicity prediction of peptides and peptide derivatives by consensus machine learning", Medicinal Chemistry Communications 2018, 9, 1538-1546.

The discussed concepts of peptide quantification and generative modelling by artificial neural networks are published in:

• M. D. Allenspach*, J. A. Fuchs*, N. Doriot, J. A. Hiss, G. Schneider, C. Steuer, "Quantification of hydrolyzed peptides and proteins by amino acid fluorescence" Journal of Peptide Science 2018, e3113.

• A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, G. Schneider, "Generative recurrent networks for de novo drug design", Molecular Informatics 2018, 37, 1-2. v

Contents

PUBLICATIONS iv

LISTOF FIGURESAND TABLES viii

LISTOF ABBREVIATIONS xi

SUMMARY xiii

ZUSAMMENFASSUNG xv

1 INTRODUCTION 1 1.1 Lipophilicity: A Fundamental Concept for Pharmacokinetic and Phar- macodynamic Assessment in Drug Discovery ...... 1 Partition and Distribution Coefficients ...... 4 Experimental Approaches to Determine Partition and Distribution Coef- ficients ...... 6 In Silico Calculation of Partition- and Distribution-Coefficients ...... 9 1.2 Peptides in Drug Discovery ...... 13 Advantages and Drawbacks of Peptides ...... 14 Overcoming the Drawbacks of Peptides by Combining Biotechnology and Medicinal Chemistry ...... 15 Lipophilicity of Peptides and Peptide-Mimetics ...... 18 1.3 Machine Learning for the Prediction of Pharmaceutically Relevant Prop- erties ...... 20 Molecular Representation ...... 21 Unsupervised Algorithms ...... 26 Supervised Algorithms ...... 29 Model Evaluation ...... 35 Applicability Domain ...... 37 1.4 Evolutionary Algorithms ...... 39 1.5 Protein-Protein Interactions in Drug Discovery ...... 41 The Chemokine System ...... 42 vi

CCR7 and CCL19/CCL21 ...... 45

2 AIMSOFTHIS THESIS 47

3 MATERIALS AND METHODS 49 3.1 Laboratory Methods ...... 49 Peptide Synthesis ...... 49 Peptide Analytics and Purification ...... 51 Shake-Flask Method ...... 51 Microscale Thermophoresis ...... 53 3.2 Computational Methods ...... 54 Software ...... 54 Molecular Representation and Descriptor Calculation ...... 55 Machine Learning ...... 55 Datasets ...... 56 De Novo Peptide Design ...... 57

4 RESULTS AND DISCUSSION 59 4.1 Baseline Models ...... 59 Introduction ...... 59 Feature Selection and Dimensionality Reduction ...... 60 Results for Modelling with Lasso Features vs. PCA Scores ...... 63 Predictions from Baseline Models for Peptides up to a Length of Six AA . 64 Discussion ...... 66 4.2 Expanded Models ...... 68 Introduction and Hypothesis ...... 68 Results for Modelling LIPOPEP vs. AZ ...... 68 Final Consensus Model based on the Pooled Data ...... 71 Domain of Applicability ...... 73 Discussion ...... 76 4.3 Benchmarking the Final Consensus Model ...... 80 Introduction and Hypothesis ...... 80 Methods ...... 80 Results ...... 81 Discussion ...... 85 4.4 Focussed De Novo Generated Peptide Libraries for Studying Chemokine- Receptor / Ligand Interactions ...... 87 Introduction ...... 87 Fragmentation of CCR7_C24A ...... 89 vii

De Novo Peptide Generation by Simulated Molecular Evolution and Ranking 90 Binding Affinities of Selected Offsprings ...... 92 Discussion ...... 93

5 CONCLUSIONSAND OUTLOOK 97

6 ACKNOWLEDGEMENTS 101

BIBLIOGRAPHY 103

ASUPPLEMENTARY INFORMATION 125 A.1 Supplementary Information to Chapter 1 ...... 125 A.2 Supplementary Information to Chapter 4.1 ...... 126 A.3 Supplementary Information to Chapter 4.2 ...... 149 A.4 Supplementary Information to Chapter 4.3 ...... 151 A.5 Supplementary Information to Chapter 4.4 ...... 152 viii

List of Figures

1.1 The Drug Discovery Pathway ...... 3 1.2 LogD vs. pH Profile of ...... 5 1.3 Molecular Structures of Cyclosporin A, Desmopressin, and Carbetocin ...... 17 1.4 Dataset Preparation and Machine Learning Workflow ...... 21 1.5 1D - 3D Molecular Representations ...... 23 1.6 Principal Component Analysis ...... 27 1.7 k-mean Clustering ...... 29 1.8 Support Vector Machines ...... 33 1.9 Ensemble Prediction by Cascaded Jury Networks ...... 35 1.10 Cross-Validation and y-Randomisation ...... 36 1.11 Covered Chemical Space and Applicability Domain ...... 38 1.12 NMR-Structures of CCL19 and CCL21 ...... 42 1.13 Sequence Alignment N-Termini of Homeostatic CC- Chemokine Receptors 43 1.14 Schematic Depiction of the CCR7/CCL19 Site 1 Interaction ...... 46

4.1 Feature Selection by Lasso...... 60 4.2 Loadingplot of the Lasso-selected Features ...... 61 4.3 PCA Scree Plot...... 62 4.4 Heatmaps of SVR hyper-parametrisation ...... 63 4.5 Baseline Model Predictions for the In-House Peptides ...... 65 4.6 Y-Randomisation ...... 67 4.7 Performances LIPOPEP vs. AZ ...... 69 4.8 Differences between LIPOPEP and AZ...... 70 4.9 Consensus Results ...... 72 4.10 Williams Plots ...... 75 4.11 Retraining ACDlabs and Chemaxon. Flagging Molecules by ADMET- Predictor...... 82 4.12 Benchmarking: Model Performances in Relation to Ionisability, Molecu- lar Weight and Liophilicity...... 84 4.13 Overview of the CCR7 Project ...... 88 4.14 MST-Curves of CCR7_C24A, CCR7_10.1 and CCR7_6.4 ...... 89 4.15 Fragmentation of CCR7_C24A ...... 90 ix

4.16 Properties of Virtual Peptide Libraries...... 92

List of Tables

1.1 Selected Experimental Methods for Direct and Indirect Lipophilicity De- termination...... 8 1.2 Selected "Classic" Methods for logP Prediction...... 10 1.3 Selected QSPR Methods for Lipophilicity Prediction...... 12

1.4 LogD7.2 of some Peptide Drugs...... 18 1.5 Prominent Kernel Functions ...... 32

3.1 Summary of the Synthesized Peptides...... 50 3.2 SFM: Chromatographic Settings for Peptide-quantification...... 52 3.3 Summary Datasets...... 57

4.1 Performances of Baseline Models on the LIPOPEP Set...... 64 4.2 Performances of Extended Models and the Consensus Model on the Pooled Data...... 73

4.3 Structures and logD7.4 of the Test Compounds Predicted with an Abso- lute Error > 2 log Units...... 77 4.4 Results of the Benchmark Analysis ...... 82 4.5 Summary of the Synthesised and Tested Offsprings ...... 94

xi

List of Abbreviations

1D one-dimensional 2D two-dimensional 3D three-dimensional AA amino acid AAM arithmetic average model ACN acetonitrile ACP anticancer peptide AD applicability domain AE absolute error AMP antimicrobial peptide ANN(-E) artificial neural network (-ensemble) AZ AstraZeneca CCL19, CCL21 CC-chemokine ligand 19/21 CCR7 chemokine receptor 7 CHI chromatographic hydrophobicity index CPC centrifugal partition chromatography CPP cell penetrating peptide CR chemokine receptor CV cross validation DCM dichloromethane DMF EA evolutionary algorithms ECFP extended connectivity fingerprints EV external validation FA FDA Food and Drug Administration Fmoc 9-fluorenylmethoxycarbonyl GAG glycosaminoglycans GP Gaussian process GPCR G-protein-coupled receptor HCTU 2-(6-chloro-1H-benzotriazol-1-yl)-1,1,3,3-tetramethylaminium- hexafluorophosphate HPLC high-performance liquid chromatography HTS high-throughput screening IUPAC International Union of Pure and Applied Chemistry Kd dissociation constant Lasso least absolute shrinkage and selection operator LLE llipophilic ligand efficiency logDpH logarithmic distribution coefficient at specific pH logP logarithmic partition coefficient MD molecular dynamics MHC-1 major histocompatibility complex 1 MOE molecular operating environment ML machine learning xii

MLR multivariate linear regression MS mass spectrometry MST microscale thermophoresis MW molecular weight NCE new chemical entity NMM n-methyl-morpholine NMR nuclear magnetic resonance NN nearest neighbour OCHEM online chemical modelling environment PB phosphate buffer PBS phosphate-buffered PC principal component PCA principal component analysis PD pharmacodynamics PK pharmacokinetics pKa logarithmic acid dissociation constant PPI protein-protein interaction peptide-protein interaction PSGL-1 p-selectin glycoprotein ligand-1 QSAR quantitative structure-activity relationship QSPR quantitative structure-property relationship RF random forest RMSE root mean squared error Ro5 rule of five SFM shake-flask method SMILES simplified molecular input line entry specification S/N signal-to-noise ratio SPPS solid phase peptide synthesis std standard deviation SVM support vector machine SVR support vector regression TFA trifluoroacetic acid TRH thyrotropin-releasing hormone TIS triisopropylsilane UHPLC ultra-high-performance liquid chromatography USP United States Pharmacopeia vdW van der Waals xiii

Summary

Lipophilicity displays a key physicochemical property in drug design and discov- ery. In early stage scenarios, lipophilicity is employed to rationalise the selection of molecules from a large pool of compounds directing the development into preferred regions of the chemical space. The direct link between lipophilicity and the fate of a drug in various compartments of the human body makes it also a pivotal param- eter for optimising its pharmacokinetic properties. The experimental determination of lipophilicity in terms of partition- and distribution coefficients is often tedious and expensive. Indirect measurements are faster and dedicated to be applied in a high- throughput fashion. However, these methods require calibration and most assays are accurate only for a restricted set of compounds. As an alternative, the community began to develop computational approaches. The first atom- and fragment-based models, which interpret lipophilicity as an additive property, appeared in the 1960’s. Some extended versions of these models are still regularly applied today. With the emergence of machine learning algorithms, which have no static program instructions but can improve themselves by learning from in- correct predictions, the field was complemented by more sophisticated data-driven approaches. Nowadays, researchers can pick from a plethora of in silico lipophilicity models for small molecules originating from traditional medicinal chemistry. Complex, nature-inspired molecules, like peptides and macrocycles, appeared rather at the edge of drug discovery. However, technological advances in the synthesis, ap- plication forms, and drug delivery, as well as the will to enhance the borders of target ligandability, put such structures into the spotlight. Yet, their distinct differences in size, complexity, and physicochemical properties when compared to small molecules impair the reliability of most lipophilicity predictors and advocate the development of bespoke models. In this thesis I address this need by the development and training of such models with peptide data.

Initially, suitable machine learning algorithms for feature selection and non-linear mod- elling were employed. The resulting peptide-specific logarithmic distribution coeffi- cient (logD) models at physiological pH were based on low-dimensional input in order to account for the data scarcity. These models picked molecular representations that are intuitively meaningful to model partition behaviour between a polar and apolar phase. Retrospective evaluation on a set of short, linear di- to pentapeptides consisting out of natural amino acids, revealed excellent performance of all employed algorithms (RMSEs between 0.39 and 0.54). Further validation of a set of 15 synthesised and tested linear hexapeptides demonstrated the applicability to this length. Ten of the 15 pep- tides were predicted with an absolute error lower than 0.5 log units and the error for the remaining five peptides was lower than 1.0 log units. xiv

As the development of pharmaceutically relevant peptides goes beyond linear, un- modified sequences, our initial models evolved to be able to provide accurate predic- tions for such real-world examples of drug discovery. For this purpose, lipophilicity data of peptide-derived mimetics provided by AstraZeneca were considered to ex- pand the known chemical space and allow a re-parametrisation. The overall accuracy obtained by uniting the output of the models to a performance-weighted consensus value, demonstrated the applicability of this novel approach to peptides and peptide derivatives in a logD7.4 range of approximately -3 to 5. By comparing our results to three commercially available and routinely applied lipophi- licity models for small molecules, we found that these tools revealed weaknesses, stressing the necessary development of dedicated lipophilicity predictors.

In the final part of this thesis, the consensus model was utilised to guide the de novo de- sign of short peptide modulators of the protein-protein interaction between the chemo- kine receptor 7 (CCR7) and its endogenous ligand CCL19. We confirmed the previ- ously reported interaction between the receptor N-terminus and the chemokine and scrutinised the binding epitope by systematic fragmentation of the former. This ap- proach led to the identification of a hexapeptide which served as the template for the simulated molecular evolution to explore the sequence space for potential CCL19-binders. Generated peptides within the virtual libraries were ranked by pharmacophoric and lipophilic similarity to rationalise the selection for synthesis and testing. In total, five from 12 tested peptides exhibited binding affinities to CCL19 in the low micromolar range. These results demonstrated the validity of using logD as a part in a multidi- mensional approach to focus in silico libraries for the creation of novel peptide designs by simultaneously retaining target-affinity. Yet, the feasibility to evolve the proposed sequences into pharmaceutically relevant compounds and their potential to interfere with the pathological lymph-node trafficking of CCR7 over expressing cancer cells needs to be investigated in further studies.

In summary, this work is a linkage between modelling a pharmaceutically relevant property for a hitherto neglected compound class and applying the gained knowledge to aid de novo drug design. This amalgamation of computational and experimental expertise stands for a smart drug discovery strategy which fosters the creation of inno- vative medicines in the future. xv

Zusammenfassung

Lipophilie ist eine wichtige physikochemische Eigenschaft für das Design und die En- twicklungen von Pharmazeutika. In frühen Stadien der Entwicklung wird Lipophilie als Rationale zur Selektion von Molekülen aus grossen Substanzbanken verwendet. Dieser Vorgang hilft, die Wirkstoffentwicklung in bevorzugte Regionen des Raumes chemischer Eigenschaften zu dirigieren. Die direkte Verbindung zwischen Lipophilie und dem Schicksal, welches ein Wirkstoff in verschiedenen Kompartimenten des men- schlichen Körpers ereilt, macht diese Eigenschaft zu einem zentralen Parameter der Pharmakokinetik. Die experimentelle Bestimmung der Lipophilie anhand von Vertei- lungskoeffizienten ist oftmals aufwendig und teuer. Indirekte Messungen sind weniger zeitaufwendig und für einen hohen Durchsatz besser geeignet. Diese Methoden er- fordern jedoch eine Kalibrierung und die meisten Assays können nur für strukturell ähnliche Moleküle präzise genutzt werden. Daher begann man mit der Entwicklung computer-gestützter Ansätze zur Vorher- sage von Lipophilie. Die ersten Atom- und Fragment-basierten Modelle wurden in den 1960er Jahren vorgestellt. Diese Modelle interpretieren Lipophilie als eine ad- ditive Eigenschaft. Einige erweiterte Versionen dieser Ansätze werden auch heute noch regelmässig angewendet. Durch die Entstehung von Algorithmen im Bereich des maschinellen Lernens wurde das Feld mit anspruchsvolleren, datengesteuerten Ansätzen ergänzt. Diese Algorithmen folgen keinen statischen Programmierungen. Vielmehr sind sie in der Lage, selbst aus schlechten Vorhersagen zu lernen und sich zu verbessern. Heutzutage können Forscher aus einer Vielzahl an in silico Lipophilie- Modellen für Wirkstoffmoleküle wählen, die der traditionellen medizinischen Chemie entstammen. Komplexere Moleküle, oft einer natürlichen Vorlage nachgeahmt, wie Peptide und Makrozyklen, erschienen lange nur am Rande der Forschung. Durch technologische Fortschritte im Bereich Synthese, Darreichungsformen und gezielter Wirkstoffabgabe sowie den Willen, die Grenzen der angreifbaren pharmakologischen Ziele zu erweit- ern, rückten auch solche Moleküle in den Vordergrund. Dessen deutliche Unterschiede in Grösse, Komplexität und physikochemischen Eigenschaften, verglichen mit kleinen Wirkstoffmolekülen, beeinträchtigen zuverlässige Vorhersagen der meisten Modelle und befürworten die Entwicklung von massgeschneiderten Ansätzen. In dieser Dis- sertation antworte ich auf diese Notwendigkeit mit der Entwicklung und dem Training solcher Modelle mit Peptiddaten.

Beginnend wurden Algorithmen zur Selektion geeigneter molekularer Merkmale und zur nicht-linearen Regression eingesetzt. Die resultierenden peptid-spezifischen Mod- elle zur Vorhersage des logarithmischen Verteilungskoeffizienten (logD) bei physiol- ogischem pH, basierten auf Merkmalen niedriger Dimensionalität, um den geringen xvi

Mengen an Daten, die zur Verfügung standen, gerecht zu werden. Der Algorithmus wählte molekulare Repräsentationen, welche intuitiv sinnvoll erschienen, die Vertei- lung eines Wirkstoffs zwischen polarer und apolarer Phase zu modellieren. Die ret- rospektive Evaluierung durch einen Datensatz kurzer, linearer Di- bis Pentapeptide, die ausschliesslich aus natürlichen Aminosäuren aufgebaut waren, ergab exzellente Vorhersagen aller angewendeten Algorithmen (RMSE zwischen 0.39 und 0.54). Die fol- gende Validierung einer Reihe von 15 synthetisierten und getesteten, linearen Hexapep- tiden, demonstrierte die erfolgreiche Anwendung der Modelle bis zu dieser Länge. Zehn der 15 Peptide wurden mit einem absoluten Fehler kleiner als 0.5 log Einheiten vorhergesagt. Der Fehler der anderen fünf Peptide war kleiner als 1.0 log Einheiten. Die Entwicklung pharmazeutisch relevanter Peptide geht über lineare, unmodifizierte Sequenzen hinaus. Daher wurden die anfänglichen Modelle weiter entwickelt, um auch für reale Beispiele der Peptidforschung präzise Vorhersagen zu liefern. Für diesen Zweck wurden Lipophilie-Daten von Peptid-abgeleiteten Mimetika, zur Verfügung gestellt von AstraZeneca, berücksichtigt. Diese Daten erweiterten den Horizont des chemischen Raumes, welchen die Modelle kennen, und erlaubten deren Reparame- terisierung. Die Genauigkeit, bezogen auf alle Daten, die durch einen nach Präzision der Vorhersagen gewichteten Konsensuswert erlangt wurde, demonstrierte die An- wendbarkeit des neuen Ansatzes für Peptide und Peptidderivate im logD7.4-Bereich von ungefähr -3 bis 5. Die in dieser Dissertation vorgestellten Resultate wurden mit drei kommerziell er- hältlichen Modellen verglichen, welche routinemässig für kleine Wirkstoffmoleküle angewendet werden. Diese kommerzielen Modelle wiesen Schwachstellen auf, was die Notwendigkeit von massgeschneiderten Peptidmodellen betonte.

Im letzten Teil dieser Dissertation wurde das Konsensusmodell dazu genutzt, das de novo Design kurzer Peptidmodulatoren der Protein-Protein Interaktion zwischen dem Chemokinrezeptor 7 (CCR7) und dessen endogenem Liganden CCL19 zu leiten. Wir konnten die bereits beschriebene Interaktion zwischen dem N-Terminus des Rezeptors und dem Chemokin bestätigen und untersuchten das Bindungsepitop von ersterem durch systematische Fragmentierung. Dieser Ansatz führte zu der Identifizierung eines Hexapeptids, welches als Vorlage zur simulierten molekularen Evolution diente. Somit konnte der virtuelle Sequenzraum für potenzielle CCL19-Binder sondiert wer- den. Die virtuell generierten Sequenzen wurden anhand von pharmakophorischer und lipophilischer Ähnlichkeit eingeordnet, um die Auswahl einiger Peptide zur Syn- these und zum Testen zu begründen. Fünf der 12 getesteten Peptide zeigten Bindungs- affinitäten zu CCL19 im niedrigen mikromolaren Bereich. Diese Ergebnisse demonstri- erten, dass logD ein valider Parameter in einem multidimensionalen Ansatz darstellt, mit dem Ziel in silico Peptiddatenbanken zu fokussieren. Es wurden neue Peptidse- quenzen entdeckt, welche gleichzeitig eine Affinität zu der Zielstruktur bewahrten. xvii

Davon abgesehen muss in weiteren Studien untersucht werden, ob die vorgeschlage- nen, linearen Sequenzen zu pharmazeutisch relevanten Strukturen weiter entwickelt werden können. Ein weiterer Schritt ist die Abklärung des Potenzials der Peptide, die Metastasierung CCR7-über-exprimierender Krebszellen in die Lymphknoten zu stören.

Zusammenfassend verknüpft diese Arbeit das Modellieren einer pharmazeutisch rele- vanten Eigenschaft, für eine bisher vernachlässigte Substanzklasse, mit der Anwen- dung des erlangten Wissens zur Unterstützung von de novo Wirkstoffdesign. Das Verschmelzen rechnerischer und experimenteller Sachkenntnis steht für eine smarte Strategie der Wirkstoffentwicklung, die das Kreieren innovativer Arzneien der Zukunft fördert.

1

1 Introduction

1.1 Lipophilicity: A Fundamental Concept for Pharma- cokinetic and Pharmacodynamic Assessment in Drug Discovery

It is essential to understand physicochemical properties for the rational molecular de- sign of new chemical entities (NCEs) exhibiting pharmacological effects.[1–3] The assess- ment of physicochemical properties directs drug discovery into a region of chemical space which is likely to contain drug-like compounds [4], allowing the objective cre- ation of chemical libraries. For example, the chemical space of orally available small molecules [5] and "frequent hitters" that generate positive read-outs in multiple activ- ity assays [6] are defined by physicochemical properties. Substances with undesired properties can be excluded and a maximally diverse pool with a minimum of com- pounds can be achieved. This "negative design" is important for contemporary early stage drug development because pharmaceutical companies must prioritise hits from rapid, partly automatised, experimental screening [7] and vast virtual libraries (Figure 1.1). These modern technologies provide both challenges and unique opportunities for drug discovery. The concept changes to "positive design" in the further optimisation and identification of lead structures and in the final selection of clinical candidates. Iterative structure-modification follows parallel monitoring of physicochemical prop- erties to optimise pharmacokinetics (PK), pharmacodynamics (PD), safety, and novelty. Key physicochemical properties in the drug discovery process are lipophilicity, solubil- ity, and ionisation status (pKa).[8] These properties are interdependent and vary with different pH and polarity values of the environment. Lipophilicity displays molecu- lar behaviour in aqueous and lipid-rich compartments of the body and affects the ab- sorption, distribution, metabolism, excretion and toxicology (ADMET-Tox).[9, 10] This interplay of PK-properties determines bioavailability and drug safety [11] and has a major impact on the application form and dosage regime:

• Absorption Solubility is a critical requirement for absorption. The solubility of ionisable com-

pounds increases with the difference in pH and pKa [1]. However, the intrinsic lipophilicity of the neutral species is often employed as a reasonable character- istic to define soluble molecular starting points. A further requirement is the 2 Chapter 1. Introduction

capability to cross cell membranes, which is referred to as permeability. A rela- tionship between lipophilicity and permeability was established for various com- pound classes and types of biological membranes such as the blood brain barrier and Caco-2 cell monolayers.[12, 13] In general, low lipophilicity implies low per- meability.[14]

• Distribution Molecular binding to plasma proteins, in particular to serum albumin, or certain tissues has a profound effect on other PK parameters because this way a drug is not freely available for distribution in the body. Lipophilicity displays a ma- jor determinant for the binding capability of a compound to serum albumin and other non-specific binding to human tissues.[15, 16]

• Metabolism and Excretion Drug metabolism occurs via hepatic, renal or biliary pathways. It strongly de- pends on structural aspects of the molecule [17], a circumstance, which makes it the most difficult ADMET process to be described by physicochemical properties. Gleeson found a non-linear, statistically significant relation between lipophilicity and in vivo clearance for approximately 11,000 GlaxoSmithKline compounds.[18] Physiologically, the metabolism introduces polar structures to enhance renal and biliary clearance. Consequently, lowering the lipophilicity reduces metabolic clearance.[19, 20]

• Toxicology Compound-binding to any target is partly driven by lipophilicity (hydrophobic effect).[21, 22] High lipophilicity enhances drug promiscuity [23–25], which in turn increases the likeliness of unwanted pharmacological effects. More specif- ically, lipophilicity correlates with drug-induced inactivation of the human ether- a-go-go-related gene (hERG) potassium channel [26] , phospholipidosis [27], drug- induced liver injury (DILI) [28] and the inhibition of enzymatic cytochrome P450 proteins.[18]

The analysis of success rates from the first-in-man studies to registration for ten major pharma companies between 1991 and 2000 revealed that 89% of the drug candidates did not pass clinical trials.[29] In 1991, poor PK and bioavailability have been the major cause for drug-attrition (approximately 40%). Pharmacokinetic profiling from the early stage of drug development onwards reduced this attrition to less than 10% by 2000, when the lack of efficacy and safety became the major cause for attrition. This num- bers provide evidence towards how the pharmaceutical industry identifies and rec- tifies causes for attrition through (physicochemical) profiling and re-structuring their pipeline. 1.1. Lipophilicity: A Fundamental Concept for Pharmacokinetic and 3 Pharmacodynamic Assessment in Drug Discovery

More than one century ago, Meyer [30] and Overton [31] explained biological activity of narcotic compounds with lipophilicity. Ever since, this approach has been used for assessing PD. Hansch and Fujita, the pioneers of quantitative structure-activity relationship (QSAR) analysis, employed the lipophilic character of a compound as a determinant for activity from their earliest work on.[32] By analysing high through- put screening (HTS) databases, Keserü and Makara recently showed that by the na- ture of retrieved hits and following common hit-to-lead optimisation practices, con- temporary lead compounds are larger and more lipophilic than in the past.[33] This trend can be reasoned by the finding that entropy-driven compounds promote po- tency.[34] However, the "...tendency to build potency into molecules by the inappropriate use of lipophilicity..." leads ultimately to difficulties with compounds in clinical trials due to poor ADMET-profiles. Hann coined this phenomenon as "molecular obesity". [35] Therefore, improving the metrics of ligand efficiency instead of the potency alone pre- vents over-emphasising potency and inflation of physicochemical properties.[36] This concept includes the use of the smallest possible ligands and minimal lipophilicity to obtain desired outcomes. Such metrics quantify how efficient molecular features con- tribute to the target-affinity; for instant, the lipophilic ligand efficiency (LLE) presents one widely accepted index for combining in vitro potency and lipophilicity.[37]

Figure 1.1: Proposed Drug Discovery Pathway adapted from [2]. aMultiparametric guidelines are applied to narrow numbers of hits to a reasonable amount (negative design). b Hit-to-lead optimisation, lead identification and further optimisation to clinical candidates is supported by pharmakokinetic assessment and ligand effi- ciency metrics (positive design).[36, 37] HTS: high throughput screening

The optimisation of either the PK or PD of a compound is not the preferred strategy in medical chemistry. Instead, drug discovery is driven by a multi-dimensional, multi- parameteric approach. Recently, GlaxoSmithKline researchers stated, that solubility, 4 Chapter 1. Introduction dose and lipophilicity are their guiding parameters for oral drug candidates.[38] The pioneering "rule of five" (Ro5) by Lipinski employs lipophilicity (as well as molecular weight and the number of h-bond-acceptors and -donors) as a marker for poor absorp- tion and permeability of oral drug candidates.[5] Today, we know the Ro5 applies only to compounds absorbed by passive mechanisms and that compliant compounds are not automatically good drugs. However, it stimulated work on a plethora of other rules for guiding drug discovery.[2] These rules are empirically-derived from traditional small synthetic molecules, and as such they are inapplicable for larger compound classes like macrocyclic peptides.[39] Chameleonic properties of high-molecular-weight drugs, i.e the conformational flexibility to adapt to the polarity of the environment, also interfere with the Ro5.[40] Such compound classes, often derived from natural prod- ucts, gain importance as the pharmaceutical community strives to extend the "drug- gable genome" [41] with targets that can not be sufficiently modulated by conventional medicinal chemistry.To employ lipophilicity as a rationale in drug discovery in such cases, it is helpful to discuss their reliable experimental and computational determina- tion. This thesis makes a step into this direction by addressing the lipophilicity of short peptides and peptide-mimetics.

Partition and Distribution Coefficients

According to the definition of the International Union of Pure and Applied Chem- istry (IUPAC), lipophilicity represents the affinity of a molecule to an apolar environ- ment.[42] In 1872, Berthelot and Jungfleisch proposed expressing this molecular char- acteristic by the partition behavior between an aqueous and immiscible apolar organic solvent.[43] By convention, the decadic logarithm of the ratio of concentrations in both phases (partition coefficient, logP) phrases this definition in a numerical format:

[x]org logPx = log (1.1) [x]aq A negative value reflects a preference of the solute molecule for the aqueous phase, while a positive value reflects preference for the organic phase. A molecule with logP = 0 distributes equally in both phases. A second convention specifies that logP dis- plays the intrinsic property of a neutral molecule [42], implying that either the analyte is non-ionisable or determination is conducted under pH conditions ensuring the un- charged state. This concept can be applied for any biphasic solvent system which is found to mimic partition events in nature. For example, hydrocarbonic solvents, such as and , result in correlations between logP and a membrane/water partition for hydrophobic compounds.[44] On the background of simulating natural membranes, Hansch and Fujita proposed logP between n-octanol and water (or aque- ous buffer), [32, 45]. This system asserted itself as being commonly accepted to express 1.1. Lipophilicity: A Fundamental Concept for Pharmacokinetic and 5 Pharmacodynamic Assessment in Drug Discovery lipophilicity in the contexts discussed in this thesis. Hence, further mentioning of par- tition and distribution coefficients refers to the n-octanol/water(buffer) system. The concept of a constant logP is extended by introducing distribution coefficients (logD), which account for all neutral and ionisable species of an analyte:

Σ[species x]org logDx = log (1.2) Σ[species x]aq LogD varies depending on the charge-status of the ionisable groups and can be ex- pressed as a function over pH.[46]

Figure 1.2 depicts the calculated logDpH profiles as well as experimental constants of buprenorphine, a zwitter-ionic opioid-derivative, at given pH values. In this example, the logDpH differences between neutral, cationic and anionic forms amount to 4.7 log units, advocating that logDpH is the appropriate parameter to describe the behaviour of ionisable compounds.[47] Bhal et al. demonstrated that using logD5.5 instead of logP as one criterion in Lipinskis’ "rule of five" leads to an increase of 2.3% in the num- ber of screening compounds passing the filter.[48] Going further, the logDpH concept facilitates investigation and explanation of pharmacokinetic behaviour at any pH of pharmaceutical interest.

Figure 1.2: LogD vs. pH profile of buprenorphine. Curves were calculated with ACD/Labs (ACD Percepta 2015 Build 2726, Advanced Chemistry Development Inc., Toronto, Canada) (dark blue) and Instant J Chem (v.18.5.0, 2018, ChemAxon, Bu- dapest, Hungary) (light blue). Blue crosses depict logD for the anion, neutral form, cation and at physiological pH calculated with ADMET Predictor (v8.5., Simulations Plus Inc., Lancaster, USA). Black circles and diamonds refer to experimental logD from [49]a and [50]b. 6 Chapter 1. Introduction

For example, the analgesic acetylsalicylic acid is a weak acid (pKa = 3.49) which is ab- sorbed already in its neutral form in the stomach (pH = 1 - 1.5) and thus provides a faster effect than other cox-inhibitors.[51] Negative charge, and hence lower logD7.4 in the blood prevents re-passing of acetylsalicylic acid in the membrane (ion-trapping).[52] Nevertheless, it has been common to use logP, in particular when this property is cal- culated. This practice is due to the availability of logP algorithms and that the compu- tational assessment is simpler than for logDpH. Since the ionisation status is the only characteristic that changes the partition behaviour with regard to pH, logDpH can be calculated as a function of logP, pKa(s) and pH. A detailed description of the mathematical relationships for acidic and basic groups of mono- and multi-protic substances is beyond the scope of this thesis. The exemplary shown equation 1.3 expresses logDpH as a function of pH and is applicable for zwitter- ionic species and ampholytic substances, such as buprenorphine:

-pKa2 + pH pKa1 - pH logDpH = logP − log(1 + 10 + 10 ) (1.3)

The combination of logP and pKa algorithms for predicting logDpH is discussed later in section 1.1.

Experimental Approaches to Determine Partition and Distribution Co- efficients

Partition- and distribution coefficients can be determined either directly or indirectly (Table 1.1). Direct approaches quantify the analyte in the biphasic system and insert concentrations into equation 1.1 or 1.2. Indirect methods utilise experimental molecu- lar parameters such as chromatographic retention times and capacity factors which are intrinsically linked to lipophilicity.[53]

Direct Methods

Besides introducing the n-octanol/water system, Hansch and Fujita also proposed the shake-flask method [45] which is considered as the gold-standard for determination of logP and logDpH between -2 and 4.[54] A mechanical shakeprocedure ensures analyte partition before quantification. The method is timeconsuming, prone to emulsification, and requires relatively large amounts of a compound. Further, it is susceptible to im- purities, compound concentrations must be below the aqueous or n-octanol solubility limits, and it is limited by the specificity of the applied detection method. Improv- ing the mentioned drawbacks and up-scaling has been an on-going process ever since. To avoid an increased transition of n-octanol droplets in the excess of aqueous phase, De Bruijn et al. proposed an alternative procedure: slowly stirring instead of shak- ing the assay.[55] Additionally, for shorter equilibration time and less emulsification 1.1. Lipophilicity: A Fundamental Concept for Pharmacokinetic and 7 Pharmacodynamic Assessment in Drug Discovery at the solvent-interface, dialysis tubes can also be used.[56] The traditional approach was developed further to enable simultaneous experiments for multiple compounds and lower compound consumption.[57] Automation enhances the rapid logDpH as- sessment for large compound arrays.[58] For example, Hitzel et al. transferred the ex- periment onto 96-well plates and performed preparation as well as sampling assisted by robotic systems. Notably, fast reversed-phase high-performance liquid chromatog- raphy (RP-HPLC) runs enable short analysis time.[59] Researchers from Fa. Hoffmann- La Roche introduced a carrier-mediated distribution system (CAMDIS) in 96-well plate format. Octanol is coated on a lipophilic membrane at the tube-bottom to avoid emul- sification. In this approach, the analyte is quantified solely in the aqueous phase before and after shaking and the concentration in n-octanol is back-calculated by the mass bal- ance (equation 3.3). Recently, the use of 1H-NMR analysis with low-field (42.5 MHz) NMR-instruments, which is affordable also for small laboratories, was proposed for lipophilicity determination.[60] The analyte is quantified solely in D2O before and af- ter shaking the NMR-tube. Considering the magnetic field drift, relative integrations standardised against the characteristic water peak were used. The method requires aqueous concentrations that enable accurate peak-integration against the strong water peak in the spectra.

Indirect Methods

Chromatographic retention is strongly dependent on molecular lipophilicity and col- umn chemistry. Particularly RP-HPLC advanced to a mainstream approach for indirect logP and logDpH determination.[61, 62] It has advantages over direct methods in terms of speed, easy-to-automate, on-line detection, insensitivity to impurities and degrada- tion products (as long as the principal peak is known and separated), and reduced sample size. Moreover, RP-HPLC has proven to be useful in industrial settings.[63] However, for diverse sets incorporating neutral, acidic, and basic molecules, the re- gression of this metric is often poor. Poole et al. state, that molecular interactions contributing to RP-HPLC retention are similar but not identical to partition phenom- ena.[53] For a more realistic replication of the n-octanol/water partition system and to avoid unwanted intermolecular and molecule-sorbent interactions, the employment of various masking agents has been investigated.[64] Most obvious is the use of n- octanol as additive to the mobile phase. Giaginis et al. highlighted its value to achieve improved correlations for acidic drugs.[65, 66] Lombardo et al. used also n-octanol as a masking agent for the indirect logD7.4 determination of neutral and basic drugs.[67] Further research utilised surfactants to create micro-emulsion in the mobile phase. In particular, the combination of , butan-1-ol and heptane displays similarities to the n-octanol/water partition system.[68, 69] Often, the best correlations between retention and logP or logDpH are achieved by the capacity factor for water as 8 Chapter 1. Introduction

the mobile phase (logKw). The value for logKw is usually extrapolated from isocratic capacity factors at several mobile-phase compositions. To reduce the screening time for isocratic systems and facilitate the applicability to wider retention ranges, Valko and coworkers experimented with short columns and fast gradient runs to determine the chromatographic hydrophobicity index (CHI).[70–72] They consider the actual or- ganic volume percent in the mobile phase at the time of molecule exiting the column as a hydrophobicity scale. It was shown that the CHI provides excellent correlation with aqueous solubility, permeation in AMP assay, intrinsic clearance in human liver microsome preparation, hERG binding and promiscuity for ten-thousands of Glaxo- SmithKline molecules.[10]

Table 1.1: Selected experimental methods for direct and indirect lipophilicity deter- mination.

Method Implementation Reference

Direct Methods Conventional [45]

Miniaturised [57] Shake Flask Method (SFM) Slow Stirring [73] High-Throughput [57, 59] Screening (HTS)

CAMDIS [74]

1H-NMR [60]

Indirect Methods

Thin-Layer [75] Chromatography (TLC)

Chromatographic High-Performance Liquid [10, 63, 65, 66, 76] Chromatography (HPLC)

Ultra-High-Performance [77] Liquid Chromatography (UHPLC)

Centrifugal Partition Chromatography [50]

Two-Phase pH- Titration [78] Metric 1.1. Lipophilicity: A Fundamental Concept for Pharmacokinetic and 9 Pharmacodynamic Assessment in Drug Discovery

Martel et al. employed isocratic and gradient modes on ultra-high-performance liquid chromatography (UHPLC) for short analysis time, improved separation and reduced sample size.[77] They provided a standardised logP set to train or benchmark in sil- ico models by considering a maximal diverse selection of 759 compounds from the ZINC database.[79] A less popular indirect method is the use of liquid-liquid chro- matography, although the lack of a solid support avoids the problem of adsorption. In centrifugal partition chromatography (CPC), centrifugal forces keep one immisci- ble phase component stationary while the other is pumped through the system as the mobile phase. El Tayar et al. evaluated several CPC-techniques and were able to re- produce logP values, determined by SFM, for 89 building blocks like small alcohols, , benzoic acids.[80] The discussed methods present only a small selection of conceivable approaches be- cause chromatographic retention comprises a complex interplay between molecule- and column-chemistry, mobile-phase composition, temperature, and flow rate. The sheer endless possible conditions make an accurate experimental replication and inter- laboratory comparison difficult. Despite the efforts that have been put into this field, an universal approach does not exist. In a personal communication with Michael Kossen- jans (project manager of discovery sciences at AstraZeneca), I was told that they prefer to rely on their SFM implementation rather than on chromatographic indices.

Generally, logP and logDpH are also related to the difference between pKa of the so- app lute in aqueous phase and pKa in the biphasic partition system. The difference can be measured potentiometrically.[81] This technique presents an accurate alternative to SFM, but several measurements at different volume ratios must be conducted. It has been used occasionally in small-scale experiments.[82–84]

In Silico Calculation of Partition- and Distribution-Coefficients

The discussed experimental techniques for large-scale logP and logDpH determination (section 1.1) constitute one possible option to address the demand for lipophilicity as- sessment in drug discovery and design. Alternatively, these parameters can be com- putationally predicted, saving the material costs, man-power, and time that accrue from experiments. It is common practice to describe lipophilicity of chemical datasets by "clogP", "AlogP" or various other calculated partition- and distribution coefficients. LLE and multiparametric-guidelines for drug discovery (discussed in section 1.1) also consider calculated values. Computational methods can be differentiated into "classic" fragment- or atom models (Table 1.2) and quantitative structure-property relationship (QSPR) models (Table 1.3) and will be discussed in the same order in this section. 10 Chapter 1. Introduction

"Classic Models"

In the 1960s’ (two years after establishing SFM) Hansch and Fujita were amongst the first to calculate logP values, assuming lipophilicity to be an additive property.[85] Their approach was suited for analogues with different residues decorating one molec- ular scaffold. The study set the cornerstone for the development of “classic” mod- els. Nys and Rekker introduced fragmental values, derived in a reductionist manner, which are summed up to yield the final logP value.[86] Their PROlogP model allowed calculations for novel scaffolds with unknown logP. The authors included a correction factor to account for the proximity effect: Two close hydrophilic groups will bind fewer water molecules than each group in a non-nearest neighbour configuration, resulting in a higher observed lipophilicity than calculated. Continuous method improvement include the introduction of more correction factors; for example for aromatic conden- sation, cross-conjugation, and hydrogen-bonding.[87] A selection of classic models all sharing the same principle is shown in Table 1.2. The models differentiate in the use of (i) splitting rules to derive predefined fragments or atoms from a given molecule, (ii) training data, and (iii) correction factors.

Table 1.2: Selected "classic" methods for logP prediction.

Method Implementation Reference

π - System [85] PROlogP [86] clogP Fragment-Based [88] PlogP [89] ACDlogP [90]

AlogP Atom-Based [91]

HlogP [92] Atom- and Fragment-Based Combination KlogP [93]

Possibly the most popular classic models are ClogP [88] and AlogP [91]. ClogP is fragment-based and includes numerous correction-factors. AlogP sums up atomic contributions to lipophilicity and uses no correction factors. Viswanadhan et al. de- veloped a combination of fragmental and atomic contributions (HlogP) and compared their predictions to AlogP and ClogP. In this analysis they emphasised the need for accurate models for "larger" structures, as it was recognised that the prediction error increased with molecular weight. Mannhold et al. confirmed this observation in an 1.1. Lipophilicity: A Fundamental Concept for Pharmacokinetic and 11 Pharmacodynamic Assessment in Drug Discovery exhaustive benchmark analysis of 30 logP predictors. They noticed a linear decrease in performance by increasing the number of non-hydrogen atoms (NHA). For molecules with NHA > 40, all models performed worse than an arithmetic average model (AAM) that takes the average logP of a given dataset as the prediction for each entry. ACDlabs (Advanced Chemistry Development Inc., Toronto, Canada) provides a pop- ular fragment-based lipophilicity calculator (ACDlogP) [90] that can also be applied to derive logDpH. ACDlogD is calculated as a function of ACDlogP and a predicted pKa. Besides ACDlogD, the logD predictor implemented in Instant J Chem (ChemAxon, Budapest, Hungary) is also investigated in this thesis. This method considers a con- sensus value from HlogP and KlogP plus a pKa model which is based on atomic partial charges and polarisability.

Quantitative structure-property relationship (QSPR) models

Building a relation between a numerical representation of the molecular structure and logP or logDpH presents a different in silico approach to the classic models. The su- perordinate concept is called quantitative structure-property relationship (QSPR) [94, 95] and is applied to predict various properties. A detailed discussion on implemen- tations and applications beyond lipophilicity is presented in section 1.3. Considering that logP and logDpH are continuous values, QSPR for lipophilicity prediction are re- gression models. Moriguchi et al. presented MlogP which calculates logP by multivariate linear regres- sion (MLR) from solely the sums of lipophilic and hydrophilic atoms and eleven correc- tion factors.[96] At this time, MlogP was widely accepted because it presented a cheap alternative to classic models in terms of computational costs. ADMET Predictor (Simu- lations Plus Inc., Lancaster, USA) provides logD calculations based on their s+logP and s+pKa algorithms. In principal, no limits are set regarding complexity/dimensionality of the molecular representation. For example, Riniker utilised structural information retrieved from short molecular dynamic simulations (MD) in water and vacuum for logP prediciton in octanol/water, hexadecane/water and cyclohexane/water.[97] Re- cent studies focus on the employment of machine learning (ML) to solve the given regression task (Table 1.3). Artificial neural network (-ensembles) (ANN(-E)), regu- larised regression (LASSO), random forests (RF), support vector regression (SVR), and Gaussian process (GP) are frequently applied algorithms. A main topic of this thesis is the application of ML to obtain logDpH QSPR models which can be used as predictors in drug discovery. A detailed description and discussion of several ML-techniques fol- lows in section 1.3. For teaching QSPR models the relation between molecular structure and the desired property, appropriate datasets are required. The PHYSPROP database contains names, structures, and physicochemical properties of more than 41,000 small molecules and 12 Chapter 1. Introduction drug-like compounds (http://www.srcinc.com/what-we-do/environmental/ scientific-databases.html). A database containing physicochemical properties is integrated in the online chemical modelling environment (OCHEM; http://www.o- chem.eu) with the intention of providing a widely used platform for QSAR/QSPR studies and sharing results and experience.[98] The CHEMBL database (https:// www.ebi.ac.uk/chembl/) also allows for the search of experimental lipophilicity data of bioactive drug-like small molecules.[99] Aside from publicly or commercially available data, pharmaceutical companies can work with their in-house data. For ex- ample, Bruneau et al. used data on 5,000 AstraZeneca compounds and Schroeter et al. data on 14,500 compounds from Bayer Schering for model training.[100, 101]

Table 1.3: Selected QSPR methods for logP and logDpH (gray) prediction. LogD7.0 is calculated by Schroeter et al. The others calculate logD7.4. For a detailed introduc- tion to molecular "descriptors" see section 1.3.

Method Algorithm Descriptors Reference

s+logP ANNE 217 uncorrelated 2D descriptors [102] MlogP MLR 13 1D descriptors [96] AlogPS2.0 MLR, ANN E-state indices [103] Visconti et al. PLS, SVR volsurf+ and 2D descriptors [104–106] Bruneau et al. ANN 56 2D- 3D descriptors [100] Schroeter et al. GP 904 constitutional, topological, ge- [101, 107] ometrical, WHIM, GETAWAY de- scriptors, functional group counts and molecular properties Ognichenko et al. RF connectivity indices, partial [108] charges, refraction Wang et al. PLS, SVR 30 2D descriptors [109] Riniker LASSO, GTB Molecular Dynamics Fingerprints [97] (MDFP+)

In silico models and indirect experimental methods share one commonality: their suc- cess is dependent on the characteristics of the molecules with known logP or logDpH used for training and calibration. Reliable predictions or determinations can not be expected for compounds outside the known ranges of logP or logDpH of the given molecular representation (structural "similarity"), and of the calibrated retention times or capacity factors in HPLC-measurements. On the computational side, there exists the concept of "Applicability Domain" (AD) which aims to assess either the reliability of a prediction or detect outliers with regard to the training data (cf. section 1.3). 1.2. Peptides in Drug Discovery 13

1.2 Peptides in Drug Discovery

The majority of marketed drugs are either small, synthetic molecules or belong to the class of genetically engineered bio-macromolecules. Today, we are witnessing emerg- ing interest in peptides to fill the structural gap between both extremes. From 2012 to 2016, 7% of approved drugs by the american food and drug administration (FDA), were based on peptidic structures.[110] The same source reports that 80% of the FDA- approved peptides are no longer than 10 amino acids (AA). As of 2016 there are 60 ap- proved peptide-based therapeutics on the market.[111] With approximately 140 com- pounds in different stages of clinical trials and over 500 in preclinical development the number of approvals is likely to increase.[112, 113] The main indication areas are metabolic diseases and oncology, wherein the -like peptide-1 receptor, the somatostatin receptor and gonadotropin-releasing hormone receptors are mostly tar- geted.[114] In the beginning of the last century, wound healing effects of antimicrobial peptides (AMPs) named Gramicidins were noticed and clinically used.[115, 116] Today’s AMPs are typically 10-100 AA long, possess positive charges and amphiphilic nature.[117] AMPs interact directly with bacterial membranes, whereas the actual mode of action is not fully understood.[118, 119] The membranolytic character of these peptides makes them in particular interesting in times of growing bacterial resistance. Recent develop- ments comprise the numerical description and analysis of AMP chemical space [120], de novo generation of AMPs with improved potency over their natural template [121] and ML classification models to predict activity [122]. Membranolytic properties are reported for anticancer peptides (ACP) as well.[123] ACPs target specifically cancer cell membranes, making them less afflicted with undesired side effects and growing cellular resistance. Grisoni et al. were the first to apply a long-short term memory re- current neural network, trained on the peptide-alphabet and fine-tuned with known ACPs, for de novo sequence generation. Six from the twelve synthesised peptides were selectively active against MCF7 human breast adenocarcinoma cells with no membra- nolytic properties against human erythrocytes.[124] Besides direct target-interaction, cationic, amphipathic and hydrophobic peptides can provide appropriate delivery systems for macromolecular drugs into cells.[125, 126] Nevertheless, it remains difficult to penetrate only specific cells and the underlying mechanisms are not sufficiently understood for translation of cell-penetrating pep- tides (CPP) into clinics. Selective drug delivery into diseased tissues and organs can be achieved with peptide-drug conjugates as well.[127] For example, tumor necrosis factor α fused to the tumor-homing peptide -- (NGR-TNFa) entered clinical trials. The tripeptide is known to selectively bind CD 13 that is over- expressed on tumor blood vessels. NGR–TNFa stabilized 50% of treated patients and weekly dosing maintained this stabilization for a median time of more than 9 months, 14 Chapter 1. Introduction with limited toxicity.[128, 129] Short peptides are also seen as promising alternatives to small molecules for the modu- lation of protein-protein or peptide-protein interactions (PPI).[130] PPIs have large con- tact surfaces (approximately 1500-3000 Å2) compared to those in protein-small molecule interactions (approximately 300-1000 Å2) which are typically flat and lack grooves and pockets as well as distinct pharmacophoric features. Their characteristics reduce the chance for small molecular agents to exhibit strong ligand properties.[131] Recent re- search investigates if peptides can overcome these challenges due to their structural flexibility, mimicking the protein surface structure. 8 to 16 AA long peptides were found to antagonise the interactions between p53 (an ubiquitin ligase) and hDM2 [132] and MDM2 [133–135], two validated cancer targets. Other peptides for anticancer ther- apy interfere the hypoxia inducible factor-1 (HIF-1) complex formation [136], disrupt the interaction of anti-apoptotic BCL-2 proteins [137–139] and prevent oncogenic acti- vation of the Wnt pathway [140, 141]. Tavassoli et al. reported a cyclic pentapeptide that disrupts the interaction of HIV Gag protein with the host protein tumor suscepti- bility gene 101, for inhibiting HIV-budding.[142] For potential anti-inflammatory treat- ment, bicyclic peptide scaffolds were investigated for their ability to hinder the tumor necrosis factor α (TNF-α) / -receptor interaction.[143] Glas et al. designed constrained peptides inhibiting the pathogenic interaction between virulence factor exoenzyme S (ExoS) of Pseudomas aeruginosa and the human protein 14-3-3. Cornerstone for the structure-based development was an 11 AA long ExoS stretch that mainly contributes to binding.[144] Khazanov and Carlson calculated the median length of active protein-ligand binding sites at 11 residues.[145] The authors estimated that a tetrapeptide library of 83000 en- tries would cover all unique known protein binding regions. Experience had shown that 6 to 12 residue positional peptide library scanning leads to successful ligand- identification.[146, 147]

Advantages and Drawbacks of Peptides

Peptides are effectors of most (patho-) physiological signal transduction processes, hence presenting reasonable starting points for compound optimisation. Often repre- senting a small functional part of a protein, therapeutic peptides offer greater efficacy and selectivity than small molecules.[148] The tight structural relationship of drug can- didate to physiologically active template relates to a reasonable off-target profile.[149, 150] Since the degradation products are amino acids, a low risk of systemic toxicity and drug-drug interaction can also be expected. In comparison to recombinant proteins and antibody-drugs, peptides possess reduced immunogenicity and their smaller size and hydrophilic nature promote further tissue-penetration.[151] Also, the manufactur- ing is less expensive than recombinant production. In particular since the advent of 1.2. Peptides in Drug Discovery 15 solid-phase peptide synthesis (SPPS), initiated by Merrifield in 1963 [152], fast, auto- mated synthesis following standard protocols had become a crucial part of preclinical research. The combinatorial nature of SPPS facilitates the creation of comprehensive chemical libraries for screening purposes comprising linear and cyclic peptides that are based on any desired natural or non-natural AA building block. In spite of all listed advantages, the development of applicable therapeutic peptides remains challenging. Most important, natural peptide sequences are exposed to rapid protease degradation in the digestive system and blood plasma. The short half-life, paired with low membrane permeability results in low bioavailability and requires parenteral administration in most cases. Targeting the central nervous system is not feasible because of their inability to pass the blood-brain-barrier. Such caveats are well studied and chemical modification strategies for natural peptides have been proposed for remedy. In fact, peptidic therapeutics that move beyond early drug development incorporate non-natural proportions. The following section provides a summary of popular strategies to transit a natural peptide "hit" into a potential drug candidate combining the here-mentioned benefits with those of synthetic small molecules.

Overcoming the Drawbacks of Peptides by Combining Biotechnology and Medicinal Chemistry

Natural AA sequences from endogenous ligands, library screening or de novo design inherit poor PK (cf. section 1.2). For developing actual therapeutic peptides these limitations must be overcome, even for parenteral administration. In particular the severe low systemic stability must be resolved. The following strategies are typically employed to translate natural peptide templates into clinical candidates.

L to D amino acid substitution

The majority of human proteolytic enzymes typically hydrolyse L-configured AAs. In- troduction of D-configured AA consequently improves stability.[153] The - analogue Desmopressin (Figure 1.3) presents an approved and marketed example: Arginine in position eight of the octapeptide was transferred to the D-configuration, leading to slower metabolism in comparison to Vasopressin. Desmopressin can be administered nasally, oral and sublingual besides the common intravenous path.[154] The modification results also in enhanced antidiuretic activity over Vasopressin but approximately 1500 fold less vasoconstriction, affecting the clinical indication area.

Introduction of non-natural amino acids

Improved metabolic stability by simultaneously maintaining potency and selectivity can be achieved by introducing synthetic, chemically modified AAs. For example, 16 Chapter 1. Introduction the cyclic -analogue Carbitocin contains a methyltyrosine and a thioether in place of a disulfide bond (Figure 1.3). It has a prolonged half-life of 85-100 min- utes preventing a continuous intravenous infusion.[155, 156] Another strategy to im- prove the stability is modification of the peptide backbone. Especially amide bond N- alkylation results in protected and conformationally restricted peptide backbones.[157] N-alkylation does not solely affect the conformation of the non-natural AA, but also of the adjacent residues. The steric hindrance reduces entropic effects upon bind- ing, which can be the reason for low affinity and selectivity. Introducing rigidity into molecules with low rotational barriers is particularly powerful if the bioactive confor- mation is known and can be adapted. A recent study showed enhanced intestinal per- meability of N-methylated cyclo (-D-Ala-Ala5-) peptides.[158] So called N-alkyl scans (sequential alkylation of one amide bond at a time) can help to identify pharmacolog- ically relevant residues. For adapting the less frequent cis conformation, the unique tertiary amide bond of is exploited. The conformational influence of proline is as such similar to that of N-alkylated AAs.[159] A further opportunity to avoid rapid proteolysis is the introduction of β-AAs or peptoids.[130] The latter have the side chain attached to the backbone nitrogen instead to the α-carbon.

Cyclic peptides

Cyclisation also enhances conformational rigidity and protection of C- and N-termini against peptidases. Cyclisation is feasible by chemical ligation in various positions such as (i) head-to-tail, (ii) side chain-to-side chain, or (iii) chain-to-tail formation of amide-bonds, (thio-) ethers, lactones or disulfide bonds.[130, 160] In 2000, "hydrocar- bon stapling" was introduced as a cyclisation technique that constrains α-helical pep- tides.[161, 162] The method utilises α-substituted non-natural AAs with chemically stable alkenyl-moieties, linked together via olefin metathesis, commonly at a distance of 4 or 7 between the AAs. Mimicking the α-helical structure was a design strategy in the previously discussed studies from Bernal et al., Grossmann et al. and Glas et al.[132, 140, 144] Cyclisised peptides are paraphrased as macrocycles because of the size of their ring system. For example, the macrcocycle of the calcineurin-inhibitor Cyclosporin A in- corporates 11 AAs (Figure 1.3). Cyclosporin A has a natural origin, it was isolated from Tolypocladium inflatum and Cylindrocarpon lucidum. The cyclic structure enhances the enzymatic stability. However, Cyclosporin A can be administered orally. The key factor for the passive membrane permeability is the structural adaptability to the po- larity of the environment.[163] Besides Cyclosporin, more than 40 macrocyclic peptides were on the market in 2017. The majority come from natural templates such as antimicrobials or human peptide hormones.[164] 1.2. Peptides in Drug Discovery 17

β-turn Mimetics

Mimicking protein epitopes enables the study of molecular recognition in physiologi- cal processes. β-turn structures in endogenous proteins and peptides display a major recognition motif for PPIs and G-protein-coupled receptor (GPCR) interactions.[165] Also, β-hairpin shaped peptides have been found to represent a novel class of an- tibiotics against gram-negative Pseudomonas that block the outer-membrane biogene- sis.[166] Polyphor AG (Allschwil, Switzerland) successfully identified the compound Murepavadin (POL7090) against Pseudomonas aeruginosa, which is currently under in- vestigation in clinical phase III.[167]. The same company found β-hairpin peptides that antagonise the p53/HDM2 interaction.[168] Such "protein epitope mimetics" at- tract considerable interest in the field of vaccinology for defining starting points for B cell epitope mimetic designs.[169]

Figure 1.3: Approved and marketed cyclic peptide mimetics exhibiting chemical modifications discussed in this section: L to D AA substitution (green), methylation of the AA nitrogen (blue), introduction of non-natural amino acid side chains (red). In Desmopressin, one cysteine is deaminated (also red).

The presented strategies are commonly combined, making peptide-re-engineering a multidimensional process. The discussed drugs Carbitocin and Desmopressin are cyclic 18 Chapter 1. Introduction and possess D-configured or non-natural AA. The cyclic structure of Cyclosporin A in- corporates a N-methylated backbone. Daptomycin, an antibiotic lipopeptide, contains D- and D- as well as L- and L-3-methylglutamic acid (Figure 1.3). The structure comprises a ten-membered AA macrocycle and a three-membered exocyclic tail which is coupled to decanoic acid. Walport et al. stated that in particular cyclic peptides present a striking starting point for therapeutic agents and foretold a bright future of this compound-class due to the ability of furnishing existing scaffolds with more drug-like properties.[170]

Lipophilicity of Peptides and Peptide-Mimetics

Short, natural peptides are typically polar molecules, due to the amide-linkages and ionisable side chains. Free C- and N-termini lead to additional polar groups, namely carboxylic acid and amine. By modifying these templates as described in the previous section, the lipophilicity can change drastically. As for small molecules, the knowledge of this key physicochemical property is of considerable interest for understanding and predicting PK/PD behaviour and guide peptide drug design. The most comprehensive and publicly available collection of peptide-lipophilicity data was provided by Akamatsu and coworkers.[171–173] LogD7.0 and logD7.4 was deter- mined by SFM for 210 linear di- to pentapeptides. The authors derived equations to correlate ∆logDpH with free-energy related physicochemical parameters of the side chain substituents and defined a AA hydrophobicity index. In some PK studies, Caco- 2 cell-permeability [174–176] and hepatic uptake [177] was correlated with lipophilic- ity. LogD7.2 was also determined for a few well-known biologically active peptides, such as Angiotensin, Cyclosporin, Vasopressin and Thyrotropin Releasing Hormone (TRH) (Table 1.4).

Table 1.4: LogD7.2 of some peptide drugs or their analogs.

Peptide Tested Analog logD7.2 Reference

Angiotensin [Leu8]angiotensin 2 0.18 [178] Cyclosporin - 2.92 [179] TRH - -2.46 [180] Vasopressin 125I-vasopressin -2.15 [181]

In 2006, Thompson et al. published a logP collection, also incorporating the before mentioned experimental data.[182] The authors analysed the applicability of logP pre- dictors for the peptides and observed high errors for cyclic and ionisable structures. The dataset was updated in 2013, now comprising 428 entries.[183] This relative small number of SFM data for peptides stands in stark contrast to the plethora for drug-like 1.2. Peptides in Drug Discovery 19 small molecules. Mant et al. started in 1988 to use AA retention times and hydrophobicity scales to pre- dict peptide retention on RP-HPLC.[184] Later, a comprehensive dataset of more than 2100 peptides was introduced and used to derive individual AA group retention co- efficients.[185] In one study, the retention times of a set of endothelin and neurokinin receptor antagonists were utilised to indirectly determine logP. Still, the indirect parti- tion or distribution coefficient estimation by HPLC, did not find broad application for peptides.[186] In consequence of the experimental data scarcity, the stock of computational approaches specifically designed for peptides is limited. The work of Akamatsu and coworkers yielded several empricial equations to predict logD of di- to pentapeptides.[171–173] However, they lose validity when the training set is extended. Another approach fol- lowed the concept of the "classic" lipophilicity predictors. 219 natural di- to pentapep- tides were fragmented into the single AAs. Each AA contributed additively to logP and only two correction factors for blocked and unblocked C- and N-termini were employed.[89] The applicability of this model is questionable in so far, that it was eval- uated only for ten novel peptides. One recent study presented QSPR models, based on machine learning, to predict logD7.4 for di- to pentapeptides.[104] The mechanistic in- terpretation of the models revealed, that hydrogen-bond donor groups influence more the hydrophilic character of a peptide, than hydrogen-bond acceptors. The authors utilised molecular representations based on the three-dimensional structure, advocat- ing that their small peptides “...can be structurally considered as small organic molecules and thus it is legit to use standard building tools for preparing their 3D structure”. Otherwise, peptide specific conformation (-ensemble) generators like PEP-FOLD [187] can be em- ployed. PEP-FOLD is applicable for linear, natural peptides with a length between 5 to 50 AAs. Using several predicted confirmations of the highly flexible peptides (or a consensus thereof) to retrieve their 3D information, could be the more accurate map- ping.

The continuation to measure and to model lipophilicity in terms of partition- and dis- tribution coefficients, properties having tremendous impact on drug discovery, has evolved only little to the field of peptides and peptide-mimetics. Addressing this lack of bespoke models is motivated by the hypothesis, that small molecule derived models are unreliable for structurally dissimilar compounds. Testing this hypothesis is a key subject of this thesis (cf. chapter 2). Further, we advocate to model logDpH because peptides and peptide-mimetics very likely incorporate ionisable functions and often possess zwitterionic character (cf. Figure 4.12). 20 Chapter 1. Introduction

1.3 Machine Learning for the Prediction of Pharmaceuti- cally Relevant Properties

The previous section pointed out that bespoke lipophilicity models for peptides and peptide-mimetics are rare despite their significance for contemporary drug discov- ery. In this section, the relevant machine learning (ML) techniques in this thesis to derive such models are discussed. Apart from lipophilicity predictions (cf. section 1.1), ML has a tremendous impact on the general direction of research in the field of chemoinformatics.[188, 189] Much effort was put into the optimisation of ADME-Tox modelling.[190, 191] Nowadays, machine learning models exist to predict the aqueous solubility [192], pKa [193], oral bioavailability [194], and binding to human serum al- bumin [195] to name just a few applications. The name machine learning, a sub field in artificial intelligence, was coined in 1959 by Arthur Samuel.[196] The question arose, how "... to construct computer programs that au- tomatically improve with experience", in order to give a computer the ability to learn from data "... without being explicitly programmed".[197] In chemoinformatics, these data are typically numerical representations of molecules. "Describing" molecular attributes in a meaningful context-related way, presents the first task of any modelling campaign (Figure 1.4 A). Next, the data is shown to a ML model, which learns to solve the given problem without having static program instructions.[198] This data-driven approach can be differentiated into supervised and unsupervised learning (Figure 1.4 B). In unsupervised learning, the input data X is not related to any response, meaning the algorithm is "learning without a teacher".[199] The applied algorithms search for pat- terns within X, for example to extract useful information and discover trends in large datasets, cluster the objects according to some attribute or describe the data with novel latent variables.[199, 200] In supervised learning, the input data X are connected to their desired output y that is either a class-label (classification problem) or a continuous value (regression prob- lem). The latter is true for logP and logDpH predictions. The goal is to learn a rule that generalises the relation between X and y.[201] In contrast to unsupervised learn- ing, the provided prediction yˆi for each xi in the training set is associated with an error that is expressed by some loss function (L(ˆyi, yi)). Thus, many supervised learning problems are formulated as a minimisation of these loss functions by employing it- erative optimisation, such as methods based on gradient descent.[202] The procedure starts off with initialising the coefficients ("weights") for the input X, and L(ˆyi, yi) is calculated. By calculating the derivative of L(ˆyi, yi), the algorithm retrieves informa- tion of its slope and can update the coefficients to move towards the minimum. Su- pervised learning algorithms allow the intervention from outside, broadly spoken in terms of the model-complexity, regularisation and update procedure. This is directed 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 21 by model-specific parameters, whose exploration is part of the model-development (hyper-parametrisation). In order to find the best solution for a given problem, the supervised models must be evaluated (cf. section 1.3)

Figure 1.4: Dataset Preparation and Machine Learning Workflow. A: Information about each molecule is expressed in a computer-readable format ("features" or "de- scriptors"), and scaled to avoid over-fitting single attributes. The final feature set serves as the input X. B: Optionally, ML techniques can be applied to select features features with a high meaning for the given task. Exploring and describing the data X is the domain of unsupervised learning. Supervised algorithms learn to build a function that maps the input X to the given target y. CV:cross validation; RMSE: root mean squared error

In the following, the single steps in a machine learning scenario are presented in detail. The discussion on ML algorithms focuses on supervised and unsupervised methods that are relevant in the context of this thesis.

Molecular Representation

Before building a relationship between a molecular structure and its activity (QSAR) or some property (QSPR), it is necessary to define that structure. A particular exercise in chemoinformatics constitutes the capturing of context-relevant information regard- ing the desired outcome to predict. In other words, the molecular information must be provided in a way the machine is able to read, interpret and employ it to solve a given 22 Chapter 1. Introduction task. Since the advent of computer-assisted drug discovery, several thousand of such molecular "descriptors" or "features" have been proposed.[203] Machine learning algo- rithms, like the LASSO regression (cf. section 1.3 and 4.1), can be utilised for "feature selection".[204] Molecular features are usually differentiated by the necessary structure-dimensionality for their calculation:

One-Dimensional Features

One-dimensional (1D) or "constitutional" features are calculated from the chemical for- mula without any information about atom-connectivity. They display a global pre- sentation, based on the complete molecule. Examples of 1D features are the molec- ular weight (MW), atom- ring- and bond-counts and polar surface area approxima- tions.[205] Typically, logP and logDpH are denoted as 1D features (Figure 1.5), due to their global character but the "classic" models rely on fragmentation rules that also consider connectivity (cf. section 1.1).

Two-Dimensional Features

Two-dimensional (2D) features are based on topological molecular graphs that repre- sent atom-connections in molecules. The first topological descriptor, developed in 1947 and still applied, is the Wiener Index. It is defined as the sum of edges in the short- est path between all nonhydrogen pairs in a given structure.[206] Further prominent examples are Zagreb indices [207], Balaban J indices (average distance sum connectiv- ity) [208], the Kappa shape [209] and Moreau-Broto topological autocorrelation indices [210]. Pharmacophore-descriptors like the chemical advanced template search (CATS) proposed by Reutlinger et al. are of topological nature as well.[211] In this example, the molecular structure is reduced to the molecular graph, and pharmacophore types lipophilic (L), aromatic (R), hydrogen-bond acceptor (A) and hydrogen-bond donor (D) are assigned. Following, atom pairs for all possible combinations up to a distance of ten bonds are counted. The summation of a respective feature type combination λ serves for scaling (division). A crucial step for high-speed molecular screening, similarity searching and substruc- ture analysis, has been the introduction of molecular fingerprints which exhibit an ab- stract representation of structural features, encoded in a boolean array or bit map. Fingerprints are generated from the molecule itself where each pattern (substructure) serves as a seed to a pseudo-random number generator which results in a bit vector ("hashing"). So created vectors sum up to the final fingerprint. Since each substructure creates a unique set of bits, the fingerprint indicates for certain if a substructure is miss- ing and with some probability its presence. Fingerprint patterns can overlap, meaning 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 23 that each pattern shares portion of itself with other patterns. Increasing structural com- plexity leads in consequence to more accurately characterisation by the fingerprint. Ex- tended connectivity fingerprints (ECFP) define substructures starting from each non- hydrogen atom and "look" at the neighbourhood of this atom in circular layers up to a predefined distance.[212]

Figure 1.5: Selection of 1D - 3D molecular features. This visualisation is adapted and extended from [213] and [214] for the model peptide YPWF-NH2 .

Three-Dimensional Features

Three-dimensional (3D) or "geometrical" features are calculated from 3D atom coor- dinates and represent spatial properties. Thus, they are dependent on the molecular conformation. Examples are various molecular shape properties and distributions and the weighted holisitic invariant molecular (WHIM) descriptor that captures 3D infor- mation of molecular shape, size, symmetry and atom distributions.[203] Since these 3D features require structurally optimised compounds, it is difficult to apply them to non-rigid molecules like short peptides in a reasonable fashion unless a designated bi- ological confirmation is known. 24 Chapter 1. Introduction

The presented QSPR lipophilicity models (Table 1.3) rely on various 1D to 3D molecu- lar representations. Danishuddin and Asad recently proposed the class of "thermodynamics descriptors", which relate a compound-structure to the observed chemical behaviour.[205] In addi- tion to physicochemical descriptors like logS (solubility) and molar refractivity (total polarisability of a mole of a substance), logP and logDpH also belong into this class.

Computing and Quantifying Chemical Similarity

A fundamental working hypothesis for machine learning models is the chemical simi- larity principle, coined by Johnson and Maggiora.[215] It states, that structurally similar compounds are likely to exhibit similar properties. This concept can be employed to define the applicability domain of a model in terms of the position of novel query com- pounds in relation to the covered chemical space.[216] Similarity searching presents also a valuable concept in virtual screening for drug de- sign purposes: Ranking screening compounds by their similarity to a bioactive refer- ence and focussing on the top-ranked, proved to be a successful strategy both retro- spectively and prospectively.[217–219] The distance between two molecules a and b in an n-dimensional feature space is typically calculated as the euclidean distance: v u n uX 2 d(a, b) = t (ai − bi) (1.4) i=1 The Manhattan or City Block distance between two points replaces the euclidean ge- ometry and reports the sum of absolute differences between their Cartesian coordi- nates. This approach can be advantageous over the euclidean distance in case of high-dimensional data and is less sensitive to outliers.[220] A scale-invariant distance measurement that takes correlations within a data set into account was introduced by Mahalanobis, encouraged by the problem of identifying similarities of skulls.[221] The Mahalanobis distance essentially grasps how many standard deviations a point is away from the mean of a given distribution in multi-dimensional space. In case of feature scaling to unit variance, it is identical to the euclidean distance in this space. In large screening campaigns, binary fingerprints particularly fit the purpose because of (i) their specificity for a given structure, (ii) their relatively short calculation time and (iii) a rapid distance calculation.[222, 223] The Tanimoto coefficient captures molecular similarity between molecule a and b:

n X xjAxjB j=1 T (a, b) = n n n (1.5) X 2 X 2 X (xjA) + (xjB) − xjAxjB j=1 j=1 j=1 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 25

where x denotes any molecular feature or binary value (total number = n) and xjA is the jth attribute in molecule a. For binary fingerprints, the Tanimoto coefficient ranges between 0 (no bit set in common in fingerprints of molecule a and b) and 1 (identical molecules having identical binary fingerprints).

Amino Acid Scales and Peptide Features

Natural peptides display separate structural characteristics, regarding that they are ex- clusively constructed by the limited set of 20 standard AAs. Zimmermann et al. hence proposed to define AA scales, where every AA side-chain gets assigned a value for its characteristic.[224] Müller presents in his thesis a comprehensive overview about de- veloped AA scales inheriting information about side-chain size and shapes, hydropho- bicity and other physicochemical properties.[225] The gained side-chain information can be either summarised to obtain one global peptide feature, or put into relation to the sequence position of the respective AA. The PPCALI descriptor displays such a length-invariant, position related approach (for more detailed information see section 1.3).[226] PepCATS is a peptide adaption of the previously discussed topological CATS phar- macophore descriptor.[226] Instead of characterising single atoms, each AA residue is assigned to a pharmacophore type (lipophilic (L), aromatic (R), acceptor (A), donor (D), positive (P), negative (N)). Again, the distances of all 21 possible binary type- combinations between two potential pharmacophore points are counted and binned within a defined range. Cross-correlation with a distance of six AAs results in a 126- dimensional, length-independent feature. The resulting vector elements were again scaled by dividing each value by the sum of all occurrences of the same feature pair type. The development of AA scales and peptide-features is steadily pushed forward be- cause the features for small molecules might exhibit too detailed representations for parts of the whole structure. By that, information about the overall spatial arrange- ment can be missed. This fact gains impact by increasing sequence length, for exam- ple in computational AMP design where the peptides are typically larger than 15 - 20 AAs.[117] Hence, Müller’s python implementation for AMP feature calculation falls back to AA scales, global descriptors and peptide features that take into account AA residue positioning.[120] In this thesis, small molecule-derived features are utilised because the scrutinised peptides and peptide-mimetics are relatively small (cf. section 3.2). PepCATS is employed for the purpose of screening a de novo generated hexapep- tide library (cf. section 4.4). 26 Chapter 1. Introduction

Feature Scaling

All presented machine learning techniques , and the distance measurement in k-mean clustering, are not scale-invariant. That means they will automatically overestimate features having a large scale (e.g molecular weight), independent from their real con- text relevance. Features that are magnitudes smaller (e.g number of rotatable bonds), will have marginal impact on the models. Scaling overcomes this bias, by standardis- ing each feature in the data to have a zero-mean and standard deviation (std) = 1.

Unsupervised Algorithms

Principal Component Analysis

In 1901, Karl Pearson reported a mathematical concept, in analogy to the principle axis theorem in mechanics, for representing high-dimensional physical, statistical or biological data by a straight line or plane [227], laying the cornerstone for a widely used unsupervised algorithm called principal component analysis.[228] PCA converts a set of observations having n-dimensional features, into a set of n linear combinations of these original features, called principal components (PC). The attribute of the first PC is that it accounts for as much variance in the data as possible. The following PCs account for as much remaining data-variance as possible under the constraint that they are orthogonal to their predecessors, and hence linearly uncorrelated (Figure 1.6 A). PCA is determined by calculating the covariance matrix S of p individual observations

(x1...xp) represented by n features:

p 1 X S = x˜ x˜ T (1.6) p − 1 k k k=1 with x˜k = xp - x¯ being the centered data and T denoting the matrix transpose. Eigenvec- tor a and eigenvalues λ are calculated from S subject to constraint that aTa = 1, resulting in the projection directions in form of:

Sa = λa (1.7)

The original data points x˜k are then transformed into the new coordiante system by orthogonal projection onto the principal components (Figure 1.6 B):

z˜k =x ˜kA (1.8) where A is the orthogonal matrix with eigenvectors as columns. Eigenvectors in A are sorted according to their eigenvalues in decreasing order and i eigenvectors are chosen beginning from the one with highest eigenvalue to obtain i new dimensions. Any i-th principal component can be expressed as the following: 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 27

PCi = b1iX1 + b2iX2 + ... + bniXn (1.9) where Xn is any n-th feature vector and bni is the corresponding coefficient of the linear combination (loading). As such, the loadings define the direction in feature space along which the best approximation to the data is achieved (maximal data-variance). Since they sum up to 1, these coefficients also display feature-importance and how features are correlated.

Figure 1.6: Principal Component Analysis. A: Geometrical visualisation of the construction of the two principal components (PC) for a 2D dataset. The first PC explains the maximal variance. The second (orthogonal to the first), explains the maximum of the remaining variance. B: The original features are projected onto the PCs. The new coordinates (scores) present the position of the molecules in the latent space.

In chemoinformatics PCA is applied to visualise the known bioactive or drug-like chemical space.[229] The chemical space navigator ChemGPS employs PCA to extract chemographic map coordinates from predefined molecular descriptors of "core" drug molecules and "satellites", that are intentionally placed outside drug-like space. PCA scores of novel compounds project them onto this pre-established map. The purpose of ChemGPS is to provide a reference system "...for comparing multiple combinatorial libraries, and for keeping track of previously explored regions of the chemical space".[230] Martels’ standardised logP dataset from the ZINC database (cf. section 1.1) was com- piled by picking "most descriptive compounds" based on a PCA from Volsurf+ de- scriptors.[77] Feher and Schmidt utilised PCA to investigate differences between natu- ral products, drug molecules and molecules coming from combinatorial libraries.[231] Natural and combinatorial compounds differentiate mostly by avoiding chirality, preva- lence of aromatic rings and condensed ring systems, lower number and different ratio 28 Chapter 1. Introduction of heteroatoms, for making the combinatorial synthesis more efficient. The authors accentuate that their PCA-scheme guides the shift towards mimicking distribution properties of natural products for retaining more diverse combinatorial products with greater biological relevance. PCA is also employed for QSAR and QSPR modelling to reduce feature-dimensionality, by picking a subset of created PCs, starting from the first, that explains the maximal data variance. This concept is employed also for developing AA descriptors: Larsson et al. compressed an 80-dimensional descriptor set to three dimensions for the statisti- cal molecular design (SMD) of a peptide library targeting PPIs in Escherichia Coli.[232] Schneider and Wrede applied PCA to a set of 143 amino acid property scales to get a 19- dimensional representation of each standard AA, scaled by unit variance (PPCA).[233] Later, Koch et al. converted the PPCA descriptor into a nx19 matrix where the nth col- umn stands for the nth amino acid in a given sequence. By summing up the products of matrix row elements within a correlation distance zero to seven (autocorrelation) for every row, and concatenating 19 sums, leads to a 152 dimensional vector that is a length-invariant version of PPCA.[226] Dimensionality reduction is a valuable tech- nique when models are build on limited amounts of data. k-mean Clustering

The objective of k-means is to partition the data j in k groups (clusters), such that the within-cluster variance is minimised over all k clusters.[234] Once the value of k clus- ters (k ≤ j) has been defined by the user, each object (molecule) is randomly assigned to a cluster by the algorithm. For each cluster, the centroid value (i.e., mean value of the descriptors of the molecules belonging to the cluster) is calculated, and molecules are re-assigned to the cluster having the closest centroid. The new centroids are then calcu- lated for the next assignment and this procedure proceeds iteratively until convergence (graphical visualisation in Figure 1.7). It corresponds mathematically to optimising the function

k X X 2 J = ||xj − µi|| (1.10) i=1 xj∈Si

2 With data xj, and centroid µi of the cluster Si. Here, ||xj − µi|| is the squared euclidean distance. Cluster analyses like the k-means algorithm present tools for data mining, which is the preamble for various techniques to discover inter-data relations and trends.[235] For example, k-mean was employed to group genes (k = 15) from Plasmodium falciparum ac- cording to the time of expression throughout the life cycle.[236] The assumption was, that genes with similar expression profiles have similar functions. In fact, the authors 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 29 showed, that membership in a cluster was non-random and predictive. This infor- mation supports in the next step the function-assignment of the yet uncharacterised proteins in each cluster, encoded by the genome.

Figure 1.7: k-mean Clustering of 2D data into three groups. (I) Cluster centers are randomly seeded (coloured full circles). (II) Data are assigned to the nearest cen- ter. (III) Cluster centroids are calculated (coloured dashed circles). (IV) Data are re-assigned to the cluster with the closest centroid.

Roy and Roy employed the k-mean algorithm for generating ten training and test set combinations to explore feature selection strategies for partial least square regression models.[237] The clustering was based on the molecular representation of a cytopro- tection dataset of substituted thiocarbamates.[238] Sampling from each cluster ensured that chemically diverse structures were represented in both data partitions. In this the- sis, the concept was applied on the target values y, to ensure a similar lipophilicity distribution in training and external validation partitions (cf. section 4.1).

Supervised Algorithms

Least Absolute Shrinkage and Selection Operator (Lasso)

The concept of regularisation was introduced by Tikhonov in 1943, and has been ap- plied in various contexts of statistics and machine learning ever since.[239] Regular- ising the input feature regression coefficients of the canonical ordinary least squares model extends the concept of multivariate regression analysis for robust modelling. This serves to normalise the range of coefficients, penalising those towards the ex- tremes of that range. For descriptors xij and output yi (i = 1, 2, . . . , n and j = 1, 2, . . . , p) regularising algorithms solve the regression problem of finding β = βj to minimise

n   p X X 2 X yi − xijβj + α |βj| (1.11) i=1 j j=1 or 30 Chapter 1. Introduction

n   p X X 2 X 2 yi − xijβj + α βj (1.12) i=1 j j=1

P P 2 subject to |βj| or βj ≤ s, where s is a predefined threshold that controls the amount of regularisation. Equation 1.11 considers the sum of absolute coefficients val- ues and refers to as L1- or Lasso regression.[240, 241] Considering the sum of squared coefficient values is called L2- or Ridge regression.[242] The linear combination of L1- and L2-regularisation is implemented in the Elastic net algorithm.[243] In comparison to Ridge regression, which encourages small coefficients 6= 0, Lasso provides sparse solutions. By forcing the L1-term to be less than or equal to a fixed threshold s, features with low influence turn zero. This results in a simplification of the model, decreasing the tendency to over-fit at the risk of forfeiting some accuracy. The model complexity

(i.e the weight of the regularisation) is controlled by α. If α is large, βj is highly pe- nalised. Low α values result in relatively slack penalty. The parameter is optimised in QSAR/QSPR analyses, allowing one to find a trade-off between bias and variance. Lasso regularisation can be employed to select small subsets of initial molecular feature- sets which are most informative in modelling a respective target. In the field of ma- chine learning, this approach is referred to as feature selection.[204, 244] Algamal et al. recently employed an adaptive Lasso selection operator to study high-dimensional QSAR for predicting anticancer activity of imidazo- derivatives.[245] Shen and coworkers hypothesised that cancer phenotype-prediction is feasible with a set of orthogonal latent variables revealing tumor subgroups. They employed the Lasso reg- ularisation strategy to identify the genomic features that contribute most to the biologi- cal variation and possess significant weights on the latent variables.[246] Their method accurately identified known drivers in several cancer types and candidate biomarkers. Further work employed Lasso not solely for feature selection but also for solving a given regression problem. It was used to build quantitative structure-retention time relationships in gradient reversed-phase and isocratic hydrophilic interaction liquid chromatography (HILIC) systems.[247, 248] Datta et al. predicted the DNA binding affinity K of 31 9-anilinoacridine derivatives as antitumor agents.[249]

Support Vector Machines for Classification and Regression

The support vector machine algorithm was introduced by Cortes and Vapnik for super- vised binary classification problems, more specifically for handwritten character recog- nition.[250] Conceptually, the input vectors are mapped to a high dimensional feature space, where a linear decision surface can be constructed. This relies on the "kernel trick", a means of representing and projecting input vectors in a manner amenable to learning the optimal linear separation of classes in a computationally inexpensive fash- ion. This hyperplane separates objects from two different classes. For its construction, 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 31 the algorithm needs the closest data points (support vectors) that are most difficult to classify. The goal is to define the hyperplane such, that the distance to the support vectors (the margin) is maximal. Let

w ∗ x − b = 0 (1.13) represent the optimal hyperplane where w is its normal vector and b a location vector.

Each data point that is not on the plane will be classified to one class by wxi – b ≤ -1 when yi = -1 and to the other class by wxi – b ≥ 1 when yi = 1. Subsequently, the smallest distance of hyperplane to support vectors must be 1 and can be calculated by

|w ∗ x − b| 1 = (1.14) kwk kwk

2 following that the distance between both borders of the margin is kwk . The model with the farthest minimum distance possible, the maximal margin hyperplane, possesses maximal separation power. To achieve that, kwk must be minimised, without letting data points move into the margin. Minimising kwk is a non-linear optimisation task, which is constrained by the training data positions. The concept provides an intuitive means of performing classification, but real-world datasets are rarely separable in this fashion. The domain of support vector machines (SVM) is to operate in a high-dimensional space derived from the original feature space where a linear separation of data becomes possible. For transformation into higher dimensions, kernel functions are utilised. The function of the decision surface becomes

l X f(x) = y a k(x , x ) i i i i’ (1.15) i=1 where the combination of input vectors x of the support vectors enter in form of a dot product, the kernel function k(xi,xi’). Computing the dot products in high-dimensional latent space has been paraphrased the "kernel trick".[251] Commonly, linear, polyno- mial and gaussian kernels are employed (Table 1.5).[252] Generalising the maximal margin hyperplane to the non-separable case is called the support vector classifier.[250] This concept is superior to the maximal margin hyper- plane in terms of data over-fitting. Instead of resulting in a highly conservative hy- perplane that separates all data points, a larger and "softer" margin leads to more ro- bustness of individual observations. For this, slack variables ζi of each data point are introduced into the optimisation task, essentially telling its location relative to margin and hyperplane. The non-negative tuning parameter C controls the adaptability of ζi and thus the acceptable extent of margin-violation. For C = 0, it follows that ζi = 0, 32 Chapter 1. Introduction meaning the data point is on the correct side of the margin. In case of a cleanly sep- arable dataset, the result is identical to the maximal margin classifier. As C increases, the model gets more tolerant of violations to the margin. That leads to less fitting of training data but potentially more robustness. Since the decision boundary is based on the support vectors, it is quite robust to data points far from the hyperplane.

Table 1.5: Prominent kernel functions.

Name Function

p X k(x , x ) = x x Linear i i’ ij i’j j=1 p  X  k(x , x ) = 1 + x x d Polynomial i i’ ij i’j j=1 p  X  k(x , x ) = exp γ (x − x )2 Gaussian (Radial basis function) i i’ ij i’j j=1

Application of kernel functions leads to efficient handling of datasets with a high ratio of input-dimensions to observations such as gene and protein expression data in cancer research.[253] SVMs have been extensively used in the field of drug discovery. Byva- tov et al. compared SVM and artificial neural networks for drug/nondrug classifica- tion. While the first showed superior general performance, both methods complement each other as the sets of under- and over predicted objects as well as true positives and true negatives were dissimilar.[254] In another study, the same group predicted the inhibitory effects of ligands targeting COX-2 and Thrombin. A prospective study demonstrated activity of identified compounds in cellular assays.[255] Schneider and coworkers identified novel ACPs, coming from the template sequence of Decoralin, which was previously found to possess anticancer activity.[256, 257] SVMs have also been employed for multiclass compound substrate-prediction of seven categories of drug transporters. This model learns only from a small set of physicochemical proper- ties (charge, molecular weight, lipophilicity and plasma unbound fraction) plus three different descriptors for each category that were picked by a greedy algorithm.[258] The SVM concept can also be utilised to solve regression problems.[259] The process is identical: Finding a hyperplane that maximises the margin, reducing the error to the acceptable level. Data point to hyperplane distances shall be minimal.[260] Since support vector machines for regression (SVR) were introduced, they have also been applied in QSAR/QSPR analyses. Horvath et al. used SVM and SVR to predict activ- ity landscapes in structurally similar, bioactive compounds. SVM was used to predict whether or not the compounds sat on the so-called "activity cliff" and SVR to predict the actual variation in activity.[261] Schneider and coworkers tested the prospective 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 33

applicability of SVM regression in determining the selectivity of D3 receptor ligands.[262]

Figure 1.8: Binary classification by SVM. Left: Solely the support vectors are needed for constructing the decision boundary. In case of the maximum margin clas- sifier, no margin-violation is allowed. Right: By the kernel trick, original features are transformed implicitly into a high-dimensional latent space, facilitating linear separation. Back-transformation to lower dimensional space disrupts the boundary surface.

A binary SVM discriminated between active and inactive candidates and the SVM re- gressor predicted selectivity ratios between dopamine receptor subtypes D2 and D3.

Six out of 11 tested compounds showed the desired selectivity for subtype D3 and low micromolar binding affinities. SVR also constitutes a valuable machine learning tool for ADME-Tox predictions, such as respiratory toxicity.[263] The regression approach was used to overcome limitations of previously reported models that were developed on respiratory toxicity datasets with a single symptom as the endpoint. As discussed in section 1.1, SVR served as well for logDpH prediction for small molecules (Table 1.3).

Artificial Neural Networks and Multilayer Perceptrons

The original concept of artificial neural networks (ANN) was inspired by the architec- ture and functionality of the human brain. A system of virtual "neurons" deductively learn to solve a given task, without being provided any task-related rules. In analogy to the physiological paragon, a neuron receives external stimuli, here in form of numer- ical values. Each incoming signal is weighted, essentially representing the strength of a synapse in biology, and the sum of information is processed. By means of some activa- tion function, the neuron decides whether to transmit a signal to additional connected neurons or not. In feed-forward ANN, information is fed into an initial input layer, processed through additional "hidden layers", and provided to a single neuron at the end of the cascade which reports the finally predicted value. A common technique to 34 Chapter 1. Introduction train neural networks is error back-propagation where the weights are iteratively ad- justed to optimise the final output.[264] Implementing non-linear activation functions allows ANNs to perform as universal function approximators.[265] This finding pref- aced the early wave of hype for applications of ANNs in various research areas in the late 1980s. In drug discovery, ANN-driven methods have been applied for compound- classification, QSAR, target identification, feature selection and molecular property- prediction.[233, 266] Recently, we are witnessing an emerging interest in "deep" neural network architectures to deal with highly complex data patterns and "big data". Such networks learn input features with multiple levels of abstraction, also producing con- vincing results in contemporary systems biology and genomic studies.[267, 268] In order to model dynamic, temporal behaviour for a given time sequence, recurrent neural networks adopt patterns emulating the behaviour of neurons in the neocor- tex.[269] These neurons give direct feedback to themselves, indirect feedback to neu- rons in previous layers and lateral feedback to neurons in the same layer. In particular long short-term memory recurrent neural networks (LSTM), introduced by Hochre- iter and Schmidhuber, have been successfully applied for natural language process- ing.[270] Gupta et al. employed LSTMs for de novo drug design by learning the syntax of molecular representation in terms of the SMILES language.[271] The same group presented the first prospective application of a fine tuned LSTM model to reveal retinoid X and peroxisome proliferator-activated receptor agonists with nanomolar to low mi- cromolar activity [272] Later, the concept was adapted to learn the "grammar" of 10,000 presumably α-helical peptides, consisting of naturally occurring AAs. Fine-tuning the model on 26 membranolytic ACPs with low micromolar activities against human breast adenocarcinoma cells, revealed ten active out of twelve de novo designed pep- tides.[124] There are no known hard limitations to ANN complexity, with the primary determi- nants in practice being performance and computational complexity. However, for low- dimensional input, simple solutions can be successfully utilised as jury models (Figure 1.9).[273] Wolpert initially termed the approach stacked generalisation. The idea was to deduce the bias of multiple base models by voting to obtain a more accurate ensemble out- put. Therefore, the jury model is fed with predictions from the previous models and learns to build a relationship between this input and the real binary (classification) or floating (regression) target-value. Renner et al. trained two ANN-ensembles to identify allosteric metabotropic glutamate receptor 5 (mGluR5) antagonists and to distinguish them from mGluR1 antagonists.[274] Hiss et al. demonstrated the applicability of neu- ral network ensembles to classify octapeptides that are recognised by murin major his- tocompatibility complex 1 (MHC-1) for vaccine development.[275] Following up on this work, Koch et al. scrutinised functional peptide space for MHC-1 ligands by jury networks considering multilayer perceptrons and SVM as first stage models.[226, 276] 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 35

Figure 1.9: Cascaded jury net- work. Several first stage model learn the same or different molec- ular representations and predict a given target. Any algorithm that can solve the underlying classifi- cation or regression problem can be utilised. Each first stage model then feeds its prediction into one neuron of the jury input layer. The following architecture is flexible in terms of number of hidden layers and neurons per layers. Typically, one output neuron provides the final prediction.

Model Evaluation

After using supervised learning to solve either a (multi-)classification or regression problem, the created model must be evaluated in terms of predictiveness and robust- ness. This step is crucial for (i) comparing several machine learning techniques for one task and for (ii) comparing different parameter combinations that can be examined for an individual model, a process known as hyper-parametrisation. The first consideration is to avoid training models on all available data, instead with- holding a proportion for "external validation" (EV). These EV data will be fed only to the final model, without providing it the experimental label or target value y. The predictions are then compared to y. In order to evaluate intermediate models during development, whilst avoiding performing such operations on the EV set, Geisser in- troduced the concept of cross-validation (CV).[277] With n repetitions, a fraction of the training data is reserved for validation until each data point has served once for valida- tion and n-1 times for training (Figure 1.10 A). The internal splitting in the training data is randomly conducted. The CV result therefore provides information about the ability of the model to generalise to novel structures contained in the EV partition. Tailored for small sample problems, CV is commonly applied in QSAR and QSPR modelling. However, Isaksson et al. state that by means of CV, the uncertainty in a point estimate is unknown but can be large for small sample classification problems.[278] The authors advocate reporting the performance in the form of a conservative confidence interval until improved methods are developed. The second consideration is how the performance of different algorithms or model pa- rameter (-combinations) can be quantified. In the case of supervised regression mod- elling, the error between the target yi and the corresponding prediction yˆi of observa- tion i is calculated. A global performance metric that considers the absolute deviation is the root mean squared error (RMSE): 36 Chapter 1. Introduction

v u n uX 2 RMSE = t (yi − yˆi) /n (1.16) i=1 where n is the total amount of observations. Since the impact of each error is pro- portional to the size of the squared error, larger errors contribute disproportionately to RMSE. Thus it is sensitive to outliers. A second metric for the purpose of in silico logDpH model evaluation is the percentage of predictions that deviate not more than ± 0.5 log units from the experimental value:

n X   |yi − yˆi| ≤ 0.5 Accuracy[%] = i=1 ∗ 100 (1.17) n A detailed discussion about this metric and the set threshold of 0.5 log units can be found in section 4.1. Accuracy [%] is based on absolute error deviations and is therefore less sensitive to outliers.

Figure 1.10: Model evaluation strategies. A: N-fold Cross-Validation: In each itera- tion, one n-th of the data is left out for validation (black rectangle) and the remaining is used for model training (white rectangle). The procedure is repeated n times until all objects served once for validation. The validation loss is reported as the mean ± std of all n iterations. B: y-Randomisation: The known target values y in the training set are randomly brought into a wrong order before a model learns to build the rela- tionship f(X) = y. The procedure is repeated p times and the performance is reported as the mean ± std of all p iterations.

Topliss et al. pointed out, that a few "best" features for a given QSAR(QSPR) scenario, such as those resulting from the feature selection process, inherit the risk to fit the data reasonably well by chance.[279, 280] A possibility to assess the performance retrieved by chance correlation, is given by y-Randomisation.[281] Here, the original input X is fed to a model for training but the y data is randomly reordered (Figure 1.10 B). If the scrambled data result in lower performance than the original, "... then one can feel confident about the relevance of the real QSAR (QSPR) model" [282]. Since a random permutation of y can be close to the original arrangement, thus leading to similar out- comes, the procedure is typically conducted multiple times and the mean ± standard 1.3. Machine Learning for the Prediction of Pharmaceutically Relevant Properties 37 deviation is reported. The y-randomisation result can be used as a "benchmark perfor- mance" of a non-predictive model. It serves for comparison to the best model in order to explore the benefit of the latter for modelling the target y for the given data.

Applicability Domain

As discussed in the previous section, QSPR models are tested on unseen molecules with known target values. Regression metrics like RMSE (equation 1.16) and accu- racy [%] (equation 1.17) allow thereby to estimate the expected error for future pre- dictions. However, it must be kept in mind that it is impossible to train the model on all chemotypes occurring in the druggable chemical space due to the lack of univer- sal experimental data (Figure 1.11). This fact constitutes a central pillar in the context of this thesis, because peptide-lipophilicity data are rare (cf. section (1.2)). It can be assumed further that future objects, which are under-represented in the training data, have higher prediction errors than similar future objects. With a few exceptions, the existing models for logP and logDpH prediction are trained on, and serve for drug-like small molecules (cf. section 1.1). This motivates the development of bespoke lipophilic- ity models for peptides and derivatives thereof. In order to quantify the "novelty" of a future prediction, the model’s known chemical space or response-range must be defined. This approach is referred to as defining the applicability domain (AD).[216, 283, 284] The applied molecular representations are thereby well-suited to define the borders of the model’s known chemical space.

Range-based Methods

Range-based methods define the AD based on the range of each feature-dimension in the training set, creating a boundary-box (Figure 1.11). Novel structures that exceed this box in one or multiple dimensions are detected as outliers. This approach becomes somewhat inefficient with increasing dimensionality of the input features, because the extent of empty chemical space in the boundary-box increases as well.[253] But also for a model that considers only a five-dimensional feature set, Sahigara et al. showed that the range criterion did not lead to any outlier detection.[285]

Distance-based Methods

Distance-based methods determine the distance of an unseen structure to a defined point within the training set. A threshold-value is set, which presents the applicabil- ity domain border. For example, the distance towards the nearest training set object or the mean distance to the k nearest neighbours can be employed as the decision criterion.[286, 287] Similarity of the query to its training neighbours is indicated by 38 Chapter 1. Introduction distances that are smaller than this threshold and the prediction is found to be reli- able. According to Sahigara et al., "...approaches based on leverage are quite recommended for defining AD of a QSAR model".[286] The leverage is proportional to the Mahalanobis distance of a query compound towards the centroid of the training set. This concept can be extended for uncorrelated and scaled molecular representations of the chemical space, using distances like the euclidean (equation 1.4). This leads to a (hyper-) spher- ical delimitation of the model’s AD (Figure 1.11). Typically, a factor (2, 3, 95 percentile, std) of the average distance of training objects to the centroid is employed as the AD threshold. For unseen objects, the distance to the training set centroid is determined. If it is greater than the threshold, the molecule lies beyond the AD.

Figure 1.11: Left: Experimental data (full circles) are only available for parts of the chemical space. LogP and logDpH are mostly covered by small chemical entities and drug-like molecules (black). Lipophilicity data of linear peptides (red) and peptide- mimetics (blue) are rare. Right: Applicability Domain assessment of novel query compounds (crosses) in terms of their positioning in chemical space relative to the training molecules (circles). This figure was adapted and extended from [288] and from http://bigchem.eu/sites/default/files/Online14_Grisoni.pdf [289]. 1.4. Evolutionary Algorithms 39

Local Density Methods

The previously discussed concepts of distance-based AD assessment do not allow for variations within the data densities. A way to overcome this problem is to consider the neighbourhood of the training set compounds as well. A relative distance for the query to (k) nearest training set neighbour(s) (NNtrain) and the (k) nearest neighbour(s) of (NNtrain) can be employed. The ratio of the first distance to the second quantifies the "novelty" of the unseen compound.[290] The novelty-range which comprises the AD can then be defined. Several such local outlier scores have been proposed which are based on k-NN distances.[291–293]

1.4 Evolutionary Algorithms

The design of novel peptides can be fostered by deriving offspring sequences from a template or "parent" sequence by mutating single or multiple AAs. For example, it is well known that many adhesive proteins contain the Arg-Gly-Asp tripeptide as their cell recognition site.[294] Coming from that knowledge, the given sequence can be mutated or decorated with further moieties.[295] The template approach presents a valuable alternative, when structural information about the target or the ligand-target complex is missing. However, just by considering the 20 natural AAs, x20 sequences comprise all possible "offsprings" from a template peptide where x positions are subject to residue-changes. In other words, the sequence space for a given scenario is rapidly transcending what is possible to synthesise and test for activity. Nature-inspired evo- lutionary algorithms (EA) can address this challenge [296], and have been applied in various peptide design studies.[297–299] Mimicking the evolutionary process by EA directs the creation of in silico offspring libraries. The generated offsprings can then be tested for their desired response and the best can serve as the template for next- generation offsprings. This concept differs fundamentally from the ultra-high through- put platforms, such as the phage display and mRNA display systems.[300] Although the peptides proposed by EA might not present the optimal structures for a given task, the approach allows to explore and interpret the sequence space. Also, it is easy to im- plement in small laboratories without the capacity for, or knowledge about screening platforms. One of the first EA applications for rational de novo peptide design was conceived by Schneider and Wrede in 1994.[301] Their simulated molecular evolution process was based on a physicochemical similarity principle for directing the variation of the par- ent peptide.[299] To define the residue transition probability P(i,j) from AA residue i to residue j, the euclidean distance di,j between the respective physicochemical descrip- tors is determined. P(i,j) follows by: 40 Chapter 1. Introduction

2 2  d ij  X  d ij  P (i, j) = exp − / exp − (1.18) 2σ2 2σ2 j In this work, the pairwise distances between the 20 genetically encoded AAs are based on the side chain composition (weight ratio of non-carbon to carbon atoms), polarity and molecular volume, incorporated in a scaled version of the grantham matrix.[302] The σ parameter controls the chance that a mutation from AA i to j takes place. Larger σ values allow less conservative mutations (larger steps to more dissimilar residues are possible) while smaller values lead to limited sequence diversity within the in silico library. In this way, each AA position is under consideration for a potential mutation at the same time, leading to offsprings with the same sequence length as the parent. The pioneering work of Schneider and Wrede gained insight into the design of idealised matrix metallopeptidase cleavage sites. Later, another application of the EA identified novel sequences preventing the positive chronotropic effect of anti-β1-adrenoreceptor auto antibodies from the serum of patients with dilated cardiomyopathy.[303] Mean- while this algorithm is called VESPA. Hiss et al. built on this concept to introduce Morphing of Peptides by Evolutionary Design (MoPED). MoPED is equipped with both an initial start template and target sequence. This allows smooth transition steps between the start and target peptides. Stutz et al. later employed MoPED to morph a membranolytic AMP into a non-membranolytic mitochondrial targeting peptide.[304] The sequence diversity of EA based in silico peptide libraries can be assessed by equa- tion 1.19:

|A| X Hi = pi(xk) · log2pi(xk) (1.19) k=1 where Hi is the shannon entropy at position i and pi(xk) is the probability of AA k from the set of A = 20 natural AAs at this position.[296] Hi is expressed in bit, since the base of the logarithm is two. For the equal distribution of all 20 AAs at position i, the maximum entropy computes to 4.32 bit. Full conservation of one AA results in Hi = 0. The total entropy of a library containing peptides of length N is computed as:

N X Htotal = Hi (1.20) i=1

Htotal presents a quantitative index to compare the diversity of in silico peptide libraries. 1.5. Protein-Protein Interactions in Drug Discovery 41

1.5 Protein-Protein Interactions in Drug Discovery

The network of highly specific interactions between proteins is crucial for all living or- ganisms.[130, 305] Herein, protein-protein interactions (PPI) are understood as specific and intentional contacts of fully functional proteins.[306] As such, PPIs should not be confused with protein-DNA, protein-RNA, or protein-cofactor interactions. Further, generic functionalities which build, fold, or degrade proteins are not considered as PPIs. For example, ribosomal proteins share "functional contacts" with each other, but not all of these contacts lead to interactions per se. PPIs are ubiquitous and involved in all levels of cellular action. They contribute to the execution and regulation of the majority of biological processes; for example, in signal transduction, cell metabolism, and membrane transport mechanisms. The complete network of PPIs in the human body, the human interactome, was estimated to comprise approximately 130,000 bi- nary PPIs.[307] In particular, pathological PPI (-networks) within the interactome are seen as promising drug targets, offering a plethora of opportunities for drug discov- ery.[308, 309] While PPIs bear a huge potential for the discovery of novel therapeutics, their mod- ulation presents a challenging task. First, embedding molecules into buried binding pockets, where a large fraction or even all of the ligand is exposed to the target, is not feasible. In contrast, PPIs are flat interaction surfaces of relative large extension (ap- proximately 1,500-3,000 Å2). Second, creating selectivity is difficult because PPIs lack distinct pharmacophoric features. Since the endogenous binding partners are proteins or peptides, they do not provide a direct template for the design of small molecules. In consequence, PPIs are seen as targets with low "ligandability", in particular for small molecules.[131, 310] However, the discovery of so-called hot spots deepened the un- derstanding of PPIs and promoted drug discovery. Hot spots are regions at the in- teraction surfaces that contribute overproportionately to the free binding energy.[311, 312] Recent studies showed that some PPIs are accessible for small molecules when the targeting focuses on these hot spots. In consequence, a total amount 36 inhibitors have reached clinical development until 2016.[313] Peptides and derivatives thereof can overcome the challenges encountered with small molecules. These structures are larger than small molecules, so they cover large parts of the surface of a targeted protein. More important, peptides possess the ability to mimic the counterpart of the protein surface due to their conformational flexibility (cf. section 1.2). On the background that a large proportion of binding traces back to just a few AA residues at the hot spots, competitive PPI inhibition was envisioned by pep- tide sequences directly derived from the binding epitopes of one of the participating proteins.[314–316] Such short peptide sequences are highly flexible in their unbound state, resulting in a high entropic penalty when they have to adopt a defined 3D con- formation upon binding.[317, 318] Stabilising the bioactive conformation beforehand 42 Chapter 1. Introduction displays a common approach to increase the target affinity.[319] The discussed β-turn mimetics, peptoids and the stapling technique, are examples of such constrained pep- tides (cf. section 1.2).

The Chemokine System

The chemokine system is a complex network of receptor-ligand interactions and con- stitutes a key component of the immune system regulation.[320, 321] To date, 48 of its secreted proteins, the chemokines, which bind to approximately 20 respective GPCRs, are known. The chemokine system is promiscuous in terms that some chemokines can bind multiple receptors and the majority of receptors is attracted by more than one ligand (cf. Figure A.1). Chemokines have the function to mediate cell trafficking of var- ious chemokine-receptor (CR) expressing leukocytes on "cellular highways" in the hu- man body. This subgroup of cell regulating cytokines is further divided into inflamma- tory and homeostatic chemokines.[322] The main role of the former group is to trigger the movement of immune cells to the origin of an inflammation. Contrary, homeostatic chemokines are constitutively expressed in the thymic and lymphoid tissue. The func- tions of this latter subgroup are more diverse; for example, homeostatic chemokines are involved in lymphoid organogenesis, angiogenesis, and general organogenesis by mediating stem cell migration. Furthermore, they are able to exert neuroprotective and reparative functions.[322, 323] The pivotal role of homeostatic chemokines in development accounts for their strong structural conservation (Figure 1.12).[324] Chemokines exhibit an unstructured N- terminus of variable length (5 - 32 AAs), which is followed by a CC, CXC, XC or CX3C motif (C = cys, X = non-cys residue). Next, a N-loop motif passes into the so-called core domain of the chemokine which comprises a short 310- helix and a three stranded, antiparallel β-sheet. The core domain is followed Figure 1.12: Alignment of the NMR struc- by an α-helix and an unstructured C- tures of CCL19 (PDB: 2MP1) and CCL21 terminus. Stabilisation of the conserved (PDB:2L4N). The visualisation exemplary de- tertiary structure is achieved by (i) hy- picts the strong conservation of secondary structure motifs and tertiary structure of drophobic interactions of the C-terminal chemokines. α-helix with the β-sheet and (ii) one or two disulfide bridges between the CC

(CXC, XC, CX3C) motif near the N-terminus and cysteines at the β1-β2 turn (30s 1.5. Protein-Protein Interactions in Drug Discovery 43

Loop) and the β3-strand. The arrangement of the two N-terminal cysteines presents the basis for the structural differentiation of chemokines into the respective four sub- families.[320] This thesis focuses on CC-chemokines, specifically on CCL19 and CCl21 and their receptor CCR7. Besides the structural conservation, chemokines also share similar modes of CR recog- nition and activation. Clark-Lewis et al. revealed in their work on CXCL8 the criti- cal role of the chemokine N-terminus for CR activation.[325] Further, they found that CR binding and activation are two dissimilar events. The same authors observed this phenomenon also for other chemokine/receptor interactions and stated that the un- structured N-terminal region of the CR is crucial for the binding.[326] Further studies on CCR1, CCR2, and CCR3 gave rise to the proposal of a two-step, two-site bind- ing and activation model.[327–329] In the first step, the unstructured CR N-terminus recognises the core-domain of the chemokine which leads to binding (site 1) (cf. con- ceptual depiction for the CCR7/CCL19 axis in Figure 1.14). In the second step, the unstructured chemokine N-terminus activates the CR by orientating itself towards the transmembrane bundle of the GPCR (site 2). The binding step is driven by charge- dependent interactions between acidic residues on the CR side and basic residues in the core-domain of the chemokine. Therefore, the N-termini of the homeostatic CC- chemokine receptors (CCR 1, 5, 6, 7, 9, 10) exhibit 10 to 24% acidic AAs which occur in a ratio of ≥ 2:1 vs. basic residues (Figure 1.13). Moreover, the negatively charged AAs are spatially separated from the positively charged ones, which tend to be located near the subsequent transmembrane region. Other patches of charged amino acids have an equal ratio or tend to be more positively charged but are located in the intra- and extracellular loop structures of the CCRs (not shown). The negative charge density at the receptor N-terminus is further increased by posttranslational -sulfation and N-glycosylation (except for CCR10).[330, 331]

Figure 1.13: Sequence alignment of the N-termini of homeostatic CC- chemokine receptors, created with [332]. Red: Negatively charged AAs. Blue: Positively charged AAs. Yellow: Tyr residues as potential sulfation-sites. The alignment con- servation annotation is a quantitative numerical index reflecting conservation of the physicochemical properties of AAs in each column. The index ranges from zero (no conservation) to 11 (marked by *), denoting full conservation. The alignment and visualisation thereof was conducted by Cyrill Brunner. 44 Chapter 1. Introduction

Notably, this model was proposed before the determination of high-resolution struc- ture elucidation was available. However, later NMR studies on the CXCR4/CXCL12 interaction were consistent with the model.[333] The authors found that the small molecule AMD3100 can specifically dislodge the chemokine N-terminus from its bind- ing site. Thus, it prevents CR activation, without displacing the bound chemokine core-domain from the CR N-terminus. This antagonism leads to a reduced chemo- taxis of CXCR4-positive stem cells to higher concentrations of CXCL12 and releases them from the bone marrow into the blood.[334] AMD3100 (Plerixaflor) is nowadays used in combination with granulocyte colony-stimulating medicines to facilitate stem cell extraction; it is approved for patients with non-hodgkin lymphoma and multi- ple myeloma.[335] However, recent studies reveal a more complex behaviour and go beyond the two-step, two-site model in some cases. The underlying observation is that numerous chemokine and CR contacts occur outside the proposed sites, as well as posttranslational modification on the CR N-terminus (tyrosine sulfation, N-/O- glycolisation and polysialylation) and complex stoichiometry (di-/oligomerization of the chemokine and/ or the CR).[336] Besides the crucial role of the chemokines to assure the functionality of the immune system, a series of pathological events is associated with the inappropriate activa- tion of this network. These events include cardiovascular disease, allergic inflamma- tory disease, transplantation, neuroinflammation, cancer, and HIV.[337] Several stud- ies emphasise the application of therapeutic peptides to disrupt undesired PPIs of chemokines or other proteins with CRs. In 2010, crystal structures of CXCR4 bound to an antagonistic 16-residue cyclic peptide were presented.[338] The peptide fits into a transmembrane ligand cavity of CXCR4, which is known to be a co-receptor for the HIV-1 protein gp120, playing a role in cell fusion and entry. Small cyclic pentapep- tides as CXCR4-antagonist have also been reported.[339] These peptides mimic turns within the original ligand structure and exhibit low micromolar to nanomolar activ- ity in a CXCL12/CXCR4 binding inhibition assay. Heveker et al. presented a series of 13-residue peptides extracted from the CXCL12 sequence and tested their ability to block HIV-1 infection.[340] The peptide corresponding to the CXCL12 N-terminus was identified as a competitive binder and served as a template for an exhaustive screen- ing of 234 single-substitution analogues. After the binding of gp120 to the cellular receptor CD4, conformational changes lead to a highly conserved binding pocket, also for the N-terminus of CCR5, another HIV-1 co-receptor. The approved small molecule inhibitor inhibits this interaction and is used for the treatment of CCR5- sensitive HIV-1 patients.[341] A study by Seitz et al. revealed that also peptides bear therapeutic opportunities for this target. They discovered a cyclic β-hairpin peptide, mimicking the α-helical epitope, which the tyrosine-sulfated N-terminal segment of CCR5 adopts upon binding.[342] Earlier work suggested a linear tyrosine-sulfated 1.5. Protein-Protein Interactions in Drug Discovery 45

CCR5-derived peptide which fulfils the same purpose.[343] Furthermore, phage dis- play was used to identify peptidic modulators of the interactions between eotaxins (CCL11, CCL24, CCL26) and CCR3.[344] The same interaction was studied by Zhu et al. who derived four peptides from the CCR3 N-terminus.[345] Differentiation of the tyrosine-sulfation states lead to changes in the selectivity profile of these peptides to the different eotaxins. These affinity variations probably result only from subtle changes of the chemokine surfaces, as indicated by chemical shift mapping with NMR.

CCR7 and CCL19/CCL21

A particular PPI which has been investigated within this thesis involves the chemokine receptor 7 (CCR7) and the chemokine CC-motif ligand 19 (CCL19). Like all other CRs, CCR7 contains seven transmembrane-spanning domains and its signal is mediated by heterotrimeric G proteins. While the full receptor-sequence is known, structural eluci- dation is still missing as of today. CCR7 is expressed on activated lymphocytes, mature dendritic cells and lymphoid tissues. Thus, it is involved in the regulation of the im- mune system [346]. Pathologically, several cancer cells express CCR7, which links the receptor to lymph-node metastasis.[323] Moreover, a recent study identified CCR7 as a potential target for therapeutics against rheumatoid arthritis.[347] CCR7 positive cells are attracted by the soluble gradient of the two endogenous ligands, the homeostatic chemokines CCL19 and CCL21. CCL21 is unique, because it exhibits an extended C-terminus with two additional cys- teine residues. The remaining structure resembles the conserved tertiary structure of chemokines (Figure 1.13). The extended C-terminus is important for binding gly- cosaminoglycans (GAG) on the surface of endothelial cells, which immobilises the chemokine.[346, 348] Such immobilised chemokine fields are a prerequisite for the efficient adhesion of leucocytes. The truncation of this C-terminus is a natural post- translational modification, leading to soluble CCL21 gradients regulating cell traffick- ing. In 2012, the NMR solution structure of CCL21 was determined (PDB: 2L4N).[349] Chemical shift perturbations upon titration of a CCR7 N-terminal peptide to CCL21 indicated the binding to CCL21 residues in the N-loop, the 40s loop and the third β- strand. The recently presented crystallographic structure of truncated CCL21 (1-79) (PDB: 5EKI) showed that the N-loop is tilted externally, in comparison to CCL19 and full-length CCL21, where it tilts internally.[350] It is also hypothesised that CCL21 is auto-inhibited by the extended C-terminal domain but is released upon interaction with polysialic acid that is located at CCR7.[351] These studies support the importance of the N-loop for receptor recognition. CCL19 gradients are crucial for lymphocyte recirculation and B- and T-cell migration to secondary lymphoid organs, and are stronger inducers of chemotaxis due to the ab- sence of the extended C-terminal domain. However, enhanced CCL19 immobilisation 46 Chapter 1. Introduction by simultaneously retaining the chemotactic potency in dendritic cells can be reached by transferring the CCL21 tail to CCL19.[352] The NMR solution structure of CCL19 was presented by Veldkamp et al.[353] The group also performed NMR shift perturbation experiments by titrating the N-termini of CCR7 and p-selectin glycoprotein ligand-1 (PSGL-1) to the chemokine. The experi- ment revealed competitive binding of both N-termini, which supports the assumption that resting T-cells which co-express CCR7 and PSGL-1 experience increased chemo- taxis and enhanced recruitment to secondary lymphoid organs.[354] In analogy to CCL21, the strongest perturbations of the chemical shift were observed at the N-loop and the third β-strand.

Figure 1.14: Schematic depiction of the CCR7/CCL19 site 1 interaction. The ex- tracellularly located, unstructured receptor N-terminus plays a pivotal role in chemokine recognition. The binding is hypothesised to be driven by ionic interac- tions of acidic residues at the N-terminus and basic residues which are clustered at the chemokine core-domain. Structural data of single CCR7 or in complex with its ligands is not yet available.

The results of these binding studies give reason to hypothesise that a site 1 recognition of CCL19 (and CCL21) occurs at the CCR7 N-terminus (Figure 1.14) and that binding is driven by ionic interactions of AA residues of opposite charges. Solely considering the N-terminus in these experiments traces back to the mentioned approach of restraining PPIs to the binding epitope or hot spots. One can further think about peptidic mod- ulators derived from the examined sequences of either receptor or chemokine. Such peptides may present valid starting points for novel compounds, enriching the portfo- lio of therapeutics for immunosuppression and oncology. 47

2 Aims of this Thesis

The computational prediction of key physicochemical properties such as lipophilicity, solubility, and pKa, reduces the cost- and resource-expensive experimental determina- tion. Moreover, these predictions facilitate the handling and analysis of large public compound libraries and in-house databases from pharmaceutical companies. The ob- tained computational results can be used for:

• The exploration of privileged property ranges for certain classes of pharmaceuti- cally active compounds.

• Discriminating compounds with unwanted properties ("negative design").

• Monitoring property-changes in hit-to-lead optimisation campaigns.

• Navigating drug design to structures with satisfactory pharmacokinetic and phar- macodynamic profiles ("positive design").

While immense effort has been undertaken to develop a plethora of lipophilicity mod- els for small molecules, tailored approaches for peptides and peptide-derived mimetics have so far not been reported. On this account, machine learning algorithms trained with peptide data may fill this gap and provide practically useful, bespoke QSPR mod- els of lipophilicity. The methodology-based part of the thesis tests this hypothesis by:

• The construction of baseline logD7.4 models for short, linear peptides. Applica- tion of techniques for feature selection and dimensionality reduction to account for the data scarcity.

• Establishing the shake-flask method for peptides. Synthesising hexapeptides

and experimentally determining logD7.4 for prospective model-validation up to a length of six AA.

• Extending the baseline models considering peptide-mimetics from AstraZeneca, as these structures encompass real-world examples of drug discovery; further, assessing the applicability domain and limitations of these extended models. 48 Chapter 2. Aims of this Thesis

• Evaluating the performance of the presented approaches in comparison to com- mercially available and routinely applied logD models for small molecules.

The final model is then tested for its eligibility to support the design of peptide PPI modulators. This application-based part includes:

• Confirming the interaction between CCL19 and the CCR7 N-terminus; deduction of a peptide-template thereof for simulated molecular evolution.

• The promotion and validation of the logD7.4 models to select peptides from cre- ated in silico libraries, where these peptides shall differ from the template, yet retaining target-affinity. 49

3 Materials and Methods

In this chapter, the relevant experimental and computational methods for the pre- sented studies and their implementation are summarised.

3.1 Laboratory Methods

Peptide Synthesis

Throughout the projects 35 peptides have been synthesized in total. Table 3.1 provides a summary of all peptides and shows for which project the respective peptide was syn- thesised for.

The peptides have been synthesised using Fmoc-based SPPS [152] on a SymphonyTM synthesizer (Gyros Protein Technologies, Tucson, USA) with DMF (Honeywell Spe- ciality Chemicals, Seetze, Germany) as solvent. For producing C-terminal-amidated peptides, 50 µM Rink amide 4-methyl benzhydrylamine (MBHA) resin (0.52 mmol/g) (AAPPTec, USA) was used as solid support. At the beginning, the resin was swollen three times for 10 minutes in DMF. The Fmoc-protected AAs were purchased from AAPPTec (Louisville, USA) and Protein Technologies, Inc (Tucson, USA) and a ten- fold excess relative to the resin was applied. Coupling was conducted using 400 mM HCTU (Gyros Protein Technologies, Tucson, USA) as the activator and 800 mM NMM (Fisher Chemical, Pittsburgh, USA) as the mediator reagent in DMF (v/v). The final mol ratio in the reaction vial during coupling was 0.1 resin : 1 AA : 1 HCTU : 2 NMM. Before and after deprotection of base-labile Fmoc-protection groups from the resin and the AAs with a solution of 20% pyrrolidine (Acros organics, USA) in DMF (v/v), the reaction vial was washed with DMF. After coupling the last AA, the reaction vial was first washed with DMF and then five times with DCM (Sigma-Aldrich, St. Louis,USA). Finally, the acid-labile side chain protection groups and the resin were cleaved with a solution of 95% TFA (ABCR, Karlsruhe, Germany), 2.5% TIS (Sigma-Aldrich, St. Louis,

USA) and 2.5% ddH2O (v/v/v). The crude products precipitated for at least two hours in ice-cold diisopropyl-ether (Merck Millipore, Darmstadt, Germany) at -20 ◦C, follow- ing four washing steps with centrifugation (10 minutes, 3000 rpm, -10 ◦C), removal of the supernatant and re- suspension in ice-cold diisopropyl-ether. After work-up, the peptides were left overnight for drying. 50 Chapter 3. Materials and Methods

Table 3.1: Summary of synthesised peptides by SPPS. The C-terminus of all pep- tides is amidated.

ID Name Sequence Project 1 JF_1_1 ALIWGY 2 JF_1_2 FLGKVW 3 JF_1_3 GAWPFL 4 JF_1_4 IPFWKL 5 JF_1_5 KLVWAF 6 JF_1_6 LPVGWF External Validation of logD Models 7 JF_1_7 LYLGWI 7.4 8 JF_1_8 PWGYVA 9 JF_1_10 VPAFII 10 JF_1_11 WPKIYV 11 JF_1_13 VLIWFV 12 JF_1_14 SVYLQP 13 CCR7_C24A QDEVTDDYIGDNTTV- DYTLFESLASKKDVR 14 CCR7_10.1 QDEVTDDYIG 15 CCR7_10.2 DDYIGDNTTV 16 CCR7_10.3 DNTTVDYTLF 17 CCR7_10.4 DYTLFESLAS Fragmentation of CCR7 N-Terminus 18 CCR7_10.5 ESLASKKDVR 19 CCR7_6.1 QDEVTD 20 CCR7_6.2 DEVTDD 21 CCR7_6.3 EVTDDY 22 CCR7_6.4 VTDDYI 23 CCR7_6.5 TDDYIG 24 Offspring 432 PQEDVF 25 Offspring 211 VTDDWM 26 Offspring 865 YTDDYI 27 Offspring 161 LTDNYI 28 Offspring 191 IGEEFL 29 Offspring 420 VVEDFL De Novo Peptide Generation 30 Offspring 364 AEDDYA 31 Offspring 876 MSDSVV 32 Offspring 944 ISDNMA 33 Offspring 256 VQSDVA 34 Offspring 82 WQDDFL 35 Offspring 7 VSNEHM 3.1. Laboratory Methods 51

Peptide Analytics and Purification

The crude peptides were dissolved to a concentration of 1 mg/ml in a mixture of ddH2O and ACN with 0.1% FA (5-50% depending on solubility). 10 µl were injected for anal- ysis. We used reversed phase HPLC (Shimadzu, Kyoto, Japan) with a NucleodurTM C18 HTec column (150 x 3 mm, 5 µm, 110 Å) and conducted gradient runs from 5% -

70% ACN (Fisher scientific, Loughborough, UK) in ddH2O + 0.1% FA (Sigma-Aldrich, St. Louis, USA) over 25 minutes with a flow-rate of 0.5 ml/min. Compounds were de- tected simultaneously by UV (210, 228, 254, 270, 290, 310 nm) (Shimadzu SPD-M20A DAD) and electrospray mass detection (Shimadzu LCMS-2020) over a mass range of 300 – 1500 or 300 - 2000 in positive ion mode. For purification, the crude peptides were dissolved in the same solvent as for the analysis but up to a concentration of 10 mg/ml depending on the yield. Reversed phase preparative HPLC (Shimadzu, Kyoto, Japan) was conducted with a NucleodurTMC18 HTec column (150 x 21 mm, 5 µm, 110 Å) un- der the same chromatographic conditions as for the analysis, except for adjusting the flow rate to 24.5 ml/min and injection volume to 5 ml. We used the same settings for detection as for the analytical system, but UV was only recorded at 190 and 210 nm. Purity of the collected fractions was determined by the analytical method described before. The pure fractions were lyophilisated with an Alpha 2-4 LDplus Freeze Dryer (Christ, Osterode am Harz, Germany) at 0.03 mbar and -85 ◦C. The MS chromatograms and spectra of the principal peaks of all purified peptides can be found in the sections A.2 and A.5.

Shake-Flask Method

Determination of logP for neutral peptides, respectively logD at physiological pH 7.4 for peptides in dissociated form were conducted by adapting the shake flask method [45, 54] at ambient temperature in n-octanol ≥ 99% (Sigma-Aldrich, St. Louis, USA) and 20 mM PB. PB was prepared with K2HPO4 x 3H2O and KH2PO4 (Sigma-Aldrich, St. Loius, USA) and the pH was adjusted with either 1 M NaOH (Sigma-Aldrich, St. Louis, US) or 1 M HCL (Merck Millipore, Darmstadt, Germany) to pH = 7.40 ± 0.01 with a Lab 860TMpH meter (SI-Analytics, Mainz, Germany). PB was then filtered through a 0.22 µm PES membrane filter (TPP, Trasadingen, Switzerland). N-octanol was degassed prior to use. Both solvents were mutually saturated by shaking over night and following separation to avoid transitions of one phase into another by form- ing micro droplets or –emulsions. Peptide stock solutions with a minimal volume of 5 TM ml were prepared in either PB (logD7.4 < 0.5, calculated by ACD/Labs ) or n-octanol TM (logD7.4 ≥ 0.5, calculated by ACD/Labs ) at a concentration of 25-100 µg/ml and ultra-sonificated if necessary. The peptides were weighted with a Toledo MX 5 bal- ance (Mettler, Columbus, USA). An uncertainty of the balance of 1% was calculated 52 Chapter 3. Materials and Methods for a minimum weight of 160 µg based on the repeatability given by the manufac- turer (0.0008 mg) and coverage factor = 2, according to USP 39 NF 32 chapter <41> and <1251>. The shake-procedure was conducted on a K 250 mechanical shaker (IKA Labortechnik, Staufen im Breisgau, Germany) for one hour in 50 ml centrifugation ves- sels (TPP, Trasadingen, Switzerland) where the minimal volume of each phase was 1 ml to ensure proper sampling. The PB/n-octanol ratio was 1:1 (logD7.4 -2 to 2, calcu- TM TM lated by ACD/Labs ), 1:10 (logD7.4 < -2, calculated by ACD/Labs ) or 6:1 (logD7.4 > 2, calculated by ACD/LabsTM). After phase separation for 15 minutes, the samples were centrifuged at 23 ◦C with 2300 RPM for 10 minutes. Peptide-quantification in one phase before (cstart) and after shaking (cPB or coct) was performed by HPLC-UV (Table 3.2) and external calibration. Each run was conducted in triplicate and the peak area was averaged. The mobile phase was adjusted for each peptide to achieve separation from the solvent in the shortest possible runtime. Before changing conditions for an- other peptide, the column was flushed according to the recommended procedure of Dolan [355] and equilibrated for at least 30 minutes.

Table 3.2: Chromatographic settings for peptide-quantification after the shake- procedure.

Module Settings

HPLC VWR L-2000 series Column Lichrospher 100 RP18 Lichrocart 250 x 4 mm

Mobile Phase ddH2O + 0.1% TFA (v/v) / ACN + 0.1% TFA (v/v) Flow rate 1.2 ml/min Temperature 25 ◦C Run-type Isocratic Injection Volume 20 µl Detection UV 220 nm

The calibration curve was determined by at least five data points, where at least two samples were weighted independently. Therefore, the analyte was dissolved in either

PB or n-octanol, such as the stock solution. For calculation of logD7.4 from concentra- tions of one phase, the respective other concentration was back-calculated considering the mass balance:

M = cstart ∗ V start = cPB ∗ V PB + coct ∗ V oct (3.1) where M is the mass calculated from the volume and concentration of the stock solu- tion. And with: 3.1. Laboratory Methods 53

coct logD7.4 = log (3.2) cPB the final equation is: " # (cstart − cPB) logD7.4 = log (3.3) V oct cPB ∗ V PB if concentrations were determined in the aqueous phase. It changes to:

" V PB # coct ∗ V oct logD7.4 = log (3.4) (cstart − coct) if concentrations were determined in the n-octanol phase. LogDpH of three indepen- dent samples was determined as described. An intra-day comparison was conducted again with three independent samples. The calibration curve was not again deter- mined on the second day but an aliquot of the stock solution of the first day served as a quality control. The final result was specified as the mean of the six samples from both days ± std.

Microscale Thermophoresis

In order to detect binding of peptides to the chemokine CCL19 and for Kd determi- nation, MST was conducted.[356] His-tagged CCL19 (Creative Biomart, Shirley, USA), dissolved in PBS (Sigma Aldrich, St. Louis, USA) + 0.05% Tween (Sigma Aldrich, St. Louis, USA) was centrifuged at 15000g for 10 minutes at 4 ◦C upon thawing. The His-tagged chemokine was further labelled using Red Tris NTA Dye (NanoTemper Technologies GmbH, Munich, Germany) in PBS in a ratio of 2:1 and the solution was incubated for 30 minutes in the dark at room temperature. Labelled CCL19 was diluted to a concentration of 20 nM. Prior to the binding experiments, protein aggregation was tested preparing 5 premium coated capillaries series NT.115 (NanoTemper Technolo- gies GmbH, Munich, Germany) with 10 nM labeled CCL19 in PBS. Serial dilutions (1:1 for 16 data points or 1:2 for 8 data points) of potential binders were prepared with a starting concentration of 500 µM in PBS. The 8 (or 16) tubes having 10 µl peptide solution with decreasing concentrations are then mixed with 10 µl of 20 nM labelled CCL19, leading to a final target concentration of 10 nM. The solutions were then in- cubated at room temperature for 15 minutes. For some peptides, a control assay was performed with the same concentration of labelled His6-peptide (NanoTemper Tech- nologies GmbH, Munich, Germany) instead of CCL19. Thermophoretic movement of the fluorescent labelled binding partner was measured on a Monolith NT.115 instru- ment (NanoTemper Technologies GmbH, Munich, Germany) at 25 ◦C, 20% excitation power (i.e. power supplied to the excitation LED) and 40% MST power (i.e. power 54 Chapter 3. Materials and Methods supplied to the IR Laser) for at least 3 technical replicates. Acquisition time was set to 20 seconds and the change in relative fluorescence was taken 5 seconds after applying the temperature gradient. In order to rule out protein-adhesion to capillary walls, cap- illary scans before and after measurements were conducted. If adhesion occurred, the respective data point was dismissed or the measurement was repeated. Data analysis was performed using MO.Affinity Analysis v.2.2.4 (NanoTemper Technologies GmbH,

Munich, Germany). The Kd model was not used for fitting, if:

• S/N is < 5 • the signal amplitude is < 3

These peptides were treated as non-binders. Measurements were repeated 3 times with individual peptide samples, dilution series and labelled CCL19. Kd was reported as the mean ± std of the three measurements.

3.2 Computational Methods

Software

Data in .csv or .xlsx formats were handled and analysed with Microsoft Excel for Mac (v.16.13.1) and the pandas data analysis package (v.0.19.2) [357]. Data in .txt format were accessed with TextWrangler (BareBones Software Inc., North Chelms- ford, USA), and converted to .csv or .xlsx files. For data analysis, implementa- tion of machine learning algorithms, de novo peptide generation and plotting graphs, Python (v.2.7, https://www.python.org/) was used.[358] The working environ- ment of Python was managed by the Anaconda Software Distribution tool (v.4.3.21, https://anaconda.org/) and the code was written and executed in Jupyter Note- books (v.5.1.0, http://jupyter.org/). For logD7.4 calculations and benchmarking the developed model(s) we used ACD/Labs (ACD Percepta 2015 Build 2726, Ad- vanced Chemistry Development Inc., Toronto, Canada), Instant JChem (v.18.5.0, 2018, ChemAxon, Budapest, Hungary) and the physicochemical & biopharmaceutical (PCB) module from ADMET Predictor (v8.5., Simulations Plus Inc., Lancaster, USA). ALOGP and MLOGP were calculated by DRAGON (v.7, Kode Chemoinformatics, Pisa, Italy). The sequence alignment of chemokine-receptors and depiction of acidic, basic AAs and was conducted with Jalview (v.2).[332] Peptide synthesis was planned and programmed with the SymphonyTMSoftware (v.3.3.0), and the peptide UV- and MS- chromatograms and spectra were analysed with the Open Solution Software (v.1.2 Build 29, Shimadzu Corporation, Kyoto, Japan). UV-chromatograms of quantification after SFM were analysed with Easy Chrom Software (v.3.3.2). Preparation of graphs and scientific figures was conducted with Prism (v.7, GraphPad Software, La Jolla, 3.2. Computational Methods 55

USA), Adobe Illustrator CS6 (v.16.0.0, Adobe Systems, San Jose, USA) and the mat- plotlib (v.2.0.0)[359] and seaborn (v.0.7.1) visualisation packages in Python. Parts of this manuscript were prepared with Microsoft Word for Mac (v.16.13.1) and the final thesis was written in LATEX with the TeXstudio Editor (v.2.12.8).

Molecular Representation and Descriptor Calculation

Chemical structures were expressed and saved either in SMILES or .sdf format. Prior to any work, the structures were washed at pH = 7 with MOE (v.2016.08, Chemical Computing Group Inc., Montreal, Canada). If present, salts that were drawn with covalent notation were disconnected from the molecule. Only the largest fragment was kept. In this way, solvents and disconnected ions were removed. Protonation states were rebalanced at pH = 7 by deprotonating strong acids and/or protonating strong bases. 1D and 2D molecular descriptors were calculated by DRAGON (v.7, Kode Chemoinformatics, Pisa, Italy) and by the same MOE version used for wash- ing. The resulting datasets were saved as .mdb or .sdf and transformed in .csv format for analysis and application in the machine learning work flow. The pepCATS descriptor [226] was calculated using the modlamp.descriptor module in the mod- lAMP Python package (v.3.4.0, http://www.modlamp.org), that was developed by my colleague and former member of the Schneider group Alex Müller.[120] ECFP3 fingerprints [212] were calculated with DRAGON.

Machine Learning

The presented machine learning workflows were implemented in Python. From the scipy (v.0.18.1) open-source software for mathematics, science and engineering we used pandas, and the data were handled and presented to the algorithms in form of the pandas.Dataframe or as arrays created by numpy (v.1.11.13).[360] The machine learning algorithms were applied using the scikit-learn package (v.0.18.1).[361] Prior to training of the models, the data were pre-processed: Non-informative descriptors (RSD < 2.5%), missing values and redundant descriptors (R2 > 0.95) were removed. For the latter case, the descriptor with the higher mean R2 to all others was removed. Remain- ing descriptors were scaled with the function StandardScaler. For model develop- ment and evaluation, data were split by train_test_split and CV was performed using KFold. For optimising a single parameter like the number of components in PCA or α in Lasso, multiple models were trained but for the hyper-parametrisation of SVR, ParameterGrid was applied. Performance of the regression models was calculated with the function mean_squared_error and two-sided t-tests with ttest_ind. 56 Chapter 3. Materials and Methods

Datasets

In total, four datasets have been compiled for the development of lipophilicity models and for benchmarking these models with commercial predictors. All datasets incor- porate an identifier, the SMILES code of the washed structure and the experimental logDpH. Lipophilicity was exclusively determined by SFM in n-octanol/buffer sys- tems. Table 3.3 gives an overview about the peptide sequences, MW and logDpH of each dataset as well as their purpose for this work.

• LIPOPEP This dataset was manually compiled from the literature. It incorporates 243 linear di- to pentapeptides. 223 peptides can also be found in [183]. 20 peptides came from additional sources.[174, 176, 177] Each entry was manually checked and only peptides for which logD was measured at pH 7.0 to 7.4 and experimental information (solvents, shake procedure and quantification) was given, were con- sidered. All peptides are non-cyclic and consist of natural AA. The C-Termini are either free, amidated or possess a tert-butyl group. The N-Termini are either free or acetylated. Further chemical modifications are not present. Under these con- ditions, we assumed marginal changes in the distribution profile in the pH range

between 7.0 and 7.4, except for residues (pKa ≈ 6.0), which occurred in seven peptides. Some peptides had no ionisable AA and C- and N-Termini were

blocked. For these structures is logD7.4 = logP. LogD7.4 of all peptides in LIPOPEP ranged from -2.83 to 2.30.

• AZ AstraZeneca kindly provided a dataset that incorporates peptides and peptide-

mimetics from former drug discovery projects. LogD7.4 was exclusively deter- mined by AstraZeneca. The original dataset had 1630 entries. Peptides were re-

moved if logD7.4 was measured indirectly by HPLC instead of SFM and if logD7.4 was annotated to be "<" or ">" than some value. The remaining data was checked for duplicates and the final dataset consisted of 800 entries. The structures com- prised cyclic peptides, chemical modifications on the AA side chain and back- bone level and hydrophobic linker and possess a mean MW = 672 ± 289 g/mol.

LogD7.4 ranges from -2.50 to 5.80.

• In-House

This dataset comprises 15 linear peptides for which logD7.4 was determined in our laboratories by SFM as described in section 3.1. Three peptides (Gly-Pro-Gly-

NH2, Ac-Gln-Trp-Leu-NH2, Tyr-Pro-Trp-Phe-NH2) were purchased with purity > 98% (Bachem AG, Bubendorf, Switzerland). 12 hexapeptides were synthesised and purified as also described in section 3.1. The peptides consist of the 10 most 3.2. Computational Methods 57

frequent AA in LIPOPEP in a random fashion and possess no acidic functions

(the C-Termini are amidated). LogD7.4 ranges from -3.05 to 2.35.

• s-Mol The s-Mol dataset was extracted from ChEMBL23 database (https://www.ebi.

ac.uk/chembl/).[99] It comprises 4200 logD7.4 entries which have been mea- sured also by AstraZeneca as described in [57] and can be found under the assay

ID CHEMBL3301363. These small molecules from AstraZeneca cover a logD7.4 range from -1.50 to 4.50.

Table 3.3: logD7.4 datasets which were used within the presented projects.

Dataset Size MW logD7.4 Application [mean ± std] [mean ± std]

LIPOPEP 243 397 ± 106 -0.94 ± 1.09 Training / Ext. Validation / Benchmarking AZ 800 672 ± 289 1.65 ± 1.31 Training / Ext. Validation / Benchmarking In-House 15 678 ± 144 -0.03 ± 1.49 Ext. Validation / Benchmarking s-Mol 4200 383 ± 107 2.19 ± 1.20 Benchmarking

LIPOPEP and AZ were used to train and validate the lipophilicity models presented in this thesis. The In-House set served for external validation. The s-Mol dataset was not considered for model development. In the benchmark analysis 75% of this dataset was used to train our final model and the remaining 25% to evaluate its applicability for small molecules. In the benchmarking study, the In-House peptides were joined with the external validation partition of LIPOPEP.

De Novo Peptide Design

The generation of novel peptides from a template sequence was conducted by simulated molecular evolution, implemented in the algorithm Vespa.[301] The mutation operator was based on physicochemical AA similarity, enumerated in the grantham matrix[302]. Vespa (v.3, rewritten by Gisela Gabernet, Feb. 2017) was originally implemented in Python by Gisbert Schneider. AA sequences of generated peptides were provided in FASTA format and the function MolFromFasta, implemented in RDKit (Release 2017.09.1 http://www.rdkit.org/), was used to create molecular structures which were turned into SMILES format by MolToSmiles for pre-processing and descrip- tor calculation. The NMR-structures of CCL19 and CCL21 were visualised in PyMOL (v.1.7, Schrödinger LLC, Portland, USA).

59

4 Results and Discussion

This part of the thesis is divided into four sub-chapters, related to the different project sections:

Sub-chapters 4.1 and 4.2 discuss the development of peptide-specific logD7.4 models. This includes the use of machine learning techniques for feature selection and mod- elling, the analysis of training data and its resulting impact on model performance, the combination of models towards a consensus result and the domain of applicability assessment. Sub-chapter 4.3 describes and discusses how our model performs in com- parison to some selected commercially available logD software.

Parts of the sections 4.1, 4.2 and 4.3 were published in: J. A. Fuchs, F. Grisoni, M. Kossenjans, J. A. Hiss, G. Schneider, "Lipophilicity prediction of peptides and peptide derivatives by consensus machine learning", Medicinal Chem- istry Communications 2018, 9, 1538-1546.

Sub-chapter 4.4 describes the model application to rationally select CCL19-binding peptides.

4.1 Baseline Models

Introduction

In comparison to the abundance of computational approaches for lipophilicity calcu- lations of small molecules [9, 102], only a few peptide-specific models, and also for other structures that exceed the size and complexity of common drug-like molecules, exist. In 1999, a method to calculate partition coefficients for peptides by summing up lipophilic contributions of each AA in a given sequence was proposed.[89] The con- cept follows the classic fragment approaches for small molecules (cf. section 1.1), but in its implementation it is limited to natural AA and non-cyclic structures. Visconti et al. used SVR to build logD7.4 models for peptides up to a length of six AAs.[104] The authors emphasised the scarcity of peptide lipophilicity data, which restricts the development of broadly generalisable models. On this account, the following part of this thesis describes our contribution to bespoke logD7.4 predictors for these molecules. First, the focus was set on short, linear peptides (LIPOPEP) and the prospective model evaluation with a set of linear hexapeptides, synthesised and tested in our laboratory facilities. 60 Chapter 4. Results and Discussion

Feature Selection and Dimensionality Reduction

For the LIPOPEP set, all available 1D and 2D MOE descriptors were calculated (N=219). These descriptors were pre-processed and checked for missing values and redundancy by correlation analysis as described in chapter 3.2. The remaining 120 descriptors were mean-centered and scaled to unit variance. 75% of the data were used for model de- velopment (training) and cross-validation (CV) while 25% remained for external vali- dation (EV) of the final models. Splitting was conducted randomly, but the peptides were clustered before into ten groups according to their lipophilicity by the k-mean algorithm [234]. This method ensured similar logD7.4 distributions of the two partitions. The training data were then given to Lasso [240] and a series of 50 sepa- rate regression models with α values from 10-5 to 101 were trained (Figure 4.1). For α <0.01, maximum model vari- ance between both the training- and CV- partition was observed. For α > 0.01, the variance started to shrink but the overall performances decreased (bias). Models trained with α > 0.2 showed nearly identical performance between training and CV but RMSE > 0.9 indi- cated too low complexity for the given regression task. The final model was trained with α set to 0.06, leading to a selection of 12 features. The fea- ture "rsynth", describing the synthesis- Figure 4.1: Feature selection by Lasso. Top: ability of a molecule, was regarded ir- 50 models were trained with different α in relevant in the context of logD predic- range 10-5 to 101 and average RMSE of five- tion and thus removed. The final selec- fold CV (green: training partition, blue: valida- tion partition) for each model was calculated. tion of 11 features is summarised in Table The green and blue shades depict the respec- A.1. tive std. Increasing α creates simpler models Seven of the selected features incorpo- with less features (Bottom). The dashed line indicates the chosen model with α=0.06, which rated information about parts of the relies on 11 features. molecule’s vdW surface: PEOE_VSA- 6, PEOE_VSA-5, PEOE_VSA+6 and PEOE_VSA-3 sum up vdW surface areas on the atomic level that fall into defined partial charge categories. The number 4.1. Baseline Models 61

Figure 4.2: Loadingplot. PCA was completed on the eleven selected features and the first two components are shown. of acidic atoms (a_acid) and PEOE_RPC- account for charge properties as well. SMR_VSA4, SlogP_VSA5 and PEOE_VSA_FHYD relate to surface-polarisability and -hydrophobicity. Lasso also selected MOE’s global lipophilicity descriptor h_logD. We observed that h_logD provided poor predictions for LIPOPEP (Table 4.1), so it was not treated as a stand-alone logD model but kept within the feature-set. Lasso seemed to pick features that incorporate different types of molecular informa- tion, as we could not observe any clusters of loadings onto the first two components of a principal component analysis (PCA) [228] (Figure 4.2). With exception of SMR_VSA4, the direction of either PC1 or PC2 or both was influenced by all features, indicating a homogeneous distribution of the overall data variance. The performance of Lasso on the LIPOPEP set (RMSE (CV) = 0.60 ± 0.09, 75.5 ± 7.4 % accurate predictions, Table 4.1) demonstrated successful feature selection and model building. In the next step, an orthogonal information-extraction strategy was scrutinised. The objective was to evolve from a knowledge- or intuition-driven pre-selection, to a data- driven working hypothesis by providing the computer broad, informative input. To achieve such a broad feature set, all available 1D and 2D Dragon descriptors were cal- culated as well and joined with the MOE set. The same pre-processing steps as before 62 Chapter 4. Results and Discussion were conducted. By this, 1188 dimensions were added to the 120 MOE features, lead- ing to a 1308-dimensional input. The DRAGON features covered a variety of molecu- lar representations: Constitutional and connectivity information, E-state indices, drug- like and topological indices, ring descriptors, molecular and pharmacophoric proper- ties. The novel feature set served as the input for a PCA. The scree plot of the first 120 com- ponents was generated (Figure 4.3). One heuristic approach to choose the number of components for a reduced feature set is to look for an "elbow" in the eigenvalues, indi- cating that further components incorporate negligible data variance in comparison to the components before.[362] We followed this approach and defined components 1-20 as the reduced feature set, which explained 66.2% of the initial data variance. Other heuristic criteria consider retaining 90% of data variance or all components with eigen- values >1.[362] Here, this led to sets with more than 90 dimensions. Thus, the scree plot criterium presented the preferred option considering that the LIPOPEP training parti- tion incorporates only 179 entries. In comparison to Lasso, PCA does not pick certain features. The objective is to reduce the dimensionality by simultaneously retaining maximal data-variance (cf. section 1.3).

Figure 4.3: Screeplot of the first 120 components from a PCA completed on the joined feature set of MOE and DRAGON. The blue line shows that the reduced fea- ture set, defined by components 1-20, explained 66.2% of the initial variance. 4.1. Baseline Models 63

Results for Modelling with Lasso Features vs. PCA Scores

Lasso, as a variant of penalised multivariate regression, provides both feature selec- tion and logD predictions. In comparison, PCA is an unsupervised learning method, and as such, can not be applied for predictions. To compare the applicability of both generated feature sets for the given task, we introduced support vector machines for regression.[259] This algorithm learned the training partition of LIPOPEP, represented either by the selected Lasso features or PCA scores. Two parallel SVR-optimisation strategies were followed with C and γ as subjects to hyper-parametrisation, employing a grid-based five-fold CV approach (Figure 4.4). Each optimisation considered either Lasso selected features "SVR(Lasso)" or the PCA scores "SVR(PCA)" as input. In total, 144 parameter combinations were considered with C ∈ [0.01, 0.05, 0.1, 0.5, 1.0, 3.0, 5.0, 10.0, 20.0, 50.0, 100.0, 200.0] and γ ∈ [1x10-8, 1x10-7, ... 1x103]. The respective models showing the lowest RMSE on CV were then automatically selected.

Figure 4.4: Performance heatmaps of SVR(Lasso) and SVR(PCA) models for the 144 scrutinised combinations of C and γ. The blue square marks the parameters of the final models.

The full training set partition was then provided to SVR(Lasso), SVR(PCA) and the ini- tial Lasso model for learning and logD7.4 of left-out data was predicted. The RMSE and percentage of accurate predictions (∆logD7.4 ≤ 0.5 from the experimental value) were computed. All three baseline models generalised to the unseen data, as proofed by the small performance differences between CV and EV (Table 4.1). Lasso and SVR(PCA) performed nearly identically (RMSE (CV) 0.60 ± 0.09 and 0.59 ± 0.11). SVR(Lasso) provided the best predictions (RMSE (CV) 0.47 ± 0.13 and 73 - 75% accurate on CV 64 Chapter 4. Results and Discussion

Table 4.1: Top: Performances of the baseline models + h_logD on the LIPOPEP set (N(Total) = 243). The same five CV-partitions as for the baseline models were used to calculate the performance of h_logD. Bottom: Performances of y-randomised base- line models, calculated for all training data (100 iterations). CV: cross-validation; EV: external validation

CV (N = 179) EV (N = 64) Model RMSE ± std % accuracy ± std RMSE % accuracy Lasso 0.60 ± 0.09 75.5 ± 7.4 0.54 73.4 SVR (Lasso) 0.47 ± 0.13 86.0 ± 3.1 0.39 90.6 SVR (PCA) 0.59 ± 0.11 73.8 ± 4.1 0.41 75.0 h_logD 1.90 ± 0.64 10.1 ± 9.6 1.99 11.0

Y-randomisation (N = 179) RMSE ± std % accuracy ± std Lasso 1.08 ± 0.11 35.3 ± 7.0 SVR (Lasso) 1.13 ± 0.12 35.9 ± 6.8 SVR (PCA) 1.14 ± 0.12 35.8 ± 6.4 and EV). In comparison, only 10% of the predictions from h_logD were accurate. Ap- parently, the task was difficult to solve for this algorithm, given LIPOPEP peptides as query compounds. These results obtained for the LIPOPEP set, demonstrated the suc- cessful selection of features and dimensionality reduction for robust logD7.4 modelling of short, linear peptides. In order to test the robustness of these baseline models, we computed the same per- formance metrics on the CV set, but the models were trained with input X and target values y in a randomised order. This procedure was repeated 100 times. The results are shown in the lower part of Table 4.1. In all cases, we observed an increase in the RMSE and decrease in the percentage of accurate predictions. Y-randomisation could not be applied to test h_logD, since it is not possible to train the algorithm with own data. However, our randomised models still showed lower RMSE than h_logD on the non-scrambled data.

Predictions from Baseline Models for Peptides up to a Length of Six AA

Further, the applicability of the baseline models, towards hexapeptides was tested. We strived for a peptide set that included frequently observed AAs in the LIPOPEP set and covered its lipophilicity range. In total, 12 peptides were generated from Phe, Leu, Val, Ile, Gly, Ala, Tyr, Trp, Pro and Lys (Table 3.1). As such,the generated peptides included the nine most frequent AAs from LIPOPEP as well as Lys, which was chosen to insert additional charges into some peptides to decrease the lipophilicity. Met, which was 4.1. Baseline Models 65 the tenth most frequent AA, was left out for synthetic reasons. The AA composition of each peptide was arbitrary. All peptides were synthesised by Fmoc-based SPPS and purified as described in chap- ter 3.1. LogD7.4 was experimentally determined for the hexapeptides and three model peptides (Gly-Pro-Gly-NH2, Ac-Gln-Trp-Leu-NH2, Tyr-Pro-Trp-Phe-NH2) which were purchased to set up the SFM. The resulting In-House set covered a logD7.4 range from -3.05 to 2.35, which corresponded roughly to the range of LIPOPEP (-2.83 to 2.30). 10 of the 15 peptides were accurately predicted by SVR(Lasso) and the other predic- tions deviated less than 1.0 log units from the experimental value (Figure 4.5). The error was lower in the range of -0.38 to 2.35, indicating that the model was more cer- tain for lipophilic peptides. The Lasso model performed similar, although eight pep- tides were not accurately predicted, and two of them had an absolute error > 1.0 log units. Again the predictions were better for peptides having experimental logD7.4 of -0.38 and higher. Compared to that, the absolute error of the SVR(PCA) ranged from 0.01 to 2.25. A correlation between the absolute error and the experimental value was not observed. The experimental uncertainty was expressed by the std of the six inde- pendent measurements for each peptide. This value went up to 0.31 at the edge cases of the covered logD7.4 range. Between logD7.4 -1.13 to 1.29 it ranged from 0.01 to 0.03.

Figure 4.5: Absolute prediction errors of the three baseline models on the In-House set. The straight line indicates the threshold for accurate predictions (∆logD7.4 ≤ 0.5), the dashed line for acceptable predictions (∆logD7.4 < 1.0). The std of the six experimental logD7.4 values per peptide is depicted by the light blue shade. 66 Chapter 4. Results and Discussion

Discussion

The systematic data-driven approach led to three baseline models (Lasso, SVR(Lasso), SVR(PCA)), which were exclusively trained on and optimised for a set of short, lin- ear di- to pentapeptides covering logD7.4 from -2.83 to 2.30. Often, lipophilicity ranges or thresholds rather than a certain value are applied to guide drug discovery towards preferable chemical properties. According to Lipinski et al. for example, poor absorp- tion or permeation is more likely when clogP is > 5.[5] However, it is known that the "rule of five" space covers only a fraction of the drug-like chemical space [363] but re- vised versions of the concept rely on lipophilicity thresholds as well.[48, 110] Waring proposed the optimum logP/logD range that determines the overall quality of candi- date drug molecules to be between ≈ 1 - 3. In another publication, the same author screened a structurally diverse caco-2 cell permeability dataset from AstraZeneca, and proposed lipophilicity ranges to get 50% probability of high permeability depending on a given MW range.[364] The stated logD ranges were between 0.3 to 1.6 log units large and incorporated values between <0.5 to >4.5. Considering, that these are rough guidelines which do not replace the close investigation of promising drug candidates, it was reasonable to define a prediction to be accurate when it is within ± 0.5 log units from the experimental value. This assumption is supported by the experimentally de- terminable lipophilicity range. For example, the LIPOPEP set covered 5.1 log units. Thus, an accurate prediction deviates from the experimental value less than 10% of the entire logD7.4 range. Model performances were dependent on three aspects: (i) Quality (and quantity) of the input data, (ii) a useful molecular representation regarding the given task and (iii) the choice of model-parameters. LIPOPEP represented a dataset that was manually compiled and checked but incorporated a limited amount of entries. This is reality for many datasets in chemoinformatics and creates a natural limit on generalisability and performance improvement for any predictive model. Creating low-dimensional fea- ture sets is a common way to deal with this situation. Additionally, the Lasso features were meaningful in the physicochemical context of a partition between two immiscible phases: Attraction to either aqueous or apolar environment depends on the analyte’s surface characteristics and protonation state. Contrary to Lasso, the PCA prohibits feature-interpretability. On the other hand, it can discover data-patterns and descrip- tors which we would not have picked intrinsically, yet comprising valuable informa- tion for modelling distribution coefficients. In machine learning, several samples for each feature combination would present per- fect prerequisites, however this is often not practical. An experience-based rule of thumb in the field of pattern recognition proposes at least five training examples for each dimension in the representation.[365] Our selections resulted in 16 (Lasso) and nine (PCA) LIPOPEP samples per input-dimension. 4.1. Baseline Models 67

Comparing the baseline models with their counterparts trained with y-randomised data revealed that a robust mathematical relationship between the two feature sets and logD7.4 was built. However, the performance decrease to RMSE ≈ 1.1 ± 0.1 for all mod- els was not large. Having a closer look at the individual predictions from the random models, we observed that they did not exceed a range of -2 to 0.5 (exemplary depicted for the Lasso model in Figure 4.6). In other words, the random models were only able to make predictions within a narrow range, but the overall logD7.4 distribution of LIPOPEP (-0.94 ± 1.09, see Figure 4.8 A) prevented them from getting worse than RMSE ≈ 1.1. In comparison, the overall performance of h_logD on the original data was much higher (RMSE(CV) = 1.90 ± 0.64), meaning this algorithm must have made some high prediction errors. According to the vendor of MOE, the logP part of h_logD was trained with 1836 small molecules (R2 = 0.84, RMSE = 0.59) but structural informa- tion was not provided (https://www.chemcomp.com/MOE-chemoinformatics_ and_QSAR.htm, last assessed June 2018). Based on our results, we suspect a system- atic error that occurs due to structural dissimilarities of the training set to the query compounds. However, in addition to the ten other features, h_logD does contribute to our selected feature set from Lasso. Figure 4.5 illustrates that Lasso and SVR(Lasso) are particularly well suited for logD7.4 predictions of hexapeptides. However, only half of the naturally oc- curring AA were considered. A univer- sal approach regarding all natural AA requires more data with a more homo- geneous AA distribution than LIPOPEP provides. In fact, the structural diver- sity plays as much a role as the amount of data, because a structurally narrow training dataset will also lead to a nar- row QSPR model. This fact literally em- phasises our endeavour to move from lipophilicity models for small molcules to tailored in silico approaches. Figure 4.6: Lasso predictions after learning A potential reason for the heterogeneous y-randomised LIPOPEP data. For performance error distribution of SVR(PCA) and 6 out calculations, the procedure was repeated 100 times. The plot exemplary depicts predictions of 15 predictions on the In-House set with from the last iteration. ∆logD7.4 >1.0 is, that the model over-fits LIPOPEP data which was not observable in the chosen performance metrics. Also, there could be feature "dissimilarity" between LIPOPEP and the In-House data in one or more dimensions so that the respective query compound is not covered by the 68 Chapter 4. Results and Discussion model’s known chemical space. Quantification of this "dissimilarity" is part of the applicability domain assessment. Compared to the computational results, a clear trend of experimental uncertainty at the edge-cases of covered lipophilicity range was observed. This corresponds to the fact that the distribution between PB and n-octanol becomes unbalanced for very hy- drophilic or hydrophobic compounds. For example, logD7.4 = -3 means that the an- alyte concentration in PB is 1000 times higher than in n-octanol. Quantifying the concentration-changes before and after shaking becomes increasingly challenging for such cases. For measuring hydrophobic compounds, large excess of the aqueous phase must be applied, fostering the formation of n-octanol micro-droplets. This hinders ac- curate quantification as well. To avoid formation of droplets or emulsions, an adaption of SFM was proposed.[55] The solution is slowly stirred instead of shaken. Taken together, the findings demonstrate the applicability of machine learning tech- niques for obtaining bespoke lipophilicity models for short, linear peptides but also promote the need to define the model’s limitations.

4.2 Expanded Models

Introduction and Hypothesis

Through machine learning, we were able to produce robust SVR regression models that learned from low-dimensional feature sets. Yet, the chemical spaces of these baseline models were restricted to short and linear peptides. However, proteolytic instability restricts the use of such peptides in therapeutic applications (cf. section 1.2). In or- der to improve the utility of peptide-derived structures and peptide-mimetics for drug development, the poor pharmacokinetic profiles must be overcome. Typically, this is achieved by installing functionalities that can be found in classic medicinal chemistry derived molecules as well.[366] To account for such real-world examples, the utility of the baseline models was examined, considering peptide lipophilicity data from As- traZeneca (AZ set).

Results for Modelling LIPOPEP vs. AZ

In the first step, the AZ set was split into CV and EV partitions according to the method used for LIPOPEP. By clustering the data in ten groups according to logD and ran- domly taking 75% of each group for CV and 25% for EV, similar lipophilicity distri- butions were ensured. The baseline models (trained on the LIPOPEP set) were then tested on AZ EV. The results are depicted in Figure 4.7. RMSE differences of 0.95 to 1.61 between LIPOPEP and AZ EV showed that our models were not capable of gen- eralising to this new chemical domain. We also considered the opposite case, allowing 4.2. Expanded Models 69

the models to learn all AZ training data and predicted logD7.4 of LIPOPEP EV. Again, the models failed to generalise from one dataset to the other. In this instance, the com- puted RMSE differences were considerably lower (0.45 to 0.96), likely reflecting the increased density of observations across the chemical space.

Figure 4.7: RMSE of the baseline models trained on LIPOPEP data (light orange) vs. those trained on AZ data (light blue). External validation was conducted with test partitions from LIPOPEP (orange) and AZ (blue). CV: cross-validation; EV: ex- ternal validation.

Deeper analysis of the errors on the respective EV sets revealed a relation to the pres- ence (or absence) of compounds having similar logD7.4 values (± 0.5 log units) in the training set. This observation was independent of whether logD7.4 was predicted for LIPOPEP or AZ. When the models learned from the LIPOPEP set, they over-emphasised the hydrophilic character of the data and the predicted logD values for the AZ set were lower than the experimental (Figure A.14). Vice versa, the short LIPOPEP peptides were predicted too hydrophobic when the AZ set was utilised for training. Analysing the logD7.4 distributions of both datasets (Figure 4.8 A)) supported this observation.

A clear shift towards hydrophobic peptide-mimetics in AZ (logD7.4 = 1.65 ± 1.31) in contrast to hydrophilic structures in LIPOPEP (logD7.4 = -0.94 ± 1.09) was observed (p < 0.01, non-parametric Mann-Whitney U test). These results emphasised that the covered target range in the training set presents a key parameter for defining the ap- plicability domain, meaning that any prediction must be considered to be unreliable for query compounds outside this range. This is a general assumption in QSAR / QSPR modelling, regardless of the algorithms and data-preparation methodologies employed. The structural differences between LIPOPEP and AZ were investigated by conduct- ing a substructure analysis. This analysis was based on atom-centered fragments, de- rived from extended-connectivity fingerprints with radius = 3 (ECFP3) [212]. ECFP3 70 Chapter 4. Results and Discussion were calculated and the occurrence of fragments was counted. The secondary amide bond was most prevalent in both datasets (Figure 4.8 B). Other prominent substruc- tures in the LIPOPEP set are the alkyl-motif of Val, Leu and Ile, the -motif of Tyr and Phe, and the free C- and N-termini. In the AZ set, the free –COOH termi- nus occurred only in 0.03 % of the peptide-mimetics, because it was either blocked, cyclised, or linked to further non-peptidic functional groups. The AZ compounds con- tain many functional groups originating from synthetic small molecules, introduced to overcome drawbacks such as metabolic instability, poor membrane permeability and peptide aggregation.[366, 367]

Figure 4.8: Comparisons between LIPOPEP (orange) and AZ (blue). A: LogD7.4 distributions. Black curve: Bimodal kernel density estimation for the pooled data. B: Illustration of the most prevalent substructures of LIPOPEP and AZ, derived by ECFP3 fragment counts. C: PCA score plot on component 1 and 2 conducted on the Lasso selected features. D: Score plot of the first two components of the SVR(PCA) feature set.

There were many tertiary amides substituting the secondary amide peptide backbone. Also secondary amines were introduced for this purpose. Cyclohexane-derivatives are often present in modified amino acid side-chains, and a variety of condensed ring 4.2. Expanded Models 71 systems occur, with 1-amino-tetrahydro- being the most frequent repre- sentative. To visualise how the structural dissimilarities of LIPOPEP and AZ effected the chemi- cal space of the SVR(Lasso) model, PCA with the eleven Lasso features was conducted. The resulting scores of each compound onto the first two components was plotted. For investigating the chemical space of SVR(PCA) in the same fashion, we plotted the val- ues of dimension one and two in the reduced feature set (which are the scores onto the first two PCA components). The results were similar for both models: LIPOPEP and AZ populated different chemical space regions (Figure 4.8 C and D). The AZ com- pounds spread out largely but some regions were less densely-populated in compari- son to the LIPOPEP set.

Final Consensus Model based on the Pooled Data

As previously discussed, we noted severe differences between the LIPOPEP and AZ set. However, the baseline models, developed with the LIPOPEP training data, were capable of building robust, non over-fitting relationships also for the AZ data (Figure 4.7). The CV results of SVR(Lasso) and SVR(PCA) (RMSE = 0.81 ± 0.09 and 0.83 ± 0.06), revealed a ratio of ≈ 10:1 between the RMSE and the covered lipophilicity range (8.3 log units). This ratio of RMSE to the target-value range corresponded roughly to the scenario of the baseline models on LIPOPEP (RMSE(CV) between 0.47 ± 0.1 and

0.60 ± 0.09 by a covered logDpH range of 5.1 log units). This let us conclude that the model architecture and input features were generally suited for predicting logDpH. To investigate whether expanded chemical knowledge leads to a broader applicability of our models, we pooled the LIPOPEP and AZ data.

As both sets were split stratified by logD7.4, the same bimodal lipophilicity distribu- tions resulted for CV and EV of the pooled data (Figure 4.8 A). This procedure ensured that the expanded models learned the entire logD7.4 range. In addition to the lipophilic- ity range, the models known feature space expanded as well. Cross-validation metrics were computed, then the complete training partition was learned prior to predicting the left-out data for external validation. As shown in the upper part of Table 4.2 this resulted in robust expanded versions of Lasso, SVR(Lasso) and SVR(PCA) which were less prone to over-fitting. Y-randomisation proved again that the approach performed better than any random model. Yet, RMSE(CV) of the Lasso (0.98 ± 0.03) was signif- icantly larger than of SVR(Lasso) (0.77 ± 0.05) and SVR(PCA) (0.78 ± 0.04) (p < 0.01, unpaired t-tests) Thus, we decided to proceed without this model. Despite the nearly identical overall performance of the SVR regressors, single predic- tions for individual compounds deviated significantly (Figure 4.9 A). This outcome was expected, since both models relied on different molecular representations. How- ever, no superiority of one model over the other on the single observation level was 72 Chapter 4. Results and Discussion present, independent of whether the compounds originated from the LIPOPEP or AZ set. These results let us conclude that both models might supplement each other. A perceptron [368] was implemented to provide a jury decision by learning the pre- dictions from SVR(Lasso) and SVR(PCA). The perceptron was optimised with regard to the batch size ∈ [10, 25, 50], training epochs ∈ [50, 100] and the activation function ∈ [’linear’, ’relu’, ’sigmoid’]. The final model was trained for 100 epochs with batch size = 10 and linear activation function. However, this jury model was not performing better than averaging the output weighted by the CV performance of the first stage models:

" # RMSESVR(Lasso) Consensus = logDSVR(Lasso) ∗ RMSESVR(Lasso) + RMSESVR(PCA) " # (4.1) RMSESVR(PCA) +logDSVR(PCA) ∗ RMSESVR(Lasso) + RMSESVR(PCA)

This simple calculation led to best predictions for the pooled data (Table 4.2), without the need for another model. Consequently, it was implemented for providing the con- sensus result. Experimental logD7.4 vs. our consensus predictions of the pooled left-out data are depicted in Figure 4.9 B.

Figure 4.9: Consensus results. A: Absolute error (AE) of SVR(Lasso) vs. AE of SVR(PCA) on the pooled EV set. Observations in the upper left triangle have lower errors by SVR(Lasso). SVR(PCA) produces lower errors for data in the lower right triangle. B: Consensus predictions by performance weighted average vs. experimen- tal logD7.4 of the pooled EV set. Accurate predictions are within the straight lines. Acceptable predictions within the dashed lines. LIPOPEP compounds within the pooled EV set were coloured orange, AZ data blue. Red circles: Compounds with AE > 2.0. AE: absolute error; EV: external validation. 4.2. Expanded Models 73

The highest errors were observed between logD7.4 -2 and 2, although sufficient data were available. Apparently, some training molecules with logDpH in this region influ- enced the model less than others. Seven AZ compounds were predicted with absolute error > 2 log units (data points circled in red in Figure 4.9 B). These structures will be discussed in the next but one section. Overall, 64% of the unseen data were predicted accurate (∆logD7.4 < 0.5), 28% acceptable (∆logD7.4 between 0.5 and 1.0) and only 8% with higher errors.

Table 4.2: Top: Performances of the Expanded Models and Consensus logD on the Pooled Data (N(Total) = 1058). Bottom: Performances after y-randomisation, calcu- lated for all training data (100 iterations). CV: cross-validation; EV: external valida- tion; *consensus performance was calculated for the entire training set.

CV (N = 776) EV (N = 282)

Model RMSE ± std % accuracy ± std RMSE % accuracy

Lasso 0.98 ± 0.03 39.7 ± 3.8 1.06 36.0 SVR (Lasso) 0.77 ± 0.05 65.2 ± 3.1 0.80 61.4 SVR (PCA) 0.78 ± 0.04 58.9 ± 3.3 0.75 62.6 Consensus 0.57* 72.3* 0.72 63.7

Y-Randomisation (N = 776) RMSE ± std % accuracy ± std

Lasso 1.68 ± 0.06 20.5 ± 2.8 SVR (Lasso) 1.65 ± 0.02 26.5 ± 1.0 SVR (PCA) 1.54 ± 0.02 34.1 ± 1.1

Domain of Applicability

The results presented and discussed hitherto emphasised multiple times that evalu- ating the prediction quality is as important as the prediction itself. This concept is commonly referred to as applicability domain (AD) assessment. AD defines the region in space which provides a model with sufficient knowledge to make reasonable pre- dictions. Query compounds that lie outside the borders of this domain can be seen as anomalous objects for the underlying model.[216] As reported in section 1.3, many AD definitions exist. We decided to implement a distance-based method, in order to quan- tify the novelty of query compounds in regards to the training domain. This assumes that outliers exhibit a larger distance in feature space to the training set compounds than a certain threshold value.[287] For this purpose the expansion of feature space was first quantified as the average euclidean distance of all training compounds to 74 Chapter 4. Results and Discussion their centroid (d). A cut-off (h) was set manually by multiplying d with factor x. For a novel compound, the euclidean distance towards the training set centroid was com- puted (dQuery) and if dQuery > h, the compound was labelled as beyond the AD. By choosing x = 2, 97.4% (SVR(Lasso)), respectively 94.8% (SVR(PCA)) of the pooled training data and 95.0% (93.6%) of the pooled left-out data fell into AD (Figure 4.10). These results were in accordance to a systematic investigation of cut-off values in distance-based AD approaches.[286] The authors stated that thresholds by twice the value of d or the 95 percentile of the training compound-distances to their centroid times d reflected a reasonable choice of compounds outside AD. This approach pro- vided the best regression performances in a proof of concept study on the CAESAR dataset.[369, 370] Limiting AD by the std of training compound-distances to the model centroid as the multiplier of d led to a maximal number of outliers but did not improve the statistics. Williams plots [371] were implemented for visualising the AD on the basis of h and standardised residuals (Figure 4.10 Top). Thresholds for the latter were set to -3 and 3, considering that observations within this range cover three standard deviations. Greater deviations from the mean error for outliers within the orange areas outside the standardised residuals but within h could not be explained by feature dissimilari- ties. 2.6% of the training compounds exceeded the cut-off (h = 6.0) in SVR(Lasso), and 5.2% the cut-off (h = 50.1) in SVR(PCA). All outliers with respect to h had a prediction error within three standard deviations. This indicated influential feature characteris- tics of these molecules for the models. Consequently, they were not excluded from the training set. -1 For both models, the "novelty" of a query compound was quantified as dQuery ∗ h . A novelty value > 1 meant that a compound was an outlier while compounds with a value < 1 were inside the AD. This single value was chosen because it is easy to interpret for the user. A steady RMSE increase from 0.51 to 1.27 was observed for SVR(Lasso), when the pooled left-out observations were grouped into five equally sized groups from novelty = 0.28 ± 0.03 to the most distant group (novelty = 0.96 ± 0.57). SVR(PCA) exhibited the same behaviour (with exception of novelty = 0.53 ± 0.03), but the total increase was smaller (∆RMSE = 0.35).

The final .csv file provided to the user after predicting logD7.4 for novel compounds incorporates: (i) the ID, (ii) the predicted value from SVR(Lasso) + novelty of the query, (iii) the predicted value from SVR(PCA) + novelty of the query and (iv) the consensus value (cf. A.16). 4.2. Expanded Models 75

Figure 4.10: Top: Williams plots of standardised residuals vs. the distances to the model centroid of compounds in the pooled training set (squares) and EV (circles) for SVR(Lasso) and SVR(PCA). The vertical dashed line depicts the cut-off (h = 2x average dtrain). The horizontal lines indicate the standardised residuals at -3 and 3. Observations in the green rectangle are within the AD and cover 99.7% of the nor- mally distributed error. Observations in light orange either exceed the distance or the error thresholds. Observations in dark orange exceed both. Red circles: Com- pounds with absolute error > 2 log units. Bottom: RMSE in relation to "novelty" of the query compounds. The novelty is quantified as the ratio of the distance to the model centroid divided by h. Left-out data were sorted according to their novelty and binned into five equally sized groups. Lines connecting the markers are intro- duced for visualisation and do not display a function. 76 Chapter 4. Results and Discussion

Discussion

Outlier Analysis

Although ≈ 95% of the query compounds are within the ADs of both models and 92% of the consensus predictions are acceptable (64% within ± 0.5 log units), high devia- tions (absolute error > 2 log units) for seven peptide-mimetics were observed. For the compound AZ 323, its coordinates on the Williams plots (red circles) as well as the structural comparison to the nearest neighbours (NN) in the training data (Table 4.3) explain the large error. This molecule showed by far the largest distance to the model centroids and revealed exceptionally high MW (2710 Da) and number of atoms (333) in comparison to its NN in SVR(Lasso) (MW = 1423, 203 atoms) and SVR(PCA) (MW = 1805, 243 atoms). AZ 354 displays an example where the similarity in feature space of SVR(Lasso) stands in contrast to the structural dissimilarity: The NN is cyclic while the query compound is linear. In comparison, the NN in SVR(PCA) presents the same structure with inser- tion of side-chain free α- and β- AAs.

AZ 359 and AZ 382 are predicted to be hydrophilic molecules (consensus logD7.4 = -0.94 and -0.72), but the experimental values are 2.00 and 1.95. Also, they were structurally similar to their NNs which possessed nearly identical experimental logD values. In both cases the obvious difference is the introduction of a hydroxy-methyl group in the query compound. Both models apparently are highly influenced by this function and weight it with a strong hydrophilic character. The opposite was observed for AZ 576 which incorporates a free carboxy function. This function is blocked by a tert-butyl group in the NN of SVR(PCA). The strong im- pact of the negative charge (4.6 log units lower experimental logD7.4), is not adequately captured by the model. The situation that small changes in the structure correspond to large shifts in experi- mental values was observed and described before in QSAR studies and is referred to as "activity-cliffs".[372, 373] In analogy to this concept, we observe a "property cliff" for AZ 770. In both models, the NN in the training data is the same cyclic hexapeptide but with altered stereochemistry of two amino acids. Experimental logD7.4 of the NN is

3.44. Thus, AZ 770 is predicted to be hydrophobic (consensus logD7.4 = 2.57). Yet, the experimental value is 0.24. Of course, if not proofed otherwise, the experimental value could be wrong. If this particular molecule is from interest, the experiment should be repeated to confirm that it is correct. The discussed examples showcase some model limitations. However, the overall per- formance does not justify the addition of more complexity for the sake of better pre- diction of special cases. 4.2. Expanded Models 77

Table 4.3: Molecular structure and logD/textsubscript7.4 of compounds in pooled EV which were predicted with absolute error > 2 log units and the nearest neighbour (NN) in chemical spaces of SVR(Lasso) and SVR(PCA).

ID Consensus Structure NN SVR(Lasso) NN SVR(PCA) logD7.4 (logD7.4 experimental) (logD7.4 experimental) (logD7.4 experimental)

AZ 323 1.72

(-2.10) (-0.46) (-0.90)

AZ 359 -0.94

(2.00) (1.97) (1.97)

AZ 354 -0.71

(2.01) (-1.20) (1.98) 78 Chapter 4. Results and Discussion

AZ 382 -0.72

(1.95) (1.98) (1.98)

AZ 576 1.47

(-1.10) (0.23) (3.50)

AZ 770 2.57

(0.24) (3.44) (3.44)

AZ 347 -0.03

(2.02) (0.76) (0.56) 4.2. Expanded Models 79

Recapitulation

The inability of both SVR model to generalise from LIPOPEP to AZ and vice versa showcases that QSPR models are likely to fail if novel query molecules emerge out- side the known chemical space and target-value range. These findings demonstrate the importance of clearly defining the domain of applicability, based on the diversity of structural features and experimental values observed. First, the most prevalent substructures within the LIPOPEP set were the peptide bound, free C- and N-termini and hydrophobic side chains residues. Contrary, peptide-mimetics exhibited functionalities that were introduced to overcome metabolic instability and combine peptides with benefits from small molecules. Such design strategies lead to elevated complexity which in consequence results in expansion of the QSPR feature space (Figure 4.8 C and D). Second, both datasets cover different lipophilicity ranges. Naturally, the majority of short peptides in LIPOPEP presented structures having undesirably hydrophilic char- acteristics for drug discovery. Thus, we observed a shift towards hydrophobic struc- tures in AZ. Simply merging both sets and randomly creating novel training and test scenarios was inefficient. A single random split can lead to a biased error estimate if by chance the models got to learn only parts of the entire lipophilicity range. Tetko et al. showed for a set of toxicity models, that the prediction accuracy was mainly deter- mined by the training data distribution and not by the chosen QSAR approach.[374]

Along those lines, the bimodal logD7.4 distribution was inherited by stratification. This strategy results also in close to identical distributions of the input features between the training and left-out data (Figure A.15). The pooled data approach led to robustly per- forming expansions of the initial SVR-based baseline models producing similar overall performance (RMSE 0.77 ± 0.05 and 0.78 ± 0.04). Merging first-stage predictors by application of neural networks as a jury model is a well-known concept in chemoinformatics.[226, 375]. Since the input was simply both SVR predictions, application of a more complex network than a single percep- tron was not justified. Although the perceptron improves the RMSE, it is not superior to straightforward averaging. In such a case, the simpler solution is preferred. Here, rejecting the perceptron-approach prevents also introduction of a further model (and the error it produces). The fact that not more than 0.03 - 0.08 RMSE units improvement over the first-stage models (pooled EV) were achieved, supports again their quality and robustness for lipophilicity calculations of the given data. As the presented development of lipophilicity models was driven by systematic in- spection of training data and resulting known chemical space, the distance-based AD assessment was chosen to detect outliers from this space. This approach assumes that predictions get more accurate when query compounds are in close proximity to com- pounds in the training set. RMSE increase in relation to distance to model centroid 80 Chapter 4. Results and Discussion supports this hypothesis (Figure 4.10 Bottom). The implemented AD displays an easy- to-understand quantification of structural novelty. This relieves assessing the suitabil- ity of training data for the given task. Also, it prevents the over-emphasis of uncertain predictions and might advice the user to return in such cases to experimental evalua- tion.

4.3 Benchmarking the Final Consensus Model

Introduction and Hypothesis

In the previous chapters, the intra- and inter-model comparisons guided the develop- ment phase. h_logD was also analysed and discussed because the algorithm displayed one feature of the Lasso selection. Yet, a review of state-of-the-art lipophilicity calcula- tions, applied for peptides and peptide-mimetics, was missing. Thompson et al. made the first step by investigating the prediction accuracy of logP predictors for 141 blocked peptides, 158 unblocked peptides and 41 cyclic peptides. An RMSE of 1.21 of the best performing algorithm AlogP[91] and unreasonably high errors for the other models revealed that logP was the wrong parameter given the presence of ionisable functions within the peptides.[182] In this section, the first benchmark analysis of three representative commercial logD calculators (ADMET-Predictor, ACDlabs and Chemaxon (cf. section 3.2)) on peptides, is presented. All three commercial models calculate logD as a function of logP from the query compound in neutral state and the pKa of ionisable functions. ACDlabs and Chemaxon rely on the "classic" approach regarding lipophilicity as an additive property where defined molecular fragments or atoms contribute to logP.[90, 92, 93] ADMET-Predictor calculates logP with an artificial neural network ensemble (ANNE). Having the hypothesis in mind that "classic" approaches show limitations when too many logP contributions are summed up, we compared the results to the consensus model presented in chapter 4.2. The applicability for small molecules was also in- vestigated on a comprehensive set of 4200 logD7.4 values, measured in a standardised experiment by AstraZeneca (s-Mol).

Methods

LogD7.4 was predicted by ACDlabs, ADMET-Predictor, Chemaxon and our consensus model. The benchmark sets were (i) LIPOPEP + In-House peptides (N = 258), (ii) AZ peptide-mimetics (N = 800), (iii) s-Mol compounds drawn from ChEMBL as described in (section 3.2) (N = 4200) and (iv) all data merged together (N = 5258). For all dataset and model combinations, RMSE and % Accuracy was calculated. 4.3. Benchmarking the Final Consensus Model 81

ACDlabs provides two logP algorithms. Here, a combined prediction from both, ACD- logP [90] and GALAS [unpublished], was the logP contribution in the logD function. The first stage models SVR(Lasso) and SVR(PCA) were trained as described in section 4.2, with the exception when they served for predictions of s-Mol. In this case, we used 75% of s-Mol (N = 3150) for training. Selecting data for this partition was done strati-

fied to logD7.4 as performed when splitting LIPOPEP, AZ or the pooled data. Finally, the consensus results for all compounds were calculated and compared to the com- mercial software. This introduced a bias, because our models had already seen 75% of the data while the others had presumably not (structures of training molecules were not provided by the vendors). However, performance comparisons between CV and EV (Figure A.17) justified this approach, as no over-estimation of the predictive power of our model was observed. In analogy to Mannhold’s exhaustive benchmark study of logP algorithms, an arithmetic average model (AAM) was introduced to serve as a baseline. It adopted the average experimental logD7.4 of all compounds in a given dataset (-0.89 for LIPOPEP, 1.65 for AZ, 2.19 for s-Mol, and 1.95 for all data merged to- gether) and used this value as the prediction for each compound within the respective set. Thus, it incorporated the most likely logD value as its model rational. Models with RMSE > RMSE (AAM) can be considered as non-predictive.[102]

Results

The performances achieved by the analysed models on all datasets and combinations thereof are presented in Table 4.4. The same order of increasing RMSE from Consensus < ADMET-Predictor < ACDlabs < Chemaxon, with the exception of the s-Mol dataset where ACDlabs outperformed ADMET-Predictor, was observed. On the AZ set, the consensus approach showed the most benefit over the second best model (∆RMSE = 0.51 to ADMET-Predictor). On the other data sets ∆RMSE to the second best model ranged between 0.25 to 0.49. The consensus model and ADMET-Predictor performed better than the AAM on all datasets. So did ACDlabs on the LIPOPEP and s-Mol set. RMSE of Chemaxon was > RMSE(AAM) on all datasets. However, the software pro- vided up to 12.5% more accurate predictions than AAM. Interestingly, all commercial models had a lower RMSE on LIPOPEP than on s-Mol. In particular, ADMET-Predictor and ACDlabs provided good results for short, linear peptides (RMSE = 0.69 and 0.75). This led to 79.5% (75.2%) accurate predictions in comparison to 83.7% retrieved by the consensus model. Apparently, information about AA side chain residues are well cov- ered by these models. In fact, only 3% of the peptides within the LIPOPEP set were flagged by ADMET-Predictor as being outside its AD (Figure 4.11 B). Rejection of only 1% of s-Mol data supported the assumption that drug-like small molecules lay within ADMET-Predictor’s AD. s-Mol incorporated 16 times more data than LIPOPEP, and its 82 Chapter 4. Results and Discussion

logD7.4 range was 0.6 log units larger, which explained the model’s worse performance (∆RMSE = 0.5 to LIPOPEP).

Table 4.4: Performance of Consensus logD in comparison to commercial logD algo- rithms on LIPOPEP, AZ, s-Mol and all data merged together. Cells coloured in green highlight smaller RMSE than AAM on the respective dataset. Higher RMSE than AAM is coloured in red. AAM: arithmetic average model.

All Data LIPOPEP AZ s-Mol

Method Rank RMSE Accuracy [%] Rank RMSE Accuracy [%] Rank RMSE Accuracy [%] Rank RMSE Accuracy [%]

Consensus 1 0.68 60.1 1 0.44 83.7 1 0.67 64.9 1 0.69 57.7 ADMET - 2 1.17 49.6 2 0.69 79.5 2 1.18 42.6 3 1.19 49.0 Predictor ACDlabs 3 1.24 47.9 3 0.75 75.2 3 2.11 35.4 2 1.02 48.7 Chemaxon 4 1.57 39.6 4 1.21 35.7 4 2.74 29.0 4 1.26 41.8

AAM* 1.39 27.1 1.13 33.3 1.31 28.3 1.20 29.4

Figure 4.11: Performance comparisons between default and adjusted commer- cial models. A: RMSE of Chemaxon and ACDlabs before (black bars) and after re-training (grey bars) with CV-partitions of the respective datasets. B: RMSE of ADMET-Predictor for all compounds within the respective dataset (orange) and for compounds inside the models AD (green). Numbers in brackets show percentage of rejected molecules in the respective dataset. 4.3. Benchmarking the Final Consensus Model 83

ACDlabs and Chemaxon provide the possibility to re-train the algorithms with any li- brary that provides molecular structure and experimental logD. In accordance to train- ing of SVR(Lasso) and SVR(PCA), we provided the CV-partition of each dataset to the models and predicted again logD7.4 for all compounds in the respective set. As shown in Figure 4.11 A, this procedure did not result in any improvements. ADMET-Predictor has a range-based AD assessment implemented. According to the vendor’s documen- tation, for outliers a prediction “may be correct but it would be unwise to put much faith in it”. Since LIPOPEP and s-Mol comprised structures that fell in 97% (99%) of the cases into regions of the known chemical space, re-calculation of RMSE rejecting the outliers led to similar results (Figure 4.11 B). In contrast, ADMET-Predictor flagged 19% of the AZ molecules, indicating that peptide-mimetics tend to exceed its chemical space. The RMSE decreased from 1.18 to 0.95, considering only the 81% of compounds that were within the scope of the method. The results presented in Table 4.4 highlighted the differences in accuracy between the model and dataset combinations. Aiming to understand which factors determined the accuracy, we partitioned the four datasets according to the following schemes:

• 1.) Ionisable and non-ionisable molecules

In this case, the pKa contribution in the logD function of commercial models was

absent for the latter group of compounds. For the first group, pKa contributed to the calculated logD value.

• 2.) Groups of molecules within a given MW range Compounds in each dataset were binned into five equally sized groups according to MW. MW was chosen as a parameter for molecular complexity. Compounds with larger MW incorporated more structural features (fragments or atoms) which contributed to the overall logP of the "classic" methods (ACDlabs and Chemaxon).

• 3.) Groups of molecules within a given logD7.4 range Compounds in each dataset were binned into five equally sized groups according

to logD7.4. In section 4.2, we observed less accurate predictions for compounds that exhibited lipophilic or hydrophilic characteristics which the training set did not cover. Since no information about the known logD range of commercial mod-

els was available, this sub-analysis tested robustness over logD7.4 ranges of the given datasets.

Figure 4.12 A showed that all possible model and dataset combinations resulted in higher RMSE and less accurate predictions for ionisable compounds. The strongest RMSE increase was observed for ACDlabs and Chemaxon on AZ, which incorprated 77% ionisable molecules. 84 Chapter 4. Results and Discussion

Figure 4.12: Performance of the consensus model in comparison to commercial logD tools on LIPOPEP, AZ, s-Mol and all data merged together. A: Differentiation inionisable and non-ionisable molecules within the respective datasets. B: Molecules in the respective datasets are sorted and binned in five groups according to their MW. Each group contains one fifth of the data. The x-values display MW (mean ± std) of each group. C: Molecules in the respective datasets are sorted and binned in five groups according to their lipophilicity. Each group contains one fifth of the data. The x-values display logD7.4 (mean ± std) of each group. Lines connecting the markers are introduced to visualise trends for each logD calculator. They do not display a function. 4.3. Benchmarking the Final Consensus Model 85

For ionisable peptide-mimetics, ADMET-Predictor was the only commercial model that provided reasonable predictions (RMSE = 1.26, 40% accurate predictions). The consensus model revealed the smallest RMSE decreases on all datasets, followed by ADMET-Predictor on LIPOPEP and AZ and ACDlabs on s-Mol. When investigating the relation of MW of the query compounds to the predictive per- formance, a clear trend of increasing RMSE with increasing MW was observed for the commercial models (Figure 4.12 B). Within the subgroups incorporating the heaviest molecules of AZ (MW = 1135 ± 296) and s-Mol (MW = 525 ± 73), all commercial mod- els revealed RMSE > RMSE(AAM). The difference was most substantial on AZ with RMSE = 1.80 (ADMET-Predictor), 4.00 (ACDlabs) and 5.40 (Chemaxon) in compari- son to RMSE = 1.40 for the AAM. These results were in accordance with Mannhold’s benchmark analysis of logP predictors for small molecules: In this analysis, the perfor- mance decreased linearly by increasing number of non-hydrogen atoms (NHA). For molecules with NHA > 40, all investigated logP models failed to produce better results than the AAM.[102]. Since the AAM takes the most likely logD value as its model ratio- nale, no influence by MW was observed. In contrast to the commercial models, the MW of AZ and s-Mol molecules had no impact on the consensus model. On the LIPOPEP set, also the consensus model exhibited RMSE increase from 0.25 for the group incor- porating the lightest peptides (MW = 257 ± 27) to 0.75 for the heaviest group (MW = 609 ± 83). Still, in each MW bin in each dataset, the consensus model performed better than the commercial models and the AAM. The AAM revealed a parabola-shaped RMSE dependency to the lipophilicity of the query compounds. Naturally, the lowest RMSE was observed for logD bins with the smallest difference to the average logD of the respective dataset. On LIPOPEP and s-Mol, the highest ∆RMSE between the five bins according to lipophilicity was < 0.49 for the commercial models and the consensus. On AZ, an RMSE increase from the sub-group incorporating peptide-mimetics with logD7.4 = 1.06 ± 0.23 to most hy- drophilic subgroup (logD7.4 = -0.33 ± 0.79) was observed. The increase was largest for the Chemaxon tool, followed by ACDlabs, ADMET-Predictor, and the consensus model.

Discussion

The results of this study reveal performance discrepancies for the commercial mod- els depending on their application to either drug-like small molecules, short peptides, or complex peptide-mimetics. Our consensus is most beneficial for peptide-mimetics, demonstrating the usefulness of bespoke models for specific compounds-classes. In general, our results emphasise that robust relationships of chosen molecular represen- tations to lipophilicity are built by the first-stage models SVR(Lasso) and SVR(PCA). Thus, the method can be successfully transferred to small molecules (s-Mol), leading 86 Chapter 4. Results and Discussion to ≈ 58% accurate predictions (9% more than ADMET-Predictor and ACDlabs), al- though its development was peptide-specific. Interestingly, all models performed best on LIPOPEP. One explanation may be the structural similarity of short peptides to small molecules used for training the commercial algorithms. Secondly, LIPOPEP has a narrow logD7.4 distribution with the majority of compounds ranging between -2 and 0. Thus, one has to think in relative terms of the comparably low RMSEs as AAM also produces a low RMSE (1.21). The determination of partition coefficients is experimentally more challenging than of distribution coefficients. No ionisable species of the analyte must be present in solution which has to be ensured by applying proper pH conditions. This makes high-throughput applications of SFM for logP determination difficult. In contrast, in silico logP calculations are less complex. The pKa term, which accounts for charge- contributions to lipophilicity in the commercial models (cf. equation 1.3), can be omit- ted. This prevents additional error. Consequently, ADMET-Predictor, ACDlabs and Chemaxon have lower RMSEs for non-ionisable compounds. The consensus model reveals the smallest RMSE deviations between predictions for the fractions of ionis- able compounds (LIPOPEP: 68%, AZ: 77%, s-Mol: 51%) and non-ionisable compounds. These results emphasise that the charge-influence is well-covered within the molecu- lar representations of SVR(Lasso) and SVR(PCA). We can assume that ionisable com- pounds are charged at physiological pH, because common functionalities in peptides as well as small molecules in drug discovery, such as carboxylic acids, amines and have pKa well above or under 7.4. The strongest impact of MW on the performance was observed for Chemaxon, fol- lowed by ACDlabs. Both models rely on the "classic" additive approach that was first proposed[85] for logP calculations. Steady RMSE increase by increasing compound MW indicates that the prediction uncertainty of each fragmental or atomic contribu- tion adds up. Here, defining a general cut-off is difficult since performances also de- viate between the models. One can rather think of an empirical robustness evalua- tion for additive models to MW or fragment count as AD assessment. The additive models also do not consider fragment in-accessibility to the environment, which gets more likely in larger molecules or cyclic structures. If molecular complexity goes along with non-rigidity, solvent-dependent conformational changes will add up to more un- certainty.[40] Both facts reason, why in particular the peptide-mimetics with MW 737 ± 94 and higher as well as s-Mol compounds with MW 525 ± 73 reflect challenging sub-classes for the additive models. Also, MW within these subclasses is significantly higher than in the training sets of ACDlogP (MW = 269 ± 119) and GALAS (MW = 252.33 ± 104.74) (p < 0.01, unpaired t-tests) as stated by the vendor.[376] ADMET- Predictor performs worse than the AAM on the same subclasses, but the performance decrease is comparably low. According to Simulations-Plus, the training set for S+logP algorithm has a similar MW distribution than that of ACDlabs (MW = 246 ± 106). 4.4. Focussed De Novo Generated Peptide Libraries for Studying 87 Chemokine-Receptor / Ligand Interactions

While this displays again significant difference to the mentioned subclasses (p < 0.01, unpaired t-test), some large molecules are also incorporated, the maximum MW is 1203.[377] The consensus does not rely on molecular representations accounting for the overall molecular size, volume or surface-area and thus is not affected by MW. The mentioned drawbacks of additive models do not exist in ADMET-Predictor and the consensus be- cause both present descriptor-based machine learning approaches. The superiority of the latter over ADMET-Predictor corroborates again the need for qualitative and di- verse training data for the given task. As seen in Figure 4.12 C, actual lipophilic or hydrophilic properties of short peptides and small molecules do not influence the performance, indicating good logD7.4 cov- erage for these compound classes in the respective training sets. As we split the data stratified to logD7.4, consensus performance is also robust for AZ. Contrary, in particu- lar ACDlabs and Chemaxon struggle on hydrophilic peptide-mimetics (logD7.4 = -0.33 ± 0.79) again affirming the limitations of additive models. The prerequisite for reli- able calculations is an equal distribution of lipophilic and hydrophilic contributions over the entire molecular structure. This is in consent with experimental success, as surfactant-like compounds will attach to the solvent-interface, making the quantifica- tion challenging.

The benchmark study reveals that the commercial models, routinely applied in small molecule drug discovery [378–383], should be used with caution for more complex molecules. For such cases, the consensus approach is most beneficial. In the future, this concept of bespoke models is not limited to peptides and derivatives thereof. It can rather be transferred to other classes of complex molecules having pharmaceutic potential, such as macrocycles and other derivatives of natural product-templates.[384, 385] This trend can be observed since pharmaceutical companies develop novel mod- els or fine-tune existing ones on their own data.[100, 101]

4.4 Focussed De Novo Generated Peptide Libraries for Studying Chemokine-Receptor / Ligand Interactions

Introduction

In this study, we present a novel approach for the rational selection of in silico gener- ated peptides as modulators of the CCR7 / CCL19 interaction. The first part describes the discovery of a six AA short CCL19 binding peptide by systematic fragmentation of the binding epitope of the chemokine-receptor 7 (CCR7). This sequence presents 88 Chapter 4. Results and Discussion the template for further de novo peptide generation by the simulated molecular evolu- tion algorithm VESPA [301] (Figure 4.13). Directed mutations of this template provide offspring libraries with relative narrow sequence-space, in comparison to all 206 pos- sible hexapeptides, considering all 20 natural AA. It is desirable to constrain the novel peptides to certain properties, e.g amphipathicity and cationic properties of α-helical antimicrobial peptides [119, 386, 387]. This would facilitate the selection of only a few compounds that are subject to synthesis and testing. However, the absence of ligand- or structure information about the investigated PPI does not provide such a rational to set the mutation-step size parameter σ of VESPA. Therefore, the validity of our consensus model in combination with pharmacophoric similarity (pepCATS distance) as ranking criteria, is tested. The computational analysis is accompanied by the selection, synthesis and prospective testing of binding-affinity of twelve de novo generated offsprings. The workflow can be considered as a small scale virtual screening campaign.[388] The intend of virtual screening is to exclude biologi- cally uninteresting regions of the vast chemical space [389], often by applying physic- ochemical properties as filter-criteria. Virtual screening presents a valuable starting point in particular for settings where high-throughput screening (HTS) is technically or resource-wise not feasible.[390]

Figure 4.13: Overview of the conducted workflow in this study. Consensus logD7.4 was used in combination with pharmacophoric similarity (pepCATS) to rationalise offspring-selection for synthesis and testing after their generation with VESPA. The depiction of residue transition probability P(i–>j) in relation to the distance to the parent peptide is adapted from [296]. 4.4. Focussed De Novo Generated Peptide Libraries for Studying 89 Chemokine-Receptor / Ligand Interactions

Fragmentation of CCR7_C24A

Structural information of CCL19 (PDB: 2MP1) was first presented by Veldkamp et al. in 2015.[353] The authors of this publication conducted NMR shift per- turbation experiments, titrating the N- terminus CCR7_C24A to CCL19, reveal- ing binding affinity of Kd = 12 ± 13 µM. The measurable interaction between CCL19 and the receptor N-terminus in- dicated the two-step two-side binding mode (cf. section 1.5). Mapping N- termini to the CCL19 amino acid residues further revealed putative chemical shift perturbations in the N-Loop, the third β- strand and the α-helix. The 30 AA long N-terminus CCR7_C24A was synthesised and the interaction with labelled CCL19 was confirmed by mi- croscale thermophoresis (MST) according to the descriptions in Chapter 3. Fig- ure 4.14 (Top) displays one exemplary binding curve resulting in Kd = 2.0 ± 1.0 µM. As discussed in chapter 4.1, our lipophilicity models were validated for a set of linear hexapeptides. In order to ap- ply the consensus model for such pep- tides, CCR7_C24A was shortened. A sys- tematic reductionist approach was cho- sen which accounted for all AAs: In the first step CCR7_C24A was fragmented into five 10mers, consecutively overlap- ping five AAs. Analysis of their bind- Figure 4.14: Exemplary MST-Curves of ing affinities revealed a steady decrease CCR7_C24A, CCR7_10.1 and CCR7_6.4. The error bars consider three technical replicates. from the N-terminus to C-terminus of The complete collection of binding data can be CCR7_C24A (Kd = 0.8 ± 0.5 µM for found in section A.5 The final Kd is calculated CCR7_10.1, increasing to 24.5 ± 6.8 µM as the mean ± std of three individual determi- nations (cf. A.5). for CCR7_10.5, see Figure 4.15 B). In the 90 Chapter 4. Results and Discussion next step, five hexapeptides, each over- lapping one AA were created starting from CCR7_10.1 and Kd was again measured.

Two of the hexapeptides showed no binding to CCL19. CCR7_6.1 (QDEVTD-NH2) ex- hibited higher Kd than its template (1.3 ± 0.7 µM). Contrary, CCR7_6.4 (VTDDYI-NH2) and CCR7_6.5 (TDDYIG-NH2) revealed improved binding affinity to CCL19 (Kd = 0.3 ± 0.1 µM for both) over their predecessor. CCR7_6.4 was chosen as the template for gen- erating in silico libraries, because it differed from CCR7_6.5 by the presence of Val and absence of Gly. Val is clearly assignable to hydrophobic AAs, while Gly lacks any side- chain and thus distinct pharmacophoric features. All analytical and MST data can be found in section A.5. The control assay of CCR7_10.1 and CCR7_6.4 to a his6-peptide showed lower binding affinities (55.6 µM and 10.7 µM, respectively), proving that both peptides interact with CCL19 itself and not only with its his-tag (Figure A.42).

Figure 4.15: Fragmentation of CCR7_C24A and Kd measurements by MST. A: CCR7_C24A was systematically cut into five 10mers each overlapping five AAs. The 10-mer with lowest Kd to CCL19 was then cut into five 6-mers each overlapping one AA. B: Kd (mean ± std of three individual experiments) of CCR7_C24A, all frag- ments thereof and the negative sample GLPVVVKL-NH2. CCR7 6.4 (blue) serves as the template for de novo peptide library design. *Negative sample did not meet suffi- cient S/N ratio and signal amplitude criteria for fitting (cf. section 3.1). Applying the fit anyway results in Kd > 500 µ M.

De Novo Peptide Generation by Simulated Molecular Evolution and Ranking

Starting from the template CCR7_6.4 (VTDDYI-NH2), VESPA was employed to create six in silico offspring libraries with σ ∈ [0.075, 0.1, 0.125, 0.15, 0.2, 0.25], each contain- ing 1000 peptides. The parameter σ controls the mutational flexibility from the tem- plate AA at a certain position within the sequence to another AA at the same position 4.4. Focussed De Novo Generated Peptide Libraries for Studying 91 Chemokine-Receptor / Ligand Interactions

(cf. section 1.4). Larger σ values increase the probability for mutations that are more distant to the template in regard to the underlying grantham matrix.[302] Lipophilic

(logD7.4) and pharmacophoric similarity (pepCATS [226]) as well as the shannon in- formation content[391] were computed for each library. The latter demonstrated the correlation of increasing diversity by increasing σ values (Figure 4.16 A). Assuming equal distribution of all 20 natural AA, the maximal information content for this al- phabet is log2(20) = 4.32 bit at each position, summing up to a maximum of 25.92 bit for a library of hexapeptides. The bit sums of the libraries σ = 0.2 and σ = 0.25 (23.9 and 25.3) were close to the maximum. Considering additional redundancy since the AAs will most likely be not equally distributed, these values indicated that higher σ values will not increase diversity. In accordance to increasing sequence diversity, broader dis- tance distributions in pepCATS feature space were observed with the median increas- ing from 1.26 (σ = 0.075) to 2.59 (σ = 0.25). The increased sequence diversity correlated with larger predicted logD7.4 ranges, while the median of this property stayed at ≈ -1.

This value was close to the consensus value of the template VTDDYI-NH2 (logD7.4 = - 0.71). The libraries with σ ∈ [0.075, 0.1] incorporated offsprings having marginal or no changes from the template (632 duplicates for σ = 0.075 and 227 for σ = 0.1). For further investigation, the library σ = 0.15 was chosen as a trade-off between "conservativeness" and "randomness". Having a closer look at its information content, it was observable that mutations in position three and four (both Asp in the template) were less diverse than in the other positions (Figure 4.16 A). In consequence, this acidic patch occurred on a frequent basis in this library compared to libraries with higher σ where the gap to all other AAs can be overcome. The scatter plot in Figure 4.16 B depicts that the lipophilic and pharmacophoric sim- ilarity do not correlate. To investigate whether similarity to the template in terms of both features leads to CCL19 binding hexapeptides, the subset "Ranking A" (N=68) was defined. This set incorporated peptides for which ∆logD7.4 and d(pepCATS) to the template were < 1st percentile of the entire distribution of the respective property in the σ = 0.15 library. In order to validate logD7.4 as a necessary co-factor, a second subset "Ranking B" (N=49) was also created. Ranking B incorporated peptides with st rd d(pepCATS) < 1 percentile but ∆logD7.4 was > 3 percentile. Scrutinising the dif- ferences in both subsets on sequence level was conducted by assigning each AA to polar (P), hydrophobic (H), aromatic (R), basic (B) or acidic (A) properties. The result- ing "pattern" was H-P-A-A-R-H for the template. In Ranking A, 38% of the peptides preserved this distinct pattern or deviated in one position, 31% deviated in two and 31% in three or more positions. Although Ranking B comprised the same distribution of pharmacophoric similarity, the logD7.4 deviation resulted in a high fraction of pep- tides deviating in three or more positions from the pattern (76%). Only 2% deviated at less than two positions and no peptide inherited the original pattern. Since the pat- tern preservation was pronounced in Ranking A, the six randomly chosen peptides for 92 Chapter 4. Results and Discussion testing deviated in maximal one position. Six peptides from Ranking B were as well randomly chosen, here with no constraint to the pattern. All twelve peptides were synthesised and tested according to the CCR7 fragments.

Figure 4.16: Properties of de novo generated libraries with altering σ values starting from the template CCR7_6.4 (VTDDYI-NH2) and ranking of library σ = 0.15. A: Six libraries, each containing 1000 offspring peptides, were generated with σ between 0.075 and 0.25. The boxplots visualise the predicted logD7.4 and pepCATS distance distributions from the offsprings to the template (Tukey boxplot: median, 25 and 75 percentile, 1.5 interquartile range). Sequence diversity is quantified as the shannon information content for position 1-6 of all peptides in each library (maximal infor- mation for a set of hexapeptides = 25.92 bit). The library created with σ = 0.15 (blue) was chosen for further analysis. B: The marked regions in the scatter plot show the peptides which lie within the defined similarity ranges. C: Property distributions of “Ranking A” and “Ranking “B”.

Binding Affinities of Selected Offsprings

The offsprings 432 and 211 from Ranking A inherited the original pattern and bind to CCL19 (5.9 ± 2.2 µM and 0.6 ± 0.7 µM) (Table 4.5). Offspring 432 exhibited a ≈ 20 times higher Kd in comparison to the template. Also, this peptide showed the largest deviations of all binders on AA level (5 positions). The other four tested offsprings 4.4. Focussed De Novo Generated Peptide Libraries for Studying 93 Chemokine-Receptor / Ligand Interactions deviated at one position. Thereof, offspring 865 and 161 showed nearly identical affin- ity to CCL19 as the template. Both peptides for which no binding occurred possessed pattern changes in position two. Offspring 191 mutates from a polar residue to Gly which lacks any side-chain pharmacophoric feature. Offspring 420 changes to Val (hy- drophobic). On AA level, the acidic patch in the template centre (Asp-Asp) was sub- stituted with one or two Glu within these offsprings, extending the side-chains by one methylen group. In Ranking B, five peptides deviated in two or more positions from the template pat- tern. Consequently, the sum of the edit distance on the pattern level raised (from 4 in Ranking A to 14). Binding to CCL19 was not observed in this set, with the exception of offspring 364. Apparently, the two negative charges of the acidic AAs in the cen- tre followed by an aromatic AA, which was preserved in offspring 364, promote the interaction. For all binders, control assays with the his6 peptide were conducted. No offspring exhibited binding in this set up (Figure A.42).

Discussion

The presented findings demonstrate, that systematic fragmentation of the binding epi- tope of one PPI partner leads to small peptides, incorporating the important bind- ing residues and therefore retaining affinity to the target. Further, the exploration of de novo generated libraries and resulting discovery of five novel binders and insight into the interaction between CCL19 and the N-terminus of CCR7 was presented. We chose this approach over alanine-scanning which can be used to determine changes in stability and function of peptides and proteins [392], but also to investigate binding- contribution of specific residues.[393–395] The latter is particularly useful if "core po- sitions" in a given sequence are already known, avoiding exhaustive exchange to Ala in each position.[396] Yet, we missed such knowledge about the CCL19/CCR7 inter- action. Another time- and resource-saving option is in silico alanine-scanning coupled with quantitative predictions of molecular dynamics free energy simulations.[397, 398] However, this would not consider the desired peptide-length reduction and the ap- proach is hindered by the lack of structural information. The effort stands in this case in no relation to simply synthesise the linear peptides in automated fashion and exper- imental validation. CCR7_10.1 exhibited the highest fraction of acidic AAs (40%) and showed the lowest

Kd (0.8 ± 0.5 µM) in the set of 10mers. This observation supports the hypothesis, that the initiation of the CCL19 and CCR7 interaction involves in the first place charge- dependent interactions between acidic extracellular domains of the receptor and posi- tively charged basic residues at the chemokine core domain (cf. section 1.5). The fragmentation of CCR7_10.1 provides more detail about core AAs involved in this 94 Chapter 4. Results and Discussion interaction: CCR7_6.4 and CCR7_6.5, both overlapping the sequence Asp-Asp-Tyr- Iso, exhibit the highest binding affinity. Apparently, this pattern is important since CCR7_6.3, missing Iso and having Asp-Asp-Tyr shifted from the centre to the sequence end, does not bind.

Table 4.5: Selected offsprings from the focused libraries Ranking A and Ranking B that have been synthesised and tested. Kd is reported as the mean ± std of three individual experiments. Red amino acids indicate deviation from the template’s se- quence pattern (H-P-A-A-R-H) at the respective position. Edit Distance [AA] is one for each AA that does not correspond to the AA in the respective template position, Edit Distance [Pattern] is one for each AA not corresponding to the AA property as- signment in the respective position. Binding affinity is not reported (n.a) for ligands that did not fulfil the criteria for applying the Kd model (cf. section 3.1).

Library Peptide Sequence Edit Distance [AA] Edit Distance [Pattern] d(pepCATS) ∆logD7.4 Kd [µM]

CCR7_6.4 VTDDYI-NH2 - - - - 0.3 ± 0.1

Offspring 432 PQEDFV-NH2 5 0 1.11 0.14 5.9 ± 2.2

Offspring 211 VTDDWM-NH2 2 0 1.29 0.01 0.6 ± 0.7

Offspring 865 YTDDYI-NH2 1 1 1.68 0.15 0.4 ± 0.3

Offspring 161 LTDNYI-NH2 2 1 0.86 0.19 0.2 ± 0.1

Ranking A Offspring 191 IGEEFL-NH2 6 1 1.44 0.02 n.a

Offspring 420 VVEDFL-NH2 4 1 1.66 0.05 n.a Σ 20 4 8.04 0.56

Offspring 364 AEDDYA-NH2 3 3 0.71 0.67 0.7 ± 0.1

Offspring 876 MSDSVV-NH2 5 2 1.23 1.05 n.a

Offspring 944 ISDNMA-NH2 5 2 1.19 0.90 n.a

Offspring 256 VQSDVA-NH2 5 3 1.32 0.88 n.a Ranking B Offspring 82 WQDDFL-NH2 4 1 1.45 0.89 n.a

Offspring 7 VSNEHM-NH2 5 3 1.68 1.00 n.a Σ 27 14 7.58 5.39

The application of an evolutionary algorithm was chosen to elaborate an exploratory de- sign strategy. The creation of similar peptides to the template, for instance by substitut- ing just one AA at the time can be done manually and is easy to rationalise. However, defining how more dissimilar peptides are taken into account is challenging. VESPAs’ 4.4. Focussed De Novo Generated Peptide Libraries for Studying 95 Chemokine-Receptor / Ligand Interactions purpose in this study is to provide the rational: Diverse peptide sets are created but objects originate from only one template. Subsequent choice of library σ = 0.15 is ad- vocated as a trade-off between "conservativeness" (marginal changes, duplicates) and "randomness" (close to maximal shannon entropy). The downside is that experimental validation can not cope with the library size (1000 peptides) we chose for generating a diverse set of peptides. As mentioned in the introduction to this study, one needs further criteria to select actual peptides for synthesis and testing.

LogD7.4 presents a global parameter. Thus, similar lipophilicity to the template oc- curred also for peptides with maximal edit distances, indicating that logD7.4 is not to be expected to correlate directly with binding affinity. Still, it can be considered as a valuable parameter in binding studies, since the combination with pharmacophoric similarity led to the discovery of the preserved template pattern. The fact that four of the six peptides from Ranking A and just one of six peptides from Ranking B bind CCL19 emphasises, (i) the validity of the approach and (ii) that preserving the template pattern facilitates creation of novel peptides which are likely to retain target-affinity. The presented templates present a start point for further investigation. Further re- search should consider the analysis of the active conformation and projection of side- chain functionalities in analogy to the peptide secondary structure. This may lead to stabilised peptide-mimetics by approaches discussed in the section 1.2. [316]

97

5 Conclusions and Outlook

Among the many computationally accessible properties, lipophilicity is of particular interest. Most in silico models appeared at a time when drug discovery was primar- ily driven by “classic” medicinal chemistry and the development of small molecu- lar pharmaceutical agents. However, today it can be observed that larger and struc- turally more complex molecules are considered for expanding the druggable chem- ical space. Peptides and derivatives thereof are such molecules which fill the gap between small molecular agents and biomacromolecules like antibodies, proteins, or DNA/RNA based medicines. Naturally, these compounds exhibit variant properties, hence populating different regions of chemical space from small molecules. Conse- quently, it can not be expected that QSPR models, based on knowledge about small molecules, automatically work for these intermediates. Therefore, to advance the field of peptide research, I developed bespoke lipophilicity models at physiological pH. The preceding data compilation pointed to a lack of experimental data which was resolved by an adequate feature selection and methods for dimensionality reduction.

Supervised regularised regression (Lasso) and principal component analysis provided low-dimensional feature sets, which facilitated the robust modelling for short, linear peptides up to a length of five AAs. The Lasso derived feature set is interpretable and the selected features were found to be physically linked to the partition phenomenon (Table A.1). The QSPR models developed in this thesis do not consider 3D molecu- lar information at this point but its inclusion presents a meaningful extension in future studies. The conformation of non-rigid structures can change depending on the solvent polarity, leading to different exposed surface characteristics.[40, 110] Grasping this in- formation may help to describe the affinity of a peptide to the two immiscible phases. An example of employing molecular dynamics-based features for lipophilicity predic- tion can already be found in the literature.[97] Other recent work employs the quantum chemistry based equilibrium thermodynamics method COSMO-RS, which calculates chemical potentials of each species in solution from surface charge densities.[399] Yet, the current status of these approaches has not moved beyond proof-of concept stud- ies and has thus not provided an applicable model for real-world examples. In this context, peptides present particularly challenging examples because of their flexible nature. Recently Zhang et al. took a first step into this direction by analysing different force fields to compute solvation free energies in octanol and water. These energies wereused to predict logP of neutral AA side chain analogues with a RMSE between 0.4 ± 0.1 and 1.3 ± 0.2, depending on the force field.[400] The choice of 2D features for our 98 Chapter 5. Conclusions and Outlook purpose was justified by the excellent performance of both SVR(Lasso) and SVR(PCA) on the LIPOPEP dataset. Encouraged by these results, we showcased the prospective model application for a set of hexapeptides. For that purpose, we used three model peptides to set up the shake-flask method. The quantification of the analyte in one phase before and after the shake procedure by HPLC-UV led to logD7.4 determinations with standard deviations of maximal 0.3 log units. In particular the SVR(Lasso) model revealed a high accu- racy in the predictions for the hexapeptides (RMSE of 0.47 and 67% of the predictions having an absolute error < 0.5 log units). Here, it must be kept in mind that logD7.4 was calculated considering the UV detector calibration based on the nominal peptide concentration. However, our group recently demonstrated that pre-analytical pitfalls, such as salt formation, weighing errors and solubility issues, can lead to remarkable differences between the indicated and actual peptide concentrations in solution.[401] Therefore, we advocate that a fraction of peptide solutions for future shake flask exper- iments should be sampled and undergo the proposed hydrolysation procedure with a subsequent quantification of the aromatic AAs.

When trying to predict logD7.4for peptide mimetics from AstraZeneca, a low general- isability of the baseline models, which had been trained on the LIPOPEP set, towards these compounds was observed (Figure 4.7). The differently populated regions of chemical space of both datasets and the shift to hydrophobic compounds in the AZ set explain this observation. These findings advocated our emphasis on the characteristics of the underlying training data in terms of model applicability. However, the failure was not a consequence of initially too specifically chosen features or model architecture for LIPOPEP, since the performance was reasonable when the models learned AZ train- ing data. Subsequent stratified pooling of both datasets resulted in practically applica- ble lipophilicity models for both, peptides and peptide mimetics, in the logD7.4 range of -3.05 to 5.08. The utilisation of AZ data with various non-natural chemical struc- tures, thereby accounts for pharmaceutically interesting peptide derived compounds. The practical relevance of our models is further emphasised by the fact that 80% of the approved peptide drugs from 2012 - 2016 contain only two to ten amino acid residues and 29% of peptides in clinical development are shorter than ten AAs.[110, 111] In a subsequent benchmark study, we tested our consensus model against three com- mercial, broadly applied models. For peptides and their synthetic derivatives, these commercial tools revealed weaknesses, corroborating the necessary development of dedicated logD7.4 predictors for this compound class. The two fragment-based models exhibited the strongest correlation between prediction error and increasing molecular weight. These findings emphasised, that the working hypothesis of lipophilicity be- ing an additive property is limited. Yet, the presented analysis does not promote a Chapter 5. Conclusions and Outlook 99 sharp definition of a reliable weight range of query compounds. In addition, molec- ular weight as a representative of size and complexity presents not the only delim- iter. Also, the presence of charges impacted the prediction error, while the consensus model was again less influenced than the others. Although our model was developed for peptides, it provided very accurate results for the left-out fraction of the s-Mol dataset (RMSE = 0.69, Accuracy [%] = 57.7). This outcome confirmed again its broad applicability, implied that similar data are available for training. One downside is the limitation of modelling logD solely at pH 7.4. In comparison, the commercial models calculate logDpH for the entire pH range as a function of logP and pKa. However, many pharmaceutically interesting partition events occur at the physiological pH. Thus, the majority of logDpH data is reported between pH 7.0 and 7.4. In the author’s opinion, future QSPR modelling should focus more on solutions which can be reliably applied to compound classes that are currently neglected. This approach includes the assess- ment of the prediction certainty for such structures. Here, this demand was met by novelty detection, based on input feature distances between query compounds and the training set.

The final part of this thesis aimed to deepen the understanding of chemokine inter- actions with their respective receptors. Prior research in this area provided evidence for an initial attachment of chemokines to the receptor N-terminus. We were able to confirm the interaction between the N-terminus of CCR7 and the endogenous ligand CCL19 (Figure 4.15). Further, this extracellularly located, unstructured part of CCR7 was systematically fragmented. The identification of a six AA long CCL19 binding se- quence thereof validated the focussing on PPI binding epitopes as starting points for the development of peptidic modulators. This approach is not restricted to chemokines but presents an example that can be adapted for the investigation of other PPIs as well.

Kd was determined using MST, a biophysical binding assay that has, to the best of our knowledge, not been used to study chemokine interactions before. The advantages of MST are the relatively short preparation time and low material consumption. Future work needs to establish an orthogonal binding assay to confirm the MST results. Fur- ther, it would be advisable to design a phenotypic assay for scrutinising the potential of peptidic modulators to hinder the chemokine induced migration of CCR7 express- ing cells. With the discovered hexapeptide as a starting template at hand, we hypothesised that logDpH is a valid parameter to guide further de novo peptide design into a preferred sequence space. In combination with pharmacophoric similarity (pepCATS), logDpH was the rationale to select peptides from in silico libraries created by EA. This ap- proach revealed five unique hexapeptides which bind to CCL19 in the low micromolar range (cf. Table 4.5). Hereby, the impact of the template sequence pattern in terms of side chain pharmacophores to retain target-affinity, was discovered. The identified 100 Chapter 5. Conclusions and Outlook

CCL19-binders are pharmaceutically interesting because they may be able to saturate the chemokine in proximity to CCR7 over-expressing cancer cells, hindering lymph- node metastasis. However, one must think about a targeted therapy because it is unde- sirable to systematically interfere CCL19 promoted chemotaxis. Another conceivable application would be to utilise these peptides as carriers to direct their cargo into com- partments with high concentrations of CCL19, such as secondary lymphoid organs. Further development of the proposed sequences should focus on overcoming the dis- advantages of short, linear peptides, such as rapid degradation and unfavourable PK properties. A reasonable attempt would be to investigate the bioactive conformation and constrain the peptide to it; for example, by olefin metathesis as presented by Glas et al..[144] Introducing rigidity by cyclisation can lead to stabilised compounds with increased binding affinity through minimising the entropic penalty. Our findings also motivate further investigations on CCL21, since this chemokine shares the same recep- tor as CCl19. One can further think of searching for peptidic modulators by looking at the chemokine binding epitope. Such derived peptides could bind the receptor N- terminus, leading to more specific prevention of the trafficking of CCR7 over express- ing cells. However, violating the conserved tertiary chemokine structure is likely to limit this approach since literature gives evidence that the interplay between the N- loop and the third β-strand is crucial for binding.[353]

The main part of this work addressed the need for bespoke lipophilicity models for peptides and peptide-mimetics by employing machine learning methods. Future stud- ies need to evaluate the applicability of logDpH as a valid parameter for PK optimisa- tion and library screening for these compounds. The chemokine study demonstrated the suitability of logDpH as a part of a multi-parametric guide for de novo peptide de- sign. Future research could translate the discovered hexapeptides into pharmaceuti- cally relevant compounds which modulate the interaction between CCR7 and its en- dogenous ligands. The presented computer-assisted strategies for property prediction and de novo design save time and resources and complement the chemist’s expertise in a data driven fashion. In times of increasing demand for process excellence and in- novation in the pharmaceutical industry, this amalgamation of men and machine is a conceivable approach to efficiently discover future medicines. 101

6 Acknowledgements

The presented work is a product of three challenging years, both for my professional and personal development. Certainly, it would not exist without the aid and support of so many great colleagues, friends and family members that have crossed and ac- company my path.

In the first place, my deepest thanks and appreciation belongs to my supervisor Prof. Dr. Gisbert Schneider. Your dedication and visionary ideas about the future of health science encouraged me to familiarise myself with the fields of chemoinformatics and machine learning and to work hard for reaching my goals. Your decision to give me all the freedom to conduct my research in your group at ETH Zurich, certainly marked a major event in my professional life. I wish for each young scientist, that he will meet an inspiring mentor like you are.

A special gratitude goes to Dr. Irmgard Werner, who gave me the opportunity to join ETH Zurich in the first place, to Dr. Jan Hiss for always providing valuable feedback, in particular on my manuscripts and talks, and his care about the laboratory equip- ment. I want to say thank you also to Dr. Christian Steuer for his steady support in analytical questions and excellent coordination and teamwork in the peptide quantifi- cation project. Further, I would like to thank Prof. Dr. Stefanie Krämer and Prof. Dr. Gunnar Jeschke for taking the effort to co-examinate this thesis.

A special mention is devoted to Dr. Arndt Finkelmann, Dr. Francesca Grisoni, Dr. Alex Müller and Ryan Byrne for introducing me to Python, setting up the computa- tional environment and fruitful discussions and lessons on concepts in chemoinfor- matics. Cyrill Brunner is thanked for introducing me to the microscale thermophoresis technique and his valuable collaboration in the chemokine project. I am also grateful to the other PhD fellows and "modlabians" over the past years: Alex Button, Berend Huis- man, Erik Gawehn, Dominique Bruns, Gisela Gabernet, "JJ" Zhang, Claudia Neuhaus, Lukas Friedrich, Dr. Christoph Bauer and Dr. Daniel Merk. It was fantastic having the opportunity to work and discuss, celebrate ups and suffer downs with you all. I will certainly miss the matchless environment and spirit of this team.

I would like to thank also Dr. Michael Kossenjans from AstraZeneca, who took the effort together with Gisbert so that I could use their data for my lipophilicity model. My gratitude belongs to Sarah Haller, Ruth Alder, Danielle Lüthi, Dr. Petra Schneider 102 Chapter 6. Acknowledgements and Chrissula Chatzidis for assistance in technical and organisational matters in and out of the laboratory. Anvita Gupta is thanked for giving me insight into the topic of generative modelling. I am also grateful to many of the named persons for proof- reading this thesis.

Apart from work, I owe gratitude to my closest friends: My brothers from other moth- ers (Andreas, Andreas, Martin, Florian), my cousin Dominik, Marieke and my dear fellows from the good old days in school. Still, i can count on all your support and friendship in every situation.

I can doubtlessly call myself lucky since I am companied by such a strong support- ing family. My special gratitude goes to my grandparents, to Birgit, Peter, Miriam, Ilka, Jana, Stefan and my little niece Pia who teaches me how to discover the world and looking at the miracle of life from child eyes.

The last acknowledgement is reserved to the persons who mean the most to me. With- out you, Lisa, Jasmin, Gabi and Wolfgang, and all your love and support, I would not possess the strength and passion to tackle the challenges in my professional and per- sonal life and stand where I am. 103

Bibliography

[1] L. Di, E. H. Kerns, Drug-Like Properties, Academic Press, Cambridge, 2015. [2] S. Mignani et al., “Present drug-likeness filters in medicinal chemistry during the hit and lead optimization process: how far can they be simplified?”, Drug Discovery Today 2018, 23, 605–615. [3] L. Di, E. H. Kerns, G. T. Carter, “Drug-like property concepts in pharmaceutical design.” Current Pharmaceutical Design 2009, 15, 2184–2194. [4] O. Ursu, A. Rayan, A. Goldblum, T. I. Oprea, “Understanding drug-likeness”, Wiley Interdisciplinary Reviews: Computational Molecular Science 2011, 1, 760–781. [5] C. A. Lipinski, F. Lombardo, B. W. Dominy, P. J. Feeney, “Experimental and computational approaches to estimate solubility and permeability in drug dis- covery and development settings”, Advanced Drug Delivery Reviews 1997, 23, 3– 25. [6] O. Roche et al., “Development of a Virtual Screening Method for Identification of “Frequent Hitters” in Compound Libraries”, Journal of Medicinal Chemistry 2002, 45, 137–142. [7] G. Schneider, “Automating drug discovery”, Nature Reviews Drug Discovery 2018, 17, 97–113. [8] R. J. Young, Physical properties in drug design, Springer-Verlag, Berlin, 2014, pp. 1–68. [9] M. J. Waring, “Lipophilicity in drug discovery”, Expert Opinion on Drug Discov- ery 2010, 5, 235–248. [10] R. J. Young, D. V. S. Green, C. N. Luscombe, A. P. Hill, “Getting physical in drug discovery II: The impact of chromatographic hydrophobicity measure- ments and aromaticity”, Drug Discovery Today 2011, 16, 822–830. [11] J. A. Arnott, S. L. Planey, “The influence of lipophilicity in drug discovery and design.” Expert Opinion on Drug Discovery 2012, 7, 863–875. [12] G. Camenisch, J. Alsenz, H. van de Waterbeemd, G. Folkers, “Estimation of permeability by passive diffusion through Caco-2 cell monolayers using the drugs’ lipophilicity and molecular weight”, European Journal of Pharmaceutical Sciences 1998, 6, 313–319. [13] K. Hosoya, A. Yamamoto, S.-i. Akanuma, M. Tachikawa, “Lipophilicity and Transporter Influence on Blood-Retinal Barrier Permeability: A Comparison with Blood-Brain Barrier Permeability”, Pharmaceutical Research 2010, 27, 2715–2724. [14] X. Liu, B. Testa, A. Fahr, “Lipophilicity and Its Relationship with Passive Drug Permeation”, Pharmaceutical Research 2010, 28, 962–977. [15] K. Valkó, S. Nunhuck, C. Bevan, M. H. Abraham, D. P. Reynolds, “Fast gra- dient HPLC method to determine compounds binding to human serum albu- min. Relationships with octanol/water and immobilized artificial membrane lipophilicity”, Journal of Pharmaceutical Sciences 2003, 92, 2236–2248. [16] F. Lombardo, R. Scott Obach, M. Y. Shalaeva, F. Gao, “Prediction of Volume of Distribution Values in Humans for Neutral and Basic Drugs Using Physico- chemical Measurements and Plasma Protein Binding Data”, Journal of Medicinal Chemistry 2002, 45, 2867–2876. [17] J. C. Madden, M. Cronin, “Structure-based methods for the prediction of drug metabolism”, Expert Opinion on Drug Metabolism & Toxicology 2006, 2, 545–557. [18] M. P. Gleeson, “Generation of a Set of Simple, Interpretable ADMET Rules of Thumb”, Journal of Medicinal Chemistry 2008, 51, 817–834. 104 Bibliography

[19] H. van de Waterbeemd, D. A. Smith, K. Beaumont, D. K. Walker, “Property- Based Design: Optimization of Drug Absorption and Pharmacokinetics”, Jour- nal of Medicinal Chemistry 2001, 44, 1313–1333. [20] M. V. S. Varma et al., “Physicochemical Determinants of Human Renal Clear- ance”, Journal of Medicinal Chemistry 2009, 52, 4844–4852. [21] A. Sarkar, G. E. Kellogg, “Hydrophobicity - Shake Flasks, Protein Folding and Drug Discovery”, Current Topics in Medicinal Chemistry 2010, 10, 67–83. [22] P. W. Snyder et al., “Mechanism of the hydrophobic effect in the biomolecular recognition of arylsulfonamides by carbonic anhydrase.” Proceedings of the Na- tional Academy of Sciences of the United States of America 2011, 108, 17889–17894. [23] A. L. Hopkins, “Network pharmacology: the next paradigm in drug discov- ery”, Nature Chemical Biology 2008, 4, 682–690. [24] J.-U. Peters, P. Schnider, P. Mattei, M. Kansy, “Pharmacological Promiscuity: Dependence on Compound Properties and Target Specificity in a Set of Recent Roche Compounds”, ChemMedChem 2009, 4, 680–686. [25] Á. Tarcsay, G. M. Keser˝u,“Contributions of Molecular Properties to Drug Promis- cuity”, Journal of Medicinal Chemistry 2013, 56, 1789–1795. [26] M. J. Waring, C. Johnstone, “A quantitative assessment of hERG liability as a function of lipophilicity”, Bioorganic & Medicinal Chemistry Letters 2007, 17, 1759–1764. [27] U. M. Hanumegowda, G. Wenke, A. Regueiro-Ren, R. Yordanova, J. P. Corradi, S. P. Adams, “Phospholipidosis as a Function of Basicity, Lipophilicity, and Volume of Distribution of Compounds”, Chemical Research in Toxicology 2010, 23, 749–755. [28] M. Chen, J. Borlak, W. Tong, “High lipophilicity and high daily dose of oral medications are associated with significant risk for drug-induced liver injury”, Hepatology 2013, 58, 388–396. [29] I. Kola, J. Landis, “Can the pharmaceutical industry reduce attrition rates?”, Nature Reviews Drug Discovery 2004, 3, 711–716. [30] H. Meyer, “Zur Theorie der Alkoholnarkose”, Archiv für Experimentelle Patholo- gie und Pharmakologie 1899, 42, 109–118. [31] E. Overton, Studien über die Narkose, Fischer, Jena, 1901. [32] T. Fujita, J. Iwasa, C. Hansch, “A new substituent constant, π, derived from partition coefficients”, Journal of the American Chemical Society 1964, 86, 5175– 5180. [33] G. M. Keser˝u,G. M. Makara, “The influence of lead discovery strategies on the properties of drug candidates”, Nature Reviews Drug Discovery 2009, 8, 203–212. [34] Á. Tarcsay, G. M. Keser˝u,“Is there a link between selectivity and binding ther- modynamics profiles?”, Drug Discovery Today 2015, 20, 86–94. [35] M. M. Hann, “Molecular obesity, potency and other addictions in drug discov- ery”, MedChemComm 2011, 2, 349–355. [36] A. L. Hopkins, G. M. Keser˝u,P. D. Leeson, D. C. Rees, C. H. Reynolds, “The role of ligand efficiency metrics in drug discovery”, Nature Reviews Drug Discovery 2014, 13, 105–121. [37] Leeson, Paul D, Springthorpe, Brian, “The influence of drug-like concepts on decision-making in medicinal chemistry”, Nature Reviews Drug Discovery 2007, 6, 881–890. [38] Bayliss, Martin K et al., “Quality guidelines for oral drug candidates: dose, solubility and lipophilicity”, Drug Discovery Today 2016, 21, 1719–1727. [39] T. H. Keller, A. Pichota, Z. Yin, “A practical view of ‘druggability’”, Current Opinion in Chemical Biology 2006, 10, 357–361. Bibliography 105

[40] A. Whitty, M. Zhong, L. Viarengo, D. Beglov, D. R. Hall, S. Vajda, “Quantifying the chameleonic properties of macrocycles and other high-molecular-weight drugs”, Drug Discovery Today 2016, 21, 712–717. [41] A. L. Hopkins, C. R. Groom, “The druggable genome”, Nature Reviews Drug Discovery 2002, 1, 727–730. [42] A. D. McNaught, A. Wilkinson, Compendium of Chemical Terminology, Wiley- Blackwell, 1997. [43] M. Berthelot, E. Jungfleisch, “On the laws that operate for the partition of a substance between two solvents”, Annales de Chemie et de Physique 1872, 26, 396– 407. [44] F. Gobas, D. Mackay, W. Y. Shiu, J. M. Lahittete, G. Garofalo, “A novel method for measuring membrane-water partition coefficients of hydrophobic organic chemicals: Comparison with 1-octanol-water partitioning”, Journal of Pharma- ceutical Sciences 1988, 77, 265–272. [45] C. Hansch, P. P. Maloney, T. Fujita, R. M. Muir, “Correlation of biological ac- tivity of phenoxyacetic acids with Hammett substituent constants and partition coefficients”, Nature 1962, 194, 178–180. [46] R. A. Scherrer, S. M. Howard, “Use of distribution coefficients in quantitative structure-activity relations”, Journal of Medicinal Chemistry 1977, 20, 53–58. [47] M. Kah, C. D. Brown, “Log D: Lipophilicity for ionisable compounds”, Chemo- sphere 2008, 72, 1401–1408. [48] K. B. Sanjivanjit, K. Karim, I. G. Peirson, G. M. Pearl, “The Rule of Five Re- visited: Applying LogD in Place of LogP in Drug-Likeness Filters”, Molecular Pharmaceutics 2007, 4, 556–560. [49] A. Avdeef, D. A. Barrett, P. N. Shaw, R. D. Knaggs, S. S. Davis, “Octanol-, -, and dipelargonat-water partitioning of - 6-glucoronide and other related opiates”, Journal of Medicinal Chemistry 1996, 39, 4377–4381. [50] V. Pliska, B. Testa, H. van de Waterbeemd, Lipophilicity in Drug Action and Toxi- cology, Wiley-VCH, Weinheim, 1996. [51] C. A. M. Hogben, L. S. Schanker, D. J. Tocco, B. B. Brodie, “Absorption of drugs from the stomach. II. The human”, Journal of Pharmacology and Experimental Ther- apeutics 1957, 120, 540–545. [52] T. Herdegen, Kurzlehrbuch Pharmakologie und Toxikologie, Georg Thieme Verlag, Stuttgart, 2010. [53] S. K. Poole, C. F. Poole, “Separation methods for estimating octanol-water partition coefficients”, Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 2003, 797, 3–19. [54] OECD, Test No. 107: Partition Coefficient (n-octanol/water): Shake Flask Method, 1995. [55] J. De Bruijn, F. Busser, W. Seinen, J. Hermens, “Determination of octanol/water partition coefficients for hydrophobic organic chemicals with the “slow-stirring” method”, Environmental Toxicology and Chemistry 1989, 8, 499–512. [56] W. Schräder, J. T. Andersson, “Fast and direct method for measuring 1-octanol water partition coefficients exemplified for six local anesthetics”, Journal of Phar- maceutical Sciences 2001, 90, 1948–1954. [57] M. C. Wenlock, T. Potter, P. Barton, R. P. Austin, “A method for measuring the lipophilicity of compounds in mixtures of 10”, Journal of Biomolecular Screening 2011, 16, 348–355. [58] E. H. Kerns, “High throughput physicochemical profiling for drug discovery”, Journal of Pharmaceutical Sciences 2001, 90, 1838–1858. 106 Bibliography

[59] L. Hitzel, A. P. Watt, K. L. Locker, “An increased throughput method for the determination of partition coefficients”, Pharmaceutical Research 2000, 17, 1389– 1395. [60] H. Cumming, C. Rücker, “Octanol–Water Partition Coefficient Measurement by a Simple 1H NMR Method”, ACS Omega 2017, 2, 6244–6249. [61] J. Sangster, Octanol-Water Partition Coefficients, John Wiley & Sons, New Jersey, 1997. [62] K. Valko, Handbook of Analytical Separations. Chapter 12: Measurements of physical properties for drug design in industry, Elsevier, New York, 2000. [63] F. Lombardo, M. Y. Shalaeva, B. D. Bissett, N. Chistokhodova, Physicochemical and Biological Profiling in Drug Research. ElogD7.4 20,000 Compounds Later: Refine- ments, Observations, and Applications, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany, 2006. [64] C. Liang, H. Lian, “Recent advances in lipophilicity measurement by reversed- phase high-performance liquid chromatography”, TrAC - Trends in Analytical Chemistry 2015, 68, 28–36. [65] C. Giaginis, S. Theocharis, “Octanol/water partitioning simulation by reversed- phase high performance liquid chromatography for structurally diverse acidic drugs: Effect of n-octanol as mobile phase additive”, Journal of Chromatography A 2007, 116–125. [66] C. Giaginis, S. Theocharis, “Octanol/water partitioning simulation by RP-HPLC for structurally diverse acidic drugs: Comparison of three columns in the pres- ence and absence of n-octanol as the mobile phase additive”, Journal of Separa- tion Science 2013, 36, 3830–3836. [67] F. Lombardo, M. Y. Shalaeva, K. A. Tupper, F. Gao, “ElogDoct: A tool for lipophilic- ity determination in drug discovery. 2. Basic and neutral compounds”, Journal of Medicinal Chemistry 2001, 44, 2490–2497. [68] Y. Ishihama, Y. Oda, N. Asakawa, “Evaluation of Solute Hydrophobicity by Microemulsion Electrokinetic Chromatography”, Analytical Chemistry 1995, 67, 1588–1595. [69] S. K. Poole, D. Durham, C. Kibbey, “Rapid method for estimating the octanol water partition coefficient (logPow) by microemulsion electrokinetic chromatog- raphy”, Journal of Chromatography B 2000, 745, 117–126. [70] K. Valkó, L. R. Snyder, J. L. Glajch, “Retention in reversed-phase liquid chro- matography as a function of mobile-phase composition”, Journal of Chromatog- raphy A 1993, 656, 501–520. [71] K. Valko, C. Bevan, D. Reynolds, “Chromatographic hydrophobicity index by fast-gradient RP-HPLC: a high-throughput alternative to log P/log D”, Analyt- ical Chemistry 1997, 69, 2022–2029. [72] K. Valkó, “Application of high-performance liquid chromatography based mea- surements of lipophilicity to model biological distribution”, Journal of Chro- matography A 2004, 1037, 299–310. [73] J. De Bruijn, F. Busser, W. Seinen, J. Hermens, “Determination of octanol/water partition coefficients for hydrophobic organic chemicals with the ’slow-stirring’ method”, Environmental Toxicology and Chemistry 1989, 8, 499–512. [74] B. Wagner, H. Fischer, M. Kansy, A. Seelig, F. Assmus, “Carrier Mediated Dis- tribution System (CAMDIS): a new approach for the measurement of octanol / water distribution coefficients.” European Journal of Pharmaceutical Sciences 2015, 68, 68–77. [75] F. Andri´c,D. Bajusz, A. Racz, S. Segan, K. Héberger, “Multivariate assessment of lipophilicity scales-computational and reversed phase thin-layer chromato- graphic indices”, Journal of Pharmaceutical and Biomedical Analysis 2016, 127, 81– 93. Bibliography 107

[76] F. Chen, X. Cao, S. Han, H. Lian, L. Mao, “Relationship between hydrophobic- ity and RPLC retention behavior of amphoteric compounds”, Journal of Liquid Chromatography and Related Technologies 2014, 2711–2724. [77] S. Martel, F. Gillerat, E. Carosati, D. Maiarelli, “Large, chemically diverse dataset of logP measurements for benchmarking studies”, European Journal of Pharma- ceutical Sciences 2013, 48, 21–29. [78] K. Tacaks-Novak, A. Avdeef, “Interlaboratory study of log P determination by shake-flask and potentiometric methods”, Journal of Pharmaceutical and Biomedi- cal Analysis 1996, 14, 1405–1413. [79] J. J. Irwin, B. K. Shoichet, “ZINC - A free database of commercially available compounds for virtual screening”, Journal of Chemical Information and Modeling 2005, 45, 177–182. [80] N. El Tayar, R.-S. Tsai, P. Vallat, C. Altomare, B. Testa, “Measurement of parti- tion coefficients by various centrifugal partition chromatographic techniques”, Journal of Chromatography A 1991, 556, 181–194. [81] S. D. Krämer, J. C. Gautier, P. Saudemon, “Considerations on the potentiometric log P determination”, Pharmaceutical Research 1998, 15, 1310–1313. [82] C. Barzanti et al., “Potentiometric determination of octanol–water and lipo- some–water partition coefficients (log P) of ionizable organic compounds”, Tetra- hedron Letters 2007, 48, 3337–3341. [83] F. H. Clarke, N. M. Cahoon, “Partition coefficients by curve fitting: The use of two different octanol volumes in a dual-phase potentiometric titration”, Journal of Pharmaceutical Sciences 1996, 85, 178–183. [84] A. Avdeef, “pH-Metric log P. Part 1. Difference Plots for Determining Ion-Pair Octanol-Water Partition Coefficients of Multiprotic Substances”, Quantitative Structure-Activity Relationships 1992, 11, 510–517. [85] C. Hansch, T. Fujita, “p-σ-π Analysis. A method for the correlation of biological activity and chemical structure”, Journal of the American Chemical Society 1964. [86] G. G. Nys, R. F. Rekker, “Statistical analysis of a series of partition coefficients with special reference to the predictability of folding of drug molecules. The introduction of hydrophobic fragmental constants (f values)”, Chimica Thera- peutica 1973, 8, 521–535. [87] R. Mannhold, R. F. Rekker, “The hydrophobic fragmental constant approach for calculating log P in octanol/water and aliphatic hydrocarbon/water systems”, Perspectives in Drug Discovery and Design 2000, 18, 1–18. [88] A. Leo, P. Y. C. Jow, C. Silipo, C. Hansch, “Calculation of hydrophobic constant (log P) from π and f constants”, Journal of Medicinal Chemistry 1975, 18, 865–868. [89] P. Tao, R. X. Wang, L. H. Lai, “Calculating partition coefficients of peptides by the addition method”, Journal of Molecular Modeling 1999, 5, 189–195. [90] A. A. Petrauskas, E. A. Kolovanov, “ACD/Log P method description”, Perspec- tives in Drug Discovery and Design 2000, 19, 99–116. [91] A. K. Ghose, G. M. Crippen, “Atomic Physicochemical Parameters for 3- Di- mensional Structure Directed Quantitative Structure-Activity Relationships”, Journal of Computational Chemistry 1986, 7, 565–577. [92] V. N. Viswanadhan, A. K. Ghose, J. J. Wendoloski, “Estimating aqueous solva- tion and lipophilicity of small organic molecules: A comparative overview of atom/group contribution methods”, Perspectives in Drug Discovery and Design 2000, 19, 85–98. [93] G. Klopman, J. Y. Li, S. Wang, “Computer automated log P calculations based on an extended group contribution approach”, Journal of Chemical Information and Computer Sciences 1994, 34, 752–781. 108 Bibliography

[94] J. Bajorath, Chemoinformatics and Computational Chemical Biology, Humana Press, New York, 2016. [95] A. R. Katritzky, V. S. Lobanov, “QSPR - the Correlation and Quantitative Pre- diction of Chemical and Physical-Properties From Structure”, Chemical Society Reviews 1995, 24, 279–288. [96] I. Moriguchi, S. Hirono, Q. Liu, I. Nakagome, Y. Matsushita, “Simple method of calculating octanol/water partition coefficient”, Chemical and Pharmaceutical Bulletin 1992, 40, 127–130. [97] S. Riniker, “Molecular Dynamics Fingerprints (MDFP): Machine Learning from MD Data to Predict Free-Energy Differences”, Journal of Chemical Information and Modeling 2017, 57, 726–741. [98] I. Sushko et al., “Online chemical modeling environment (OCHEM): web plat- form for data storage, model development and publishing of chemical informa- tion”, Journal of Computer-Aided Molecular Design 2011, 25, 533–554. [99] A. Gaulton et al., “The ChEMBL database in 2017.” Nucleic Acids Research 2017, 45, D945–D954. [100] P. Bruneau, N. R. McElroy, “LogD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction”, Jour- nal of Chemical Information and Modeling 2006, 46, 1379–1387. [101] T. Schroeter et al., “Machine learning models for lipophilicity and their domain of applicability”, Molecular Pharmaceutics 2007, 4, 524–538. [102] R. Mannhold, G. I. Poda, C. Ostermann, I. V. Tetko, “Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds”, Journal of Pharmaceutical Sciences 2009, 98, 861–893. [103] I. V.Tetko, V.Y. Tanchuk, A. E. P.Villa, “Prediction of n-Octanol/Water Partition Coefficients from PHYSPROP Database Using Artificial Neural Networks and E-State Indices”, Journal of Chemical Information and Computer Sciences 2001, 41, 1407–1421. [104] A. Visconti, G. Ermondi, G. Caron, R. Esposito, “Prediction and interpretation of the lipophilicity of small peptides”, Journal of Computer-Aided Molecular De- sign 2016, 29, 361–370. [105] G. Cruciani, P. Crivori, P.-A. Carrupt, B. Testa, “Molecular fields in quantitative structure–permeation relationships: the VolSurf approach”, Journal of Molecular Structure 2000, 503, 17–30. [106] G. Cruciani, M. Pastor, W. Guba, “VolSurf: a new tool for the pharmacokinetic optimization of lead compounds”, European Journal of Pharmaceutical Sciences 2000, 11, S29–S39. [107] T. S. Schroeter et al., “Predicting Lipophilicity of Drug-Discovery Molecules using Gaussian Process Models”, ChemMedChem 2007, 2, 1265–1267. [108] L. N. Ognichenko et al., “QSPR prediction of lipophilicity for organic com- pounds using random forest technique on the basis of simplex representation of molecular structure”, Molecular Informatics 2012, 31, 273–280. [109] J. B. Wang, D. S. Cao, M. F. Zhu, Y. H. Yun, “In silico evaluation of logD7.4 and comparison with other prediction methods”, Journal of Chemometrics 2015, 29, 389–398. [110] G. B. Santos, A. Ganesan, F. S. Emery, “Oral Administration of Peptide-Based Drugs: Beyond Lipinski’s Rule”, ChemMedChem 2016, 11, 2245–2251. [111] J. L. Lau, M. K. Dunn, “Therapeutic peptides: Historical perspectives, current development trends, and future directions”, Bioorganic & Medicinal Chemistry 2018, 26, 2700–2707. [112] K. Fosgerau, T. Hoffmann, “Peptide therapeutics: Current status and future directions”, Drug Discovery Today 2015, 20, 122–128. Bibliography 109

[113] A. A. Kaspar, J. M. Reichert, “Future directions for peptide therapeutics devel- opment”, Drug Discovery Today 2013, 18, 807–817. [114] A. Henninot, J. C. Collins, J. M. Nuss, “The Current State of Peptide Drug Dis- covery: Back to the Future?”, Journal of Medicinal Chemistry 2018, 61, 1382–1414. [115] R. J. Dubos, “Studies on a bactericidal agent extracted from a soil bacillus: I. Preparation of the agent. Its activity in vitro”, The Journal of Experimental Medicine 1939, 70, 1. [116] G. Wang, Antimicrobial Peptides, CABI, Walingford, 2017. [117] H. Jenssen, P. Hamill, R. E. W. Hancock, “Peptide antimicrobial agents”, Clinical Microbiology Reviews 2006, 19, 491–511. [118] M. Zasloff, “Antimicrobial peptides of multicellular organisms”, Nature 2002, 415, 389–395. [119] C. D. Fjell, J. A. Hiss, R. Hancock, “Designing antimicrobial peptides: form follows function”, Nature Reviews Drug Discovery 2012, 11, 37–51. [120] A. T. Müller, G. Gabernet, J. A. Hiss, G. Schneider, “modlAMP: Python for antimicrobial peptides”, Bioinformatics 2017, 33, 2753–2755. [121] M. Pillong et al., “Rational Design of Membrane-Pore-Forming Peptides”, Small 2017, 13, 1701316. [122] A. T. Müller et al., “Sparse Neural Network Models of Antimicrobial Peptide- Activity Relationships”, Molecular Informatics 2016, 35, 606–614. [123] G. Gabernet, A. T. Müller, J. A. Hiss, G. Schneider, “Membranolytic anticancer peptides”, MedChemComm 2016, 7, 2232–2245. [124] F. Grisoni, C. Neuhaus, G. Gabernet, A. Müller, J. Hiss, G. Schneider, “Design- ing anticancer peptides by constructive machine learning.” ChemMedChem 2018, 13, 1300–1302. [125] K. Kurrikoff, M. Gestin, Ü. Langel, “Recent in vivo advances in cell-penetrating peptide-assisted drug delivery”, Expert Opinion on Drug Delivery 2016, 13, 373– 387. [126] M. Pooga, Ü. Langel, Classes of Cell-Penetrating Peptides, Humana Press, New York, 2015, pp. 3–28. [127] N. Svensen, J. G. A. Walton, M. Bradley, “Peptides for cell-selective drug deliv- ery”, Trends in Pharmacological Sciences 2012, 33, 186–192. [128] V. Gregorc et al., “Defining the optimal biological dose of NGR-hTNF, a selec- tive vascular targeting agent, in advanced solid tumours”, European Journal of Cancer 2010, 46, 198–206. [129] V. Gregorc et al., “Phase II Study of Asparagine-Glycine-Arginine–Human Tu- mor Necrosis Factor α, a Selective Vascular Targeting Agent, in Previously Treated Patients With Malignant Pleural Mesothelioma”, Journal of Clinical Oncology 2016, 28, 2604–2611. [130] L. Nevola, E. Giralt, “Modulating protein-protein interactions: The potential of peptides”, Chemical Communications 2015, 51, 3302–3315. [131] S. Surade, T. L. Blundell, “Structural Biology and Drug Discovery of Difficult Targets: The Limits of Ligandability”, Chemistry & Biology 2012, 19, 42–50. [132] F. Bernal, A. F. Tyler, S. J. Korsmeyer, L. D. Walensky, G. L. Verdine, “Reactiva- tion of the p53 tumor suppressor pathway by a stapled p53 peptide.” Journal of the American Chemical Society 2007, 129, 2456–2457. [133] A. Muppidi, Z. Wang, X. Li, J. Chen, Q. Lin, “Achieving cell penetration with distance-matching cysteine cross-linkers: a facile route to cell-permeable pep- tide dual inhibitors of Mdm2/Mdmx”, Chemical Communications 2011, 47, 9396– 9398. [134] Y. H. Lau et al., “Investigating peptide sequence variations for ‘double-click’ stapled p53 peptides”, Organic & Biomolecular Chemistry 2014, 12, 4074–4077. 110 Bibliography

[135] T. Hara, S. R. Durell, M. C. Myers, D. H. Appella, “Probing the Structural Requirements of Peptoids That Inhibit HDM2p53 Interactions”, Journal of the American Chemical Society 2006, 128, 1995–2004. [136] E. Miranda et al., “A cyclic peptide inhibitor of HIF-1 heterodimerization that inhibits hypoxia signaling in cancer cells”, Journal of the American Chemical Soci- ety 2013, 135, 10418–10425. [137] M. L. Stewart, E. Fire, A. E. Keating, L. D. Walensky, “The MCL-1 BH3 helix is an exclusive MCL-1 inhibitor and apoptosis sensitizer.” Nature Chemical Biology 2010, 6, 595–601. [138] L. D. Walensky et al., “Activation of Apoptosis in Vivo by a Hydrocarbon- Stapled BH3 Helix”, Science 2004, 305, 1466–1470. [139] A. Muppidi et al., “Rational Design of Proteolytically Stable, Cell-Permeable Peptide-Based Selective Mcl-1 Inhibitors”, Journal of the American Chemical Soci- ety 2012, 134, 14734–14737. [140] T. N. Grossmann, J. T.-H. Yeh, B. R. Bowman, Q. Chu, R. E. Moellering, G. L. Verdine, “Inhibition of oncogenic Wnt signaling through direct targeting of β-catenin.” Proceedings of the National Academy of Sciences of the United States of America 2012, 109, 17942–17947. [141] S. A. Kawamoto, A. Coleska, X. Ran, H. Yi, C.-Y. Yang, S. Wang, “Design of Triazole-Stapled BCL9 α-Helical Peptides to Target the β-Catenin/B-Cell CLL/ lymphoma 9 (BCL9) Protein–Protein Interaction”, Journal of Medicinal Chemistry 2012, 55, 1137–1146. [142] A. Tavassoli, Q. Lu, J. Gam, H. Pan, S. J. Benkovic, S. N. Cohen, “Inhibition of HIV budding by a genetically selected cyclic peptide targeting the Gag-TSG101 interaction.” ACS Chemical Biology 2008, 3, 757–764. [143] W. Lian, P. Upadhyaya, C. A. Rhodes, Y. Liu, D. Pei, “Screening Bicyclic Pep- tide Libraries for Protein–Protein Interaction Inhibitors: Discovery of a Tumor Necrosis Factor-α Antagonist”, Journal of the American Chemical Society 2013, 135, 11990–11995. [144] A. Glas, D. Bier, G. Hahne, C. Rademacher, C. Ottmann, T. N. Grossmann, “Constrained Peptides with Target Adapted Cross Links as Inhibitors of a Patho- genic Protein–Protein Interaction”, Angewandte Chemie International Edition 2014, 53, 2489–2493. [145] N. A. Khazanov, H. A. Carlson, “Exploring the Composition of Protein-Ligand Binding Sites on a Large Scale”, PLOS Computational Biology 2013, 9, e1003321. [146] C. T. Dooley, R. A. Houghten, “The use of positional scanning synthetic peptide combinatorial libraries for the rapid determination of opioid receptor ligands”, Life Sciences 1993, 52, 1509–1517. [147] L. Otvos et al., “Design and development of a peptide-based adiponectin re- ceptor agonist for cancer treatment”, BMC Biotechnology 2011, 11, 90. [148] G. Hummel, U. Reineke, U. Reimer, “Translating peptides into small molecules”, Molecular BioSystems 2006, 2, 499–508. [149] L. J. Otvos, J. D. Wade, “Current challenges in peptide-based drug discovery”, Frontiers in Chemistry 2014, 2, 517. [150] R. T. Raines, H. Wennemers, “Peptides on the Rise”, Accounts of Chemical Re- search 2017, 50, 2419–2419. [151] P. Vlieghe, V. Lisowski, J. Martinez, M. Khrestchatisky, “Synthetic therapeutic peptides: science and market”, Drug Discovery Today 2010, 15, 40–56. [152] R. B. Merrifield, “Solid Phase Peptide Synthesis. I. The Synthesis of a Tetrapep- tide”, Journal of the American Chemical Society 1963, 85, 2149–2154. Bibliography 111

[153] L. Gentilucci, R. De Marco, L. Cerisoli, “Chemical Modifications Designed to Improve Peptide Stability: Incorporation of Non-Natural Amino Acids, Pseudo- Peptide Bonds, and Cyclization”, Current Pharmaceutical Design 2010, 16, 3185– 3203. [154] B. Özgönenel, M. Rajpurkar, J. M. Lusher, “How do you treat bleeding disor- ders with desmopressin?”, Postgraduate Medical Journal 2007, 83, 159–163. [155] M. G. Moertl, S. Friedrich, J. Kraschl, C. Wadsack, U. Lang, D. Schlembach, “Haemodynamic effects of carbetocin and oxytocin given as intravenous bo- lus on women undergoing caesarean delivery: a randomised trial”, BJOG: An International Journal of Obstetrics & Gynaecology 2011, 118, 1349–1356. [156] T. Engstrom, T. Barth, P. Melin, H. Vilhardt, “Oxytocin receptor binding and uterotonic activity of carbetocin and its metabolites following enzymatic degra- dation”, European Journal of Pharmacology 1998, 355, 203–210. [157] J. Chatterjee, F. Rechenmacher, H. Kessler, “N-Methylation of Peptides and Pro- teins: An Important Element for Modulating Biological Functions”, Angewandte Chemie International Edition 2012, 52, 254–269. [158] J. G. Beck et al., “Intestinal Permeability of Cyclic Peptides: Common Key Back- bone Motifs Identified”, Journal of the American Chemical Society 2012, 134, 12125– 12133. [159] B. Laufer, J. Chatterjee, A. O. Frank, H. Kessler, “Can N-methylated amino acids serve as substitutes for in conformational design of cyclic pen- tapeptides?”, Journal of Peptide Science 2009, 15, 141–146. [160] Qvit, Nir, Rubin, Samuel J S, Urban, Travis J, Mochly-Rosen, Daria, Gross, Eric R, “Peptidomimetic therapeutics: scientific approaches and opportunities”, Drug Discovery Today 2017, 22, 454–462. [161] H. E. Blackwell, R. H. Grubbs, “Highly efficient synthesis of covalently cross- linked peptide helices by ring-closing metathesis”, Angewandte Chemie Interna- tional Edition 1998, 37, 3281–3284. [162] C. E. Schafmeister, J. Po, G. L. Verdine, “An All-Hydrocarbon Cross-Linking System for Enhancing the Helicity and Metabolic Stability of Peptides”, Journal of the American Chemical Society 2000, 122, 5891–5892. [163] N. Tsomaia, “Peptide therapeutics: Targeting the undruggable space”, European Journal of Medicinal Chemistry 2015, 94, 459–470. [164] A. Zorzi, K. Deyle, C. Heinis, “Cyclic peptide therapeutics: past, present and future”, Current Opinion in Chemical Biology 2017, 38, 24–29. [165] Y. Che, B. R. Brooks, G. R. Marshall, “Development of small molecules designed to modulate protein–protein interactions”, Journal of Computer-Aided Molecular Design 2006, 20, 109–130. [166] S. U. Vetterli, K. Moehle, J. A. Robinson, “Synthesis and antimicrobial activ- ity against Pseudomonas aeruginosa of macrocyclic β-hairpin peptidomimetic antibiotics containing N-methylated amino acids”, Bioorganic & Medicinal Chem- istry 2016, 24, 6332–6339. [167] N. Srinivas et al., “Peptidomimetic Antibiotics Target Outer-Membrane Bio- genesis in Pseudomonas aeruginosa”, Science 2010, 327, 1010–1013. [168] R. Fasan et al., “Using a beta-hairpin to mimic an alpha-helix: Cyclic pep- tidomimetic inhibitors of the p53-HDM2 protein protein interaction”, Ange- wandte Chemie International Edition 2004, 43, 2109–2112. [169] K. Zerbe, K. Moehle, J. A. Robinson, “Protein Epitope Mimetics: From New Antibiotics to Supramolecular Synthetic Vaccines”, Accounts of Chemical Research 2017, 50, 1323–1331. [170] L. J. Walport, R. Obexer, H. Suga, “Strategies for transitioning macrocyclic pep- tides to cell-permeable drug leads”, Current Opinion in Biotechnology 2017, 48, 242–250. 112 Bibliography

[171] M. Akamatsu, Y. Yoshida, H. Nakamura, M. Asao, H. Iwamura, T. Fujita, “Hy- drophobicity of Di- and Tripeptides Having Unionizable Side Chains and Cor- relation with Substituent and Structural Parameters”, Quantitative Structure- Activity Relationships 1989, 8, 195–203. [172] M. Akamatsu, T. Fujita, “Quantitative analyses of hydrophobicity of di-to pen- tapeptides having un-ionizable side chains with substituent and structural pa- rameters”, Journal of Pharmaceutical Sciences 1992, 81, 164–174. [173] M. Akamatsu, T. Fujita, “Hydrophobicities of di-to pentapeptides having union- izable side chains and correlation with substituent and structural parameters”, Pharmacochemistry Library 1995, 23, 185–214. [174] R. A. Conradi, A. R. Hilgers, N. Ho, P. S. Burton, “The Influence of Peptide Structure on Transport Across Caco-2 Cells”, Pharmaceutical Research 1991, 8, 1453–1460. [175] R. A. Conradi, A. R. Hilgers, N. F. H. Ho, P. S. Burton, “The Influence of Pep- tide Structure on Transport Across Caco-2 Cells. II. Peptide Bond Modification Which Results in Improved Permeability”, Pharmaceutical Research 1992, 9, 435– 439. [176] G. T. Knipp, D. G. Vander Velde, T. J. Siahaan, R. T. Borchardt, “The Effect of β-Turn Structure on the Passive Diffusion of Peptides Across Caco-2 Cell Monolayers”, Pharmaceutical Research 1997, 14, 1332–1340. [177] E. B. Hunter, S. P. Powers, L. J. Kost, D. I. Pinon, L. J. Miller, N. F. LaRusso, “Physicochemical determinants in hepatic extraction of small peptides”, Hepa- tology 1990, 12, 76–82. [178] A. C. M. Paiva, V. L. A. Nouailhetas, T. B. Paiva, “Synthesis of octanoyl[8- leucyl]angiotensin II, a lipophilic angiotensin antagonist”, Journal of Medicinal Chemistry 1977, 20, 898–901. [179] N. El Tayar, A. E. Mark, P. Vallat, R. M. Brunne, B. Testa, W. F. van Gunsteren, “Solvent Dependent Conformation and Hydrogen Bonding Capacity of Cy- closporin A: Evidence from Partition Coefficients and Molecular Dynamics Sim- ulations”, Journal of Medicinal Chemistry 1993, 36, 3757–3764. [180] H. Bundgaard, J. Møss, “Prodrugs of Peptides. 6. Bioreversible Derivatives of Thyrotropin-Releasing Hormone (TRH) with Increased Lipophilicity and Resis- tance to Cleavage by the TRH-Specific Serum Enzyme”, Pharmaceutical Research 1990, 7, 885–892. [181] W. A. Banks, A. J. Kastin, “Peptides and the blood-brain barrier: Lipophilicity as a predictor of permeability”, Brain Research Bulletin 1985, 15, 287–292. [182] S. J. Thompson, C. K. Hattotuwagama, J. D. Holliday, “On the hydrophobicity of peptides: Comparing empirical predictions of peptide logP values”, 2006, 1, 237–241. [183] M. N. Davies, D. R. Flower, “A Benchmark Dataset Comprising Partition and Distribution Coefficients of Linear Peptides”, Dataset Papers in Science 2013. [184] C. T. Mant, T. W. L. Burke, J. A. Black, R. S. Hodges, “Effect of peptide chain length on peptide retention behaviour in reversed-phase chromatogrphy”, Jour- nal of Chromatography A 1988, 458, 193–205. [185] M. C. J. Wilce, M. I. Aguilar, M. T. W. Hearn, “High-performance liquid chro- matography of amino acids, peptides and proteins : CVII. A Analysis of group retention contributions for peptides separated with a range of mobile and sta- tionary phases by reversed-phase high-performance liquid chromatography”, Journal of Chromatography A 1991, 536, 165–183. [186] J. Paladino et al., “Estimation of blood levels of endothelin and neurokinin receptor antagonists at the rat portal and jugular veins after oral administration as a tool in peptide drug design.” Drug Design and Discovery 1994, 12, 121–128. Bibliography 113

[187] A. Lamiable, P. Thévenet, J. Rey, M. Vavrusa, P. Derreumaux, P. Tuffery, “PEP- FOLD3: faster de novo structure prediction for linear peptides in solution and in complex”, Nucleic Acids Research 2016, 44, W449–W454. [188] J. C. Gertrudes, V. G. Maltarollo, R. A. Silva, P. R. Oliveira, K. M. Honorio, A. B. F. da Silva, “Machine Learning Techniques and Drug Design”, Current Medicinal Chemistry 2012, 19, 4289–4297. [189] A. Lavecchia, “Machine-learning approaches in drug discovery: methods and applications”, Drug Discovery Today 2015, 20, 318–331. [190] V. G. Maltarollo, J. C. Gertrudes, P. R. Oliveira, K. M. Honorio, “Applying ma- chine learning techniques for ADME-Tox prediction: a review”, Expert Opinion on Drug Metabolism & Toxicology 2015, 11, 259–271. [191] M. Tareq Hassan Khan, “Predictions of the ADMET Properties of Candidate Drug Molecules Utilizing Different QSAR/QSPR Modelling Approaches”, Cur- rent Drug Metabolism 2010, 11, 285–295. [192] P. Duchowicz, E. Castro, “QSPR Studies on Aqueous Solubilities of Drug-Like Compounds”, International Journal of Molecular Sciences 2009, 10, 2558–2577. [193] P. S. Kharkar, “Two-Dimensional (2D) In Silico Models for Absorption, Dis- tribution, Metabolism, Excretion and Toxicity (ADME/T) in Drug Discovery”, Current Topics in Medicinal Chemistry 2010, 10, 116–126. [194] M. Á. Cabrera-Pérez, H. Pham-The, “Computational modeling of human oral bioavailability: what will be next?”, Expert Opinion on Drug Discovery 2018, 13, 509–521. [195] T. Vallianatou, G. Lambrinidis, A. Tsantili-Kakoulidou, “In silicoprediction of human serum albumin binding for drug leads”, Expert Opinion on Drug Discov- ery 2013, 8, 583–595. [196] A. L. Samuel, “Some studies in machine learning using the game of checkers”, IBM Journal of Research and Development 1959, 3, 221–229. [197] T. M. Mitchell, Machine learning, 1997. [198] C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag, New York, 2006. [199] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer Verlag, New York, 2009. [200] D. J. Hand, H. Mannila, P. Smyth, Principles of Data Mining, MIT Press, Cam- bridge, 2001. [201] S. Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, Springer, Boston, MA, 2015. [202] S. Sra, S. Nowozin, S. J. Wright, Optimization for Machine Learning, MIT Press, Cambridge, 2012. [203] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, John Wiley & Sons, Weinheim, Germany, 2008. [204] I. Guyon, A. Elisseeff, “An Introduction to Variable and Feature Selection”, Journal of Machine Learning Research 2003, 3, 1157–1182. [205] Danishuddin, A. U. Khan, “Descriptors and their selection methods in QSAR analysis: paradigm for drug design”, Drug Discovery Today 2016, 21, 1291–1302. [206] H. Wiener, “Structural Determination of Paraffin Boiling Points”, Journal of the American Chemical Society 1947, 69, 17–20. [207] I. Gutman, K. C. Das, “The first Zagreb index 30 years after”, Match Communi- cations in Mathematical and in Computer Chemistry 2004, 83–92. [208] A. T. Balaban, “Highly discriminating distance-based topological index”, Chem- ical Physics Letters 1982, 89, 399–404. [209] L. B. Kier, “Indexes of molecular shape from chemical graphs”, Medicinal Re- search Reviews 1987, 7, 417–440. 114 Bibliography

[210] G. Moreau, P. Broto, “The auto-correlation of a topological-structure-a new Molecular Descriptor”, Nouveau Journal de Chimie 1980, 4, 359–360. [211] M. Reutlinger et al., “Chemically Advanced Template Search (CATS) for Scaffold- Hopping and Prospective Target Prediction for ’Orphan’ Molecules.” Molecular Informatics 2013, 32, 133–138. [212] D. Rogers, M. Hahn, “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 2010, 50, 742–754. [213] D. Rognan, “Chemogenomic approaches to rational drug design”, British Jour- nal of Pharmacology 2007, 152, 38–52. [214] M. R. Reutlinger, “Adaptive combinatorial de novo design of multi-target mod- ulating compounds”, Diss ETH No. 21732 2014. [215] M. A. Johnson, G. M. Maggiora, Concepts and applications of molecular similarity, Wiley-Interscience, 1990. [216] M. Mathea, W. Klingspohn, K. Baumann, “Chemoinformatic Classification Meth- ods and their Applicability Domain”, Molecular Informatics 2016, 35, 160–180. [217] G. Schneider, P. Schneider, S. Renner, “Scaffold-Hopping: How Far Can You Jump?”, QSAR & Combinatorial Science 2006, 25, 1162–1171. [218] D. E. Clark, “What has virtual screening ever done for drug discovery?”, Expert Opinion on Drug Discovery 2008, 3, 841–851. [219] M. Lill in In Silico Models for Drug Discovery, Humana Press, Totowa, NJ, 2013, pp. 1–12. [220] C. C. Aggarwal, A. Hinneburg, D. A. Keim in Database Theory — ICDT 2001, Springer, Berlin, Heidelberg, 2001, pp. 420–434. [221] P. C. Mahalanobis, “On the generalised distance in statistics”, Proceedings of the National Institute of Sciences of India 1936, 2, 49–55. [222] P. Willett, “Similarity-based virtual screening using 2D fingerprints”, Drug Dis- covery Today 2006, 11, 1046–1053. [223] H. Eckert, J. Bajorath, “Molecular similarity analysis in virtual screening: foun- dations, limitations and novel approaches”, Drug Discovery Today 2007, 12, 225– 233. [224] J. M. Zimmerman, N. Eliezer, R. Simha, “The characterization of amino acid sequences in proteins by statistical methods”, Journal of Theoretical Biology 1968, 21, 170–201. [225] A. T. Müller, “De novo design of antimicrobial peptides: from sequence tem- plates to artificial intelligence”, Diss ETH No. 24920 2018. [226] C. P. Koch et al., “Scrutinizing MHC-I Binding Peptides and Their Limits of Variation”, PLOS Computational Biology 2013, 9, e1003088. [227] K. Pearson, “On lines and planes of closest fit to systems of points in space”, The London Edinburgh and Dublin Philosophical Magazine and Journal of Science 1901, 2, 559–572. [228] I. T. Jolliffe in Principal Component Analysis, Springer, New York, 1986, pp. 115– 128. [229] M. Reutlinger, G. Schneider, “Nonlinear dimensionality reduction and map- ping of compound libraries for drug discovery.” Journal of Molecular Graphics & Modelling 2012, 34, 108–117. [230] T. I. Oprea, J. Gottfries, “Chemography: The Art of Navigating in Chemical Space”, Journal of Combinatorial Chemistry 2001, 3, 157–166. [231] M. Feher, J. M. Schmidt, “Property Distributions: Differences between Drugs, Natural Products, and Molecules from Combinatorial Chemistry”, Journal of Chemical Information and Computer Sciences 2003, 43, 218–227. Bibliography 115

[232] A. Larsson et al., “Multivariate Design, Synthesis, and Biological Evaluation of Peptide Inhibitors of FimC/FimH ProteinProtein Interactions in Uropathogenic Escherichia coli”, Journal of Medicinal Chemistry 2005, 48, 935–945. [233] G. Schneider, P. Wrede, “Artificial neural networks for computer-based molec- ular design.” Progress in Biophysics and Molecular Biology 1998, 70, 175–222. [234] L. M. Le Cam, J. Neyman, Proceedings of the Fifth Berkeley Symposium on Mathe- matical Statistics and Probability, Univ of California Press, 1967. [235] J. Cleve, U. Lämmel, Data Mining, Oldenbourg Wissenschaftsverlag, München, 2014. [236] K. G. Le Roch et al., “Discovery of gene function by expression profiling of the malaria parasite life cycle”, Science 2003, 301, 1503–1508. [237] P. P. Roy, K. Roy, “On Some Aspects of Variable Selection for Partial Least Squares Regression Models”, QSAR & Combinatorial Science 2008, 27, 302–313. [238] A. Ranise et al., “Structure-Based Design, Parallel Synthesis, Structure-Activity Relationship, and Molecular Modeling Studies of Thiocarbamates, New Potent Non-Nucleoside HIV-1 Reverse Transcriptase Inhibitor Isosteres of Phenethylth- iazolylthiourea Derivatives”, Journal of Medicinal Chemistry 2005, 48, 3858–3873. [239] A. N. Tikhonov, V. I. Arsenin, Solutions of ill-posed problems, Vh Winston, 1977. [240] R. Tibshirani, “Regression shrinkage and selection via the Lasso”, Journal of the Royal Statistical Society Series B-Methodological 1996, 58, 267–288. [241] R. Tibshirani, “Regression shrinkage and selection via the lasso: a retrospec- tive”, Journal of the Royal Statistical Society Series B-Statistical Methodology 2011, 73, 273–282. [242] A. E. Hoerl, R. W. Kennard, “Ridge Regression - Biased Estimation for Nonorthog- onal Problems”, Technometrics 1970, 12, 55–67. [243] H. Zou, T. Hastie, “Regularization and variable selection via the elastic net”, Journal of the Royal Statistical Society Series B-Statistical Methodology, 67, 301–320. [244] Y. Saeys, I. Inza, P. Larranaga, “A review of feature selection techniques in bioinformatics”, Bioinformatics 2007, 23, 2507–2517. [245] Z. Y. Algamal, M. H. Lee, A. M. Al Fakih, M. Aziz, “High-dimensional QSAR prediction of anticancer potency of imidazo[4,5-b]pyridine derivatives using adjusted adaptive LASSO”, Journal of Chemometrics 2015, 29, 547–556. [246] Q. Mo et al., “Pattern discovery and cancer gene identification in integrated cancer genomic data”, Proceedings of the National Academy of Sciences 2013, 110, 201208949–4250. [247] Ł. Kubik, P. Wiczling, “Quantitative structure-(chromatographic) retention re- lationship models for dissociating compounds”, Journal of Pharmaceutical and Biomedical Analysis 2016, 127, 176–183. [248] E. Daghir-Wojtkowiak et al., “Least absolute shrinkage and selection opera- tor and dimensionality reduction techniques in quantitative structure retention relationship modeling of retention in hydrophilic interaction liquid chromatog- raphy”, Journal of Chromatography A 2015, 1403, 54–62. [249] S. Datta, V. A. Dev, M. R. Eden, “Developing QSPR for Predicting DNA Drug Binding Affinity of 9-Anilinoacridine Derivatives Using Correlation-Based Adap- tive LASSO Algorithm”, Computer Aided Chemical Engineering 2017, 40, 2767– 2772. [250] C. Cortes, V. Vapnik, “Support-vector networks”, Machine Learning 1995, 20, 273–297. [251] B. Schölkopf in Advances in Neural Information Processing Systems, Microsoft Re- search Cambridge, Cambridge, United Kingdom, 2001. [252] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, “An introduction to kernel-based learning algorithms”, IEEE Transactions on Neural Networks 2001, 12, 181–201. 116 Bibliography

[253] R. Clarke et al., “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data”, Nature Reviews Cancer 2008, 8, 37–49. [254] E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, “Comparison of support vector machine and artificial neural network systems for drug/nondrug clas- sification.” Journal of Chemical Information and Computer Sciences 2003, 43, 1882– 1889. [255] L. Franke, E. Byvatov, O. Werz, D. Steinhilber, P. Schneider, G. Schneider, “Ex- traction and Visualization of Potential Pharmacophore Points Using Support Vector Machines: Application to Ligand-Based Virtual Screening for COX-2 In- hibitors”, Journal of Medicinal Chemistry 2005, 48, 6997–7004. [256] Y.-C. Lin et al., “Multidimensional Design of Anticancer Peptides.” Angewandte Chemie International Edition 2015, 54, 10370–10374. [257] K. Konno et al., “Decoralin, a novel linear cationic α-helical peptide from the venom of the solitary eumenine wasp Oreumenes decoratus”, Peptides 2007, 28, 2320–2327. [258] A. Ose et al., “Development of a Support Vector Machine-Based System to Pre- dict Whether a Compound Is a Substrate of a Given Drug Transporter Using Its Chemical Structure”, Journal of Pharmaceutical Sciences 2016, 105, 2222–2230. [259] H. Drucker, C. J. C. Surges, L. Kaufman, A. Smola, V. Vapnik in Advances in Neural Information Processing Systems, Monmouth University, West Long Branch, United States, 1997, pp. 155–161. [260] A. J. Smola, B. Schölkopf, “A tutorial on support vector regression”, Statistics and Computing 2004, 14, 199–222. [261] D. Horvath, G. Marcou, A. Varnek, S. Kayastha, A. de la Vega de León, J. Bajo- rath, “Prediction of Activity Cliffs Using Condensed Graphs of Reaction Repre- sentations, Descriptor Recombination, Support Vector Machine Classification, and Support Vector Regression”, Journal of Chemical Information and Modeling 2016, 56, 1631–1640. [262] E. Byvatov, B. C. Sasse, H. Stark, G. Schneider, “From virtual to real screening for D3 dopamine receptor ligands.” ChemBioChem 2005, 6, 997–999. [263] T. Lei et al., “ADMET Evaluation in Drug Discovery. Part 17: Development of Quantitative and Qualitative Prediction Models for Chemical-Induced Respira- tory Toxicity”, Molecular Pharmaceutics 2017, 14, 2407–2421. [264] D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors”, Nature 1986, 323, 533–536. [265] K. Hornik, M. Stinchcombe, H. White, “Multilayer feedforward networks are universal approximators”, Neural Networks 1989, 2, 359–366. [266] G. Schneider, “Neural networks are useful tools for drug design”, 2000, 13, 15– 16. [267] Y. LeCun, Y. Bengio, G. Hinton, “Deep learning”, Nature 2015, 521, 436–444. [268] E. Gawehn, J. A. Hiss, G. Schneider, “Deep Learning in Drug Discovery”, Molec- ular Informatics 2016, 35, 3–14. [269] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities”, Proceedings of the National Academy of Sciences 1982, 79, 2554–2558. [270] S. Hochreiter, J. Schmidhuber, “Long short-term memory.” Neural Computation 1997, 9, 1735–1780. [271] A. Gupta, A. T. Müller, B. J. H. Huisman, J. A. Fuchs, P. Schneider, G. Schnei- der, “Generative Recurrent Networks for De Novo Drug Design”, Molecular Informatics 2017, 37, 1700111. Bibliography 117

[272] D. Merk, L. Friedrich, F. Grisoni, G. Schneider, “De Novo Design of Bioac- tive Small Molecules by Artificial Intelligence.” Molecular Informatics 2018, 37, 1700153. [273] D. H. Wolpert, “Stacked generalization”, Neural Networks 1992, 5, 241–259. [274] S. Renner et al., “Searching for Drug Scaffolds with 3D Pharmacophores and Neural Network Ensembles”, Angewandte Chemie International Edition 2007, 46, 5336–5339. [275] J. A. Hiss, A. Bredenbeck, F. O. Losch, P. Wrede, P. Walden, G. Schneider, “De- sign of MHC I stabilizing peptides by agent-based exploration of sequence space.” Protein Engineering Design & Selection 2007, 20, 99–108. [276] C. P. Koch et al., “Exhaustive proteome mining for functional MHC-I ligands.” ACS Chemical Biology 2013, 8, 1876–1881. [277] S. Geisser, “The Predictive Sample Reuse Method with Applications”, Journal of the American Statistical Association 1975, 70, 320–328. [278] A. Isaksson, M. Wallman, H. Göransson, M. G. Gustafsson, “Cross-validation and bootstrapping are unreliable in small sample classification”, Pattern Recog- nition Letters 2008, 29, 1960–1965. [279] J. G. Topliss, R. J. Costello, “Chance Correlations in Structure-Activity Studies Using Multiple Regression-Analysis”, Journal of Medicinal Chemistry 1972, 15, 1066–1068. [280] J. G. Topliss, R. P. Edwards, “Chance factors in studies of quantitative structure- activity relationships.” Journal of Medicinal Chemistry 1979, 22, 1238–1244. [281] C. Rücker, G. Rücker, M. Meringer, “y-Randomization and its variants in QSPR / QSAR.” Journal of Chemical Information and Modeling 2007, 47, 2345–2357. [282] R. G. Karki, V. M. Kulkarni, “Three-dimensional quantitative structure–activity relationship (3D-QSAR) of 3-aryloxazolidin-2-one antibacterials”, Bioorganic & Medicinal Chemistry 2001, 9, 3153–3160. [283] I. Sushko et al., “Applicability domain for in silico models to achieve accuracy of experimental measurements”, Journal of Chemometrics 2010, 24, 202–208. [284] L. Eriksson, J. Jaworska, A. P. Worth, M. T. D. Cronin, R. M. McDowell, P. Gra- matica, “Methods for reliability and uncertainty assessment and for applicabil- ity evaluations of classification- and regression-based QSARs.” Environmental Health Perspectives 2003, 111, 1361. [285] F. Sahigara, D. Ballabio, R. Todeschini, V. Consonni, “Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions”, Journal of Cheminformatics 2013, 5, 27. [286] F. Sahigara, K. Mansouri, D. Ballabio, A. Mauri, “Comparison of different ap- proaches to define the applicability domain of QSAR models.” Molecules 2012, 17, 4791–4810. [287] R. P. Sheridan, B. P. Feuston, V. N. Maiorov, S. K. Kearsley, “Similarity to mole- cules in the training set is a good discriminator for prediction accuracy in QSAR.” Journal of Chemical Information and Computer Sciences 2004, 44, 1912–1928. [288] I. V. Tetko, G. I. Poda, C. Ostermann, R. Mannhold, “Accurate in silico logP predictions: One can’t embrace the unembraceable”, QSAR and Combinatorial Science 2009, 845–849. [289] F. Grisoni, “Machine learning for chemoinformatics: an introduction”, BigChem training online course 2017. [290] D. M. J. Tax, R. P. W. Duin in Advances in Pattern Recognition, Springer, Berlin, Heidelberg, Berlin, Heidelberg, 1998, pp. 593–601. [291] M. M. Breunig et al., LOF: identifying density-based local outliers, ACM, New York, 2000. [292] H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek, LoOP: local outlier probabilities, ACM, New York, 2009. 118 Bibliography

[293] W. Jin, A. K. H. Tung, J. Han, W. Wang in Advances in Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, 2006, pp. 577–593. [294] E. Ruoslahti, M. D. Pierschbacher, “New perspectives in cell adhesion: RGD and integrins”, Science 1987, 238, 491–497. [295] P.Vanhee, A. M. van der Sloot, E. Verschueren, L. Serrano, F. Rousseau, J. Schym- kowitz, “Computational design of peptide ligands”, Trends in Biotechnology 2011, 29, 231–239. [296] G. Schneider, De novo Molecular Design, John Wiley & Sons, New York, 2013. [297] I. Belda et al., “ENPDA: an evolutionary structure-based de novo peptide de- sign algorithm”, Journal of Computer-Aided Molecular Design 2005, 19, 585–601. [298] V. Brusic, G. Rudy, G. Honeyman, J. Hammer, L. Harrison, “Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network.” Bioinformatics 1998, 14, 121–130. [299] G. Schneider, J. Schuchhardt, P. Wrede, “Peptide design in machina: develop- ment of artificial mitochondrial protein precursor cleavage sites by simulated molecular evolution”, Biophysical Journal 1995, 68, 434–447. [300] R. Obexer, L. J. Walport, H. Suga, “Exploring sequence space: harnessing chem- ical and biological diversity towards new peptide leads”, Current Opinion in Chemical Biology 2017, 38, 52–61. [301] G. Schneider, P. Wrede, “The rational design of amino acid sequences by artifi- cial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site”, Biophysical Journal 1994, 66, 335–344. [302] R. Grantham, “Amino Acid Difference Formula to Help Explain Protein Evo- lution”, Science 1974, 185, 862–864. [303] G. Schneider et al., “Peptide design by artificial neural networks and computer- based evolutionary search.” Proceedings of the National Academy of Sciences 1998, 95, 12179–12184. [304] K. Stutz et al., “Peptide–Membrane Interaction between Targeting and Lysis”, ACS Chemical Biology 2017, 12, 2254–2259. [305] M. Shatnawi in Emerging Trends in Computational Biology, Bioinformatics, and Sys- tems Biology: Algorithms and Software Tools, Elsevier, 2015, pp. 99–121. [306] J. De Las Rivas, C. Fontanillo, “Protein–Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks”, PLOS Computa- tional Biology 2010, 6, e1000807. [307] K. Venkatesan et al., “An empirical framework for binary interactome map- ping”, Nature Methods 2009, 6, 83–90. [308] J. A. Wells, C. L. McClendon, “Reaching for high-hanging fruit in drug discov- ery at protein-protein interfaces”, Nature 2007, 450, 1001–1009. [309] S. Jaeger, P. Aloy, “From protein interaction networks to novel therapeutic strategies”, Iubmb Life 2012, 64, 529–537. [310] L. Laraia, G. McKenzie, D. R. Spring, A. R. Venkitaraman, D. J. Huggins, “Over- coming Chemical, Biological, and Computational Challenges in the Develop- ment of Inhibitors Targeting Protein-Protein Interactions”, Chemistry & Biology 2015, 22, 689–703. [311] T. Clackson, J. A. Wells, “A hot spot of binding energy in a hormone-receptor interface”, Science 1995, 267, 383–386. [312] I. S. Moreira, P. A. Fernandes, M. J. Ramos, “Hot spots A review of the protein- protein interface determinant amino-acid residues”, Proteins: Structure Function and Bioinformatics 2007, 68, 803–812. [313] D. E. Scott, A. R. Bayly, C. Abell, J. Skidmore, “Small molecules, big targets: drug discovery faces the protein–protein interaction challenge”, Nature Reviews Drug Discovery 2016, 15, 533–550. Bibliography 119

[314] A. Luther, K. Moehle, E. Chevalier, G. Dale, D. Obrecht, “Protein epitope mimetic macrocycles as biopharmaceuticals”, Current Opinion in Chemical Biology 2017, 38, 45–51. [315] V. Azzarito, K. Long, N. S. Murphy, A. J. Wilson, “Inhibition of alpha-helix- mediated protein-protein interactions using designed molecules”, Nature Chem- istry 2013, 5, 161–173. [316] M. P. Gimeno, A. Glas, O. Koch, T. N. Grossmann, “Structure-Based Design of Inhibitors of Protein–Protein Interactions: Mimicking Peptide Binding Epi- topes”, Angewandte Chemie International Edition 2015, 54, 8896–8927. [317] E. Petsalaki, R. B. Russell, “Peptide-mediated interactions in biological systems: new discoveries and applications”, Current Opinion in Biotechnology 2008, 19, 344–350. [318] T. Pawson, P. Nash, “Assembly of Cell Regulatory Systems Through Protein Interaction Domains”, Science 2003, 300, 445–452. [319] L.-G. Milroy, T. N. Grossmann, S. Hennig, L. Brunsveld, C. Ottmann, “Modu- lators of Protein-Protein Interactions”, Chemical Reviews 2014, 114, 4695–4748. [320] A. Zlotnik, O. Yoshie, “Chemokines: a new classification system and their role in immunity”, Immunity 2000, 12, 121–127. [321] C. L. Sokol, A. D. Luster, “The chemokine system in innate immunity.” Cold Spring Harbor Perspectives in Biology 2015, 7, a016303. [322] J. L. Williams, D. W. Holman, R. S. Klein, “Chemokines in the balance: main- tenance of homeostasis and protection at CNS barriers”, Frontiers in Cellular Neuroscience 2014, 8, 1–12. [323] A. Zlotnik, A. M. Burkhardt, B. Homey, “Homeostatic chemokine receptors and organ-specific metastasis”, Nature Reviews Immunology 2011, 11, 597–606. [324] A. Zlotnik, O. Yoshie, H. Nomiyama, “The chemokine and chemokine receptor superfamilies and their molecular evolution”, Genome Biology 2006, 7. [325] I. Clark-Lewis, C. Schumacher, M. Baggiolini, B. Moser, “Structure-Activity Re- lationships of Interleukin 8 Determined Using Chemically Synthesized Analogs - Critical Role of Nh2-Terminal Residues and Evidence for Uncoupling of Neu- trophil Chemotaxis, Exocytosis, and Receptor-Binding Activities”, Journal of Bi- ological Chemistry 1991, 266, 23128–23134. [326] I. C. Lewis et al., “Structure-activity relationships of chemokines”, Journal of Leukocyte Biology 1995, 57, 703–711. [327] F. S. Monteclaro, I. F. Charo, “The amino-terminal extracellular domain of the MCP-1 receptor, but not the RANTES/MIP-1alpha receptor, confers chemokine selectivity. Evidence for a two-step mechanism for MCP-1 receptor activation.” Journal of Biological Chemistry 1996, 271, 19084–19092. [328] F. S. Monteclaro, I. F. Charo, “The Amino-terminal Domain of CCR2 Is Both Necessary and Sufficient for High Affinity Binding of Monocyte Chemoattrac- tant Protein 1 RECEPTOR ACTIVATION BY A PSEUDO-TETHERED LIGAND”, Journal of Biological Chemistry 1997, 272, 23186–23190. [329] J. E. Pease, J. Wang, P. D. Ponath, P. M. Murphy, “The N-terminal Extracellular Segments of the Chemokine Receptors CCR1 and CCR3 Are Determinants for MIP-1α and Eotaxin Binding, Respectively, but a Second Domain Is Essential for Efficient Receptor Activation”, Journal of Biological Chemistry 1998, 273, 19972– 19976. [330] J. Liu, S. Louie, W. Hsu, K. M. Yu, H. B. J. Nicholas, G. L. Rosenquist, “Ty- rosine sulfation is prevalent in human chemokine receptors important in lung disease”, American Journal of Respiratory Cell and Molecular Biology 2008, 38, 738– 743. [331] M. A. Hauser et al., “Distinct CCR7 glycosylation pattern shapes receptor sig- naling and endocytosis to modulate chemotactic responses.” Journal of Leukocyte Biology 2016, 99, 993–1007. 120 Bibliography

[332] A. M. Waterhouse, J. B. Procter, D. M. A. Martin, M. Clamp, G. J. Barton, “Jalview Version 2–a multiple sequence alignment editor and analysis workbench”, Bioin- formatics 2009, 25, 1189–1191. [333] Y. Kofuku et al., “Structural basis of the interaction between chemokine stro- mal cell-derived factor-1/CXCL12 and its G-protein-coupled receptor CXCR4”, Journal of Biological Chemistry 2009, 284, 35240–35250. [334] G. L. Uy, M. P. Rettig, A. F. Cashen, “, a CXCR4 antagonist for the mobilization of hematopoietic stem cells”, Expert Opinion on Biological Therapy 2008, 8, 1797–1804. [335] A. J. Wagstaff, “Plerixafor”, Drugs 2009, 69, 319–326. [336] A. B. Kleist et al., “New paradigms in chemokine receptor signal transduction: Moving beyond the two-site model”, Biochemical Pharmacology 2016, 114, 53–68. [337] C. Gerard, B. J. Rollins, “Chemokines and disease”, Nature Immunology 2001, 2, 108–115. [338] B. Wu et al., “Structures of the CXCR4 Chemokine GPCR with Small-Molecule and Cyclic Peptide Antagonists”, Science 2010, 330, 1066–1071. [339] N. Fujii et al., “Molecular-Size Reduction of a Potent CXCR4-Chemokine An- tagonist Using Orthogonal Combination of Conformation- and Sequence-Based Libraries”, Angewandte Chemie International Edition 2003, 42, 3251–3253. [340] N. Heveker et al., “Dissociation of the signalling and antiviral properties of SDF-1-derived small peptides”, Current Biology 1998, 8, 369–376. [341] S. S. Lieberman-Blum, H. B. Fung, J. C. Bandres, “Maraviroc: A CCR5-receptor antagonist for the treatment of HIV-1 infection”, Clinical Therapeutics 2008, 30, 1228–1250. [342] M. Seitz, P. Rusert, K. Moehle, A. Trkola, J. A. Robinson, “Peptidomimetic inhibitors targeting the CCR5- binding site on the human immunodeficiency virus type-1 gp120 glycoprotein complexed to CD4”, Chemical Communications 2010, 46, 7754–7756. [343] M. Farzan et al., “A Tyrosine-sulfated Peptide Based on the N Terminus of CCR5 Interacts with a CD4-enhanced Epitope of the HIV-1 gp120 Envelope Glycoprotein and Inhibits HIV-1 Entry”, Journal of Biological Chemistry 2000, 275, 33516–33521. [344] M. Houimel, P. Loetscher, M. Baggiolini, L. Mazzucchelli, “Functional inhibi- tion of CCR3-dependent responses by peptides derived from phage libraries”, European Journal of Immunology 2001, 31, 3535–3545. [345] J. Z. Zhu et al., “Tyrosine Sulfation Influences the Chemokine Binding Selec- tivity of Peptides Derived from Chemokine Receptor CCR3”, Biochemistry 2011, 50, 1524–1534. [346] R. Förster, A. C. Davalos-Misslitz, A. Rot, “CCR7 and its ligands: balancing immunity and tolerance”, Nature Reviews Immunology 2008, 8, 362–371. [347] G. L. Moschovakis et al., “The chemokine receptor CCR7 is a promising target for rheumatoid arthritis therapy”, Cellular & Molecular Immunology 2018, 388, 1. [348] K. Schumann et al., “Immobilized chemokine fields and soluble chemokine gradients cooperatively shape migration patterns of dendritic cells.” Immunity 2010, 32, 703–713. [349] M. Love et al., “Solution Structure of CCL21 and Identification of a Putative CCR7 Binding Site”, Biochemistry 2012, 51, 733–735. [350] “Crystallographic Structure of Truncated CCL21 and the Putative Sulfotyrosine- Binding Site”, Biochemistry 2016, 55, 5746–5753. [351] C. Moussion et al., “Polysialylation controls dendritic cell trafficking by regu- lating chemokine recognition”, Science 2016, 351, 186–190. Bibliography 121

[352] A. S. Jørgensen et al., “CCL19 with CCL21-tail displays enhanced glycosamino- glycan binding with retained chemotactic potency in dendritic cells”, Journal of Leukocyte Biology 2018, 104, 401–411. [353] Veldkamp, Christopher T et al., “Solution Structure of CCL19 and Identification of Overlapping CCR7 and PSGL-1 Binding Sites”, Biochemistry 2015, 54, 4163– 4166. [354] K. M. Veerman et al., “Interaction of the selectin ligand PSGL-1 with chemokines CCL21 and CCL19 facilitates efficient homing of T cells to secondary lymphoid organs”, Nature Immunology 2007, 8, 532–539. [355] J. W. Dolan, “Detective work, part 3: Strong retention and chemical problems with the column”, LC-GC Europe 2016, 29, 28–30. [356] C. J. Wienken, P. Baaske, U. Rothbauer, D. Braun, S. Duhr, “Protein-binding assays in biological liquids using microscale thermophoresis”, Nature Commu- nications 2010, 1, 100. [357] W. McKinney, “pandas: a foundational Python library for data analysis and statistics”, PyHPC 2011. [358] T. E. Oliphant, “Python for scientific computing”, Computing in Science & Engi- neering 2007, 9, 10–20. [359] J. D. Hunter, “Matplotlib: A 2D Graphics Environment”, Computing in Science & Engineering 2007, 9, 90–95. [360] E. Travis, Oliphant. A guide to NumPy, 2006. [361] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python”, Journal of Ma- chine Learning Research 2011, 12, 2825–2830. [362] D. A. Jackson, “Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches”, Ecology 1993, 74, 2204–2214. [363] M.-Q. Zhang, B. Wilkinson, “Drug discovery beyond the ‘rule-of-five’”, Current Opinion in Biotechnology 2007, 18, 478–488. [364] M. J. Waring, “Defining optimum lipophilicity and molecular weight ranges for drug candidates-Molecular weight dependent lower log D limits based on permeability”, Bioorganic & Medicinal Chemistry Letters 2009, 19, 2844–2851. [365] S. Theodoridis, K. Koutroumbas in Pattern Recognition, Academic Press, Boston, 2009, pp. 261–322. [366] D. J. Craik, D. P. Fairlie, S. Liras, D. Price, “The Future of Peptide-based Drugs”, Chemical Biology & Drug Design 2012, 81, 136–147. [367] L. Di, “Strategic Approaches to Optimizing Peptide ADME Properties”, The AAPS Journal 2014, 17, 134–143. [368] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain”, Psychological Review 1958, 65, 386–408. [369] C. Zhao, E. Boriani, A. Chana, A. Roncaglioni, E. Benfenati, “A new hybrid system of QSAR models for predicting bioconcentration factors (BCF)”, Chemo- sphere 2008, 73, 1701–1707. [370] A. Lombardo, A. Roncaglioni, E. Boriani, C. Milan, E. Benfenati, “Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish”, Chemistry Central Journal 2010, 4, S1. [371] T. I. Netzeva et al., “Current status of methods for defining the applicability do- main of (quantitative) structure-activity relationships - The report and recom- mendations of ECVAM Workshop 52”, Atla-Alternatives to Laboratory Animals 2005, 33, 155–173. [372] A. M. Wassermann, M. Wawer, J. Bajorath, “Activity landscape representations for structure-activity relationship analysis.” Journal of Medicinal Chemistry 2010, 53, 8209–8223. 122 Bibliography

[373] G. M. Maggiora, “On Outliers and Activity CliffsWhy QSAR Often Disap- points”, Journal of Chemical Information and Modeling 2006, 46, 1535–1535. [374] I. V. Tetko et al., “Critical Assessment of QSAR Models of Environmental Tox- icity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection”, Journal of Chemical Information and Modeling 2008, 48, 1733–1746. [375] J. A. Hiss et al., “Combinatorial chemistry by ant colony optimization”, Future Medicinal Chemistry 2014, 6, 267–280. [376] A. Sazonovas, “Personal Communication”, 2018. [377] M. Lawless, “Personal Communication”, 2018. [378] D. F. Ortwine, I. Aliagas, “Physicochemical and DMPK in silico models: Facili- tating their use by medicinal chemists”, Molecular Pharmaceutics 2013, 10, 1153– 1161. [379] T. W. Johnson, K. R. Dress, M. Edwards, “Using the Golden Triangle to optimize clearance and oral absorption”, Bioorganic & Medicinal Chemistry Letters 2009, 19, 5560–5564. [380] K. R. Manchester, P. D. Maskell, L. Waters, “Experimental versus theoretical log D7.4, pKaand plasma protein binding values for benzodiazepines appearing as new psychoactive substances”, Drug Testing and Analysis 2018, 10, 37. [381] M. C. Wenlock, R. P. Austin, P. Barton, A. M. Davis, “A comparison of physio- chemical property profiles of development and marketed oral drugs”, Journal of Medicinal Chemistry 2003, 46, 1250–1256. [382] T. J. Ritchie, C. N. Luscombe, S. J. F. Macdonald, “Analysis of the calculated physicochemical properties of respiratory drugs: Can we design for inhaled drugs yet?”, Journal of Chemical Information and Modeling 2009, 49, 1025–1032. [383] T. Hou, J. Wang, W. Zhang, X. Xu, “ADME evaluation in drug discovery. 6. Can oral biavailability in humans be effectively predicted by simple molecular property-based rules?”, Journal of Chemical Information and Modeling 2007, 47, 460–463. [384] F. Giordanetto, J. Kihlberg, “Macrocyclic drugs and clinical candidates: What can medicinal chemists learn from their properties?”, Journal of Medicinal Chem- istry 2014, 57, 278–295. [385] A. Ganesan, “The impact of natural products upon modern drug discovery”, Current Opinion in Chemical Biology 2008, 12, 306–317. [386] M. R. Yeaman, N. Y. Yount, “Mechanisms of antimicrobial peptide action and resistance.” Pharmacological Reviews 2003, 55, 27–55. [387] Y. Huang, J. Huang, Y. Chen, “Alpha-helical cationic antimicrobial peptides: relationships of structure and function.” Protein & Cell 2010, 1, 143–152. [388] W. P. Walters, M. T. Stahl, M. A. Murcko, “Virtual screening—an overview”, Drug Discovery Today 1998, 3, 160–178. [389] B. K. Shoichet, “Virtual screening of chemical libraries”, Nature 2004, 432, 862– 865. [390] G. Schneider, “Virtual screening: an endless staircase?”, Nature Reviews Drug Discovery 2010, 9, 273–276. [391] C. E. Shannon, “A Mathematical Theory of Communication”, Bell System Tech- nical Journal 1948, 27, 379–423. [392] K. L. Morrison, G. A. Weiss, “Combinatorial alanine-scanning”, Current Opinion in Chemical Biology 2001, 5, 302–307. [393] G. Corzo et al., “Solution structure and alanine scan of a spider toxin that af- fects the activation of mammalian voltage-gated sodium channels.” Journal of Biological Chemistry 2007, 282, 4643–4652. [394] A. G. Beck-Sickinger, H. A. Weland, H. Wittneben, K.-D. Willim, K. Rudolf, G. Jung, “Complete L-Alanine Scan of Reveals Ligands Binding Bibliography 123

to Y1 and Y2 Receptors with Distinguished Conformations”, European Journal of Biochemistry 1994, 225, 947–958. [395] B. Cunningham, J. Wells, “High-resolution epitope mapping of hGH-receptor interactions by alanine-scanning mutagenesis”, Science 1989, 244, 1081–1085. [396] J. Lee et al., “Computationally Designed Peptide Inhibitors of the Ubiquitin E3 Ligase SCFFbx4”, ChemBioChem 2013, 14, 445–451. [397] L. Boukharta, H. Gutiérrez-de-Terán, J. Åqvist, “Computational Prediction of Alanine Scanning and Ligand Binding Energetics in G-Protein Coupled Recep- tors”, PLOS Computational Biology 2014, 10, e1003585. [398] Y. Yan, M. Yang, C. G. Ji, J. Z. H. Zhang, “Interaction Entropy for Computational Alanine Scanning”, Journal of Chemical Information and Modeling 2017, 57, 1112– 1122. [399] S. Tshepelevitsh, K. Hernits, I. Leito, “Prediction of partition and distribu- tion coefficients in various solvent pairs with COSMO-RS”, Journal of Computer- Aided Molecular Design 2018, 43, 1–12. [400] H. Zhang, Y. Jiang, Z. Cui, C. Yin, “Force Field Benchmark of Amino Acids. 2. Partition Coefficients between Water and Organic Solvents”, Journal of Chemical Information and Modeling 2018, acs.jcim.8b00493. [401] M. D. Allenspach, J. A. Fuchs, N. Doriot, J. A. Hiss, G. Schneider, C. Steuer, “Quantification of hydrolyzed peptides and proteins by amino acid fluores- cence”, Journal of Peptide Science 2018, e3113.

125

A Supplementary Information

A.1 Supplementary Information to Chapter 1

Figure A.1: Depiction of the CC-chemokine system. CC-chemokine Receptors 1 to 10 (blue) and the three atypical receptors ACKR1, ACKR2 and ACKR4 (yellow) have CC-chemokines as ligands. The chemokines (CCL) in black have agonistic effects on the receptor, chemokines in red have antagonistic effects. 126 Appendix A. Supplementary Information

A.2 Supplementary Information to Chapter 4.1

HPLC-MS data of the synthesised and purified peptides for the prospective validation of the benchmark models. The analytical method is described in section 3.1.

HPLC-MS data for JF_1_1

Name Sequence MW [g/mol] Positive Charges

JF_1_1 ALIWGY-NH2 720.86 1

Mass chromatogram 1.4 1e7 10.28 1.2

1.0

0.8 JF_1_1 0.6 Intensity [TIC] 0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 1.0 1e7

721.3 0.8

0.6

Intensity 0.4

0.2 722.5

723.35 0.0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.2: LC-MS (positive ion mode) data of peptide JF_1_1. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 127

HPLC-MS data for JF_1_2

Name Sequence MW [g/mol] Positive Charges

JF_1_2 FLGKVW-NH2 747.93 2

Mass chromatogram 2.5 1e7 8.7

2.0

1.5

JF_1_2

1.0 Intensity [TIC]

0.5

0.0 0 5 10 15 20 25 Time [min]

373.95 Mass spectrum 700000

600000

500000 374.7

400000

300000 748.35 Intensity

200000 747.55

100000 403.15 0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.3: LC-MS (positive ion mode) data of peptide JF_1_2. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 128 Appendix A. Supplementary Information

HPLC-MS data for JF_1_3

Name Sequence MW [g/mol] Positive Charges

JF_1_3 GAWPFL-NH2 688.81 1

Mass chromatogram 1.6 1e7 11.1 1.4

1.2

1.0

0.8 JF_1_3

0.6 Intensity [TIC]

0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

1e7 Mass spectrum 1.0 689.3

0.8

0.6

Intensity 0.4

0.2 690.4

691.35 0.0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.4: LC-MS (positive ion mode) data of peptide JF_1_3. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 129

HPLC-MS data for JF_1_4

Name Sequence MW [g/mol] Positive Charges

JF_1_4 IPFWKL-NH2 802.02 2

Mass chromatogram 2.5 1e7

9.39 2.0

1.5

JF_1_4

1.0 Intensity [TIC]

0.5

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 1400000 401.6

1200000

1000000 401.3 800000

600000 Intensity 802.5 400000 422.15 200000

0 0 500 1000 1500 2000 2500 M/z

Figure A.5: LC-MS (positive ion mode) data of peptide JF_1_4. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 130 Appendix A. Supplementary Information

HPLC-MS data for JF_1_5

Name Sequence MW [g/mol] Positive Charges

JF_1_5 KLVWAF-NH2 761.95 2

Mass chromatogram 1.4 1e7 9.4

1.2

1.0

0.8 JF_1_5 0.6 Intensity [TIC] 0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 3500000 762.35 3000000

2500000

2000000

1500000 Intensity 381.75 402.3 1000000 763.5 500000

0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.6: LC-MS (positive ion mode) data of peptide JF_1_5. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 131

HPLC-MS data for JF_1_6

Name Sequence MW [g/mol] Positive Charges

JF_1_6 LPVGWF-NH2 716.87 1

Mass chromatogram 3.5 1e7 10.95 3.0

2.5

2.0 JF_1_6 1.5 Intensity [TIC] 1.0

0.5

0.0 0 5 10 15 20 25 Time [min]

717.25 Mass spectrum 4500000

4000000

3500000

3000000

2500000

2000000 Intensity 1500000 718.6 1000000 775.65 500000

0 0 500 1000 1500 2000 2500 M/z

Figure A.7: LC-MS (positive ion mode) data of peptide JF_1_6. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 132 Appendix A. Supplementary Information

HPLC-MS data for JF_1_7

Name Sequence MW [g/mol] Positive Charges

JF_1_7 LYLGWI-NH2 762.94 1

1e7 Mass chromatogram 1.4 11.69

1.2

1.0

0.8 JF_1_7 0.6 Intensity [TIC] 0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 1.0 1e7 763.35

0.8

0.6

Intensity 0.4

764.45 0.2

0.0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.8: LC-MS (positive ion mode) data of peptide JF_1_7. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 133

HPLC-MS data for JF_1_8

Name Sequence MW [g/mol] Positive Charges

JF_1_8 PWGYVA-NH2 690.79 1

Mass chromatogram 2.0 1e7 9.21

1.5

1.0 JF_1_8 Intensity [TIC]

0.5

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 1600000 691.25 1400000

1200000

1000000

800000

Intensity 600000 690.55693.35 400000 692.4

200000 1381.55

0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.9: LC-MS (positive ion mode) data of peptide JF_1_8. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 134 Appendix A. Supplementary Information

HPLC-MS data for JF_1_10

Name Sequence MW [g/mol] Positive Charges

JF_1_10 VPAFII-NH2 657.84 1

Mass chromatogram 2.5 1e7 10.57

2.0

1.5

JF_1_10

1.0 Intensity [TIC]

0.5

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 1.6 1e7 658.3 1.4

1.2

1.0

0.8

Intensity 0.6

0.4 659.5

0.2 657.55

0.0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.10: LC-MS (positive ion mode) data of peptide JF_1_10. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 135

HPLC-MS data for JF_1_11

Name Sequence MW [g/mol] Positive Charges

JF_1_11 WPKIYV-NH2 803.99 2

Mass chromatogram 1800000 8.87 1600000

1400000

1200000

1000000 JF_1_11 800000

Intensity [TIC] 600000

400000

200000

0 0 5 10 15 20 25 Time [min]

Mass spectrum 800000 402.6 700000

600000

500000

400000

Intensity 300000

200000 402.35 804.45 100000

0 0 500 1000 1500 2000 2500 M/z

Figure A.11: LC-MS (positive ion mode) data of peptide JF_1_11. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 136 Appendix A. Supplementary Information

HPLC-MS data for JF_1_13

Name Sequence MW [g/mol] Positive Charges

JF_1_13 VLIWFV-NH2 774.99 1

Mass chromatogram 2.5 1e7

12.08 2.0

1.5

JF_1_13

1.0 Intensity [TIC]

0.5

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 4000000

3500000 775.35

3000000

2500000

2000000

Intensity 1500000

1000000 776.55 500000 777.9 0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.12: LC-MS (positive ion mode) data of peptide JF_1_13. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.2. Supplementary Information to Chapter 4.1 137

HPLC-MS data for JF_1_14

Name Sequence MW [g/mol] Positive Charges

JF_1_14 SVYLQP-NH2 704.81 1

Mass chromatogram 1.4 1e7

1.2 7.71

1.0

0.8 JF_1_14 0.6 Intensity [TIC] 0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

Mass spectrum 8000000 705.3 7000000

6000000

5000000

4000000

Intensity 3000000

2000000 706.4 1000000 704.55

0 200 400 600 800 1000 1200 1400 1600 M/z

Figure A.13: LC-MS (positive ion mode) data of peptide JF_1_14. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 138 Appendix A. Supplementary Information

Figure A.14: Scatter plots of LIPOPEP EV (orange) and AZ EV (blue) prediction er- rors in relation to the amount of training set compounds whose experimental logD7.4 deviates ≤ 0.5 log units from the respective query compound. Plots 1, 3 and 5 depict the results when the models were learned the LIPOPEP training set, plots 2, 4 and 6 when the models learned the AZ training set. Predictions within the area of the straight lines are accurate, within the dashed lines acceptable. EV: external valida- tion. A.2. Supplementary Information to Chapter 4.1 139

Summary of the Peptides in the LIPOPEP and In-House Datasets.

ID Sequence SMILES C- N- Ionisa logD pH Terminus Terminus ble (exp) (aq)

LIPOPEP 1 AGA [H]N(C(C)=O)[C@H](C(NCC(N[C@H](C(NC(C)(C)C)=O)C)=O) CONHt- NHCOCH3 No -0.6 7.2 =O)C Butyl 2 AAA [H]N(C(C)=O)[C@H](C(NC(C)C(N[C@H](C(NC(C)(C)C)=O)C) CONHt- NHCOCH3 No -0.51 7.2 =O)=O)C Butyl 3 AFA [H]N(C(C)=O)[C@H](C(NC(CC1=CC=CC=C1)C(N[C@H](C(NC CONHt- NHCOCH3 No 1.01 7.2 (C)(C)C)=O)C)=O)=O)C Butyl 4 AWA [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@H](C(NC(C)(C)C)=O CONHt- NHCOCH3 No 1.25 7.2 )C)=O)CC1=CNC2=C1C=CC=C2)=O)C Butyl 5 APA [H]N(C(C)=O)[C@H](C(N1CCC[C@H]1C(N[C@H](C(NC(C)(C) CONHt- NHCOCH3 No -0.39 7.2 C)=O)C)=O)=O)C Butyl 6 AHA O=C(N[C@H](C(=O)N[C@@H](Cc1[nH]cnc1)C(=O)N[C@H]( CONHt- NHCOCH3 Yes -0.48 7.2 C(=O)NC(C)(C)C)C)C)C Butyl 7 ADA [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@H](C(NC(C)(C)C)=O CONHt- NHCOCH3 No -0.74 7.2 )C)=O)CC(O)=O)=O)C Butyl 8 AEA [H]N(C(C)=O)[C@H](C(N[C@@H](CCC(O)=O)C(N[C@H](C(N CONHt- NHCOCH3 No -0.67 7.2 C(C)(C)C)=O)C)=O)=O)C Butyl 9 (D-)FG O=C([O-])CNC(=O)[C@H]([NH3+])Cc1ccccc1 COOH NH2 Yes -2.16 7.2

10 (D-)F(D- O=C(N[C@H](Cc1ccccc1)C(=O)N[C@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.46 7.2 )FG ])C[NH3+] 11 (D-)F(D- O=C([O- COOH NH2 Yes -0.66 7.2 )F(D-)FG ])CNC(=O)[C@H](NC(=O)[C@H](NC(=O)[C@H]([NH3+])Cc1 ccccc1)Cc1ccccc1)Cc1ccccc1 12 (D-)F(D-)F [H]N(C(C)=O)[C@@H](C(N[C@@H](C(N)=O)CC1=CC=CC=C1 CONH2 NHCOCH3 No 1.19 7.2 )=O)CC2=CC=CC=C2 13 (D-)F(D- [H]N(C(C)=O)[C@@H](C(N[C@@H](C(N[C@@H](C(N)=O)C CONH2 NHCOCH3 No 2.3 7.2 )F(D-)F C1=CC=CC=C1)=O)CC2=CC=CC=C2)=O)CC3=CC=CC=C3 14 FL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.17 7.0 ])[C@@H]([NH3+])Cc1ccccc1 15 LF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.15 7.0 ])[C@@H]([NH3+])CC(C)C 16 FF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.85 7.0 ])[C@@H]([NH3+])Cc1ccccc1 17 LL O=C(N[C@@H](CC(C)C)C(=O)[O-])[C@@H]([NH3+])CC(C)C COOH NH2 Yes -1.46 7.0 18 LV O=C(N[C@@H](C(C)C)C(=O)[O-])[C@@H]([NH3+])CC(C)C COOH NH2 Yes -2.05 7.0 19 VL O=C(N[C@@H](CC(C)C)C(=O)[O-])[C@@H]([NH3+])C(C)C COOH NH2 Yes -2.07 7.0

20 AI O=C(N[C@@H]([C@H](CC)C)C(=O)[O-])[C@@H]([NH3+])C COOH NH2 Yes -2.60 7.0 21 II O=C(N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -1.82 7.0 ])[C@@H]([NH3+])[C@H](CC)C 22 LI O=C(N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -1.64 7.0 ])[C@@H]([NH3+])CC(C)C 23 VV O=C(N[C@@H](C(C)C)C(=O)[O-])[C@@H]([NH3+])C(C)C COOH NH2 Yes -2.82 7.0 24 WW O=C(N[C@@H](Cc1c2c([nH]c1)cccc2)C(=O)[O- COOH NH2 Yes -0.27 7.0 ])[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 25 WF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.47 7.0 ])[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 26 WA O=C(N[C@H](C(=O)[O- COOH NH2 Yes -1.98 7.0 ])C)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 27 WL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -0.73 7.0 ])[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 28 WY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c COOH NH2 Yes -1.13 7.0 1)cccc2)C(=O)[O-] 29 LY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(=O)[ COOH NH2 Yes -1.94 7.0 O-] 30 YL Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)[ COOH NH2 Yes -1.75 7.0 O-] 31 VY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])C(C)C)C(=O)[ COOH NH2 Yes -2.52 7.0 O-] 32 FY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)C(= COOH NH2 Yes -1.68 7.0 O)[O-] 33 YY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])Cc1ccc(O)cc1) COOH NH2 Yes -1.87 7.0 C(=O)[O-] 34 LM S(CC[C@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(=O)[O-])C COOH NH2 Yes -1.87 7.0

35 ML S(CC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)[O-])C COOH NH2 Yes -1.84 7.0 36 MV S(CC[C@H]([NH3+])C(=O)N[C@@H](C(C)C)C(=O)[O-])C COOH NH2 Yes -2.53 7.0 140 Appendix A. Supplementary Information

37 FM S(CC[C@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)C(=O)[O-])C COOH NH2 Yes -1.59 7.0 38 SL OC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)[O-] COOH NH2 Yes -2.49 7.0 39 PF O=C(N[C@@H](Cc1ccccc1)C(=O)[O-])[C@H]1[NH2+]CCC1 COOH NH2 Yes -2.07 7.0 40 PL O=C(N[C@@H](CC(C)C)C(=O)[O-])[C@H]1[NH2+]CCC1 COOH NH2 Yes -2.41 7.0 41 PI O=C(N[C@@H]([C@H](CC)C)C(=O)[O-])[C@H]1[NH2+]CCC1 COOH NH2 Yes -2.56 7.0 42 FP O=C([O-])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])Cc1ccccc1 COOH NH2 Yes -1.36 7.0 43 LP O=C([O-])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])CC(C)C COOH NH2 Yes -1.76 7.0 44 IP O=C([O- COOH NH2 Yes -1.79 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])[C@H](CC)C 45 FFF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.02 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)Cc1ccccc1 46 GFF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1ccccc1)C(=O COOH NH2 Yes -1.33 7.0 )[O-])C[NH3+] 47 FVG O=C([O-])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+]) COOH NH2 Yes -2.33 7.0 Cc1ccccc1)C(C)C 48 FVF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.76 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)C(C)C 49 FVA O=C(N[C@H](C(=O)[O- COOH NH2 Yes -2.19 7.0 ])C)[C@@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)C(C)C 50 LVV O=C(N[C@@H](C(C)C)C(=O)[O- COOH NH2 Yes -2.10 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(C)C 51 LII O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H]([C@H](CC)C) COOH NH2 Yes -1.11 7.0 C(=O)[O-])[C@@H]([NH3+])CC(C)C 52 LVL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.57 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(C)C 53 LAL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -2.03 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)C 54 LLL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -0.94 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)CC(C)C 55 WGG O=C(NCC(=O)[O- COOH NH2 Yes -2.72 7.0 ])CNC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 56 WFA O=C(N[C@H](C(=O)[O- COOH NH2 Yes -1.00 7.0 ])C)[C@@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2) Cc1ccccc1 57 WWL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes 0.36 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2)Cc 1c2c([nH]c1)cccc2 58 LLY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3 COOH NH2 Yes -1.34 7.0 +])CC(C)C)CC(C)C)C(=O)[O-] 59 VFY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3 COOH NH2 Yes -1.50 7.0 +])C(C)C)Cc1ccccc1)C(=O)[O-] 60 GFY Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)C[NH3+])Cc1cc COOH NH2 Yes -1.96 7.0 ccc1)C(=O)[O-] 61 YLV Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O) COOH NH2 Yes -1.45 7.0 N[C@@H](C(C)C)C(=O)[O-] 62 YVF Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H](C(C)C)C(=O)N COOH NH2 Yes -1.37 7.0 [C@@H](Cc1ccccc1)C(=O)[O-] 63 YGF Oc1ccc(cc1)C[C@H]([NH3+])C(=O)NCC(=O)N[C@@H](C(C)C COOH NH2 Yes -1.86 7.0 )C(=O)[O-] 64 YYL Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])Cc1ccc(O)cc1) COOH NH2 Yes -1.38 7.0 C(=O)N[C@@H](CC(C)C)C(=O)[O-] 65 AYI Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])C)C(=O)N[C@ COOH NH2 Yes -2.04 7.0 @H]([C@H](CC)C)C(=O)[O-] 66 IYV Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C COOH NH2 Yes -1.77 7.0 (=O)N[C@@H](C(C)C)C(=O)[O-] 67 MLF S(CC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H COOH NH2 Yes -1.03 7.0 ](Cc1ccccc1)C(=O)[O-])C 68 LSL OC[C@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(=O)N[C@@H]( COOH NH2 Yes -2.35 7.0 CC(C)C)C(=O)[O-] 69 ISL OC[C@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C(=O)N[C@ COOH NH2 Yes -2.28 7.0 @H](CC(C)C)C(=O)[O-] 70 ISI OC[C@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C(=O)N[C@ COOH NH2 Yes -2.64 7.0 @H]([C@H](CC)C)C(=O)[O-] 71 SLI OC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H]( COOH NH2 Yes -1.99 7.0 [C@H](CC)C)C(=O)[O-] 72 SLL OC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H]( COOH NH2 Yes -2.03 7.0 CC(C)C)C(=O)[O-] 73 FIT O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H]([C@H](O)C)C( COOH NH2 Yes -1.95 7.0 =O)[O-])[C@@H]([NH3+])Cc1ccccc1 74 LIT O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H]([C@H](O)C)C( COOH NH2 Yes -2.14 7.0 =O)[O-])[C@@H]([NH3+])CC(C)C A.2. Supplementary Information to Chapter 4.1 141

75 IIT O=C(N[C@@H]([C@H](O)C)C(=O)[O- COOH NH2 Yes -2.23 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)[C@H](CC) C 76 LTI O=C(N[C@@H]([C@H](O)C)C(=O)N[C@@H]([C@H](CC)C)C( COOH NH2 Yes -2.30 7.0 =O)[O-])[C@@H]([NH3+])CC(C)C 77 TLI O=C(N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -1.66 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])[C@H](O)C)CC(C)C 78 TVL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.97 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])[C@H](O)C)C(C)C 79 PLL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.64 7.0 ])[C@H]1[NH2+]CCC1 80 LPL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.56 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])CC(C)C 81 LLP O=C([O- COOH NH2 Yes -1.58 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H](NC(=O)[C@@H]([NH3+])C C(C)C)CC(C)C 82 IPI O=C(N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -1.65 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])[C@H](CC)C 83 FGGF O=C(NCC(=O)N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.51 7.0 ])CNC(=O)[C@@H]([NH3+])Cc1ccccc1 84 VAAF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.91 7.0 ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])C(C)C) C)C 85 LLVF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.25 7.0 ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C )CC(C)C)C(C)C 86 LLLV O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](C(C)C)C(=O)[O- COOH NH2 Yes -0.51 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)CC(C)C 87 VGFF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1ccccc1)C(=O COOH NH2 Yes -0.51 7.0 )[O-])CNC(=O)[C@@H]([NH3+])C(C)C 88 AVLL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.74 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])C)C(C)C 89 IAGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.78 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C 90 FFFF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1ccccc1)C(=O COOH NH2 Yes 1.63 7.0 )[O- ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)Cc1ccccc1 91 LLGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.42 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)CC(C)C 92 LLAF O=C(N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -1.00 7.0 ])C)[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)CC(C)C 93 LLLF O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)[O COOH NH2 Yes 0.24 7.0 -])[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)CC(C)C 94 IIVV O=C(N[C@@H](C(C)C)C(=O)[O- COOH NH2 Yes -1.41 7.0 ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])[C@H] (CC)C)[C@H](CC)C)C(C)C 95 IIGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.99 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)[C @H](CC)C 96 IAAI O=C(N[C@H](C(=O)N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -2.82 7.0 ])C)[C@@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C 97 FFGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes 0.17 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])Cc1ccccc1)Cc1 ccccc1 98 VLVL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.23 7.0 ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])C(C)C) CC(C)C)C(C)C 99 WLLV O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](C(C)C)C(=O)[O- COOH NH2 Yes 0.23 7.0 ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2)C C(C)C 100 WGLL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes 0.06 7.0 ])CNC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 101 YILG Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H]([C@H](CC)C)C COOH NH2 Yes -1.49 7.0 (=O)N[C@@H](CC(C)C)C(=O)NCC(=O)[O-] 102 FVYF Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3 COOH NH2 Yes -0.32 7.0 +])Cc1ccccc1)C(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)[O-] 103 IYIV Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])[C@H](CC)C)C COOH NH2 Yes -1.09 7.0 (=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](C(C)C)C(=O)[O -] 104 VFLT O=C(N[C@@H](CC(C)C)C(=O)N[C@@H]([C@H](O)C)C(=O)[ COOH NH2 Yes -1.32 7.0 O-])[C@@H](NC(=O)[C@@H]([NH3+])C(C)C)Cc1ccccc1 105 MILI S(CC[C@H]([NH3+])C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C COOH NH2 Yes -0.49 7.0 @@H](CC(C)C)C(=O)N[C@@H]([C@H](CC)C)C(=O)[O-])C 142 Appendix A. Supplementary Information

106 VMFI S(CC[C@H](NC(=O)[C@@H]([NH3+])C(C)C)C(=O)N[C@@H] COOH NH2 Yes -0.63 7.0 (Cc1ccccc1)C(=O)N[C@@H]([C@H](CC)C)C(=O)[O-])C 107 PLLL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@ COOH NH2 Yes -1.06 7.0 @H](CC(C)C)C(=O)[O-])[C@H]1[NH2+]CCC1 108 LPLL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -0.92 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])CC(C)C 109 LLPL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.00 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H](NC(=O)[C@@H]([NH3+])C C(C)C)CC(C)C 110 LLLP O=C([O- COOH NH2 Yes -1.18 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C @@H]([NH3+])CC(C)C)CC(C)C)CC(C)C 111 IPGI O=C(NCC(=O)N[C@@H]([C@H](CC)C)C(=O)[O- COOH NH2 Yes -1.69 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])[C@H](CC)C 112 VPVL O=C(N[C@@H](C(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -1.91 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])C(C)C 113 VPGV O=C(NCC(=O)N[C@@H](C(C)C)C(=O)[O- COOH NH2 Yes -2.83 7.0 ])[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])C(C)C 114 YPGW Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N1CCC[C@H]1C(=O)NCC( COOH NH2 Yes -1.25 7.0 =O)N[C@@H](Cc1c2c([nH]c1)cccc2)C(=O)[O-] 115 YPGI Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N1CCC[C@H]1C(=O)NCC( COOH NH2 Yes -1.65 7.0 =O)N[C@@H]([C@H](CC)C)C(=O)[O-] 116 GGFVF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C(C)C)C(=O)N[C COOH NH2 Yes -1.40 7.0 @@H](Cc1ccccc1)C(=O)[O-])CNC(=O)C[NH3+] 117 VFVGL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -0.97 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+ ])C(C)C)Cc1ccccc1)C(C)C 118 VGFVF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C(C)C)C(=O)N[C COOH NH2 Yes -0.50 7.0 @@H](Cc1ccccc1)C(=O)[O- ])CNC(=O)[C@@H]([NH3+])C(C)C 119 GAALL O=C(N[C@H](C(=O)N[C@H](C(=O)N[C@@H](CC(C)C)C(=O) COOH NH2 Yes -2.55 7.0 N[C@@H](CC(C)C)C(=O)[O-])C)C)C[NH3+] 120 AFGVF O=C(N[C@@H](C(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.59 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])C)Cc1ccccc1 121 AGFVF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C(C)C)C(=O)N[C COOH NH2 Yes -1.10 7.0 @@H](Cc1ccccc1)C(=O)[O-])CNC(=O)[C@@H]([NH3+])C 122 LIIGA O=C(N[C@H](C(=O)[O- COOH NH2 Yes -1.65 7.0 ])C)CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH 3+])CC(C)C)[C@H](CC)C)[C@H](CC)C 123 GLLGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.18 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)C[NH3+])CC(C) C)CC(C)C 124 ALLGF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.63 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+ ])C)CC(C)C)CC(C)C 125 IIIIG O=C([O- COOH NH2 Yes -0.97 7.0 ])CNC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O )[C@@H]([NH3+])[C@H](CC)C)[C@H](CC)C)[C@H](CC)C)[C @H](CC)C 126 IVVVI O=C(N[C@@H](C(C)C)C(=O)N[C@@H]([C@H](CC)C)C(=O)[ COOH NH2 Yes -0.89 7.0 O- ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])[C@H] (CC)C)C(C)C)C(C)C 127 FGAGI O=C(N[C@H](C(=O)NCC(=O)N[C@@H]([C@H](CC)C)C(=O)[ COOH NH2 Yes -1.87 7.0 O-])C)CNC(=O)[C@@H]([NH3+])Cc1ccccc1 128 FAAAL O=C(N[C@@H](CC(C)C)C(=O)[O- COOH NH2 Yes -2.23 7.0 ])[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H ]([NH3+])Cc1ccccc1)C)C)C 129 WGGFV O=C(NCC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C(C)C) COOH NH2 Yes -0.44 7.0 C(=O)[O-])CNC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 130 WLFAA O=C(N[C@H](C(=O)N[C@H](C(=O)[O- COOH NH2 Yes -0.32 7.0 ])C)C)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])Cc 1c2c([nH]c1)cccc2)CC(C)C)Cc1ccccc1 131 IAYWG Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3 COOH NH2 Yes -1.47 7.0 +])[C@H](CC)C)C)C(=O)N[C@@H](Cc1c2c([nH]c1)cccc2)C(= O)NCC(=O)[O-] 132 GLSVL OC[C@H](NC(=O)[C@@H](NC(=O)C[NH3+])CC(C)C)C(=O)N[ COOH NH2 Yes -1.64 7.0 C@@H](C(C)C)C(=O)N[C@@H](CC(C)C)C(=O)[O-] 133 SLAIV OC[C@H]([NH3+])C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C( COOH NH2 Yes -1.94 7.0 =O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](C(C)C)C(=O)[O- ])C 134 YTGFL Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H]([C@H](O)C)C( COOH NH2 Yes -1.18 7.0 =O)NCC(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C) C(=O)[O-] A.2. Supplementary Information to Chapter 4.1 143

135 LVGTF O=C(N[C@@H]([C@H](O)C)C(=O)N[C@@H](Cc1ccccc1)C(= COOH NH2 Yes -1.18 7.0 O)[O- ])CNC(=O)[C@@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(C)C 136 YGGFL Oc1ccc(cc1)C[C@H]([NH3+])C(=O)NCC(=O)NCC(=O)N[C@@ COOH NH2 Yes -0.80 7.0 H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O)[O-] 137 YGGFM S(CC[C@H](NC(=O)[C@@H](NC(=O)CNC(=O)CNC(=O)[C@@ COOH NH2 Yes -1.39 7.0 H]([NH3+])Cc1ccc(O)cc1)Cc1ccccc1)C(=O)[O-])C 138 FF O=C(N[C@@H](Cc1ccccc1)C(=O)[O- COOH NH2 Yes -0.94 7.4 ])[C@@H]([NH3+])Cc1ccccc1 139 WW O=C(N[C@@H](Cc1c2c([nH]c1)cccc2)C(=O)[O- COOH NH2 Yes -0.35 7.4 ])[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2 140 WWW O=C(N[C@@H](Cc1c2c([nH]c1)cccc2)C(=O)[O- COOH NH2 Yes 0.51 7.4 ])[C@@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2)Cc 1c2c([nH]c1)cccc2 141 WMDF S(CC[C@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2)C COOH NH3 Yes 1.60 7.4 (=O)N[C@@H](CC(=O)[O- ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 142 WMRF S(CC[C@H](NC(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cccc2)C COOH NH4 Yes 1.90 7.4 (=O)N[C@@H](CCCNC(=[NH2+])N)C(=O)N[C@@H](Cc1cccc c1)C(=O)N)C 143 WDMF S(CC[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+])Cc1c2 COOH NH5 Yes 1.70 7.4 c([nH]c1)cccc2)CC(=O)[O- ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 144 SQDG OC[C@H]([NH3+])C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@ COOH NH6 Yes -2.40 7.4 H](CC(=O)[O-])C(=O)NCC(=O)N 145 SQRG OC[C@H]([NH3+])C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@ COOH NH7 Yes -2.40 7.4 H](CCCNC(=[NH2+])N)C(=O)NCC(=O)N 146 GV [H]N(C(C)=O)CC(N[C@@H](C(C)C)C(N)=O)=O CONH2 NHCOCH3 No -1.33 7.0 147 AV [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)C CONH2 NHCOCH3 No -1.13 7.0 148 LV [H]N(C(C)=O)[C@@H](CC(C)C)C(N[C@@H](C(C)C)C(N)=O) CONH2 NHCOCH3 No 0.26 7.0 =O 149 GF [H]N(C(C)=O)CC(N[C@H](C(N)=O)CC1=CC=CC=C1)=O CONH2 NHCOCH3 No -0.56 7.0

150 IV [H]N(C(C)=O)[C@]([H])(C(N[C@H](C(N)=O)C(C)C)=O)[C@ CONH2 NHCOCH3 No 0.16 7.0 @H](C)CC 151 VV [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@@H](C(C)C)C(N)=O)= CONH2 NHCOCH3 No -0.32 7.0 O 152 FV [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)CC1=C CONH2 NHCOCH3 No 0.43 7.0 C=CC=C1 153 AL [H]N(C(C)=O)[C@H](C(N[C@@H](CC(C)C)C(N)=O)=O)C CONH2 NHCOCH3 No -0.54 7.0 154 AA [H]N(C(C)=O)[C@H](C(N[C@H](C(N)=O)C)=O)C CONH2 NHCOCH3 No -2.01 7.0 155 GL [H]N(C(C)=O)CC(N[C@@H](CC(C)C)C(N)=O)=O CONH2 NHCOCH3 No -0.78 7.0 156 LI [H]N(C(C)=O)[C@@H](CC(C)C)C(N[C@]([C@@H](C)CC)([H] CONH2 NHCOCH3 No 0.68 7.0 )C(N)=O)=O 157 FG [H]N(C(C)=O)[C@H](C(NCC(N)=O)=O)CC1=CC=CC=C1 CONH2 NHCOCH3 No -0.5 7.0 158 VA [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@H](C(N)=O)C)=O CONH2 NHCOCH3 No -1.14 7.0 159 YV [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)CC1=C CONH2 NHCOCH3 No -0.2 7.0 C=C(O)C=C1 160 YL [H]N(C(C)=O)[C@H](C(N[C@@H](CC(C)C)C(N)=O)=O)CC1= CONH2 NHCOCH3 No 0.32 7.0 CC=C(O)C=C1 161 YF [H]N(C(C)=O)[C@H](C(N[C@H](C(N)=O)CC1=CC=CC=C1)=O CONH2 NHCOCH3 No 0.54 7.0 )CC2=CC=C(O)C=C2 162 WV [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)CC1=C CONH2 NHCOCH3 No 0.73 7.0 NC2=C1C=CC=C2 163 MV [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)CCSC CONH2 NHCOCH3 No -0.28 7.0

164 MF [H]N(C(C)=O)[C@H](C(N[C@H](C(N)=O)CC1=CC=CC=C1)=O CONH2 NHCOCH3 No 0.42 7.0 )CCSC 165 SV [H]N(C(C)=O)[C@@H](CO)C(N[C@@H](C(C)C)C(N)=O)=O CONH2 NHCOCH3 No -1.53 7.0 166 SF [H]N(C(C)=O)[C@@H](CO)C(N[C@H](C(N)=O)CC1=CC=CC= CONH2 NHCOCH3 No -0.79 7.0 C1)=O 167 TV [H]N(C(C)=O)[C@]([C@@H](C)O)([H])C(N[C@@H](C(C)C)C CONH2 NHCOCH3 No -1.25 7.0 (N)=O)=O 168 TI [H]N(C(C)=O)[C@]([C@@H](C)O)([H])C(N[C@]([C@@H](C) CONH2 NHCOCH3 No -0.86 7.0 CC)([H])C(N)=O)=O 169 NV [H]N(C(C)=O)[C@@H](CC(N)=O)C(N[C@@H](C(C)C)C(N)=O CONH2 NHCOCH3 No -1.85 7.0 )=O 170 NI [H]N(C(C)=O)[C@@H](CC(N)=O)C(N[C@]([C@@H](C)CC)([ CONH2 NHCOCH3 No -1.43 7.0 H])C(N)=O)=O 171 NF [H]N(C(C)=O)[C@@H](CC(N)=O)C(N[C@H](C(N)=O)CC1=CC CONH2 NHCOCH3 No -1.14 7.0 =CC=C1)=O 144 Appendix A. Supplementary Information

172 LN [H]N(C(C)=O)[C@@H](CC(C)C)C(N[C@@H](CC(N)=O)C(N)= CONH2 NHCOCH3 No -1.3 7.0 O)=O 173 IN [H]N(C(C)=O)[C@]([C@H](CC)C)([H])C(N[C@@H](CC(N)=O CONH2 NHCOCH3 No -1.41 7.0 )C(N)=O)=O 174 QV [H]N(C(C)=O)[C@@H](CCC(N)=O)C(N[C@@H](C(C)C)C(N)= CONH2 NHCOCH3 No -1.85 7.0 O)=O 175 QL [H]N(C(C)=O)[C@@H](CCC(N)=O)C(N[C@@H](CC(C)C)C(N) CONH2 NHCOCH3 No -1.32 7.0 =O)=O 176 QF [H]N(C(C)=O)[C@@H](CCC(N)=O)C(N[C@H](C(N)=O)CC1=C CONH2 NHCOCH3 No -1.14 7.0 C=CC=C1)=O 177 FQ [H]N(C(C)=O)[C@H](C(N[C@@H](CCC(N)=O)C(N)=O)=O)CC CONH2 NHCOCH3 No -1.03 7.0 1=CC=CC=C1 178 VQ [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@@H](CCC(N)=O)C(N)= CONH2 NHCOCH3 No -1.82 7.0 O)=O 179 KF O=C(N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C( CONH2 NHCOCH3 Yes -2.43 7.0 =O)N)C 180 FK O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCCC[NH3+])C( CONH2 NHCOCH3 Yes -2.23 7.0 =O)N)C 181 OrnF O=C(N[C@@H](CCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(= CONH2 NHCOCH3 Yes -2.23 7.0 O)N)C 182 VAA [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@H](C(N[C@H](C(N)=O CONH2 NHCOCH3 No -1.4 7.0 )C)=O)C)=O 183 VAV [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@H](C(N[C@@H](C(C)C CONH2 NHCOCH3 No -0.67 7.0 )C(N)=O)=O)C)=O 184 VIG [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@]([C@@H](C)CC)([H]) CONH2 NHCOCH3 No -0.45 7.0 C(NCC(N)=O)=O)=O 185 ALV [H]N(C(C)=O)[C@H](C(N[C@@H](CC(C)C)C(N[C@@H](C(C) CONH2 NHCOCH3 No -0.14 7.0 C)C(N)=O)=O)=O)C 186 VFA [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@H](C(N[C@H](C(N)=O CONH2 NHCOCH3 No 0.06 7.0 )C)=O)CC1=CC=CC=C1)=O 187 AVI [H]N(C(C)=O)[C@H](C(N[C@@H](C(C)C)C(N[C@]([C@@H]( CONH2 NHCOCH3 No -0.2 7.0 C)CC)([H])C(N)=O)=O)=O)C 188 IFA [H]N(C(C)=O)[C@]([C@H](CC)C)([H])C(N[C@H](C(N[C@H]( CONH2 NHCOCH3 No 0.52 7.0 C(N)=O)C)=O)CC1=CC=CC=C1)=O 189 GAV [H]N(C(C)=O)CC(N[C@H](C(N[C@@H](C(C)C)C(N)=O)=O)C) CONH2 NHCOCH3 No -1.56 7.0 =O 190 AGF [H]N(C(C)=O)[C@H](C(NCC(N[C@H](C(N)=O)CC1=CC=CC=C CONH2 NHCOCH3 No -0.71 7.0 1)=O)=O)C 191 IAV [H]N(C(C)=O)[C@]([C@H](CC)C)([H])C(N[C@H](C(N[C@@H CONH2 NHCOCH3 No -0.21 7.0 ](C(C)C)C(N)=O)=O)C)=O 192 FGL [H]N(C(C)=O)[C@H](C(NCC(N[C@@H](CC(C)C)C(N)=O)=O) CONH2 NHCOCH3 No 0.6 7.0 =O)CC1=CC=CC=C1 193 FIG [H]N(C(C)=O)[C@H](C(N[C@]([C@@H](C)CC)([H])C(NCC(N CONH2 NHCOCH3 No 0.34 7.0 )=O)=O)=O)CC1=CC=CC=C1 194 VVI [H]N(C(C)=O)[C@@H](C(C)C)C(N[C@@H](C(C)C)C(N[C@]([ CONH2 NHCOCH3 No 0.49 7.0 C@@H](C)CC)([H])C(N)=O)=O)=O 195 GLG [H]N(C(C)=O)CC(N[C@@H](CC(C)C)C(NCC(N)=O)=O)=O CONH2 NHCOCH3 No -1.23 7.0 196 AYL [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@@H](CC(C)C)C(N)= CONH2 NHCOCH3 No -0.04 7.0 O)=O)CC(C=C1)=CC=C1O)=O)C 197 AYF [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@H](C(N)=O)CC1=CC CONH2 NHCOCH3 No 0.26 7.0 =CC=C1)=O)CC(C=C2)=CC=C2O)=O)C 198 WAA [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@H](C(N)=O)C)=O)C) CONH2 NHCOCH3 No -0.38 7.0 =O)CC1=CNC2=C1C=CC=C2 199 WIG [H]N(C(C)=O)[C@H](C(N[C@]([C@@H](C)CC)([H])C(NCC(N CONH2 NHCOCH3 No 0.62 7.0 )=O)=O)=O)CC1=CNC2=C1C=CC=C2 200 WGF [H]N(C(C)=O)[C@H](C(NCC(N[C@H](C(N)=O)CC1=CC=CC=C CONH2 NHCOCH3 No 0.99 7.0 1)=O)=O)CC2=CNC3=C2C=CC=C3 201 WAV [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@@H](C(C)C)C(N)=O CONH2 NHCOCH3 No 0.36 7.0 )=O)C)=O)CC1=CNC2=C1C=CC=C2 202 AMV [H]N(C(C)=O)[C@H](C(N[C@H](C(N[C@@H](C(C)C)C(N)=O CONH2 NHCOCH3 No -0.63 7.0 )=O)CCSC)=O)C 203 IMF [H]N(C(C)=O)[C@]([C@H](CC)C)([H])C(N[C@H](C(N[C@H]( CONH2 NHCOCH3 No 1.28 7.0 C(N)=O)CC1=CC=CC=C1)=O)CCSC)=O 204 LSF [H]N(C(C)=O)[C@@H](CC(C)C)C(N[C@@H](CO)C(N[C@H]( CONH2 NHCOCH3 No 0.23 7.0 C(N)=O)CC1=CC=CC=C1)=O)=O 205 LTL [H]N(C(C)=O)[C@@H](CC(C)C)C(N[C@]([C@H](O)C)([H])C( CONH2 NHCOCH3 No 0.24 7.0 N[C@@H](CC(C)C)C(N)=O)=O)=O 206 KFV O=C(N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C( CONH2 NHCOCH3 Yes -2.13 7.0 =O)N[C@@H](C(C)C)C(=O)N)C 207 KIF O=C(N[C@@H](CCCC[NH3+])C(=O)N[C@@H]([C@H](CC)C) CONH2 NHCOCH3 Yes -1.46 7.0 C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 208 KFL O=C(N[C@@H](CCCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C( CONH2 NHCOCH3 Yes -1.51 7.0 =O)N[C@@H](CC(C)C)C(=O)N)C A.2. Supplementary Information to Chapter 4.1 145

209 LKF O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CCCC[NH3+])C(=O) CONH2 NHCOCH3 Yes -1.41 7.0 N[C@@H](Cc1ccccc1)C(=O)N)C 210 OrnFL O=C(N[C@@H](CCC[NH3+])C(=O)N[C@@H](Cc1ccccc1)C(= CONH2 NHCOCH3 Yes -1.37 7.0 O)N[C@@H](CC(C)C)C(=O)N)C 211 LOrnF O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CCC[NH3+])C(=O)N CONH2 NHCOCH3 Yes -1.38 7.0 [C@@H](Cc1ccccc1)C(=O)N)C 212 RIF O=C(N[C@@H](CCCNC(=[NH2+])N)C(=O)N[C@@H]([C@H]( CONH2 NHCOCH3 Yes -0.90 7.0 CC)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 213 RFL O=C(N[C@@H](CCCNC(=[NH2+])N)C(=O)N[C@@H](Cc1cccc CONH2 NHCOCH3 Yes -1.04 7.0 c1)C(=O)N[C@@H](CC(C)C)C(=O)N)C 214 LRF O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CCCNC(=[NH2+])N CONH2 NHCOCH3 Yes -0.76 7.0 )C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 215 LFR O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](Cc1ccccc1)C(=O)N[ CONH2 NHCOCH3 Yes -0.93 7.0 C@@H](CCCNC(=[NH2+])N)C(=O)N)C 216 IFR O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H](Cc1ccccc1)C(= CONH2 NHCOCH3 Yes -0.93 7.0 O)N[C@@H](CCCNC(=[NH2+])N)C(=O)N)C 217 HIF O=C(N[C@@H](Cc1[nH]cnc1)C(=O)N[C@@H]([C@H](CC)C) CONH2 NHCOCH3 Yes 0.36 7.0 C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 218 FHL O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1[nH]cnc1)C( CONH2 NHCOCH3 Yes 0.46 7.0 =O)N[C@@H](CC(C)C)C(=O)N)C 219 IHV O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H](Cc1[nH]cnc1) CONH2 NHCOCH3 Yes -0.33 7.0 C(=O)N[C@@H](C(C)C)C(=O)N)C 220 GFH O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1[nH]cnc1)C( CONH2 NHCOCH3 Yes -1.09 7.0 =O)N)CNC(=O)C 221 WHV O=C(N[C@@H](Cc1[nH]cnc1)C(=O)N[C@@H](C(C)C)C(=O) CONH2 NHCOCH3 Yes 0.16 7.0 N)[C@@H](NC(C)=C)Cc1c2c([nH]c1)cccc2 222 FWH O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1c2c([nH]c1)c CONH2 NHCOCH3 Yes 0.89 7.0 ccc2)C(=O)N[C@@H](Cc1[nH]cnc1)C(=O)N)C 223 DFL O=C(N[C@@H](CC(=O)[O- CONH2 NHCOCH3 Yes -1.39 7.0 ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O) N)C 224 FDL O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(=O)[O- CONH2 NHCOCH3 Yes -1.19 7.0 ])C(=O)N[C@@H](CC(C)C)C(=O)N)C 225 LDL O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CC(=O)[O- CONH2 NHCOCH3 Yes -1.55 7.0 ])C(=O)N[C@@H](CC(C)C)C(=O)N)C 226 ILD O=C(N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CC(C)C)C(=O) CONH2 NHCOCH3 Yes -1.90 7.0 N[C@@H](CC(=O)[O-])C(=O)N)C 227 EFL O=C(N[C@@H](CCC(=O)[O- CONH2 NHCOCH3 Yes -1.52 7.0 ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O) N)C 228 EIF O=C(N[C@@H](CCC(=O)[O- CONH2 NHCOCH3 Yes -1.57 7.0 ])C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](Cc1ccccc1) C(=O)N)C 229 FEF O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CCC(=O)[O- CONH2 NHCOCH3 Yes -1.08 7.0 ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 230 LEF O=C(N[C@@H](CC(C)C)C(=O)N[C@@H](CCC(=O)[O- CONH2 NHCOCH3 Yes -1.25 7.0 ])C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C 231 LIE O=C(N[C@@H](CC(C)C)C(=O)N[C@@H]([C@H](CC)C)C(=O) CONH2 NHCOCH3 Yes -1.87 7.0 N[C@@H](CCC(=O)[O-])C(=O)N)C 232 (D-)F(D- [H]N(C(C)=O)[C@@H](C(N[C@@H](C(N[C@@H](C(N)=O)C CONH2 NHCOCH3 No 2.3 7.2 )F(D-)F C1=CC=CC=C1)=O)CC2=CC=CC=C2)=O)CC3=CC=CC=C3 233 YPWF Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N1CCC[C@H]1C(=O)N[C CONH2 NH2 Yes 1.10 7.4 @@H](Cc1c2c([nH]c1)cccc2)C(=O)N[C@@H](Cc1ccccc1)C(= O)N 234 Y(D-)AWF Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N[C@@H](C(=O)N[C@@ CONH2 NH2 Yes 1.29 7.4 H](Cc1c2c([nH]c1)cccc2)C(=O)N[C@@H](Cc1ccccc1)C(=O)[ O-])C 235 YPIDV Oc1ccc(cc1)C[C@H](NC(=O)C)C(=O)N1CCC[C@H]1C(=O)N[C CONH2 NHCOCH3 Yes -1.85 7.2 @@H]([C@H](CC)C)C(=O)N[C@@H](CC(=O)[O- ])C(=O)N[C@@H](C(C)C)C(=O)N 236 YPINV NC([C@H](C(C)C)NC([C@H](CC(N)=O)NC([C@@]([C@@H]( CONH2 NHCOCH3 No -0.42 7.2 C)CC)([H])NC([C@@H]1CCCN1C([C@@H](NC(C)=O)CC2=CC =C(O)C=C2)=O)=O)=O)=O)=O 237 YPGNV NC([C@H](C(C)C)NC([C@H](CC(N)=O)NC(CNC([C@@H]1CC CONH2 NHCOCH3 No -2.06 7.2 CN1C([C@@H](NC(C)=O)CC2=CC=C(O)C=C2)=O)=O)=O)=O) =O 238 YPIIV NC([C@H](C(C)C)NC([C@@]([C@@H](C)CC)([H])NC([C@@] CONH2 NHCOCH3 No 1.13 7.2 ([C@@H](C)CC)([H])NC([C@@H]1CCCN1C([C@@H](NC(C)= O)CC2=CC=C(O)C=C2)=O)=O)=O)=O)=O 239 YPGIV NC([C@H](C(C)C)NC([C@@]([C@@H](C)CC)([H])NC(CNC([ CONH2 NHCOCH3 No -0.2 7.2 C@@H]1CCCN1C([C@@H](NC(C)=O)CC2=CC=C(O)C=C2)=O )=O)=O)=O)=O 146 Appendix A. Supplementary Information

240 FPIIV [H]N(C(C)=O)[C@H](C(N1CCC[C@H]1C(N[C@]([C@@H](C) CONH2 NHCOCH3 No 1.61 7.2 CC)([H])C(N[C@]([C@@H](C)CC)([H])C(N[C@@H](C(C)C)C( N)=O)=O)=O)=O)=O)CC2=CC=CC=C2 241 FPGIV [H]N(C(C)=O)[C@H](C(N1CCC[C@H]1C(NCC(N[C@]([C@@H CONH2 NHCOCH3 No 1.96 7.2 ](C)CC)([H])C(N[C@@H](C(C)C)C(N)=O)=O)=O)=O)=O)CC2= CC=CC=C2 242 FPII [H]N(C(C)=O)[C@H](C(N1CCC[C@H]1C(N[C@]([C@@H](C) CONH2 NHCOCH3 No 1.17 7.2 CC)([H])C(N[C@]([C@@H](C)CC)([H])C(N)=O)=O)=O)=O)CC 2=CC=CC=C2 243 FPGI [H]N(C(C)=O)[C@H](C(N1CCC[C@H]1C(NCC(N[C@]([C@@H CONH2 NHCOCH3 No 2 7.2 ](C)CC)([H])C(N)=O)=O)=O)=O)CC2=CC=CC=C2

In-House 1 GPG O=C(NCC(=O)N)[C@H]1N(CCC1)C(=O)C[NH3+] CONH2 NH2 Yes -3.05 7.4 2 YPWF Oc1ccc(cc1)C[C@H]([NH3+])C(=O)N1CCC[C@H]1C(=O)N[C CONH2 NH2 Yes 1.29 7.4 @@H](Cc1c2c([nH]c1)cccc2)C(=O)N[C@@H](Cc1ccccc1)C(= O)N 3 QWL NC([C@H](CC(C)C)NC([C@@H](NC([C@H](CCC(N)=O)N(C(C CONH2 NHCOCH3 No 0.07 7.4 )=O)[H])=O)CC1=CNC2=C1C=CC=C2)=O)=O 4 FLGKVW O=C(N[C@@H](CCCC[NH3+])C(=O)N[C@@H](C(C)C)C(=O)N CONH2 NH2 Yes -0.99 7.4 [C@@H](Cc1c2c([nH]c1)cccc2)C(=O)N)CNC(=O)[C@@H](N C(=O)[C@@H]([NH3+])Cc1ccccc1)CC(C)C 5 KLVWAF O=C(N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)N)C)[C@@H CONH2 NH2 Yes -1.13 7.4 ](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3+] )CCCC[NH3+])CC(C)C)C(C)C)Cc1c2c([nH]c1)cccc2 6 LPVGWF O=C(N[C@@H](C(C)C)C(=O)NCC(=O)N[C@@H](Cc1c2c([nH CONH2 NH2 Yes 1.12 7.4 ]c1)cccc2)C(=O)N[C@@H](Cc1ccccc1)C(=O)N)[C@H]1N(CC C1)C(=O)[C@@H]([NH3+])CC(C)C 7 LYLGWI Oc1ccc(cc1)C[C@H](NC(=O)[C@@H]([NH3+])CC(C)C)C(=O) CONH2 NH2 Yes 2.16 7.4 N[C@@H](CC(C)C)C(=O)NCC(=O)N[C@@H](Cc1c2c([nH]c1) cccc2)C(=O)N[C@@H]([C@H](CC)C)C(=O)N 8 PWGYVA Oc1ccc(cc1)C[C@H](NC(=O)CNC(=O)[C@@H](NC(=O)[C@H] CONH2 NH2 Yes -0.38 7.4 1[NH2+]CCC1)Cc1c2c([nH]c1)cccc2)C(=O)N[C@@H](C(C)C) C(=O)N[C@H](C(=O)N)C 9 VPAFII O=C(N[C@H](C(=O)N[C@@H](Cc1ccccc1)C(=O)N[C@@H]([ CONH2 NH2 Yes 0.66 7.4 C@H](CC)C)C(=O)N[C@@H]([C@H](CC)C)C(=O)N)C)[C@H] 1N(CCC1)C(=O)[C@@H]([NH3+])C(C)C 10 ALIWGY Oc1ccc(C[C@@H](C(N)=O)NC(CNC([C@H](Cc2c[nH]c3ccccc CONH2 NH2 Yes 0.92 7.4 23)NC([C@H]([C@@H](C)CC)NC([C@H](CC(C)C)NC([C@H]( C)[NH3+])=O)=O)=O)=O)=O)cc1 11 GAWPFL O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](CC(C)C)C(=O)N) CONH2 NH2 Yes 0.80 7.4 [C@H]1N(CCC1)C(=O)[C@@H](NC(=O)[C@@H](NC(=O)C[N H3+])C)Cc1c2c([nH]c1)cccc2 12 VLIWFV O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](C(C)C)C(=O)N)[ CONH2 NH2 Yes 2.35 7.4 C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H]([ NH3+])C(C)C)CC(C)C)[C@H](CC)C)Cc1c2c([nH]c1)cccc2 13 SVYLQP Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H]([NH3 CONH2 NH2 Yes -1.59 7.4 +])CO)C(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCC( =O)N)C(=O)N1CCC[C@H]1C(=O)N 14 IPFWKL O=C(N[C@@H](Cc1ccccc1)C(=O)N[C@@H](Cc1c2c([nH]c1)c CONH2 NH2 Yes -0.85 7.4 ccc2)C(=O)N[C@@H](CCCC[NH3+])C(=O)N[C@@H](CC(C)C) C(=O)N)[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])[C@H](CC) C 15 WPKIYV Oc1ccc(cc1)C[C@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(= CONH2 NH2 Yes -1.82 7.4 O)[C@H]1N(CCC1)C(=O)[C@@H]([NH3+])Cc1c2c([nH]c1)cc cc2)CCCC[NH3+])[C@H](CC)C)C(=O)N[C@@H](C(C)C)C(=O) N

A.2. Supplementary Information to Chapter 4.1 147

Table A.1: Selected features by Lasso (α=0.06). Descriptions adapted in parts from file:///Applications/moe2016/html/index.htm (last accessed June 2018)

Feature Description PEOE_VSA-6 Partial Equalisation of Orbital Electronegativities (PEOE) (Gasteiger 1980) that calculates atomic partial charges only from elements, formal charges and connectivity informa- tion of the molecule. Here: sum of vdW surface of atom i (vi) where the partial charge of atom i (qi) is < -0.30. Most negative partial charge category. vi is calculated by a con- nection table approximation.

SMR_VSA4 Subdivided surface area: Sum of vi such that the atomic contribution to molar refractivity is in range [0.39, 0.44]. Representation of the polarisability of a molecule.

SlogP_VSA5 Sum of vi such that the atomic contribution to logP is in range [0.15, 0.20]. Intermediate hydrophobic contribution.

a_acid Number of acdic atoms.

h_logD Octanol/water distribution coefficient at pH 7, calculated as the state average from h_logP using an eight parame- ter model based on Hueckel Theory [unpublished] with R2 = 0.84, RMSE=0.59 on 1,836 small molecules. Peptides or peptidic structures were not subject of the training set.

PEOE_RPC- Relative negative partial charge: the smallest negative qi di- vided by the sum of the negative qi.

PEOE_VSA-5 Sum of vi where qi is in range [-0.25, -0.3]. Second highest negative partial charge category.

PEOE_VSA_FHYD Fractional hydrophobic vdW surface area, also calculated by vi and qi.

PEOE_VSA+6 Sum of vi where qi is > 0.3. Most positive partial charge category.

ast_violation_ext Astex fragment like violation count (Rule of three: hbd < 3, hba < 3, clogP < 3).

PEOE_VSA-3 Sum of vi where qi is in range [-0.15, -0.2]. Intermediate negative partial charge category. 148 Appendix A. Supplementary Information A.3. Supplementary Information to Chapter 4.2 149

A.3 Supplementary Information to Chapter 4.2

Figure A.15: Feature distributions distinguished by the training and left-out par- titions of the pooled data. Top: The 11 Lasso selected features considered by the model SVR(Lasso). Bottom: The 20 PCs considered by the model SVR(PCA). 150 Appendix A. Supplementary Information

Figure A.16: Illustration of the final model implementation. First, the respective scaled feature sets of query structures are given to the models. The user can deposit training data upon which the models will be trained before predicting logD7.4 of the novel structure. By now, it can be chosen between the LIPOPEP set, the AZ set or the pooled data. Then, leverage values for the AD assessment are calculated for each model and training data combination. Finally, the performance weighted consensus value is calculated. The output of the Python notebook provides information about these aspects. The script creates finally a .csv file containing (i) the predictions of both SVRs, (ii) the ratio between distance of query compound to the model centroid divided by the leverage value for both SVRs and (iii) the consensus result. A.4. Supplementary Information to Chapter 4.3 151

A.4 Supplementary Information to Chapter 4.3

Figure A.17: Performance comparison of CV vs. EV partitions for all models in the benchmark study. CV: cross-validation; EV: external validation; AAM: arithmetic average model. 152 Appendix A. Supplementary Information

A.5 Supplementary Information to Chapter 4.4

HPLC-MS data of the synthesised and purified peptides discussed in Section 4.4. The analytical method is described in section 3.1.

Name Sequence MW [g/mol] Positive Charges

CCR7_C24A QDEVTDDYIGDNTTV- 3437.69 4

DYTLFESLASKKDVR-NH2

Figure A.18: LC-MS (positive ion mode) data of peptide CCR7_C24A. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 153

Name Sequence MW [g/mol] Positive Charges

CCR7_10.1 QDEVTDDYIG-NH2 1153.16 1

MS chromatogram 1000000 8.63 900000

800000

700000

600000

500000

Intensity [TIC] 400000

300000

200000

100000 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 8.63 min) 250000

1153.3 200000

150000 1154.3

Intensity 100000

50000 1155.25 1730.3 1730.6 577.35 1538.4 1922.55 0 0 500 1000 1500 2000 2500 M/z

Figure A.19: LC-MS (positive ion mode) data of peptide CCR7_10.1. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 154 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

CCR7_10.2 DDYIGDNTTV-NH2 1111.12 1

MS chromatogram 900000 8.72 800000

700000

600000

500000

400000 Intensity [TIC] 300000

200000

100000 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 8.72 min) 160000

140000 556.3 1111.25 120000

100000

80000 1112.2

Intensity 60000

40000 583.1

20000

0 0 500 1000 1500 2000 2500 M/z

Figure A.20: LC-MS (positive ion mode) data of peptide CCR7_10.2. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 155

Name Sequence MW [g/mol] Positive Charges

CCR7_10.3 DNTTVDYTLF-NH2 1187.26 1

MS chromatogram 1400000 11.47

1200000

1000000

800000

600000 Intensity [TIC] 400000

200000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 11.47 min) 300000

250000 594.4 1187.35 200000

150000 1188.25 Intensity 100000 1781.25 50000

0 0 500 1000 1500 2000 2500 M/z

Figure A.21: LC-MS (positive ion mode) data of peptide CCR7_10.3. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 156 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

CCR7_10.4 DYTLFESLAS-NH2 1144.24 1

MS chromatogram 2000000 12.22

1500000

1000000 Intensity [TIC] 500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 12.22 min) 700000

600000 572.9

500000

400000 1144.3 300000 Intensity 1145.2 200000

100000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 M/z

Figure A.22: LC-MS (positive ion mode) data of peptide CCR7_10.4. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 157

Name Sequence MW [g/mol] Positive Charges

CCR7_10.5 ESLASKKDVR-NH2 1131.29 4

MS chromatogram 1.8 1e7

1.6 1.14

1.4

1.2

1.0

0.8

Intensity [TIC] 0.6

0.4

0.2

0.0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 1.14 min) 7000000

6000000 377.95

5000000

4000000

3000000 Intensity 566.4 2000000 391.55 1000000

0 0 500 1000 1500 2000 2500 M/z

Figure A.23: LC-MS (positive ion mode) data of peptide CCR7_10.5. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 158 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

CCR7_6.1 QDEVTD-NH2 704.68 1

MS chromatogram 1400000 2.1

1200000

1000000

800000

600000 Intensity [TIC] 400000

200000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 2.1 min) 700000

600000 705.1

500000

400000

300000 Intensity

200000 706.05

100000 707.05 0 0 500 1000 1500 2000 2500 M/z

Figure A.24: LC-MS (positive ion mode) data of peptide CCR7_6.1. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 159

Name Sequence MW [g/mol] Positive Charges

CCR7_6.2 DEVTDD-NH2 691.63 1

MS chromatogram 550000 3.56 500000

450000

400000

350000

300000

Intensity [TIC] 250000

200000

150000

100000 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 3.56 min) 200000

692.05

150000

100000 Intensity

693.05 50000

0 200 400 600 800 1000 1200 1400 1600 1800 2000 M/z

Figure A.25: LC-MS (positive ion mode) data of peptide CCR7_6.2. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 160 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

CCR7_6.3 EVTDDY-NH2 739.72 1

MS chromatogram 1200000 6.71 1000000

800000

600000

Intensity [TIC] 400000

200000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 6.71 min) 600000

500000 740.05

400000

300000 Intensity 200000 741.05

100000 742.05 1479.95 0 0 500 1000 1500 2000 2500 M/z

Figure A.26: LC-MS (positive ion mode) data of peptide CCR7_6.3. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 161

Name Sequence MW [g/mol] Positive Charges

CCR7_6.4 VTDDYI-NH2 723.77 1

MS chromatogram 3000000 7.81

2500000

2000000

1500000

Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 7.81 min) 1400000

724.1 1200000

1000000

800000

600000 725.15 Intensity

400000

200000 726.15 1448.3 0 0 500 1000 1500 2000 2500 M/z

Figure A.27: LC-MS (positive ion mode) data of peptide CCR7_6.4. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 162 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

CCR7_6.5 TDDYIG-NH2 681.69 1

MS chromatogram 1200000

7.4 1000000

800000

600000

Intensity [TIC] 400000

200000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 7.4 min) 600000

500000 682.05

400000

300000 Intensity 200000

683.05 100000 684.05 1364.15 0 0 500 1000 1500 2000 2500 M/z

Figure A.28: LC-MS (positive ion mode) data of peptide CCR7_6.5. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 163

Name Sequence MW [g/mol] Positive Charges

Offspring 432 PQEDFV-NH2 732.79 1

MS chromatogram 3500000 7.75 3000000

2500000

2000000

1500000 Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 7.75 min) 1800000

1600000 733.15 1400000

1200000

1000000

800000 Intensity 600000 734.15

400000

200000 735.15 1466.35 0 0 500 1000 1500 2000 2500 M/z

Figure A.29: LC-MS (positive ion mode) data of peptide Offspring 432. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 164 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 211 VTDDWM-NH2 764.86 1

MS chromatogram 2500000 9.04

2000000

1500000

1000000 Intensity [TIC]

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 9.04 min) 1400000

1200000 765.15 1000000

800000

600000 Intensity

400000 766.15

200000 767.15 1529.35 0 0 500 1000 1500 2000 2500 M/z

Figure A.30: LC-MS (positive ion mode) data of peptide Offspring 211. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 165

Name Sequence MW [g/mol] Positive Charges

Offspring 865 YTDDYI-NH2 787.82 1

MS chromatogram 2000000 8.11

1500000

1000000 Intensity [TIC] 500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 8.11 min) 1000000

788.15 800000

600000

Intensity 400000 789.15

200000

0 0 500 1000 1500 2000 2500 M/z

Figure A.31: LC-MS (positive ion mode) data of peptide Offspring 865. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 166 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 161 LTDNYI-NH2 736.82 1

MS chromatogram 3500000 7.97

3000000

2500000

2000000

1500000 Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 7.97 min) 2000000

737.2

1500000

1000000 Intensity 738.2 500000

739.2 1473.5 0 0 500 1000 1500 2000 2500 M/z

Figure A.32: LC-MS (positive ion mode) data of peptide Offspring 161. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 167

Name Sequence MW [g/mol] Positive Charges

Offspring 191 IGEEFL-NH2 705.81 1

MS chromatogram 4000000 9.59 3500000

3000000

2500000

2000000

1500000 Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 9.59 min) 2000000 706.2

1500000

1000000 Intensity 707.2 500000

708.25 1411.45 0 0 500 1000 1500 2000 2500 M/z

Figure A.33: LC-MS (positive ion mode) data of peptide Offspring 191. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 168 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 420 VVEDFL-NH2 719.84 1

MS chromatogram 3500000 9.69

3000000

2500000

2000000

1500000 Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 9.69 min) 1800000

1600000 720.2 1400000

1200000

1000000

800000 Intensity 600000 721.2 400000

200000 722.2 1439.5 0 0 500 1000 1500 2000 2500 M/z

Figure A.34: LC-MS (positive ion mode) data of peptide Offspring 420. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 169

Name Sequence MW [g/mol] Positive Charges

Offspring 364 AEDDYA-NH2 681.66 1

MS chromatogram 1200000 3.99

1000000

800000

600000

Intensity [TIC] 400000

200000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 3.99 min) 600000

682.1 500000

400000

300000 Intensity 200000 683.0 100000

0 0 500 1000 1500 2000 2500 M/z

Figure A.35: LC-MS (positive ion mode) data of peptide Offspring 364. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 170 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 876 MSDSVV-NH2 635.74 1

MS chromatogram 3000000 6.89

2500000

2000000

1500000

Intensity [TIC] 1000000

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 6.89 min) 1600000 636.1 1400000

1200000

1000000

800000

Intensity 600000 637.15 400000

200000

0 0 500 1000 1500 2000 2500 M/z

Figure A.36: LC-MS (positive ion mode) data of peptide Offspring 876. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 171

Name Sequence MW [g/mol] Positive Charges

Offspring 944 ISDNMA-NH2 648.74 1

MS chromatogram 2500000 6.15

2000000

1500000

1000000 Intensity [TIC]

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 6.15 min) 1400000

1200000 649.1 1000000

800000

600000 Intensity 650.15 400000

200000 651.15 0 0 500 1000 1500 2000 2500 M/z

Figure A.37: LC-MS (positive ion mode) data of peptide Offspring 944. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 172 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 256 VQSDVA-NH2 616.67 1

MS chromatogram 2500000

2.33 2000000

1500000

1000000 Intensity [TIC]

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 2.33 min) 1400000

1200000 617.15

1000000

800000

600000 Intensity

400000 618.2

200000 619.15 0 200 400 600 800 1000 1200 1400 1600 1800 2000 M/z

Figure A.38: LC-MS (positive ion mode) data of peptide Offspring 256. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 173

Name Sequence MW [g/mol] Positive Charges

Offspring 82 WQDDFL-NH2 821.89 1

MS chromatogram 2500000

10.17 2000000

1500000

1000000 Intensity [TIC]

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 10.17 min) 1000000

822.15 800000

600000

Intensity 400000 823.2

200000

824.15 1644.1 0 0 500 1000 1500 2000 2500 M/z

Figure A.39: LC-MS (positive ion mode) data of peptide Offspring 82. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. 174 Appendix A. Supplementary Information

Name Sequence MW [g/mol] Positive Charges

Offspring 7 VSNEHM-NH2 714.80 2

MS chromatogram 2500000

1.23 2000000

1500000

1000000 Intensity [TIC]

500000

0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 1.23 min) 900000

800000 358.15 700000

600000

500000

400000

Intensity 715.1 300000

200000 716.1 100000 358.7

0 0 500 1000 1500 2000 2500 M/z

Figure A.40: LC-MS (positive ion mode) data of peptide Offspring 7. Top: MS chromatogram. Bottom: MS spectrum for the principal peak. A.5. Supplementary Information to Chapter 4.4 175

Name Sequence MW [g/mol] Positive Charges

Negative GLPVVVKL-NH2 823.09 2

MS chromatogram 4.5 1e7 8.71 4.0

3.5

3.0

2.5

2.0

Intensity [TIC] 1.5

1.0

0.5

0.0 0 5 10 15 20 25 Time [min]

MS spectrum (Time = 8.71 min) 1.4 1e7

1.2 823.4

1.0 412.4

0.8

0.6 Intensity

0.4

824.8 0.2 411.55

0.0 0 500 1000 1500 2000 2500 M/z

Figure A.41: LC-MS (positive ion mode) data of peptide Negative. Top: MS chro- matogram. Bottom: MS spectrum for the principal peak. 176 Appendix A. Supplementary Information

Binding curves of the peptides discussed in Section 4.4, determined by microscale ther- mophoresis (MST). The method is described in section 3.1.

Table A.2: Kd determinations of CCR7_C24A and CCR7_10.1 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

CCR7_C24A 2.0 ± 1.0 CCR7_10.1 0.8 ± 0.5

CCR7_C24A_1 CCR7_10.1_1

1.0

0.5 1.0

0 0.5 Fraction Bound Fraction Bound KD = 3.0 µM KD = 0.7 µM 0 10-8 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_C24A_2 CCR7_10.1_2

1.0 1.0 0.5 0.5

Fraction Bound 0 0 Fraction Bound K = 1.1 µM D KD = 1.3 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_C24A_3 CCR7_10.1_3

1.0 1.0

0.5 0.5

0

Fraction Bound 0 Fraction Bound K = 1.9 µM D KD = 0.4 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] A.5. Supplementary Information to Chapter 4.4 177

Table A.3: Kd determinations of CCR7_10.2 and CCR7_10.3 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

CCR7_10.2 2.0 ± 0.7 CCR7_10.3 5.6 ± 1.7

CCR7_10.2_1 CCR7_10.3_1

1.0 1.0

0.5

0.5 0 Fraction Bound Fraction Bound K = 2.0 µM D KD = 4.0 µM 0 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 -1 Ligand concentration [mol*l ] Ligand concentration [mol*l-1]

CCR7_10.2_2 CCR7_10.3_2

1.0 1.0

0.5

0.5

Fraction Bound 0 Fraction Bound KD = 2.7 µM KD = 7.3 µM 0 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_10.2_3 CCR7_10.3_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 1.4 µM KD = 5.4 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] 178 Appendix A. Supplementary Information

Table A.4: Kd determinations of CCR7_10.4 and CCR7_10.5 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

CCR7_10.4 10.6 ± 2.1 CCR7_10.5 24.5 ± 6.8

CCR7_10.4_1 CCR7_10.5_1

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 10.8 µM KD = 20.3 µM

10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_10.4_2 CCR7_10.5_2

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 8.4 µM KD = 32.3 µM

10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_10.4_3 CCR7_10.5_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 12.6 µM KD = 20.8 µM

10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] A.5. Supplementary Information to Chapter 4.4 179

Table A.5: Kd determinations of CCR7_6.1 and CCR7_6.2 by MST. The resulting value is reported as the average ± std of three individual experiments.

µ Name Kd [µM] Name Kd [ M]

CCR7_6.1 1.3 ± 0.7 CCR7_6.2 -

CCR7_6.1_1 CCR7_6.2_1 1.0 1.0 0.5 0 0.5

0 Fraction Bound Fraction Bound S/N ratio: 1.8 KD = 2.1 µM Signal Amplitude: 1.1

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_6.1_2 CCR7_6.2_2

1.0 1.0 0.5 0.5 0

0 Fraction Bound Fraction Bound KD = 1.0 µM S/N ratio: 1.1 Signal Amplitude: 0.5 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_6.1_3 CCR7_6.2_3

1.0

0.5 1.0 0.5 0 Fraction Bound 0 Fraction Bound KD = 0.7 µM S/N ratio: 0.8 Signal Amplitude: - -8 -7 -6 -5 -4 -3 10 10 10 10 10 10 10-7 10-6 10-5 10-4 10-3 -1 Ligand concentration [mol*l ] Ligand concentration [mol*l-1] 180 Appendix A. Supplementary Information

Table A.6: Kd determinations of CCR7_6.3 and CCR7_6.4 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

CCR7_6.3 - CCR7_6.4 0.3 ± 0.1

CCR7_6.3_1 CCR7_6.4_1

1.0

1.0 0.50 0.5

Fraction Bound 0 Fraction Bound S/N ratio: 1.5 KD = 0.2 µM Signal Amplitude: 1.2 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_6.3_2 CCR7_6.4_2

1.0 1.0 0.5 0 0.5

0 Fraction Bound Fraction Bound S/N ratio: 2.1 KD = 0.4 µM Signal Amplitude: 0.6 10-8 10-7 10-6 10-5 10-4 10-3 10-9 10-8 10-7 10-6 10-5 10-4 10-3 -1 Ligand concentration [mol*l-1] Ligand concentration [mol*l ]

CCR7_6.3_3 CCR7_6.4_3

1.0 1.0

0.5 0.5 0 0 Fraction Bound Fraction Bound S/N ratio: 0.8 KD = 0.2 µM Signal Amplitude: 0.8 -9 -8 -7 -6 -5 -4 -3 10-7 10-6 10-5 10-4 10-3 10 10 10 10 10 10 10 -1 Ligand concentration [mol*l-1] Ligand concentration [mol*l ] A.5. Supplementary Information to Chapter 4.4 181

Table A.7: Kd determinations of CCR7_6.5 and Offspring 432 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

CCR7_6.5 0.3 ± 0.1 Offspring 432 5.9 ± 2.2

CCR7_6.5_1 432_1

1.0 1.0 0.5 0.5 0 0 Fraction Bound Fraction Bound KD = 0.4 µM KD = 4.0 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_6.5_2 432_2

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.2 µM KD = 8.4 µM

10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

CCR7_6.5_3 432_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.5 µM KD = 5.4 µM

10-9 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] 182 Appendix A. Supplementary Information

Table A.8: Kd determinations of Offspring 211 and Offspring 865 by MST. The re- sulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

Offspring 211 0.6 ± 0.7 Offspring 865 0.4 ± 0.3

211_1 865_1

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 1.4 µM KD = 0.3 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

211_2 865_2

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.1 µM KD = 0.8 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

211_3 865_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.2 µM KD = 0.2 µM

10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] A.5. Supplementary Information to Chapter 4.4 183

Table A.9: Kd determinations of Offspring 161 and Offspring 191 by MST. The re- sulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

Offspring 161 0.2 ± 0.1 Offspring 191 -

161_1 191_1

1.0 1.0 0.5 0.5 0 Fraction Bound Fraction Bound 0 KD = 0.3 µM S/N ratio: 3.9 Signal Amplitude: 1.6 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

161_2 191_2

1.0 1.0 0.5 0.5 0 Fraction Bound Fraction Bound 0 KD = 0.2 µM Signal Amplitude: 1.8 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

161_3 191_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.1 µM Signal Amplitude: 1.1 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] 184 Appendix A. Supplementary Information

Table A.10: Kd determinations of Offspring 420 and Offspring 364 by MST. The resulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

Offspring 420 - Offspring 364 0.7 ± 0.1

420_1 364_1

1.0 1.0 0.5 0.5 0 0 Fraction Bound Fraction Bound KD = 0.7 µM Signal Amplitude: 1.5 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

420_2 364_2

1.0 1.0 0.5 0.5 0 0 Fraction Bound Fraction Bound KD = 0.7 µM Signal Amplitude: 1.7 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

420_3 364_3

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound KD = 0.7 µM Signal Amplitude: 2.3 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] A.5. Supplementary Information to Chapter 4.4 185

Table A.11: Kddeterminations of Offspring 876 and Offspring 944 by MST. The re- sulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

Offspring 876 - Offspring 944 -

876_1 944_1

1.0 1.0 0.5 0

0.5 Fraction Bound Fraction Bound S/N ratio: - S/N ratio: 2.0 Signal Amplitude: - Signal Amplitude: 2.3 0 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

876_2 944_2 1.0 1.0 0.5 0.5 0

Fraction Bound 0 Fraction Bound S/N ratio: 2.3 S/N ratio: 2.1 Signal Amplitude: 1.4 Signal Amplitude: 2.2 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 -1 Ligand concentration [mol*l ] Ligand concentration [mol*l-1]

876_3 944_3 1.0 1.0 0.5 0.5 0

Fraction Bound 0 Fraction Bound S/N ratio: 4.9 Signal Amplitude: 2.6 Signal Amplitude: 1.6 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 -1 Ligand concentration [mol*l ] Ligand concentration [mol*l-1] 186 Appendix A. Supplementary Information

Table A.12: Kd determinations of Offspring 256 and Offspring 82 by MST. The re- sulting value is reported as the average ± std of three individual experiments.

Name Kd [µM] Name Kd [µM]

Offspring 256 - Offspring 82 -

256_1 82_1

1.0 1.0 0.5 0.5 0 0 Fraction Bound Fraction Bound S/N ratio: 1.4 S/N ratio: 3.3 Signal Amplitude: 1.1 Signal Amplitude: 1.4 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

256_2 82_2

1.0 1.0

0.5 0.5

0 0 Fraction Bound Fraction Bound S/N ratio: 3.0 Signal Amplitude: 2.3 Signal Amplitude: 1.8 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

256_3 82_3

1.0 1.0 0.5 0.5 0 Fraction Bound Fraction Bound 0 S/N ratio: 1.6 Signal Amplitude: 1.7 Signal Amplitude: 1.0 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] A.5. Supplementary Information to Chapter 4.4 187

Table A.13: Kd determinations of Offspring 7 and Negative by MST. The resulting value is reported as the average ± std of three individual experiments. *Negative did not meet sufficient S / N ratio and signal amplitude criteria for fitting (cf. section 3.1). Applying the model still, would result in Kd > 500 µM.

Name Kd [µM] Name Kd [µM]

Offspring 7 - Negative > 500*

7_1 GGG_35.1_1

1.0 1.0

0.5

0.5 0 Fraction Bound Fraction Bound S/N ratio: 3.8 KD = n.D Signal Amplitude: 1.8 0 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

7_2 GGG_35.1_2

1.0 1.0

0.5 0.5

0

0 Fraction Bound Fraction Bound S/N ratio: 2.4 KD = n.D Signal Amplitude: 2.3 10-8 10-7 10-6 10-5 10-4 10-3 10-8 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1]

7_3 GGG_35.1_3

1.0 1.0

0.5 0.5

0

0 Fraction Bound Fraction Bound S/N ratio: - KD = n.D Signal Amplitude: - 10-8 10-7 10-6 10-5 10-4 10-3 10-7 10-6 10-5 10-4 10-3 Ligand concentration [mol*l-1] Ligand concentration [mol*l-1] 188 Appendix A. Supplementary Information

Figure A.42: His6 control assays (N=1). A: CCR7_10.1 binding to His6-peptide. B: CCR7_6.4 binding to His6-peptide. C: Control assays of the offsprings for which binding was observed (cf. Table 4.5).