c 2018 JONATHAN JUN FENG CHEN ALL RIGHTS RESERVED DATA MINING/MACHINE LEARNING TECHNIQUES FOR DRUG DISCOVERY: COMPUTATIONAL AND EXPERIMENTAL PIPELINE DEVELOPMENT

A Dissertation Presented to The Graduate Faculty of The University of Akron

In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Integrated Biosciences Doctoral Program

Jonathan Jun Feng Chen May, 2018 DATA MINING/MACHINE LEARNING TECHNIQUES FOR DRUG DISCOVERY: COMPUTATIONAL AND EXPERIMENTAL PIPELINE DEVELOPMENT

Jonathan Jun Feng Chen Dissertation

Approved: Accepted:

Advisor Program Director Dr. Donald P. Visco Jr. Dr. Hazel Barton

Co-Advisor/Committee Member Dean of the College Dr. Zhong-Hui Duan Dr. John Green

Committee Member Dean of the Graduate School Dr. Nic Leipzig Dr. Chand Midha

Committee Member Date Dr. Jie Zheng

Committee Member Dr. Richard Londraville

ii ABSTRACT

Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of , proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed. The work presented used and processed data from the NLM to identify new can- didates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound struc- tural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vec- tor machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Com- pound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of

iii candidates and compounds from the training set were compared to identify structure- activity relationships for additional avenues of inquiry. Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility.

iv DEDICATION

I dedicate this dissertation to my grandparents: Bao Huang Chen, Shu Xian Lai, and Wang Yu Wan. They took care of me when I was young. But around the time I decided to attend graduate school, I took care of them when I was at home from college. I saw the pain, deteriorating quality of life, and shame they went through as they aged. I was motivated to help in some way, to reduce their suffering in some way. While my work had no impact on their lives, I hope this work can help someone else someday in the future.

v ACKNOWLEDGEMENT

First, I would like to acknowledge my parents, brother, extended family, and friends. Their support through everything, good and bad, was the solid grounding I needed to keep moving forward. Second, I would like to acknowledge Dr. Jeffry Kantor, Dr. Jeremiah Zartman, and Dr. Miranda Burnette. Dr. Kantor was the first person who encouraged me to consider graduate school as a path forward. Dr. Miranda Burnette taught me almost all of my experimental skills while I worked in Dr. Zartman’s lab. Third, I would like to acknowledge my advisor Dr. Donald P. Visco. Dr. Visco mentored me, motivated me, and gave me the opportunity to do the work presented. He gave me direction and freedom to explore in my research. Thus, I expanded my horizons and added skills to my skillset while making progress. I enjoyed and will miss the most our weekly meetings that we had where he never neglected to ask me how I was and listened to my responses. I learned how to be a scientist, a professional, and a mentor from him. Fourth, I would like to acknowledge Ms. Lyndsey N. Schmucker. She is the undergraduate researcher who conducted most of the experimental work presented. She is a friend and professional who showed me how to squeeze as much productivity in the time available. Finally, I would like to acknowledge Choose Ohio First scholarship for their con- tinued support throughout my graduate career. Their support significantly reduced the financial burden of attending graduate school. Their bioinformatics scholarship

vi also helped motivate my work in the field and exposed me to many different lines of inquiry and opportunities available.

vii TABLE OF CONTENTS

Page

LIST OF TABLES...... xii

LIST OF FIGURES...... xiv

CHAPTER I AN OVERVIEW OF DRUG DISCOVERY ...... 1

1.1 Historical Drugs and Drug Discovery ...... 1

1.2 Computational Methods in Drug Discovery...... 4

1.2.1 Computational Methods in Drug Discovery: Structure-Based Methodologies ...... 6

1.2.2 Computational Methods in Drug Discovery: Ligand-Based Methodologies ...... 8

1.2.3 Computational Methods in Drug Discovery: Hybrid-Based Methodologies ...... 12

1.3 PCA-GA-SVM vHTS Drug Discovery Pipeline...... 12

1.3.1 PCA-GA-SVM vHTS Drug Discovery Pipeline: Guiding Con- siderations Before Construction ...... 13

1.3.2 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Signa- ture Molecular Descriptor ...... 16

1.3.3 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Algo- rithms Chosen ...... 21

viii 1.3.4 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Rest Of The Pipeline ...... 29

1.3.5 PCA-GA-SVM vHTS Drug Discovery Pipeline: Determining Effectiveness ...... 33

CHAPTER II PCA-GA-SVM CASE STUDIES ...... 37

2.1 Sample ...... 37

2.1.1 Sample: vHTS and Experimental Validation Results ...... 38

2.1.2 Sample: Model Analysis ...... 40

2.1.3 Sample: Structural Feature Analysis ...... 40

2.2 Cathepsin L ...... 41

2.2.1 Cathepsin L: vHTS and Experimental Validation Results . . . . 43

2.2.2 Cathepsin L: Model Analysis ...... 51

2.2.3 Cathepsin L: Structural Features Analysis ...... 51

2.3 Factor XIIa...... 56

2.3.1 Factor XIIa: vHTS and Experimental Validation Results . . . . 58

2.3.2 Factor XIIa: Model Analysis...... 68

2.3.3 Factor XIIa: Structural Feature Analysis ...... 69

2.4 Factor XIa ...... 71

2.4.1 Factor XIa: vHTS and Experimental Validation Results . . . . . 73

2.4.2 Factor XIa: Model Analysis ...... 83

2.4.3 Factor XIa: Structural Feature Analysis ...... 83

ix 2.5 Complement Factor C1s ...... 86

2.5.1 Complement Factor C1s: vHTS and Experimental Validation Results ...... 89

2.5.2 Complement Factor C1s: Model Evaluation ...... 98

2.5.3 Complement Factor C1s: Structural Feature Analysis ...... 99

2.6 SENP8 ...... 102

2.6.1 SENP8: vHTS and Experimental Validation Results ...... 104

2.6.2 SENP8: Model Analysis ...... 111

2.6.3 SENP8: Structural Feature Analysis ...... 112

2.7 PK-M2 Pt. 1 ...... 114

2.7.1 PK-M2 Pt. 1: vHTS and Experimental Validation Results . . . 117

2.7.2 PK-M2 Pt. 1: Model Analysis ...... 126

2.7.3 PK-M2 Pt. 1: Structural Feature Analysis...... 127

2.8 PK-M2 Pt. 2 ...... 132

2.8.1 PK-M2 Pt. 2: vHTS and Experimental Validation Results . . . 133

2.8.2 PK-M2 Pt. 2: Model Evaluation ...... 141

2.8.3 PK-M2 Pt. 2: Structural Feature Analysis...... 141

CHAPTER III CONCLUSION: EVALUATION OF PCA-GA-SVM AND REC- OMMENDATIONS REGARDING FUTURE USE ...... 143

CHAPTER IV INTEGRATION: BIOLOGY, CHEMISTRY, CHEMICAL EN- GINEERING, AND COMPUTER SCIENCE ...... 147

x APPENDICES ...... 180

APPENDIX A ALGORITHM SPECIFICATIONS AND IMPLEMENTATION SCRIPTS ...... 181

1.1 ZINC-15 PAINS Filter ...... 182

1.2 Data Scaling and Normalization ...... 184

1.3 Principal Component Analysis ...... 186

1.4 GA/SVM-C Models Creation ...... 188

1.5 GA/SVM-R Models Creation ...... 193

1.6 Create ROC Curves ...... 198

1.7 vHTS with GA/SVM-C Models ...... 202

1.8 vHTS with GA/SVM-R Models ...... 218

1.9 Set Theoretic Tanimoto Coefficient calculation...... 235

APPENDIX B EXPERIMENTAL VALIDATION METHODOLOGIES AND PROTOCOLS...... 243

2.1 Cathepsin L ...... 244

2.2 Factor XIIa...... 244

2.3 Factor XIa ...... 245

2.4 Complement Factor C1s ...... 245

2.5 SENP8 ...... 245

2.6 Human PK-M2 Pt. 1 ...... 246

2.7 Human PK-M2 Pt. 2 ...... 246

xi LIST OF TABLES

Table Page

1.1 Dataset Assignments to AIM 1 and AIM 2 ...... 35

2.1 Cathepsin L: The First Round vHTS Modeling Summary ...... 43

2.2 Cathepsin L: The First Round Activity Validation Summary ...... 45

2.3 Cathepsin L: The Second Round vHTS Modeling Summary ...... 48

2.4 Cathepsin L: The Second Round Activity Validation Summary...... 49

2.5 Cathepsin L: Candidate Similarity ...... 53

2.6 Factor XIIa: The First Round vHTS Modeling Summary ...... 59

2.7 Factor XIIa: The First Round Activity Validation Summary...... 61

2.8 Factor XIIa: The Second Round vHTS Modeling Summary...... 64

2.9 Factor XIIa: The Second Round Activity Validation Summary...... 65

2.10 Factor XIIa: The PAINS Activity Summary...... 68

2.11 Factor XIa: The First Round vHTS Modeling Summary ...... 74

2.12 Factor XIa: The First Round Activity Validation Summary ...... 76

2.13 Factor XIa: The Second Round vHTS Modeling Summary ...... 78

xii 2.14 Factor XIa: The Second Round Activity Validation Summary...... 80

2.15 Factor XIa: The PAINS Activity Summary ...... 83

2.16 C1s: The First Round vHTS Modeling Summary ...... 90

2.17 C1s: The First Round Activity Validation Summary...... 92

2.18 C1s: The Second Round vHTS Modeling Summary ...... 94

2.19 C1s: The Second Round Activity Validation Summary...... 96

2.20 SENP8: The First Round vHTS Modeling Summary ...... 105

2.21 SENP8: The First Round Activity Validation Summary ...... 107

2.22 SENP8: The Second Round vHTS Modeling Summary ...... 109

2.23 SENP8: The Second Round Activity Validation Summary...... 110

2.24 PK-M2 Pt. 1: The First Round vHTS Modeling Summary (AID 2533). . . 118

2.25 PK-M2 Pt. 1: The First Round Activity Validation Summary (AID 2533)120

2.26 PK-M2 Pt. 1: The Second Round vHTS Modeling Summary (AID 2533) . 122

2.27 PK-M2 Pt. 1: The Second Round Activity Validation Summary (AID 2533)...... 124

2.28 PK-M2 Pt. 1: The Second Round Candidate Overlap Scores ...... 131

2.29 PK-M2 Pt. 2: The First Round vHTS Modeling Summary (AID 1540). . . 134

2.30 PK-M2 Pt. 2: The First Round Activity Validation Summary (AID 1540)135

2.31 PK-M2 Pt. 2: The Second Round vHTS Modeling Summary (AID 1540) . 137

2.32 PK-M2 Pt. 2: The Second round Activity Validation Summary (AID 1540) ...... 139

3.1 Summary Of Case Study Hit Rates ...... 143

xiii LIST OF FIGURES

Figure Page

1.1 Molecular Graph Of Acetic Acid...... 17

1.2 Signature Fragmentation Example ...... 19

1.3 Principal Component Analysis (PCA) ...... 22

1.4 PCA Usage In Pipeline...... 24

1.5 Genetic Algorithm (GA) ...... 25

1.6 Support Vector Machines (SVM) ...... 27

1.7 Pipeline Process Flowchart ...... 30

2.1 Cathepsin L: The First Scaffold ...... 54

2.2 Cathepsin L: The Second Scaffold ...... 55

2.3 Cathepsin L: The Third Scaffold...... 55

2.4 Cathepsin L: The Fourth Scaffold...... 56

2.5 Factor XIIa: The First Round vHTS ROC ...... 60

2.6 Factor XIIa: The Second Round vHTS ROC ...... 64

2.7 Factor XIIa: The First Scaffold...... 69

xiv 2.8 Factor XIIa: The Second Scaffold...... 70

2.9 Factor XIIa: The Third Scaffold ...... 71

2.10 Factor XIa: The First Round vHTS ROC...... 75

2.11 Factor XIa: The Second Round vHTS ROC ...... 79

2.12 Factor XIa: The First Scaffold ...... 85

2.13 Factor XIa: The Second Scaffold ...... 86

2.14 C1s: The First Round vHTS ROC ...... 91

2.15 C1s: The Second Round vHTS ROC ...... 95

2.16 C1s: The First Scaffold ...... 100

2.17 C1s: The Second Scaffold...... 101

2.18 SENP8: The First Round vHTS ROC ...... 106

2.19 SENP8: The Second Round vHTS ROC ...... 110

2.20 SENP8: The First Scaffold ...... 113

2.21 SENP8: The Second Scaffold...... 114

2.22 PK-M2 Pt. 1: The First Round vHTS ROC (AID 2533) ...... 119

2.23 PK-M2 Pt. 1: Second Round vHTS ROC (AID 2533) ...... 123

2.24 PK-M2 Pt. 1: The First Scaffold (AID 2533) ...... 127

2.25 PK-M2 Pt. 1: The Second Scaffold (AID 2533) ...... 128

2.26 PK-M2 Pt. 1: The Third Scaffold (AID 2533) ...... 129

2.27 PK-M2 Pt. 1: The Fourth Scaffold (AID 2533) ...... 130

2.28 PK-M2 Pt. 1: The New Atomic Signatures (AID 2533) ...... 132

2.29 PK-M2 Pt. 2: The First Round vHTS ROC (AID 1540) ...... 134

xv 2.30 PK-M2 Pt. 2: The Second Round vHTS ROC...... 138

2.31 PK-M2 Pt. 2: The New Atomic Signatures ...... 142

xvi CHAPTER I AN OVERVIEW OF DRUG DISCOVERY

1.1 Historical Drugs and Drug Discovery

Drugs and medicine played significant roles in civilizations since the beginning (Ng, 2008). Animals (parts or whole) and herbs played substantial roles in spiritu- al/religious ceremonies and healing (Sigerist, 1987). Each culture had its versions of home-remedies specific to the time, location, and the availability of local flora and fauna, and each had their understandings of how these medicines worked (Sigerist, 1987). While traditional explanations of how a specific remedy functions may be dubious, the medicinal effect is not. Therefore, these traditional medicines served as the starting point of modern drug discovery (Fabricant and Farnsworth, 2001). J. N. Langley’s theory of receptor “switches” that turn biological signals on or off is the foundation of the modern conception of medicine (Langley, 1905). Receptors for different signaling pathways were found in all tissue types and provided therapeutic targets to effect a desired physiological change (Drews, 2000). When examined in this framework, it is apparent that traditional remedies contain one or more active ingre- dients that act upon a specific receptor. Upon careful examination of the particular herbal components in traditional medicine, active therapeutic agents were determined and can be used directly for treatments (Fabricant and Farnsworth, 2001). However, traditional medicines as a source of pharmaceuticals also have its associated chal- lenges. Plants contain many other compounds and isolating the active therapeutic

1 agent can be challenging, while obtaining enough of it at the requisite purity further increases the difficulty (Balunas and Kinghorn, 2005). Cell-based assays or animal studies can mitigate these challenges, but were less compatible with or were more difficult to use than the technology that was developed concurrently (Koehn and Carter, 2005). These limitations reduced the interest in natural products as a source of pharmaceuticals, though interest rebounded as limitations are addressed (Harvey, 2008; Li and Vederas, 2009; Henrich and Beutler, 2013). An alternative source of therapeutics derived from natural sources is engineered biomolecules, where organisms and cell cultures are injected with modified genes for expression and production (Matasci et al., 2008). Recombinant proteins produced can be used to complement or replace host production in the case of insufficient or complete absence of host production, respectively (Pavlou and Reichert, 2004), or induce a response from the host in the case of vaccines (Modjtahedi et al., 2012; Van Damme, 2016). While proteins and peptides are effective as therapies and other applications (Khan et al., 2016), small molecule drug compounds (< 500 Daltons (Lipinski et al., 2001)) are still preferred as their small size enables transport and permeation throughout the body (Yang and Hinner, 2015). While not developed explicitly to find small molecule drug compounds, two tech- niques were coincidentally suited for that purpose: high-throughput screening (HTS) and combinatorial chemistry. HTS was a technique designed to maximize efficiency via shrinking the reaction volume and the amount of material needed for each experi- ment(Pereira and Williams, 2007). HTS was also intended to test multiple compounds at the same time in parallel (Pereira and Williams, 2007). The system was flexible and able to either evaluate numerous compounds at a given concentration or evaluate a few compounds over a range of concentrations (Pereira and Williams, 2007; Drews, 2000), but either case would require less material and resources to test a given num-

2 ber of compounds. For HTS to have maximum effect, a large number of untested compounds are needed. Combinatorial chemistry is a technique of combining different precursors and allowing them to react randomly (Czarnik and Mei, 2007) to create sets of com- pounds(Thompson, 2001). Together, HTS and combinatorial chemistry supplied re- searchers with many small molecule compounds to test for biological activity(Houghten, 2000; Schreiber, 2000). While combinatorial chemistry did generate more compounds to test in HTS, it was successful to the point where there were too many compounds, and the bottle- neck was once again how to reasonably examine the chemical space mapped by the new compounds with the finite resources available (Pereira and Williams, 2007). In response, screening compound libraries were created with HTS limitations in mind, containing all possible iterations of a given scaffold (focused HTS) (Geysen et al., 2003). A competing method of creating screening compound libraries was to test several iterations of many different compound scaffolds (diversity HTS) (Valler and Green, 2000). Diversity HTS became the preferred method of creating screening com- pound libraries since any candidates identified can be optimized for desired biological activities and properties (Valler and Green, 2000). Additionally, diversity HTS will more efficiently explore the chemical space of possible drug candidates (Valler and Green, 2000). To better understand why diversity HTS is the preferred option to explore the chemical space, it is essential to understand the size of the chemical space considered. Depending on the factors in consideration, estimations of the chemical space occupied by small molecules can differ by many orders of magnitude. The number of possible molecules, those that are theoretical and those that are synthesized, grows exponen- tially with the addition of different dimensions (e.g., number of atoms, type of atoms,

3 number of bonds, types of bonds, type of branching, type of ring structures, etc.) Additionally, the estimation grows with increasing molecular weight, as more atoms and atom combinations are permissible. For example, an estimate of the subset of molecules consisting of 30 atoms, using only C, N, O, and S atoms, is 1060 (Bohacek et al., 1996), which is large enough to be taken as the estimate for all synthesized small molecules, now and in the future. Even a more conservative estimate of 1030 or 1014 (Geysen et al., 2003) is many orders of magnitude larger than is feasibly tested. In an attempt to make the chemical space more manageable, expert knowledge can be used to “focus” compound libraries for HTS by removing compounds whose bio- logical activity is unlikely (Dobson, 2004). However, library focusing via this method showed marginal improvement (Dobson, 2004), and experts cannot be expected to inspect the number of compounds considered visually. For reference, the PubChem Compound database used in this work contained roughly 72 million compounds (circa May 2014) and had grown to nearly 95 million compounds (by February 2018). Thus, researchers and corporations arrived at the same conclusion: computational methods for identifying specific compound libraries, if not individual compounds, to focus on are necessary and the only current way to reliably consider compounds at the scale proposed (Geysen et al., 2003).

1.2 Computational Methods in Drug Discovery

In the 1950s and 1960s, computers were limited to “... HMO (Huckel molecular orbital) and PPP (Pariser-Parr-Pople) calculations on aromatic compounds...” in medicinal chemistry (Bultinck et al., 2003). Since then, the advancement of com- putational power and the availability of computers have made it an integral tool in the discipline and experts predicted its use to identify compound classes on which to focus (Geysen et al., 2003). The growth of computational power and sophistication

4 has fulfilled that prediction (Van Drie, 2007). The applications of computational methods are collectively known as computer- aided drug design. Each of the three applications fulfill a different purpose and motivation: (1) filtering large compound libraries for “hits” to create focused com- pound libraries, (2) optimizing known hits to improve metabolic and pharmacokinetic properties (i.e., absorption, distribution, metabolism, excretion, and toxicity) and (3) creating novel compounds from functional groups or other types of molecular frag- ments (Sliwoski et al., 2014). The first application, filtering large compound databases, usually occurs in a vir- tual HTS (vHTS). Possible ligands are digitized, resulting in a 3-D model or string of characters capturing structural information. The digitized ligands are evaluated according to different metrics such as similarity to known ligands (Willett, 2006), quantitative structure-activity relationship (QSAR) predicted activity (Seifert et al., 2003), or the change in Gibbs-free energy from docking onto a target (Enyedy and Egan, 2008). vHTS is used at the start of a drug discovery campaign or project to identify hits for future in vitro testing. It should be noted that nearly all compounds in a library are not hits (Dobson, 2004). If HTS were used to test all available compounds, nearly all resources would have been spent synthesizing (if the compound is not commercially available) or testing inactive compounds. Thus, vHTS increases not only the hit-rate but also the economic feasibility of drug discovery. vHTS hits are usually not used as active drug compounds. Instead, the hits are starting points for the second application, optimization of metabolic and pharmacoki- netic properties. Through multiple iterations of synthesis and vHTS/HTS validation, the vHTS hits are optimized for efficacy and metabolic and pharmacokinetic proper- ties (Jorgensen, 2004) before the various in vivo efficacy and toxicity trials.

5 Pharmacokinetic properties are particularly complex to screen and optimize. Un- der the umbrella of drug-likeness (Lipinski et al., 2001), exact prescriptions of accept- able property parameters do not exist. The best alternative available are heuristics, the most well-known being the “Rule of 5” by Lipinski et al. (2001) that predicts poor absorption or permeation if there are > 5 H-bond donors, > 10 H-bond acceptors, molecular weight > 500 daltons, calculated Log P > 5, or Moriguchi logP > 4.15. Other heuristics include, but are not limited to, if the optimized molecule can be synthesized, if the compound “looks” like a drug, and if the compound contains essential structural features (Walters et al., 1998). Depending on the application, drug-likeness can be part of the vHTS, though historically industry is inclined not to because doing so limits available options and drug-likeness can be adjusted or optimized later (Lipinski, 2000). The third application of computational methods is designing active drug ingre- dients. One variation links together multiple binding fragments placed in the target binding pocket to form a molecule (Schneider and Fechner, 2005). Another variation links multiple, possibly non-binding fragments, to a rooted-binding fragment in the binding pocket to grow a molecule (Schneider and Fechner, 2005). Leads are then tested and optimized following the same path as the leads from vHTS.

1.2.1 Computational Methods in Drug Discovery: Structure-Based Methodologies

The three applications of computational methods are usually considered using two different methodologies: structure-based methodologies and ligand-based methodolo- gies. Structure-based, or molecular dynamic simulation, methodologies are rooted in mathematics, physics, and chemistry fundamentals. Programs like AutoDock, DOCK, Flex, AMBER, GROMACS, CHARMM calculate how molecules behave in a given environment and attempt to find the most energy favorable solution (Dur-

6 rant and McCammon, 2011; Wong and McCammon, 2003; Durrant and McCammon, 2010; Douguet et al., 2005; Kalyaanamoorthy and Chen, 2011; Cheng et al., 2012; Cornell et al., 1995; Wang et al., 2004; Zeng and Wu, 2015; Sinko et al., 2013; Lill and Danielson, 2011; da Silva et al., 2010). The only experimental data required is the 3-D structure of the target. In exchange, structure-based approaches are computa- tionally expensive since prediction confidence depends on the convergence of multiple simulation runs. Before using structure-based methodologies, knowledge of the 3-D structure of the target protein, especially the binding pocket, is required. Ideally, both pieces of information are available via a ligand-bound X-ray crystal structure of the binding pocket. If that is not available, both pieces of information can be estimated by matching the protein’s amino acid sequence to the sequence and structure of a similar protein (Misura et al., 2006; Chivian and Baker, 2006). If the X-ray crystal structure does not include a bound ligand, probing the protein surface for concave spaces or locations with favorable energetics can identify the binding pocket (Laurie et al., 2006). Once the structure and the binding pocket is resolved, compound docking is sim- ulated. Solutions are validated by repeated runs with different initial starting con- ditions according to the algorithms used such as Monte Carlo (Sousa et al., 2006), genetic algorithm (Jones et al., 1997) or some other scheme to introduce perturbation. Docking simulations are scored based on energy or energy-related metrics (DeWitte and Shakhnovich, 1996; Mitchell et al., 1999; Huey et al., 2007). Depending on calcu- lation complexity, 100,000 molecules can be screened per day (Agarwal and Fishwick, 2010).

7 1.2.2 Computational Methods in Drug Discovery: Ligand-Based Methodologies

The other methodology used to perform vHTS, optimize vHTS hits, and build ligands de novo is ligand-based methodologies. Entirely dependent on experimen- tal data to make predictions (Zeng and Wu, 2015; Alvarsson et al., 2014a,b; Bender et al., 2004), ligand-based methodologies uses data of known ligands to find other ligands, usually compounds with similar structure or functional groups, based on the similar property principle (similar structures should have similar properties) (John- son and Maggiora, 1990). The requirements are precisely inverse of structure-based approaches: computational requirements are minimal, but a prerequisite amount of experimental data is needed to make accurate predictions. For application in ligand-based methodologies, chemical structure and property information in a quantitative form are more amenable for use and manipulation. Known as molecular descriptors (Todeschini and Consonni, 2008), the quantitative form of structure and property information is more easily applied in different ap- plications while also enabling quantitative differentiation of one compound from an- other. Many different molecular descriptors exist and can be categorized in multi- ple ways. Commonly, molecular descriptors are grouped according to dimensionality: zero-dimensional descriptors (0D), one-dimensional descriptors (1D), two-dimensional descriptors (2D), and three-dimensional descriptors (3D) (Faulon and Bender, 2010; Todeschini and Consonni, 2009). 0D descriptors contain no structural information and describe properties that depend solely on chemical formula (e.g., molecular weight). 1D descriptors contain substructure information and describe properties that require some, but not full, structural information (e.g., dipole moments, partition coefficients, hydrogen acceptors, and donors). 2D molecular descriptors are the most numerous and contain full 2-D structural information. 3D molecular descriptors also include spa-

8 tial features, thus contains more information than 2D descriptors. Since ligand-based methodologies correlate chemical structure to chemical property, the way structural information is presented and accepted has to be uniform. Therefore, the same molec- ular descriptor must be used to transform the information in both the dataset and the compound library. Ligand-based methodologies are usually further grouped into similarity and QSAR prediction methodologies. Choice of molecular descriptor depends on the application as some are more suited for specific applications than others. For example, commonly used SMILES and IUPAC/InChI keys are strings that describe molecular structure and are well suited for similarity searches but less so for QSAR predictions. Phys- iochemical, particularly pharmacokinetic, properties and functional groups are other commonly used molecular descriptors that are well suited for QSAR predictions but not for similarity. Similarity, or the searching for molecules structurally related to a known ligand or compound of interest, can be done by matching and comparing molecular descriptors. Similarity searches are mostly done using 2D molecular descriptors since they contain structural information for the entire molecule and are computationally cheaper when used than 3D molecular descriptors (Hutter, 2011). Similarity searches have also been used to identify any unintended proteins a given compound might interact with (Keiser et al., 2007, 2009). In comparison to similarity, QSARs are more involved and require more atten- tion and decisions to perform optimally. A QSAR is a mathematical correlation of structural features to a given property (e.g., activity) created from structural and experimental data. The QSAR correlation could be linear (e.g., multiple linear re- gression) or non-linear (e.g., neural networks) and used to predict data, which may be discrete (e.g., classification data) or continuous (e.g., physiochemical properties,

9 pharmacokinetic properties, activity, etc.) Correlation methods existed before the advent of computers, but the growth of computational power increased the number and diversity of correlation methods and algorithms available for use in making QSARs. The trend line in Excel, which is a linear correlation model, takes seconds to minutes to calculate manually depending on the amount of data to be fitted. Meanwhile, Excel accomplishes the same task in seconds. As computational difficulty increases (e.g., polynomial or exponential trend line) the difference becomes even starker. Machine learning (Samuel, 1959), a discipline of computer science where algorithms improve task performance over different iterations, allows for the optimization of QSAR correlations, a difficult task to achieve and impossible to match with manual calculations. Common non-machine learning correlation models to create QSAR correlations include multiple linear regression, principal component analysis, and partial least squares regression (Acharya et al., 2011). Stepwise multiple linear regression adds (forward selection) or removes (backward elimination) feature variables to arrive at an optimal correlation model (Hocking, 1976). Principal component analysis (PCA) transforms data such that feature variance is maximized along the new orthogonal axes (Jolliffe, 2014) and can yield a linear correlation using the transformed variables. Partial least squares regression transforms data further than principal component analysis as both the feature variables and dependent variable are transformed before a linear regression is conducted (Wold et al., 2001). As for machine learning models, neural networks and support vector machines (SVM) are commonly used. Neural networks attempt to model brain function for a specific application (Agatonovic-Kustrin and Beresford, 2000). Many different pro- cess elements take the data inputs and perform different operations. The outputs of each process element are weighted according to prior data/training, but all are

10 aggregated to yield a single overall prediction. The delegation of different process elements to handle data in different manners and the interconnection of various pro- cess elements allow for data to be considered in a non-linear fashion. However, the process must be finely tuned as neural networks are prone to overfitting, which is the overspecialization of models to the training data such that they have diminished predictive power (Hawkins, 2004). SVM (Cortes and Vapnik, 1995) models data using all available feature variables and creates as many dimensions/axes as necessary. A hyperplane is constructed to describe the data: a division hyperplane for discrete data or a valuation hyperplane for continuous data. The hyperplane is oriented and optimized to maximize correct predictions of the training data, with a cost parameter that assigns penalties to incorrect predictions. Different kinds of data are better described with planes based on different types of functions. Thus, SVM includes a kernel function parameter to select the specific type of hyperplane function (e.g., linear, polynomial, Laplacian, Gaussian, etc.) to use. Additionally, SVM uses the “kernel trick” to map data and work in higher dimensions without data transformation to reduce the complexity and calculations involved (Sch¨olkopf, 2001). The correlation methods presented are only some of the most commonly used in drug discovery but do illustrate some of the diversity of choice. Unfortunately, there is no standard or guideline as to which one is best suited for any given application or dataset. The best method may vary in the same application with different datasets and vice versa. Thus the best method for any given situation can only be determined through trial and error. As data are essential for ligand-based methodologies, the two aspects of data cru- cial for QSAR are the amount and quality of data available. QSARs require a mini- mum amount of data to create accurate correlations; without the minimum amount,

11 QSAR predictions will drop in accuracy. The quality of data refers to the accuracy of experimental and structural data. Inaccuracies in either may yield correlations that do not exist or fail to recognize the importance of one or more structural features. An aspect that is not readily controllable but is of constant concern is overfitting. As briefly mentioned previously, overfitting occurs when models are too specific to the data used when creating/training and lose predictive power (Hawkins, 2004). Overfitting typically occurs when a non-linear model is used unnecessarily or when more variables are used than necessary (Hawkins, 2004). While the user chooses the model, the use of variables need not be. Filter and wrapper methods can both limit the inclusion of extraneous variables in different ways: filter methods select variables before model construction while wrapper methods work in conjunction with model training to decide which variables to include and exclude (Guyon and Elisseeff, 2003). Accordingly, to limit overfitting and result in the most potent correlative models, the correct modeling method and appropriate variables need to be selected.

1.2.3 Computational Methods in Drug Discovery: Hybrid-Based Methodologies

There are also cases of hybrid methodologies where structure-based and ligand- based methodologies are used in conjunction (Huang et al., 2015). In these cases, the strengths of each methodology complement the other. However, the individual deficiencies of each methodology are still present, though mitigated to a certain degree.

1.3 PCA-GA-SVM vHTS Drug Discovery Pipeline

Given the small molecule chemical space and the current state of healthcare (both the rising number of diseases that need new therapies and the associated economic costs), a new method of identifying novel therapies is desired. Ideally, the approach will identify new pharmaceutical candidates quickly and efficiently once a therapeutic

12 target is designated. The approach should also be scalable such that computational resources can be focused on a single target or distributed to many different targets according to desire or need. Finally, the approach should be able to examine all molecules and explore as much of the small molecule chemical space as possible. As new chemical reactions are developed and new molecules produced, the rules and heuristics that govern what are or are not drug molecules will change. Thus, an approach that examines as many molecules as possible for leads is preferable to one that preemptively limits what molecules are considered due to present day rules and heuristics. Per these stated goals, work was undertaken to develop a virtual drug discovery pipeline, ultimately resulting in the presented principal component analysis- genetic algorithm-support vector machine (PCA-GA-SVM) pipeline. Resource and work-flow considerations that shaped and guided how the PCA-GA-SVM pipeline was developed and applied are presented. Then the PCA-GA-SVM pipeline itself is presented and discussed in detail.

1.3.1 PCA-GA-SVM vHTS Drug Discovery Pipeline: Guiding Considerations Before Construction

One of the first considerations was the scope of molecules considered. With limited resources available, the scope needs to be one that is tractable with the available resources and able to address as many of the stated goals as possible. Defining the scope requires some compromises as two of the stated goals, identifying leads as quickly and efficiently as possible and examining as many molecules as possible to explore the small molecule chemical space, are two conflicting objectives. The quickest way to identify leads is to use curated HTS libraries as all compounds in the library should be biologically active against at least one target, while the examination of all molecules requires consideration of both synthesized and theoretical molecules. Using

13 curated HTS libraries limited what molecules were considered while an investigation of all molecules was limited by available resources and lack of synthesis expertise. A compromise between the two stated objectives is to limit the scope of molecules considered to all currently synthesized molecules as that will expand the range of molecules considered and prevent wasting resources on possibilities that may or may not be realized. Eventually, this resulted in the selection of the PubChem Compound database as the screening library. The second consideration was whether to use structure-based or ligand-based methodologies in the pipeline. As previously stated, structure-based methodologies are less limited by data but are more computationally expensive, while ligand-based methodologies are the inverse. One of the resources available was the experimental data resulting from previous HTS experiments and now available as a result of “big data” (Yan et al., 2006), removing the data constraint from ligand-based method- ologies. Additionally, any experimental validation of hits require an experimental protocol, and it is preferable to spend resources testing different pipeline applications than to develop new experimental protocols. The experimental protocol was one of the pieces of information included with the HTS data, making it much more con- venient to conduct experiments that are based on HTS data. Furthermore, speed was one of stated goals/motivations, and there is a difference in speed between the two methodologies. Structure-based methodologies had a reported ligand processing rate of 100,000 molecules per day (Agarwal and Fishwick, 2010) while ligand-based methodologies processed the PubChem Compound database (containing roughly 74 million compounds circa 2014) in less than a day in preliminary tests. Finally, effi- ciency was another stated goal and hits from ligand-based screens are typically more active than those from structure-based screens (Stumpfe et al., 2012). Based on all of this information, ligand-based methodologies was chosen as the basis for the

14 PCA-GA-SVM pipeline. We discuss this optimization next. A third consideration was how to use all tools available to construct the pipeline ef- ficiently. Previous approaches suffered from different constraints that limited through- put and efficiency (Weis et al., 2008; Kayello et al., 2014). For example, a screening of the PubChem Compound database using an approach developed for previous tech- nology took weeks to complete. Meanwhile, a preliminary test with an earlier version of the pipeline screened the PubChem Compound database in less than a day. The strengths and concepts of previous work should be used while optimizing it to utilize the technology currently available fully. One optimization is to utilize specialized software for intended, specific purposes only. Screening models created with specialized mathematical software written by experts and robustly tested by different users is much more reliable and trained much more quickly than ones created with personally written software that is untested, unproven, and buggy. Since the purpose of this work is to develop predictive models for other applications, the focus was on innovating and utilizing what is available in improved ways rather than creating and building code from scratch. A second optimization is to utilize high-performance computing techniques. With the advancement of technology, parallel computation is now much easier and more available than ever before, and by splitting tasks into many smaller ones, the work can be distributed and finished more efficiently. While this can also be done in the specialized software environment, doing so will subject the screen to data input/out- put, resource management and usage, compatibility, and work-flow (e.g. how the data is split and assigned to different threads, the ordering of subtasks, the compila- tion of results from different threads, etc.) constraints. Instead, the work is divided and allocated within the system environment: data input/output can be managed in whatever way is most sensible and efficient for the user, and individual task processes

15 can be declared as desired by the user for maximum control over what and how tasks are parallelized. In this way, many different instances of modeling can occur at the same time to increase chances of finding an optimal model or work on many different datasets in parallel and a database can split into subsets for screens in parallel. The fourth and last consideration was to use the Signature molecular descriptor. As one of the original developers of Signature, my advisor Dr. Donald P Visco Jr. has both unique insights into the molecular descriptor and experience in its application. The Signature molecular descriptor was chosen to maximize available resource advantages to increase the likelihood of success in the following work and will be described in detail.

1.3.2 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Signature Molecular De- scriptor

2D molecular descriptors capture the 2D structure of a molecule, with graph theory, which are mathematical constructs that document how groups of objects are related to each other in a pairwise manner. A molecular graph considers atoms as nodes and bonds as edges to a network that captures all pairwise relationships (bonds) between nodes (atoms) (Todeschini and Consonni, 2008), seen in Figure 1.1. Fragments can be created when edges or sections of the molecular graph are deleted. If only edges are erased, the resulting fragments of the molecular graph are also another representation of the 2-D molecular structure. These fragments could also be interpreted as structural feature variables for use in different applications, such as the independent variables for creating QSARs with the occurrence count as the value for the variable (Duchowicz et al., 2008; Huang et al., 2013; Gharagheizi et al., 2013; Mattei et al., 2013). Topological indices are a class of molecular descriptors based on molecular graphs

16 2 5

1 3 6 7 4 8

Figure 1.1: Molecular Graph Of Acetic Acid. The atoms are nodes (numbered cir- cles) and the bonds are edges. By erasing the edges, the molecular graph can be fragmented. that also records the topological environment in the graph (Todeschini and Consonni, 2008). For example, the molecular connectivity topological index, χ, documents the branching and molecular connectivity of carbon (Randic, 1975), and eventually all non-hydrogen atoms (Kier and Hall, 1976, 1977, 1981, 1990), in a molecule. One of the many 2D fragment-based topological indices (Devillers and Balaban, 2000) that systematically records the 2-D structural formula of a molecule in a molecular descriptor is called Signature (Visco et al., 2002; Faulon et al., 2003b, 2004). With a molecular graph, Signature de-constructs the molecule into fragments of a pre-determined size (Signature’s height) by designating an atom as the start (root atom) and following the bonds for a certain number of bonds (Signature’s height), without backtracking, while documenting all atoms and bonds encountered. The process is then repeated for all atoms in the molecule. A solitary fragment is an atomic Signature while the collection of fragments for the molecule is a molecular Signature. The number of times an atomic Signature appears in a molecule is the occurrence number of an atomic Signature. Since Signature changes as height changes,

17 heights for both atomic and molecular Signatures need to be declared when discussing the Signature of a molecule. To illustrate how molecular structure fragmentation by Signature, acetic acid will be fragmented into atomic Signatures in Figure 1.2. Functionally, Signature reads chemical table files (extension .mol or .sdf), which contain the relative 3-D location of atoms and their pairwise relationships with other atoms, to create molecular graph fragments. The fragments’ information is recorded in three files. The first two are used as a pair, one a matrix of occurrence numbers and the second a list of corresponding atomic Signatures, while the third is a dictionary of atomic Signatures and occurrence numbers for individual molecules. Depending on the application, either the first two or the third may be more useful. Signature enjoys many advantages rooted in its origins as a computer-aided struc- tural elucidation tool (Faulon, 1994). Signature was used to create the structural fragments and bonds detailed in the data, construct 2-D structures containing the identified fragments and bonds, and predict physiochemical properties among other uses (Faulon, 1994) to efficiently elucidate molecular structure from analytical data. The first advantage to using Signature is its ability to map molecular topography completely. Atomic Signatures are constructed using all atoms in the molecule as the root atom (Visco et al., 2002; Faulon et al., 2003a,b, 2004). Therefore all bonds and atoms are captured, unlike fragmentation methods that ignore the bond between functional groups and the rest of the molecule (Joback, 1984; Joback and Reid, 1987), and can be used to create QSARs. The second advantage to using Signature is it is a canonical representation of molecules. All other 2D molecular descriptors can be derived from Signatures (Visco et al., 2002; Faulon et al., 2003b). Therefore, any 2D molecular descriptor beneficial to QSAR performance can be included as well since all of them could be expressed in terms of Signature.

18 Height=1: Height=2: Atomic Signature for C : Atomic Signature for C : C(C,=O,O) C(C(H,H,H),=O,O(H))

Molecular Signature: Molecular Signature: 1 C(C,H,H,H)+ 1 C(C,=O,O) 1 C(C(=O,O),H,H,H) + 1 O(=C)+ 1 O(C,H)+ 3 H(C) + 1 C(C(H,H,H),=O,O(H)) + 1 H(O) + 1 O(=C(C,O))+ 1 O(C(=O,C),H) + 3 H(C(C,H,H))+ 1 H(O(C))

Figure 1.2: Signature Fragmentation Example. Beginning at the root atom (starred carbon), document atoms and bonds until a pre-determined distance (height) without backtracking. Height=0 atomic Signature contains only the root atom; height=1 atomic Signatures (solid arrows) included the primary atomic neighbors and bonds; height=2 atomic Signatures (dashed arrow) included secondary atomic neighbors and bonds. Documentation of a single root atom is the atomic Signature; documentation for the molecule is the molecular Signature. Reproduced from Chen, J and Visco, D.P. ”Identifying novel factor XIIa inhibitors with PCA-GA-SVM developed vHTS models.” European Journal of Medicinal Chemistry; 140:31-41. Copyright c 2017 Elsevier Masson SAS. All rights reserved.

19 The third advantage to using Signature is its tunability. Height, the parameter that controls fragment size, also controls the degeneracy of molecular Signatures: as height increases, atomic Signatures increase in specificity (Faulon et al., 2003a, 2005). The ability to control atomic Signature fragment size was used in the pipeline to maximize data utility and control the level of extrapolation. Some information is best described by lower height atomic Signatures while other information is best described by higher height atomic Signatures. When constructing models, the most descriptive variables should be used when possible. Therefore, the molecules used in this approach, from the training data and the PubChem Compound database, are fragmented at heights 0, 1, and 2 so that models can test different sized fragments containing the same structural feature and determine if one is more descriptive than the others and can yield better models and better predictions. Atomic Signature height can also be used to tune extrapolation. Height con- trols specificity by controlling the size of the atomic Signatures fragments. When multiple atomic Signature heights are used, extrapolation can be finely tuned and limited to affect only the higher level heights as any extrapolation at the lower height atomic Signatures would necessarily propagate to the higher height atomic Signa- tures. Therefore, by choosing an appropriate level of similarity, extrapolation can be limited to only the higher level atomic Signatures or free to impact atomic Signatures across multiple heights. These three advantages of Signature have allowed it to perform well in QSAR ap- plications. Signature-based QSARs are as accurate as those created with other molec- ular descriptors while also reducing correlations between structural feature variables (Visco et al., 2002; Faulon et al., 2003b). Previous industrial applications include the molecular design and property optimization of foam blowing agents (Weis et al., 2005), solvents (Weis and Visco, 2010), and reactions feeds (Chemmangattuvalappil

20 et al., 2010; Chemmangattuvalappil and Eden, 2013; Dev et al., 2014). Biological ap- plications include identifying small molecule ligands (Faulon et al., 2003b; Weis et al., 2008; Li et al., 2014a; Chen and Visco Jr, 2016, 2017), peptide ligands (Churchwell et al., 2004), and modeling protein-protein interactions (Martin et al., 2005).

1.3.3 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Algorithms Chosen

The four considerations and limitations made thus far were (1) screen all molecules in PubChem Compound, (2) use a ligand-based methodology, (3) create and train models in a specialized mathematics environment but conduct the screen in the script- ing environment, and (4) use the Signature molecular descriptor. These considera- tions left many options for how to build a pipeline. However, the pipeline required two main components: a variable/feature selection algorithm and a modeling algo- rithm. Eventually, we arrived at PCA and GA for feature selection as a filter and wrapper method, respectively, and SVM for model creation. The reason why each of these three algorithms was selected will be discussed along with how their interactions create and train models.

Principal Component Analysis (PCA)

PCA is a data transformation technique to create observations with linearly uncor- related variables, called principal components, from possibly correlated ones (Jolliffe, 2014). In data with variance, the axis of largest variance, second largest variance, etc. may not match the axis corresponding to the original variables used. To better describe the variance and to remove correlation between variables, the data can be linearly transformed such that the data is in terms of the axis of largest variance, second largest variance, etc. In this transformation, shown in Figure 1.3, the axes of variance are the principal components, and the transformed data are now in terms of

21 the uncorrelated principal components.

Variable 2 Principal Component 1

Variable 1

Principal Component 2

Figure 1.3: Principal Component Analysis (PCA). It is a transformation technique so that the data can be represented with linearly independent variables called principal components. However, physical significance may not be directly translatable/trans- ferable.

The transformation is mathematically done by eigenvectors and eigenvalues. Each element in an eigenvector corresponds to the contribution of a specific variable to that eigenvector. The eigenvalue of a particular eigenvector divided by the sum of all eigenvalues is the relative variance captured by the corresponding eigenvector. Traditionally, PCA has been used for variable (dimensionality) reduction. The first principal component captures most of the variance, and each following principal component captures increasingly smaller amounts of variance. When the variance captured reaches negligible amounts, only the preceding principal components are

22 necessary for explaining the variance in the original data. While the data variables are now linearly independent, the transformation distorts the physical meaning of different variables. Data is usually collected in terms of a physical variable with physical significance. When the variables are combined to create the principal components, the physical significance of the variables used is usually not readily transferable to the principal component. In this pipeline, the physical significance of atomic Signatures is important and must be retained. However, the reduction in variables by PCA is also very desirable as a filter method. To benefit from the removal of extraneous variables and circumvent the distortion of variables physical significance, the pipeline conducts PCA but does not use the principal components directly. Instead, the pipeline examines how atomic Signatures contribute to the construction of each principal component weighted by the eigenvalue, or the relative amount of variance captured by that principal component. The contributions are then summed up, and the atomic Signatures that contribute the most to capturing variance are selected for further use. In this way, PCA was still used to reduce the number of atomic Signature without distorting the physical significance of any variables. The way the pipeline uses PCA to analyze atomic Signatures is shown in Figure 1.4

Genetic Algorithm (GA)

After selecting for the atomic Signatures that contribute the most to variance capture, the selected atomic Signatures are used by the GA, an optimization algorithm based on natural selection (Whitley, 1994). Each variable is treated as a that may or may not be present in a , which is a random combination of genes. GA creates a population of that are then evaluated with an objective function to be optimized. The best chromosomes are selected and retained

23 Weighted Contributions Eigenvectors/ Eigenvalues of atomic Principal Components Signatures

풏 푨 푨 … 푨 풗 ퟏ,ퟏ ퟏ,ퟐ ퟏ,풏 ퟏ ෍ 푨ퟏ,풋풗푱 풋=ퟏ

풏 푨ퟐ,ퟏ 푨ퟐ,ퟐ … 푨ퟐ,풏 풗ퟐ ෍ 푨ퟐ,풋풗푱 = 풋=ퟏ ⋮ ⋮ ⋱ ⋮ … ⋮

෍ 푨풏,풋풗푱 푨풏,ퟏ 푨풏,ퟐ … 푨풏,풏 풗풏 풋=ퟏ

Figure 1.4: PCA Usage In Pipeline. All atomic Signatures contribute to constructing the principal components, but some atomic Signatures contribute more than others. The weighted contributions of each atomic Signature in all principal components were calculated to identify which atomic Signatures contributed the most to capturing variance and should be used in models to describe variance. to serve as the basis to create the next generation of chromosomes by way of genetic operations such as reproduction, crossover, and mutations. The process is repeated until the best model emerges (i.e., the objective function is perfectly optimized or reaches the desired value) or improvement has plateaued (i.e., no improvement over many generations. The entire process is shown in Figure 1.5 GA was selected as a wrapper method because it can work with the models to test many different atomic Signature combinations and under varying circumstances: the

24 Chromosome Score Selection Reproduction

1 0 0 1 1 0 0 0 0 1 2.45 1 0 0 1 1 0 0 0 0 1

0 0 0 1 1 0 1 1 0 1 -1.38 0 0 0 1 1 0 1 1 0 1

1 0 1 1 0 0 0 0 0 1 -0.15 1 0 0 1 1 0 0 0 0 1

1 1 1 1 0 1 0 1 1 1 0.63 0 0 0 1 1 0 1 1 0 1

Evaluation Cross-Over

1 0 1 1 0 0 1 0 0 1 Mutation 1 0 0 1 1 0 1 1 0 1

0 1 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0 0 1

1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1

1 0 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1

Figure 1.5: Genetic Algorithm (GA). The retained atomic Signatures are randomly combined to form chromosomes that are evaluated based on an objective function. Ones in the chromosome indicate the inclusion of the corresponding atomic Signature while the zeros indicate the exclusion of corresponding atomic Signatures. The best chromosomes serve as the basis for the next generation following genetic operations such as reproduction, crossover, and mutation. different atomic Signature combinations are the chromosomes created in each genera- tion while the varying circumstances arise from the retention of the best chromosomes. Aside from serving as the genesis of the new chromosomes of each generation, the best chromosomes are also retained and passed on unchanged into the next generation for further testing. If the modeling algorithm contains aspects of randomness, the best chromosomes can be tested multiple times under varying circumstances and attempt to produce more predictive models. In the case of this pipeline, the models are gener- ated with SVM with some randomized parameters. Therefore, the best chromosomes

25 will be tested several times under perturbation conditions to ensure model robustness and prediction ability.

Support Vector Machines (SVM)

To evaluate the atomic Signature combinations encoded by each chromosome, an SVM model was created with the atomic Signatures specified in the chromosome. SVM is a supervised machine learning algorithm that builds classification or regression models depending on the kind of data presented during training (Cortes and Vapnik, 1995). While functionally the same regardless of what information it is given, with the separating hyperplane dividing or characterizing the data, SVM is best understood from the perspective of a classifier. SVM classifiers plot data in space and try to maximize the separation region, called the margin, between the two classes, illustrated in Figure 1.6. SVM uses as many dimensions as there are variables to obtain the best separation possible between the two classes and uses the “kernel trick” to operate in that higher dimensional space without having to transform the data as the dot product of the specific functions used is the same in low and higher dimensions (Sch¨olkopf, 2001). The outside edges of the margins are defined by support vectors that are anchored by the data points at the periphery of the margins. The separating hyperplane is in the center of the margin. The shape of the hyperplane and the margins on either side of it depend on the basis function used (e.g., linear, Gaussian, polynomial, hyperbolic, etc.) While attempting to maximize the margin, SVM also has to maximize classifica- tion accuracy and incur penalties when misclassification occurs, determined by the cost parameter. Depending on the cost associated with misclassification, the same data can have many different separating hyperplanes and margins. If the cost for misclassification is high, then the margins and separating hyperplane are narrower

26 Support Vectors

Separating Hyperplane

Margin

Figure 1.6: Support Vector Machines (SVM). It is a modeling approach that tries to maximize the margin between two classes. The margins are support vectors that are defined by the data at the periphery of the margin. In the middle of the margin is the separating hyperplane. Depending on the cost associated with misclassification, the separating hyperplane may assume different orientations for the same data to maximize the margin and minimize the penalties associated misclassification or data intrusion in the margins. and oriented to avoid misclassification. If the cost for misclassification is low, then the margins and separating hyperplane are oriented such that data encroachment into the margins is allowed in order have a wider margin. SVM was chosen to create vHTS models because it is a machine learning algorithm that can create predictive models while avoiding overfitting and has the flexibility to

27 create nonlinear models. The ability to iteratively train better performing models is central to creating the most predictive models possible. However, overfitting is a significant concern because when it occurs, predictive power decreases. Supervised machine learning methods are prone to overfitting, sometimes because a non-linear method is used when a linear one is acceptable (Hawkins, 2004). SVM can avoid this by selecting a linear kernel, as was done in the pipeline, to result in a linear model. Furthermore, SVM was inherently less prone to overfitting(Burges, 1998) and has a history of creating predictive models. Finally, if the linear models perform poorly, a different kernel can always be chosen to create a non-linear SVM model for use and overfitting due to using a non-linear model should not occur. SVM models created from the atomic Signatures specified in the chromosomes are evaluated with cross-validation accuracy or error. Cross-validation is a method of handling data where data is segregated into training only data and the evaluation only data. As the data used for evaluation is not used in training, cross-validation accuracy can be interpreted as a measure of the predictive power of a model when applied to new data. As predictive power is the desired, optimized attribute, it is used as the evaluation metric that is returned to the GA as the function to be optimized. In this way, a feedback loop is created between GA and SVM that allows GA to function as a wrapper method for SVM model creation. To summarize the modeling process, 2D molecular structure information of a dataset is converted into atomic Signature fragments using the Signature molecular descriptor. The atomic Signatures were then filtered for the ability to capture variance by examining how they are used to create principal components in PCA. The filtered atomic Signatures are passed to the GA to stochastically and iteratively create differ- ent atomic Signature combinations in the form of chromosomes. Each chromosome from GA indicate which atomic Signatures to include in the SVM model creation and

28 the score, characterizing the predictive ability of the resulting SVM model created from the chromosome, is returned to the GA. A previously considered approach was to create all possible model variants to systematically test and identify the globally optimum model(s). This meant creating models with all atomic Signature combinations for all cost values. However, this was later rejected due to the significant expenditure of resources for relatively little gain. To illustrate this, consider a situation where 100 different atomic Signatures were identified by PCA and 100 different cost values were considered in creating the SVM model. The systematic construction and evaluation of all possible models result in approximately 1032 models for the proposed scenario, which is much more than can be examined in a reasonable amount of time. And analysis of model performance versus cost values indicates most computational resources are spent on poor models stemming from a less optimal cost value. Therefore, it is much more efficient to stochastically create and train locally optimum models rather than spending most of the available resources trying to identify the globally optimum model.

1.3.4 PCA-GA-SVM vHTS Drug Discovery Pipeline: The Rest Of The Pipeline

The modeling process is just one part of the pipeline, which is depicted in Figure 1.7. Experimental data that is used to create and train the models come from previous HTS and characterization experiments. While used primarily for other purposes, such as identifying hits or the chemical identity of new substances, its utility can be extended when combined with a “big data” database (Yan et al., 2006) like PubChem (Wang et al., 2014; Kim et al., 2015), CHEMBL (Gaulton et al., 2012), DrugBank (Wishart et al., 2006), and ZINC (Sterling and Irwin, 2015).

29 Identify dataset Step 1

Remove PAINS

SVM-R Training Set: known IC50 only SVM-C Training set: all compounds

PCA: create signature pool PCA: create signature pool Step 2

GA-SVM-R Model Creation GA-SVM-C Model Creation

Screen Compound Database (PubChem)

Overlap: select compounds most similar to training set Step 3 Remove PAINS in results

Select compounds with highest confidence (SVM-C) & activity(SVM-R)

Test identified compounds Step 4 Yes Yes Improve models?

Figure 1.7: Pipeline Process Flowchart. The pipeline contains four main steps: (1) identify a target data set and remove PAINS and structurally similar compounds, (2) train classification and activity prediction models using the PAINS-free dataset, (3) vHTS a compound library with the models (NCBI’s PubChem Compound Database in this work), and (4) experimental validation of predicted hits. Molecules from the dataset and libraries were transformed into Signatures before input in the pipeline. Adapted from Chen, J and Visco, D.P. ”Identifying novel factor XIIa inhibitors with PCA-GA-SVM developed vHTS models.” European Journal of Medicinal Chemistry; 140:31-41. Copyright c 2017 Elsevier Masson SAS. All rights reserved.

Pan Assay Interference Compounds (PAINS)

Before creating and training models with the data, it is essential to discuss the existence of pan-assay interference compounds (PAINS) (Baell and Holloway, 2010) 30 and other related compounds. These compounds interfere with the reading of assay results (e.g., compounds that are fluorescent, luminescent, absorbent, etc.), interact with receptors or each other in an irregular manner unrelated to binding (e.g. aggre- gation, redox, etc.) or have non-specific interactions (i.e. promiscuous) and present confounding factors to the data analysis. Though there are experimental protocols to discern if observed activity is due to interaction with the target protein or irregular interactions that define PAINS, it was decided to remove PAINS and similar struc- tures out of both the training data and the candidates considered for experimental validation out of an abundance of caution to prevent confounding factors interfering with later analysis. PAINS and other similarly structured compounds were identified using the “Patterns defined in PAINS” filter from ZINC15 (Sterling and Irwin, 2015).

Evaluation Metrics

The modeling process uses the PAINS-free data to create and train models for vHTS. Evaluation of the models occurs throughout the process for different purposes. To identify the best performing model during the training and to evaluate model performance after the experimental validation of hits, a priori and a posteriori metrics are used. A priori metrics are used to evaluate models and identify which ones should be implemented in vHTS. This pipeline used cross-validation and training accuracy to identify the best vHTS model, with cross-validation accuracy of primary importance and training accuracy secondary. Both accuracies are defined as (Weis et al., 2008):

TP + TN Accuracy = TP + TN + FP + FN where “TP” and “TN” mean true positive and true negative, respectively, while “FP” and “FN” mean false positive and false negative, respectively. After experimental

31 validation of hits, the precision, or hit rate, of the vHTS was determined to evaluate the performance of the models used. Precision is defined as (Weis et al., 2008):

TP P recision = TP + FP

where “TP” and “FP” mean true positive and false positive, respectively. Another critical metric used, structural similarity, does not evaluate the models themselves but measures the amount of extrapolation required to go from compounds in the training set to the candidate. The predictive power of a model is inversely proportional to the amount of extrapolation (i.e., prediction accuracy decrease as extrapolation increases) (Weis et al., 2008). Of the many different similarity metrics available (Chen and Reynolds, 2002), the overlap metric(Ω) was selected and is defined as (Weis et al., 2008): x Ω = [min,max] xtotal where x[min,max] is the total number of unique atomic Signatures in the compound that falls within the maximum and minimum occurrences observed in the training set. The overlap metric is based on the set-theoretic definition of the Tanimoto Coefficient (Chen and Reynolds, 2002).

Safeguards Against Overfitting

It should be noted that many different pipeline features were implemented to com- bat overfitting. The pipeline used SVM as a modeling method, which is less prone to overfitting (Burges, 1998), and a linear kernel, which prevents overfitting due to the implementation of a nonlinear model when linear model is sufficient (Hawkins, 2004). Additionally, PCA and GA are filter and wrapper methods that removed atomic Signatures with little or no contributions to variance capture. Furthermore,

32 GA vigorously attempted to identify better performing atomic Signature combina- tions and tested the best atomic Signature combinations for robustness. Finally, the data separation into training-only and evaluation-only for cross-validation means any model overfitting to the training data should have poor cross-validation accuracy. In summation, the chances of overfitting should be minimized due to the inclusion and selection of all these pipeline features.

1.3.5 PCA-GA-SVM vHTS Drug Discovery Pipeline: Determining Effectiveness

After the pipeline was developed, the proof-of-concept was identifying Cathepsin L inhibitors (Chen and Visco Jr, 2016), which will be discussed in the first case study presented. After proof-of-concept demonstrated the efficacy of the pipeline, a series of datasets and protein systems were identified to test the robustness and ability accommodate data sets with different properties of the pipeline. While some datasets were also chosen to test specific applications (e.g. different datasets of the same target, similar targets, activators in contrast to inhibitors, etc.) the datasets were chosen for two AIMs: (1) determine the effect of varying amounts of data and (2) determine the effect of the distribution of data (i.e. actives vs inactives). A third AIM was to determine the effect of retraining, or training the models again with a dataset that is updated with newly available data from experiments or other sources, on vHTS model performance and whether other dataset particularities could affect retraining. Only datasets examining protein/ligand systems were chosen to avoid time, avail- ability, expertise, and facilities constraints. The title, compound statistics, and de- scription was extracted from the XML file of each PubChem Bioassay dataset to identify assays for further consideration. In general, any titles that contained specific keywords (e.g., cells, culturing, AIDS) were removed from contention. The remaining

33 datasets were examined for protocol compatibility and whether it could test an aspect of the pipeline. The datasets were chosen such that only one dataset variable, size or distribution, was changed while the other was held within a specified range, active fraction between 0.4 and 0.6 for AIM 1 and 100 to 1000 compounds for AIM 2. This was to isolate the effect of each variable as much as possible to facilitate overall trend analysis. Additionally, the datasets were chosen before consideration of PAINS so while the dataset may have been chosen to meet a certain criterion initially, it may not have ultimately met the specified criterion as a result of PAINS removal. Furthermore, there were many difficulties deciphering protocols, which were often missing details and required trial and error to elucidate. Eventually, the use of many datasets was discontinued due to inability to resolve the testing protocol. The eventual datasets chosen are detailed in Table 1.1.

AIM 1: Effect of Varying Dataset Size

The datasets chosen were limited to those whose active fraction was within the range of 0.4 and 0.6 to isolate the effect of size. All datasets with fewer than 10 compounds used cell culture or animal models for study and were excluded. Other datasets of 100 compounds or more either did not have the correct active fraction or had an unresolvable protocol. Ultimately, only three datasets were available to evaluate AIM 1.

AIM 2: Effect of Varying Dataset Active:Inactive Distribution

The datasets chosen were limited to those whose overall size was between 100 and 1000 compounds to isolate the effect of active to inactive distribution. The datasets chosen spanned the entire active fraction range from 0-0.2 to 0.8-1, with one dataset

34 Table 1.1: Dataset Assignments to AIM 1 and AIM 2. Datasets were chosen to test a specific pipeline aspect. Note that AID 2533 was a dataset that fulfilled the criteria for both AIM 1 and AIM 2.

AIM 1: Varied Size AIM 2: Varied Distribution (Active Fraction Between 0.4 and 0.6) (Size Between 100 and 1000 Compounds)

Size PubChem Bioassay Active Fraction PubChem Bioassay

10-100 AID 728 0-0.2 AID 787

10-100 AID 825 0.2-0.4 AID 846

100-1000 AID 2533 0.4-0.6 AID 2533

0.6-0.8 AID 624322 0.8-1 AID 1540

in each of the five active fraction ranges.

Additionally, it should be noted that supervised machine learning methods, like SVM, tend to predict the majority classification in cases where a class significantly outnumbers the other (Barandela et al., 2003). This reduces training error since the data indicated an expectation of more members for the more numerous class. How- ever, this leads to more false positives, false negatives, and decreases the efficacy of the pipeline. Therefore, the resulting trend realized will indicate the type of behavior to be expected under different circumstances and specify the conditions when the pipeline might fail.

AIM 3: Effect of Retraining

All case studies underwent two rounds of vHTS and experimental validation of identified candidates to determine the effect of retraining. An increase in efficacy

35 was observed (Chen and Visco Jr, 2016) in the proof-of-concept case study with Cathepsin L, and testing the effect of retraining in varied circumstances and systems should determine if the increase in performance is a feature of the pipeline or an artifact to be disregarded. Additionally, to determine if dataset particularities may affect retraining, various dataset and system conditions will be considered such as similarity, the amount of data available, the active/inactive classification distribution, the number of compounds tested, etc.

36 CHAPTER II PCA-GA-SVM CASE STUDIES

The efforts to test the ability to respond to datasets of varying attributes of the pipeline resulted in the seven different case studies presented in this chapter. The procedural information (e.g., treatment of the dataset selected for each study, how the pipeline works, how data is used to generate models, etc.) is very similar between datasets and will be presented in the first instance only unless changes were made later. The case studies will focus on the presentation of new findings, the results and corresponding data analysis, and the biological impact of the receptor as the focus of each dataset. A short, sample case study is presented before the seven actual case studies to introduce the format and what information can be expected in each part of the case study.

2.1 Sample

When the case study is introduced, the dataset selected is presented first. Various aspects regarding the size, active fraction, and other notable features of a dataset as well as which AIMs the dataset is suited to address are described and discussed to clarify why the specific dataset was chosen. Then, the background information about the biological target of the data set is presented. Physiological, biological, chemical, and pharmaceutical information is reviewed and summarized to provide an overview of the work done studying the subject of the dataset and how the work presented in

37 the case study will fit into the broader context.

2.1.1 Sample: vHTS and Experimental Validation Results

Following the introduction, more details of the dataset used for the model creation process is summarized. Specific features of the dataset are presented along with the results of classification and activity prediction model creation. The following informa- tion is presented: the number of active compounds, number of inactive compounds, the number of atomic Signatures, the number of atomic Signatures identified by PCA for use in model creation, number of best classification models, number of activity prediction models created and trained, the training errors of the best models, and the cross-validation error of the best models. The data is processed before use in model training. For classification, the labels in the dataset were used directly. For activity prediction, activity data spans multiple orders of magnitude and has been transformed logarithmically to better suit modeling activities. Before the second round vHTS for cathepsin L inhibitors in the cathepsin L case study, it was found the SVM regression algorithm scales data such that the mean is zero and the standard deviation is one, even if the parameter is set to “false”. To avoid confusion and increase control over the modeling process, the activity data is first logarithmically transformed, then scaled so the mean is zero and the standard deviation is one from the second round vHTS for cathepsin L inhibitor forward. To confirm the best models were selected aside from minimizing training and cross- validation errors, receiver operating characteristic (ROC) curves were created. When the SVM-C classification models are used to predict the classes of the compounds in the training data, the predictions can be in the form of an objective function value. The ROC curve is constructed from the compounds’ objective function values and true classes by plotting how the true and false positive rates vary as different

38 thresholds values are used to sort compounds into the positive and negative classes. A model that perfectly predicts the dataset compounds’ classes has a ROC curve with a point at the (1,1) location. When the false positives are equivalent to the true positives (i.e., when y=x), then any prediction from the model is as good as a random class assignment. Therefore, the farther a curve is away from the y=x line, the more predictive a model likely is. The area under the curve (AUC) is a quantitative measure of the ability of the models to correctly predict a candidate is active and can be similarly interpreted as how far the curve is away from the y=x line and how close the curve is the (1,1) point. After model evaluations are presented, candidates for activity validation experi- ments are selected from the vHTS results. Due to resource constraints, only a limited number of candidates can be selected. The criteria used to identify which candidates to experimentally test next focuses on the confidence in classification (magnitude of objective function value), predicted activity, similarity to the dataset as a whole, economic constraints, and commercial availability. The first three (classification con- fidence, predicted activity, and similarity) are subject to the dataset used and the corresponding models while the last two is subject to the fluctuations in market con- ditions. The predictions must be verified by testing, and the only way they can be tested is if they are commercially available, due to lack of resources and synthesis expertise. Of the commercially available candidates, it is preferable to test more candidates for a better sense of the efficacy of the pipeline. Therefore, the cheaper candidates at the requisite amount were prioritized over more expensive alternatives. Molport, a third-party distributor that catalogs inventories and prices for many different man- ufacturers, was used to identify both candidates that were available and sort them according to price: given a SMILES (Simplified Molecular-Input Line-Entry System:

39 a set of strings that describe the chemical structure of a molecule) list of all candi- dates that met the modeling and similarity criteria, Molport was able to determine if there were manufacturers with the candidate in inventory as well as the cost for each manufacturer. This list was first sorted for price amongst manufacturers of the same candidate, then by the price for each candidate. The least expensive candidates were then identified and purchased for activity validation. The activity validation results are tabulated by presenting various details of the candidates chosen: candidate structure, PubChem CID number, predicted activity, and experimental activity. The specific experimental protocol used for each case study can be found in the Appendix A. The modeling and activity validation process occurs in two different rounds to address AIM 3. The first round uses only the data found in the dataset to train the models. The second round retrains the model with the addition of the first round experimental data to the original dataset to increase model performance.

2.1.2 Sample: Model Analysis

After all modeling and activity validation results are presented, the models used in vHTS screens were examined. The modeling analysis will contain any noteworthy insights gleaned from the models in the case study. The modeling analysis will also include considerations for how the modeling results should be interpreted and how they can be improved in the future.

2.1.3 Sample: Structural Feature Analysis

The purpose of the pipeline is to predict biological activity from molecular struc- ture. Structural feature analysis returns to that purpose by examining the molecular structure of the compounds in the datasets and the candidates for structural sim-

40 ilarities and differences and how that correlates with observed biological activity. Comparing and contrasting similarly structured compounds or those that contain the same scaffolds can yield insights into additional avenues of study. It is noteworthy how the pipeline identifies scaffolds. When the atomic Signature chromosomes are first created, the combinations of atomic Signatures are random. After the resulting models are created and evaluated, the best performing chromo- some(s) are kept and used to create the next generation. Through many different trials and iterations, the most predictive combinations of atomic Signature are identified. Recall, atomic Signatures are structural features and fragments that can combine, like puzzle pieces. If specific features are identified and can be combined, they can map out and specify large parts of the molecule. When the parts specified have grown to a sufficient degree, the resulting section of the molecule is a scaffold where the periphery can be modified but the central, mapped out parts cannot.

2.2 Cathepsin L

PubChem Bioassay 825 (Diamond, 2008b) contains data on 98 compounds: 48 actives and 50 inactives. It has an active fraction of 0.490. Both of these details make this dataset one that can address AIM 1: to determine how the pipeline performs with a data set between ten and one hundred compounds in size. This dataset will also address AIM 3 as model retraining was never done with this pipeline before. The results from retraining will inform future expectations of model behavior stemming from retraining. This case study was originally published by Chen and Visco Jr (2016). Cathepsin L (EC 3.4.22.15) is a lysosomal endopeptidase, or a membrane-bound enzyme that cleaves peptide bonds within proteins, and is responsible for the decom- position of proteins within a cell (Bohley et al., 1984; Mayer and Doherty, 1986).

41 Cathepsin L is coded by the CATL gene on chromosome 9 (Chauhan et al., 1993) and is produced as a 38 kDa proenzyme that matures into the active 29 kDa form under slightly acidic conditions (Smith and Gottesman, 1989). The cathepsin family contains many lysosomal proteases, function through many different mechanisms, and may be localized to specific tissue (Barrett et al., 1998). Like other cystine proteases, reactive thiols in the active site of cathepsin L facilitate catabolism (Turk et al., 2000) after proenzyme activation in acidic conditions. The cathepsin L protein processes many different proteins, including ECM pro- teins fibronectin, collagen, elastin, and laminin (Gal and Gottesman, 1986; Mason et al., 1986), serum proteins (Johnson et al., 1986), histone H3 (Duncan et al., 2008; Adams-Cioaba et al., 2011), and transcription factors (Goulet et al., 2004). Due to its role in the regulation of many different proteins, Cathepsin L is also impli- cated in many diseases. Cathepsin L deficiency can lead to heart disease (Stypmann et al., 2002; Petermann et al., 2006; Spira et al., 2007) due decreased programmed cell turnover (Dennem¨arker et al., 2010). Cathepsin L is also utilized by viruses, such as severe acute respiratory syndrome coronavirus (SARS-CoV), for mediated entry into cells (Simmons et al., 2005; Grove and Marsh, 2011). Activity on ECM proteins by cathepsin L also indicate a role in cancer progression, metastasis, and prognosis (Chauhan et al., 1991; Krueger et al., 2001; Yang and Cox, 2007; Sullivan et al., 2009). Finally, it has been found to affect adipogenesis (Yang et al., 2007; Lafarge et al., 2011). The regulatory role of Cathepsin L necessitates regulation. Inflammatory cy- tokines, such as interleukin-6 and interferon-γ, raises Cathepsin L expression (Ger- ber et al., 2001; Laha et al., 1995). While activated in slightly acidic conditions, more acidic conditions (Turk et al., 1993) or neutral/basic conditions (Mason et al., 1985) causes Cathepsin L to denature. The denatured protein is then catabolized by

42 cathepsin D (Turk et al., 1999), another member of the cathepsin family. Identifica- tion of Cathepsin L inhibitors is an area of active research, cataloged at EMBL-EBI’s MEROPS database (Rawlings et al., 2015). Here, the PGA-GA-SVM pipeline is used to identify new cathepsin L inhibitor leads (Chen and Visco Jr, 2016).

2.2.1 Cathepsin L: vHTS and Experimental Validation Results

AID 825 contained data on 98 compounds: 48 actives, with known IC50 values, and 50 inactives. The maximum compound concentration tested was 50µM and will also be the maximum tested concentration in activity validation experiments. Signature fragmentation of AID 825 resulted in 905 atomic Signatures of heights 0, 1, and 2. The dataset was converted into classes to train the classification model (SVM-C)

while the average IC50 values were used to train the activity prediction QSAR model (SVM-R). The results of modeling are detailed in Table 2.1.

Table 2.1: Cathepsin L: The First Round vHTS Modeling Summary. Cross-Validation = cross-validation error. Adapted from Chen and Visco Jr (2016).

SVM-C SVM-R

Training set 48 actives, 50 inactives 48 actives only

composition 98 total known IC50 Training set 8 h=0, 127 h=1, 8 h=0, 97 h=1, Signatures 770 h=2; 905 total 428 h=2; 533 total

PCA Results 283 of 905 atomic Signatures 369 of 533 atomic Sigantures

Models Created 1 1

Training Error 0.020 0.061

Cross-Validation 0.019 0.158

43 The PubChem Compound database was screened for possible hits with the clas- sification and activity prediction QSAR models in a vHTS. The results were filtered and focused using the following criteria:

1. Overlap = 1

2. SVM-C score > 2

3. SVM-R predicted activity ≤ 1µM

The first criterion was to prevent extrapolation of predictions to candidates with atomic Signatures not found in the training data. The second criterion was to increase confidence in class predictions, as the sign of the SVM-C score indicates the class while the magnitude indicates confidence in the predicted class. The final criterion was to focus efforts on candidates with the highest predicted activity. Of the 72 million compounds in PubChem, circa May 2014, 39 candidates were identified. Commercial availability considerations meant only 16 of the 39 candidates were tested in activity validation experiments. The candidates were tested up to 200µM, higher than specified in the testing protocol of PubChem Bioassay AID 825 (Diamond, 2008b), due to weakly observed activity. Three of the sixteen candidates tested were active for an experimental hit rate of 18.75%. Full experimental results can be found in Table 2.2. Before retraining the models to address AIM 3, two adjustments were made. First, a compound was removed from the training data. One previously tested compound that was purchased for quality assurance, CID 426874, was reported inactive but was active in our tests. Additionally, dissolution of the compound was challenging and formed what appeared to be another phase in the solvent when dissolved. The observations from stock preparation and the experimental result created doubt in the veracity of the reported data for CID 426874. Therefore, it was removed from the

44 training data in the second round. Second, data was now normalized such that the mean was zero and the standard deviation was one before model training. It was discovered the R function used normalized the data used in regressions even when the normalization parameter was set to false. For clarity in every step of the modeling process, training data was normalized before training models from this point forward.

Table 2.2: Cathepsin L: The First Round Activity Validation Summary. Candidates selected are commercially available, economically viable, and passed the specified criteria of predicted activity < 1µM, SVM-C scores > 2, and overlap=1. Candidates were tested in triplicate and the mean IC50 value of candidates active across the triplicate was reported. CID = PubChem Compound ID. Adapted from Chen and Visco Jr (2016)

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

17644424 0.461 20.8

17640928 0.409 61.42

2531484 0.425 63.64

45 17641858 0.162 > 200

17643506 0.434 > 200

17611690 0.434 > 200

17574247 0.194 > 200

17574368 0.494 > 200

3808779 0.494 > 200

1841293 0.084 > 200

46 3832483 0.084 > 200

17643413 0.612 > 200

47023001 0.836 > 200

17616073 0.274 > 200

4570118 0.507 > 200

3709571 0.194 > 200

The models were retrained with the addition of the sixteen candidates from the first round, resulting in the training data containing 113 compounds: 51 actives and 62 inactives. The same procedure for training vHTS models in the first round was used to train the second round vHTS models. The results are described in Table 2.3.

47 Table 2.3: Cathepsin L: The Second Round vHTS Modeling Summary. Cross- Validation = cross-validation error. Adapted from Chen and Visco Jr (2016).

SVM-C SVM-R

Training set 51 actives, 62 inactives 51 actives only

composition 113 total known IC50 Training set 8 h=0, 127 h=1, 8 h=0, 97 h=1, atomic Signatures 765 h=2; 900 total 428 h=2; 533 total

PCA Results 266 of 905 atomic Signatures 359 of 533 atomic Sigantures

Models Created 2 1

Training Error 0.060 0.062

Cross-Validation 0.204 0.299

The pipeline resulted in two classification QSAR models. A consensus approach to use both classification models was used, where a candidate must be classified as active by both models for further consideration. When using the previous criteria to select candidates for experimental validation, only five candidates were selected. The predicted activity criterion was relaxed to 10µM to identify and test more candidates. Therefore, the new criteria was:

1. Overlap = 1

2. SVM-C score > 2

3. SVM-R predicted activity ≤ 10µM

Twenty-three candidates met the modified criteria but only thirteen also met com- mercial availability considerations. One candidate fluoresced at the emissions wave- length of the assay indicator; therefore, only twelve candidates were tested. Activity

48 validation determined nine of the twelve candidates were active for an experimental hit rate of 75%. The results are detailed in Table 2.4

Table 2.4: Cathepsin L: The Second Round Activity Validation Summary. Candidates selected were commercially available, economically viable, and met the criteria of predicted activity < 10µM, SVM-C scores > 2, and overlap=1. Candidates were tested in triplicate and the reported IC50 value was the mean of those that were active across the triplicate. CID = PubChem Compound ID. *Showed activity but

IC50 above maximum testing concentration. Adapted from Chen and Visco Jr (2016)

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

6400935 8.543 6.49

6402773 8.543 2.88

1872573 0.843 0.97

1942062 1.093 1.61

49 6603233 8.165 0.49

20968843 5.127 36.47

3223264 7.37 41.91

322479 2.111 0.08

9551759 0.308 0.62

6403170 3.722 > 50µM*

3674399 5.127 > 50µM

50 4107783 5.127 > 50µM

2.2.2 Cathepsin L: Model Analysis

Examination of the models for the first and second round yielded an interesting trend and a result for AIM 3: both error and hit rates increased from the first to the second round. One explanation was the change in data used to train the models initially in the first round and retraining the models in the second round. PubChem Bioassay 825 originally had a heterogeneous mixture of structures and preferences for certain structural feature combinations, resulting in the scaffolds found in the first round. The addition of first-round activity validation experiments skewed that mixture and those preferences for structural feature combinations, resulting in the scaffolds found in the second round. Therefore, the addition of experiment data changes the amount and content of information used to train the models, resulting in models that are less specific to the set (i.e., increased training and cross-validation error) but more predictive (i.e., increased experiment hit rates).

2.2.3 Cathepsin L: Structural Features Analysis

Examination of the candidates’ structure showed large diversity even as structural features were limited to only those found in the training data. To examine the candidates structural diversity, the structural similarity metric called the Tanimoto coefficient was calculated. The set theoretic form of the Tanimoto Coefficient between a candidate and compound in the training data is defined (Chen and Reynolds, 2002) as:

51 m X nA,inB,i 1 TCA,B = m m m X X X nA,i + nB,i − nA,inB,i i=1 i=1 1

where TCA,B is the Tanimoto Coefficient between candidate A and compound B, nA,i

is the number of atomic Signature fragments i in candidate A and nB,i is the number of fragments i in compound B. The candidates’ Tanimoto Coefficient with each compound in the training set, summarized in Table 2.5, spanned a range of similarities, from the most structurally similar (AID 6603233) to the most dissimilar (AID 17644424) with the median slightly under 50%. This, combined with the hit rates observed in the activity validation experiments, demonstrated the pipeline is capable of predict classes and activities for diverse candidates and produce hit rates much higher than nominal traditional HTS hit rates. Examination of the candidates’ structure also showed many candidates were iter- ations of a given scaffold. By examining candidates containing the same scaffold, the role of functional groups or the scaffold can be deduced. It should be cautioned that the conjectures produced are not a determination of function but correlations that are additional avenues of investigation. The first scaffold and related candidates are shown in Figure 2.1. Each candidate contains a ring in the R position but only CID 17644424 and CID 17640928 were active (Figure 2.1b and 2.1c, respectively). While no discernible pattern emerges from an examination of the rings, the identical structure of the rest of the molecule indicates the inhibitor functionality is likely dependent on the identity of the R group.

52 Table 2.5: Cathepsin L: Candidate Similarity. *Active in validation experiments. Adapted Chen and Visco Jr (2016)

Candidates From The First Round Candidates From The Second Round Set-Theoretic Tanimoto Coefficient Set-Theoretic Tanimoto Coefficient

CID Min Max Median CID Min Max Median

17574247 0.14 0.79 0.39 6403170 0.17 0.65 0.38

1841293 0.21 0.71 0.35 3223264* 0.17 0.65 0.43

17574368 0.14 0.8 0.38 6400935* 0.18 0.71 0.39

2531484* 0.2 0.55 0.35 6402773* 0.17 0.67 0.4

17640928* 0.16 0.61 0.41 322479* 0.19 0.65 0.29

17641858 0.14 0.65 0.42 1872573* 0.17 0.64 0.37

17643413 0.17 0.57 0.4 1942062* 0.17 0.65 0.38

17643506 0.18 0.58 0.4 3674399 0.13 0.77 0.44

17644424* 0.14 0.71 0.4 4107783 0.13 0.85 0.44

17616073 0.17 0.55 0.35 20968843* 0.12 0.87 0.4

4570118 0.13 0.72 0.44 9551759* 0.16 0.86 0.41

17611690 0.15 0.66 0.4 6603233* 0.17 0.97 0.37

47023001 0.21 0.41 0.32 3808779 0.14 0.8 0.38 3832483 0.21 0.71 0.35 3709571 0.14 0.79 0.39

A second scaffold emerged from the second round candidates and is shown in Figure 2.2. In this case, the two factors that may explain the difference in activity are the position and the size of the functional groups. CID 6402773 (Figure 2.2d) is

53 (a) Scaffold 1. (b) CID 17644424 (c) CID 17640928 (IC50 = 20.8µM). (IC50 = 61.42µM).

(d) CID 17641858 (e) CID 17643506 (f) CID 17611690 (IC50 > 200µM). (IC50 > 200µM). (IC50 > 200µM).

Figure 2.1: Cathepsin L: The First Scaffold. The first scaffold is found in five candi- dates from the first round of vHTS. inactive while the other two candidates are active. Activity is likely due to functional group position, as the active candidates had functional groups in the R1 position instead of the R2 position, or size, as CID 6402773 has a functional group much bigger and bulkier than the methyl and ethyl groups found in the active candidates. Again, the rest of the molecule is identical in all three candidates; therefore, the determination of activity is likely either due to the identity or the size of the functional groups at the R1 and the R2 position. The last candidates for which this kind of analysis can be applied to in this cathepsin L case study are the ones shown in Figure 2.3. The two most similar candidates, CID 4107783 and CID 20968843 (Figure 2.3c and 2.3d, respectively), are identical except for the group at the R1 position, indicating the identity of R1 as the likely determinant of activity. As for the R2 and R3 positions, the effects of groups at those locations cannot be distinguished from that of the groups at the R1 position.

Thus, it is unknown what effect groups at the R2 and R3 positions have.

54 (a) Scaffold 2. (b) CID 6403170 (IC50 > 50µM).

(c) CID 6400935 (IC50 = 6.49µM). (d) CID 6402773 (IC50 = 2.88µM).

Figure 2.2: Cathepsin L: The Second Scaffold. Scaffold 2 is found in three candidates from the second round of vHTS.

(a) Scaffold 3. (b) CID 3674399 (IC50 > 50µM).

(c) CID 4107783 (IC50 > 50µM). (d) CID 20968843 (IC50 = 36.47µM).

Figure 2.3: Cathepsin L: The Third Scaffold. The third scaffold was found in three candidates from the second round of vHTS.

55 It should be noted that an additional group of candidates containing the same scaffold, showed in Figure 2.4 with the scaffold in question, was found but all candi- dates were of the active class and were comparable in activity; therefore, the effects of different functional groups cannot be determined through the compare/contrast analysis conducted earlier. As the candidates were not identified as a PAINS (Baell and Holloway, 2010) or structurally similar to known PAINS, this represents an ideal opportunity for hit-to-lead development. Because all iterations of the scaffold are ac- tive, constraints in terms of functional groups and structural variability are unknown. Thus the scaffold can be interpreted as an unconstrained hit that can be iteratively modified to tune for pharmacokinetic and metabolism properties.

(a) Scaffold 4. (b) CID 1872573 (c) CID 1942062 (IC50 = 0.97µM). (IC50 = 1.61µM).

Figure 2.4: Cathepsin L: The Fourth Scaffold. the fourth scaffold is found in two candidates from the second round of vHTS.

2.3 Factor XIIa

Another dataset to address AIM 1 was PubChem Bioassay AID 728 (Diamond, 2011). AID 728 contains a total of 82 compounds: 48 actives and 34 inactives. The active fraction of AID 728 is 0.585. These dataset characteristics allow it address AIM 1 and test how the pipeline responds to datasets containing between ten and one hundred compounds. This work also addresses AIM 3 as this data set is one of the few with an active fraction > 0.5 and provides a chance to observe the effect of retraining when there is more active compound data than inactive compound data.

56 This case study was presented initially by Chen and Visco Jr (2017). The focus of AID 728 is on Factor XIIa (EC 3.4.21.38), also known as the Hageman Factor (Ratnoff and Colopy, 1955), and is the activated form of Factor XII. Coded by the 128 kbp Factor XII gene (Cool and MacGillivray, 1987) found on chromosome 5 (Citarella et al., 1988), Factor XII is an 80 kDa glycoprotein (Dunn et al., 1982) that circulates the body as an inactive zymogen. Conformation and activation of the zymogen occur upon (1) adsorption on or contact with a negatively-charged surface (Chan et al., 1978), (2) digestion by the kallikrein protease (Cochrane et al., 1973), or (3) digestion by another Factor XIIa protein (Silverberg et al., 1980). Clotting occurs in two different varieties: hemostasis, clotting to stop blood loss, and thrombosis, clotting that stops blood circulation. Factor XIIa impacts both kinds of clotting. Its absence was first discovered in patients with normal platelet count, clot retraction, and bleeding time but abnormal clotting time (Ratnoff and Colopy, 1955). Later revealed to participate indirectly in homeostasis clotting as a non- essential element (Ratnoff and Colopy, 1955; Proctor and Rapaport, 1961), Factor XII does activate and amplify signals and factors that participate directly in homeostasis, including Factor XI (Davie and Ratnoff, 1964; Ghebrehiwet et al., 1981). The role of Factor XII in thrombosis is an area of current study. First reports suggested Factor XII deficiency as a factor in thrombophilia (Kuhli et al., 2004; Goodnough et al., 1983) but later studies found mitigating circumstances minimized the role and contribution of Factor XII (Girolami et al., 2004) Now, Factor XII deficiency is thought to inhibit thrombosis (Kleinschnitz et al., 2006; Renn´eet al., 2005). Thrombosis studies in mice showed Factor XII deficiency resulted in inhibited clot formation and reduced clot stability (Kleinschnitz et al., 2006; Renn´eet al., 2005). Clot formation inhibition and destabilization mechanisms are unclear and are currently an area of active study (Kleinschnitz et al., 2006; Renn´eet al., 2005).

57 Due to its ability to affect thrombosis and not hemostasis, Factor XII may be an ideal therapeutic target for thrombosis treatment but little is known about its regulation. The most apparent symptom of Factor XII deficiency, abnormal bleeding time (Ratnoff and Colopy, 1955), is of low priority considering other life-altering or life-threatening symptoms or diseases. Therefore, no historical records exist in terms of treatment. Estrogen has been observed to affect Factor XII levels (Farsetti et al., 1995, 1998; Inoue et al., 2006). However, estrogen is an important regulatory hormone that affects many different systems and is precluded as a viable target as precise targeting is desired. With the lack of viable options, one method left is to screen compounds against Factor XIIa to identify biologically active hits, as is presented in this case study (Chen and Visco Jr, 2017).

2.3.1 Factor XIIa: vHTS and Experimental Validation Results

AID 728 contains data on 82 compounds: 48 actives of known IC50 and 34 inac- tives. The highest candidate concentration tested was 50µM and will be the same for this work as well. Signature fragmentation of the compounds in AID 728 resulted in 698 atomic Signatures of heights 0, 1, and 2. PCA created principal components from the 698 atomic Signatures, the contributions of which were used to identify the atomic Signatures contributing the most to variance capture. These were then passed to the GA-SVM to create and train classification (SVM-C) and activity prediction QSAR (SVM-R) models. Details of the resulting models are presented in Table 2.6. The models were evaluated using ROC curves, shown in Figure 2.5. The ROC curves for all 308 SVM-C models had a square shape and an AUC = 1 to indicate a sharp divide between classes in the data, which is supported by the zero training and cross-validation errors in Table 2.6. The SVM-R ROC curve lacks the square shape and has an AUC = 0.562. This is due to the way the ROC curve was calculated. The

58 Table 2.6: Factor XIIa: The First Round vHTS Modeling Summary. Cross-validation = cross validation accuracy. Adapted from Chen and Visco Jr (2017).

SVM-C SVM-R

Training Set 48 active, 34 inactive 48 actives only

composition 82 total known IC50 Training Data 10 h=0, 123 h=1 10 h=0, 83 h=1 atomic Signatures 565 h=2; 698 total 255 h=2; 348 total

PCA results 198 of 698 atomic Signatures 155 of 348 atomic Signatures

Models Created 308 1

Training Error 0 0.25

Cross-Validation 0 0

SVM-R models were applied to the active and inactive compounds to generate the ROC curve. Active compounds only contained half of all atomic Signatures in the training data, with the other half in the inactive compounds. Extrapolation to such a degree would inevitably result in poor performance. When extrapolation is limited by the candidate selection criteria, the SVM-R models should increase in performance. The ROC for all models were above the y=x line. Thus, any accurate identification of active candidates is due to model performance and not random chance. Since all 308 classification models had the same amount of error and the same AUC, zero and one, respectively, the choice was made to use all 308 models for vHTS. Other than the lack of a non-arbitrary method to select a representative model, using all 308 models also allows for maximum data utility to increase the chances of vHTS success. The consensus approach was used, in which all models must agree that a candidate is active for it to receive an active classification. Following vHTS,

59 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curves for 308 SVM-C models (b) ROC Curve For The SVM-R Model (AUC = 1). (AUC = 0.562).

Figure 2.5: Factor XIIa: The First Round vHTS ROC. The ROC curves were above the y=x line. Thus, accurate identification of active candidates are due to model performance and not random chance. Adapted from Chen and Visco Jr (2017). candidates were chosen for activity validation experiments according to the following criteria:

1. Overlap = 1

2. SVM-C score > 2 for all 308 SVM-C models

3. SVM-R predicted activity < 50µM

Of 72 million compounds in PubChem (circa May 2014), 123 candidates met the selection criteria. After commercial availability and economic viability considera- tions, fourteen candidates were chosen. Ultimately, six were found active in activity validation experiments for an experimental hit rate of 42.9%. Data for all fourteen candidates are reported in Table 2.6.

60 Table 2.7: Factor XIIa: The First Round Activity Validation Summary. Candidates selected were commercially available, economically viable, and met the criteria of predicted activity < 50µM, SVM-C scores > 2 and overlap=1. Candidates were tested in triplicate and the mean IC50 values of active candidates across the triplicates were reported. CID = PubChem Compound ID. Adapted from Chen and Visco Jr (2017).

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

898930 1.88 7.21

710644 2.05 9.39

743712 0.35 25.4

743706 1.39 26.7

61 746939 3.65 31.7

792914 10.50 34.1

5711085 1.34 > 50

5463326 0.64 > 50

710710 1.54 > 50

5289928 0.86 > 50

62 7340998 0.88 > 50

5769514 1.24 > 50

5769082 0.62 > 50

17020029 1.75 > 50

Recall the hit rate in the cathepsin L case study increased from the first round to the second round. To address AIM 3 and to determine if the increase in hit rate is a feature or an artifact, the models were retrained. Experimental data on the fourteen candidates tested in the first round was added to the training data and the vHTS models were retrained. The new training data now contains 96 compounds: 54 actives and 42 inactive. The results of retraining are reported in Table 2.8. The ROC curve of each model, shown in Figure 2.6, was created for a priori model evaluation. The ROC curve for the new classification model approached a step function with AUC = 0.952, indicating the model found a relatively sharp distinction between classes. The ROC curve of the SVM-R model was further away from the y=x line, AUC = 0.716, and could be interpreted as increased activity prediction ability.

63 Table 2.8: Factor XIIa: The Second Round vHTS Modeling Summary. Cross- validation = cross validation accuracy. Adapted from Chen and Visco Jr (2017).

SVM-C SVM-R

Training set 54 active, 42 inactive 54 actives only

composition 96 total known IC50 Training Data 10 h=0, 123 h=1 10 h=0, 83 h=1 atomic Signatures 565 h=2; 698 total 255 h=2; 348 total

PCA results 177 of 698 atomic Signatures 153 of 348 atomic Signatures

Models Created 1 1

Training Error 0.031 0.420

Cross-Validation 0.030 0.458 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For The SVM-C Model (b) ROC Curve For The SVM-R Model (AUC = 0.952). (AUC = 0.716).

Figure 2.6: Factor XIIa: The Second Round vHTS ROC. Curves above the y=x line indicate models accurately identify actives. Adapted from Chen and Visco Jr (2017).

64 Virtual HTS with the retrained models summarized in Table 2.8 resulted in 66 identified candidates. Of the 66 candidates, eleven were determined commercially available and economically viable. Activity validation experiments found all eleven candidates to be active for an experimental hit rate of 100%. The results are detailed in Table 2.9.

Table 2.9: Factor XIIa: The Second Round Activity Validation Summary. Selected candidates are commercially available, economically viable and met the criteria of predicted activity < 50µM, SVM-C scores > 2 and overlap=1. Candidates are tested in triplicate and the mean IC50 values of candidates active across the triplicate were reported. CID = PubChem Compound ID. Adapted from Chen and Visco Jr (2017).

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

971668 1.00 0.68

969825 3.07 0.32

3613559 0.87 0.21

65 5000438 2.16 0.49

971729 1.84 0.40

786123 14.11 43.59

871570 20.83 1.96

693001 13.20 0.60

970336 1.57 0.16

66 8072667 1.59 8.37

974948 1.45 13.48

This work was performed before the consideration of PAINS (Baell and Holloway, 2010) so all compounds in the training data were used. After the fact, results were scrutinized for PAINS. Six second-round candidates were identified as or structurally similar to known aggregation PAINS (Babaoglu et al., 2008) using the “PAINS” filter from ZINC15 (Sterling and Irwin, 2015). To determine if the observed activity of the six candidates was due to aggrega- tion, the candidates were retested at various surfactant concentrations to check for aggregation. The testing protocol is based on a previously presented approach (Irwin et al., 2015). While the assay protocol already included a surfactant at the highest suggested testing concentration (Irwin et al., 2015), the candidates were tested any- way to determine how aggregation activity varied with surfactant concentration and if the concentration used in the protocol eliminated aggregation activity. The results in Table 2.10 suggest the surfactant concentration used in the assay protocol was enough to eliminate aggregation. Thus the results showed in Table 2.6 and Table 2.8 are due to candidate activity, not aggregation.

67 2.3.2 Factor XIIa: Model Analysis

The increase in errors and hit rates observed from the first round to the sec- ond round in the cathepsin L study was also observed for the Factor XIIa study. Though it might be a coincidence, the hit rate increase was observed twice and the same argument and supporting evidence can be made here. The candidates selected for validation in the first round skewed towards specific combinations of atomic Sig- natures, resulting in the two scaffolds found in the first round. The influx of new experimental data and structural information changed how certain combinations of atomic Signatures were considered, resulting in the shift in focus of atomic Signature combinations and, ultimately, in the scaffolds found in the second round. The mod- els were exchanging specificity to the training data, resulting in increased error, for predictive power, resulting in increased hit rates.

. Table 2.10: Factor XIIa: The PAINS Activity Summary. Candidates similar to known aggregators are retested at various surfactant concentrations to determine aggregation as a function of surfactant concentration. The Surfactant concentration used in the assay was the highest tested (0.02%.) Adapted from Chen and Visco Jr (2017)

CID 0% surfactant 0.01% surfactant 0.02% surfactant

IC50[µM] IC50[µM] IC50[µM] 971668 1.85 2.63 2.43

969825 0.43 0.83 0.86

3613559 0.60 0.75 0.81

5000438 4.97 5.27 5.83

971729 0.05 1.23 1.19

970336 0.19 0.97 1.08

68 2.3.3 Factor XIIa: Structural Feature Analysis

Through both rounds of vHTS and activity validation, seventeen candidates were identified that could be used to study Factor XII activity. When all candidates are considered, three different scaffolds appeared in multiple candidates. Two were exclusive to one round in particular while another was found in both rounds. The first scaffold for consideration was found in candidates of both rounds, shown in Figure 2.7. Every candidate containing the scaffold was active except for CID 710710, shown in Figure 2.7b. Noticeably, CID 710710 was the only candidate that contained a nitro group at the R2 position. The active candidates contained a variety of groups (e.g., ether, methyl, halide and amides) in a variety of positions but not a nitro group. Therefore, the nitro group or the placement at the R2 position or both should be the cause of inactivity.

(a) Scaffold 1. (b) CID 710710 (IC50 > 50µM).

Figure 2.7: Factor XIIa: The First Scaffold. All candidates with scaffold 1 were active except for CID 710710. CID 710710 is the only candidate with a nitro at the

R2 position and either the nitro group, the group placement, or both should be the cause of inactivity. Adapted from Chen and Visco Jr (2017).

Other than the active candidates, a scaffold was found in the inactive candidates

69 of the first round, shown in Figure 2.8. The scaffold was found in seven inactive candidates but also one active compound from the training data, CID 5340140, shown in Figure 2.8c. When compared with the most similar candidate, CID 7430998 shown

in Figure 2.8b, the difference between the two was the pyrimidine at the R1 position

and the nitro group at the R2 position; one or both groups should be the reason why CID 5340140 is active.

(a) Scaffold 2. (b) CID 7340998 (c) CID 5340140 (IC50 > 50µM). (IC50 = 1.02µM).

Figure 2.8: Factor XIIa: The Second Scaffold. Candidates containing the second scaffold was inactive but CID 5340140 from the training set was active. When com- pared with the most similar candidate, CID 7340998, the notable differences were the

pyrimidine and nitro group at the R1 and R2 positions, respectively. One or both groups should be the reason why CID 5310140 is active.

The last scaffold, shown in Figure 2.9, was found in eight candidates from the second round and nine compounds in the training data. The seventeen compounds

could be further categorized by the identity of the R4 group: pyridine, furan, or benzene ring. All compounds containing the scaffold were active; thus analysis is limited. However, some compounds were less active than others. The compounds

with a pyridine ring had IC50 < 1µM except one, CID 8072667. The compounds

with a benzene ring had IC50 < 10µM except one, CID 974948. The difference between these two compounds and the rest of its’ respective group was also a shared

commonality between both compounds: the nitro group at the R2 position. Either

70 the position or the nitro group itself has a negative impact on activity. Beyond the role and impact of the R2 nitro group, the scaffold is a relatively unexplored hit that can be iterated and optimized for pharmacokinetic and metabolism properties. Other iterations of this scaffold contained a mix of methoxy, methyl, hydrogen, and chlorine groups but showed little relative difference in activity. Ways to greatly affect activity are unknown for this scaffold and is another avenue for research.

(a) Scaffold 3. (b) CID 8072667 (c) CID 974948 (IC50 = 8.37µM). (IC50 = 13.48µM).

Figure 2.9: Factor XIIa: The Third Scaffold. Both compounds shown, CID 8072667 and CID 974948, have lower activity than its cohorts, likely due to the presence of the nitro group.

The first two scaffolds discussed were found in small numbers in the data set, with only four of the eighty-two compounds containing the first scaffold and one of the eighty-two compounds containing the second scaffold. If the third scaffold treated each ring iteration (pyridine, furan, or benzene) as an individual scaffold, the number of compounds containing each ring iteration would also be comparably small. This is evidence the pipeline can identify key structural features from relatively few instances and use that to identify potential candidates to focus efforts and resources on.

2.4 Factor XIa

The dataset PubChem Bioassay AID 846 (Diamond, 2008c) was selected to ad- dress AIMs two and three. AID 846 contains 289 compounds, 91 actives and 198

71 inactives, and has an active fraction of 0.315. Use of this dataset allows an investiga- tion as to if the pipeline can to produce predictive models in cases of active fractions between 0.2 and 0.4. This dataset also addresses AIM 3 as retraining was never attempted on datasets with an active fraction between 0.2 and 0.4 before and will inform what expectations to have when retraining models for datasets with smaller active fractions. AID 846 focuses on clotting Factor XI, which was discovered when patients with mild to moderate hemophilia were missing a blood protein (Rosenthal et al., 1953, 1955). Factor XI is encoded by the 23 kbp long FXI gene, containing 15 exons and 15 introns (Asakai et al., 1987), located on the fourth chromosome (Kato et al., 1989). Factor XI is produced primarily by the liver as a 607 amino acid (Fujikawa et al., 1986), 160 kDa homodimer zygomen, comprised of two 80 kDa monomers (Fujikawa et al., 1986). The zygomen circulate the body at a concentration of about 4 µg/mL (Bouma and Griffin, 1977). According to classical models of hemostasis, Factor XIIa binds to Factor XI, chang- ing Factor XI conformation to the activated form, Factor XIa (EC 3.4.21.27) (Davie and Ratnoff, 1964). Factor XIa then activates other signals to amplify the clotting cascade. The classical model was contradicted by case studies where patients with deficient precursor levels did not exhibit hemophilia symptoms while patients with deficient Factor XI levels did (Asakai et al., 1987). Downstream product thrombin has emerged as an alternative activation method along with auto-activation by Factor XIa molecules and negatively charged surfaces (Naito and Fujikawa, 1991; Gailani and Broze Jr, 1991) to minimize the role of precursors. Simultaneously, the understood role of Factor XI in the clotting cascade increased as its role now also includes ampli- fication of signals to convert more prothrombin to thrombin and accelerate fibrinogen to fibrin conversion (Gailani and Broze Jr, 1991).

72 While no strict correlation exists between Factor XI deficiency and hemophilia, which necessitate case-by-case treatment (Seligsohn et al., 2005; Bolton-Maggs, 2000), patients with elevated Factor XIa levels are at risk of excessive clotting and prone to develop thrombosis (Meijers et al., 2000; Eichinger et al., 2004; Doggen et al., 2006; Yang et al., 2006). Factor XI inhibition as a treatment for thrombosis was the result of the relationship between Factor XI, other Factor XI precursors, and clotting. If precursor inhibition has resulted in reduced clot formation and stability, inhibition of a direct actor like Factor XI should have a larger impact on clotting (Renn´eet al., 2005; Kleinschnitz et al., 2006). Experiments with Factor XI deficient mice showed decreased clot stability (Renn´eet al., 2005; Kleinschnitz et al., 2006) and Factor XI has become a therapeutic target for anti-thrombosis treatment. Factor XI inhibitors have been studied since the 1970s starting with four protease inhibitors (α1-protease inhibitor (Heck and Kaplan, 1974), antithrombin III (Damus et al., 1973), C1-inhibitor (Forbes et al., 1970), and α2-plasmid inhibitor (Saito et al., 1979)) and derivatives of the four protease inhibitors (Scott et al., 1986). Other, non- protease inhibitors have also been developed (Quan et al., 2014; Al-Horani et al., 2013; Meijers et al., 1988; Weis et al., 2008; Li et al., 2014a; Pinto et al., 2017; Fjellstr¨om et al., 2015) with additional ones identified in the case study presented.

2.4.1 Factor XIa: vHTS and Experimental Validation Results

AID 846 contains 289 compounds: 91 actives of known IC50 values and 198 in- actives (Diamond, 2008c). The compounds in the training data were fragmented to yield 1798 atomic Signatures of heights 0, 1, and 2. Examination of how PCA used atomic Signatures to create principal components identified the atomic Signatures that contributed the most to variance capture, which were used by the GA-SVM to make classification (SVM-C) and activity prediction QSAR (SVM-R) models, the

73 results of which are detailed in Table 2.11.

Table 2.11: Factor XIa: The First Round vHTS Modeling Summary. Cross-validation = cross validation error.

SVM-C SVM-R

Training set 91 active, 198 inactive 91 actives only

composition 289 total known IC50 Training Set 13 h=0, 200 h=1 10 h=0, 117 h=1 atomic Signatures 1585 h=2; 1798 total 552 h=2; 679 total

PCA results 168 of 1798 atomic Signatures 262 of 679 atomic Signatures

Models Created 1 1

Training Error 0.06 0.50

Cross-Validation 0.06 0.55

Model evaluation with ROC curves are shown in Figure 2.10. The SVM-C curve approaches (1,1), indicating a threshold value exists where nearly all training data compounds are correctly classified. It is noted again the poor shape of the ROC curve of the SVM-R model is due to the application of the models to atomic Signatures not present when training as those atomic Signatures are only found in the inactive compounds. When extrapolation is limited by the activity validation experimental candidate selection criteria, the SVM-R model should perform better than indicated by the ROC curve. Following vHTS, the results are filtered for candidates according to the following criteria:

1. Overlap = 1

2. SVM-C score > 2

74 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For The SVM-C Model (b) ROC Curve For The SVM-R Model (AUC = 0.929). (AUC = 0.690).

Figure 2.10: Factor XIa: The First Round vHTS ROC. The ROC curves above and away from the y=x line indicate both contribute to accurate prediction of active candidates and identification of actives are the result of model ability instead of random chance.

3. SVM-R predicted activity < 50µM

Of the 72 million compounds screened, 110 candidates met the three criterion for further consideration. Economic and commercial availability considerations identified eleven candidates for activity validation experiments. Ultimately three of the eleven were found active for an experimental hit rate of 27.3%. The results of the activity validation experiments are detailed in Table 2.12. The 27.3% hit rate is one of the lower hit rates observed for the first round thus far. The pipeline can produce better models and may do so when retrained. Thus, to raise vHTS hit rates and to address AIM 3 of observing how retraining affects hit rates, the models were retrained with the addition of the activity validation data to

75 Table 2.12: Factor XIa: The First Round Activity Validation Summary. Selected candidates are economically viable, commercially available, and met the criteria of predicted activity < 50µM, SVM-C scores > 2 and overlap=1. Candidates were tested in triplicate and the IC50 value reported is the mean of candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

834536 1.05 19.4

764214 0.91 43.7

17178134 2.67 37.2

4026876 2.13 > 50

76 6042862 6.14 > 50

682556 2.71 > 50

812347 2.72 > 50

802465 2.71 > 50

733816 2.72 > 50

17178138 4.77 > 50

77 17131127 5.48 > 50

the original training data. The augmented training data now contained information

on 300 compounds: 94 actives with known IC50 and 206 inactive. The details and result of retraining are reported in Table 2.13.

Table 2.13: Factor XIa: The Second Round vHTS Modeling Summary. Cross- validation = cross validation error.

SVM-C SVM-R

Training set 94 active, 206 inactive 94 actives only

300 total known IC50 Training Data 13 h=0, 200 h=1 10 h=0, 117 h=1 atomic Signatures 1585 h=2; 1798 total 552 h=2; 679 total

PCA results 172 of 1798 atomic Signatures 232 of 679 atomic Signatures

Models Created 1 1

Training Error 0.08 0.61

Cross-Validation 0.09 0.54

ROC curves were used to evaluate the classification and activity prediction mod- els before vHTS. Interestingly, both classification and activity prediction models per- formed worse, with AUC slightly lower for the SVM-C curve and AUC significantly lower for the SVM-R curve. One explanation is that retraining generalizes the models such that specificity to training data decreases in exchange for prediction power and hit rate increases. This conjecture is supported by the hit rates of the second round

78 activity validation experiments, details of which are presented in Table 2.14. 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For The SVM-C Models (b) ROC Curve For The SVM-R Model (AUC = 0.908). (AUC = 0.584).

Figure 2.11: Factor XIa: The Second Round vHTS ROC. The ROC curves for the models from the this second round were closer to the y=x line than the ROC curves for the models from the first round, likely indicating decreased specificity to the training data in exchange for increased prediction power and hit rate.

When the first round criteria was applied to identify the candidates to test in the second round of activity validation, 36 candidates were identified but were not economically viable or commercially available. To increase the number of economi- cally viable candidates, the SVM-C criterion was reduced from 2 to 1.5 as confidence in predictions is desired but economic viability is paramount to activity validation experiments. The modified criteria used was:

1. Overlap = 1

2. SVM-C score > 1.5

79 3. SVM-R predicted activity < 50µM

With the relaxed SVM-C criterion, 116 candidates met the specified criteria. Eleven of the 116 were economically viable and commercially available for activity validation experiments. Ultimately, seven of the eleven candidates were found active for an experimental hit rate of 63.6%. Additionally, two of the eleven candidates identified in the second round was also Factor XIIa inhibitor candidates. The full details of the experiments are reported in Table 2.14.

Table 2.14: Factor XIa: The Second Round Activity Validation Summary. Candi- dates selected were commercially available, economically viable, and met the criteria of predicted activity < 50µM, SVM-C scores > 1.5 and overlap=1. Candidates were tested in triplicate and the mean IC50 value was reported for candidates active across the triplicate. CID = PubChem Compound ID. *Factor XIIa inhibitor candidate as well (Chen and Visco Jr, 2017).

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

5440853 7.93 29.1

785793 4.13 30.3

80 5146207 4.06 32.4

743608 2.25 15.0

4957387 4.67 19.7

710644* 2.05 0.58

746939* 2.25 > 50

5166982 1.55 0.73

81 5101792 2.66 > 50

4992436 1.47 > 50

4993295 2.71 > 50

Modeling and experimental work were conducted for this case study without con- sidering PAINS (Baell and Holloway, 2010). Examination with the “PAINS” filter from ZINC15 (Sterling and Irwin, 2015) identified one candidate as similar to a known aggregator (Ferreira et al., 2010). To check if the activity observed was due to ag- gregation, the candidate was retested with varying concentrations of the surfactant according to a previously described approach (Irwin et al., 2015). Results of the retest in Table 2.15, suggest the surfactant concentration specified by the protocol (Diamond, 2008c) was at a high enough level to inhibit any aggregation activity that may occur and the activity observed in the activity validation experiments is the activity of the candidate and not the result of aggregation.

82 Table 2.15: Factor XIa: The PAINS Activity Summary. CID 5440853, a compound similar to a known PAINS, was retested at varying surfactant concentrations to deter- mine if activity observed in Table 2.14 is the result of aggregation. The concentration used in the assay was the highest concentration tested, 0.02% surfactant concentra- tion.

CID 0% surfactant 0.01% surfactant 0.02% surfactant

IC50[µM] IC50[µM] IC50[µM] 5440853 23.2 28.8 26.0

2.4.2 Factor XIa: Model Analysis

The previously observed trend of increased error and hit rates persist in this case study as well. Not only did the errors increase, but the ROC curves also worsened. This reinforces the conjecture that models become less specific to the training data and more predictive through retraining. The inclusion of the new activity validation experiment data seemed to refocus the vHTS on structural features that are more predictive. The hit rates for this case study were not as impressive as the hit rates for cathepsin L and Factor XIIa. However, a range of hit rates has started to emerge. This can be used later to understand what hit rates can be expected under what circumstances.

2.4.3 Factor XIa: Structural Feature Analysis

In total, twenty-two candidates were identified and ten were found active experi- mentally. Further examination of the candidates and the training data may yield some insight into which structural features translate into biological activity and which do not.

83 Many of the twenty-two candidates contain one of two scaffolds. The first scaffold, shown in Figure 2.12a, was found in eight candidates (seven active and one inactive) and eight compounds (six active and two inactive) from the training data. Compar- isons between the different compounds revealed several correlations of structure and activity. First, CID 570059 from the training data is the scaffold with hydrogens for all R groups and is inactive. Therefore, any observed activity is likely due to the functional groups and optimization is likely more limited as it will be constrained by groups required for activity before optimizing pharmacokinetic and metabolic prop- erties. Second, bromine or lack of a methoxy group at the R2 position is inhibiting activity. When CID 833492, an inactive compound from the training data shown in Figure 2.12c, is compared with the most similar compound in the training data and candidates, active compound CID 743706 shown in Figure 2.12d, the difference between the two structures is the methoxy/bromine group substitution at the R2 po- sition. Therefore, either the presence of the bromine or the absence of the methoxy group is likely inhibiting activity. Third, the placement of functional groups was im- portant. CID 743608 and 710644, shown in Table Table 2.14, has the same structure aside from the placement of chlorine at R2 and R3 position, respectively, but only CID 743608 was active. Therefore, the location of the chlorine group on the scaffold likely determined biological activity. Finally, Factor XIa may have a binding pocket that can accommodate larger functional groups. Two of the active candidates have large/bulky functional groups: CID 658532 and CID 2201551, shown in Figure 2.12e and 2.12f, respectively. Both groups are five backbone units long and steric inhibition might have been expected but was not observed. Therefore, the binding pocket may be able to accommodate larger R groups or encourage favorable orientation to prevent steric strain. One candidate, CID 5166982 shown in Table 2.14, contains what seemed to be

84 (a) Scaffold 1. (b) CID 570059 (c) CID 833492 (IC50 > 50µM). (IC50 > 50µM).

(d) CID 743706 (e) CID 658532 (f) CID 2201551 (IC50 = 0.47µM). (IC50 = 3.18µM). (IC50 = 0.25µM).

Figure 2.12: Factor XIa: The First Scaffold. Candidates only had functional groups in the R2 and R3 positions while training data compounds had functional groups in all positions. a derivative of the first scaffold. No other compounds or candidates contained the scaffold. Therefore, this may be a new avenue of investigation. This also suggests the pipeline may be able to identify new scaffolds that are combinations or derivatives of existing ones. The other scaffold for consideration, shown in Figure 2.13, was found wholly in two inactive candidates and partially in two active compounds from the training data. All four contain the pyrazole carbaldehyde part of the scaffold. Therefore, it is likely the pyrazole carbaldehyde part of the scaffold does not actively contribute to inhibition. Thus the role of that portion of the scaffold may either affect orientation/positioning or have no effect. It should be mentioned that Weis et al. (Weis et al., 2008) also developed models

85 (a) Scaffold 2. (b) CID 841465. (c) CID 2220615.

Figure 2.13: Factor XIa: The Second Scaffold. The compounds in Table 2.14 con- taining the scaffold were inactive but both here were active. The pyrazole carbalhyde part of the scaffold may not actively contribute to inhibition. to screen the PubChem Compound database for Factor XIa inhibitors and Li et al. (2014a) conducted the experimental confirmation. Though the same bioassay (AID 846) was used, treatment differed. Weis et al. (Weis et al., 2008) did not use all of the compounds in the bioassay, but removed dozens and moved the classification threshold to have an equivalent numbers actives and inactives. For experimental validation (Li et al., 2014a), twenty-one candidates were selected and seven of the tested candidates were active for an experimental hit rate of 33.3%. While a direct comparison cannot be made, this is yet another contributive effort to identifying Factor XIa inhibitors.

2.5 Complement Factor C1s

To further address AIMs two and three, PubChem Bioassay dataset AID 787 (Diamond, 2008a) was selected. AID 787 contains data on 183 compounds, 23 actives and 160 inactives, which was reduced to 136 compounds, 16 actives and 120 inactives, when PAINS and similar structures are removed. AID 787 also has a small active fraction of 11.8%, PAINS-free. These two features make this dataset an excellent one to probe if the pipeline can create and train predictive models in circumstances where

86 there is a dearth of data on active compounds, which is the majority of situations in drug discovery. This dataset also addresses AIM 3 in a circumstance common to drug discovery. If retraining can improve models when datasets have low active fractions, retraining could have a large and important impact. AID 787 focuses on complement factor C1s, a component in the immune system. The immune system is comprised of two overarching systems, the innate system and the adaptive system, to adequately respond to threats against the host from pathogens and toxins. A particular subsystem of the innate portion of the immune system is the complement system (Ziccardi, 1983). The initiator of the system, com- plement factor C1, is a pentamer formed from the combination of a pattern recogni- tion subunit (C1q), two initiating subunits (C1r, EC 3.4.21.41), and two cascading subunits (C1s, EC 3.4.21.42) (Ziccardi, 1983). Coded by the complement factor 1 gene on the fourth chromosome (Goldberger et al., 1987), production and assembly of the 750 kDa glycoprotein occur mainly in monocytes and macrophages (Morris et al., 1978; M¨ulleret al., 1978), secondarily in other tissues in cells throughout the body (Gulati et al., 1993) and is found at a relative serum concentration of 0.17µM (Ziccardi, 1983). C1q recognizes many different patterns (Wallis et al., 2010) and upon recognition and attachment to a target, C1q cleaves and activates C1r, which itself cleaves and ac- tivates C1s (Gaboriaud et al., 2004). C1s, in turn, cleaves and activates complement factors 2 and 4. Parts of complement factors 2 and 4 combine to form C3 convertase that activates other immune system components, including increased local produc- tion of pro-inflammatory molecules (Bokisch et al., 1969; Hugli, 1986), recruitment of macrophages (Bokisch et al., 1969; Hugli, 1986), and membrane-attack complex formation (M¨uller-Eberhard, 1985). The activation and subsequent amplification of immune system responses that are

87 triggered by C1 make C1 a therapeutic target of focus. Normal regulation occurs via endogenous production of C1 inhibitor (Ratnoff and Lepow, 1957), which binds irreversibly to C1r and C1s (Sim et al., 1979; Ziccardi and Cooper, 1979). As the only endogenous regulator (Ziccardi, 1981, 1985), mutations in the C1 inhibitor or physiological cues that excessively activate C1s may lead to deficient control over the complement system. One disease resulting from mutations in the C1 inhibitor is hereditary angioedema: insufficient or deficient C1 inhibitor activity result in inflammation, edema, and other symptoms (Gompels et al., 2005). As for physiological cues in complement system implicated diseases, incomplete removal of extracellular components can result in dis- eases like age-related macular degeneration (extracellular debris) (Anderson et al., 2010) or Alzheimer’s (amyloid fibrils) (Alexander et al., 2008): the complement sys- tem and other immune responses activate to remove extracellular components marked by C1q, with incomplete removal creating a chronic positive feedback loop that esca- lates damage to surrounding tissue. It should be noted that dis-regulation can also stem from other elements in the complement system and that C1 is only one of the many possible therapeutic targets (Ricklin et al., 2010). Current available therapeutic treatments include FDA-approved supplementary

C1 inhibitors derived from donor plasma: Berinert R (Kawalec et al., 2013; Craig et al., 2009) and CinryzeTM (Lunn et al., 2010). Supplementary recombinant C1 inhibitor

from transgenic rabbit mammary glands, Ruconest R (Longhurst, 2008; Kawalec et al., 2013; Cruz, 2015), is another FDA-approve treatment. However all three supple- mentary C1 inhibitor treatments pose a major financial burden (Lunn et al., 2010; Kawalec et al., 2013; Cruz, 2015), which is already a reduction from the cost of emergency treatment (Wilson et al., 2010; Petraroli et al., 2015). Alternative treatments studied and still under review include inhibiting the activ-

88 ity of the individual subcomponents of C1, where C1s has emerged as the preferred target (Buerke et al., 2001). A treatment using antibodies to target C1s was patented and is currently in Phase 1 trials (Shi et al., 2013; Van Vlasselaer et al., 2017) As for small molecule inhibitors, some have been found (Szalai et al., 2000; Buerke et al., 2001; Subasinghe et al., 2004) and optimized by different modifications (Subasinghe et al., 2006; Travins et al., 2008), for example PEGylation (Subasinghe et al., 2012)), to circumvent issues. Computational methods were used to model small molecules inhibitors in docking simulations (Subasinghe et al., 2004, 2006; Travins et al., 2008; Subasinghe et al., 2012) but not to identify leads. vHTS identification of leads tar- geting another complement factor (Vulpetti et al., 2017) suggests vHTS can identify small molecule C1s inhibitors as well.

2.5.1 Complement Factor C1s: vHTS and Experimental Validation Results

PubChem Bioassay AID 787 contains 183 compounds: 23 actives of known IC50

values and 160 compounds of unknown IC50 values (Diamond, 2008a). Compound concentrations were tested up to 50µM for the dataset and were the highest con- centration tested in the activity validation experiments for this case study as well. After PAINS and PAINS-like compounds were removed from AID 787, the PAINS- free dataset contained 136 compounds: 16 actives and 120 inactives. Fragmentation with Signature yielded 1072 atomic Signatures of heights 0, 1, and 2. The role of different atomic Signatures in the construction of the principal components by PCA was examined to identify and filter for those contributing the most to capturing vari- ance. The filtered atomic Signatures are then used for GA-SVM model construction and training, the details of which are tabulated in Table 2.16. Curiously, all atomic Signatures found in the active compounds were used to train the activity prediction QSAR model. One explanation may be the amount of active

89 Table 2.16: C1s: The First Round vHTS Modeling Summary. Cross-Validation = cross-validation error.

SVM-C SVM-R

Training set 16 actives, 120 inactives 16 actives only

NO PAINS 136 total known IC50 Training data 11 h=0, 136 h=1 8 h=0, 57 h=1 atomic Signatures 925 h=2; 1072 total 165 h=2; 230 total

PCA Results 159 of 1072 atomic Signatures 230 of 230 atomic Signatures

Models Created 115 1

Training Error 0 0

Cross-Validation 0.007 0.090

compound data was not enough to distinguish between which atomic Signatures can be used to predict activity and which ones cannot. Another reason could be the structural diversity of the active compounds combined with the limited number of actives to create a dearth of information to provide the resolution needed to iden- tify predictive atomic Signatures. In any event, all atomic Signatures of the active compounds were used to create/train the activity prediction QSAR model. A priori evaluation with ROC curves in Figure 2.14 concurred with with the statistics presented in Table 2.16. The ROC curve reached (1,1) for all 115 classifi- cation models, indicating a sharp division between active and inactive classes, which is supported by the small training and cross-validation errors observed. The ROC curve for the activity prediction QSAR model was farther away from (1,1) and did not resemble a step function. This may be due to the presence of atomic Signatures not used to train the QSAR. Only 230 atomic Signatures from the 16 active compounds

90 were used in training the activity prediction QSAR model while the whole data set, including 120 inactives, yielded 1072 atomic Signatures. With the degree of extrapo- lation required for the SVM-R model to predict the activity of inactive compounds, it is reasonable to expect the ROC curve observed in Figure 2.14b. Once the activity validation selection criteria limited extrapolation, the activity prediction power of the QSAR model should be higher than the corresponding ROC curve suggests. 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curves For 115 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC = 1). (AUC = 0.667).

Figure 2.14: C1s: The First Round vHTS ROC. All ROC curves are above the y=x line, indicating all models are contributing to identifying hits and correctly identifying actives are due to model performance, not chance.

All 115 classification models and one activity prediction model were used in a vHTS screen of the 72 million compounds in the PubChem Compound database. As all 115 classification models performed equally well, all 115 models were used in a consensus approach (i.e., all models must agree to assign a compound a class) to maximize data utility. The criteria used to identify candidates from the vHTS results.

91 1. Overlap = 1

2. SVM-C score > 2 for all 115 SVM-C models

3. SVM-R predicted activity < 50µM

Notice the second criterion not only used all classification models but applied the same standard to predictions by all models. This is to maximize the utility of the data available and to increase the ability to identify hits by the pipeline. Additionally, the focus on economical candidates carries more significance in this assay as the current FDA-approved therapies are a significant financial burden on those afflicted with C1 dis-regulation (Lunn et al., 2010; Kawalec et al., 2013; Cruz, 2015; Wilson et al., 2010; Petraroli et al., 2015). Seven candidates met the three presented criteria, four of which were active in activity validation experiments for an experimental validation hit-rate of 57%. The data for all seven candidates are shown in Table 2.17.

Table 2.17: C1s: The First Round Activity Validation Summary. Candidates selected are commercially available, economically viable, and met the criteria of predicted ac- tivity < 50µM, SVM-C scores > 2 for all 115 SVM-C models, and overlap=1. Candi- dates were tested in triplicate and the mean IC50 values are reported for candidates active across the triplicate. CID = PubChem Compound ID. *Candidate showed weak activity at 50µM but is inactive under stated criteria.

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

17178137 21.9 11.0

92 4951143 6.77 19.1

2986934 1.18 0.34

710644 4.36 1.09

5146207 10.9 > 50*

807111 8.88 > 50*

1107361 11.2 > 50

The experimental hit rate of 57% is higher than traditional HTS (Dobson, 2004) but to address AIM 3 and to determine if the models can be improved by retraining,

93 as observed in the previous case studies (Chen and Visco Jr, 2016, 2017), the models are retrained. With the addition of new experimental data, the training set now contains information for 143 compounds: 20 actives and 123 inactives. The results of retraining are detailed in Table 2.18.

Table 2.18: C1s: The Second Round vHTS Modeling Summary. Cross-Validation = cross-validation error.

SVM-C SVM-R

Training set 20 active, 123 inactive 20 actives only

NO PAINS 143 compounds known IC50 Training data 11 h=0, 136 h=1 8 h=0, 57 h=1 atomic Signatures 925 h=2; 1072 total 165 h=2; 230 total

PCA Results 164 of 1072 atomic Signatures 186 of 230 atomic Signatures

Models Created 1224 1

Training Error 0.021 0.108

Cross-Validation 0.020 0.162

The retraining resulted in 1224 classification models and 1 activity prediction model. A priori evaluation of all models was conducted with the ROC curves shown in Figure 2.15. The curves approaching (1,1) and resembling a step function once again indicates a sharp distinction between classes. The curves of all models were above the y=x line thus accurate identification of active candidates are due to the models’ predictive power rather than random chance. The ROC curve for the activity prediction QSAR is closer to the y=x line; however, the same argument of extrap- olating to inactive compounds containing atomic Signatures found in the inactive compounds only still holds and the activity prediction model should perform better

94 when extrapolation is limited by the activity validation selection criteria. 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For 1224 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC between [0.972, 0.998]). (AUC = 0.601).

Figure 2.15: C1s: The Second Round vHTS ROC. The ROC curves for all 1224 clas- sification models and one activity prediction model are above the y=x line, indicating correct identification of active candidates are a function of model performance, not chance.

All 1224 classification models had identical errors and very similar AUC values. To maximize data utility, all classification models were used in concert for a consensus approach as was done in the first round. A limited number of candidates passed the selection criteria in the first round and even fewer candidates passed the same criteria in the second round. To increase the number of candidates, the first criterion (overlap=1) and the second criterion (SVM-C> 2) were relaxed. The new candidates considered due to relaxing the criteria are of lower confidence in correct classification and increased extrapolation. However, these candidates also present an opportunity to determine if the pipeline can create models that can accurately make predictions

95 while extrapolating. The modified criteria was:

1. Overlap ≥ 0.9 i.e. at least 90% of atomic Signatures in the candidate molecule are found in the training set.

2. SVM-C score > 0 for all 1224 SVM-C models. The models must unanimously agree a molecule should in the active class.

3. SVM-R predicted activity < 50µM

Fifty-two candidates met the modified criteria and economic and commercial avail- ability considerations. Ultimately ten of the fifty-two candidates were purchased for experimental validation and five of the ten were active for an activity validation hit rate of 50%. Details on all ten candidates tested experimentally can be found in Table 2.19.

Table 2.19: C1s: The Second Round Activity Validation Summary. Candidates se- lected are commercially available, viable, and met the criteria of predicted activity < 50µM, SVM-C scores > 0 and overlap≥ 0.9. Candidates were tested in triplicate and the mean IC50 values are reported for candidates active across the triplicate. CID = PubChem Compound ID, *Candidate showed weak activity at 50µM but is inactive under stated criteria.

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

827004 0.30 3.04

96 4957387 4.27 32.9

898930 26.01 5.54

17178134 7.21 23.1

17178138 33.44 42.6

17131127 23.71 > 50*

834536 1.66 > 50*

97 693001 0.43 > 50*

792914 2.05 > 50*

570059 4.20 > 50*

2.5.2 Complement Factor C1s: Model Evaluation

In most case studies presented, the error and hit rates increase between the first and second iterations. This was thought to be the result of increased information available for training, which resulted in models that are less specific to the training data but more predictive in nature. In this work, only the error rate increased while the hit rate fell slightly. The previously provided explanation is most likely still in effect, but with an additional layer of complexity: the relaxation of the overlap criterion. When the overlap criterion was relaxed, extrapolation was increased in exchange for allowing more candidates to be considered. As there is an inversely proportional relationship between prediction accuracy and extrapolation, the adverse effects of criterion relaxation counterbalanced the gain from additional information,

98 resulting in an equivalent or slightly worse performance between the first and second round. Interestingly, the hit-rates of this study were comparable to ones in the previous case studies, though the active fraction for this data set is much lower (11.8% here vs. an average of 50% in other case studies). This is an indication as to the robustness of the pipeline and the ability of produced models to provide accurate predictions even in circumstances where there is less data for the active class, which is the case for the majority of data sets available for drug discovery and other applications.

2.5.3 Complement Factor C1s: Structural Feature Analysis

A total of seventeen candidates have been identified by vHTS for activity verifica- tion. Nine of the seventeen candidates were experimentally determined to be active and can be used to further research efforts in the study of C1s. While the pipeline cannot definitively elucidate the mechanism of action, structural comparisons can identify other avenues of research and study. Examination of the seventeen candidates identified two different scaffolds for fur- ther consideration. Examination of the first scaffold and five related candidates, showed in Figure 2.16, yielded several different observations about functional groups and activity. First, the presence of bromine in the R3 position likely has a small impact on activity. Candidates in Figures 2.16c and 2.16d are structurally identical but for the presence of the bromine and have similar IC50 values as well (19.1µM vs 23.1µM, respectively).

Second, whether the large functional group is at the R1 or R2 greatly impacted activity. Candidates in Figures 2.16d and 2.16f are structurally identical except for the location of the ester group at the R1 or R2 position and that is the difference on whether the candidate was active or inactive. A similar analysis for candidates in

99 (a) Scaffold 1. (b) CID 17178137 (c) CID 4951143 (IC50 = 11.0µM). (IC50 = 19.1µM).

(d) CID 17178134 (e) CID 17178138 (f) CID 17131127 (IC50 = 23.1µM). (IC50 = 42.6µM). (IC50 > 50µM).

Figure 2.16: C1s: The First Scaffold. Five candidates contained scaffold 1. The scaffold was found in the candidates only.

Figures 2.16b and 2.16e yielded similar information. If the first observation regarding the bromine is true, then the placement of the ester group at the R1 or R2 likely accounts for the activity discrepancy. From this, perhaps the R1 para position is more favorable to activity than the R2 meta position. Nine candidates contained the second scaffold, shown in Figure 2.17a. Six of the nine candidates containing the second scaffold were analyzed in the same manner as the candidates with the first scaffold and yielded a similar pattern of preference for the R3 para position. Candidates in Figures 2.17b, 2.17c, and 2.17d have the same structure with the ether group at R3, R2, and R1 positions, respectively. The candidate with the ether group the R3 position was most active, then the candidates with the ether at the R1 position and, finally, the candidate with the ether at the

R2 position was least active. Candidates in Figures 2.17e and 2.17f depict a similar pattern. For reference, Figure 2.17g is the second scaffold with hydrogens at R1,

100 R2, and R3 positions and is inactive. Therefore, activity is likely predicated on the presence and location of the functional group.

(a) Scaffold 2. (b) CID 710644 (IC50 = 1.09µM).

(c) CID 5146207 (d) CID 898930 (e) CID 827004 (IC50 > 50µM). (IC50 = 5.54µM). (IC50 = 3.04µM).

(f) CID 834536 (g) CID 570059 (h) SID 7977382 (IC50 > 50µM). (IC50 > 50µM). (IC50 = 0.85µM).

Figure 2.17: C1s: The Second Scaffold. The scaffold was found in five of nine can- didates. This scaffold was also found in only one compound in the original data set, SID 7977382.

The second scaffold was also found in the training data within SID 7977382, shown

in Figure 2.17h. The methoxy groups observed in several candidates at the R2 and

R3 positions fused into a dioxol ring. The resulting compound is active and suggest a synergistic effect on activity due ring fusion as opposed to only methoxy groups at

101 the R2 and R3 positions (Figure 2.17b and 2.17c, respectively.)

2.6 SENP8

To examine how the pipeline performs when the active fraction of a dataset is be- tween 0.6 and 0.8 for AIM 2, PubChem Bioassay 624322 (Salvesen, 2012) was selected. Composed initially of 290 compounds, 217 actives and 73 inactives, the dataset con- tains 253 compounds once PAINS and similar structures were identified and removed, 187 actives and 66 inactives. It has an active fraction of 0.739, allowing it to address the 0.6 to 0.8 active fraction range for AIM 2 and provide a performance expectation of the pipeline for a dataset with similar attributes. Additionally, the models were retrained for use in a second vHTS to address AIM 3 and provide information about how retraining is affected when the active fraction is this large. The focus of the AID 624322 is Sentrin Peptidase Family Member 8 (SENP8) (Mukhopadhyay and Dasso, 2007). Also known as DEN1 (Gan-Erdene et al., 2003; Wu et al., 2003), NEDP1 (Mendoza et al., 2003), and PRSC2 (National Library of Medicine (US), 2017), SENP8 is the modulator of a protein modification process with the -like protein called Neural precursor cell Expressed Developmentally Downregulated 8 (NEDD8). The SENP8 gene on (National Library of Medicine (US), 2017) codes for a 212 residue protease (Wu et al., 2003) that matures the inactive NEDD8 precursor by cleaving it after the C-terminal of the diglycine motif (Mendoza et al., 2003). The exposed diglycine motif is used to conjugate NEDD8 to other proteins, like . SENP8 also cleave excess NEDD8 from hyper- NEDDylated proteins (Gan-Erdene et al., 2003; Wu et al., 2003; Mendoza et al., 2003). NEDDylation affects protein stability and protein interactions (Broemer et al., 2010). As a result, NEDDylation is a highly conserved process with homologous

102 sequences found in other organisms (Del Pozo et al., 1998; Lammer et al., 1998; Men- doza et al., 2003; Zhou and Watts, 2005; Chan et al., 2008). NEDD8 modification occurs in three steps with three different kinds enzymes: activation, conjugation, and ligation (Glickman and Ciechanover, 2002). NEDD8 is activated by the activating en- zyme NAE, composed of NAE1(APP-BP1) and Uba3 (Osaka et al., 1998; Liakopoulos et al., 1998). Conjugation occurs with the conjugation enzyme Ubc12 (Osaka et al., 1998; Liakopoulos et al., 1998). Ligation occurs with many different ligase enzymes specific for different target proteins, including IAPs (Broemer et al., 2010), DCNL3 (Meyer-Schaller et al., 2009), DCUN1D1 (Huang et al., 2011), MDM2 (Embade et al., 2012; Xirodimas et al., 2004; Watson et al., 2006), and c-CBL (Zuo et al., 2013). NEDDyfication affects key proteins in many different systems, including MDM2 (Watson et al., 2006, 2010), IAPs (Broemer et al., 2010; Nagano et al., 2012), E2F1 (Loftus et al., 2012; Aoki et al., 2013), HuR (Embade et al., 2012), BCA3 (Gao et al., 2006), p53 (Xirodimas et al., 2004; Watson et al., 2010; Abida et al., 2007), L11 (Mahata et al., 2012), s14 (Zhang et al., 2014), and capsase 7 (Broemer et al., 2010; Nagano et al., 2012). These proteins control systems including cell cycles and proliferation (Watson et al., 2010; Wang et al., 2015; Tateishi et al., 2001), viability (Osaka et al., 2000; Chan et al., 2008), tumorigenesis (Xirodimas et al., 2004; Gao et al., 2006), and inflammation (Ehrentraut et al., 2013). Dis-regulation of these systems are implicated in many diseases, including cancer, and NEDD8 has been identified as a therapeutic target (Soucy et al., 2009) Bortezomib (VelcadeTM ) (Kane et al., 2003; Field-Smith et al., 2006; Kane et al., 2007) is an FDA approved drug that targets ubiquitin, which is structurally similar to NEDD8 and is another modifier protein, and suggests it was possible to identify small molecule inhibitors of NEDD8. A few small-molecule indirect-inhibitors were identified in the search, some by vHTS (Leung et al., 2011; Zhong et al., 2012).

103 MLN4924 (Pevonedistat) is a small molecule indirect-inhibitor that binds selectively and irreversibly to NEDD8 activating enzyme NAE (Soucy et al., 2009) and displays in vitro and in vivo anti-cancer cell functionality (Soucy et al., 2009; Lin et al., 2010; Luo et al., 2012; Lan et al., 2016; Czuczman et al., 2016; Tong et al., 2017). MLN4924 is currently undergoing Phase 1 trials (Swords et al., 2015, 2017). While targeting NAE does stop the initiation of NEDDylation, activated NEDD8 is still being produced and accumulated. Another approach is to target SENP8, which is responsible for activating NEDD8 and NEDD8 removal from hyper-NEDDylated proteins. If SENP8 is inhibited, activated NEDD8 is eventually completely consumed and NEDDylation will halt. Therefore, SENP8 may be a viable alternative target to inhibit NEDD8 and will be the focus of this case study.

2.6.1 SENP8: vHTS and Experimental Validation Results

Originally, AID 624322 contained data on 290 compounds: 217 actives with IC50 values and 73 inactives. Removal of PAINS (Baell and Holloway, 2010) and com- pounds with similar structures reduced the training set to data on 253 compounds: 187 actives and 66 inactives. Signature fragmentation yielded 1417 atomic Signatures of heights 0, 1, and 2. PCA constructed principal components were examined to iden- tify which of the 1417 atomic Signatures contributed the most to variance capture. The highest contributory atomic Signatures were used for SVM-C and SVM-R model creation and training, the results of which are detailed in Table 2.20. ROC curves were created to analyze the performance of both classification and ac- tivity prediction models a priori. Both curves are above the y=x line and AUC > 0.5. Therefore correct active predictions are due to model performance and not random chance. Additionally, the ROC curve for the SVM-R model is close the y=x line mainly because it was used to predict the activity of inactive compounds to create

104 Table 2.20: SENP8: The First Round vHTS Modeling Summary. Cross-validation = cross-validation error.

SVM-C SVM-R

Training set 187 active, 66 inactive 187 actives only

NO PAINS 253 total known IC50 Training set 16 h=0, 181 h=1 15 h=0, 164 h=1 atomic Signatures 1220 h=2; 1417 total 995 h=2; 1174 total

PCA results 231 of 1417 atomic Signatures 291 of 1174 atomic Signatures

Models Created 1 1

Training Error 0.111 0.447

Cross-Validation 0.106 0.529 the ROC curve. Recall, only the atomic Signatures in active compounds were used to train the SVM-R models. Application to inactive compounds require extrapolation and will reduce model performance. When extrapolation is limited by the activity validation candidate selection criteria, the activity prediction QSAR model should perform better than suggested by the ROC curve. After application of both classification and activity prediction models in a vHTS, candidates for experimental testing was selected with the following criteria:

1. Overlap = 1

2. SVM-C score > 2.

3. SVM-R predicted activity < 50µM

In other case studies, the predicted IC50 criterion was set to the maximum concentra- tion candidates were tested at. This was to maximize the number of candidates under

105 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For The SVM-C Model (b) ROC Curve For The SVM-R Model (AUC = 0.884). (AUC = 0.563).

Figure 2.18: SENP8: The First Round vHTS ROC. The curves indicate they both contribute to the identification of active leads: both curves are above the y=x line, indicating accurate identification of active candidates are due to the models and not chance.

consideration without extrapolating and focus efforts on the most active candidates. Here, the criterion was set to < 50µM, half of the maximum concentration tested in the dataset of 100µM, to narrow down the number of candidates to test. Many can- didates met the first and second criteria and even with a more rigorous 3rd criterion, 2519 candidates met the criteria and were available for experimental validation. Aside from commercial availability, economics dictate the preference for cheaper alternatives if the quality and performance are at most slightly worse. Without fur- ther information regarding the quality and performance of the candidates, economi- cal candidates were chosen and focused on. After economic viability and commercial availability considerations, ten candidates were selected for activity validation exper-

106 iments. Four of the ten candidates were found active for an experimental hit rate of 40%.

Table 2.21: SENP8: The First Round Activity Validation Summary. Candidates selected are commercially available, economically viable, and passed the criteria of predicted activity < 50µM, SVM-C scores > 2, and overlap=1. Candidates were tested in triplicate and the mean IC50 reported for candidates active across the trip- licate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

17300927 2.91 1.28

2957665 2.91 1.43

2955496 3.96 5.35

17299262 2.37 36.4

1298377 2.82 > 50

107 724460 4.07 > 50

4012508 17.9 > 50

532206 6.04 > 50

12450088 6.83 > 50

19655523 6.83 > 50

While 40% is a relatively high first round hit rate among the case studies and is much higher than the hit rate of traditional HTS (Dobson, 2004), experience with the pipeline to this point suggest hit rates can be improved if models are retrained. To achieve a higher hit rate and to address AIM 3 and determine if a higher active fraction affected the increase in hit rate due to retraining, the models were retrained with the addition of the new activity validation experiment data. The results of retraining are detailed in Table 2.22.

108 Table 2.22: SENP8: The Second Round vHTS Modeling Summary. Cross-validation = cross-validation error.

SVM-C SVM-R

Training set 191 active, 72 inactive 191 compounds

NO PAINS 263 total known IC50 Training set 16 h=0, 181 h=1 15 h=0, 164 h=1 atomic Signatures 1220 h=2; 1417 total 995 h=2; 1174 total

PCA results 228 of 1417 atomic Signatures 289 of 1174 atomic Signatures

Models Created 1 1

Training Error 0.137 0.416

Cross-Validation 0.126 0.533

Both retrained models were evaluated with ROC curves a priori. The new ROC curves are similar to the ones from the first round models though interestingly the ROC curve for the activity prediction model has a slightly increased AUC, 0.563 in the first round vs. 0.571 in the second round. The second vHTS, conducted with the retrained classification and activity predic- tion models, identified 940 candidates with the first round candidate selection criteria. Originally, twelve of the 940 candidates were selected for activity verification experi- ments. However, seven candidates were fluorescent at the emission wavelength of the assay indicator and were excluded from further consideration. Of the five remaining candidates, one was active for an experimental hit rate of 20%. Details on the five candidates are reported in Table 2.23.

109 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For The SVM-C Model (b) ROC Curve For The SVM-R Model (AUC = 0.861). (AUC = 0.571).

Figure 2.19: SENP8: The Second Round vHTS ROC. The curves are above the y=x line. Thus, the models accurately identify of active candidates.

Table 2.23: SENP8: The Second Round Activity Validation Summary. Five non- fluorescent candidates were commercially available, economically viable, had pre- dicted activity < 50µM, SVM-C scores > 2 and overlap= 1. Candidates are tested in triplicate. The mean IC50 value were reported for candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

IC50[µM] IC50[µM]

1109711 2.96 27.7

110 1254030 3.08 > 50

1202874 4.29 > 50

3904065 5.22 > 50

729994 8.74 > 50

2.6.2 SENP8: Model Analysis

In the other case studies presented, fluorescence candidates were observed and ex- cluded occasionally. However, this was the first time when the majority of candidates were fluorescent and had to be excluded. Due to the reduced number of candidates considered, it is unclear whether the retrained models improved or regressed. This result forced a reconsideration of the results thus far: are the observed results thus far random chance or indicative of the merit and real hit rate of the models created by the pipeline? The number of candidates tested in each activity validation study is fewer than twenty. Ultimately, this is indicative of the drug discovery process, where resource constraints limit what experiments can be done and what candidates are tested. As long as the hit rates are reported in context, there is no reason to

111 discount the reported hit rates. An additional caveat is the active fraction of the training data. When more data of one class is presented, supervised training methods are inclined to assign the majority class (Barandela et al., 2003). In the circumstance of AID 624322, the high active fraction translates to a higher likelihood of classifying a candidate as active and more difficulties correctly identifying inactives to exclude, resulting in more false positives and lower hit rates as more candidates need to be tested to identify the true actives.

2.6.3 SENP8: Structural Feature Analysis

Fifteen candidates were identified and tested in this work, five of which are active and can be used to help study SENP8 inhibition as an alternative to direct NEDD8 in- hibition. Besides future use as a probe, the five active candidates and the ten inactive candidates can also be used to elucidate additional avenues of inquiry. Examination yielded a complete scaffold and a partial scaffold of interest. The complete scaffold was found in five different candidates: CID 2957665, CID 17300927, CID 2955496, CID 17299262, and CID 1298377, shown with the scaffold in Figure 2.20. The methoxy groups at the R4 and R5 positions were left as methoxy groups in CID 2957665, Figure 2.20c, and fused into five or six membered rings in all other candidates. Activities span a large range, no matter if the R4-R5 methoxy configurations were identical or different, indicating a likely diminished role for R4 and

R5 in affecting activity. Candidates with a ring at the R2 and R3 positions, shown in Figures 2.20b and 2.20c, were consistently more active than if not, suggesting a preference for a ring structure at that those positions. Finally, a wide variety of functional groups were observed in the R1 position. Comparison of the similar structures found in CID 2955496, CID 17299262, and CID 1298377, shown in Figures

2.20d, 2.20e, and 2.20f respectively, indicated functional group identity at the R1

112 position as the likely main differentiator of activity.

(a) Scaffold 1. (b) CID 17300927 (c) CID 2957665 (IC50 = 1.28µM). (IC50 = 1.43µM).

(d) CID 2955496 (e) CID 17299262 (f) CID 1298377 (IC50 = 5.35µM). (IC50 = 36.4µM). (IC50 > 50µM).

Figure 2.20: SENP8: The First Scaffold. No compounds containing the scaffold was present in the training set. Analysis suggests R1 position is key to activity while R4 and R5 are not.

The second group of candidates for further consideration were candidates and training set compounds that share a common partial scaffold, shown in Figure 2.21.

An assortment of functional groups at the R1 position suggests that it does not participate directly in binding. The remaining R2 position indicated a difference between the inactive candidates and the active compounds from the training data: the actives all transformed the ester into an ether that is also attached to a five- membered ring with a nitro group attached. If the nitro group is not present, then the candidate is not active, like CID 724460 in Figure 2.21b. If the ester remains an ester, then the candidate remains inactive as was observed in CID 729994 and CID3904065, shown in Figure 2.21c and Figure 2.21d, respectively. It should be remembered these are observations and avenues for additional study and consideration.

113 (a) The shared backbone. (b) CID 724460 (IC50 > 50µM).

(c) CID 729994 (d) CID 3904065 (e) SID 22406761 (IC50 > 50µM). (IC50 > 50µM). (IC50 = 6.55µM).

(f) SID 85268502 (g) SID 85269512 (h) SID 85270951 (IC50 = 5.11µM). (IC50 = 7.54µM). (IC50 = 7.03µM).

Figure 2.21: SENP8: The Second Scaffold. The backbone was found in three inactive candidates and four training data compounds.

2.7 PK-M2 Pt. 1

When searching for datasets that could address AIMs one or two, one dataset was identified that could address both simultaneously. And if models were retrained for a second round vHTS, all three AIMs could be addressed by the same dataset: PubChem Bioassay AID 2533. The dataset contained data for 202 compounds, 86 actives and 116 inactives. It has an active fraction of 0.464. Because the dataset

114 attributes fell within the range for constants of both AIMs, active fractions between 0.4 and 0.6 for AIM 1 and dataset size between 100 compounds and 1000 compounds, it is at the intersection of the two AIMs and can address both. It will also be interesting to investigate what the effect of retraining will be when the dataset has “average” attributes for AIM 3. It should also be noted that this is the first dataset for which the desired biologically active molecules are activators instead of inhibitors. Pyruvate kinase (PK; EC 2.7.1.40) is the enzyme that transfers phosphate groups from phosphoenolpyruvate (PEP) to adenosine diphosphate (ADP) in glycolysis, re- sulting in pyruvate and adenosine triphosphate (ATP). Two different genes each en- code two different versions of PK (four total) depending on how the gene is spliced. The PKL gene codes for PK in the liver (PK-L) and erythrocytes (PK-R) (Noguchi et al., 1987), while the PKM gene codes for PK-M1 and PK-M2 depending on whether exon 9 or 10 is spliced, respectively (Noguchi et al., 1986). PK-M1 glycolysis occurs in aerobic conditions, yielding more energy (net produc- tion of 36 ATP per glucose molecule), and is present in differentiated mature muscle and brain cells (Nelson and Cox, 2012; Takenaka et al., 1991). PK-M2 has two different isoforms: an active, tetrameric form allosterically activated by fructose-1,6- bisphosphate (Ashizawa et al., 1991) and a nearly inactive dimeric form (Eigenbrodt et al., 1992). PK-M2 glycolysis occurs in anaerobic conditions, yielding less energy (net production of 2 ATP per glucose molecule), and is present primarily in dividing and embryonic/fetal tissue (Nelson and Cox, 2012; Takenaka et al., 1991; Eigenbrodt et al., 1992). Normally, PK-M2 is highly expressed in an organism initially and is replaced over time by the other three isoenzymes as cells and tissue mature (Dombrauckas et al., 2005; Ao et al., 2017). In tumors and oncogenic conditions, however, PK-M2 expression returns to developmental levels (Elbers et al., 1991; Hacker et al., 1998)

115 in what is known as the Warburg effect (Warburg, 1926). In the Warburg effect, pyruvate is ”fermented” so that PK-M2 production of ATP is favored even in aerobic conditions (Warburg, 1956). Furthermore, hypoxic and acidic conditions in tumors promote conversion of tetrameric PK-M2 into dimeric PK-M2 (Kumar et al., 2010). The nearly inactive PK-M2 slows pyruvate processing, resulting in the accumulation of precursors and byproducts that can be redirected for other, critical cell proliferation and developmental processes, including nucleic acids, proteins, and lipids production (Mazurek et al., 1997, 2001; Vander Heiden et al., 2009; Li et al., 2014b). Additionally, PK-M2 has been implicated in the drug resistance of cancer cells (Yoo et al., 2004; Martinez-Balibrea et al., 2009). Replacement (Trialists et al., 2005; Wang et al., 2012) or knockout (Ao et al., 2017) of PK-M2 results in inhibited growth or apoptosis (Ao et al., 2017; Wang et al., 2012), probably because glycolysis intermediates cannot be accumulated and reallocated to critical cell proliferation and development processes (Trialists et al., 2005; Wang et al., 2012). Found in gastrointestinal (Schneider and Schulze, 2003; Hardt et al., 2000; Oremek et al., 1997), kidney (Brinck et al., 1994), ovarian (Chao et al., 2017), lung (Schneider et al., 2003), breast (Ibsen et al., 1982), and other cancer, the role and presence of PK- M2 has attracted therapeutic targeting interest. Therapy options explored include small-molecule activators of PK-M2 (Auld et al., 2010; Boxer et al., 2011; Kung et al., 2012), small-molecule inhibitors of PK-M2 (Spoden et al., 2008; Auld et al., 2010; Vander Heiden et al., 2010; Chen et al., 2011; Anastasiou et al., 2012; Yacovan et al., 2012; Guo et al., 2013; Parnell et al., 2013; Xu et al., 2014), down-regulation of PK- M2 (Pandita et al., 2014; Wong et al., 2014; Wang et al., 2012), and replacement of PK-M2 expression with PK-M1 expression (Liu et al., 2014). Two common ways to identify compounds for these therapy options were HTS and computational analysis. HTS often identified compounds (Auld et al., 2010; Boxer

116 et al., 2009; Kung et al., 2012; Xu et al., 2014; Guo et al., 2013; Vander Heiden et al., 2010; Matsui et al., 2017) and computational analysis probed and modeled how compounds interact with PK-M2 (Kalyaanamoorthy and Chen, 2011; Anastasiou et al., 2012; Guo et al., 2013; Kalaiarasan et al., 2015). Eventually, computational analysis was also used to guide HTS work (Yacovan et al., 2012): vHTS created a small compound library for HTS and interrogated compound-protein interactions after. This case study is the successor to that work.

2.7.1 PK-M2 Pt. 1: vHTS and Experimental Validation Results

AID 2533 contains 202 compounds: 86 actives that have known activity and 116 inactive compounds (Vander Heiden, 2012). The highest compound concentration tested in the protocol was 50µM (Vander Heiden, 2012) and will be the highest concentration tested for this work as well. After removal of PAINS and similar struc- tures, 183 compounds remained: 85 active, 98 inactives. Fragmentation of the 183 compounds yielded 521 atomic Signatures of heights 0, 1, and 2. Examining the principal components constructed by PCA for the atomic Signatures contributing the most to variance capture identified the atomic Signatures to be used for GA-SVM model creation and training. Details of the resulting models are reported in Table 2.24. The classification and activity prediction models were evaluated with ROC curves, shown in Figure 2.22. The pipeline created and trained two classification models of equal performance, which was supported by the SVM-C ROC curves where AUC only differed by 0.0005 (0.9809 vs. 0.9804). Since the models performed equally well, both models were used for vHTS in a consensus approach. Additionally, using both models maximizes data utility and chances of identifying active candidates. The shape of the SVM-C ROC curve also indicates a clear separation between classes in the training

117 Table 2.24: PK-M2 Pt. 1: The First Round vHTS Modeling Summary (AID 2533). Cross-validation = cross validation error.

SVM-C SVM-R

Training Set 85 active, 98 inactive 85 actives

NO PAINS 183 total known AC50 Training Set 10 h=0, 86 h=1 10 h=0, 68 h=1 atomic Signatures 425 h=2; 521 total 239 h=2; 317 total

PCA results 130 of 521 atomic Signatures 133 of 317 atomic Signatures

Models Created 2 1

Training Error 0.02 0.24

Cross-Validation 0.02 0.29

data. The SVM-R model was notable since this was the first time in our work that any ROC curve of an SVM-R model had an AUC of 0.9. The ROC curve also indicated a relatively sharp distinction between classes indicated by the square shape. One explanation may be the fraction of atomic Signatures that were found in the active compounds. Of the 521 atomic Signatures found in the training data, 317 or 60.8% were found in the active compounds. The two classification models and one activity prediction model were used to screen the 72 million compounds of the PubChem Compound database. The results were filtered for activity validation experiment candidates based on the following criteria:

1. Overlap = 1

2. SVM-C > 2 for both SVM-C models.

118 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For 2 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC = 0.9809 & 0.9804). (AUC = 0.90).

Figure 2.22: PK-M2 Pt. 1: The First Round vHTS ROC (AID 2533). The curves indicate both kinds of models are actively contributing to the identification of can- didates to focus resources on: both curves are above and away from the y=x line, indicating accurate predictions are due to model ability and not chance.

3. SVM-R predicted activity < 50µM.

Forty-three candidates met the proposed criteria. Economic and commercial avail- ability considerations identified twelve candidates from the forty-three to undergo ex- perimentation. Only one of the twelve candidates was found active for an experiment hit-rate of 8.3%. Data for all twelve candidates are presented in Table 2.25. The first round experimental hit rate of 8.3% is comparable to traditional HTS hit-rates (Dobson, 2004) but lower than expectations based on previous cases. To in- crease hit rates and determine if a dataset with “average” attributes affect retraining, the activity validation data was used to augment the training data, resulting in the training data now containing 195 compounds: 86 active, 109 inactive. Details of the

119 retrained models are presented in Table 2.26.

Table 2.25: PK-M2 Pt. 1: The First Round Activity Validation Summary (AID 2533). Candidates selected were commercially available, economically viable, and met the criteria of predicted activity < 50µM, SVM-C scores > 2 for both SVM-C models, and overlap=1. Candidates were tested in triplicate and mean AC50 value were reported for candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

AC50[µM] AC50[µM]

7610465 17.3 17.7

7940834 6.60 N/A

4930982 17.3 N/A

854824 27.5 N/A

120 727175 27.5 N/A

804549 34.0 N/A

808228 3.36 N/A

883705 4.15 N/A

295211 6.60 N/A

675980 23.7 N/A

883730 10.9 N/A

121 1122557 8.81 N/A

Table 2.26: PK-M2 Pt. 1: The Second Round vHTS Modeling Summary (AID 2533). Cross-validation = cross validation error.

SVM-C SVM-R

Training Set 86 active, 109 inactive 86 actives

NO PAINS 195 total known AC50 Training Set 10 h=0, 86 h=1 10 h=0, 68 h=1

Signatures 425 h=2; 521 total 239 h=2; 317 total

PCA results 131 of 521 atomic Signatures 134 of 317 atomic Signatures

Models Created 49 1

Training Error 0.02 0.22

Cross-Validation 0.02 0.29

Instead of two classification models, retraining resulted in forty-nine of them. Evaluation by ROC curves indicated that though the errors are still identical, the range of AUC values did increase, min=0.978 and max=0.983. To maximize data utility and maximize performance, all forty-nine models were used for consensus clas- sification to maximize data utility. As for the activity prediction model, the ROC curve again has a shape to indicate a sharp division between classes in the training data. The AUC of the ROC curve of the SVM-R model dropped slightly, from 0.90 to 0.893.

122 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For 49 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC between [0.978, 0.983]). (AUC = 0.89).

Figure 2.23: PK-M2 Pt. 1: Second Round vHTS ROC (AID 2533). Both curves are above and away from the y=x line, indicating accurate predictions are due to the models performance and not chance.

When the activity validation candidate selection criteria from the first round was used, too few candidates met the criteria before economic and commercial availability considerations were made. To increase the number of candidates considered, the criteria was modified so that the overlap and SVM-C criteria were relaxed to 0.9 and 0, respectively. The new criteria was:

1. Overlap = 0.9

2. SVM-C > 0 for all 49 SVM-C model

3. SVM-R predicted activity < 50µM

Nine hundred ten candidates met the new criteria for further consideration. After economic and commercial availability considerations, eleven candidates were identified

123 for activity validation experiments and six were ultimately found active yielding an experiment hit rate of 54.5%. Details for all eleven candidates are shown in Table 2.27.

Table 2.27: PK-M2 Pt. 1: The Second Round Activity Validation Summary (AID 2533). Candidates selected were commercially available, economically viable and met the criteria of predicted activity < 50µM, SVM-C scores > 0 and overlap> 0.9.

Candidates were tested in triplicate and mean AC50 value were reported for candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

AC50[µM] AC50[µM]

600667 1.86 32.7

961623 2.00 44.2

17604395 2.48 31.5

124 834562 3.41 29.5

49926919 3.41 43.1

39860591 1.15 39.5

39860582 0.36 N/A

39860573 0.45 N/A

39860569 1.15 N/A

125 39860571 1.15 N/A

768936 2.27 N/A

2.7.2 PK-M2 Pt. 1: Model Analysis

The trend in other case studies has been an increase in error and hit rates, the explanation being the models became less specific to the training data and more predictive. That was also observed here, though an additional factor was involved: the increased extrapolation to identify more candidates for consideration. Extrapo- lation and predictive power have an inverse relationship: as one increases, the other decreases. Increasing extrapolation should have lowered hit rates, as was seen previ- ously in the complement factor C1s case study, but the hit rates were not lowered. One explanation could be the C1s case study had a higher first round hit rate and had less potential for improvement through retraining while this case study had a much lower first round hit rate and had much more potential to improve through retraining. Another explanation could be the effect of retraining was much bigger than the effect of extrapolation, resulting in a net increase in hit rates for this case study.

126 2.7.3 PK-M2 Pt. 1: Structural Feature Analysis

Twenty-three candidates were identified, selected, and tested for biological activ- ity. Ultimately, seven were found active in activity validation experiments. While the seven candidates can be used to dissect various aspects of PK-M2 biological activity for further study, examination of all twenty-three candidates and the 183 compounds contained in the training data, several patterns correlating biological activity to struc- ture emerged. The candidates of the second round can be grouped around three different scaf- folds. CID 600667, and CID 768936, shown in Figure 2.24 along with the first scaffold of the second round, only differ in the R group present, a methyl group or a chlorine group. Yet, only the candidate containing the methyl group was active. Therefore, it is likely either the methyl group is what makes the candidate active or the chlorine group is what makes the candidate inactive.

(a) The first scaffold. (b) CID 600667 (c) CID 768936 (AC50 = 32.7µM). (AC50 > 50µM).

Figure 2.24: PK-M2 Pt. 1: The First Scaffold (AID 2533). The difference was the methyl vs Cl group at the R position, which determined whether the candidate was an activator or not. The scaffold was not found in the original data set.

The second scaffold of the second round was found in CID 834562, CID 17604395, and CID 961623, all shown in Figure 2.25. CID 834562 and CID 961623 are identical

127 except CID 961623 has a chlorine group at the R4 position instead of a hydrogen

group. CID 961623 also has a 15µM higher AC50. Therefore, it is likely the chlorine group is the cause of the AC50 shift. Furthermore, CID 17604395 is identical to CID

961623 except for the replacement of the chlorine and fluorine groups at the R1 and

R2 positions, respectively, with methyl groups. CID 17604395 also has a 13 µM lower

AC50 value. Therefore, it is likely methyl groups are preferable to halogens at the R1 and R2 positions.

(a) The second scaffold. (b) CID 834562 (AC50 = 29.5µM).

(c) CID 17604395 (AC50 = 31.5µM). (d) CID 961623 (AC50 = 44.2µM).

Figure 2.25: PK-M2 Pt. 1: The Second Scaffold (AID 2533). Five candidates con- tained the scaffold. It was present in the candidates only.

The third scaffold of the second round was found in five different candidates. CID 39806591, shown in Figure 2.26b was the only candidate out of five that was active. The difference between CID 39806591 and CID 39860569, shown in Figure 2.26e, is

128 the chlorine group at the R3 position. Since one candidate was active and the other was not, it is likely the chlorine group is why CID 39806591 is active.

(a) The third scaffold. (b) CID 39860591 (c) CID 39860582 (AC50 = 39.5µM). (AC50 > 50µM).

(d) CID 39860573 (e) CID 39860569 (f) CID 39860571 (AC50 > 50µM). (AC50 > 50µM). (AC50 > 50µM).

Figure 2.26: PK-M2 Pt. 1: The Third Scaffold (AID 2533). The scaffold was found in five candidates, one of which was active.

All twelve candidates in the first round were iterations of the last scaffold, shown in Figure 2.27a. CID 7610465, shown in Figure 2.27b, is the only active candidate of

the round and likely because of the fluorine groups at R4 and R5 positions. Other

candidates contained similar or identical groups at the R1, R2, R3, and R6 positions

with only the fluorine groups at the R4 and R5 positions as the differences. This opens additional avenues of investigation, including if the observed activity is specific to fluorine groups or electronegative groups in general. The structures used to discount

the groups at the R1 and R2 positions are shown in Figure 2.27 while the structures of the other seven candidates in the first round can be found in Table 2.25. When the selection criterion was relaxed, especially when the overlap criterion was reduced to 0.9, many more candidates could be considered. For an overview of

129 (a) Scaffold 4. (b) CID 7610465 (c) CID 4930982 (AC50 = 17.7µM). (AC50 > 50µM).

(d) CID 808228 (e) CID 1122557 (f) CID 883705 (AC50 > 50µM). (AC50 > 50µM). (AC50 > 50µM).

Figure 2.27: PK-M2 Pt. 1: The Fourth Scaffold (AID 2533). The scaffold and five iterations of the scaffold used to interrogate the role of groups at the R1 and R2 positions. This scaffold was found in the candidates only. the effects of relaxing the overlap criterion, the overlap metric was calculated and detailed in Table 2.28. Analysis indicated none of the candidates had overlap = 1 and would not meet the criteria used in the first round. Additionally, the candidates can be grouped into those with no new atomic Signatures and those with new atomic Signatures. This is because of how overlap is defined. Only atomic Signatures that were found in the training data and whose occurrence number fell within the range observed in the training data was considered a match. Therefore, relaxing the overlap criteria allowed candidates with atomic Signature occurrence numbers outside what was observed in the training data and candidates with new atomic Signatures for consideration.

130 Table 2.28: PK-M2 Pt. 1: The Second Round Candidate Overlap Scores. Relaxing the overlap criteria allowed candidates with new atomic Signatures and atomic Sig- natures with occurrence numbers outside the observed range in the training data to be consideration. *=active in activity validation experiments.

Candidates with atomic Signatures Candidates with new outside training data range atomic Signatures

CID Overlap CID Overlap

768936 0.923 39860582 0.915

600667* 0.923 39860573 0.947

961623* 0.941 39860569 0.949

17604395* 0.973 39860571 0.923

834562* 0.939 39860591* 0.902

49926919* 0.9

It should be noted that the distribution of the active candidates had fewer ac- tives in the candidates containing new atomic Signatures than the candidates with occurrence numbers outside training data range. This is likely due to the amount of extrapolation involved. In candidates with occurrence numbers outside training data range, the only extrapolation was the occurrence numbers. In candidates with new atomic Signatures, the extrapolations include new atomic Signatures of unknown impact on activity and the occurrence numbers of the new atomic Signatures. In light of this, perhaps the candidates with occurrence numbers outside training data range should be considered before candidates containing new atomic Signatures to limit or control the degree of extrapolation. Candidates containing new atomic Signatures were the ones containing scaffold 3

131 and CID 49926919, shown in Figure 2.28. The new atomic Signatures were height 2 atomic Signatures because the overlap criterion was still high. New higher-height (i.e., height 2) atomic Signatures would affect only the higher-height atomic Signatures and result in a higher overlap score. In contrast, new lower-height atomic Signatures would propagate changes to the higher-height atomic Signatures because the higher-height atomic Signatures are based on lower-height ones. Thus, new atomic Signatures at the height 0 or height 1 level would require significant relaxation of the overlap criterion.

(a) Scaffold 4. (b) CID 49926919.

Figure 2.28: PK-M2 Pt. 1: The New Atomic Signatures (AID 2533). candidates containing scaffold 4 and CID 49926919 contain atomic Signatures not found in the training set due to relaxing the overlap criterion. The atoms indicated by an arrow are the ones for which there are new atomic Signatures.

2.8 PK-M2 Pt. 2

Another dataset focused on PK-M2 that could also address AIM 2 is PubChem Bioassay AID 1540. This PubChem Bioassay contains data for 131 compounds: 123 actives and 8 inactives. AID 1540 also has a very large active fraction of 0.882. These dataset attributes allow AID 1540 to address AIM 2 at the 0.8 to 1 range. Retraining the models and conducting a second round of vHTS also addresses AIM 3 in a very

132 unusual circumstance where there is plentiful data regarding the class of interest (i.e., the active class) and a dearth of data for the other class. It will be interesting to see if the lack of inactives data translates to many false positives and an inability to increase hit rates even with the addition of the first round activity validation data. As the same protein is the subject of the dataset, the background information will be omitted and the results presented immediately.

2.8.1 PK-M2 Pt. 2: vHTS and Experimental Validation Results

AID 1540 contains 131 compounds: 123 actives of known activity and 8 inactives (Vander Heiden, 2010). PAINS (Baell and Holloway, 2010) and similar structures were removed, resulting in a dataset containing 130 compounds: 122 actives and 8 inactives. Fragmentation by Signature resulted in 418 atomic Signature of heights 0, 1, and 2. PCA-constructed principal components were analyzed to identify the atomic Signatures significant in capturing variance to be used by GA-SVM to create and train models, the results of which are detailed in Table 2.29. ROC curves were created to evaluate the models before vHTS, shown in Figure 2.29. The pipeline created and trained eighteen classification models with identical errors but different ROC curves, with AUC ranging from 0.888 to 0.946. The curve indicates there are a few problematic compounds that all models could not classify correctly without also misclassifying other compounds. But aside from those, there was a common division between active and inactive compounds. As both SVM-C and SVM-R models’ ROC curves were away from the y=x line; thus, correct identification of actives are due to model ability and not random chance.

133 Table 2.29: PK-M2 Pt. 2: The First Round vHTS Modeling Summary (AID 1540). Cross-validation = cross validation error.

SVM-C SVM-R

Training Set 122 active, 8 inactive 85 actives

NO PAINS 130 total known AC50 Training Set 12 h=0, 86 h=1 12 h=0, 82 h=1 atomic Signatures 320 h=2; 418 total 292 h=2; 386 total

PCA results 110 of 418 atomic Signatures 114 of 386 atomic Signatures

Models Created 18 1

Training Error 0.015 0.46

Cross-Validation 0.015 0.51 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For 18 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC = [0.888,0.946]). (AUC = 0.624).

Figure 2.29: PK-M2 Pt. 2: The First Round vHTS ROC (AID 1540). Curves above the y=x line indicate accurate predictions are due to model ability and not chance.

134 The eighteen classification models were used in a consensus approach to maximize data utility. The eighteen classification models and one activity prediction model were used in a vHTS. The screening results were filtered for candidates with the following criteria:

1. Overlap = 1

2. SVM-C > 2 for both SVM-C models

3. SVM-R predicted activity < 50µM.

Sixteen candidates met the specified criteria and seven met economic and com- mercial availability considerations. The seven candidates were tested in activity vali- dation experiments, though no candidates were active. Details on the candidates are reported in Table 2.30.

Table 2.30: PK-M2 Pt. 2: The First Round Activity Validation Summary (AID 1540). Candidates selected were commercially available, economically viable, and met the criteria of predicted activity < 50µM, SVM-C scores > 2 for both SVM-C models, and overlap=1. Candidates were tested in triplicate and mean AC50 value reported for candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

AC50[µM] AC50[µM]

2283520 0.162 N/A

135 2284339 0.162 N/A

2283109 0.162 N/A

2284622 0.066 N/A

2899500 0.162 N/A

2283890 0.145 N/A

2252038 0.059 N/A

This is the first null hit rate of all work thus far. While how the null hit rate should be interpreted is unclear, past experiences with the pipeline suggest retraining

136 the models with new activity validation experiments data can provide more critical information needed to make correct predictions in the second round. To see if this remains true for this case study and to see what effect retraining has on prediction accuracy, the models were retrained with the training data augmented with data from the activity validation experiments. The results of retraining are shown in Table 2.31.

Table 2.31: PK-M2 Pt. 2: The Second Round vHTS Modeling Summary (AID 1540). Cross-validation = cross validation error.

SVM-C SVM-R

Training Set 112 active, 15 inactive 112 actives

NO PAINS 127 total known AC50 Training Set 12 h=0, 86 h=1 12 h=0, 82 h=1 atomic Signatures 320 h=2; 418 total 292 h=2; 386 total

PCA results 113 of 521 atomic Signatures 114 of 317 atomic Signatures

Models Created 46 1

Training Error 0.007 0.457

Cross-Validation 0.007 0.505

ROC curves were used to evaluate the models before vHTS. Retraining resulted in 46 classification models and assessment by the ROC curves indicate the 46 models should have good prediction power since the shape indicate the models are better able to distinguish between the active and inactive class and the AUC, between 0.967 and 0.978, indicate correct predictions of active candidates are due to model perfor- mance, not chance. The ROC curve for the activity prediction model also showed improvement, with a slightly better curve shape and AUC value, 0.664. The forty-six classification models were used in a consensus approach to maximize

137 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 True positive rate positive True rate positive True 0.2 0.2 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False positive rate False positive rate

(a) ROC Curve For 46 SVM-C Models (b) ROC Curve For The SVM-R Model (AUC between [0.967, 0.978]). (AUC = 0.664).

Figure 2.30: PK-M2 Pt. 2: The Second Round vHTS ROC. curves are above and away from the y=x line, indicating accurate predictions are due to the models performance and not chance. data utility. After vHTS, the selection criteria to identify candidates for experimental testing resulted in too few candidates (39) even before economic and commercial availability consideration. To identify more candidates for consideration, the criteria was relaxed:

1. Overlap = 0.9

2. SVM-C > 0 for all 46 SVM-C models

3. SVM-R predicted activity < 50µM

Eight hundred fifty-two candidates met the new criteria. Economic and com- mercial availability considerations identified nine for activity validation experiments.

138 However, none of the nine candidates were active in this round either. Details are reported in 2.32.

Table 2.32: PK-M2 Pt. 2: The Second round Activity Validation Summary (AID 1540). The candidates selected were commercially available, economically viable and met the criteria of predicted activity < 50µM, SVM-C scores > 0 and overlap> 0.9.

Candidates are tested in triplicate and the mean AC50 value is reported for candidates active across the triplicate. CID = PubChem Compound ID.

Structure CID Predicted Experimental

AC50[µM] AC50[µM]

74469 0.020 N/A

67116 0.016 N/A

95750 0.020 N/A

139 7139 0.020 N/A

77756 0.010 N/A

81993 0.013 N/A

1714447 0.032 N/A

3014343 0.024 N/A

300018 0.008 N/A

140 2.8.2 PK-M2 Pt. 2: Model Evaluation

This was the first time none of the candidates selected, in either round, were active in activity validation experiments. However, the training data were unusual and may affect vHTS in two different ways. First, there may not be enough information to correctly identify inactive candidates as there is a dearth of inactive compounds to train the classification models with. Second, supervised learning methods are inclined to assign candidates to the majority class since the training data suggest there should be more of the majority class and is more likely correct in any cases of uncertainty. The sum effect is the increase in misclassification of inactive candidates and reduced hit rates, resulting in the situation observed in the case study. The sum effect of increased false positives also makes isolating the effect of retrain- ing to address AIM 3 difficult. It is unclear what effect retraining had beyond the shift in focus away from the scaffold found in all candidates of the first round, shown in Table 2.30. Regardless, this is the first time the pipeline was applied to a dataset with such a large active fraction and will provide information as to what pipeline performance can be expected and what effect retraining has in similar contexts.

2.8.3 PK-M2 Pt. 2: Structural Feature Analysis

In this instance, there was not any structure-activity analysis to conduct. While the candidates were iterations of several different scaffolds, no different classes were found in the training data. Without a difference in activity, no conjecture can be made regarding why one compound is active and another is not. The only thing to note is the appearance of new atomic Signatures in the second round candidates when the overlap metric was relaxed. Found in two scaffolds and displayed in Figure 2.31, this is further evidence of the effect and role overlap has in

141 controlling what candidates, and subsequently structural features, are considered.

(a) First Scaffold With New Atomic Singature. (b) Second Scaffold With New Atomic Singature.

Figure 2.31: PK-M2 Pt. 2: The New Atomic Signatures. Found in two scaffolds, the new atomic Signatures are indicated.

142 CHAPTER III CONCLUSION: EVALUATION OF PCA-GA-SVM AND RECOMMENDATIONS REGARDING FUTURE USE

After the examination of seven case studies, more is known about the effectiveness of the PCA-GA-SVM pipeline. While only a limited number of candidates were tested for each case study, repeated success is either due to fortuitous activity validation candidate selection or the pipeline was functioning as intended and could successfully identify candidates for consideration based on a small number of instances in datasets. Additionally, while the amount of data available is limited, trends could be seen. The hit rates for all seven case studies are detailed in Table 3.1.

Table 3.1: Summary Of Case Study Hit Rates. Active = active fraction, 1st hit % = first round hit rate, 2nd hit % = second round hit rate.

AIM 1: Varied Size AIM 2: Varied Distribution (Active Fraction Between 0.4 and 0.6) (Size Between 100 and 1000 Compounds)

Size AID 1st Hit % 2nd Hit % Active AID 1st Hit % 2nd Hit %

10-100 728 42.9 100 0-0.2 787 57.0 50.0

10-100 825 18.8 75.0 0.2-0.4 846 27.3 63.6

100-1000 2533 8.3 54.5 0.4-0.6 2533 8.3 54.5

0.6-0.8 624322 40.0 20.0 0.8-1 1540 0 0

143 For AIM 1, there does seem to be a trend where smaller datasets are better than large datasets. One possible explanation for this is how the datasets were created. Most of these datasets are confirmatory screens, retesting the hits from a primary screen to verify activity. The smaller data set may be less diverse, making it easier to identify the key predictive structural features that should be included in the models. For AIM 2, a trend suggests the pipeline performs better in cases of less data on active compounds. A possible explanation is the bias found in supervised learning methods. In situations where there is more data for one of the classes, supervised learning methods are inclined to assign data to that class (Barandela et al., 2003). In cases of less data on active compounds, the inclination is to classify data as in- active, resulting in more false negatives. Conversely, when there are more data on active compounds, the inclination is to classify data as actives, resulting in more false positives. Since the pipeline is trying to find actives and only actives are tested, false positives are much more of a concern as many of the inactives predicted as actives are tested resulting in poor hit rates. For AIM 3, it can be said confidently that there is an overall benefit to retraining. The perception that certain atomic Signatures were beneficial to predicting activity in atomic Signature combinations changes with the addition of new experimental data, resulting in the selection of new atomic Signature combinations in the second round vHTS models. This resulted in hit rates increasing to the 50% level or higher. There are outliers where the hit rate did not increase, but other extenuating circumstances involved could explain the outliers. In general, retraining does improve the initial hit rate. From the trends observed, some recommendations regarding pipeline utilization can be made. The first recommendation of pipeline application is to retrain models whenever possible. The extra experimental data helps to refine the models to make

144 them more predictive, often resulting in the second round hit rate doubling the first round hit rate. The second recommendation is to avoid high active fraction datasets. The trend in AIM 2 is the result of the bias towards the class with more information in cases where an information inequality exists (Barandela et al., 2003) with supervised learning methods like SVM. This results in increased false negatives in cases of low active faction and false positives in cases of a large active fraction. While it is more efficient and preferable if the false negatives were identified as positives as well in applications of low active fraction datasets, the result is still the identification of active candidates to test with a net increase in hit rate albeit with fewer active candidates identified. As for false positives, the misclassification of inactive candidates negatively impacts hit rates since the inactive candidates dilute the pool of active candidates and more tests are necessary to identify the active candidates. Therefore, the pipeline is usable and beneficial in small active fraction situations but is less effective in large active fraction situations. The third recommendation is to test candidates in increasing degree of extrapo- lation. One additional outcome of this work is to identify new scaffolds for further study. This is more easily realized through extrapolation, though efficiency suffers. As seen in the PK-M2 case study with AID 2533, overlap controls extrapolation in two different ways: expanding the occurrence range of atomic Signatures found in the dataset and including new atomic Signatures, which also has new occurrence ranges. Expansion of atomic Signature occurrence ranges is one degree of extrapolation while the addition of new atomic Signatures is two degrees of extrapolation, the new atomic Signatures and the associated occurrence ranges. In Table 2.28, four out of five candi- dates with expanded occurrence ranges were active whereas only two of six candidates with new atomic signatures were active. Therefore, testing the candidates with ex-

145 panded occurrence ranges before candidates with new atomic Signatures may be more efficient.

146 CHAPTER IV INTEGRATION: BIOLOGY, CHEMISTRY, CHEMICAL ENGINEERING, AND COMPUTER SCIENCE

In the integrated biosciences doctoral program at The University of Akron, people from very different backgrounds and specialties frequently interact, socialize, and collaborate in research. The importance of integration was made apparent by the diversity of invited speakers that give seminars and lectures about their work. Invited individuals include those who are from a purely biological background to people who started in art or mathematics but have collaborated with or are directly involved in biological research. The speakers invited are role-models of how researchers from different specialties and backgrounds can collaborate to attack different problems. The ”Problem Solving in Integrated Biosciences” course exemplified the impor- tance of integration. People from different backgrounds within our cohort apply their expertise to answer a scientific question. The results of the experimental inquiry were presented upon completion. Through this course, the students collaborate and integrate knowledge of different fields into their work early and often. The capstone of the integration is reflected each student’s Ph.D. work. In the work presented here, knowledge from several different fields was integrated together. First, knowledge of chemical engineering and computer science was utilized to create the pipeline. Second, knowledge of biology was necessary to understand the relationship between a biological target and observed physiology. Finally, knowledge of chemistry

147 was used to infer how identified therapies interact with the biological target. Chemical engineering is the mathematical description of chemical systems. Chem- ical engineering has a rich history that is drawn upon to create the vHTS models. Similar models are frequently created to describe the physical and thermodynamic properties of different materials, and it was theorized these models could be repur- posed for a medicinal and biological context. While the vHTS models were created based on chemical engineering knowledge, model training and optimization was based on computer science. Computers can perform calculations quickly and repeatedly. Algorithms can codify the calculations to achieve specific goals such as the creation of many different models or the changing of parameters to create an optimal model. Combining both chemical engineering and computer science knowledge allows the creation and optimization of vHTS models. The ultimate purpose of the pipeline and the vHTS models is to find new medici- nal active ingredients. By its nature, medicinal therapies try to effect a physiological change via biochemical means. Knowledge of biology provides the understanding of how different systems are interrelated and change can be mediated directly or indi- rectly by targeting a biological receptor. This knowledge is essential to understanding which entities are targeted and why it is worthwhile to target them. Finally, after a biological target and active compounds are found, chemistry knowl- edge provides a framework to interrogate the interaction. Chemistry offers context to understand how different compounds or molecules interact with a targeted biological receptor. Knowledge of functional group properties and interaction can be used to develop conjectures why compounds are active or inactive and how these compounds can be optimized. Similarly, this knowledge can be used to identify next compounds to test or which compounds to avoid testing. The benefits of integration were evident early on when invited speakers presented

148 their work and when students were working together on their project for the ”Prob- lem Solving in Integrated Biosciences” course. After all, different perspectives and backgrounds inform unique approaches and methods of attacking any given problem. Since the purpose of science is solving problems and seeking truths of the world, the utility of integrating different kinds of information cannot be underestimated. In this work, the interdisciplinary combination of biology, chemistry, chemical en- gineering, and computer science was necessary to create and understand the context of pipeline development and success. Furthermore, the interdisciplinary approach identified potential pitfalls that are not readily apparent if the project was based only on one discipline. Finally, the interdisciplinary approach identified further op- portunities for development as knowing more about the environment and system it is applied in identified different avenues of use with slight modifications. Therefore, it was invaluable that knowledge from all four disciplines was incorporated and utilized. It is immediately apparent that integration is not as simple as a synergistic or additive effect. The different deficiencies in each knowledge base were supplemented, and the strengths were enhanced. However, there is a multiplicative effect of integrat- ing different streams of information: techniques that are akin to blunt instruments when applied without guidance become precise when the appropriate perspective is provided. For example, the experimental validation here is similar to HTS experi- ments done to identify hits. Random selection of compounds to test yields a nominal hit rate of a few percent(Dobson, 2004). However, the inclusion of chemical engineer- ing models and computer science algorithms increased the hit-rate many times over. Retraining the models magnified this benefit even further. From this, it is apparent how important integration is. Knowledge from different sources supplement deficiencies, augment strengths, and invents new opportunities. The multiplicative benefit of integration becomes even more critical in the current

149 scientific environment. As costs continue to rise and with finite amounts of resources, people in science need to use everything available to increase the utility of spent resources. While this viewpoint sounds excellent on paper, there are many obstacles to in- tegration that need to be resolved, primarily the tribalism in science. When working with different individuals from different backgrounds, it is initially like working with people from different cultures and nationalities. Individuals from the various disci- plines became accustomed to discipline-specific terminology or definitions of words, customs, perspectives, and ideas. Furthermore, specialization often turns into isola- tion and resistance to integration. Awareness of these barriers to integration has to increase for so that they can be addressed early and often. The ”Problem Solving in Integrated Biosciences” course was a microcosm of the integration difficulties at large. The group members, from different disciplines, had to develop a shared understanding of language first. There were some difficulties in establishing a common language. The knowledge of colloquialisms and concepts that mean different things in other contexts were taken for granted. Therefore, the shared understanding of language was repeatedly revised to account for the updated understanding of words and concepts. When ideas, cultures, and perspectives from the different disciples were exchanged in formulating the project for the course, there were often differences in opinion. While there should be a healthy amount disagreement in science, there is a difference between genuine skepticism and abstinence or ego. As experts and advocates for ideas, it is easy to reject ideas because they are new and foreign without honestly considering the merits of them. In time, we learned to forestall the initial impulse to reject an idea and honestly consider its merits and how it could be adapted to enhance other visions.

150 Finally, there was specialization that can transform into isolation and resistance to integration. Ph.D. students quickly become experts by focusing their energies on a perspective field. Specialization is essential for development, but can transform into indifference for unrelated subject material and prevent integration. A certain level of disdain can be observed from those attending presentations from invited speakers on subject matter that were not immediately relevant. This disdain is extended to conferences where students may attend a talk but immediately leave upon realization the presentation is different from expectations, sometimes while the presentation is still ongoing. Disdain is also exhibited when students forgo attending different pre- sentation sessions, the implicit purpose of going to conferences, in favor of pleasure and relaxation. Conferences are a collection of ideas and experts that can be used to improve and give different perspectives to ideas. As such, they should be valued as chances to improve and realize ideas that are hard to encounter. For proof, look no further than the invited speakers and professors at The University of Akron. When discussing their success, a familiar refrain is the introduction of a new or different idea by a colleague or peer at a presentation or conference. Integration of that idea or concept resulted in a new line of inquiry and a new project, which is why people are listening to them present their work. In summation, integration is a valuable resource for providing much-needed insight and perspective to quickly and efficiently advance research. There is a multiplicative benefit to integrating any sources of knowledge and ideas and should occur whenever possible. While integration provides many benefits, it is not easily done in practice due to barriers that exist culturally and mentally. The proper attention to these barriers will help resolve them and make integration easier and more widespread.

151 BIBLIOGRAPHY

Abida, W. M., Nikolaev, A., Zhao, W., Zhang, W., and Gu, W. (2007). Fbxo11 promotes the neddylation of p53 and inhibits its transcriptional activity. Journal Of Biological Chemistry, 282(3):1797–1804.

Acharya, C., Coop, A., E Polli, J., and D MacKerell, A. (2011). Recent advances in ligand-based drug design: relevance and utility of the conformationally sampled pharmacophore approach. Current Computer-aided Drug Design, 7(1):10–22.

Adams-Cioaba, M. A., Krupa, J. C., Xu, C., Mort, J. S., and Min, J. (2011). Struc- tural basis for the recognition and cleavage of histone h3 by cathepsin l. Nature Communications, 2:197.

Agarwal, A. K. and Fishwick, C. W. (2010). Structure-based design of anti-infectives. Annals Of The New York Academy Of Sciences, 1213(1):20–45.

Agatonovic-Kustrin, S. and Beresford, R. (2000). Basic concepts of artificial neural network (ann) modeling and its application in pharmaceutical research. Journal Of Pharmaceutical And Biomedical Analysis, 22(5):717–727.

Al-Horani, R. A., Ponnusamy, P., Mehta, A. Y., Gailani, D., and Desai, U. R. (2013). Sulfated pentagalloylglucoside is a potent, allosteric, and selective inhibitor of factor xia. Journal Of Medicinal Chemistry, 56(3):867–878.

Alexander, J. J., Anderson, A. J., Barnum, S. R., Stevens, B., and Tenner, A. J. (2008). The complement cascade: Yin–yang in neuroinflammation–neuro- protection and-degeneration. Journal Of Neurochemistry, 107(5):1169–1187.

Alvarsson, J., Eklund, M., Andersson, C., Carlsson, L., Spjuth, O., and Wikberg, J. E. S. (2014a). Benchmarking study of parameter variation when using signature fin- gerprints together with support vector machines. Journal Of Chemical Information And Modeling, 54(11):3211–3217.

Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., and Noeske, T. (2014b). Ligand-based target prediction with signature fingerprints. Journal Of Chemical Information And Modeling, 54(10):2647–2653.

152 Anastasiou, D., Yu, Y., Israelsen, W. J., Jiang, J.-K., Boxer, M. B., Hong, B. S., Tempel, W., Dimov, S., Shen, M., Jha, A., et al. (2012). Pyruvate kinase m2 acti- vators promote tetramer formation and suppress tumorigenesis. Nature Chemical Biology, 8(10):839.

Anderson, D. H., Radeke, M. J., Gallo, N. B., Chapin, E. A., Johnson, P. T., Curletti, C. R., Hancox, L. S., Hu, J., Ebright, J. N., Malek, G., et al. (2010). The pivotal role of the complement system in aging and age-related macular degeneration: hypothesis re-visited. Progress In Retinal And Eye Research, 29(2):95–112.

Ao, R., Guan, L., Wang, Y., and Wang, J.-N. (2017). Effects of pkm2 gene silencing on the proliferation and apoptosis of colorectal cancer ls-147t and sw620 cells. Cellular Physiology And Biochemistry, 42(5):1769–1778.

Aoki, I., Higuchi, M., and Gotoh, Y. (2013). Neddylation controls the target speci- ficity of e2f1 and apoptosis induction. Oncogene, 32(34):3954–3964.

Asakai, R., Davie, E. W., and Chung, D. W. (1987). Organization of the gene for human factor xi. Biochemistry, 26(23):7221–7228.

Ashizawa, K., Willingham, M., Liang, C., and Cheng, S.-y. (1991). In vivo reg- ulation of monomer-tetramer conversion of pyruvate kinase subtype m2 by glu- cose is mediated via fructose 1, 6-bisphosphate. Journal Of Biological Chemistry, 266(25):16842–16846.

Auld, D., Shen, M., Skoumbourdis, A. P., Jiang, J.-k., Boxer, M., Southall, N., Inglese, J., and Thomas, C. (2010). Identification of activators for the m2 isoform of human pyruvate kinase.

Babaoglu, K., Simeonov, A., Irwin, J. J., Nelson, M. E., Feng, B., Thomas, C. J., Cancian, L., Costi, M. P., Maltby, D. A., Jadhav, A., et al. (2008). Comprehensive mechanistic analysis of hits from high-throughput and docking screens against β- lactamase. Journal Of Medicinal Chemistry, 51(8):2502.

Baell, J. B. and Holloway, G. A. (2010). New substructure filters for removal of pan assay interference compounds (pains) from screening libraries and for their exclusion in bioassays. Journal Of Medicinal Chemistry, 53(7):2719–2740.

Balunas, M. J. and Kinghorn, A. D. (2005). Drug discovery from medicinal plants. Life Sciences, 78(5):431–441.

Barandela, R., S´anchez, J., Garcı´a,V., and Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3):849–851.

Barrett, A., Rawlings, N., and Woessner, J. (1998). Handbook of Proteolytic Enzymes Academic Press.

153 Bender, A., Mussa, H. Y., Glen, R. C., and Reiling, S. (2004). Similarity search- ing of chemical databases using atom environment descriptors (MOLPRINT 2D): Evaluation of performance. J. Chem. Inf. Comput. Sci., 44(5):1708–1718.

Bohacek, R. S., McMartin, C., and Guida, W. C. (1996). The art and practice of structure-based drug design: A molecular modeling perspective. Med Res Rev, 16(1):3–50.

Bohley, P., Kirschke, H., Schaper, S., and Wiederanders, B. (1984). Principles of the regulation of intracellular proteolysis. In Symp. Biologica Hungarica, volume 25, pages 101–115.

Bokisch, V. A., M¨uller-Eberhard, H. J., and Cochrane, C. G. (1969). Isolation of a fragment (c3a) of the third component of human complement containing anaphy- latoxin and chemotactic activity and description of an anaphylatoxin inactivator of human serum. Journal Of Experimental Medicine, 129(5):1109–1130.

Bolton-Maggs, P. H. (2000). Factor xi deficiency and its management. Haemophilia- oxford-, 6:100–109.

Bouma, B. and Griffin, J. H. (1977). Human blood coagulation factor xi. J Biol Chem, 252(18):6432–6437.

Boxer, M. B., Jiang, J.-k., Vander Heiden, M. G., Shen, M., Skoumbourdis, A. P., Southall, N., Veith, H., Leister, W., Austin, C. P., Park, H. W., et al. (2009). Eval- uation of substituted n, n-diarylsulfonamides as activators of the tumor cell specific m2 isoform of pyruvate kinase. Journal Of Medicinal Chemistry, 53(3):1048–1055.

Boxer, M. B., Jiang, J.-k., Vander Heiden, M. G., Shen, M., Veith, H., Cantley, L. C., and Thomas, C. J. (2011). Identification of activators for the m2 isoform of human pyruvate kinase version 3.

Brinck, U., Fischer, G., Eigenbrodt, E., Oehmke, M., and Mazurek, S. (1994). L- and m2-pyruvate kinase expression in renal cell carcinomas and their metastases. Virchows Archiv, 424(2):177–185.

Broemer, M., Tenev, T., Rigbolt, K. T., Hempel, S., Blagoev, B., Silke, J., Ditzel, M., and Meier, P. (2010). Systematic in vivo rnai analysis identifies iaps as nedd8-e3 ligases. Molecular Cell, 40(5):810–822.

Buerke, M., Schwertz, H., Seitz, W., Meyer, J., and Darius, H. (2001). Novel small molecule inhibitor of c1s exerts cardioprotective effects in ischemia-reperfusion in- jury in rabbits. The Journal Of Immunology, 167(9):5375–5380.

Bultinck, P., De Winter, H., Langenaeker, W., and Tollenare, J. P. (2003). Compu- tational Medicinal Chemistry For Drug Discovery. CRC Press.

154 Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining And Knowledge Discovery, 2(2):121–167. Chan, J., Burrowes, C., and Movat, H. (1978). Surface activation of factor xii (hage- man factor)critical role of high molecular weight kininogen and another potentiator. Agents And Actions, 8(1-2):65–72. Chan, Y., Yoon, J., Wu, J.-T., Kim, H.-J., Pan, K.-T., Yim, J., and Chien, C.-T. (2008). Den1 deneddylates non- proteins in vivo. Journal Of Cell Science, 121(19):3218–3223. Chao, T.-K., Huang, T.-S., Liao, Y.-P., Huang, R.-L., Su, P.-H., Shen, H.-Y., Lai, H.-C., and Wang, Y.-C. (2017). Pyruvate kinase m2 is a poor prognostic marker of and a therapeutic target in ovarian cancer. Plos One, 12(7):e0182166. Chauhan, S. S., Goldstein, L. J., and Gottesman, M. M. (1991). Expression of cathep- sin l in human tumors. Cancer Research, 51(5):1478–1481. Chauhan, S. S., Popescu, N., Ray, D., Fleischmann, R., Gottesman, M., and Troen, B. (1993). Cloning, genomic organization, and chromosomal localization of human cathepsin l. Journal Of Biological Chemistry, 268(2):1039–1045. Chemmangattuvalappil, N. G. and Eden, M. R. (2013). A novel methodology for property-based molecular design using multiple topological indices. Industrial & Engineering Chemistry Research, 52(22):7090–7103. Chemmangattuvalappil, N. G., Solvason, C. C., Bommareddy, S., and Eden, M. R. (2010). Reverse problem formulation approach to molecular design using property operators based on signature descriptors. Computers & Chemical Engineering, 34(12):2062–2071. Chen, J., Xie, J., Jiang, Z., Wang, B., Wang, Y., and Hu, X. (2011). Shikonin and its analogs inhibit cancer cell glycolysis by targeting tumor pyruvate kinase-m2. Oncogene, 30(42):4297. Chen, J. J. F. and Visco Jr, D. P. (2016). Developing an in silico pipeline for faster drug candidate discovery: Virtual high throughput screening with the signature molecular descriptor using support vector machine models. Chemical Engineering Science. Chen, J. J. F. and Visco Jr, D. P. (2017). Identifying novel factor xiia inhibitors with pca-ga-svm developed vhts models. European Journal Of Medicinal Chemistry, 140:31–41. Chen, X. and Reynolds, C. H. (2002). Performance of similarity measures in 2d fragment-based similarity searching: comparison of structural descriptors and sim- ilarity coefficients. J Chem Inf Comput Sci, 42(6):1407–14.

155 Cheng, T., Li, Q., Zhou, Z., Wang, Y., and Bryant, S. H. (2012). Structure-based virtual screening for drug discovery: a problem-centric review. Aaps J, 14(1):133– 41.

Chivian, D. and Baker, D. (2006). Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection. Nucleic Acids Research, 34(17):e112–e112.

Churchwell, C. J., Rintoul, M. D., Martin, S., Visco, D. P., J., Kotu, A., Larson, R. S., Sillerud, L. O., Brown, D. C., and Faulon, J. L. (2004). The signature molecular descriptor. 3. inverse-quantitative structure-activity relationship of icam- 1 inhibitory peptides. J Mol Graph Model, 22(4):263–73.

Citarella, F., Tripodi, M., Fantoni, A., Bernardi, F., Romeo, G., and Rocchi, M. (1988). Assignment of human coagulation factor xii (fxii) to chromosome 5 by cdna hybridization to dna from somatic cell hybrids. Human Genetics, 80(4):397– 398.

Cochrane, C., Revak, S., and Wuepper, K. (1973). Activation of hageman factor in solid and fluid phases a critical role of kallikrein. The Journal Of Experimental Medicine, 138(6):1564–1583.

Cool, D. E. and MacGillivray, R. (1987). Characterization of the human blood coagu- lation factor xii gene. intron/exon gene organization and analysis of the 5’-flanking region. Journal Of Biological Chemistry, 262(28):13662–13673.

Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M., Spellmeyer, D. C., Fox, T., Caldwell, J. W., and Kollman, P. A. (1995). A 2nd generation force-field for the simulation of proteins, nucleic-acids, and organic- molecules. Journal Of The American Chemical Society, 117(19):5179–5197.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–297.

Craig, T. J., Levy, R. J., Wasserman, R. L., Bewtra, A. K., Hurewitz, D., Obtu lowicz, K., Reshef, A., Ritchie, B., Moldovan, D., Shirov, T., et al. (2009). Efficacy of human c1 esterase inhibitor concentrate compared with placebo in acute hereditary angioedema attacks. Journal Of Allergy And Clinical Immunology, 124(4):801–808.

Cruz, M. P. (2015). Conestat alfa (ruconest): first recombinant c1 esterase inhibitor for the treatment of acute attacks in patients with hereditary angioedema. Phar- macy And Therapeutics, 40(2):109.

Czarnik, A. and Mei, H.-Y. (2007). 2.12 - how and why to apply the latest technology*. In Taylor, J. B. and Triggle, D. J., editors, Comprehensive Medicinal Chemistry {II}, pages 289 – 557. Elsevier, Oxford.

156 Czuczman, N. M., Barth, M. J., Gu, J., Neppalli, V., Mavis, C., Frys, S. E., Hu, Q., Liu, S., Klener, P., Vockova, P., et al. (2016). Pevonedistat, a nedd8-activating en- zyme inhibitor, is active in mantle cell lymphoma and enhances rituximab activity in vivo. Blood, 127(9):1128–1137. da Silva, C. H., da Silva, V. B., Resende, J., Rodrigues, P. F., Bononi, F. C., Ben- evenuto, C. G., and Taft, C. A. (2010). Computer-aided drug design and admet predictions for identification and evaluation of novel potential farnesyltransferase inhibitors in cancer therapy. J Mol Graph Model, 28(6):513–23.

Damus, P. S., Hicks, M., and Rosenberg, R. D. (1973). Anticoagulant action of heparin. Nature, 246(5432):355–357.

Davie, E. W. and Ratnoff, O. D. (1964). Waterfall sequence for intrinsic blood clot- ting. Science, 145(3638):1310–1312.

Del Pozo, J., Timpte, C., Tan, S., Callis, J., and Estelle, M. (1998). The ubiquitin- related protein rub1 and auxin response in arabidopsis. Science, 280(5370):1760– 1763.

Dennem¨arker, J., Lohm¨uller,T., M¨uller,S., Aguilar, S. V., Tobin, D. J., Peters, C., and Reinheckel, T. (2010). Impaired turnover of autophagolysosomes in cathepsin l deficiency. Biological Chemistry, 391(8):913–922.

Dev, V. A., Chemmangattuvalappil, N. G., and Eden, M. R. (2014). Structure gener- ation of candidate reactants using signature descriptors. 24th European Symposium On Computer Aided Process Engineering, Pts A And B, 33:151–156.

Devillers, J. and Balaban, A. T. (2000). Topological indices and related descriptors in QSAR and QSPR. CRC Press.

DeWitte, R. S. and Shakhnovich, E. I. (1996). Smog: de novo design method based on simple, fast, and accurate free energy estimates. 1. methodology and supporting evidence. Journal Of The American Chemical Society, 118(47):11733–11744.

Diamond, S. L. (2008a). Aid 787 - complement factor c1s ic150 from mixture screen. Website.

Diamond, S. L. (2008b). Aid 825: Cathepsin l dose-response confirmation. Website.

Diamond, S. L. (2008c). Factor xia 1536 hts dose response confirmation. Website.

Diamond, S. L. (2011). Aid 728 - factor xiia dose response confirmation from single well hts. Website.

Dobson, C. M. (2004). Chemical space and biology. Nature, 432(7019):824–8.

157 Doggen, C. J., Rosendaal, F. R., and Meijers, J. C. (2006). Levels of intrinsic co- agulation factors and the risk of myocardial infarction among men: Opposite and synergistic effects of factors xi and xii. Blood, 108(13):4045–4051. Dombrauckas, J. D., Santarsiero, B. D., and Mesecar, A. D. (2005). Structural basis for tumor pyruvate kinase m2 allosteric regulation and catalysis. Biochemistry, 44(27):9417–9429. Douguet, D., Munier-Lehmann, H., Labesse, G., and Pochet, S. (2005). Lea3d: a computer-aided ligand design for structure-based drug design. J Med Chem, 48(7):2457–68. Drews, J. (2000). Drug discovery: A historical perspective. Science, 287(5460):1960– 1964. Duchowicz, P. R., Talevi, A., Bruno-Blanch, L. E., and Castro, E. A. (2008). New qspr study for the prediction of aqueous solubility of drug-like compounds. Bioorganic & Medicinal Chemistry, 16(17):7944–7955. Duncan, E. M., Muratore-Schroeder, T. L., Cook, R. G., Garcia, B. A., Shabanowitz, J., Hunt, D. F., and Allis, C. D. (2008). Cathepsin l proteolytically processes histone h3 during mouse embryonic stem cell differentiation. Cell, 135(2):284–294. Dunn, J., Silverberg, M., and Kaplan, A. (1982). The cleavage and formation of activated human hageman factor by autodigestion and by kallikrein. Journal Of Biological Chemistry, 257(4):1779–1784. Durrant, J. D. and McCammon, J. A. (2010). Computer-aided drug-discovery tech- niques that account for receptor flexibility. Curr Opin Pharmacol, 10(6):770–4. Durrant, J. D. and McCammon, J. A. (2011). Molecular dynamics simulations and drug discovery. Bmc Biol, 9:71. Ehrentraut, S. F., Kominsky, D. J., Glover, L. E., Campbell, E. L., Kelly, C. J., Bowers, B. E., Bayless, A. J., and Colgan, S. P. (2013). Central role for endothelial human deneddylase-1/senp8 in fine-tuning the vascular inflammatory response. The Journal Of Immunology, 190(1):392–400. Eichinger, S., Sch¨onauer,V., Weltermann, A., Minar, E., Bialonczyk, C., Hirschl, M., Schneider, B., Quehenberger, P., and Kyrle, P. A. (2004). Thrombin-activatable fibrinolysis inhibitor and the risk for recurrent venous thromboembolism. Blood, 103(10):3773–3776. Eigenbrodt, E., Reinacher, M., Scheefers-Borchel, U., Scheefers, H., and Friis, R. (1992). Double role for pyruvate kinase type m2 in the expansion of phospho- metabolite pools found in tumor cells. Critical Reviews In Oncogenesis, 3(1-2):91– 115.

158 Elbers, J., Van Unnik, J., Rijksen, G., Van Oirschot, B., Staal, G., Roholl, P., and Oosting, J. (1991). Pyruvate kinase activity and isozyme composition in normal fibrous tissue and fibroblastic proliferations. Cancer, 67(10):2552–2559.

Embade, N., Fern´andez-Ramos,D., Varela-Rey, M., Beraza, N., Sini, M., de Juan, V. G., Woodhoo, A., Mart´ınez-L´opez, N., Rodr´ıguez-Iruretagoyena, B., Busta- mante, F. J., et al. (2012). Murine double minute 2 regulates hu antigen r stability in human liver and colon cancer through neddylation. Hepatology, 55(4):1237–1248.

Enyedy, I. J. and Egan, W. J. (2008). Can we use docking and scoring for hit-to-lead optimization? Journal Of Computer-aided Molecular Design, 22(3-4):161–168.

Fabricant, D. S. and Farnsworth, N. R. (2001). The value of plants used in traditional medicine for drug discovery. Environmental Health Perspectives, 109(Suppl 1):69.

Farsetti, A., Misiti, S., Citarella, F., Felici, A., Andreoli, M., Fantoni, A., Sacchi, A., and Pontecorvi, A. (1995). Molecular basis of estrogen regulation of hageman factor xii gene expression. Endocrinology, 136(11):5076–5083.

Farsetti, A., Moretti, F., Narducci, M., Misiti, S., Nanni, S., Andreoli, M., Sacchi, A., and Pontecorvi, A. (1998). Orphan receptor hepatocyte nuclear factor-4 antagonizes estrogen receptor α-mediated induction of human coagulation factor xii gene 1. Endocrinology, 139(11):4581–4589.

Faulon, J.-L. (1994). Stochastic generator of chemical structure. 1. application to the structure elucidation of large molecules. Journal Of Chemical Information And Computer Sciences, 34(5):1204–1218.

Faulon, J.-L. and Bender, A. (2010). Handbook Of Chemoinformatics Algorithms. CRC press.

Faulon, J.-L., Brown, W. M., and Martin, S. (2005). Reverse engineering chemical structures from molecular descriptors: how many solutions? Journal Of Computer- aided Molecular Design, 19(9-10):637–650.

Faulon, J.-L., Churchwell, C. J., and Visco, D. P., J. (2003a). The signature molecular descriptor. 2. enumerating molecules from their extended valence sequences. J Chem Inf Comput Sci, 43(3):721–34.

Faulon, J.-L., Collins, M. J., and Carr, R. D. (2004). The signature molecular de- scriptor. 4. canonizing molecules using extended valence sequences. J Chem Inf Comput Sci, 44(2):427–36.

Faulon, J.-L., Visco, D. P., J., and Pophale, R. S. (2003b). The signature molecular descriptor. 1. using extended valence sequences in qsar and qspr studies. J Chem Inf Comput Sci, 43(3):707–20.

159 Ferreira, R. S., Simeonov, A., Jadhav, A., Eidam, O., Mott, B. T., Keiser, M. J., McKerrow, J. H., Maloney, D. J., Irwin, J. J., and Shoichet, B. K. (2010). Com- plementarity between a docking and a high-throughput screen in discovering new cruzain inhibitors. Journal Of Medicinal Chemistry, 53(13):4891–4905.

Field-Smith, A., Morgan, G. J., and Davies, F. E. (2006). Bortezomib (velcade) in the treatment of multiple myeloma. Therapeutics And Clinical Risk Management, 2(3):271.

Fjellstr¨om,O., Akkaya, S., Beisel, H.-G., Eriksson, P.-O., Erixon, K., Gustafsson, D., Jurva, U., Kang, D., Karis, D., Knecht, W., et al. (2015). Creating novel activated factor xi inhibitors through fragment based lead generation and structure aided drug design. Plos One, 10(1):e0113705.

Forbes, C. D., Pensky, J., and Ratnoff, O. D. (1970). Inhibition of activated hage- man factor and activated plasma thromboplastin antecedent by purified serum c1 inactivator. J Lab Clin Med, 76(5):809–815.

Fujikawa, K., Chung, D. W., Hendrickson, L. E., and Davie, E. W. (1986). Amino acid sequence of human factor xi, a blood coagulation factor with four tandem repeats that are highly homologous with plasma prekallikrein. Biochemistry, 25(9):2417– 2424.

Gaboriaud, C., Thielens, N. M., Gregory, L. A., Rossi, V., Fontecilla-Camps, J. C., and Arlaud, G. J. (2004). Structure and activation of the c1 complex of complement: unraveling the puzzle. Trends In Immunology, 25(7):368–373.

Gailani, D. and Broze Jr, G. J. (1991). Factor xi activation in a revised model of blood coagulation. Science, 253(5022):909–913.

Gal, S. and Gottesman, M. M. (1986). The major excreted protein of trans- formed fibroblasts is an activable acid-protease. Journal Of Biological Chemistry, 261(4):1760–1765.

Gan-Erdene, T., Nagamalleswari, K., Yin, L., Wu, K., Pan, Z.-Q., and Wilkinson, K. D. (2003). Identification and characterization of den1, a deneddylase of the ulp family. Journal Of Biological Chemistry, 278(31):28892–28900.

Gao, F., Cheng, J., Shi, T., and Yeh, E. T. (2006). Neddylation of a breast cancer-associated protein recruits a class iii histone deacetylase that represses nfκb- dependent transcription. Nature Cell Biology, 8(10):1171–1177.

Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B., and Overington, J. P. (2012). Chembl: a large-scale bioactivity database for drug discovery. Nucleic Acids Res, 40(Database issue):D1100–7.

160 Gerber, A., Wille, A., Welte, T., Ansorge, S., and B¨uhling, F. (2001). Interleukin- 6 and transforming growth factor-β1 control expression of cathepsins b and l in human lung epithelial cells. Journal Of Interferon & Cytokine Research, 21(1):11– 19.

Geysen, H. M., Schoenen, F., Wagner, D., and Wagner, R. (2003). A guide to drug discovery: Combinatorial compound libraries for drug discovery: an ongoing chal- lenge. Nature Reviews Drug Discovery, 2(3):222.

Gharagheizi, F., Mirkhani, S. A., Keshavarz, M. H., Farahani, N., and Tumba, K. (2013). A molecular-based model for prediction of liquid viscosity of pure organic compounds: A quantitative structure property relationship (qspr) approach. Jour- nal Of The Taiwan Institute Of Chemical Engineers, 44(3):359–364.

Ghebrehiwet, B., Silverberg, M., and Kaplan, A. P. (1981). Activation of the classical pathway of complement by hageman factor fragment. The Journal Of Experimental Medicine, 153(3):665–676.

Girolami, A., Randi, M., Gavasso, S., Lombardi, A., and Spiezia, F. (2004). The occasional venous thromboses seen in patients with severe (homozygous) fxii de- ficiency are probably due to associated risk factors: a study of prevalence in 21 patients and review of the literature. Journal Of Thrombosis And Thrombolysis, 17(2):139–143.

Glickman, M. H. and Ciechanover, A. (2002). The ubiquitin-proteasome proteolytic pathway: destruction for the sake of construction. Physiological Reviews, 82(2):373– 428.

Goldberger, G., Bruns, G., Rits, M., Edge, M., and Kwiatkowski, D. (1987). Human complement factor i: analysis of cdna-derived primary structure and assignment of its gene to chromosome 4. Journal Of Biological Chemistry, 262(21):10065–10071.

Gompels, M., Lock, R., Abinun, M., Bethune, C., Davies, G., Grattan, C., Fay, A., Longhurst, H., Morrison, L., Price, A., et al. (2005). C1 inhibitor deficiency: consensus document. Clinical & Experimental Immunology, 139(3):379–394.

Goodnough, L. T., Saito, H., and Ratnoff, O. D. (1983). Thrombosis or myocardial infarction in congenital clotting factor abnormalities and chronic thrombocytope- nias: a report of 21 patients and a review of 50 previously reported cases. Medicine, 62(4):248–255.

Goulet, B., Baruch, A., Moon, N.-S., Poirier, M., Sansregret, L. L., Erickson, A., Bogyo, M., and Nepveu, A. (2004). A cathepsin l isoform that is devoid of a signal peptide localizes to the nucleus in s phase and processes the cdp/cux transcription factor. Molecular Cell, 14(2):207–219.

161 Grove, J. and Marsh, M. (2011). The cell biology of receptor-mediated virus entry. J Cell Biol, 195(7):1071–82.

Gulati, P., Lemercier, C., Guc, D., Lappin, D., and Whaley, K. (1993). Regulation of the synthesis of c1 subcomponents and c1-inhibitor. Behring Institute Mitteilungen, (93):196–203.

Guo, C., Linton, A., Jalaie, M., Kephart, S., Ornelas, M., Pairish, M., Greasley, S., Richardson, P., Maegley, K., Hickey, M., et al. (2013). Discovery of 2-((1h-benzo [d] imidazol-1-yl) methyl)-4h-pyrido [1, 2-a] pyrimidin-4-ones as novel pkm2 activators. Bioorganic & Medicinal Chemistry Letters, 23(11):3358–3363.

Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. Journal Of Machine Learning Research, 3(Mar):1157–1182.

Hacker, H.-J., Steinberg, P., and Bannasch, P. (1998). Pyruvate kinase isoenzyme shift from l-type to m2-type is a late event in hepatocarcinogenesis induced in rats by a choline-deficient/dl-ethionine-supplemented diet. Carcinogenesis, 19(1):99– 107.

Hardt, P., Ngoumou, B., Rupp, J., Schnell-Kretschmer, H., and Kloer, H. (2000). Tumor m2-pyruvate kinase: a promising tumor marker in the diagnosis of gastro- intestinal cancer. Anticancer Research, 20(6D):4965–4968.

Harvey, A. L. (2008). Natural products in drug discovery. Drug Discovery Today, 13(19-20):894–901.

Hawkins, D. M. (2004). The problem of overfitting. J Chem Inf Comput Sci, 44(1):1– 12.

Heck, L. W. and Kaplan, A. P. (1974). Substrates of hageman factor: I. isolation and characterization of human factor xi (pta) and inhibition of the activated enzyme by α1-antitrypsin. The Journal Of Experimental Medicine, 140(6):1615.

Henrich, C. J. and Beutler, J. A. (2013). Matching the power of high throughput screening to the chemical diversity of natural products. Natural Product Reports, 30(10):1284–1298.

Hocking, R. R. (1976). A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics, 32(1):1–49.

Houghten, R. A. (2000). Parallel array and mixture-based synthetic combinatorial chemistry: tools for the next millennium. Annual Review Of Pharmacology And Toxicology, 40(1):273–282.

162 Huang, G., Kaufman, A. J., Ramanathan, Y., and Singh, B. (2011). Sccro (dcun1d1) promotes nuclear translocation and assembly of the neddylation e3 complex. Jour- nal Of Biological Chemistry, 286(12):10297–10304.

Huang, S. Y., Li, M., Wang, J., and Pan, Y. (2015). Hybriddock: A hybrid protein- ligand docking protocol integrating protein- and ligand-based approaches. J Chem Inf Model.

Huang, Y., Dong, H., Zhang, X., Li, C., and Zhang, S. (2013). A new fragment contributioncorresponding states method for physicochemical properties prediction of ionic liquids. Aiche Journal, 59(4):1348–1359.

Huey, R., Morris, G. M., Olson, A. J., and Goodsell, D. S. (2007). A semiempirical free energy force field with charge-based desolvation. Journal Of Computational Chemistry, 28(6):1145–1152.

Hugli, T. E. (1986). Biochemistry and biology of anaphylatoxins. Complement, 3:111– 127.

Hutter, M. C. (2011). Graph-based similarity concepts in virtual screening. Future Medicinal Chemistry, 3(4):485–501.

Ibsen, K. H., Orlando, R. A., Garratt, K. N., Hernandez, A. M., Giorlando, S., and Nungaray, G. (1982). Expression of multimolecular forms of pyruvate kinase in normal, benign, and malignant human breast tissue. Cancer Research, 42(3):888– 892.

Inoue, Y., Peters, L. L., Yim, S. H., Inoue, J., and Gonzalez, F. J. (2006). Role of hepatocyte nuclear factor 4α in control of blood coagulation factor gene expression. Journal Of Molecular Medicine, 84(4):334–344.

Irwin, J. J., Duan, D., Torosyan, H., Doak, A. K., Ziebart, K. T., Sterling, T., Tumanian, G., and Shoichet, B. K. (2015). An aggregation advisor for ligand discovery. Journal Of Medicinal Chemistry, 58(17):7076–7087.

Joback, K. G. (1984). A Unified Approach to Physical Property Estimation Using Mul- tivariate Statistical Techniques. Massachusetts Institute of Technology, Department of Chemical Engineering.

Joback, K. G. and Reid, R. C. (1987). Estimation of pure-component properties from group-contributions. Chemical Engineering Communications, 57(1-6):233–243.

Johnson, D. A., Barrett, A., and Mason, R. (1986). Cathepsin l inactivates alpha 1-proteinase inhibitor by cleavage in the reactive site region. Journal Of Biological Chemistry, 261(31):14748–14751.

163 Johnson, M. A. and Maggiora, G. M. (1990). Concepts And Applications Of Molecular Similarity. Wiley.

Jolliffe, I. (2014). Principal component analysis. In Wiley Statsref: Statistics Refer- ence Online. John Wiley & Sons, Ltd.

Jones, G., Willett, P., Glen, R. C., Leach, A. R., and Taylor, R. (1997). Development and validation of a genetic algorithm for flexible docking. Journal Of Molecular Biology, 267(3):727–748.

Jorgensen, W. L. (2004). The many roles of computation in drug discovery. Science, 303(5665):1813–1818.

Kalaiarasan, P., Kumar, B., Chopra, R., Gupta, V., Subbarao, N., and Bamezai, R. N. (2015). In silico screening, genotyping, molecular dynamics simulation and activity studies of snps in pyruvate kinase m2. Plos One, 10(3):e0120469.

Kalyaanamoorthy, S. and Chen, Y. P. (2011). Structure-based drug design to augment hit discovery. Drug Discov Today, 16(17-18):831–9.

Kane, R. C., Bross, P. F., Farrell, A. T., and Pazdur, R. (2003). Velcade R : Us fda approval for the treatment of multiple myeloma progressing on prior therapy. The Oncologist, 8(6):508–513.

Kane, R. C., Dagher, R., Farrell, A., Ko, C.-W., Sridhara, R., Justice, R., and Pazdur, R. (2007). Bortezomib for the treatment of mantle cell lymphoma. Clinical Cancer Research, 13(18):5291–5294.

Karatzoglou, A., Smola, A., and Hornik, K. (2004). kernlab - an s4 package for kernel methods in r. Journal Of Statistical Software, 11(9):1–22.

Kato, A., Asakai, R., Davie, E., and Aoki, N. (1989). Factor xi gene (f11) is located on the distal end of the long arm of human chromosome 4. Cytogenetic And Genome Research, 52(1-2):77–78.

Kawalec, P., Holko, P., and Paszulewicz, A. (2013). Cost-utility analysis of ruconest R (conestat alfa) compared to berinert R p (human c1 esterase inhibitor) in the treatment of acute, life-threatening angioedema attacks in patients with hered- itary angioedema. Advances In Dermatology And Allergology/postpy Dermatologii I Alergologii, 30(3):152.

Kayello, H. M., Tadisina, N. K. R., Shlonimskaya, N., Biernacki, J. J., and Visco, D. P. (2014). An application of computeraided molecular design (camd) using the signature molecular descriptorpart 1. identification of surface tension reducing agents and the search for shrinkage reducing admixtures. Journal Of The American Ceramic Society, 97(2):365–377.

164 Keiser, M. J., Roth, B. L., Armbruster, B. N., Ernsberger, P., Irwin, J. J., and Shoichet, B. K. (2007). Relating protein pharmacology by ligand chemistry. Nature Biotechnology, 25(2):197.

Keiser, M. J., Setola, V., Irwin, J. J., Laggner, C., Abbas, A. I., Hufeisen, S. J., Jensen, N. H., Kuijer, M. B., Matos, R. C., Tran, T. B., et al. (2009). Predicting new molecular targets for known drugs. Nature, 462(7270):175.

Khan, S., Ullah, M. W., Siddique, R., Nabi, G., Manan, S., Yousaf, M., and Hou, H. (2016). Role of recombinant dna technology to improve life. International Journal Of Genomics, 2016.

Kier, L. B. and Hall, L. H. (1976). Molecular connectivity vii: specific treatment of heteroatoms. Journal Of Pharmaceutical Sciences, 65(12):1806–1809.

Kier, L. B. and Hall, L. H. (1977). Structure-activity studies on hallucinogenic amphetamines using molecular connectivity. Journal Of Medicinal Chemistry, 20(12):1631–1636.

Kier, L. B. and Hall, L. H. (1981). Derivation and significance of valence molecular connectivity. Journal Of Pharmaceutical Sciences, 70(6):583–589.

Kier, L. B. and Hall, L. H. (1990). An electrotopological-state index for atoms in molecules. Pharmaceutical Research, 7(8):801–807.

Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B. A., Wang, J., Yu, B., Zhang, J., and Bryant, S. H. (2015). Pubchem substance and compound databases. Nucleic Acids Res.

Kleinschnitz, C., Stoll, G., Bendszus, M., Schuh, K., Pauer, H.-U., Burfeind, P., Renn´e,C., Gailani, D., Nieswandt, B., and Renn´e,T. (2006). Targeting coagula- tion factor xii provides protection from pathological thrombosis in cerebral ischemia without interfering with hemostasis. The Journal Of Experimental Medicine, 203(3):513–518.

Koehn, F. E. and Carter, G. T. (2005). The evolving role of natural products in drug discovery. Nature Reviews Drug Discovery, 4(3):206.

Krueger, S., Kellner, U., Buehling, F., and Roessner, A. (2001). Cathepsin l anti- sense oligonucleotides in a human osteosarcoma cell line: effects on the invasive phenotype. Cancer Gene Therapy, 8(7):522.

Kuhli, C., Scharrer, I., Koch, F., Ohrloff, C., and Hattenbach, L.-O. (2004). Factor xii deficiency: a thrombophilic risk factor for retinal vein occlusion. American Journal Of Ophthalmology, 137(3):459–464.

165 Kumar, Y., Mazurek, S., Yang, S., Failing, K., Winslet, M., Fuller, B., and Davidson, B. R. (2010). In vivo factors influencing tumour m2-pyruvate kinase level in human pancreatic cancer cell lines. Tumor Biology, 31(2):69–77.

Kung, C., Hixon, J., Choe, S., Marks, K., Gross, S., Murphy, E., DeLaBarre, B., Cianchetta, G., Sethumadhavan, S., Wang, X., et al. (2012). Small molecule ac- tivation of pkm2 in cancer cells induces serine auxotrophy. Chemistry & Biology, 19(9):1187–1198.

Lafarge, J.-C., Cl´ement, K., and Guerre-Millo, M. (2011). Cathepsins s, l, and k and their pathophysiological relevance in obesity. Clinical Reviews In Bone And Mineral Metabolism, 9(2):133–137.

Laha, T. T., Hawley, M., Rock, K. L., and Goldberg, A. L. (1995). Gamma-interferon causes a selective induction of the lysosomal proteases, cathepsins b and l, in macrophages. Febs Letters, 363(1-2):85–89.

Lammer, D., Mathias, N., Laplaza, J. M., Jiang, W., Liu, Y., Callis, J., Goebl, M., and Estelle, M. (1998). Modification of yeast cdc53p by the ubiquitin-related protein rub1p affects function of the scfcdc4 complex. Genes & Development, 12(7):914–926.

Lan, H., Tang, Z., Jin, H., and Sun, Y. (2016). Neddylation inhibitor mln4924 suppresses growth and migration of human gastric cancer cells. Scientific Reports, 6.

Langley, J. N. (1905). On the reaction of cells and of nerve-endings to certain poisons, chiefly as regards the reaction of striated muscle to nicotine and to curari. The Journal Of Physiology, 33(4-5):374–413.

Laurie, R., Alasdair, T., and Jackson, R. M. (2006). Methods for the prediction of protein-ligand binding sites for structure-based drug design and virtual ligand screening. Current Protein And Peptide Science, 7(5):395–406.

Leung, C.-H., Chan, D. S.-H., Yang, H., Abagyan, R., Lee, S. M.-Y., Zhu, G.-Y., Fong, W.-F., and Ma, D.-L. (2011). A natural product-like inhibitor of nedd8- activating enzyme. Chemical Communications, 47(9):2511–2513.

Li, H., Visco, D. P., and Leipzig, N. D. (2014a). Confirmation of predicted activity for factor xia inhibitors from a virtual screening approach. Aiche Journal, 60(8):2741– 2746.

Li, J. W.-H. and Vederas, J. C. (2009). Drug discovery and natural products: end of an era or an endless frontier? Science, 325(5937):161–165.

166 Li, Z., Yang, P., and Li, Z. (2014b). The multifaceted regulation and functions of pkm2 in tumor progression. Biochimica Et Biophysica Acta (bba)-reviews On Cancer, 1846(2):285–296.

Liakopoulos, D., Doenges, G., Matuschewski, K., and Jentsch, S. (1998). A novel protein modification pathway related to the ubiquitin system. The Embo Journal, 17(8):2208–2214.

Lill, M. A. and Danielson, M. L. (2011). Computer-aided drug design platform using pymol. J Comput Aided Mol Des, 25(1):13–9.

Lin, J. J., Milhollen, M. A., Smith, P. G., Narayanan, U., and Dutta, A. (2010). Nedd8-targeting drug mln4924 elicits dna rereplication by stabilizing cdt1 in s phase, triggering checkpoint activation, apoptosis, and senescence in cancer cells. Cancer Research, 70(24):10310–10320.

Lipinski, C. A. (2000). Drug-like properties and the causes of poor solubility and poor permeability. Journal Of Pharmacological And Toxicological Methods, 44(1):235– 249.

Lipinski, C. A., Lombardo, F., Dominy, B. W., and Feeney, P. J. (2001). Experimen- tal and computational approaches to estimate solubility and permeability in drug discovery and development settings1. Advanced Drug Delivery Reviews, 46(1-3):3– 26.

Liu, J., Wu, N., Ma, L., Liu, M., Liu, G., Zhang, Y., and Lin, X. (2014). Oleanolic acid suppresses aerobic glycolysis in cancer cells by switching pyruvate kinase type m isoforms. Plos One, 9(3):e91606.

Loftus, S. J., Liu, G., Carr, S. M., Munro, S., and La Thangue, N. B. (2012). Ned- dylation regulates e2f-1-dependent transcription. Embo Reports, 13(9):811–818.

Longhurst, H. (2008). Rhucin, a recombinant c1 inhibitor for the treatment of hered- itary angioedema and cerebral ischemia. Current Opinion In Investigational Drugs (london, England: 2000), 9(3):310–323.

Lunn, M., Santos, C., and Craig, T. (2010). Cinryze as the first approved c1 inhibitor in the usa for the treatment of hereditary angioedema: approval, efficacy and safety. Journal Of Blood Medicine, 1:163.

Luo, Z., Yu, G., Lee, H. W., Li, L., Wang, L., Yang, D., Pan, Y., Ding, C., Qian, J., Wu, L., et al. (2012). The nedd8-activating enzyme inhibitor mln4924 induces autophagy and apoptosis to suppress liver cancer cell growth. Cancer Research, 72(13):3360–3371.

167 Mahata, B., Sundqvist, A., and Xirodimas, D. (2012). Recruitment of rpl11 at pro- moter sites of p53-regulated genes upon nucleolar stress through nedd8 and in an mdm2-dependent manner. Oncogene, 31(25):3060–3071. Martin, S., Roe, D., and Faulon, J. L. (2005). Predicting protein-protein interactions using signature products. Bioinformatics, 21(2):218–26. Martinez-Balibrea, E., Plasencia, C., Gin´es,A., Martinez-Card´us,A., Musul´en,E., Aguilera, R., Manzano, J. L., Neamati, N., and Abad, A. (2009). A proteomic approach links decreased pyruvate kinase m2 expression to oxaliplatin resistance in patients with colorectal cancer and in human cell lines. Molecular Cancer Thera- peutics, 8(4):771–778. Mason, R. W., Green, G., and Barrett, A. J. (1985). Human liver cathepsin l. Bio- chemical Journal, 226(1):233. Mason, R. W., Johnson, D. A., Barrett, A. J., and Chapman, H. A. (1986). Elasti- nolytic activity of human cathepsin l. Biochemical Journal, 233(3):925. Matasci, M., Hacker, D. L., Baldi, L., and Wurm, F. M. (2008). Recombinant thera- peutic protein production in cultivated mammalian cells: current status and future prospects. Drug Discovery Today: Technologies, 5(2-3):e37–e42. Matsui, Y., Yasumatsu, I., Asahi, T., Kitamura, T., Kanai, K., Ubukata, O., Hayasaka, H., Takaishi, S., Hanzawa, H., and Katakura, S. (2017). Discovery and structure-guided fragment-linking of 4-(2, 3-dichlorobenzoyl)-1-methyl-pyrrole-2- carboxamide as a pyruvate kinase m2 activator. Bioorganic & Medicinal Chemistry, 25(13):3540–3546. Mattei, M., Kontogeorgis, G. M., and Gani, R. (2013). Modeling of the critical micelle concentration (cmc) of nonionic surfactants with an extended group-contribution method. Industrial & Engineering Chemistry Research, 52(34):12236–12246. Mayer, R. J. and Doherty, F. (1986). Intracellular protein catabolism: state of the art. Febs Letters, 198(2):181–193. Mazurek, S., Boschek, C., and Eigenbrodt, E. (1997). The role of phosphometabolites in cell proliferation, energy metabolism, and tumor therapy. Journal Of Bioener- getics And Biomembranes, 29(4):315–330. Mazurek, S., Zwerschke, W., Jansen-Durr, P., and Eigenbrodt, E. (2001). Metabolic cooperation between different oncogenes during cell transformation: interaction between activated ras and hpv-16 e7. Oncogene, 20(47):6891–6898. Meijers, J. C., Kanters, D. H., Vlooswijk, R. A., Van Erp, H. E., Hessing, M., and Bouma, B. N. (1988). Inactivation of human plasma kallikrein and factor xia by protein c inhibitor. Biochemistry, 27(12):4231–4237.

168 Meijers, J. C., Tekelenburg, W. L., Bouma, B. N., Bertina, R. M., and Rosendaal, F. R. (2000). High levels of coagulation factor xi as a risk factor for venous throm- bosis. New England Journal Of Medicine, 342(10):696–701.

Mendoza, H. M., Shen, L.-n., Botting, C., Lewis, A., Chen, J., Ink, B., and Hay, R. T. (2003). Nedp1, a highly conserved cysteine protease that deneddylates cullins. Journal Of Biological Chemistry, 278(28):25637–25643.

Meyer-Schaller, N., Chou, Y.-C., Sumara, I., Martin, D. D., Kurz, T., Katheder, N., Hofmann, K., Berthiaume, L. G., Sicheri, F., and Peter, M. (2009). The human dcn1-like protein dcnl3 promotes cul3 neddylation at membranes. Proceedings Of The National Academy Of Sciences, 106(30):12365–12370.

Misura, K. M., Chivian, D., Rohl, C. A., Kim, D. E., and Baker, D. (2006). Physi- cally realistic homology models built with rosetta can be more accurate than their templates. Proceedings Of The National Academy Of Sciences, 103(14):5361–5366.

Mitchell, J. B., Laskowski, R. A., Alex, A., Forster, M. J., and Thornton, J. M. (1999). Bleeppotential of mean force describing protein–ligand interactions: Ii. calculation of binding energies and comparison with experimental data. Journal Of Computational Chemistry, 20(11):1177–1185.

Modjtahedi, H., Ali, S., and Essapen, S. (2012). Therapeutic application of mon- oclonal antibodies in cancer: advances and challenges. British Medical Bulletin, 104(1).

Morris, K., Bing, D., Andrews, J., Silverstein, L., Shohet, R., Attisano, C., and Goran, P. (1978). Biosynthesis of the subcomponents of c1 by twelve human established cell lines. The Journal Of Immunology, 120(5):1786–1787.

Mukhopadhyay, D. and Dasso, M. (2007). Modification in reverse: the sumo proteases. Trends In Biochemical Sciences, 32(6):286–295.

M¨uller,W., Hanauske-Abel, H., and Loos, M. (1978). Biosynthesis of the first com- ponent of complement by human and guinea pig peritoneal macrophages: evidence for an independent production of the c1 subunits. The Journal Of Immunology, 121(4):1578–1584.

M¨uller-Eberhard, H. J. (1985). The killer molecule of complement. Journal Of In- vestigative Dermatology, 85(1):S47–S52.

Nagano, T., Hashimoto, T., Nakashima, A., Kikkawa, U., and Kamada, S. (2012). X-linked inhibitor of apoptosis protein mediates neddylation by itself but does not function as a nedd8–e3 ligase for caspase-7. Febs Letters, 586(11):1612–1616.

169 Naito, K. and Fujikawa, K. (1991). Activation of human blood coagulation fac- tor xi independent of factor xii. factor xi is activated by thrombin and factor xia in the presence of negatively charged surfaces. Journal Of Biological Chemistry, 266(12):7353–7358.

National Library of Medicine (US), N. C. f. B. I. (2017). Senp8 sumo/sentrin peptidase family member, nedd8 specific[homo sapiens]. Internet.

Nelson, D. L. and Cox, M. M. (2012). Lehninger Principles of Biochemistry. WORTH PUBL INC.

Ng, R. (2008). Appendix 1: History of drug discovery and development. Drugs: From Discovery To Approval, Second Edition, pages 391–397.

Noguchi, T., Inoue, H., and Tanaka, T. (1986). The m1-and m2-type isozymes of rat pyruvate kinase are produced from the same gene by alternative rna splicing. Journal Of Biological Chemistry, 261(29):13807–13812.

Noguchi, T., Yamada, K., Inoue, H., Matsuda, T., and Tanaka, T. (1987). The l-and r-type isozymes of rat pyruvate kinase are produced from a single gene by use of different promoters. Journal Of Biological Chemistry, 262(29):14366–14371.

Oremek, G., Eigenbrodt, E., R¨adle,J., Zeuzem, S., and Seiffert, U. (1997). Value of the serum levels of the tumor marker tum2-pk in pancreatic cancer. Anticancer Research, 17(4B):3031–3033.

Osaka, F., Kawasaki, H., Aida, N., Saeki, M., Chiba, T., Kawashima, S., Tanaka, K., and Kato, S. (1998). A new nedd8-ligating system for cullin-4a. Genes & Development, 12(15):2263–2268.

Osaka, F., Saeki, M., Katayama, S., Aida, N., Toh-e, A., Kominami, K.-i., Toda, T., Suzuki, T., Chiba, T., Tanaka, K., et al. (2000). Covalent modifier nedd8 is essential for scf ubiquitin-ligase in fission yeast. The Embo Journal, 19(13):3475–3484.

Pandita, A., Kumar, B., Manvati, S., Vaishnavi, S., Singh, S. K., and Bamezai, R. N. (2014). Synergistic combination of gemcitabine and dietary molecule induces apoptosis in pancreatic cancer cells and down regulates pkm2 expression. Plos One, 9(9):e107154.

Parnell, K. M., Foulks, J. M., Nix, R. N., Clifford, A., Bullough, J., Luo, B., Sen- ina, A., Vollmer, D., Liu, J., McCarthy, V., et al. (2013). Pharmacologic activa- tion of pkm2 slows lung tumor xenograft growth. Molecular Cancer Therapeutics, 12(8):1453–1460.

Pavlou, A. K. and Reichert, J. M. (2004). Recombinant protein therapeuticssuccess rates, market trends and values to 2010. Nature Biotechnology, 22(12):1513.

170 Pereira, D. A. and Williams, J. A. (2007). Origin and evolution of high throughput screening. Br J Pharmacol, 152(1):53–61.

Petermann, I., Mayer, C., Stypmann, J., Biniossek, M. L., Tobin, D. J., Engelen, M. A., Dandekar, T., Grune, T., Schild, L., Peters, C., et al. (2006). Lysosomal, cytoskeletal, and metabolic alterations in cardiomyopathy of cathepsin l knockout mice. The Faseb Journal, 20(8):1266–1268.

Petraroli, A., Squeglia, V., Di Paola, N., Barbarino, A., Bova, M., Span`o,R., Marone, G., and Triggiani, M. (2015). Home therapy with plasma-derived c1 inhibitor: a strategy to improve clinical outcomes and costs in hereditary angioedema. Inter- national Archives Of Allergy And Immunology, 166(4):259–266.

Pinto, D. J., Orwat, M. J., Smith II, L. M., Quan, M. L., Lam, P. Y., Rossi, K. A., Apedo, A., Bozarth, J. M., Wu, Y., Zheng, J. J., et al. (2017). Discovery of a parenteral small molecule coagulation factor xia inhibitor clinical candidate (bms- 962212). Journal Of Medicinal Chemistry.

Proctor, R. R. and Rapaport, S. I. (1961). The partial thromboplastin time with kaolin: a simple screening test for first stage plasma clotting factor deficiencies. American Journal Of Clinical Pathology, 36(3):212–219.

Quan, M. L., Wong, P. C., Wang, C., Woerner, F., Smallheer, J. M., Barbera, F. A., Bozarth, J. M., Brown, R. L., Harpel, M. R., Luettgen, J. M., et al. (2014). Tetrahy- droquinoline derivatives as potent and selective factor xia inhibitors. Journal Of Medicinal Chemistry, 57(3):955–969.

Randic, M. (1975). Characterization of molecular branching. Journal Of The Amer- ican Chemical Society, 97(23):6609–6615.

Ratnoff, O. D. and Colopy, J. E. (1955). A familial hemorrhagic trait associated with a deficiency of a clot-promoting fraction of plasma 1. Journal Of Clinical Investigation, 34(4):602–613.

Ratnoff, O. D. and Lepow, I. H. (1957). Some properties of an esterase derived from preparations of the first component of complement. Journal Of Experimental Medicine, 106(2):327–343.

Rawlings, N. D., Barrett, A. J., and Finn, R. (2015). Twenty years of the merops database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Research, 44(D1):D343–D350.

Renn´e,T., Pozgajov´a,M., Gr¨uner,S., Schuh, K., Pauer, H.-U., Burfeind, P., Gailani, D., and Nieswandt, B. (2005). Defective thrombus formation in mice lacking coag- ulation factor xii. The Journal Of Experimental Medicine, 202(2):271–281.

171 Ricklin, D., Hajishengallis, G., Yang, K., and Lambris, J. D. (2010). Comple- ment: a key system for immune surveillance and homeostasis. Nature Immunology, 11(9):785–797.

Rosenthal, R. L., Dreskin, O. H., and Rosenthal, N. (1953). New hemophilia-like disease caused by deficiency of a third plasma thromboplastin factor. Proceedings Of The Society For Experimental Biology And Medicine, 82(1):171–174.

Rosenthal, R. L., Dreskin, O. H., and Rosenthal, N. (1955). Plasma thromboplas- tin antecedent (pta) deficiency: clinical, coagulation, therapeutic and hereditary aspects of a new hemophilia-like disease. Blood, 10(2):120–131.

Saito, H., Goldsmith, G. H., Moroi, M., and Aoki, N. (1979). Inhibitory spectrum of alpha 2-plasmin inhibitor. Proceedings Of The National Academy Of Sciences, 76(4):2013–2017.

Salvesen, G. (2012). Dose response confirmation of uhts inhibitor hits of sentrin- specific protease 8 using nedd8 protein substrate.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. Ibm Journal Of Research And Development, 3(3):210–229.

Schneider, G. and Fechner, U. (2005). Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery, 4(8):649.

Schneider, J., Neu, K., Velcovsky, H.-G., Morr, H., and Eigenbrodt, E. (2003). Tumor m2-pyruvate kinase in the follow-up of inoperable lung cancer patients: a pilot study. Cancer Letters, 193(1):91–98.

Schneider, J. and Schulze, G. (2003). Comparison of tumor m2-pyruvate kinase (tu- mor m2-pk), carcinoembryonic antigen (cea), carbohydrate antigens ca 19-9 and ca 72-4 in the diagnosis of gastrointestinal cancer. Anticancer Research, 23(6D):5089– 5093.

Sch¨olkopf, B. (2001). The kernel trick for distances. In Advances In Neural Informa- tion Processing Systems, pages 301–307.

Schreiber, S. L. (2000). Target-oriented and diversity-oriented organic synthesis in drug discovery. Science, 287(5460):1964–1969.

Scott, C. F., Carrell, R. W., Glaser, C. B., Kueppers, F., Lewis, J. H., and Colman, R. W. (1986). Alpha-1-antitrypsin-pittsburgh. a potent inhibitor of human plasma factor xia, kallikrein, and factor xiif. Journal Of Clinical Investigation, 77(2):631.

Scrucca, L. (2013). Ga: A package for genetic algorithms in r. Journal Of Statistical Software, 53(4):1–37.

172 Seifert, M. H., Wolf, K., and Vitt, D. (2003). Virtual high-throughput in silico screening. Biosilico, 1(4):143–149.

Seligsohn, U., Bolton-Maggs, P., Lee, C., and Berntorp, E. (2005). Textbook of haemophilia. Factor Xi Deficiency, pages 321–7.

Shi, J., Rose, E., Hussain, S., Tom, S., Strober, W., Sloan, S. R., Parry, G., and Stagliano, N. (2013). Tnt009, a classical complement pathway specific inhibitor, prevents complement dependent hemolysis induced by cold agglutinin disease pa- tient autoantibodies. Blood, 122(21):42–42.

Sigerist, H. E. (1987). A history of medicine: Early Greek, Hindu, and Persian medicine. Number 27. New York: Oxford University Press.

Silverberg, M., Dunn, J., Garen, L., and Kaplan, A. P. (1980). Autoactivation of human hageman factor. demonstration utilizing a synthetic substrate. Journal Of Biological Chemistry, 255(15):7281–7286.

Sim, R. B., Arlaud, G. J., and Colomb, M. G. (1979). C1 inhibitor-dependent dissocia- tion of human complement component c1 bound to immune complexes. Biochemical Journal, 179(3):449–457.

Simmons, G., Gosalia, D. N., Rennekamp, A. J., Reeves, J. D., Diamond, S. L., and Bates, P. (2005). Inhibitors of cathepsin l prevent severe acute respiratory syndrome coronavirus entry. Proceedings Of The National Academy Of Sciences Of The United States Of America, 102(33):11876–11881.

Sinko, W., Lindert, S., and McCammon, J. A. (2013). Accounting for receptor flexi- bility and enhanced sampling methods in computer-aided drug design. Chem Biol Drug Des, 81(1):41–9.

Sliwoski, G., Kothiwale, S., Meiler, J., and Lowe, E. W. (2014). Computational methods in drug discovery. Pharmacological Reviews, 66(1):334–395.

Smith, S. M. and Gottesman, M. (1989). Activity and deletion analysis of recombinant human cathepsin l expressed in escherichia coli. Journal Of Biological Chemistry, 264(34):20487–20495.

Soucy, T. A., Smith, P. G., Milhollen, M. A., Berger, A. J., Gavin, J. M., Adhikari, S., Brownell, J. E., Burke, K. E., Cardin, D. P., Critchley, S., et al. (2009). An inhibitor of nedd8-activating enzyme as a new approach to treat cancer. Nature, 458(7239):732–736.

Sousa, S. F., Fernandes, P. A., and Ramos, M. J. (2006). Protein–ligand docking: current status and future challenges. Proteins: Structure, Function, And Bioinfor- matics, 65(1):15–26.

173 Spira, D., Stypmann, J., Tobin, D. J., Petermann, I., Mayer, C., Hagemann, S., Vasiljeva, O., G¨unther, T., Sch¨ule,R., Peters, C., et al. (2007). Cell type-specific functions of the lysosomal protease cathepsin l in the heart. Journal Of Biological Chemistry, 282(51):37045–37052.

Spoden, G. A., Mazurek, S., Morandell, D., Bacher, N., Ausserlechner, M. J., Jansen- D¨urr,P., Eigenbrodt, E., and Zwerschke, W. (2008). Isotype-specific inhibitors of the glycolytic key regulator pyruvate kinase subtype m2 moderately decelerate tumor cell proliferation. International Journal Of Cancer, 123(2):312–321.

Sterling, T. and Irwin, J. J. (2015). Zinc 15–ligand discovery for everyone. Journal Of Chemical Information And Modeling, 55(11):2324.

Stumpfe, D., Ripphausen, P., and Bajorath, J. (2012). Virtual compound screening in drug discovery. Future Medicinal Chemistry, 4(5):593–602.

Stypmann, J., Gl¨aser,K., Roth, W., Tobin, D. J., Petermann, I., Matthias, R., M¨onnig,G., Haverkamp, W., Breithardt, G., Schmahl, W., et al. (2002). Dilated cardiomyopathy in mice deficient for the lysosomal cysteine peptidase cathepsin l. Proceedings Of The National Academy Of Sciences, 99(9):6234–6239.

Subasinghe, N. L., Ali, F., Illig, C. R., Rudolph, M. J., Klein, S., Khalil, E., Soll, R. M., Bone, R. F., Spurlino, J. C., DesJarlais, R. L., et al. (2004). A novel series of potent and selective small molecule inhibitors of the complement component c1s. Bioorganic & Medicinal Chemistry Letters, 14(12):3043–3047.

Subasinghe, N. L., Khalil, E., Travins, J. M., Ali, F., Ballentine, S. K., Hufnagel, H. R., Pan, W., Leonard, K., Bone, R. F., Soll, R. M., et al. (2012). Design and syn- thesis of polyethylene glycol-modified biphenylsulfonyl-thiophene-carboxamidine inhibitors of the complement component c1s. Bioorganic & Medicinal Chemistry Letters, 22(16):5303–5307.

Subasinghe, N. L., Travins, J. M., Ali, F., Huang, H., Ballentine, S. K., Marug´an, J. J., Khalil, E., Hufnagel, H. R., Bone, R. F., DesJarlais, R. L., et al. (2006). A novel series of arylsulfonylthiophene-2-carboxamidine inhibitors of the complement component c1s. Bioorganic & Medicinal Chemistry Letters, 16(8):2200–2204.

Sullivan, S., Tosetto, M., Kevans, D., Coss, A., Wang, L., O’donoghue, D., Hyland, J., Sheahan, K., Mulcahy, H., and O’sullivan, J. (2009). Localization of nuclear cathepsin l and its association with disease progression and poor outcome in col- orectal cancer. International Journal Of Cancer, 125(1):54–61.

Swords, R., Watts, J., Erba, H., Altman, J., Maris, M., Anwer, F., Hua, Z., Stein, H., Faessel, H., Sedarati, F., et al. (2017). Expanded safety analysis of pevonedistat, a first-in-class nedd8-activating enzyme inhibitor, in patients with acute myeloid leukemia and myelodysplastic syndromes. Blood Cancer Journal, 7(2):e520.

174 Swords, R. T., Erba, H. P., DeAngelo, D. J., Bixby, D. L., Altman, J. K., Maris, M., Hua, Z., Blakemore, S. J., Faessel, H., Sedarati, F., et al. (2015). Pevonedistat (mln4924), a first-in-class nedd8-activating enzyme inhibitor, in patients with acute myeloid leukaemia and myelodysplastic syndromes: a phase 1 study. British Journal Of Haematology, 169(4):534–543. Szalai, A. J., Digerness, S. B., Agrawal, A., Kearney, J. F., Bucy, R. P., Niwas, S., Kilpatrick, J. M., Babu, Y. S., and Volanakis, J. E. (2000). The arthus reaction in rodents: species-specific requirement of complement. The Journal Of Immunology, 164(1):463–468. Takenaka, M., Noguchi, T., Sadahiro, S., Hirai, H., Yamada, K., Matsuda, T., Imai, E., and Tanaka, T. (1991). Isolation and characterization of the human pyruvate kinase m gene. The Febs Journal, 198(1):101–106. Tateishi, K., Omata, M., Tanaka, K., and Chiba, T. (2001). The nedd8 system is essential for cell cycle progression and morphogenetic pathway in mice. The Journal Of Cell Biology, 155(4):571–580. Thompson, D. (2001). Solid-phase Synthesis and Combinatorial Technologies., vol- ume 18. Todeschini, R. and Consonni, V. (2008). Handbook of molecular descriptors, vol- ume 11. John Wiley & Sons. Todeschini, R. and Consonni, V. (2009). Molecular Descriptors for Chemoinformatics, Volume 41 (2 Volume Set), volume 41. John Wiley & Sons. Tong, S., Si, Y., Yu, H., Zhang, L., Xie, P., and Jiang, W. (2017). Mln4924 (pevonedi- stat), a protein neddylation inhibitor, suppresses proliferation and migration of human clear cell renal cell carcinoma. Scientific Reports, 7(1):5599. Travins, J. M., Ali, F., Huang, H., Ballentine, S. K., Khalil, E., Hufnagel, H. R., Pan, W., Gushue, J., Leonard, K., Bone, R. F., et al. (2008). Biphenylsulfonyl- thiophene-carboxamidine inhibitors of the complement component c1s. Bioorganic & Medicinal Chemistry Letters, 18(5):1603–1606. Trialists, C. T. et al. (2005). Efficacy and safety of cholesterol-lowering treatment: prospective meta-analysis of data from 90 056 participants in 14 randomised trials of statins. The Lancet, 366(9493):1267–1278. Turk, B., Dolenc, I., Lenarˇciˇc,B., Kriˇzaj, I., Turk, V., Bieth, J. G., and BjoErk,` I. (1999). Acidic ph as a physiological regulator of human cathepsin l activity. The Febs Journal, 259(3):926–932. Turk, B., Dolenc, I., Turk, V., and Bieth, J. G. (1993). Kinetics of the ph-induced inactivation of human cathepsin l. Biochemistry, 32(1):375–380.

175 Turk, B., Turk, D., and Turk, V. (2000). Lysosomal cysteine proteases: more than scavengers. Biochimica Et Biophysica Acta (bba)-protein Structure And Molecular Enzymology, 1477(1-2):98–111.

Valler, M. J. and Green, D. (2000). Diversity screening versus focussed screening in drug discovery. Drug Discovery Today, 5(7):286–293.

Van Damme, P. (2016). Long-term protection after hepatitis b vaccine.

Van Drie, J. H. (2007). Computer-aided drug design: the next 20 years. J Comput Aided Mol Des, 21(10-11):591–601.

Van Vlasselaer, P., Parry, G., Stagliano, N., and Panicker, S. (2017). Anti-complement c1s antibodies and methods of inhibiting complement c1s activity.

Vander Heiden, M. G. (2010). Secondary assay for activators of human pyruvate kinase m2 isoform.

Vander Heiden, M. G. (2012). Confirmation concentration-response assay for activa- tors of human muscle isoform 2 pyruvate kinase: for probe sar. Electronically.

Vander Heiden, M. G., Cantley, L. C., and Thompson, C. B. (2009). Understand- ing the warburg effect: the metabolic requirements of cell proliferation. Science, 324(5930):1029–1033.

Vander Heiden, M. G., Christofk, H. R., Schuman, E., Subtelny, A. O., Sharfi, H., Harlow, E. E., Xian, J., and Cantley, L. C. (2010). Identification of small molecule inhibitors of pyruvate kinase m2. Biochemical Pharmacology, 79(8):1118–1124.

Visco, D. P., J., Pophale, R. S., Rintoul, M. D., and Faulon, J. L. (2002). Developing a methodology for an inverse quantitative structure-activity relationship using the signature molecular descriptor. J Mol Graph Model, 20(6):429–38.

Vulpetti, A., Randl, S., Rudisser, S., Ostermann, N., Erbel, P., Mac Sweeney, A., Zoller, T., Salem, B., Gerhartz, B., Cumin, F., et al. (2017). Structure-based library design and fragment screening for the identification of reversible complement factor d protease inhibitors. Journal Of Medicinal Chemistry, 60(5):1946–1958.

Wallis, R., Mitchell, D. A., Schmid, R., Schwaeble, W. J., and Keeble, A. H. (2010). Paths reunited: Initiation of the classical and lectin pathways of complement acti- vation. Immunobiology, 215(1):1–11.

Walters, W. P., Stahl, M. T., and Murcko, M. A. (1998). Virtual screeningan overview. Drug Discovery Today, 3(4):160–178.

Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A., and Case, D. A. (2004). Devel- opment and testing of a general amber force field. J Comput Chem, 25(9):1157–74.

176 Wang, Y., Luo, Z., Pan, Y., Wang, W., Zhou, X., Jeong, L. S., Chu, Y., Liu, J., and Jia, L. (2015). Targeting protein neddylation with an nedd8-activating enzyme in- hibitor mln4924 induced apoptosis or senescence in human lymphoma cells. Cancer Biology & Therapy, 16(3):420–429.

Wang, Y., Suzek, T., Zhang, J., Wang, J., He, S., Cheng, T., Shoemaker, B. A., Gindulyte, A., and Bryant, S. H. (2014). Pubchem bioassay: 2014 update. Nucleic Acids Res, 42(Database issue):D1075–82.

Wang, Z., Jeon, H. Y., Rigo, F., Bennett, C. F., and Krainer, A. R. (2012). Manipu- lation of pk-m mutually exclusive alternative splicing by antisense oligonucleotides. Open Biology, 2(10):120133.

Warburg, O. (1956). On the origin of cancer. Science, 123(3191):309–314.

Warburg, O. H. (1926). ¨uber den Stoffwechsel der Tumoren. J. Springer.

Watson, I., Li, B., Roche, O., Blanch, A., Ohh, M., and Irwin, M. (2010). Chemother- apy induces nedp1-mediated destabilization of mdm2. Oncogene, 29(2):297–304.

Watson, I. R., Blanch, A., Lin, D. C., Ohh, M., and Irwin, M. S. (2006). Mdm2- mediated nedd8 modification of tap73 regulates its transactivation function. Jour- nal Of Biological Chemistry, 281(45):34096–34103.

Weis, D. C., Faulon, J. L., LeBorne, R. C., and Visco, D. P. (2005). The signature molecular descriptor. 5. the design of hydrofluoroether foam blowing agents using inverse-qsar. Industrial & Engineering Chemistry Research, 44(23):8883–8891.

Weis, D. C. and Visco, D. P., J. (2010). Computer-aided molecular design using the signature molecular descriptor: Application to solvent selection. Computers & Chemical Engineering, 34(7):1018–1029.

Weis, D. C., Visco, D. P., J., and Faulon, J. L. (2008). Data mining pubchem using a support vector machine with the signature molecular descriptor: classification of factor xia inhibitors. J Mol Graph Model, 27(4):466–75.

Whitley, D. (1994). A genetic algorithm tutorial. Statistics And Computing, 4(2):65– 85.

Willett, P. (2006). Similarity-based virtual screening using 2d fingerprints. Drug Discovery Today, 11(23-24):1046–1053.

Wilson, D. A., Bork, K., Shea, E. P., Rentz, A. M., Blaustein, M. B., and Pullman, W. E. (2010). Economic costs associated with acute attacks and long-term man- agement of hereditary angioedema. Annals Of Allergy, Asthma & Immunology, 104(4):314–320.

177 Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., and Woolsey, J. (2006). Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res, 34(Database issue):D668–72. Wold, S., Sj¨ostr¨om,M., and Eriksson, L. (2001). Pls-regression: a basic tool of chemometrics. Chemometrics And Intelligent Laboratory Systems, 58(2):109–130. Wong, C. C.-L., Au, S. L.-K., Tse, A. P.-W., Xu, I. M.-J., Lai, R. K.-H., Chiu, D. K.-C., Wei, L. L., Fan, D. N.-Y., Tsang, F. H.-C., Lo, R. C.-L., et al. (2014). Switching of pyruvate kinase isoform l to m2 promotes metabolic reprogramming in hepatocarcinogenesis. Plos One, 9(12):e115036. Wong, C. F. and McCammon, J. A. (2003). Protein flexibility and computer-aided drug design. Annu Rev Pharmacol Toxicol, 43:31–45. Wu, K., Yamoah, K., Dolios, G., Gan-Erdene, T., Tan, P., Chen, A., Lee, C.-g., Wei, N., Wilkinson, K. D., Wang, R., et al. (2003). Den1 is a dual function protease capable of processing the c terminus of nedd8 and deconjugating hyper-neddylated . Journal Of Biological Chemistry, 278(31):28882–28891. Xirodimas, D. P., Saville, M. K., Bourdon, J.-C., Hay, R. T., and Lane, D. P. (2004). Mdm2-mediated nedd8 conjugation of p53 inhibits its transcriptional activity. Cell, 118(1):83–97. Xu, Y., Liu, X.-H., Saunders, M., Pearce, S., Foulks, J. M., Parnell, K. M., Clif- ford, A., Nix, R. N., Bullough, J., Hendrickson, T. F., et al. (2014). Discovery of 3-(trifluoromethyl)-1h-pyrazole-5-carboxamide activators of the m2 isoform of pyruvate kinase (pkm2). Bioorganic & Medicinal Chemistry Letters, 24(2):515– 519. Yacovan, A., Ozeri, R., Kehat, T., Mirilashvili, S., Sherman, D., Aizikovich, A., Shitrit, A., Ben-Zeev, E., Schutz, N., Bohana-Kashtan, O., et al. (2012). 1- (sulfonyl)-5-(arylsulfonyl) indoline as activators of the tumor cell specific m2 iso- form of pyruvate kinase. Bioorganic & Medicinal Chemistry Letters, 22(20):6460– 6468. Yan, S. F., King, F. J., He, Y., Caldwell, J. S., and Zhou, Y. (2006). Learning from the data: mining of large high-throughput screening databases. J Chem Inf Model, 46(6):2381–95. Yang, D. T., Flanders, M. M., Kim, H., and Rodgers, G. M. (2006). Elevated factor xi activity levels are associated with an increased odds ratio for cerebrovascular events. American Journal Of Clinical Pathology, 126(3):411–415. Yang, M., Zhang, Y., Pan, J., Sun, J., Liu, J., Libby, P., Sukhova, G. K., Doria, A., Katunuma, N., Peroni, O. D., et al. (2007). Cathepsin l activity controls adipogenesis and glucose tolerance. Nature Cell Biology, 9(8):970.

178 Yang, N. J. and Hinner, M. J. (2015). Getting across the cell membrane: an overview for small molecules, peptides, and proteins, pages 29–53. Springer.

Yang, Z. and Cox, J. L. (2007). Cathepsin l increases invasion and migration of b16 melanoma. Cancer Cell International, 7(1):8.

Yoo, B. C., Ku, J.-L., Hong, S.-H., Shin, Y.-K., Park, S. Y., Kim, H. K., and Park, J.-G. (2004). Decreased pyruvate kinase m2 activity linked to cisplatin resistance in human gastric carcinoma cell lines. International Journal Of Cancer, 108(4):532– 539.

Zeng, H. and Wu, X. (2015). Alzheimers disease drug development based on computer- aided drug design. European Journal Of Medicinal Chemistry.

Zhang, J., Bai, D., Ma, X., Guan, J., and Zheng, X. (2014). hcinap is a novel regulator of ribosomal protein-hdm2-p53 pathway by controlling neddylation of ribosomal protein s14. Oncogene, 33(2):246–254.

Zhong, H.-J., Ma, V. P.-Y., Cheng, Z., Chan, D. S.-H., He, H.-Z., Leung, K.-H., Ma, D.-L., and Leung, C.-H. (2012). Discovery of a natural product inhibitor targeting protein neddylation by structure-based virtual screening. Biochimie, 94(11):2457– 2460.

Zhou, L. and Watts, F. Z. (2005). Nep1, a schizosaccharomyces pombe deneddylating enzyme. Biochemical Journal, 389(2):307–314.

Ziccardi, R. (1981). Activation of the early components of the classical complement pathway under physiologic conditions. The Journal Of Immunology, 126(5):1769– 1773.

Ziccardi, R. (1985). Demonstration of the interaction of native c1 with monomeric immunoglobulins and c1 inhibitor. The Journal Of Immunology, 134(4):2559–2563.

Ziccardi, R. J. (1983). The first component of human complement (c1): activation and control. In Springer Seminars In Immunopathology, volume 6, pages 213–230. Springer.

Ziccardi, R. J. and Cooper, N. R. (1979). Active disassembly of the first complement component, c1, by c1 inactivator. The Journal Of Immunology, 123(2):788–792.

Zuo, W., Huang, F., Chiang, Y. J., Li, M., Du, J., Ding, Y., Zhang, T., Lee, H. W., Jeong, L. S., Chen, Y., et al. (2013). c-cbl-mediated neddylation antagonizes ubiq- uitination and degradation of the tgf-β type ii receptor. Molecular Cell, 49(3):499– 510.

179 APPENDICES

180 APPENDIX A ALGORITHM SPECIFICATIONS AND IMPLEMENTATION SCRIPTS

After the best vHTS models during training are identified, the models’ linear equa- tion was extracted and implemented in a Linux environment to maximize flexibility in data input/output and usage of available computational resources. NCBI’s Pub- Chem Compound Database (circa May 2014, containing about 72 million compounds) was screened in parallel with the same models on different nodes to distribute the computational work and increase efficiency. All screening and modeling work was conducted on computers with dual Intel Xeon processors, either E5-2697W (3.10GHz, 48 independent threads) or E5-2687W (3.10GHz, 32 independent threads). Each independent thread of the computers was utilized to run iterations of model generations with different initial conditions. Ad- ditionally, vHTS screening was conducted by splitting the PubChem Compound database (72 million compounds) into 32 subsets and each subset screened on its own thread. PCA-GA-SVM models were created and trained with the R Statistical Software: PCA used the “eigen” function, GA used the “ga” function in the “GA” package (Scrucca, 2013), and SVM used the “ksvm” function in the “kernlab” package (Karat- zoglou et al., 2004). Parameters for GA were as follows: elitism rate=0.7, crossover rate=0.8, mutation rate=0.1, population size=1000, max iterations=1000, stop after 100 iterations of no improvement. Parameters for SVM were as follows: cost ranges

181 from 0.01 to 1 with step size 0.01, 10 fold cross validation, ν=0.2, linear kernel.

1.1 ZINC-15 PAINS Filter

#!/usr/bin/env perl use strict; use warnings; my $linecount; my @wc; my @datasets=‘ls ∗.smiles‘; chomp(@datasets); foreach my $dataset(@datasets){ @wc=split(/\s+/,‘wc $dataset‘); if ($wc[1] > 1000){ system(”split −l 1000 $dataset $dataset.”); my @subsets=‘ls $dataset.∗‘; chomp(@subsets); foreach my $subset(@subsets){ # print ”curl http://zinc15.docking.org/patterns/apps/ checker/ −F upload=\@$subset −F pains=y −F aggregators=y −F output format=csv | tee $subset. result”; system(”curl http://zinc15.docking.org/patterns/apps/ checker/ −F upload=\@$subset −F pains=y −F aggregators=y −F output format=csv | tee $subset. result”);

182 } } else { # print”curl http://zinc15.docking.org/patterns/apps/checker/ −F upload=\@$dataset −F pains=y −F aggregators=y −F output format=csv |tee $dataset.result”; system(”curl http://zinc15.docking.org/patterns/apps/checker/ −F upload=\@$dataset −F pains=y −F aggregators=y −F output format=csv | tee $dataset.result”); # print ”@wc”; # print ”$linecount\n”; } }

183 1.2 Data Scaling and Normalization

#!/usr/bin/env perl use strict; use warnings;

# Last modified: 07/22/2015 # Description: Gets mean and st. dev. used to scale response variables in R.

@ARGV>=1||die ”\n\n\nUSAGE: perl scale.pl \n\n\n”;

open(DATA,”<$ARGV[0]”)||die ”\n\n\nCould not open file $ARGV[0].\n\n\n”; open(MEAN,”>${ARGV[0]} mean”); open(SDEV,”>${ARGV[0]} sdev”);

my $value; my $mean sum=0; my $sdev sum=0; my $mean=0; my $sdev=0; my $length;

my @data = ; chomp(@data); foreach $value (@data){

184 $mean sum = $mean sum + $value; } $length=@data; $mean = $mean sum/$length; foreach $value (@data){ $sdev sum = $sdev sum + ($value − $mean)∗∗2; } $sdev = ($sdev sum / ($length − 1)) ∗∗ 0.5; print SDEV ”$sdev”; print MEAN ”$mean”; close(SDEV); close(MEAN);

185 1.3 Principal Component Analysis

#!/usr/bin/env Rscript

# SVM portion desmatrix0 <− as.matrix(read.table(”trainingset/HCdesmatrix−int0”)) desmatrix1 <− as.matrix(read.table(”trainingset/HCdesmatrix−int1”)) desmatrix2 <− as.matrix(read.table(”trainingset/HCdesmatrix−int2”)) desmatrix <− cbind(desmatrix0,desmatrix1,desmatrix2) colnames(desmatrix) <− seq(1,ncol(desmatrix)) #response <− as.matrix(read.table(”compound class”))

# Find eigenvalues and vectors for PCA with covariance matrix eigen results <− eigen(cov(desmatrix)) eigenvalues <− eigen results$values eigenvalues.identified <−which(eigenvalues>1) eigenvalues.filtered <− eigenvalues[eigenvalues.identified] loadings.filtered <− abs(eigen results$vectors[,eigenvalues.identified]) %∗% eigenvalues.filtered

# Save most relevant SVM features svm.pca.features <−which(loadings.filtered>1) save(svm.pca.features, file=”svm.pca.features”)

# Clear workspace to reset everything rm(list=ls())

186 # SVR portion desmatrix0 <− as.matrix(read.table(”activity/HCdesmatrix−int0”)) desmatrix1 <− as.matrix(read.table(”activity/HCdesmatrix−int1”)) desmatrix2 <− as.matrix(read.table(”activity/HCdesmatrix−int2”)) desmatrix <− cbind(desmatrix0,desmatrix1,desmatrix2) colnames(desmatrix) <− seq(1,ncol(desmatrix)) #response <− log(as.matrix(read.table(”activity.value”)),base=10)

# Find eigenvalues and vectors for PCA with covariance matrix eigen results <− eigen(cov(desmatrix)) eigenvalues <− eigen results$values eigenvalues.identified <−which(eigenvalues>1) eigenvalues.filtered <− eigenvalues[eigenvalues.identified] loadings.filtered <− abs(eigen results$vectors[,eigenvalues.identified]) %∗% eigenvalues.filtered # Save most relevant SVM features svr.pca.features <−which(loadings.filtered>1) save(svr.pca.features, file=”svr.pca.features”)

187 1.4 GA/SVM-C Models Creation

#!/usr/bin/env Rscript

# Time script time.start <− proc.time() library(GA) library(kernlab)

# Read in data height0 <− read.table(”/path/to/HCdesmatrix−int0”) height1 <− read.table(”/path/to/HCdesmatrix−int1”) height2 <− read.table(”/path/to/HCdesmatrix−int2”) desmatrix.raw <− cbind(height0,height1,height2) colnames(desmatrix.raw) <− seq(ncol(desmatrix.raw)) desmatrix <− as.matrix(desmatrix.raw) load(”/path/to/svm.pca.features”) desmatrix <− desmatrix[,svm.pca.features] response.raw <− read.table(”/path/to/compound class”) response <− as.matrix(response.raw)

# Variables to control GA cost.min <− 0.01 cost.max <− 1 cost.step <− 0.01 elitism.rate <− 0.7

188 crossover.chance <− 0.8 mutation.chance <− 0.1 features.total <− ncol(desmatrix) population.size <− 1000 iter.max <− 10000 run <− 100 set.seed(0)

# Create initial population and other things to input into GA initpop <− matrix(as.double(NA),nrow=population.size,ncol=features.total) for (count in 1:population.size){ pop <− rep(0,features.total) ones <− sample(features.total,sample(features.total,1)) pop[ones] <− 1 initpop[count,] <− pop } save(initpop,file=”/path/to/svm initpop.RData”) load(”/path/to/svm initpop.RData”) print(”Initial Population Created!”) cost <− seq(cost.min,cost.max,cost.step)

# Variables to keep track of svm.min <− list() cross.min <− 100000 error.min <− 100000 count <− 1

189 svm.cost<−vector() svm.rng<− list()

# Create model selection <− function(string,cost){ test.matrix <− desmatrix[,which(string==1)] leave.one.out <− nrow(desmatrix) random.cost<− sample(cost,1) current.rng.state<− get(”.Random.seed”, .GlobalEnv) svm.model <− ksvm(test.matrix,scaled=F,response,type=”C−svc”,kernel =”vanilladot”,C=random.cost,nu=0.2,cross=10) if(svm.model@cross == cross.min){ if(svm.model@error == error.min){ count <<− count+1 svm.cost[count] <<− random.cost svm.min[[count]] <<− svm.model svm.rng[[count]] <<−current.rng.state save(svm.min, file=”/path/to/svm.min”) save(svm.rng, file=”/path/to/svm.rng”) } if(svm.model@error < error.min){ count <<− 1 svm.min <<− list() svm.rng<<−list() svm.cost<<−vector() svm.cost[count] <<− random.cost

190 svm.min[[count]] <<− svm.model svm.rng[[count]]<<− current.rng.state error.min <<− svm.model@error save(svm.min, file=”/path/to/svm.min”) save(svm.rng, file=”/path/to/svm.rng”) } } if(svm.model@cross < cross.min){ count <<− 1 svm.min <<−list() svm.rng<<−list() cross.min <<− svm.model@cross svm.cost<<−vector() svm.cost[count] <<− random.cost svm.min[[count]] <<− svm.model svm.rng[[count]]<<− current.rng.state error.min <<− svm.model@error save(svm.min, file=”/path/to/svm.min”) save(svm.rng, file=”/path/to/svm.rng”) } return(−svm.model@cross) }

# Genetic Algorithm Implementation GAmodel <− ga(type=”binary”,cost=cost,fitness=selection,nBits=features.total, popSize=population.size,suggestions=initpop,pcrossover=crossover.chance,

191 pmutation=mutation.chance,elitism=base::max(1,round(population.size∗elitism. rate)), maxiter=iter.max, run=run, keepBest=F, parallel=F,seed=0) time.stop <− proc.time() print(time.stop − time.start) print(”minimum x validation error”) print(cross.min) print(”minimum training error”) print(error.min) print(summary(GAmodel)) for (i in 1:length(svm.min)){ print(svm.min[[i]]) print(”Cost”) print(svm.cost[i]) print(”RNG state”) print(svm.rng[[i]]) print(”Support Vectors”) print(svm.min[[i]]@SVindex) print(”Features Used”) print(as.numeric(gsub(”X”,””,colnames(svm.min[[i]]@xmatrix[[1]])))) print(”alpha.x”) print(as.numeric(svm.min[[i]]@coef[[1]]%∗%svm.min[[i]]@xmatrix[[1]])) print(”intercept/b”) print(svm.min[[i]]@b) } save(svm.min, file=”/path/to/svm.min”) save(svm.rng, file=”/path/to/svm.rng”)

192 1.5 GA/SVM-R Models Creation

#!/usr/bin/env Rscript

# Time script time.start <− proc.time() library(GA) library(kernlab)

# Read in data height0 <− read.table(”/path/to/HCdesmatrix−int0”) height1 <− read.table(”/path/to/HCdesmatrix−int1”) height2 <− read.table(”/path/to/HCdesmatrix−int2”) desmatrix.raw <− cbind(height0,height1,height2) colnames(desmatrix.raw) <− seq(ncol(desmatrix.raw)) desmatrix <− as.matrix(desmatrix.raw) load(”/path/to/svr.pca.features”) desmatrix <− desmatrix[,svr.pca.features] response.raw <− read.table(”/path/to/activity.log”) response <− as.matrix(scale(response.raw)[,])

# Variables to control GA cost.min <− 0.01 cost.max <− 1 cost.step <− 0.01 elitism.rate <− 0.7

193 crossover.chance <− 0.8 mutation.chance <− 0.1 features.total <− ncol(desmatrix) population.size <− 1000 iter.max <− 10000 run <− 100 set.seed(0)

# Create initial population and other things to input into GA initpop <− matrix(as.double(NA),nrow=population.size,ncol=features.total) for (count in 1:population.size){ pop <− rep(0,features.total) ones <− sample(features.total,sample(features.total,1)) pop[ones] <− 1 initpop[count,] <− pop } save(initpop,file=”/path/to/svr initpop.RData”) load(”/path/to/svr initpop.RData”) print(”Initial Population Created!”) cost <− seq(cost.min,cost.max,cost.step)

# Variables to keep track of svr.min <− list() cross.min <− 100000 error.min <− 100000 count <− 1

194 svr.cost <− vector() svr.rng<− list()

# Create model selection <− function(string,cost){ test.matrix <− desmatrix[,which(string==1)] leave.one.out <− nrow(desmatrix) random.cost<− sample(cost,1) current.rng.state<− get(”.Random.seed”, .GlobalEnv) svr.model <− ksvm(test.matrix,scaled=F,response,type=”nu−svr”,kernel =”vanilladot”,C=random.cost,nu=0.2,cross=10) if(svr.model@cross == cross.min){ if(svr.model@error == error.min){ count <<− count+1 svr.cost[count] <<− random.cost svr.min[[count]] <<− svr.model svr.rng[[count]] <<−current.rng.state save(svr.min, file=”/path/to/jjc0/svr.min”) save(svr.rng, file=”/path/to/svr.rng”) } if(svr.model@error < error.min){ count <<− 1 svr.min <<− list() svr.rng<<−list() svr.cost<<−vector() svr.cost[count] <<− random.cost

195 svr.min[[count]] <<− svr.model svr.rng[[count]]<<− current.rng.state error.min <<− svr.model@error save(svr.min, file=”/path/to/svr.min”) save(svr.rng, file=”/path/to/svr.rng”) } } if(svr.model@cross < cross.min){ count <<− 1 svr.min <<−list() svr.rng<<−list() cross.min <<− svr.model@cross svr.cost<<−vector() svr.cost[count] <<− random.cost svr.min[[count]] <<− svr.model svr.rng[[count]]<<− current.rng.state error.min <<− svr.model@error save(svr.min, file=”/path/to/svr.min”) save(svr.rng, file=”/path/to/svr.rng”) } return(−svr.model@cross) }

# Genetic Algorithm Implementation GAmodel <− ga(type=”binary”,cost=cost,fitness=selection,nBits=features.total, popSize=population.size,suggestions=initpop,pcrossover=crossover.chance,

196 pmutation=mutation.chance,elitism=base::max(1,round(population.size∗elitism. rate)), maxiter=iter.max, run=run, keepBest=F, parallel=F,seed=0) time.stop <− proc.time() print(time.stop − time.start) print(”minimum x validation error”) print(cross.min) print(”minimum training error”) print(error.min) print(summary(GAmodel)) for (i in 1:length(svr.min)){ print(svr.min[[i]]) print(”Cost”) print(svr.cost[i]) print(”RNG state”) print(svr.rng[[i]]) print(”Support Vectors”) print(svr.min[[i]]@SVindex) print(”Features Used”) print(as.numeric(gsub(”X”,””,colnames(svr.min[[i]]@xmatrix)))) print(”alpha.x”) print(as.numeric(svr.min[[i]]@coef%∗%svr.min[[i]]@xmatrix)) print(”intercept/b”) print(svr.min[[i]]@b) } save(svr.min, file=”/path/to/svr.min”) save(svr.rng, file=”/path/to/svr.rng”)

197 1.6 Create ROC Curves

#!/usr/bin/env Rscript

library(GA) library(kernlab) library(ROCR)

# Read in data height0 <− read.table(”/path/to/HCdesmatrix−int0”) height1 <− read.table(”/path/to/HCdesmatrix−int1”) height2 <− read.table(”/path/to/HCdesmatrix−int2”) desmatrix.raw <− cbind(height0,height1,height2) colnames(desmatrix.raw) <− seq(ncol(desmatrix.raw)) desmatrix <− as.matrix(desmatrix.raw) response.raw <− read.table(”/path/to/compound class”) response <− as.matrix(response.raw) svm.all<−list() pdf(”svm ROC PAINFREE AID 1540 v2.pdf”) load(”10fold/jjc19/svm.min”) svm.all<−append(svm.all,svm.min) load(”10fold/jjc43/svm.min”) svm.all<−append(svm.all,svm.min) testmatrix<−desmatrix[,as.numeric(gsub(”X”,””,colnames(svm.all[[1]]@xmatrix [[1]])))] response.pred<−predict(svm.all[[1]],testmatrix,type=”decision”)

198 svm.ROC.pred<−list() svm.ROC.pred[[1]]<−prediction(response.pred,response) svm.ROC.perf<−list() svm.ROC.perf[[1]]<−performance(svm.ROC.pred[[1]],measure=”tpr”,x.measure=” fpr”) plot(svm.ROC.perf[[1]]) if (length(svm.all)>1){ for (i in 2:length(svm.all)){ testmatrix<−desmatrix[,as.numeric(gsub(”X”,””,colnames(svm.all[[ i]]@xmatrix[[1]])))] response.pred=predict(svm.all[[i]],testmatrix,type=”decision”) svm.ROC.pred[[i]]<−prediction(response.pred,response) svm.ROC.perf[[i]]<−performance(svm.ROC.pred[[i]],measure=”tpr ”,x.measure=”fpr”) plot(svm.ROC.perf[[i]],add=T) } } save(svm.ROC.pred,file=”svm.ROC.pred”) save(svm.ROC.perf,file=”svm.ROC.perf”) dev.off() desmatrix <− desmatrix[,scan(text=”1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 72 73 74 75 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

199 120 121 122 123 124 125 126 127 128 129 130 131 132 134 135 136 137 138 139 140 141 143 144 145 146 147 148 149 150 151 152 153 154 155 156 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 182 183 184 185 186 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 237 238 239 240 241 243 244 245 246 247 248 249 250 251 252 254 255 256 258 259 261 262 263 264 265 266 267 268 269 270 271 272 274 275 276 277 278 279 280 281 282 283 284 285 287 288 289 290 292 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 340 341 342 344 345 346 347 348 351 354 357 358 359 360 361 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 411 413 414 415 416 417 418”)] pdf(”svr ROC PAINFREE AID 1540 v2.pdf”) svr.all<−list() load(”10fold/jjc40/svr.min”) svr.all<−append(svr.all,svr.min) testmatrix<−desmatrix[,as.numeric(gsub(”X”,””,colnames(svr.all[[1]]@xmatrix)))] response.pred<−predict(svr.all[[1]],testmatrix,type=”response”) svr.ROC.pred<−list() svr.ROC.pred[[1]]<−prediction(10ˆ(response.pred ∗0.63629070836945+−0.0914133757060018),response) svr.ROC.perf<−list() svr.ROC.perf[[1]]<−performance(svr.ROC.pred[[1]],measure=”tpr”,x.measure=”fpr

200 ”) plot(svr.ROC.perf[[1]]) if (length(svr.all)>1){ for (i in 2:length(svr.all)){ testmatrix<−desmatrix[,as.numeric(gsub(”X”,””,colnames(svr.all[[i ]]@xmatrix)))] response.pred=predict(svr.all[[i]],testmatrix,type=”response”) svr.ROC.pred[[i]]<−prediction(10ˆ(response.pred ∗0.63629070836945+−0.0914133757060018),response) svr.ROC.perf[[i]]<−performance(svr.ROC.pred[[i]],measure=”tpr”,x .measure=”fpr”) plot(svr.ROC.perf[[i]],add=T) } } save(svr.ROC.pred,file=”svr.ROC.pred”) save(svr.ROC.perf,file=”svr.ROC.perf”) dev.off()

201 1.7 vHTS with GA/SVM-C Models

#!/usr/bin/env perl use strict; use warnings;

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# Author: Jonathan Jun Feng Chen # Description: Goes from scan files to calculate the overlap metric, tantimoto coefficent and the IC50 from a created model.

# Change Log: # 08/10/2015: Copied changes from svr files for adjusting overlap metric. Now it’s the occurance number in the set instead of the occurance number observed in the molecule. # 03/10/2015: Removed TC output and spun it off to pc tc.pl. TC of entire PubChem taking too much space. # 11/22/2014: Split overlap output into overlap and tc output. # 11/11/2014: Rename output files to correspond to which output it is. Added lines, progressively use models. # 11/05/2014: Delete commented TC and model for origianl usage (multivariate with output from Minitab and fotran). Reintroduced TC for later usage. # 08/20/2014: Commenting for clarity, commented out TC (not needed at time) # 07/29/2014: Initial modification for stricter overlap coefficent and model fitting R output

202 # 04/18/2014: Modification from Derick’s script to become this one

# Based on : # Author: Derick C. Weis # Last updated: 10/22/07 # pc desmatrix.pl #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@ARGV == 1 or die ”usage: ./scriptname scan bin#”;

# Root path and final folder path #my $root path=”/path/to/folder”; my $root path=‘pwd‘; chomp($root path); my $cross val=”10fold”; my $results number=”compiled”; my $folder name=”screening ${cross val}”; print ”$folder name\n”;

# Directory containing all/the sdscan files. my $scan = ”/path/to/database/PubChem Compound/scan0−2 files/bin$ARGV [0]/”; #my $scan = ”/path/to/trainingset/”;

# Signature height my $start height=0;

203 my $end height=2; my $height=”$start height−$end height”;

# Output file: Filtered results of the sdfscan files my $stats output=”$root path/$folder name/${cross val} svm stats $ARGV[0]”; my $prediction output=”$root path/$folder name/${cross val} svm prediction $ARGV[0]”; my $overlap output=”$root path/$folder name/overlap svm $ARGV[0]”;

# Training path for PubChem to compare against my $training path=”$root path/trainingset”;

# Model alpha.x/feature path/ for prediction/classification my $alpha path=”$root path/$cross val/svm alpha.$results number”; my $feature path=”$root path/$cross val/svm features.$results number”; my $intercept path=”$root path/$cross val/svm intercept.$results number”;

# Overlap metric minimum my $overlap metric 1=0; my $overlap metric 2=0.6; my $overlap metric 3=0.7; my $overlap metric 4=0.8; my $overlap metric 5=0.9; my $overlap metric 6=1;

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

204 my $count1=0; my $count train=0; my $count pred=0; my $train sig counter=0; my @x=0; my %database=(); my %min database=(); my %max database=(); my @files; my %compound=(); my $cid=0; #my $i=0; #my $j=0; #my $k=0; my $name; my $name pred; my $name train; my @files pred; my $overlapcount=0; my $totalcount=0; my $overlap metric; my %f=(); my $IC50; my $filecount=0; #filecount of trainingset my @filehandle;

205 my $prevhandle=”kjbkjvjvmbjkfkjvj”; my @overlap array; my $model counter; my @feature; my @alpha; my $feature length; my $alpha length; my @intercept; my $a=0; my $decision; my $model length; my @training scan; my @HCdatabase compiled; my %occurance count; # Training Set information/data for my $i ($start height..$end height){ push(@training scan,”$training path/∗.scan$i”); push(@HCdatabase compiled,”$training path/HCdatabase$i”); }

# copy PubChem signature database to compare against: system(”cat @HCdatabase compiled > HCdatabase compiled$ARGV[0]”);

# compile scan file names into one file to read from. system(”ls @training scan > training scan$ARGV[0]”);

206 # read in training database open(DATABASE,”HCdatabase compiled$ARGV[0]”) || die ”\nCould not open HCdatabase compiled$ARGV[0] for reading\n”; while(my $line=){ chomp($line); substr($line,0,1,””) if (index($line,”1”)==0); next if ($line eq ””); $train sig counter++; my @x=split(/\s/,$line); $occurance count{$x[0]}=0; $database{$x[0]}=$train sig counter; # hash/key of database−assigns a number to a signature $f{$train sig counter}=$x[0]; #hash/key of database−assigns a signature to a number $min database{$x[0]}=1000000000000; $max database{$x[0]}=0; }

# count for overlap calculations later open(SCANLIST,”){ my @filehandle = split(/\./,$filename); if ($filehandle[0] eq $prevhandle){ open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){ chomp($line);

207 last if ($line eq ”0.0 ”); @x = split(/\s/,$line); $min database{$x[1]}=$x[0] if ($x[0] < $min database{$x [1]}); $max database{$x[1]}=$x[0] if ($x[0] > $max database{$x [1]}); $occurance count{$x[1]}++; } close(CHEMBL); } else{ $a=0; #signature count for molecule $filecount++; #counting files read in training set open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){ chomp($line); last if ($line eq ”0.0 ”); @x = split(/\s/,$line); $min database{$x[1]}=$x[0] if ($x[0] < $min database{$x [1]}); $max database{$x[1]}=$x[0] if ($x[0] > $max database{$x [1]}); $occurance count{$x[1]}++; } close(CHEMBL); $prevhandle=$filehandle[0];

208 } } my @sortedkeys = keys(%max database); #collect keys of signatures @sortedkeys = sort(@sortedkeys); #sort keys of signatures foreach my $key (@sortedkeys){ $min database{$key}=0 if ($occurance count{$key}!=$filecount); # print ”$key $min database{$key} $max database{$key}\n”; } #die;

# Open output file open(STATS OUTPUT,”>$stats output”); open(PREDICTION OUTPUT,”>$prediction output”); open(OVERLAP OUTPUT,”>$overlap output”);

# Get features list and alpha.x list open(FEATURE,”<$feature path”)||die ”Couldn’t open $feature path\n”; @feature=; print ”$feature path”; close(FEATURE); chomp(@feature); open(ALPHA,”<$alpha path”)||die ”Couldn’t open $alpha path\n”; @alpha=; close(ALPHA); chomp(@alpha); my $model count=@alpha;

209 open(INTERCEPT,”<$intercept path”)||die ”Couldn’t open $intercept path\n”; @intercept=; close(INTERCEPT); chomp(@intercept); # Initialize overlap array for my $i(0..23){ for my $j(0..$model count){ $overlap array[$i][$j]=0; } }

# Find the unique signatures occurrences in the sdfscan files. while ($name=<${scan}∗${height}>) { open(FILE,”$name”) ||die ”\nCould not open $name for reading.\n”; SIGNATURE: while(){ chomp($ );

#∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ if($ eq ”” or $ eq ”\$\$\$\$”){ next SIGNATURE; #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } elsif($ eq ”0.0 ”){ my @sortedkeys = keys(%compound); #collect keys of signatures @sortedkeys = sort(@sortedkeys); #sort keys of signatures # first filter: overlap metric

210 foreach my $key (@sortedkeys){ $totalcount++; #total signature count for overlap metric if($database{$key}){ if($compound{$key}<=$max database{ $key} && $compound{$key}>= $min database{$key}){ $overlapcount++; #overlapped signature count for overlap metric } } } $overlap metric=$overlapcount/$totalcount; if($overlap metric>=$overlap metric 1){ print OVERLAP OUTPUT ”#$cid”; print PREDICTION OUTPUT ”#$cid”; printf OVERLAP OUTPUT ” %.3f\n”, $overlap metric;

# calculate predicted IC50\ for my $j(0..$model count){ # next if ($i==$model count); print PREDICTION OUTPUT ”\n” if ($j ==$model count); next if (!$feature[$j]);

211 my @model feature=split(/\s+/,$feature[ $j]); shift(@model feature); $feature length=@model feature; my @model alpha=split(/\s+/,$alpha[$j]); shift(@model alpha); $alpha length=@model alpha; die ”feature length n.e. alpha length” if ( $feature length!=$alpha length); my @g=(0) x $feature length; # initialization of values for IC50 calculation $model counter=0; foreach my $feat (@model feature){ $g[$model counter]=$compound{ $f{$feat}} if ($compound{$f{ $feat}}); $model counter++; } my $g length=@g; die ”g length n.e to alpha length” if ( $g length!=$alpha length); my @products = map {$g[$ ]∗ $model alpha[$ ]} 0..$alpha length−1; my $sum=0; foreach my $num(@products){

212 $sum = $sum + $num; } $decision= $sum− $intercept[$j]; print PREDICTION OUTPUT ” $decision”; # print PREDICTION OUTPUT ”\n” if ($i==$model count); $overlap array[0][$j]++ if ($overlap metric >$overlap metric 1); $overlap array[1][$j]++ if ($overlap metric >$overlap metric 2); $overlap array[2][$j]++ if ($overlap metric >$overlap metric 3); $overlap array[3][$j]++ if ($overlap metric >$overlap metric 4); $overlap array[4][$j]++ if ($overlap metric >$overlap metric 5); $overlap array[5][$j]++ if ($overlap metric ==$overlap metric 6); $overlap array[6][$j]++ if ($overlap metric >$overlap metric 1 && $decision>0); $overlap array[7][$j]++ if ($overlap metric >$overlap metric 2 && $decision>0); $overlap array[8][$j]++ if ($overlap metric >$overlap metric 3 && $decision>0); $overlap array[9][$j]++ if ($overlap metric

213 >$overlap metric 4 && $decision>0); $overlap array[10][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>0); $overlap array[11][$j]++ if ( $overlap metric==$overlap metric 6 && $decision>0); $overlap array[12][$j]++ if ( $overlap metric>$overlap metric 1 && $decision>1); $overlap array[13][$j]++ if ( $overlap metric>$overlap metric 2 && $decision>1); $overlap array[14][$j]++ if ( $overlap metric>$overlap metric 3 && $decision>1); $overlap array[15][$j]++ if ( $overlap metric>$overlap metric 4 && $decision>1); $overlap array[16][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>1); $overlap array[17][$j]++ if ( $overlap metric==$overlap metric 6 && $decision>1); $overlap array[18][$j]++ if (

214 $overlap metric>$overlap metric 1 && $decision>2); $overlap array[19][$j]++ if ( $overlap metric>$overlap metric 2 && $decision>2); $overlap array[20][$j]++ if ( $overlap metric>$overlap metric 3 && $decision>2); $overlap array[21][$j]++ if ( $overlap metric>$overlap metric 4 && $decision>2); $overlap array[22][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>2); $overlap array[23][$j]++ if ( $overlap metric==$overlap metric 6 && $decision>2); } } %compound = (); #after done, reset hash/key for next compound in scan file. $cid = 0; $overlapcount=0; $totalcount=0; #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } else { #actual harvestin of scan/signature data starts here

215 @x = split(/\s/,$ );

if($x[0] eq ”#”){ $cid=$x[1]; next SIGNATURE; }

# Skip duplicates. if($compound{$x[1]}){ if($compound{$x[1]} ne $x[0]){ next SIGNATURE; } }

$compound{$x[1]}=$x[0]; #assignment and creation of hash/keys here. }

} close(FILE); } # print overlap array for stats analysis for my $i(0..23){ print STATS OUTPUT ”criterion$i”; for my $j(0..$model count){ print STATS OUTPUT ”\n” if ($j==$model count);

216 next if (!$feature[$j]); print STATS OUTPUT ” $overlap array[$i][$j]”; } } close(STATS OUTPUT); close(OVERLAP OUTPUT); close(PREDICTION OUTPUT); print ”\nOutput File: $stats output $prediction output $overlap output\n\n”;

217 1.8 vHTS with GA/SVM-R Models

#!/usr/bin/env perl use strict; use warnings;

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# Author: Jonathan Jun Feng Chen # Description: Goes from scan files to calculate the overlap metric, tantimoto coefficent and the IC50 from a created model.

# Change Log: # 07/28/2015: Added ”foreach” loop to make min occurance=0 if feature appears les than # of compounds there are. Makes overlap in line with Derick now. # 07/24/2015: Added line to remove HCdatabase compiled and traning scan after run. # 07/22/2015: Added reading mean and st. dev to undo scaling. # 03/10/2015: Removed TC output and spun it off to pc tc.pl. TC of entire PubChem taking too much space. # 11/22/2014: Split overlap output into overlap and tc output. # 11/11/2014: Rename output files to correspond to which output it is. Added lines, progressively use models. # 11/05/2014: Delete commented TC and model for origianl usage (multivariate with output from Minitab and fotran). Reintroduced TC for later usage. # 08/20/2014: Commenting for clarity, commented out TC (not needed at time)

218 # 07/29/2014: Initial modification for stricter overlap coefficent and model fitting R output # 04/18/2014: Modification from Derick’s script to become this one

# Based on : # Author: Derick C. Weis # Last updated: 10/22/07 # pc desmatrix.pl #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@ARGV == 1 or die ”usage: ./scriptname scan bin#”;

# Root path and final folder path #my $root path=”/path/to/folder”; my $root path=‘pwd‘; chomp($root path); my $cross val=”10fold”; my $results number=”compiled”; my $folder name=”screening ${cross val}”; #my $folder name=”debug”; print ”$folder name\n”; #die;

# St. dev and mean locations my $datafile name=”activity.log”; my $mean path=”${root path}/${datafile name} mean”;

219 my $sdev path=”${root path}/${datafile name} sdev”;

# Directory containing all/the sdscan files. my $scan = ”/path/to/database/PubChem Compound/scan0−2 files/bin$ARGV [0]/”; #my $scan = ”/path/to/debug/”;

# Signature height my $start height=0; my $end height=2; my $height=”$start height−$end height”;

# Output file: Filtered results of the sdfscan files my $stats output=”$root path/$folder name/${cross val} svr stats $ARGV[0]”; my $prediction output=”$root path/$folder name/${cross val} svr prediction $ARGV[0]”; my $overlap output=”$root path/$folder name/overlap svr $ARGV[0]”;

# Training path for PubChem to compare against my $training path=”$root path/activity”;

# Model alpha.x/feature path/ for prediction/classification my $alpha path=”$root path/$cross val/svr alpha.$results number”; my $feature path=”$root path/$cross val/svr features.$results number”; my $intercept path=”$root path/$cross val/svr intercept.$results number”;

220 # Overlap metric minimum my $overlap metric 1=0; my $overlap metric 2=0.6; my $overlap metric 3=0.7; my $overlap metric 4=0.8; my $overlap metric 5=0.9; my $overlap metric 6=1;

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! my $count1=0; my $count train=0; my $count pred=0; my $train sig counter=0; my @x=0; my %database=(); my %min database=(); my %max database=(); my @files; my %compound=(); my $cid=0; #my $i=0; #my $j=0; #my $k=0; my $name; my $name pred;

221 my $name train; my @files pred; my $overlapcount=0; my $totalcount=0; my $overlap metric; my %f=(); my $IC50; my $filecount=0; #filecount of trainingset my @filehandle; my $prevhandle=”kjbkjvjvmbjkfkjvj”; my @overlap array; my $model counter; my @feature; my @alpha; my $feature length; my $alpha length; my @intercept; my $a=0; my $decision; my $model length; my $sdev; my $mean; my @training scan; my @HCdatabase compiled; my %occurance count; # Training Set information/data

222 for my $i ($start height..$end height){ push(@training scan,”$training path/∗.scan$i”); push(@HCdatabase compiled,”$training path/HCdatabase$i”); }

# copy PubChem signature database to compare against: system(”cat @HCdatabase compiled > HCdatabase compiled$ARGV[0]”);

# compile scan file names into one file to read from. system(”ls @training scan > training scan$ARGV[0]”);

# read in training database open(DATABASE,”HCdatabase compiled$ARGV[0]”) || die ”\nCould not open HCdatabase compiled$ARGV[0] for reading\n”; while(my $line=){ chomp($line); substr($line,0,1,””) if (index($line,”1”)==0); next if ($line eq ””); $train sig counter++; my @x=split(/\s/,$line); $occurance count{$x[0]}=0; $database{$x[0]}=$train sig counter; # hash/key of database−assigns a number to a signature $f{$train sig counter}=$x[0]; #hash/key of database−assigns a signature to a number $min database{$x[0]}=1000000000000;

223 $max database{$x[0]}=0; }

# count for overlap calculations later open(SCANLIST,”){ my @filehandle = split(/\./,$filename); if ($filehandle[0] eq $prevhandle){ open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){ chomp($line); last if ($line eq ”0.0 ”); @x = split(/\s/,$line); $occurance count{$x[1]}++; $min database{$x[1]}=$x[0] if ($x[0] < $min database{$x [1]}); $max database{$x[1]}=$x[0] if ($x[0] > $max database{$x [1]}); } close(CHEMBL); } else{ $a=0; #signature count for molecule $filecount++; #counting files read in training set open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){

224 chomp($line); last if ($line eq ”0.0 ”); @x = split(/\s/,$line); $occurance count{$x[1]}++; $min database{$x[1]}=$x[0] if ($x[0] < $min database{$x [1]}); $max database{$x[1]}=$x[0] if ($x[0] > $max database{$x [1]}); } close(CHEMBL); $prevhandle=$filehandle[0]; } } #my @sortedkeys = keys(%max database); #collect keys of signatures #@sortedkeys = sort(@sortedkeys); #sort keys of signatures #foreach my $key (@sortedkeys){ # $min database{$key}=0 if ($occurance count{$key}!=$filecount); # print ”$key $min database{$key} $max database{$key}\n”; #} #die;

# Open output file open(STATS OUTPUT,”>$stats output”); open(PREDICTION OUTPUT,”>$prediction output”); open(OVERLAP OUTPUT,”>$overlap output”);

225 # Get features list and alpha.x list open(FEATURE,”<$feature path”)||die ”Couldn’t open $feature path\n”; @feature=; close(FEATURE); chomp(@feature); open(ALPHA,”<$alpha path”)||die ”Couldn’t open $alpha path\n”; @alpha=; close(ALPHA); chomp(@alpha); my $model count=@alpha; open(INTERCEPT,”<$intercept path”)||die ”Couldn’t open $intercept path\n”; @intercept=; close(INTERCEPT); chomp(@intercept);

# Initialize overlap array for my $i(0..23){ for my $j(0..$model count){ $overlap array[$i][$j]=0; } }

# Get data mean and st. dev open(MEAN,”<$mean path”)||die ”\n\nCould not open MEAN\n\n”; open(SDEV,”<$sdev path”)||die ”\n\nCould not open SDEV\n\n”; $sdev=;

226 $mean=; close(MEAN); close(SDEV);

# Find the unique signatures occurrences in the sdfscan files. while ($name=<${scan}∗${height}>) { open(FILE,”$name”) ||die ”\nCould not open $name for reading.\n”; SIGNATURE: while(){ chomp($ );

#∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ if($ eq ”” or $ eq ”\$\$\$\$”){ next SIGNATURE; #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } elsif($ eq ”0.0 ”){ my @sortedkeys = keys(%compound); #collect keys of signatures @sortedkeys = sort(@sortedkeys); #sort keys of signatures # first filter: overlap metric foreach my $key (@sortedkeys){ $totalcount++; #total signature count for overlap metric if($database{$key}){ if($compound{$key}<=$max database{ $key} && $compound{$key}>= $min database{$key}){

227 $overlapcount++; #overlapped signature count for overlap metric } } } $overlap metric=$overlapcount/$totalcount; if($overlap metric>=$overlap metric 1){ print OVERLAP OUTPUT ”#$cid”; print PREDICTION OUTPUT ”#$cid”; printf OVERLAP OUTPUT ” %.3f\n”, $overlap metric;

# calculate predicted IC50\ for my $j(0..$model count){ # next if ($i==$model count); print PREDICTION OUTPUT ”\n” if ($j ==$model count); next if (!$feature[$j]); my @model feature=split(/\s+/,$feature[ $j]); shift(@model feature); $feature length=@model feature; my @model alpha=split(/\s+/,$alpha[$j]); shift(@model alpha); $alpha length=@model alpha;

228 die ”feature length n.e. alpha length” if ( $feature length!=$alpha length); my @g=(0) x $feature length; # initialization of values for IC50 calculation $model counter=0; foreach my $feat (@model feature){ $g[$model counter]=$compound{ $f{$feat}} if ($compound{$f{ $feat}}); $model counter++; } my $g length=@g; die ”g length n.e to alpha length” if ( $g length!=$alpha length); my @products = map {$g[$ ]∗ $model alpha[$ ]} 0..$alpha length−1; my $sum=0; foreach my $num(@products){ $sum = $sum + $num; } # $decision= $sum− $intercept[$j]; $decision= ($sum− $intercept[$j])∗$sdev+ $mean; print PREDICTION OUTPUT ” $decision”;

229 # print PREDICTION OUTPUT ”\n” if ($i==$model count); $overlap array[0][$j]++ if ($overlap metric >$overlap metric 1); $overlap array[1][$j]++ if ($overlap metric >$overlap metric 2); $overlap array[2][$j]++ if ($overlap metric >$overlap metric 3); $overlap array[3][$j]++ if ($overlap metric >$overlap metric 4); $overlap array[4][$j]++ if ($overlap metric >$overlap metric 5); $overlap array[5][$j]++ if ($overlap metric ==$overlap metric 6); $overlap array[6][$j]++ if ($overlap metric >$overlap metric 1 && $decision>0); $overlap array[7][$j]++ if ($overlap metric >$overlap metric 2 && $decision>0); $overlap array[8][$j]++ if ($overlap metric >$overlap metric 3 && $decision>0); $overlap array[9][$j]++ if ($overlap metric >$overlap metric 4 && $decision>0); $overlap array[10][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>0); $overlap array[11][$j]++ if (

230 $overlap metric==$overlap metric 6 && $decision>0); $overlap array[12][$j]++ if ( $overlap metric>$overlap metric 1 && $decision>1); $overlap array[13][$j]++ if ( $overlap metric>$overlap metric 2 && $decision>1); $overlap array[14][$j]++ if ( $overlap metric>$overlap metric 3 && $decision>1); $overlap array[15][$j]++ if ( $overlap metric>$overlap metric 4 && $decision>1); $overlap array[16][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>1); $overlap array[17][$j]++ if ( $overlap metric==$overlap metric 6 && $decision>1); $overlap array[18][$j]++ if ( $overlap metric>$overlap metric 1 && $decision>2); $overlap array[19][$j]++ if ( $overlap metric>$overlap metric 2 && $decision>2);

231 $overlap array[20][$j]++ if ( $overlap metric>$overlap metric 3 && $decision>2); $overlap array[21][$j]++ if ( $overlap metric>$overlap metric 4 && $decision>2); $overlap array[22][$j]++ if ( $overlap metric>$overlap metric 5 && $decision>2); $overlap array[23][$j]++ if ( $overlap metric==$overlap metric 6 && $decision>2); } } %compound = (); #after done, reset hash/key for next compound in scan file. $cid = 0; $overlapcount=0; $totalcount=0; #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } else { #actual harvestin of scan/signature data starts here @x = split(/\s/,$ );

if($x[0] eq ”#”){ $cid=$x[1]; next SIGNATURE;

232 }

# Skip duplicates. if($compound{$x[1]}){ if($compound{$x[1]} ne $x[0]){ next SIGNATURE; } }

$compound{$x[1]}=$x[0]; #assignment and creation of hash/keys here. }

} close(FILE); } # print overlap array for stats analysis for my $i(0..23){ print STATS OUTPUT ”criterion$i”; for my $j(0..$model count){ print STATS OUTPUT ”\n” if ($j==$model count); next if (!$feature[$j]); print STATS OUTPUT ” $overlap array[$i][$j]”; } } close(STATS OUTPUT);

233 close(OVERLAP OUTPUT); close(PREDICTION OUTPUT); print ”\nOutput File: $stats output $prediction output $overlap output\n\n”;

# remove compiled database and training scan files system(”rm HCdatabase compiled$ARGV[0] training scan$ARGV[0]”)

234 1.9 Set Theoretic Tanimoto Coefficient calculation

#!/usr/bin/env perl use strict; use warnings;

#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# Author: Jonathan Jun Feng Chen # Description: Goes from scan files to calculate the overlap metric and the IC50 from a created model.

# Change Log: # 07/24/2015: Added line to remove traning scan after run. # 03/10/2015: Removed tc component from model script because tc of all PubChem too too much space and spun it out to this. # 11/22/2014: Split overlap output into overlap and tc output. # 11/11/2014: Rename output files to correspond to which output it is. Added lines, progressively use models. # 11/05/2014: Delete commented TC and model for origianl usage (multivariate with output from Minitab and fotran). Reintroduced TC for later usage. # 08/20/2014: Commenting for clarity, commented out TC (not needed at time) # 07/29/2014: Initial modification for stricter overlap coefficent and model fitting R output # 04/18/2014: Modification from Derick’s script to become this one

235 # Based on : # Author: Derick C. Weis # Last updated: 10/22/07 # pc desmatrix.pl #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

# Root path and final folder path #my $root path=”/path/to/folder”; #my $folder name=”15fold”; my $root path=‘pwd‘; chomp($root path); my $folder name=”activeTC”;

# Directory containing all/the sdscan files. my $scan = ”$root path/$folder name/”;

# Signature height my $start height=0; my $end height=2; my $height=”$start height−$end height”;

# Output file: Filtered results of the sdfscan files my $tc output=”$root path/$folder name/filtered svm tc”;

# Training path for PubChem to compare against my $training path=”$root path/trainingset”;

236 #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

my @a=0; # array of signature count of each molecule in training set for TC calculation my @b=0; # array of signature count of each molecule in PubChem set for TC calculation my @c=0; # array of signature that are common of molecules in both sets for TC calculation my @x=0; my %compound=(); my $cid=0; my $name; my $filecount=0; #filecount of trainingset my @filehandle; my $prevhandle=”kjbkjvjvmbjkfkjvj”; my %matrix=(); #training database matrix my $a=0; my @training scan;

# Training Set information/data for my $i ($start height..$end height){ push(@training scan,”$training path/∗.scan$i”); }

# compile scan file names into one file to read from.

237 system(”ls @training scan >training scan”);

# count for TC calculations later open(SCANLIST,”){ my @filehandle = split(/\./,$filename); if ($filehandle[0] eq $prevhandle){ open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){ chomp($line); last if ($line eq ”0.0 ”); @x = split(/\s/,$line); next if ($matrix{$filecount}{$x[1]}); $matrix{$filecount}{$x[1]}=$x[0]; #signature key/hash for each compound in training set $a=$a+$x[0]; } close(CHEMBL); $a[$filecount−1]=$a; #saving signature count to array a } else{ $a=0; #signature count for molecule $filecount++; #counting files read in training set open(CHEMBL,”<$filename”)||die ”could not open $filename”; while (my $line=){ chomp($line);

238 last if ($line eq ”0.0 ”); @x = split(/\s/,$line); next if ($matrix{$filecount}{$x[1]}); $matrix{$filecount}{$x[1]}=$x[0]; #signature key/hash for each compound in training set $a=$a+$x[0]; $a[$filecount−1]=$a; #saving signature count to array a } close(CHEMBL); $prevhandle=$filehandle[0];

} }

# Open output file open(TC OUTPUT,”>$tc output”);

# Find the unique signatures occurrences in the sdfscan files. while ($name=<${scan}∗${height}>) { open(FILE,”$name”) ||die ”\nCould not open $name for reading.\n”; SIGNATURE: while(){ chomp($ );

#∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ if($ eq ”” or $ eq ”\$\$\$\$”){ next SIGNATURE;

239 #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } elsif($ eq ”0.0 ”){ my @sortedkeys = keys(%compound); #collect keys of signatures @sortedkeys = sort(@sortedkeys); #sort keys of signatures print TC OUTPUT ”#$cid”; # calculate tantimoto coefficient for my $j(1..$filecount){ #calculate TC metric against each compound in training set my $b=0; #to compare to a for TC metric my $c=0; #to compare to a for TC metric my $a=$a[$j−1]; #corresponding compound currently TC calculation foreach my $key(@sortedkeys){ $b=$b+$compound{$key}; #same as a above if($matrix{$j}{$key}){ #if signature in common, take lowest count # print ”$matrix{$j}{$key} $key \n”; if ($matrix{$j}{$key}<= $compound{$key}){ $c=$c+$matrix{$j}{$key }; } else {

240 $c=$c+$compound{$key }; } } # print ”a=$a b=$b c=$c\n”; } my $TC=$c/($a+$b−$c); printf TC OUTPUT ” %.3f”, $TC; print TC OUTPUT ”\n” if ($j==$filecount);

} %compound = (); #after done, reset hash/key for next compound in scan file. $cid = 0; #∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ } else { #actual harvestin of scan/signature data starts here @x = split(/\s/,$ );

if($x[0] eq ”#”){ $cid=$x[1]; next SIGNATURE; }

# Skip duplicates. if($compound{$x[1]}){ if($compound{$x[1]} ne $x[0]){

241 next SIGNATURE; } }

$compound{$x[1]}=$x[0]; #assignment and creation of hash/keys here. }

} close(FILE); } print ”\nOutput File: $tc output\n\n”; system(”rm training scan”)

242 APPENDIX B EXPERIMENTAL VALIDATION METHODOLOGIES AND PROTOCOLS

The specific assay buffer compositions, compounds, protein-target concentration, substrate-indicator concentration, and reading is different for all activity validation experiments, but the experiment format and general protocol is the same. Therefore, similar to the case studies, the general protocol and experiment format is presented first then instructions specific to each assay will be presented later.

1. Serial dilute compounds at 50x concentration in DMSO (7 four-fold dilutions from 2.5mM to 152.5nM).

2. Fill each well of a not-treated Corning 96-well, flat-bottom plate with 50µL of the substrate-indicator except column 10.

3. Add 2µL of compound from step 1 in triplicate to the plate (compound 1: columns 1-3, compound 2: columns 4-6, compound 3: columns 7-9) in row order of descending concentration.

4. Add 50µL of the target-protein.

5. Incubate as for the instructed duration under the instructed conditions.

6. Read signal with Tecan M200 microplate reader.

243 Percent inhibition was calculated in the following way:

signal − blank ! %inhibition = 1 − ∗ 100. control − blank

The concentrations corresponding to immediately above and below 50% activity was used to linearly interpolate the 50% activity concentration.

2.1 Cathepsin L

Based on the assay protocol for AID 825, the buffer was 20mM sodium acetate, 1mM EDTA, and 5mM DTT in water adjusted to pH 5.5. The assay was performed on a black plate. The final concentration of the substrate-indicator, Z-Phe-Arg-AMC, was 2uM. The final concentration of the protein-target, human liver cathepsin L, was 17.4ng/mL. Incubation was for 1 hour at room temperature prior fluorescence was read with the excitation wavelength of 360 nM and emission wavelength of 460 nM.

2.2 Factor XIIa

Based on the assay protocol for AID 728, the assay buffer was 50mM Tris, 150mM NaCl, and 0.02% Tween 20 in water adjusted to pH 7.4. The assay was per- formed on a black plate. The final concentration of the protein-target, Factor XIIa, was 3.5µg/mL. The final concentration of the substrate-indicator, Boc-Gly-Gln-Arg- AMC, was 15µM. Incubation was for 1 hour at room temperature prior fluorescence was read with the excitation wavelength of 355 nM and emission wavelength of 460 nM.

244 2.3 Factor XIa

Based on the assay protocol for AID 825, the assay buffer was 50mM Trizma Base, 150mM NaCl, and 0.02% Tween 20 in water adjusted to pH 7.4. The assay was performed in a black plate. The final concentration of the protein-target, Factor XIa, was 0.23µg/mL. The final concentration of the substrate-indicator, Boc-Glu-Ala-Arg- AMC, is 15µM. Incubation was for 1 hour at room temperature prior to fluorescence reading with the excitation wavelength of 355 nM and emission wavelength of 460 nM.

2.4 Complement Factor C1s

Based on the assay protocol for AID 787, the assay buffer is 50mM HEPES, 200mM NaCl, and 0.2% polyethylene glycol (PEG) in water adjusted to pH 7.5. The assay was performed on a black plate. The final concentration of the protein-target, acti- vated human complement factor C1s, was 0.02 mg/mL. The final substrate-indicator concentration, Boc-Leu-Gly-Arg-AMC, is 15µM. Incubation was for 2.5 hours at room temperature prior to fluorescence reading with the excitation wavelength of 355 nM and emission wavelength of 460 nM.

2.5 SENP8

Based on the assay protocol for AID 624322, the assay buffer was 75mM Hepes, 2mM DTT, 0.5mM EDTA, 0.1% BSA, 0.005% Tween 20 in water adjusted to pH 7.8. The Assay was performed in black plates. The protein-target and the substrate- indicator was adjusted due to economic constraints. The final concentration of the protein-target, SENP8 was 3pM. The final concentration of the substrate-indicator, NEDD8-AMC, was 30nM. Incubation was for 45 minutes prior to fluorescence reading

245 with the excitation wavelength of 350nM and emission wavelength of 450nM.

2.6 Human PK-M2 Pt. 1

Based on the assay protocol for AID 2533, the assay buffer was 50mM Imidazole, 50mM KCl, 7mM MgCl2, 0.01% Tween 20, 0.05% BSA in water adjusted to pH 7.2. The assay was performed in white plates. The final concentration of the protein-target was 0.1nM hPK-M2. The final concentration of the substrate mixture was 0.1mM ADP and 0.5mM PEP. Incubation was for 1 hour prior to the addition of 50µL of the luminescence indicator system Kinase-Glo from Promega. The assay was further incubated for 4 minutes while protected from light before luminescence was read.

2.7 Human PK-M2 Pt. 2

Based on the assay protocol for AID 1540, the assay buffer was 50mM Trizma Base, 200mM KCl, and 15mM MgCl2 in water adjusted to pH 8.0. The assay was performed in black plates. The final concentration of the protein-target mixture was 10nM human PK-M2 and 1 uM of LDH. The final concentration of the substrate- indicator mixture was 0.1mM PEP, 4.0mM ADP, and 0.2mM NADH. The plate was read for fluorescence immediately and again ever 75 seconds for 30 mins with the excitation wavelength of 340nm and emission wavelength of 460nm.

246