ABSTRACT

LI, XINHAO. Development of Novel Machine Learning Approaches and Exploration of Their Applications in Cheminformatics. (Under the direction of Dr. Christopher Gorman).

Cheminformatics is the scientific field that develops and applies a combination of mathematics, informatics, machine learning, and other computational technologies to solve chemical problems. It aims at helping chemists in investigating and understanding complex chemical biological systems and guide the experimental design and decision making. With the rapid growing of chemical data (e.g., high-throughput screening, metabolomics, etc.), machine learning has become an important tool for exploring chemical space and mining chemical information. In this dissertation, we present five studies on developing novel machine learning approaches and exploring their applications in cheminformatics.

One of the primary tasks of cheminformatics is to predict the physical, chemical, and biological properties of a given compound. Quantitative Structure Activity Relationship (QSAR) modeling relies on machine learning techniques to establish quantified links between molecular structures and their experimental properties/activities. In chapter 2, we developed a dual-layer hierarchical modeling method to fully integrate regression and classification QSAR models for assessing rat acute oral systemic toxicity, with respect to regulatory classifications of concern. The first layer of independent regression, binary and multiclass models (base models) were solely built using computed chemical descriptors/fingerprints. Then, a second layer of models (hierarchical models) were built by stacking all the cross-validated out-of-fold predictions from the base models. All models were validated using an external test set and we found that the hierarchical models did outperform the base models for all the three endpoints. The H-QSAR modeling method represents a promising approach for chemical toxicity prediction and more generally for stacking and blending individual QSAR models into more predictive ensemble models. In chapter 3, we proposed the Molecular Prediction Model Fine-Tuning (MolPMoFiT) approach, an effective transfer learning method based on self-supervised pre-training + task- specific fine-tuning for QSPR/QSAR modeling. It enables knowledge learned from large chemical data sets to transfer to smaller data sets, thereby improving the model performance and generalization. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner and can then be fine- tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. The benchmark results show this transfer learning method can achieve strong performances compared to other state-of-the-art machine learning modeling techniques.

SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In chapter 4, we introduced SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE learns a vocabulary of high frequency SMILES substrings from

ChEMBL and then tokenizes new SMILES into a sequence of tokens for deep learning models.

SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation (by boosting the validity and novelty of generated

SMILES) as well as QSAR prediction tasks. In particular, we evaluated the performance of SPE- based QSAR prediction models using 24 benchmark datasets where SPE consistently either did match or outperform atom-level tokenization. SPE could represent a better tokenization method for the development of future deep learning applications in cheminformatics.

Lead optimization endpoints usually contain a small set of congeneric molecules which limits the application of deep learning. Transfer learning enables training a deep learning model with less data. In chapter 5, we explored our transfer learning approach (see Chapter 3) for lead optimization endpoints. A set of lead-optimization-like benchmark datasets was created from 8 datasets that include pIC50 values against the protein targets. Two language models, a LSTM model and a RoBERTa model, were trained on 10M SMILES from ChEMBL and then used as the knowledge source for the downstream QSAR tasks. The results show transfer learning indeed provide performance gains for lead optimization endpoints.

In chapter 6, we present CryptoChem, a new method and associated software to securely store and transfer information using chemicals. Relying on the concept of big chemical data, molecular descriptors and machine learning techniques, CryptoChem offers a highly complex and robust system with multiple layers of security for transmitting confidential information. The algorithm directly uses chemical structures and their properties as the central element of the secured storage. QSDR (Quantitative Structure-Data Relationship) models are used as private keys to encode and decode the data. The software is validated with a series of five datasets consisting of numerical and textual information with increasing size and complexity.

© Copyright 2020 by Xinhao Li

All Rights Reserved Development of Novel Machine Learning Approaches and Exploration of Their Applications in Cheminformatics

by Xinhao Li

A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Chemistry

Raleigh, North Carolina 2020

APPROVED BY:

______Christopher Gorman Gavin Williams Committee Chair

______Caroline Proulx Brian Reich

______Denis Fourches External Member

DEDICATION

This dissertation is dedicated to Anqi Tu and my family.

ii

BIOGRAPHY

Xinhao Li grew up in Dalian, China. With some very good experience in high school chemistry class, he decided to study chemistry in Beijing University of Chemical Technology where he got his bachelor’s and master’s degrees. During the master’s study, his research focus on organic synthesis. In 2017, he joined Dr. Denis Fourches’ lab at North Carolina State University and started to pursuit a Ph.D. with a focus on cheminformatics and machine learning.

iii

ACKNOWLEDGMENTS

I would like to thank Dr. Denis Fourches for all the guidance, support, and encouragement he has given me for the past three years. His knowledge, vision and suggestions have been a valuable source of inspiration for my research. I would like to thank Dr. Christopher Gorman, Dr.

Gavin Williams, Dr. Caroline Proulx and Dr. Brian Reich for being my committee members. I would like to thank Dr. Jamel Meslamani for his mentoring during my internship at GSK. I would also like to thank North Carolina State Chemistry Department and DARPA for the financial support. Finally, I would like to thank all my family and friends for their continued support over the years.

I would like to especially thank Anqi Tu for the constant love and support along the way.

.

iv

TABLE OF CONTENTS

LIST OF TABLES ...... viii LIST OF FIGURES ...... ix Chapter 1. Introduction ...... 1 1.1. Machine Learning ...... 2 1.2. QSAR Modeling ...... 3 References ...... 8 Chapter 2. Hierarchical H-QSAR Modeling Approach for Integrating Binary/Multi Classification and Regression Models of Acute Oral Systemic Toxicity ...... 10 Abstract ...... 11 2.1. Introduction ...... 12 2.2. Materials and Methods ...... 15 2.3. Results ...... 23 2.3.1. Chemical space of the full dataset ...... 23 2.3.2. Models development, validation, and comparison ...... 24 2.3.3. Applicability Domain for Hierarchical Models ...... 33 2.4. Discussion ...... 37 2.4.1. Comparison with other methodologies recently reported in the literature ...... 39 2.5. Conclusion ...... 40 References ...... 43 Chapter 3. Inductive Transfer Learning for Molecular Activity Prediction ...... 47 Abstract ...... 48 3.1. Introduction ...... 49 3.2. Method ...... 52 3.2.1. ULMFiT ...... 52 3.2.2. MolPMoFiT ...... 53 3.2.3. Dataset preparation ...... 57 3.2.4. Molecular Representation ...... 58 3.2.5. Data Augmentation ...... 58 3.2.6. Baselines and Comparison Models ...... 61 3.2.7. Hyperparameters and Training Procedure ...... 61 3.3. Results and Discussion ...... 62 3.3.1. Benchmark ...... 62 3.3.2. Analysis ...... 69 3.4. Conclusion ...... 73 References ...... 75 Chapter 4. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning ...... 81

v

Abstract ...... 82 4.1. Introduction ...... 83 4.2. Method ...... 86 4.2.1. SMILES Pair Encoding ...... 86 4.2.2. Dataset Preparation ...... 87 4.2.3. Machine Learning ...... 88 4.2.4. Evaluation Metrics...... 89 4.2.5. Experiments ...... 90 4.2.6. Implementation ...... 91 4.3. Results and Discussion ...... 91 4.3.1. SMILES Pair Encoding on ChEMBL ...... 91 4.3.2. Molecular Generation Case Study ...... 95 4.3.3. Molecular Property Prediction Case Study ...... 97 4.4. Conclusion ...... 99 References ...... 100 Chapter 5. Benchmarking Transfer Learning Approaches on Small Chemical Datasets . 103 Abstract ...... 103 5.1. Introduction ...... 104 5.2. Method ...... 105 5.2.1. Datasets...... 105 5.2.2. Transfer Learning ...... 107 5.2.3. Machine Learning ...... 111 5.2.4. Implementation ...... 112 5.3. Results and Discussion ...... 112 5.4. Conclusion and Future Direction ...... 116 References ...... 117 Chapter 6. CryptoChem: Encoding and Storing Information Using Chemicals ...... 119 Abstract ...... 120 6.1. Introduction ...... 121 6.2. Method ...... 123 6.2.1. Overview of CryptoChem ...... 123 6.2.2. ASCII Encoder ...... 125 6.2.3. Label Swapper ...... 125 6.2.4. Molecular Encoder ...... 127 6.2.5. Analogue Retriever ...... 128 6.2.6. Molecular Data Preparation ...... 130 6.2.7. Model Development ...... 131 6.2.8. MOLWRITE Software ...... 131 6.2.9. MOLREAD Software ...... 132 6.3. Results ...... 132 6.3.1 Assessment of CryptoChem with five datasets ...... 132 6.4. Discussion ...... 135

vi

6.5. Conclusion ...... 143 References ...... 146 APPENDICES ...... 148 Appendix A ...... 149 Appendix B ...... 161 Appendix C ...... 163

vii

LIST OF TABLES

Table 2.1. Summary of curated training set and external test set...... 16 Table 2.2. External test set performances of base and hierarchical models’ consensus...... 28 Table 2.3. Comparison of our hierarchical QSAR models with the results of Alberga’s models ...... 40 Table 3.1. Description of QSAR/QSPR datasets...... 58 Table 3.2. Hyperparameters for QSPR/QSAR Model Fine-tuning...... 61 Table 3.3. Impact of SMILES Augmentation on HIV Dataset...... 70 Table 3.4. Impact of SMILES Augmentation on BBBP Dataset...... 70 Table 4.1. Summary of QSAR benchmark datasets...... 88 Table 4.2. Example of Tokenized SMILES...... 94 Table 4.3. Metrics for Molecular Generation...... 96 Table 5.1. Lead Optimization-like Datasets...... 106 Table 6.1. Description of the five datasets to be encoded in this project...... 133 Table 6.2. Results of encoding and decoding the five datasets using MOLWRITE and MOLREAD programs (with programs performing in single CPU mode)...... 134 Table 6.3. Comparison with Contemporary Encryption Methods...... 143

viii

LIST OF FIGURES

Figure 1.1. Traditional Programming vs. Machine Learning...... 2 Figure 1.2. Types of Machine Learning...... 3 Figure 1.3. Supervised Learning Workflow...... 4 Figure 1.4. Featurization of 2D Molecular Structure...... 5 Figure 1.5. Machine Learning Methods...... 6 Figure 2.1. Overall workflow for building hierarchical QSAR models. Base regression, binary and multiclass models (60 models in total) are built with diverse combinations of machine learning algorithms and chemical descriptors/fingerprints. Out-of-Fold Predictions of base models are generated through 10-fold cross-validation. The out-of-fold predictions are concatenated together and used as input (Meta Features) for building hierarchical regression, binary and multiclass models...... 19 Figure 2.2. Clustering of the full dataset (11,056 molecules) by t-SNE with bit-based ECFP6 (2048 bits)...... 24 Figure 2.3. Performances of Regression Models on External Test Set...... 27 Figure 2.4. Performances of Binary Models on External Test Set...... 27 Figure 2.5. Performances of Multiclass Models on External Test Set...... 28 Figure 2.6. Scatter plot of the predicted versus experimentally measured logLD50 (mmol/kg) values. (a) Base Consensus Model; (b) Hierarchical Consensus Model; (c) Hierarchial Model (kNN); (d) Hierarchical Model (SVM); (e) Hierarchical Model (RF); (f) Hierarchical Model (XGBoost) ...... 29 Figure 2.7. Scatter plot of consensus model predictions. The x-axis is the predicted logLD50 from the consensus regression model, the y-axis is the predicted probability of toxic class from the consensus binary model. The scatter plots were colored predicted EPA category from the consensus multiclass model. (a) Base Consensus Models; (b) Hierarchical Consensus Models...... 30 Figure 2.8. Two selected chemicals in the test set and their respective predictions using the base and hierarchical models...... 32 Figure 2.9. Model Prediction Zones and Applicability Domain...... 35 Figure 2.10. Performances of Hierarchical Regression Consensus Model based on Prediction Zones...... 36 Figure 2.11. Performances of Hierarchical Binary Consensus Model based on Prediction Zone...... 36 Figure 2.12. Performances of Hierarchical Multiclass Consensus Model based on Prediction Zone...... 37 Figure 3.1. Scheme illustrating the MolPMoFiT Architecture: During the fine-tuning, learned weights are transferred between models. Vocab Size corresponds to the ix

number of unique characters tokenized (See Section 3.2.4) from SMILES in a data set. The stage of Task Specific Molecular Structure Prediction Model fine- tuning is optional...... 55 Figure 3.2. SMILES and Data Augmentation...... 60 Figure 3.3. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on Lipophilicity. (a) Random split; (b) Scaffold split. MolPMoFiT: Molecular Prediction Model Fine-Tuning; D-MPNN: Directed Message Passing Neural Network; RF: Random Forest; FFN: Feed-Forward Network...... 64 Figure 3.4. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on FreeSolv. (a) Random split; (b) Scaffold split...... 65 Figure 3.5. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on BBBP. (a) Random split; (b) Scaffold split...... 67 Figure 3.6. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on HIV. (a) Random split; (b) Scaffold split...... 68 Figure 3.7. Performances of models on the different size of the training set. (a) Lipophilicity; (b) FreeSolv; (c) BBBP and (d) HIV...... 69 Figure 3.8. Performances of Lipophilicity models on different number of augmented SMILES per compound and Gaussian Noise (σnoise) added to the original experimental values. TTA: Test-time augmentation...... 72 Figure 4.1. Distribution of length of SMILES Pair Encoding substrings trained on ChEMBL...... 92 Figure 4.2. Representative SPE fragments...... 93 Figure 4.3. Distribution of length of tokenized SMILES of ChEMBL. Blue: SMILES Pair Encoding tokenization; Orange: Atom-level tokenization...... 95 Figure 4.4. Random sampled examples of Generated Molecules. (a) examples from the model trained with SMILES Pair Encoding tokenization; (b) examples from the model trained with atom-level tokenization...... 96 Figure 4.5. Results of QSAR benchmark. (a) Test set RMSE (b) The effect size (Cohen’s d value) of difference between models trained with SPE tokenization and atom- level tokenization. A positive d value means atom-level tokenization performances better than SPE tokenization. A negative d value means SPE tokenization performances better than atom-level tokenization. The size effect with a |d| (absolute value of d) less than 0.2 as no difference; between 0.2 and 0.5 as of minor difference; between 0.5 and 0.8 as medium difference; greater than 0.8 as large difference...... 98 Figure 5.1. Property Distribution of Lead Optimization-like Datasets...... 106 Figure 5.2. Training and Test Set Property Distribution Following the Scaffold Split of Lead Optimization-like Datasets (Ratio=80:20)...... 107 Figure 5.3. Classic language model (a) vs. Masked language model (b)...... 109 Figure 5.4. Pooled Molecular Representation...... 110 x

Figure 5.5. Representative Molecules from HERG...... 113 Figure 5.6. The average rank on the test sets across the 8 QSAR benchmark datasets based on (a) Pearson R and (b) RMSE...... 115 Figure 6.1. Simplified workflow of the CryptoChem Algorithm...... 124 Figure 6.2. Graphical Illustration of Label Swapper...... 127 Figure 6.3. Analogue Retriever scheme for replacing a reference molecule with an analogue from the Analogue Set...... 129 Figure 6.4. (a) The 4th dataset, and (b) the encoded version using MOLWRITE...... 135 Figure 6.5. Summary of security layers in CryptoChem ...... 140

xi

Chapter 1. Introduction

Cheminformatics [1] is the scientific field that develops and applies computational methods to solve chemical problems. It aims at helping chemists in investigating and understanding complex chemical biological systems and guide the experimental design and decision making. It has been applied in a wide range of case studies such as discovering novel molecules/materials with desired activities/properties [2–6], virtually screening of large chemical databases [7–11] and predicting the outcomes of chemical reactions [12–15].

One of the primary tasks of cheminformatics is to predict the physical, chemical, and biological properties of a given compound. Quantitative Structure-Property/Activity

Relationship (QSPR/QSAR) modeling [16–20] relies on machine learning techniques to establish quantified links between molecular structures and their experimental properties/activities.

Machine learning is a subfield of artificial intelligence that enables computers to directly learn from the data and make decisions without specific programmed rules.

The main focus of my Ph.D. research is to develop novel QSAR modeling methodologies for challenging chemical/biological endpoints. This thesis is organized as follows: We will first briefly explain the core concepts of machine learning and how to apply it on molecules for QSAR modeling (Chapter 1); Then we will go through my research projects: Hierarchical QSAR modeling (Chapter 2) and Molecular Prediction Model Fine-Tuning (MolPMoFiT) approaches for molecular properties/activities predictions (Chapter 3); SMILES pair encoding (Chapter 4);

Benchmarking transfer learning approaches on small chemical datasets (Chapter 5) and,

CryptoChem (Chapter 6).

1

1.1. Machine Learning

Traditional programming uses manually programmed rules to produce the output based on the input data. Machine learning, on the other hand, can automatically learn the rules from the experience data (data we already know the output), and produces the output of the new data [21]

(Figure 1.1).

Figure 1.1. Traditional Programming vs. Machine Learning.

The significant difference between traditional programming and machine learning is the to solve complex problems. For complex problems, the traditional programming will need to implement, if possible, a really long list of rules. Applying machine learning techniques to explore and discover the patterns/rules from the large amount of data is a far better choice.

There are different types of machine learning systems (Figure 1.2). In supervised learning, the training data has labels and the machine learning algorithm learns to predict the labels given the input features. Typical supervised learning tasks are regression and classification.

The regression models predict a numerical value, such as the price of a house or the kinase inhibition potency of a given compound. The classification models predict a category, such as an email is spam or not or a given compound is considered as toxic or not. In unsupervised learning, the training data is unlabeled, and the machine learning algorithms attempt to discover the intrinsic

2

patterns of the data. Two major tasks of unsupervised learning are clustering and dimensionality reduction. Clustering groups the data in a way that data points in the same cluster are more similar to each other than to those in other clusters. Dimensionality reduction tries to simplify the data without losing too much information. It is usually used for visualization. Other important types of machine learning systems that are not covered in this document are semi-supervised learning and reinforcement learning, etc.

Figure 1.2. Types of Machine Learning.

1.2. QSAR Modeling

The idea of QSAR modeling was initially developed by Corwin Hansch [22] in 1960s. For decades, this technique has grown and evolved from using “simple” linear regression model for modeling very small series of congeneric molecules to utilizing the most sophisticated machine learning techniques to model very large, structurally diverse chemical data sets [16].

A typical QSAR modeling (supervised learning) workflow (Figure 1.3) includes (1) data collection and curation, (2) data splits, (3) featurization (or descriptor calculation), and (4) model training and evaluation. Data curation is essential for ensuring the correctness of the modeling set

3

[23, 24] (for both chemical structures and the endpoint to be modeled). The model-ready data is then split into training and test data. The training data is used to build the models and the test data is used to evaluate the final model. Machine learning algorithms can only take numerical input, the process of converting the (chemical) data into appropriate numerical representations is called featurization. During the model training process, the hyperparameters of the model are tuned via either cross-validation or a single train-validation split for improving the model performance. The hyper-parameters are the configuration of a model that cannot be estimated from the data and need to be set manually beforehand. After training, the final model will be challenged by the test data to evaluate how well the model will work for the unseen data.

Figure 1.3. Supervised Learning Workflow.

The molecular structures need to be encoded as a numerical format in order to be processed by computers. Molecular structures can be represented in 2-dimensional, 3-dimensional or even higher levels that considering the time-dependent dynamic of molecular conformations [25].

QSAR models assume that similar molecules tend to have similar properties/activities. However, the similarity between molecules could change dramatically when using different molecular representations. Hence, molecular representations play a vital role in QSAR modeling. In this thesis, we focus on using molecular representation derived from 2D structures of molecules for

4

QSAR modeling. The 2D molecular representations can be rapidly computed without specific molecular structure optimization and usually result in good model performance [16]. The 2D representations of a molecule usually encoded the topological information of a molecular graph.

A molecular graph is an undirected graph whose nodes correspond to the atoms of the molecule and edges correspond to chemical bonds. Classic machine learning algorithms require the input data formatted as a fixed-size vector (Figure 1.4). Hundreds of molecular descriptors and fingerprints were developed for this purpose [26]. Molecular descriptors are numerical values associated with the molecular structures, such as molecular weight, logP, etc. Molecular fingerprints are bit values to represent the present/absent a substructure in the molecules.

Figure 1.4. Featurization of 2D Molecular Structure.

Various machine learning algorithms have been used for QSAR modeling [16]. The major algorithms used in this document are (1) Tree-based models, e.g., decision tree (DT), random forest

(RF), gradient boost machine, etc.; (2) k-nearest neighbors (kNN); (3) kernel methods, e.g., support vector machine (SVM) and (4) Deep learning models, e.g., fully connected neural network

(FCNN), recurrent neural network (RNN), etc.

A Decision Tree model (Figure 1.5a) has a flowchart-like structure that uses a sequence of yes/no questions to classify the input data to the outputs. Random Forest and Gradient

Boosting Machines are more robust and powerful tree-based models that ensembles a large 5

number (usually hundreds to thousands) of different decision trees (Figure 1.5b). The final prediction is made by averaging the predictions from all the decision trees.

k-Nearest Neighbors (Figure 1.5c) finds the k most similar data points to the target

(unknown) data point in the feature space and assigns the label to it based on the weighted/unweighted voting of the labels of chosen neighbors.

Figure 1.5. Machine Learning Methods.

Support Vector Machine (Figure 1.5d) is a type of kernel method that aims to find the best decision boundary to classify data points that belong to different classes. A decision boundary in the high-dimensional feature (or other representations if kernels are used) space is formed by maximizing the distance between the decision boundary and the closest data points (also called support vectors) from each class. For the new data point, the class can be determined by checking the side of the decision boundary it falls on.

6

Neural Networks (Figure 1.5e) consist of multiple layers of neurons. There are three types of layers in a neural network: the input layer, the hidden layers and the output layer. Neural networks with more than two hidden layers are called deep neural networks (also referred to deep learning). Modern deep learning models usually involve tens of layers. Each hidden layer takes the input from the previous layer, transforms the data (nonlinearly), and then passes the output to the next layer. With transforming the input data nonlinearly multiple times by the hidden layers, the model learns the meaningful intermediate representations to map the input data to the expected output. Deep learning has achieved transformative impacts on many intelligence tasks such as language translation, image/speech recognition and gaming. Recently, there is an increasing interest to apply deep learning techniques in chemistry [4, 6, 17, 27–30]. In addition to the tasks of QSAR modeling, deep learning is also used for solving complex problems such as molecular design [2–6] and synthesis planning [12–15].

7

References

1. Engel T (2006) Basic overview of chemoinformatics. J. Chem. Inf. Model. 46:2267–2277 2. Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep learning for molecular design - A review of the state of the art. Mol Syst Des Eng 4:828–849. https://doi.org/10.1039/c9me00039a 3. Ståhl N, Falkman G, Karlsson A, et al Deep Reinforcement Learning for Multiparameter Optimization in de novo Drug Design. https://s3-eu-west- 1.amazonaws.com/itempdf74155353254prod/7990910/Deep_Reinforcement_Learning_fo r_Multiparameter_Optimization_in_de_novo_Drug_Design_v1.pdf 4. Chen H, Engkvist O, Wang Y, et al (2018) The rise of deep learning in drug discovery. Drug Discov Today 23:1241–1250. https://doi.org/10.1016/J.DRUDIS.2018.01.039 5. Lo Y-C, Rensi SE, Torng W, Altman RB (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23:1538–1546. https://doi.org/10.1016/j.drudis.2018.05.010 6. Lavecchia A (2019) Deep learning in drug discovery: opportunities, challenges and future prospects. Drug Discov Today 24:2017–2032. https://doi.org/10.1016/j.drudis.2019.07.006 7. St John PC, Phillips C, Kemper TW, et al (2019) Message-passing neural networks for high-throughput polymer screening. J Chem Phys 150:241722. https://doi.org/10.1063/1.5099132 8. Jensen JH (2019) A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space. Chem Sci 10:3567–3572. https://doi.org/10.1039/c8sc05372c 9. Cortéscortés-Ciriano I, Firth NC, Bender A, Watson O (2018) Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening. J Chem Inf Model 58:45. https://doi.org/10.1021/acs.jcim.8b00376 10. Nocedo-Mena D, Cornelio C, Camacho-Corona MDR, et al (2019) Modeling Antibacterial Activity with Machine Learning and Fusion of Chemical Structure Information with Microorganism Metabolic Networks. J Chem Inf Model 59:1109–1120. https://doi.org/10.1021/acs.jcim.9b00034 11. Hoffmann T, Gastreich M (2019) The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov Today. https://www.sciencedirect.com/science/article/pii/S1359644618304471?dgcid=rss_sd_all 12. Coley CW, Green WH, Jensen KF (2018) Machine Learning in Computer-Aided Synthesis Planning. Acc Chem Res 51:1281–1289. https://doi.org/10.1021/acs.accounts.8b00087 13. Zhou Z, Li X, Zare RN (2017) Optimizing Chemical Reactions with Deep Reinforcement Learning. ACS Cent Sci 3:1337–1344. https://doi.org/10.1021/acscentsci.7b00492 14. Schwaller P, Laino T, Gaudin T, et al (2018) Molecular Transformer - A Model for

8

Uncertainty-Calibrated Chemical Reaction Prediction. https://arxiv.org/pdf/1811.02633.pdf 15. Schreck JS, Coley CW, Bishop KJM Learning Retrosynthetic Planning through Simulated Experience. https://pubs.acs.org/doi/10.1021/acscentsci.9b00055. 16. Cherkasov A, Muratov EN, Fourches D, et al (2014) QSAR modeling: Where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285 17. Mater AC, Coote ML (2019) Deep Learning in Chemistry. J Chem Inf Model 59:2545– 2559. https://doi.org/10.1021/acs.jcim.9b00266 18. Tropsha A (2010) Best Practices for QSAR Model Development, Validation, and Exploitation. Mol Inform 29:476–488. https://doi.org/10.1002/minf.201000061 19. Ma J, Sheridan RP, Liaw A, et al (2015) Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships. J Chem Inf Model 55:263–274. https://doi.org/10.1021/ci500747n 20. Fourches D, Williams AJ, Patlewicz G, et al (2018) Computational Tools for ADMET Profiling. In: Computational Toxicology. pp 211–244 21. Chollet F (2017) Deep Learning with Python, 1st editio. Manning Publications 22. HANSCH C, MALONEY PP, FUJITA T, MUIR RM (1962) Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients. Nature 194:178–180. https://doi.org/10.1038/194178b0 23. Fourches D, Muratov E, Tropsha A (2010) Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J Chem Inf Model 50:1189–1204. https://doi.org/10.1021/ci100176x 24. Fourches D, Muratov E, Tropsha A (2016) Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation. J Chem Inf Model 56:1243–1252. https://doi.org/10.1021/acs.jcim.6b00129 25. Fourches D, Ash J (2019) 4D- quantitative structure–activity relationship modeling: making a comeback. Expert Opin Drug Discov 1–9. https://doi.org/10.1080/17460441.2019.1664467 26. Todeschini R, Consonni V (2009) Molecular Descriptors for Chemoinformatics. Wiley- VCH Verlag GmbH & Co. KGaA, Weinheim, Germany 27. Goh GB, Hodas NO, Vishnu A (2017) Deep learning for computational chemistry. J. Comput. Chem. 38:1291–1307 28. Unterthiner T, Mayr A, Unter Klambauer G¨, et al Deep Learning as an Opportunity in Virtual Screening 29. Greene CS, Ching T, Himmelstein DS, et al (2018) Opportunities and obstacles for deep learning in biology and medicine. http://rsif.royalsocietypublishing.org/Downloadedfrom 30. Popova M, Isayev O, Tropsha A (2018) Deep reinforcement learning for de novo drug design. Sci Adv 4:eaap7885. https://doi.org/10.1126/sciadv.aap7885

9

Chapter 2. Hierarchical H-QSAR Modeling Approach for Integrating Binary/Multi

Classification and Regression Models of Acute Oral Systemic Toxicity

Xinhao Li1, Nicole C. Kleinstreuer 2,3, and Denis Fourches1,*

1 Department of Chemistry, Bioinformatics Research Center, North Carolina State University,

Raleigh, NC 27695, United States.

2 Division of Intramural Research/Biostatistics and Computational Biology Branch, NIEHS,

RTP, North Carolina 27709, United States.

3 National Toxicology Program Interagency Center for the Evaluation of Alternative

Toxicological Methods, NIEHS, RTP, North Carolina 27709, United States.

* Whom correspondence should be sent. Email: [email protected]

Published in the Chemical Research in Toxicology (Chem. Res. Toxicol. 2020, 33, 2, 353–366)

10

Abstract

Reliable in silico approaches to replace animal testing for the evaluation of potential acute toxic effects are highly demanded by regulatory agencies. In particular, quantitative structure- activity relationships (QSAR) models have been used to rapidly assess chemical induced toxicity using either continuous (regression) or discrete (classification) predictions. However, it is often unclear how those different types of models can complement and potentially help each other to afford the best prediction accuracy for a given chemical. This paper presents a novel, dual-layer hierarchical modeling method to fully integrate regression and classification QSAR models for assessing rat acute oral systemic toxicity, with respect to regulatory classifications of concern. The first layer of independent regression, binary and multiclass models (base models) were solely built using computed chemical descriptors/fingerprints. Then, a second layer of models (hierarchical models) were built by stacking all the cross-validated out-of-fold predictions from the base models. All models were validated using an external test set and we found that the hierarchical models did outperform the base models for all the three endpoints. The H-QSAR modeling method represents a promising approach for chemical toxicity prediction and more generally for stacking and blending individual QSAR models into more predictive ensemble models.

11

2.1. Introduction

Thorough toxicity evaluation is an important step to ensure the environmental safety of chemicals. For decades, experimental protocols relying on animal testing have been routinely used for determining the potential toxic effects of chemicals on critical human health endpoints [1].

However, these in vivo procedures are not only costly and time-consuming, but the use of animals raises ethical issues, has questionable relevance to human biology, and is less and less tolerated in modern societies [2]. Due to the constantly increasing number of chemicals requiring toxicological evaluations, there is a high demand for alternatives to replace, reduce, and refine (3Rs) the use of animal testing [3, 4]. In fact, the development and implementation of alternative methods to animal testing for the evaluation of potential acute toxic effects are highly needed by regulatory agencies.

Recent announcements by the US Environment Protection Agency indicate the progressive ban of animal testing in the US in a near future, this ban building upon the comparable robustness and reliability of in silico models augmented by targeted in vitro tests. In other words, in silico methods are slowly becoming the essential component of chemical screening and prioritization, and fully complementary to in vitro assays in integrated strategies to testing and assessment [5].

With the objective of developing alternative test methods to replace animal use for acute toxicity tests, the NTP Interagency Center for the Evaluation of Alternative Toxicological Methods

(NICEATM) in collaboration with U.S. EPA’s National Center for Computational Toxicology

(NCCT) collected a large body of rat acute oral lethality (LD50) data from a number of publicly accessible resources and organized a community-based project to develop in silico models of acute oral systemic toxicity, a toxicity test used by a variety of regulatory bodies for diverse decision criteria [6]. The dataset covered five modeling endpoints needed by regulatory agencies: (1) LD50 point estimates (mg/kg); (2) Very toxic compounds (<50 mg/kg vs. all others); (3) Nontoxic

12

compounds (>2,000 mg/kg vs. all others); (4) Hazard categories under the U.S. Environmental

Protection Agency EPA classification system (n = 4 classes); (5) Hazard categories under the

United Nations Globally Harmonized System of Classification and Labelling (GHS) classification system (n = 5 classes). The LD50, the amount of a chemical resulting in the death of 50% of a group of test animals, is a commonly used measurement to compare the acute toxic potential of different chemicals: the smaller the value, the more toxic the chemical. The LD50 value predictions required a continuous regression modeling approach, while the endpoints of ‘very toxic’ chemicals and ‘nontoxic’ chemicals were binary classifications, and the last two endpoints of EPA or GHS hazard categories were multiclass models.

Quantitative structure-activity relationship (QSAR) modeling [7] is a major in silico approach relying on machine learning techniques and a set of molecular descriptors directly computed from chemical structures. QSAR models are used in drug discovery and chemical toxicity predictions to rapidly estimate continuous (regression) or discrete (classification) endpoints. The regression models quantitatively predict numerical values (e.g., LD50) of chemicals based on their molecular structures. Meanwhile, classification models with good prediction accuracy provide valuable insight, even though they are less informative and sensitive when assessing the subtle continuous changes in toxicity within a chemical class for instance

(predictors of LD50 values are needed in those cases). Obviously, there are both advantages and challenges to developing continuous versus classification models for a given chemical set.

Building a binary classification model can be very much appropriate for a given endpoint: for instance, QSAR models for Ames mutagenicity [8, 9] make perfect sense as the assay is traditionally interpreted as binary (mutagens versus non-mutagens) and has an estimated experimental reproducibility of 85%. However, other endpoints may require the development of

13

multi-class models in order to be relevant and actually useful for regulators: for example, QSAR models for skin sensitization [10] are now typically developed and applied as 4-class models

(strong/severe sensitizers, moderate sensitizers, weak sensitizers, and non-sensitizers) and achieve levels of accuracy being similar to experiment [10]. Meanwhile, for most toxicity endpoints, the development of robust and highly-reliable continuous models (R2 > 0.6) is difficult to achieve due to the broad chemical diversity of training sets, reflecting the underlying disparity of mechanisms of actions for the different compounds.

Herein, we posit that developing new strategies to better synergize regression and classification models could ultimately boost the performance, interpretability, and applicability domain of in silico models for chemical toxicity prediction. Several regression [11–13] and classification [14–16] QSAR models have been developed to predict the oral acute toxicity of chemicals. However, there is still no clear workflow or approach in which regression and classification models could complement one other in order to provide the best prediction accuracy for a given chemical. Recently, Xu et al. [17] used a multitask neural network that simultaneously trained regression and multiclassification tasks. The results showed that the multitask model improved the consistency of regression and classification models but yielded no significant improvement of the overall performances. Ensemble methods [18], also known as consensus modeling, have been successfully developed and applied on a wide variety of QSAR modeling problems to boost the predictive performances by combining the predictions of multiple individual models [19–23]. Stacking is an efficient ensemble method in which the predictions generated by first-level models are used as input for a second-level model [24, 25]. But the two levels of models utilized for stacking protocols usually belong to the same model type for the same endpoint.

14

In this proof-of-concept study, we developed a novel, dual-layer hierarchical modeling method to build and stack QSAR models for predicting categorical (binary toxic/nontoxic and four

EPA-defined categories) and continuous (LD50) endpoints for rat acute oral toxicity. The concept of hierarchical QSAR was previously introduced by Basak [26–29] in which the ‘hierarchical’ term refers to a hierarchy of molecular descriptors. In Basak’s approach, more complex and computationally intensive descriptors are only used when they can provide significant improvement to the predictions. On the other hand, the approach we proposed here builds an actual hierarchy of QSAR models. The first-layer base models (regression, binary and multiclass) were built with computed molecular descriptors and fingerprints. Then, a second layer of regression, binary and multiclass models (hierarchical models) were built by stacking up the outputs from all the base models. The idea is that different types of base models can learn the training data from different perspectives (via different machine learning techniques and molecular descriptors) and then complement each other in a stacking-like workflow, resulting in a boost of prediction performance and applicability domain for those hierarchical (H-)QSAR models.

2.2. Materials and Methods

Dataset. The rat acute oral toxicity data used in this study were collected by the National

Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods

(NICEATM) and the U.S. EPA National Center for Computational Toxicology (NCCT) from a number of public available datasets and resources. The full description and the actual dataset is available at https://ntp.niehs.nih.gov/go/tox-models. The whole dataset, comprising 11,992 compounds, was semi-randomly split into a training set (75%) and an external test set (25%) with equivalent coverage with respect to LD50 distribution by the organizers of the project. The LD50

15

distributions of curated training and test set (See Data Preparation section) are shown in Figure

A1 in Appendix A. The binary endpoint used in this study was defined based on a well-defined threshold of LD50 (non-toxic compounds with LD50 greater than or equal to 2,000 mg/kg). The multi-categorization of toxicity hazard using EPA classification scheme (class I, II, III, IV) was also established on the basis of LD50 values: chemicals with LD50 < 50 mg/kg were categorized as EPA I (danger/poison), 50 mg/kg < LD50 < 500 mg/kg as EPA II (warning), 500 mg/kg < LD50

< 5,000 mg/kg as EPA III (caution), and chemicals with LD50 > 5,000 mg/kg as EPA IV (caution).

Data preparation. The training and external test sets were curated according to standard curation protocols developed earlier [8, 30, 31]: (1) Remove counterions in salts and neutralize;

(2) Remove mixtures; (3) Filter molecules comprising following elements only: H, C, N, O, F, Br,

I, Cl, P, S; (4) Normalize the structures of specific chemotypes, such as aromatic and nitro groups

(5) Identify, analyze and remove structural duplicates and, if necessary, conflicting labels. All chemicals were curated using KNIME software [32]. After curation, the training and external test sets contained 8,211 and 2,843 molecules, respectively. The number of molecules for each endpoint is summarized in Table 2.1.

Table 2.1. Summary of curated training set and external test set. Target Model Training set Test set LD50 Regression 6,089 2,144 Toxic/Nontoxic Binary 8,209 2,820 EPA category Multiclass 8,126 2,842 Overall 8,211 2,843 Mutual 6,089 2,144

Molecular Descriptors. Molecular descriptors were computed based on 2D molecular structures. Bit-based and Count-based Morgan fingerprints with radius of 3 (ECFP6 [33]),

MACCS key fingerprints and RDKit descriptors were computed using the RDKit [34] package in

Python. Mordred descriptors was computed using Mordred [35] package in Python. The zero/near

16

zero variance descriptors and highly correlated pairs of descriptors of the training set were identified and filtered out. If the absolute correlation (|r|) between any pair of descriptors were larger than 0.9, the descriptor with the largest mean |r| was removed. After descriptors selection, there are 2048 ECFP6_bits, 2046 ECFP6_counts, 145 MACCS key, 159 RDKit and 459 Mordred descriptors remained. Those same descriptors were also computed for the compounds in the external test set.

QSAR Modeling. For regression models, the target endpoint is the experimental logLD50

(mmol/kg) values of molecules. For binary models, the training endpoint is a defined label

(toxic/nontoxic), and for the multiclass models, the training target is toxicity categories (Category

I, II, III, IV) defined by U.S. EPA criterion for acute oral systemic toxicity. The first layer models trained with computed chemical descriptors/fingerprints are called base models and the second layer models trained by stacking base models’ out-of-fold predictions are called hierarchical models.

The algorithm of building hierarchical models is summarized in Figure 2.1. The first layer base regression, binary and multiclass models are built with various combinations of machine learning algorithms and chemical descriptors/fingerprints. Out-of-fold predictions of base models are generated through 10-fold cross-validation. The out-of-fold predictions of all base models are concatenated together and used as input (meta features) for building the second layer hierarchical regression, binary and multiclass models.

Building a diverse set of first-layer models is crucial for build powerful stacking (2nd layer) models. In order to enhance the model diversity, base models were trained by using various combinations of machine learning algorithms and chemical descriptors/fingerprints. We choose the combinations of four machine learning algorithms (RF, SVM, kNN and XGBoost) and five

17

chemical descriptors/fingerprints (ECFP6_Bits, ECFP6_Counts, MACCS keys, RDKit descriptors and Mordred descriptors) for building base regression, binary and multiclass models. For each endpoint, a diverse set of 20 models were built at this stage (60 base models total). The hyperparameters of all base models were optimized through grid search using 5-fold cross- validation.

The regression models output real values and the classification models can either output the predicted labels or probabilities. Herein, the predicted logLD50 values from the base regression models, the predicted probabilities of toxic (Pnontoxic) from the base binary models and the predicted probabilities for each class of four EPA categories from the base multiclass models are used as input for hierarchical models. We only used the predicted probabilities of the first three classes

(PI, PII and PIII) from the base multiclass model since the predicted probability of the fourth class can be deduced from the predicted probabilities of the first three classes.

18

Figure 2.1. Overall workflow for building hierarchical QSAR models. Base regression, binary and multiclass models (60 models in total) are built with diverse combinations of machine learning algorithms and chemical descriptors/fingerprints. Out-of-Fold Predictions of base models are generated through 10-fold cross-validation. The out-of-fold predictions are concatenated together and used as input (Meta Features) for building hierarchical regression, binary and multiclass models.

In a stacking procedure, if the first-layer models are fit to the same training data that was used to prepare the meta features (input for hierarchical models), the second-layer models would be overfitting since the target responses would have been used twice during the modeling

(information leakage). Consequently, the resulting models would generalize poorly for new data

(e.g., new compounds to be assessed by the model). Thus, to obtain the meta features from a base model for one particular data point, the model must have been fit on a training set which does not

19

include that data point. Therefore we prepared the meta features according to a 10-fold cross- validation [24] protocol: the modeling set with computed chemical descriptors/fingerprints was randomly divided into 10 subsets of approximately equal size; one subset is put aside as a prediction set while the rest form the training set. The base models were solely built using the training set and applied to the prediction set to achieve cross-validated predictions. These are also known as out-of-fold (OFF) predictions and can be safely used as the meta features. This procedure was repeated ten times, allowing each of the ten subsets to be used as a prediction set once. It is important to emphasize that the prediction set molecules were not used to build the models which yield predictions for those specific chemicals, thus avoiding the problem of information leakage.

As shown in Table 2.1, the labels of some molecules were incomplete in the training set: the whole training set contains 8,211 molecules, but only 6,089 molecules (those with quantitative

LD50 values) actually had all three experimental labels available. The other molecules had at least one missing label, due largely to the presence of limit tests (e.g., LD50>5,000 mg/kg/day) or acute toxic class protocols where exact LD50 values were not calculated and chemicals were assigned a toxicity class based on a range. As a result, we were not able to get the meta features for all the molecules in the training set via cross-validation. For those molecules with incomplete labels, meta features of the missing labels were collected from the predictions of the corresponding base model trained on the whole labeled training data. The unlabeled molecules were not seen by the corresponding base model, so again there was no risk of information leakage. The concatenation of predictions from all the base models (60 models in total) are used as input for building hierarchical models.

The hierarchical regression, multiclass and binary models were fit with the predictions from total 60 base models and tuned with four machine learning algorithms: kNN, SVM, RF and

20

XGBoost. The parameters of models were optimized through grid search using 5-fold cross validation.

Model Evaluation. All models were evaluated with 10-fold cross-validation and the external test set.

Evaluation metrics. The root-mean-square-error (RMSE, eq 1) and the coefficient of determination (R2, eq 2) were used as evaluation metrics for the regression models.

1 RMSE = √ ∑푛 (𝑦푖 − ŷ푖)2 (1) 푛 푖=1 푛 2 2 ∑푖=1(푦푖− ŷ푖) R = 1 − 푛 2 (2) ∑푖=1(푦푖− 푦̅푖)

The classification models were evaluated with the metrics computed from the confusion matrix. A confusion matrix reports the number of false positives (FP), false negatives (FN), true positives (TP) and true negatives (TN). Accuracy (eq 3), Sensitivity (eq 4), Specificity (eq 5),

Balanced Accuracy (BA, eq 6), Recall (eq 7), Precision (eq 8), F1-score (eq 9) and Matthews correlation coefficient [36] (MCC, eq 10) were computed to evaluate the classification models.

Sensitivity, also referred as true positive rate, measures the ability of a model to correctly detect a positive sample as positive. Specificity, also referred as true negative rate, measures the ability of a model to find all negative samples. The balanced accuracy is the average of sensitivity and specificity. Recall measures the ability of a model to find all positive samples. Precision, also referred to as positive predictive value (PPV), measures the ability of a model not to label a positive sample as negative. The f1-score is the harmonic mean of precision and recall. MCC is a balanced measure which takes into account true and false positives and negatives and particularly suited for imbalanced data sets. The MCC is a correlation coefficient value between –1 and +1. A coefficient of +1 indicates a prefect prediction, 0 indicates no better than random prediction, and –1 indicates total disagreement between prediction and observation.

21

푇푃+푇푁 Accuracy = (3) 푇푃+퐹푃+푇푁+퐹푁

푇푃 Sensitivity = (4) 푇푃+퐹푁

푇푁 Specificity = (5) 푇푁+퐹푃

Sensitivity + Specificity Balanced Accuracy (BA) = (6) 2

푇푃 Precision = (7) 푇푃+퐹푃

푇푃 Recall = (8) 푇푃+퐹푁

2 × (푃푟푒푐푖푠푖표푛 ×푅푒푐푎푙푙) F1-score = (9) 푃푟푒푐푖푠푖표푛+푅푒푐푎푙푙

푇푃 × 푇푁 – 퐹푃 × 퐹푁 MCC = (10) √(푇푃+퐹푃)(푇푃+퐹푁)(푇푁+퐹푃)(푇푁+퐹푁)

The overall accuracy, balanced accuracy, f1-score and MCC of classification (binary and multiclass) were reported. In addition, the area under receiver operating characteristic curve

(AUROC) is also reported for binary models. In multiclass models, the reported f1-score are the weighted average of four classes: the metrics were first calculated for each class and weighted averaged with respect to the frequency of each class. The multiclass MCC [37] can be defined in terms of a K × K (K classes) confusion matrix C (eq 11).

퐾 푐 × 푠 – ∑ 푝푘 × 푡푘 (11) Multiclass MCC = 푘 2 퐾 2 2 퐾 2 √(푠 − ∑푘 푝푘) × (푠 − ∑푘 푡푘)

The following intermediate variables were considered to simplify the definition of

퐾 퐾 multiclass MCC: 푡푘 = ∑푖 퐶푖푘 is the number of times class k truly occurred, 푝푘 = ∑푖 퐶푘푖 is the

퐾 number of times class k was predicted, 푐 = ∑푘 퐶푘푘 is the total number of samples correctly

퐾 퐾 predicted, and 푠 = ∑푖 ∑푗 퐶푖푗 is the total number of samples. The value of the multiclass MCC is

22

no longer range from –1 to +1. The minimum value will be somewhere between –1 and +1 depending om the number and distribution of ground true labels. The maximum value is still +1.

Implementation. kNN, SVM and RF models were built using Scikit-learn [38] package and XGBoost models were built using XGBoost [39] package in Python. The codes for this study are available at: https://github.com/XinhaoLi74/Hierarchical-QSAR-Modeling.

2.3. Results

2.3.1. Chemical space of the full dataset

To investigate the chemical space of the full dataset (11,056 molecules), a t-Distributed

Stochastic Neighbor Embedding [40] (t-SNE) was calculated using the bit-based ECFP6 (2,048 bits) (Figure 2.2). Similar molecules tend to have similar acute oral toxicity. Representative molecules from some toxic clusters are shown in Figure 2.2. About 24% of the molecules have no quantified LD50 values (colored in aqua). The analysis showed that the molecules with quantified LD50 values occupied a smaller chemical space compared to the full data. A standalone interactive visualization of the chemical space is available at https://pubs.acs.org/doi/10.1021/acs.chemrestox.9b00259 (as one HTML file).

23

Figure 2.2. Clustering of the full dataset (11,056 molecules) by t-SNE with bit- based ECFP6 (2048 bits).

2.3.2. Models development, validation, and comparison

Herein, the diversity of base models plays an important role in building a powerful ensemble model. Base binary, multiclass and regression models were developed using five sets of molecular descriptors/fingerprints (Bit-based ECFP6, Count-based ECFP6, MACCS keys, RDKit descriptors and Mordred descriptors) with four machine learning algorithms (kNN, SVM, RF and

XGBoost) (see Methods Section). Then, a set of hierarchical binary, multiclass and regression models were developed based on the stacking method descripted in the Method Section with four machine learning algorithms: kNN, SVM, RF and XGBoost. The hyperparameters were optimized through grid search with 5-fold cross-validation (See Tables A1-A4 in Supporting Information).

24

All the models were validated with 10-fold cross validation, and the performances are summarized in Tables A5-A7.

Base and hierarchical models were further assessed using the external test sets of 2,144 compounds, 2,820 compounds and 2,842 compounds for regression, binary and multiclass models, respectively (Figures 2.3-2.5). The performances on the external test sets can provide unbiased evaluations of fitted models since the data in the external test set was never used at the training stage. Models were evaluated by RMSE, AUROC and MCC for regression, binary and multiclass models, respectively. The results of other evaluation metrics can be found in Tables A8-A10. For all three endpoints, hierarchical models did outperform base models. This is consistent with the

10-fold cross-validation result from the training data (Tables A5-A7). This significant boost of performances shows that the hierarchical models actually benefitted from the stacking procedure.

Hierarchical models trained with different machine learning algorithms have similar performances on all three endpoints. Base and hierarchical consensus models were also built by unweighted averaging the predictions of base and hierarchical models, respectively. Base consensus models did outperform individual base model for all three endpoints. Hierarchical consensus models showed similar/slightly better results compared to individual hierarchical models. Again, hierarchical models did outperform the base consensus models for all three endpoints. This indicates model stacking is a more efficient and powerful ensemble method than model consensus.

Table 2.2 compares the base and hierarchical consensus models on various metrics. Hierarchical consensus models outperform base consensus models according to all metrics.

Figure 2.3 shows the RMSE obtained for the regression models. The RMSE of base models ranges from 0.56 to 0.63. Among base models, the model trained with RF and RDKit descriptors achieved the best performance. The base consensus model led to a lower RMSE of

25

0.55. Hierarchical regression models trained with kNN, SVM, RF and XGBoost achieved RMSE around 0.53. Figure 2.6 shows the scatter plots of predicted versus experimental logLD50

(mmol/kg) values obtained by the consensus models and hierarchical models.

Figure 2.4 shows AUROC of binary models. The AUROC of base models ranges from

0.70 to 0.78. The base model trained with XGBoost and Mordred descriptors did outperform other base models, resulting in AUROC of 0.78. The base consensus model has an AUROC of 0.78.

Hierarchical binary models trained with different machine learning algorithms have similar

AUROC around 0.79.

Figure 2.5 shows MCC of multiclass models. The MCC of base models ranges from 0.40 to 0.46. The base model trained with RF and Mordred descriptors did outperform other base models, which has an MCC of 0.46. The base consensus model led to a higher MCC of 0.48.

Hierarchical models trained with kNN, SVM and RF obtained slightly higher MCC (0.49) than that of the hierarchical model trained with XGBoost (MCC = 0.48)

26

Figure 2.3. Performances of Regression Models on External Test Set.

Figure 2.4. Performances of Binary Models on External Test Set.

27

Figure 2.5. Performances of Multiclass Models on External Test Set.

Table 2.2. External test set performances of base and hierarchical models’ consensus. Base model Hierarchical model Model Metrics Consensus Consensus AUROC 0.777 0.792 Accuracy 0.790 0.800 Binary Balance Accuracy 0.777 0.792 F1-Score 0.788 0.799 MCC 0.567 0.589 MCC 0.475 0.497 Accuracy 0.669 0.681 Multiclass Balance Accuracy 0.578 0.618 F1-Score 0.651 0.670 RMSE 0.551 0.526 Regression R2 0.623 0.656 MAE 0.398 0.374

28

(a) Base Consensus Model (b) Hierarchical Consensus Model

(c) Hierarchical Model (kNN) (d) Hierarchical Model (SVM)

(e) Hierarchical Model (RF) (f) Hierarchical Model (XGBoost)

Figure 2.6. Scatter plot of the predicted versus experimentally measured logLD50 (mmol/kg) values. (a) Base Consensus Model; (b) Hierarchical Consensus Model; (c) Hierarchial Model (kNN); (d) Hierarchical Model (SVM); (e) Hierarchical Model (RF); (f) Hierarchical Model (XGBoost)

Figure 2.7 shows the visualizations of test set predictions from different types of base consensus models (Figure 2.7a) and hierarchical consensus models (Figure 2.7b). The hierarchical consensus regression, binary and multiclass models achieved higher consistency in

29

predictions than those of base consensus models. This indicates stacking predictions from different types of base models results in a complementary effect of the predictions of hierarchical models.

(a) Base Consensus Models

(b) Hierarchical Consensus Models

Figure 2.7. Scatter plot of consensus model predictions. The x-axis is the predicted logLD50 from the consensus regression model, the y-axis is the predicted probability of toxic class from the consensus binary model. The scatter plots were colored predicted EPA category from the consensus multiclass model. (a) Base Consensus Models; (b) Hierarchical Consensus Models.

To further illustrate the complementarity of those models, we selected several chemicals of interest and looked at the models’ prediction performances. As illustrated in Figure 2.8 for two

30

test set compounds, the radar plots on the right show the predictions from three types of models:

(1) the predicted logLD50 (mmol/kg) values from regression models; (2) the predicted probabilities of toxic from binary models; (3) and the predicted probabilities for each EPA toxic class from multiclass models. The results of base consensus models and hierarchical consensus models are shown. The blue lines show the predictions from base consensus models and the red lines show the predictions from hierarchical consensus models.

The compound Fusarenon-X [41] (Figure 2.8a) is a type B trichothecene mycotoxin typically derived from Fusarium species and usually found in contaminated cereals. Fusarenon-X is a toxic chemical (experimental logLD50,exp = -1.95, Toxic and EPA class I) that causes the disruption of DNA synthesis by inhibiting protein synthesis. The base regression model gave a predicted logLD50 (mmol/kg) value of -0.39. The base binary model predicted this compound as nontoxic and the base multiclass model predicted this compound as class III. In general, the base models severely underestimated the toxicity of Fusarenon-X. On the other hand, the hierarchical regression model afforded a more accurate estimation with a predicted logLD50 (mmol/kg) value of -1.14. The hierarchical classification models made the right predictions of the toxic classes. The three hierarchical models obtained consistently better estimations of the toxicity for this compound.

VX [42] (Figure 2.8b) is an extremely toxic (logLD50 (mmol/kg) value of -4.34) synthetic compound in the thiophosphonate class. It was developed as chemical warfare agent for military use. Both the regression models failed to accurately estimate the logLD50 values due to the fact that there are not enough training data with similarly extreme logLD50 values. However, all the models identified this compound as highly toxic (EPA Category I). From the radar plot, it can be seen that the hierarchical classification models predicted the toxicity of this compound more

31

effectively compared to the base classification models: higher predicted probability of class I

(multiclass model) and higher predicted probability of toxic (binary model).

(a)

(b) Figure 2.8. Two selected chemicals in the test set and their respective predictions using the base and hierarchical models.

32

2.3.3. Applicability Domain for Hierarchical Models

Importantly, we noticed that the performances of QSAR models were not the same across different clusters of molecules. It is indeed well-known that QSAR models provide predictions with varying levels of confidence for different molecules. Defining an applicability domain [43]

(AD) for QSAR models provides users with the actual estimation of confidence for each prediction. The predictions of molecules within AD are more reliable than the predictions of molecules outside AD.

For a given classification model, one way to identify highly confident predictions is to check the predicted probabilities: for example, if a binary model predicts a molecule as toxic with a probability of 0.99, then this model prediction has higher confidence compared to a molecule with a probability of 0.55. Based on the predicted probabilities, one can create a model’s prediction uncertainty zone, where predictions outside that zone (‘out-zone’ predictions) have higher levels of prediction confidence compared to the predictions inside the zone (‘in-zone’ predictions) [44].

For binary models, the boundary of the model prediction zone was defined as 0.50 ± z, where z is a parameter to control how broad the zone is. If z is 0.1, the molecules with predicted probabilities between 0.4 and 0.6 would be considered as ‘in-zone’ predictions, a.k.a. low confidence predictions and the molecules with predicted probabilities outside the range of [0.4, 0.6] are considered as ‘out-zone’ predictions, a.k.a. high confidence predictions. For multiclass models, a similar method can be applied: the multiclass model outputs one predicted probability for each class and the molecules with the largest class probability larger than a specific threshold (e.g., 0.5) would be considered as ‘out-zone’ predictions, a.k.a. high confidence predictions.

In this study, the zone boundary of binary models was set as 0.4 and 0.6. For the multiclass models, the threshold was set to 0.5. The threshold of 0.5 ensures there is a clear margin between

33

the largest class probability and the others. Overall, for an ‘out-zone’ prediction of the binary model, the difference between two class probabilities is at least 0.2. For a ‘out-zone’ prediction of the multiclass model, the largest class probability is larger than the sum of the other three class probabilities.

With our hierarchical QSAR method, the toxicity of each molecule is estimated with three predictions: a logLD50 value from the regression model, a prediction of toxic/nontoxic from the binary model and a prediction of EPA category from the multiclass model. One of the goals of this study is to investigate how regression and classification models can complement each other. The applicability domain of these models was defined according to the combination of the binary model prediction zone with the corresponding multiclass model prediction zone, resulting in four prediction zones: ‘in-zone’ (low confidence), ‘half-zone-binary’, ‘half-zone-multiclass’, and ‘out- zone’ (high confidence) (Figure 2.9). The applicability domain was also applied to the regression model: the ‘out-zone’ molecules defined by the classification models were also considered to have higher predicted confidence with respect to the regression models. In general, we considered the out-zone predictions as being the predictions within the applicability domain for all the models.

34

Figure 2.9. Model Prediction Zones and Applicability Domain.

Herein, the AD of hierarchical consensus models was defined by the prediction zones of the hierarchical consensus classification (binary and multiclass) models. The results are summarized in Figures 2.10-2.12 and Table A11. From the evaluation metrics, we found that the performances followed the trend out-zone (blue) > overall (grey) > in-zone (yellow) for all models.

This trend is clearly in line with our hypothesis that the prediction zones can identify the higher confident predictions and improve the model performances for those out-zone compounds. Even though the regression models were not used to define the applicability domain, the hierarchical regression consensus model’s predictions on the “out-zone” compounds still afforded better performances (R2 = 0.71, RMSE = 0.48, coverage = 79% shown in Figure 2.10) than the overall full set performance (R2 = 0.66, RMSE = 0.52, coverage = 100% shown in Figure 2.10). This clearly underscores the relevance of having both regression and classification tasks.

35

Figure 2.10. Performances of Hierarchical Regression Consensus Model based on Prediction Zones.

Figure 2.11. Performances of Hierarchical Binary Consensus Model based on Prediction Zone.

36

Figure 2.12. Performances of Hierarchical Multiclass Consensus Model based on Prediction Zone.

2.4. Discussion

In the context of chemical toxicity assessment, the desire to work with categorical endpoints (toxic/nontoxic, EPA hazard categories, etc.) along with continuous endpoints is of high importance. Classification endpoints, often easier to interpret and communicate, can result from limit tests that are mostly not suitable for building regression models, and are codified in legal frameworks with specific implications such as personal protective equipment, packaging requirements, transportation restrictions, etc. On the other hand, regression models can be quite useful for property optimization, including the avoidance of detrimental ones via safer design, and for quantitative risk assessment purposes.

In this study, many of the LD50 values were obtained from limit tests which estimate LD50 values as being above/below a specific threshold, e.g., 2,000 mg/kg or 5,000 mg/kg, or acute toxic

37

class methods which define a range of values within which the LD50 value falls. These types of tests provide less information than explicitly quantified LD50 values, thus creating challenges for model development, particularly for regression models. However, the distribution of LD50 is approximately symmetrical/normal (Figure A2). Categorizing the data into binary

(toxic/nontoxic) or multiclass (EPA hazard categories) induces a clear loss of information. Similar compounds with similar LD50 values on either side of the threshold will be placed into different classes and a slight change of threshold(s) may result in a large change in the distribution of categorical classes, which weakens the overall utility of the QSAR models. Therefore, a well- trained regression model would provide a more accurate estimation of the toxicity of a given molecule. In this case, integrating regression and classification models via the proposed hierarchical modeling method can result in a better estimation of the acute toxicity of a given compound. The regression model captures the information of the underlying data distribution while the classification models include the information from the limit test and acute toxic class data.

The resulting hierarchical modeling method represents a promising in silico tool to provide robust predictions for chemical hazard and risk assessment. This type of approach is particularly of interest when only low reliability regression models (R2 < 0.5) are obtained (due to small dataset, complex endpoints, not ideal property distribution, etc.). In that case, obtaining binary and multiclass models is usually more feasible but not necessarily of high value if one really needs the continuous prediction. The H-QSAR models aim at having all those three types of models “help” each other and improve the overall prediction performance (especially for the regression model).

This approach could also help in better defining the levels of concordance/agreement between

38

models for a particular compound to be regulated. Finally, the application domain of those H-

QSAR models is broader, which is definitely of high value for regulators.

2.4.1. Comparison with other methodologies recently reported in the literature

It would be worthwhile to evaluate our approach by comparing with other models published in the literature. Herein, we compare results of our hierarchical consensus models with those obtained by the models from other participating groups in the ICCVAM’s global project on the same external test set (Table 2.3). So far, only the modeling methods developed by Alberga et. al. [15], Ballabio et. al. [45], and Vukovic et. al. [46] are officially available in the peer- reviewed literature. The applicability domains (AD) were defined to identify the reliable predictions.

Alberga et. al. [15] used a multi-fingerprint similarity approach for building predictive models for all the five endpoints. They used 19 different fingerprints to make consensus predictions. The model performance and coverage of AD (%AD) was determined by the number of similar molecules k, the cut-off of the similarity cs, and the minimum number of fingerprints used Nstored. Particularly, Nstored was considered as a tunable parameter that provides an acceptable compromise between the performance and the coverage of the model. For the test set results,

Alberga’s models obtained RMSE = 0.41 with 35% coverage of AD for the regression model, balanced accuracy 0.89 with 26% coverage of AD for the binary model and overall accuracy 0.82 with 15% coverage of AD for the multiclass model. Our models resulted in RMSE = 0.49 with

79% coverage of AD for the regression model, balanced accuracy 0.83 with 79% coverage of AD for the binary model and overall accuracy 0.71 with 79% coverage of AD for the multiclass model.

39

Table 2.3. Comparison of our hierarchical QSAR models with the results of Alberga’s models Regression Binary Multiclass Models Balance RMSE R2 %AD %AD Accuracy MCC %AD Accuracy 0.52 0.66 100% 0.79 100% 0.68 0.50 100% This study 0.49 0.71 79% 0.83 79% 0.71 0.55 79% Alberga 0.41 0.74 35% 0.89 26% 0.82 15% et al. Ballabio - - - 0.82 73% - - et. al. Vukovic - - - - - 0.64 100% et.al.

Ballabio et. al. [45] developed a Bayesian consensus approach integrating three different algorithms (N-nearest neighbors, Binned-nearest neighbors and Naïve Bayes) for modeling the

‘very toxic’ and ‘nontoxic’ endpoints. The consensus binary model for ‘nontoxic’ endpoint achieved balanced accuracy 0.82 with 73% coverage of AD on the external test set. In comparison, our binary model led to a balanced accuracy of 0.83 with 79% coverage of AD.

Vukovic et.al. [46] proposed the aiQSAR modelling method based on local group selection during model development and at-the-runtime execution. The application on the EPA category endpoint showed overall accuracy 0.64 on the external test set. No AD was defined in this study.

Our hierarchical multiclass consensus model achieved an overall accuracy 0.68.

2.5. Conclusion

In this study, we developed a dual-layer hierarchical QSAR modeling protocol and applied the approach to three acute oral systemic toxicity endpoints: a regression model to estimate the logLD50 values; a binary model to identify the ‘nontoxic’ molecules; and a multiclass model to predict the EPA hazard categories. The hierarchical models (the second layer models) were built

40

by stacking the base models (first layer models) predictions. All models were validated by an external test set, the hierarchical models outperform the individual base models on all the three endpoints. The hierarchical models also outperform the consensus (predictions averaging) of base models, indicating our hierarchical modeling method is a more efficient and powerful method to integrate different QSAR models. Overall, this hierarchical H-QSAR modeling represents a promising approach for chemical toxicity assessment.

Supporting Information (Appendix A)

Hyperparameter Search Spaces. (Table A1); Hyperparameter Tuning of Regression Models

(Table A2); Hyperparameter Tuning of Binary Models (Table A3); Hyperparameter Tuning of

Multiclass Models (Table A4); 10-Fold Cross-Validation Performances of Regression Models

(Table A5); 10-Fold Cross-Validation Performances of Binary Models (Table A6); 10-Fold

Cross-Validation Performances of Multiclass Models (Table A7); Test Set Performances of

Regression Models (Table A8); Test Set Performances of Binary Models (Table A9); Test Set

Performances of Multiclass Models (Table A10); Performances of Hierarchical Consensus

Models based on Prediction Zones (Table A11); Distribution of logLD50 (mmol/kg) of Train

and Test Sets (Figure A1);Distribution of logLD50 (mg/kg) of Training Set (Figure A2).

Funding

DF thank the support from the NC State Chancellor’s Faculty Excellence Program.

List of Abbreviations

QSAR, quantitative structure-activity relationships; H-QSAR, hierarchical quantitative structure- activity relationships; LD50, lethal dose, 50%; ECFP, extended connectivity fingerprints; Pnontoxic, predicted probabilities of nontoxic class; PI, predicted probabilities of EPA class I; PII, predicted

41

probabilities of EPA class II; PIII, predicted probabilities of EPA class III; OFF, out-of-fold; FP, false positives; FN, false negatives; TP, true positives; TN, true negatives; MCC, Matthews correlation coefficient; t-SNE, t-Distributed Stochastic Neighbor Embedding; RF, random forest;

SVM, support vector machine; kNN, k-nearest neighbors; SVR, support vector regression; AD, applicability domain; VX, Ethyl ({2-[bis(propan-2- yl)amino]ethyl}sulfanyl)(methyl)phosphinate; BA, balanced accuracy; aiQSAR, ab initio QSAR;

42

References

1. Strickland J, Clippinger AJ, Brown J, et al (2018) Status of acute systemic toxicity testing requirements and data uses by U.S. regulatory agencies. Regul. Toxicol. Pharmacol. 94:183–196 2. Burden N, Sewell F, Chapman K (2015) Testing Chemical Safety: What Is Needed to Ensure the Widespread Application of Non-animal Approaches? PLoS Biol 13:e1002156. https://doi.org/10.1371/journal.pbio.1002156 3. Clippinger AJ, Allen D, Jarabek AM, et al (2018) Alternative approaches for acute inhalation toxicity testing to address global regulatory and non-regulatory data requirements: An international workshop report. Toxicol Vitr 48:53–70. https://doi.org/10.1016/j.tiv.2017.12.011 4. Hamm J, Sullivan K, Clippinger AJ, et al (2017) Alternative approaches for identifying acute systemic toxicity: Moving from research to regulatory testing. Toxicol Vitr 41:245– 259. https://doi.org/10.1016/j.tiv.2017.01.004 5. Raies AB, Bajic VB (2016) In silico toxicology: computational methods for the prediction of chemical toxicity. Wiley Interdiscip Rev Comput Mol Sci 6:147–172. https://doi.org/10.1002/wcms.1240 6. Kleinstreuer NC, Karmaus AL, Mansouri K, et al (2018) Predictive models for acute oral systemic toxicity: A workshop to bridge the gap from research to regulation. Comput Toxicol 8:21–24. https://doi.org/10.1016/j.comtox.2018.08.002 7. Cherkasov A, Muratov EN, Fourches D, et al (2014) QSAR modeling: Where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285 8. Fourches D, Muratov E, Tropsha A (2016) Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation. J Chem Inf Model 56:1243–1252. https://doi.org/10.1021/acs.jcim.6b00129 9. Sushko I, Novotarskyi S, Körner R, et al (2010) Applicability domains for classification problems: Benchmarking of distance to models for ames mutagenicity set. J Chem Inf Model 50:2094–2111. https://doi.org/10.1021/ci100253r 10. Alves VM, Capuzzi SJ, Muratov EN, et al (2016) QSAR models of human data can enrich or replace LLNA testing for human skin sensitization. Green Chem 18:6501–6515. https://doi.org/10.1039/C6GC01836J 11. Fan T, Sun G, Zhao L, et al (2018) QSAR and Classification Study on Prediction of Acute Oral Toxicity of N-Nitroso Compounds. Int J Mol Sci 19:3015. https://doi.org/10.3390/ijms19103015 12. Lu J, Peng J, Wang J, et al (2014) Estimation of acute oral toxicity in rat using local lazy learning. J Cheminform 6:26. https://doi.org/10.1186/1758-2946-6-26 13. Lagunin A, Zakharov A, Filimonov D, Poroikov V (2011) QSAR modelling of rat acute toxicity on the basis of PASS prediction. Mol Inform 30:241–250. https://doi.org/10.1002/minf.201000151

43

14. Drwal MN, Banerjee P, Dunkel M, et al (2014) ProTox: A web server for the in silico prediction of rodent oral toxicity. Nucleic Acids Res 42:53–58. https://doi.org/10.1093/nar/gku401 15. Alberga D, Trisciuzzi D, Mansouri K, et al (2019) Prediction of Acute Oral Systemic Toxicity Using a Multifingerprint Similarity Approach. Toxicol Sci 167:484–495. https://doi.org/10.1093/toxsci/kfy255 16. Ballabio D, Grisoni F, Consonni V, Todeschini R (2018) Integrated QSAR Models to Predict Acute Oral Systemic Toxicity. Mol Inform minf.201800124. https://doi.org/10.1002/minf.201800124 17. Xu Y, Pei J, Lai L (2017) Deep Learning Based Regression and Multiclass Models for Acute Oral Toxicity Prediction with Automatic Chemical Feature Extraction. J Chem Inf Model 57:2672–2685. https://doi.org/10.1021/acs.jcim.7b00244 18. Dietterich TG (2000) Ensemble Methods in Machine Learning. In: Ensemble Methods in Machine Learning. Springer, Berlin, Heidelberg, pp 1–15 19. Zhao C, Boriani E, Chana A, et al (2008) A new hybrid system of QSAR models for predicting bioconcentration factors (BCF). Chemosphere 73:1701–1707. https://doi.org/10.1016/j.chemosphere.2008.09.033 20. Gramatica P, Fourches D, Cherkasov A, et al (2008) Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis. J Chem Inf Model 48:766– 784. https://doi.org/10.1021/ci700443v 21. Pradeep P, Povinelli RJ, White S, Merrill SJ (2016) An ensemble model of QSAR tools for regulatory risk assessment. J Cheminform 8:1–9. https://doi.org/10.1186/s13321-016- 0164-0 22. Kuz’min VE, Muratov EN, Artemenko AG, et al (2009) Consensus QSAR Modeling of Phosphor-Containing Chiral AChE Inhibitors. QSAR Comb Sci 28:664–677. https://doi.org/10.1002/qsar.200860117 23. Alves VM, Capuzzi SJ, Braga RC, et al (2018) A Perspective and a New Integrated Computational Strategy for Skin Sensitization Assessment. ACS Sustain Chem Eng 6:2845–2859. https://doi.org/10.1021/acssuschemeng.7b04220 24. Güneş F, Wolfinger R, Tan P-Y (2017) Stacked Ensemble Models for Improved Prediction Accuracy 25. Sill J, Takacs G, Mackey L, Lin D (2009) Feature-Weighted Linear Stacking 26. Basak SC, Gute BD, Grunwald GD (1997) Use of topostructural, topochemical, and geometric parameters in the prediction of vapor pressure: A hierarchical QSAR approach. J Chem Inf Comput Sci 37:651–655. https://doi.org/10.1021/ci960176d 27. Basak SC, Majumdar S (2015) Current Landscape of Hierarchical QSAR Modeling and its Applications: Some Comments on the Importance of Mathematical Descriptors as well as Rigorous Statistical Methods of Model Building and Validation, 1st ed. Elsevier Ltd. 28. Basak SC, Mills DR, Balaban AT, Gute BD (2001) Prediction of Mutagenicity of Aromatic and Heteroaromatic Amines from Structure: A Hierarchical QSAR Approach. J

44

Chem Inf Comput Sci 41:671–678. https://doi.org/10.1021/ci000126f 29. Basak SC, Balasubramanian K, Gute BD, et al (2003) Prediction of cellular toxicity of halocarbons from computed chemodescriptors: A hierarchical QSAR approach. J Chem Inf Comput Sci 43:1103–1109. https://doi.org/10.1021/ci020054n 30. Fourches D, Muratov E, Tropsha A (2015) Curation of chemogenomics data. Nat Chem Biol 11:535–535. https://doi.org/10.1038/nchembio.1881 31. Fourches D, Muratov E, Tropsha A (2010) Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J Chem Inf Model 50:1189–1204. https://doi.org/10.1021/ci100176x 32. Berthold MR, Cebron N, Dill F, et al (2009) KNIME - the Konstanz information miner. ACM SIGKDD Explor Newsl 11:26. https://doi.org/10.1145/1656274.1656280 33. Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inf Model 50:742– 754. https://doi.org/10.1021/ci100050t 34. Landrum G RDKit: Open-source cheminformatics. http://www.rdkit.org 35. Moriwaki H, Tian YS, Kawashita N, Takagi T (2018) Mordred: A molecular descriptor calculator. J Cheminform 10:1–14. https://doi.org/10.1186/s13321-018-0258-y 36. Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta - Protein Struct 405:442–451. https://doi.org/10.1016/0005-2795(75)90109-9 37. Gorodkin J (2004) Comparing two K-category assignments by a K-category correlation coefficient. Comput Biol Chem 28:367–374. https://doi.org/10.1016/j.compbiolchem.2004.09.006 38. Pedregosa Fabian, Michel V, Grisel OLIVIER, et al (2011) Scikit-learn: Machine Learning in Python. J Mach Learn Res 12:2825–2830. https://doi.org/10.1007/s13398- 014-0173-7.2 39. Chen T, Guestrin C (2016) {XGBoost}: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, pp 785–794 40. Van Der Maaten L, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9:2579–2605 41. Aupanun S, Poapolathep S, Giorgi M, et al (2017) An overview of the toxicology and toxicokinetics of fusarenon-X, a type B trichothecene mycotoxin. J Vet Med Sci 79:6–13. https://doi.org/10.1292/jvms.16-0008 42. Schneider C, Bierwisch A, Koller M, et al (2016) Detoxification of VX and Other V-Type Nerve Agents in Water at 37 °C and pH 7.4 by Substituted Sulfonatocalix[4]arenes. Angew Chemie - Int Ed 55:12668–12672. https://doi.org/10.1002/anie.201606881 43. Mathea M, Klingspohn W, Baumann K (2016) Chemoinformatic Classification Methods and their Applicability Domain. Mol Inform 35:160–180. https://doi.org/10.1002/minf.201501019 44. Kuhn M, Johnson K (2013) Applied Predictive Modeling. Springer New York, New York, 45

NY 45. Ballabio D, Grisoni F, Consonni V, Todeschini R (2018) Integrated QSAR Models to Predict Acute Oral Systemic Toxicity. Mol Inform minf.201800124. https://doi.org/10.1002/minf.201800124 46. Vukovic K, Gadaleta D, Benfenati E (2019) Methodology of aiQSAR: a group-specific approach to QSAR modelling. J Cheminform 11:27. https://doi.org/10.1186/s13321-019- 0350-y

46

Chapter 3. Inductive Transfer Learning for Molecular Activity Prediction

Xinhao Li & Denis Fourches*

Department of Chemistry, Bioinformatics Research Center, North Carolina State University,

Raleigh, NC 27695, United States.

Published in Journal of Cheminformatics (J Cheminform. 2020, 12, 27)

47

Abstract

Deep neural networks can directly learn from chemical structures without extensive, user- driven selection of descriptors in order to predict molecular properties/activities with high reliability. But these approaches typically require large training sets to learn the endpoint-specific structural features and ensure reasonable prediction accuracy. Even though large datasets are becoming the new normal in drug discovery, especially when it comes to high-throughput screening or metabolomics datasets, one should also consider smaller datasets with challenging endpoints to model and forecast. Thus, it would be highly relevant to better utilize the tremendous compendium of unlabeled compounds from publicly-available datasets for improving the model performances for the user’s particular series of compounds. In this study, we propose the Molecular Prediction Model Fine-Tuning (MolPMoFiT) approach, an effective transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine- tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. Herein, the method is evaluated on four benchmark datasets (lipophilicity, FreeSolv, HIV, and blood-brain barrier penetration). The results showed the method can achieve strong performances for all four datasets compared to other state-of-the-art machine learning modeling techniques reported in the literature.

48

3.1. Introduction

Predicting properties/activities of chemicals from their structures is one of the key objectives in cheminformatics and molecular modeling. Quantitative structure property/activity relationship (QSPR/QSAR) modeling [1–6] relies on machine learning techniques to establish quantified links between molecular structures and their experimental properties/activities. When using a classic machine learning approach, the training process is divided into two main steps: feature extraction/calculation and the actual modeling. The features (also called descriptors) characterizing the molecular structures are critical for the model performances. They typically encompass 2D molecular fingerprints, topological indices, or substructural fragments, as well as more complex 3D and 4D descriptors [7, 8] directly computed from the molecular structures [9].

Deep learning methods have demonstrated remarkable performances in several

QSPR/QSAR case studies. In addition to use expert-engineered molecular descriptors as input, those techniques can also directly take molecular structures (e.g., molecular graph [10–21],

SMILES strings [22–24], and molecular 2D/3D grid image [25–30]) and learn the data-driven feature representations for predicting properties/activities. As a result, this type of approach is potentially able to capture and extract underlying, complex structural patterns and feature  property relationships given sufficient amount of training data. The knowledge derived from these dataset-specific descriptors can then be used to better interpret and understand the structure- property relationships as well as to design new compounds. In a large scale benchmark study,

Yang et.al [14] shown that a graph convolutional model that construct a learned representation from molecular graph consistently matches or outperforms models trained with expert-engineered molecular descriptors/fingerprints.

49

Graph convolutional neural networks (GCNN) directly operate on molecular graphs. [10]

A molecular graph is an undirected graph whose nodes correspond to the atoms of the molecule and edges correspond to chemical bonds. GCNNs iteratively update the nodes representation by aggregating the representations of their neighboring nodes and/or edges. After k iterations of aggregation, the final nodes representations capture the local structure information within their k- hop graph neighborhood (which is somehow similar to augmented substructural fragments [31] but in a more data-driven manner). Moreover, the Simplified Molecular-Input Line-Entry System

(SMILES) [32, 33] encodes the molecular structures as strings of text. Widely used in the field of cheminformatics, the SMILES format can be considered as an analogue of natural language. As a result, deep learning model architectures such as RNNs [34, 35], CNNs [36] and transformers [37] can be directly applied to SMILES for QSAR/QSPR tasks. While deep learning models have achieved state-of-the-art results on a variety of molecular properties/activities prediction tasks, these end-to-end models require very large amount of training data to learn useful feature representations. The learned representations are usually endpoint-specific, which means the models need to be built and retrained from scratch for the new endpoint/dataset of interest. Small chemical datasets with challenging endpoints to model are thus still disadvantaged with these techniques and unlikely to lead to models with reasonable prediction accuracy. As of today, this is considered as a grand challenge for QSAR modelers facing small sets of compounds without a clear path for obtaining reliable models for the endpoint of interest.

Meanwhile, transfer learning is a quickly emerging technique based on the general idea of reusing a pre-trained model built on a large dataset as the starting point for building a new, more optimized model for a target endpoint of interest. It is now widely used in the field of computer vision (CV) and natural language processing (NLP). In CV, a pre-trained deep learning model on

50

ImageNet [38] can be used as the start point to fine-tune for a new task [39]. Transfer learning in

NLP has historically been restricted to the shallow word embeddings: NLP models start with embedding layers initialized with pretrained weights from Word2Vec [40], GloVe [41] or fastText

[42]. This approach only uses the prior knowledge for the first layer of a model, the remaining layers still need to be trained and optimized from scratch. Language model pre-training [43–47] extends this approach by transferring all the learned optimized weights from multiple layers, which providing contextualized word embeddings for the downstream tasks. Language scale pre-trained language models have greatly improved the performance on a variety of language tasks. The default task for a language model is to predict the next word given the past sequence. The input and labels of the dataset used to train a language model are provided by the text itself. This is known as self-supervised learning. Self-supervised learning opens up a huge opportunity for better utilizing unlabeled data.

Due to the limited amount and sparsity of labeled datasets for certain types of endpoints in chemistry (e.g., inhibitor residence times, allosteric inhibition, renal clearance), several transfer learning methods have been developed for allowing the development of QSPR/QSAR models for those types of endpoints/datasets. Inspired by ImageNet pretraining, Goh et al. proposed ChemNet

[26] for transferable chemical property prediction. A deep neural network was pre-trained in a supervised manner on the ChEMBL [48] database using computed molecular descriptors as labels, then fine-tuned on other QSPR/QSAR tasks. Jaeger et al. [49] developed Mol2vec which employed the same idea of Word2Vec in NLP. Mol2vec learns the vector representations of molecular substructures in an unsupervised learning manner. Vectors of closely related molecular substructures are close to each other in the vector space. Molecular representations are computed by summing up the vectors of the individual substructures and be used as input for QSPR/QSAR

51

models. Hu et. al. pre-trained graph neural networks (GNNs) using both unlabeled data and labeled data from related auxiliary supervised tasks. The pre-trained GNNs were shown to significantly increase the model performances [50]. Multitask learning (MLT) is a related field to transfer learning, aiming at improving the performance of multiple tasks by learning them jointly.

Multitask DNNs (deep neural networks) for QSAR were notably introduced by the winning team in the Kaggle QSAR competition and then applied in other QSAR/QSPR studies. [51–58] MTL is particularly useful if the endpoints a significant relationship. However, MTL requires the tasks to be trained from scratch every time.

Herein, we propose the Molecular Prediction Model Fine-Tuning (MolPMoFiT – pronounced MOLMOFIT), an effective transfer learning method based on self-supervised pre- training + task-specific fine-tuning for QSPR/QSAR modeling. In the current version, a molecular structure prediction model (MSPM) is pre-trained using one million bioactive molecules from

ChEMBL and then fine-tuned for various QSPR/QSAR tasks. This method is “universal” in the sense that the pre-trained molecular structure prediction model can be used as a source for any other QSPR/QSAR models dedicated to a specific endpoint and a smaller dataset (e.g., molecular series of congeneric compounds). This approach could constitute a first look at next-gen QSAR models being capable of high prediction reliability even for small series of compounds and highly challenging endpoints.

3.2. Method

3.2.1. ULMFiT

The MolPMoFiT method we proposed here is adapted from the ULMFiT (Universal

Language Model Fine-Tuning) [45], a transfer learning method developed for any NLP

52

classification tasks. The original implementation of ULMFiT breaks the training process into three stages:

1. Train a general domain language model in the self-supervised manner on a large corpus

(e.g., Wikitext-103 [59]). Language models are a type of model that aim to predict the next

word in the sentences given the context precede it. The input and labels of the dataset used

to train a language model are provided by the text itself. After training on millions of

unlabeled text, the language model captures the extensive and in-depth knowledge [60–62]

of a language and can provide useful features for other NLP tasks.

2. Fine-tuning the general language model on the task corpus to create a task specific

language model.

3. Fine-tuning the task specific language model for downstream classification/regression

model.

As described above, the ULMFiT is a three-stage transfer learning process that includes two types of models: language models and classification/regression models. A language model is a model that takes in a sequence of words and predicts the most likely next word. A language model is trained in a self-supervised manner and no label is required. This means the training data can be generated from a huge amount of unlabeled text data. The classification/regression model is a model that takes a whole sequence and predicts the class/value associated to the sequence, requiring labeled data.

3.2.2. MolPMoFiT

In this study, we adapted the ULMFiT method to handle molecular property/activity prediction. Specifically, we trained a molecular structure prediction model (MSPM) using one

53

million molecules extracted from ChEMBL with self-supervised learning. The pre-trained MSPM was then fine-tuned for the given QSAR/QSPR tasks.

Model Architecture: The architectures for the MSPMs and the QSAR/QSPR models follow similar structures (Figure 3.1): the embedding layer, the encoder and the classifier. The embedding layer converts the numericized tokens into fixed length vector representations (see

Section 3.2.4 for details); the encoder processes the sequence of embedding vectors into feature representations which contain the contextualized token meanings; and the classifier uses the extracted feature representations to make the final prediction. The model architecture used for modeling is AWD-LSTM (ASGD Weight-Dropped LSTM). [63] The main idea of the AWD-

LSTM is to use a LSTM (Long Short-Term Memory [64]) model with dropouts in all the possible layers (embedding layer, input layer, weights, and hidden layers). The model hyperparameters are same as the ones initially implemented for ULMFiT. An embedding vector length of 400 was used for the models. The encoder consisted of three LSTM layers: the input size of first LSTM layer is

400, the hidden number of hidden units is 1152, and the output size of the last LSTM layer is 400.

The classifiers use the output of the encoder to make predictions. The MSPMs and QSPR/QSAR models use the output of the encoder in different ways for different prediction purposes. The

MSPM classifier consists of just a single softmax layer. The MSPMs predict the next token in a

SMILES string, using the hidden state at the last time step hT of the final LSTM layer of the encoder. The QSPR/QSAR model classifier consists of two feedforward neural network layers.

The first layer takes the concatenation of output vectors from the last LSTM layer of the encoder

(concatenation of max pooling, mean pooling and last time step hT [45]), followed by a ReLU activation function. The final output size is determined by the QSPR/QSAR endpoints, e.g., for

54

regression models, a single output node is used; for classification models, the output size equals to the number of classes.

Figure 3.1. Scheme illustrating the MolPMoFiT Architecture: During the fine- tuning, learned weights are transferred between models. Vocab Size corresponds to the number of unique characters tokenized (See Section 3.2.4) from SMILES in a data set. The stage of Task Specific Molecular Structure Prediction Model fine- tuning is optional.

General-Domain MSPM Training: In the first stage of training, a general domain MSPM is trained on one million molecules curated from ChEMBL. The model is trained using the one cycle policy with a constant learning rate for 10 epochs. One cycle policy is a learning rate schedule method proposed by Smith [65]. The MSPM forms the source for all the subsequent QSPR/QSAR models. The training of the general-domain MSPM model requires about one day on a single

NVIDIA Quadro P4000 GPU but it only needs to be trained once and can be reused for other

QSPR/QSAR tasks.

55

Task Specific MSPM Model Fine-Tuning (Optional): The stage is optional for

MolPMoFiT. The MSPM trained on ChEMBL covers a large and diverse chemical space of bioactive molecules and can be directly fine-tuned to predict physical properties of molecules such as lipophilicity and solubility. For bioactivities such as HIV inhibition and other drug activities, scientists are more interested in compounds with desired activities. The experimental tested data

(target task dataset) may have a different distribution from ChEMBL. Fine-tuning the general domain MSPM on target task data to adapt to the idiosyncrasies of the task data would be helpful to the downstream QSAR models. The impact of task specific MSPM fine-tuning will be analyzed in Section 3.3.1.

In this stage, the goal is to fine-tuning the general domain MSPM on the target QSAR datasets to create the task-specific (endpoint-specific) MSPM. The initial weights (embedding, encoder and linear head) of task specific MSPM are transferred from the general domain MSPM.

The task specific MSPMs are fine-tuned using the one cycle policy and discriminative fine-tuning

[45]. In a neural network, different layers encode different levels of information [66]. Higher layers contain less general knowledge toward the target task and need more fine-tuning compared to lower layers. Instead of using the same learning rate for fine-tuning all the layers, the discriminative fine-tuning trains higher layers with higher learning rates. Learning rates are adjusted based on the same the function ηlayer − 1 = ηlayer / 2.6 used in the original ULMFiT approach, where η is the learning rate.

QSAR/QSPR Models Fine-Tuning: When fine-tuning the QSAR/QSPR model, only the embedding layer and the encoder are transferred from the pre-trained model, as the QSAR/QSPR model required a different classifier. In other word, the weights of classifier are initialized randomly and need to be trained from scratch for each task. [45] The QSPR/QSAR model is fine-

56

tuned using one cycle policy, discriminative fine-tuning and gradual unfreezing [45]. During the fine-tuning, the model is gradually unfrozen over four layer-groups: (i) classifier; (ii) classifier + final LSTM layer; (iii) classifier + final two LSTM layers, and (iv) full model. Gradual unfreezing first trains the classifier of the model with the embedding and encoder layers frozen (weights are not updated). Then unfreezing the second to last layer-groups and fine-tuning the model. This process continues until all the layer-groups are unfrozen and fine-tuned.

Implementation. We implemented our model using the PyTorch [67]

(https://pytorch.org/) deep learning framework and fastai v1 library [68] (https://docs.fast.ai). To ensure the reproducibility of this study, the data and code used in this study are freely available at: https://github.com/XinhaoLi74/MolPMoFiT.

3.2.3. Dataset preparation

SMILES of all molecules in ChEMBL [48] were downloaded and curated following the procedure: (1) Removing mixtures, molecules with more than 50 heavy atoms (2) Standardizing with MolVS [69] package; (3) Sanitizing and canonizing with RDKit [70] package. After curation, one million SMILES were randomly selected for training and testing the molecular structure perdition model.

We tested our method on four publicly-available, benchmark datasets [17]: (1) molecular lipophilicity; (2) experimental measured solvation energy in kcal/mol (FreeSolv) (3) HIV inhibition, and (4) blood-brain barrier penetration (BBBP). The detailed descriptions are summarized in Table 3.1.

57

Table 3.1. Description of QSAR/QSPR datasets. # of Active Data Set Description Size Task Compound Octanol/water distribution Lipophilicity 4,200 Regression coefficient Experimental measured FreeSolv 642 Regression solvation energy (kcal/mol) HIV Inhibition of HIV replication 41,127 1,443 Classification Ability to penetrate the BBBP 2,039 1,560 Classification blood-brain barrier

3.2.4. Molecular Representation

In this study, we use SMILES strings as the textual representation of molecules. SMILES is a linear notation for representing molecular structures. For SMILES to be processed by machine learning models, they need to be transformed into numeric representations. SMILES strings are tokenized at the character level with a few specific treatments: (1) ‘Cl’, ‘Br’ are two-character tokens; (2) special characters encoded between brackets are considered as tokens (e.g., ‘[nH], ‘[O-

]’ and ‘[Te]’ et.al). The unique tokens are mapped to integers to be used as input for the deep learning models.

3.2.5. Data Augmentation

Deep learning models are data-hungry so that various data augmentation techniques have been developed for different types of data and applications [71–74]. Data augmentation usually helps deep learning models to be better generalized for new data. Each SMILES corresponds to one unique molecular structure, whereas several SMILES strings can be derived from the same molecule. In fact, for a single molecular structure, many SMILES can be generated by simply randomizing the atom ordering (Figure 3.2a). Bjerrum shown the SMILES enumeration as a data augmentation technique for QSAR models based on SMILES input can improve the robustness and accuracy [75]. It has been also shown that the generative models trained on both augmented

58

and canonical SMILES can create a larger chemical space of structures [76, 77]. Herein, we used

SMILES enumeration as the basis for data augmentation technique. The SMILES augmentation technique was applied to both the MSPM and QSAR/QSPR models. For MSPM, the SMILES augmentation ensures the trained model can cover a large and diverse chemical space

(characterized by SMILES). For unbalanced classification QSAR/QSPR datasets, the SMILES augmentation can be applied to re-balance the class distribution of training data. In addition to

SMILES augmentation, for regression QSAR/QSPR models, a Gaussian noise (mean set at 0 and standard deviation σnoise) is added to the labels of augmented SMILES which could be considered as a simulation of experimental errors [78]. (Figure 3.2b). The standard deviation σnoise is considered as a hyperparameter for the models and need to be tuned from task to task. The impact of training data augmentation will be analyzed in Section 3.3.2.

We also applied the test time augmentation (TTA): Briefly, the final predictions are generated by averaging predictions of the canonical SMILES and four augmented SMILES

(Figure 3.2c). The impact of TTA will be discussed in Section 3.3.1.

59

(a) SMILES Augmentation

(b) Training Data Augmentation (Training sets)

(c) Test-Time Augmentation (TTA) Figure 3.2. SMILES and Data Augmentation.

60

3.2.6. Baselines and Comparison Models

To evaluate the performance of our method, we compared our models to the models reported by Yang et al. [14], including directed message passing neural network (D-MPNN), D-

MPNN with RDKit features, random forest (RF) model on binary Morgan fingerprints, feed- forward network (FFN) on binary Morgan fingerprints, FFN on count-based Morgan fingerprints and FFN on RDKit descriptors,. We evaluated all models based on the original random and scaffold splits from Yang et. al. for a fair and reproducible comparison. All the models were evaluated on the test sets on 10 randomly seeded 80:10:10 data splits. For regression model, we use root-mean-square-error (RMSE) as the key metric. For classification model, we use area under the receiver operating characteristic curve (AUROC) as the key metric.

3.2.7. Hyperparameters and Training Procedure

QSAR/QSPR Model Fine-Tuning: We are interested in obtaining a model that perform robustly across a variety of QSPR/QSAR tasks. Herein, we used the same set of hyperparameters for fine-tuning QSPR/QSAR models across different tasks, which we tuned on the HIV dataset

(Table 3.2). The batch size is set to 128 (64 for HIV dataset due to the GPU memory limit). The optimal hyperparameters of the HIV dataset was determined based on the validation set results on

3 randomly 80:10:10 data splits. Specifically, we optimized the dropout rates, the base learning rate and training epochs.

Table 3.2. Hyperparameters for QSPR/QSAR Model Fine-tuning. Layer Groups Base Learning Rate Epochs Linear head only 3e-2 4 Linear head + final LSTM layer 5e-3 4 Linear head + final two LSTM layers 5e-4 4 Full Model 5e-5 6

61

Data Augmentation: In order to train a molecular structure prediction model that can be applied to a large chemical space, ChEMBL data is augmented by 4 times in addition to the original canonical SMILES. For the lipophilicity and FreeSolv datasets (regression), the number of augmented SMILES and the label noise σnoise were tuned on the validation set on three 80:10:10 random split. Specifically, the SMILES of lipophilicity training data were augmented 25 times with the label noise σnoise = 0.3 and the SMILES of FreeSolv training data were augmented 50 times with the label noise σnoise = 0.5. For classification tasks, we used data augmentation to balance the class distribution. Specifically, for HIV data, the SMILES of active class were augmented 60 times and the SMILES of inactive class were augmented 2 times. For BBBP data, the SMILES of positive class were augmented 10 times and the SMILES of negative class were augmented 30 times.

3.3. Results and Discussion

3.3.1. Benchmark

Yang et. al. [14] developed a graph convolutional model based on directed message passing neural network (D-MPNN) and benchmarked it across a wide variety of public and proprietary datasets, achieved consistently strong performance. We benchmarked our MolPMoFiT method to the state-of-the-art models from Yang et. al. on four well-studied chemical datasets: lipophilicity,

FreeSolv, HIV and BBBP. Both random and scaffold splits were evaluated. Scaffold split enforced all training and test sets shared no common molecular scaffolds, which represent a more challenging and realistic evaluation compared to a random split. All the models were evaluated on test set on the exact same ten 80:10:10 splits from Yang et. al. to ensure a fair and reproducible benchmark. Results for lipophilicity and FreeSolv data were evaluated by root mean square error

(RMSE), whereas results for HIV and BBBP were evaluated by area under the receiver operating

62

characteristic curve (AUROC). For physical properties lipophilicity and FreeSolv data, the regression models were fine-tuned on the general domain MSPM. For bioactivities HIV and BBBP data, the classification models were fine-tuned on both the general and task-specific MSPMs (See

Section 3.2.2). Evaluation metrics were computed in two settings: (1) testing on canonical

SMILES only and (2) Time-time augmentation (TTA, See Section 3.2.5).

The results for test sets are summarized in Figures 3.3-3.6. Across all four data sets,

MolPMoFiT models achieved comparable or better prediction performances compared to the baselines. Generally, a scaffold split resulted in a worse performance compared to a random split.

But a scaffold split can better measure the generalization ability of a model, which is very useful

[79] for new molecular series with scaffolds being dissimilar to any other compounds in the modeling set.

For lipophilicity data, MolPMoFiT models tested on TTA outperform those tested on canonical SMILES. On random split, MolPMoFiT achieved a test set RMSE of 0.565±0.037 and

0.625±0.032 with and without TTA, respectively (Figure 3.3a). On scaffold split, MolPMoFiT achieved a test set RMSE of 0.635±0.031 and 0.695±0.036 with and without TTA, respectively

(Figure 3.3b).

The FreeSolv dataset only contains 642 compounds, different data splits resulted in a large variance in RMSE (Figure 3.4). MolPMoFiT models tested on TTA outperform those tested on canonical SMILES on random split but have no significant difference on scaffold split. On random split, MolPMoFiT achieved a test set RMSE of 1.197±0.127 and 1.338±0.144 with and without

TTA, respectively (Figure 3.4a). On scaffold split, MolPMoFiT achieved a test set RMSE of

2.082±0.460 and 2.185±0.448 with and without TTA, respectively (Figure 3.4b).

63

(a) Random Split

(b) Scaffold Split Figure 3.3. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on Lipophilicity. (a) Random split; (b) Scaffold split. MolPMoFiT: Molecular Prediction Model Fine-Tuning; D-MPNN: Directed Message Passing Neural Network; RF: Random Forest; FFN: Feed-Forward Network.

64

(a) Random Split

(b) Scaffold Split Figure 3.4. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on FreeSolv. (a) Random split; (b) Scaffold split.

For bioactivities like BBBP and HIV inhibition, the molecules of interest (tested experimentally) may have a different distribution from ChEMBL. Fine-tuning the general domain

65

MSPM on target task data to adapt to the idiosyncrasies of the task data would be helpful to the downstream QSAR models. We evaluated the QSAR models fine-tuned both on general domain

MSPM (named as general MolPMoFiT) and task-specific MSPM (named as task-specific

MolPMoFiT) on BBBP and HIV datasets. For both BBBP (Figure 3.5) and HIV (Figure 3.6) dataset, the performance of models fine-tuned on general domain MSPM is on-par with the performance of models fine-tuned on the task-specific MSPMs. It requires more case studies to show whether fine-tuning on task-specific MSPM is beneficial.

On BBBP dataset, MolPMoFiT models outperform other comparison models (Figure 3.5).

Specifically, the general MolPMoFiT models achieved a test set AUROC of 0.950±0.020

(Canonical SMILES) and 0.945±0.023 (TTA) on random split and achieved a test set AUROC of

0.931±0.025 (Canonical SMILES) and 0.929±0.023 (TTA) on scaffold split. The task-specific

MolPMoFiT models achieved a test set AUROC of 0.950±0.022 (Canonical SMILES) and

0.942±0.023 (TTA) on random split and achieved a test set AUROC of 0.933±0.023 (Canonical

SMILES) and 0.926±0.026 (TTA) on scaffold split. It is worth noting that the implementation of

TTA shows no improvements of the model accuracy.

For HIV data (Figure 3.6), the general MolPMoFiT models achieved a test set AUROC of

0.801±0.032 (Canonical SMILES) and 0.828±0.029 (TTA) on random split and achieved a test set

AUROC of 0.794±0.023 (Canonical SMILES) and 0.816±0.022 (TTA) on scaffold split. The task- specific MolPMoFiT models achieved a test set AUROC of 0.811±0.021 (Canonical SMILES) and 0.834±0.025 (TTA) on random split and achieved a test set AUROC of 0.782±0.018

(Canonical SMILES) and 0.805±0.014 (TTA) on scaffold split.

66

(a) Random Split

(b) Scaffold Split Figure 3.5. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on BBBP. (a) Random split; (b) Scaffold split.

67

(a) Random Split

(b) Scaffold Split Figure 3.6. Comparison of MolPMoFiT to Reported Results from Yang’s [14] on HIV. (a) Random split; (b) Scaffold split.

68

3.3.2. Analysis

Impact of Transfer Learning: MolPMoFiT models were compared to the models that were trained from scratch. The models were trained on different number of training data and tested on the test set on a single 80:10:10 random split. The hyperparameters (learning rate, dropout rate and training epochs) were kept fixed: the hyperparameters of MolPMoFiT models were the same as we used in benchmark and the hyperparameters of models trained from scratch were tuned based on the validation set using the full training set. The results are illustrated in Figure 3.7. Generally, with different numbers of training data, the MolPMoFiT model always outperforms the model trained from scratch. This indicated that the MolPMoFiT transfer learning technique provided a robust improvement for the model performances.

(a) Lipophilicity (b) FreeSolv

(c) BBBP (d) HIV Figure 3.7. Performances of models on the different size of the training set. (a) Lipophilicity; (b) FreeSolv; (c) BBBP and (d) HIV.

69

Impact of Training Data Augmentation: In Section 3.3.1, we shown that test-time augmentation (TTA) can improve the accuracy of predictions. Herein, we analyze the effect of training data augmentation. All models were evaluated with evaluation metrics computed with

TTA on the test sets on three 80:10:10 random splits. The hyperparameters of models were the same as we used in benchmark.

For classification tasks (BBBP and HIV), models were trained on different sizes of augmented training data. On HIV dataset, when model trained on the original dataset (no augmentation), the AUROC is 0.816±0.005, which is significantly low than those with data augmentation. The models achieved similar performance when data augmentation applied no matter class re-balancing or not (Table 3.3). Similarly, training data augmentation significantly improves the accuracy of the model on BBBP data. The model shows no improvement with the class re-balancing (Table 3.4).

Table 3.3. Impact of SMILES Augmentation on HIV Dataset. Iteration of Augmentation Class Ratio AUROC Positive Class Negative Class (Positive: Negative) 0 0 0.037 0.816±0.005 4 4 0.037 0.831±0.003 30 1 0.52 0.830±0.007 60 1 1 0.835±0.007

Table 3.4. Impact of SMILES Augmentation on BBBP Dataset. Iteration of Augmentation Class Ratio AUROC Positive Class Negative Class (Positive: Negative) 0 0 3.38 0.894±0.004 4 4 3.10 0.937±0.005 10 10 3.25 0.949±0.002 9 30 1.1 0.946±0.002

70

For regression tasks (lipophilicity and FreeSolv), models were trained on different sizes of augmented training data, whose labels were perturbed with different Gaussian noise σnoise. The evaluated numbers of augmented SMILES per compound were {0, 5, 25, 50} and {0, 25, 50, 100} for lipophilicity and FreeSolv, respectively. The evaluated Gaussian noise σnoise values were {0,

0.1, 0.3, 0.5}and {0, 0.3, 0.5, 1} for lipophilicity and FreeSolv, respectively. The results on the test set are shown in Figure 3.8. For both lipophilicity and FreeSolv datasets, when the model was only trained on the original training data (no augmented SMILES and perturbed labels), the performance is significantly worse than those of the models trained on augmented training data.

The results above show one limitation of using SMILES as input for deep learning model: the model actually learns to map individual SMILES to molecular properties/activities instead of linking actual molecular structures to their properties/activities. However, the SMILES augmentation is used as a regularization technique, making the model more robust to various

SMILES representation for the same molecule. Appropriately adding random label noise to the augmented SMILES led to improved predictive power of the regression model. For the same data augmentation setting, testing results with TTA were found to be almost always better than the results on only canonical SMILES. While augmentation for training set can help in building models that can generalize well on new data, prediction accuracy can be further improved by TTA.

71

(a) Lipophilicity

(b) FreeSolv Figure 3.8. Performances of Lipophilicity models on different number of augmented SMILES per compound and Gaussian Noise (σnoise) added to the original experimental values. TTA: Test-time augmentation.

72

3.4. Conclusion

In this study, we introduced the MolPMoFiT, a novel transfer learning method for

QSPR/QSAR tasks. We pre-trained a molecular structure prediction model (MSPM) using one million bioactive molecules from ChEMBL and then fine-tuned it for various QSPR/QSAR tasks.

This pre-training + fine-tuning approach enables knowledge learned from large chemical data sets to transfer to smaller data sets, thereby improving the model performance and generalization.

Without endpoint-specific hyperparameter tuning, this method showed comparable or better results compared to that of the state-of-the-art results reported in the literature for four benchmark datasets. In addition to the strong out-of-box performance, this method reuses the pre-trained

MSMP across QSPR/QSAR tasks so that reduces the burden of hyperparameters tuning and model training. We posit that transfer learning techniques such as MolPMoFiT could significantly contribute in boosting the reliability of next-generation QSPR/QSAR models, especially for small/medium size datasets that are extremely challenging for QSAR modeling.

Availability of data and materials

The curated datasets (.smi and .csv files) and the full updated code used in this study are freely-available at: https://github.com/XinhaoLi74/MolPMoFiT.

Funding

We gratefully thank the financial support from DARPA and ARO (grant number W911NF-

18-1-0315).

Acknowledgements

We gratefully thank the financial support from DARPA and ARO (grant number W911NF-18-1-

0315).

73

List of Abbreviations

MolPMoFiT: Molecular Prediction Model Fine-Tuning; QSPR/QSAR: Quantitative structure property/activity relationship; SMILES: Simplified Molecular-Input Line-Entry System;

GCNN: Graph convolutional neural networks; RNN: Recurrent neural network; CNN: convolutional neural network; CV: computer vision; NLP: natural language processing; MLT: multitask learning; MSPM: molecular structure prediction model; ULMFiT: Universal Language

Model Fine-Tuning; AWD-LSTM: ASGD Weight-Dropped LSTM; D-MPNN: Directed Message

Passing Neural Network; RF: Random Forest; FFN: Feed-Forward Network; TTA: Test-time augmentation

74

References

1. Cherkasov A, Muratov EN, Fourches D, et al (2014) QSAR modeling: Where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285 2. Mater AC, Coote ML (2019) Deep Learning in Chemistry. J Chem Inf Model 59:2545– 2559. https://doi.org/10.1021/acs.jcim.9b00266 3. Tropsha A (2010) Best Practices for QSAR Model Development, Validation, and Exploitation. Mol Inform 29:476–488. https://doi.org/10.1002/minf.201000061 4. Ma J, Sheridan RP, Liaw A, et al (2015) Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships. J Chem Inf Model 55:263–274. https://doi.org/10.1021/ci500747n 5. Fourches D, Williams AJ, Patlewicz G, et al (2018) Computational Tools for ADMET Profiling. In: Computational Toxicology. pp 211–244 6. Li X, Kleinstreuer NC, Fourches D (2020) Hierarchical Quantitative Structure–Activity Relationship Modeling Approach for Integrating Binary, Multiclass, and Regression Models of Acute Oral Systemic Toxicity. Chem Res Toxicol 33:353–366. https://doi.org/10.1021/acs.chemrestox.9b00259 7. Ash J, Fourches D (2017) Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories. J Chem Inf Model 57:1286–1299. https://doi.org/10.1021/acs.jcim.7b00048 8. Fourches D, Ash J (2019) 4D- quantitative structure–activity relationship modeling: making a comeback. Expert Opin Drug Discov 1–9. https://doi.org/10.1080/17460441.2019.1664467 9. Xue L, Bajorath J (2012) Molecular Descriptors in Chemoinformatics, Computational Combinatorial Chemistry, and Virtual Screening. Comb Chem High Throughput Screen 3:363–372. https://doi.org/10.2174/1386207003331454 10. Gilmer J, Schoenholz SS, Riley PF, et al (2017) Neural Message Passing for Quantum Chemistry. http://arxiv.org/abs/1704.01212 11. Chen C, Ye W, Zuo Y, et al (2019) Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chem Mater 31:3564–3572. https://doi.org/10.1021/acs.chemmater.9b01294 12. Tang B, Kramer ST, Fang M, et al (2020) A self-attention based message passing neural network for predicting molecular lipophilicity and aqueous solubility. J Cheminform 12:15. https://doi.org/10.1186/s13321-020-0414-z 13. Withnall M, Lindelöf E, Engkvist O, Chen H (2020) Building attention and edge message passing neural networks for bioactivity and physical-chemical property prediction. J Cheminform 12:1–18. https://doi.org/10.1186/s13321-019-0407-y 14. Yang K, Swanson K, Jin W, et al (2019) Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model 59:3370–3388.

75

https://doi.org/10.1021/acs.jcim.9b00237 15. Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, et al (2015) Convolutional Networks on Graphs for Learning Molecular Fingerprints. Adv Neural Inf Process Syst 2015- Janua:2224–2232 16. Coley CW, Barzilay R, Green WH, et al (2017) Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction. J Chem Inf Model 57:1757–1772. https://doi.org/10.1021/acs.jcim.6b00601 17. Wu Z, Ramsundar B, Feinberg EN, et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9:513–530. https://doi.org/10.1039/C7SC02664A 18. Pham T, Tran T, Venkatesh S (2018) Graph Memory Networks for Molecular Activity Prediction. In: Proceedings - International Conference on Pattern Recognition. pp 639– 644 19. Wang X, Li Z, Jiang M, et al (2019) Molecule Property Prediction Based on Spatial Graph Embedding. J Chem Inf Model acs.jcim.9b00410. https://doi.org/10.1021/acs.jcim.9b00410 20. Feinberg EN, Sur D, Wu Z, et al (2018) PotentialNet for Molecular Property Prediction. ACS Cent Sci 4:1520–1530. https://doi.org/10.1021/acscentsci.8b00507 21. Stokes JM, Yang K, Swanson K, et al (2020) A Deep Learning Approach to Antibiotic Discovery. Cell 180:688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021 22. Goh GB, Hodas NO, Siegel C, Vishnu A (2017) SMILES2Vec: An Interpretable General- Purpose Deep Neural Network for Predicting Chemical Properties. http://arxiv.org/abs/1712.02034 23. Zheng S, Yan X, Yang Y, Xu J (2019) Identifying Structure–Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism. J Chem Inf Model 59:914–923. https://doi.org/10.1021/acs.jcim.8b00803 24. Kimber TB, Engelke S, Tetko I V, et al (2018) Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction. http://arxiv.org/abs/1812.04439 25. Goh GB, Siegel C, Vishnu A, et al (2017) Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models. https://arxiv.org/pdf/1706.06689.pdf 26. Goh GB, Siegel C, Vishnu A, Hodas NO (2017) Using Rule-Based Labels for Weak Supervised Learning: A ChemNet for Transferable Chemical Property Prediction. 9:. https://doi.org/10.475/123 27. Paul A, Jha D, Al-Bahrani R, et al (2018) CheMixNet: Mixed DNN Architectures for Predicting Chemical Properties using Multiple Molecular Representations. http://arxiv.org/abs/1811.08283 28. Goh GB, Siegel C, Vishnu A, et al (2018) How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions? In: Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018. pp 1340–1349

76

29. Fernandez M, Ban F, Woo G, et al (2018) Toxic Colors: The Use of Deep Learning for Predicting Toxicity of Compounds Merely from Their Graphic Images. J Chem Inf Model 58:1533–1543. https://doi.org/10.1021/acs.jcim.8b00338 30. Asilar E, Hemmerich J, Ecker GF (2020) Image Based Liver Toxicity Prediction. J Chem Inf Model acs.jcim.9b00713. https://doi.org/10.1021/acs.jcim.9b00713 31. Varnek A, Fourches D, Hoonakker F, Solov’ev VP (2005) Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures. J Comput Aided Mol Des 19:693–703. https://doi.org/10.1007/s10822-005-9008-0 32. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model 28:31–36. https://doi.org/10.1021/ci00057a005 33. Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Model 29:97–101. https://doi.org/10.1021/ci00062a008 34. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79:2554–2558. https://doi.org/10.1073/pnas.79.8.2554 35. Lipton ZC, Berkowitz J, Elkan C (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. http://arxiv.org/abs/1506.00019 36. Kim Y (2014) Convolutional Neural Networks for Sentence Classification. http://arxiv.org/abs/1408.5882 37. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in Neural Information Processing Systems. pp 5999–6009 38. Deng J, Dong W, Socher R, et al (2009) ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 248–255 39. Canziani A, Paszke A, Culurciello E (2016) An Analysis of Deep Neural Network Models for Practical Applications. http://arxiv.org/abs/1605.07678 40. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781 41. Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Representation. In: Empirical Methods in Natural Language Processing (EMNLP). pp 1532–1543 42. Joulin A, Grave E, Bojanowski P, et al (2016) FastText.zip: Compressing text classification models. http://arxiv.org/abs/1612.03651 43. Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. http://allennlp.org/elmo 44. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805 45. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: 77

ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). pp 328–339 46. Yang Z, Dai Z, Yang Y, et al (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. http://arxiv.org/abs/1906.08237 47. Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. http://arxiv.org/abs/1907.11692 48. Gaulton A, Bellis LJ, Bento AP, et al (2012) ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107. https://doi.org/10.1093/nar/gkr777 49. Jaeger S, Fulle S, Turk S (2018) Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model 58:27–35. https://doi.org/10.1021/acs.jcim.7b00616 50. Hu W, Liu B, Gomes J, et al (2019) Strategies for Pre-training Graph Neural Networks. https://arxiv.org/pdf/1905.12265.pdf 51. Xu Y, Ma J, Liaw A, et al (2017) Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships. J Chem Inf Model 57:2490–2504. https://doi.org/10.1021/acs.jcim.7b00087 52. Sosnin S, Karlov D, Tetko I V, Fedorov M V (2019) Comparative Study of Multitask Toxicity Modeling on a Broad Chemical Space. J Chem Inf Model 59:1062–1072. https://doi.org/10.1021/acs.jcim.8b00685 53. de la Vega de León A, Chen B, Gillet VJ (2018) Effect of missing data on multitask prediction methods. J Cheminform 10:26. https://doi.org/10.1186/s13321-018-0281-z 54. Wu K, Wei G-W (2018) Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks. J Chem Inf Model 58:520–531. https://doi.org/10.1021/acs.jcim.7b00558 55. Varnek A, Gaudin C, Marcou G, et al (2009) Inductive Transfer of Knowledge: Application of Multi-Task Learning and Feature Net Approaches to Model Tissue-Air Partition Coefficients. J Chem Inf Model 49:133–144. https://doi.org/10.1021/ci8002914 56. Ramsundar B, Liu B, Wu Z, et al (2017) Is Multitask Deep Learning Practical for Pharma? J Chem Inf Model 57:2068–2076. https://doi.org/10.1021/acs.jcim.7b00146 57. Wu K, Wei G-W (2018) Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks. J Chem Inf Model 58:520–531. https://doi.org/10.1021/acs.jcim.7b00558 58. Varnek A, Gaudin C, Marcou G, et al (2009) Inductive Transfer of Knowledge: Application of Multi-Task Learning and Feature Net Approaches to Model Tissue-Air Partition Coefficients. J Chem Inf Model 49:133–144. https://doi.org/10.1021/ci8002914 59. Merity S, Xiong C, Bradbury J, Socher R (2016) Pointer Sentinel Mixture Models. http://arxiv.org/abs/1609.07843 60. Linzen T, Dupoux E, Goldberg Y (2016) Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. http://arxiv.org/abs/1611.01368 61. Gulordava K, Bojanowski P, Grave E, et al (2018) Colorless green recurrent networks 78

dream hierarchically. http://arxiv.org/abs/1803.11138 62. Radford A, Jozefowicz R, Sutskever I (2017) Learning to Generate Reviews and Discovering Sentiment. http://arxiv.org/abs/1704.01444 63. Merity S, Keskar NS, Socher R (2018) Regularizing and optimizing LSTM language models. In: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings 64. Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9:1735– 1780. https://doi.org/10.1162/neco.1997.9.8.1735 65. Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. http://arxiv.org/abs/1803.09820 66. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems. pp 3320–3328 67. Adam Paszke; Sam Gross; et al (2017) Automatic differentiation in PyTorch. 31st Conf Neural Inf Process Syst (NIPS 2017) 68. Howard J, Gugger S (2020) Fastai: A Layered API for Deep Learning. Information 11:108. https://doi.org/10.3390/info11020108 69. Swain M MolVS: Molecule Validation and Standardization. https://github.com/mcs07/MolVS 70. Landrum G RDKit: Open-source cheminformatics. http://www.rdkit.org 71. Fadaee M, Bisazza A, Monz C (2017) Data Augmentation for Low-Resource Neural Machine Translation. http://arxiv.org/abs/1705.00440 72. Kobayashi S (2018) Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, Stroudsburg, PA, USA, pp 452–457 73. Kafle K, Yousefhussien M, Kanan C (2017) Data Augmentation for Visual Question Answering. In: Proceedings of the 10th International Conference on Natural Language Generation. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 198– 202 74. Lei C, Hu B, Wang D, et al (2019) A preliminary study on data augmentation of deep learning for image classification. In: ACM International Conference Proceeding Series 75. Bjerrum EJ (2017) SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. http://arxiv.org/abs/1703.07076 76. Arús-Pous J, Blaschke T, Ulander S, et al (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminform 11:20. https://doi.org/10.1186/s13321-019- 0341-z 77. Arús-Pous J, Johansson SV, Prykhodko O, et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminform 11:71. https://doi.org/10.1186/s13321-019-0393-0 79

78. Cortes-Ciriano I, Bender A (2015) Improved Chemical Structure–Activity Modeling Through Data Augmentation. J Chem Inf Model 55:2682–2692. https://doi.org/10.1021/acs.jcim.5b00570 79. Sheridan RP (2013) Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. J Chem Inf Model 53:783–790. https://doi.org/10.1021/ci400084k

80

Chapter 4. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm

for Deep Learning

Xinhao Li & Denis Fourches*

Department of Chemistry, Bioinformatics Research Center, North Carolina State University,

Raleigh, NC 27695, United States.

* To whom correspondence should be sent. Email: [email protected]

81

Abstract

SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom- level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation (by boosting the validity and novelty of generated SMILES) as well as QSAR prediction tasks. In particular, we evaluated the performance of SPE-based QSAR prediction models using 24 benchmark datasets where SPE consistently either did match or outperform atom- level tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.

82

4.1. Introduction

Over the past few years, the cheminformatics community has witnessed dramatic advances in using deep learning neural networks (DLNN) to tackle challenging tasks ranging from molecular property prediction [1–5] to de novo molecular generation and optimization [6–8]. The success of deep learning techniques in natural language processing (NLP) makes the use of text-based molecular representations an attractive research area [9]. Processing text-based chemical representations for deep learning models requires breaking those structures into a sequence of standard units (or ‘tokens’), a process called tokenization. The tokens are supposed to encode the essential structural features that are able to reliably and consistently characterize each compound.

Through the specific type of neural network architectures such as recurrent neural network (RNN)

[10], convolutional neural network (CNN) [11] or Transformer [12], these modeling techniques will process those string-based tokens and use them to learn the molecular representations for various modeling tasks.

In that context, SMILES [13, 14] (Simplified Molecular Input Line Entry System) is the most popular text representation of chemicals; it encodes a molecular graph as a fairly simple, human-readable sequence of characters. As a result, SMILES are typically used to store chemical structures but one should underline they lack 2D (and 3D) information on the atom/bond connectivity/coordinates and the overall molecular graph. Therefore Quantitative Structure-

Activity Relationship (QSAR) models based on SMILES strings are generally seen as less reliable as other models based on 2D (and/or 3D) molecular descriptors. The standard approach for

SMILES tokenization is to simply break the SMILES string character by character. Such character- level tokenization has numerous issues such as the facts that some chemically meaningful information regarding a single atom can be represented by multiple characters and may thus result

83

in ambiguous meanings. With character-level tokenization, ‘[C@@H]’ is tokenized into six characters ‘[’, ‘C’, ‘@’, ‘@’, ‘H’ and ‘]’ even though it encodes for the stereochemistry information of a single carbon atom. The token ‘C’ refers to the symbol of carbon but can also be part of the symbol of chlorine (‘C’ and ‘l’). Atom-level tokenization is a more commonly used method that follows the character-level tokenization with some modifications to ensure atoms are extracted as tokens: (1) multi-characters element symbols such as ‘Cl’ and ‘Br’ are considered as individual tokens; (2) special characters encoded between brackets are considered as tokens (e.g.,

‘[nH], ‘[O-]’ and ‘[C@]’).

In this application note, we present the SMILES Pair Encoding (SPE), a data-driven substructural tokenization algorithm for deep learning applications. The SPE is inspired by the byte pair encoding (BPE) algorithm [15], a major tokenization method in NLP. BPE was initially developed as a data compression algorithm and further adopted as a subword tokenization algorithm. BPE identifies common words and frequent subword units from a large text corpus and assigns them as unique tokens. During the tokenization process, the less common words will further be broken into frequent subword units (e.g., `goodness’ is broken into ‘good’ and ‘ness’).

Similarly to BPE, SPE identifies and keeps the frequent SMILES substrings as unique tokens.

Starting from atom-level tokens, SPE generates the SMILES substring tokens by iteratively merging the high frequency token pairs from a large chemical dataset. SPE enhances the widely used atom-level tokenization in two major aspects:

1. Chemically meaningful substructures: SPE ensures that the most common SMILES

substrings are represented as unique tokens. The SMILES substrings encode molecular

substructures that include richer information and better reflect the molecular

functionalities compared to the atom-level tokens;

84

2. Shorter input for deep learning models: The input token sequences from SPE are

shorter compared to those from atom-level tokenization. Shorter inputs can reduce the

computational cost and accelerate DLNN model training.

Herein, we performed two case studies to showcase the potential of SPE when it comes to both molecular generation and predictive QSAR models. The goal of these case studies is to evaluate whether SPE tokenization could represent a valuable alternative to the atom-level tokenization. This study demonstrates that for both generative and predictive QSAR tasks, the SPE tokenization led to superior performances compared to the atom-level tokenization. In the molecular generation case study, we trained RNN-based language models (LM) with SPE and atom-level tokenization, respectively. One common issue of LM-based molecular generative models is the low validity rates of generated SMILES. Our results show that SPE tokenization did significantly improve the validity rate compared to the atom-level tokenization. In the second case study, we compared the two tokenization methods using 24 benchmark datasets for QSAR modeling purposes. SPE did achieve better or comparable prediction performances for 23 out of the 24 datasets and on average, did offer a significant 5-fold speed up for model training.

The major contributions of this study are:

1. Propose a new SMILES tokenization algorithm that would be useful for a wide

range of cheminformatics / DLNN modeling tasks.

2. Develop an open source Python package SmilesPE, which enables the training of

SMILES pair encoding on a large dataset and the use of trained SPE vocabulary to

tokenize SMILES for deep learning applications. SmilesPE is freely available at

https://github.com/XinhaoLi74/SmilesPE and can be installed via pip.

85

4.2. Method

4.2.1. SMILES Pair Encoding

The SMILES pair encoding algorithm consists of two major steps: the vocabulary training step which learns the high frequency SMILES substrings from a large chemical dataset and the tokenization step which applies the trained vocabulary to a given dataset of SMILES, returning a sequence of tokens. In this section, we describe how to train a SPE vocabulary and how to use the trained vocabulary to tokenize SMILES for deep learning.

A SMILES Pair Encoding (SPE) vocabulary is trained according to the following steps:

• Step 1: Tokenize SMILES from a large dataset (e.g., ChEMBL [16]) at atom-level;

• Step 2: Initialize the vocabulary with all unique tokens;

• Step 3: Iteratively count the occurrence of all token pairs in the tokenized SMILES, merge

the most frequent occurring token pair as a new token and add it to the vocabulary. This

step will stop when one of the conditions is met: (1) A desired vocabulary size is achieved

or (2) No pair of tokens affords a frequency larger than a given frequency threshold. The

maximum vocabulary size (MVS) and frequency threshold (FT) are hyperparameters for

training SMILES pair encoding.

After training the SPE vocabulary, we can then tokenize any given set of SMILES strings based on the trained vocabulary. Importantly, the SMILES substrings in the trained vocabulary are ordered by their frequency and can be used for the chemical analysis of that particular database.

During the tokenization process, each SMILES string is first tokenized at atom-level. SPE can then iteratively check the frequency of all pairs of tokens and merge the pair of tokens that have the

86

highest frequency count in the trained SPE vocabulary until no further merging operation can be conducted.

It is worth noting that the proposed algorithm can also be applied to other popular text- based representation of chemicals for DLNN applications such as DeepSMILES [17] and

SELFIES [18]. DeepSMILES is a variant of SMILES with a different representation of branches and rings, but still shares the same atom-level characters with SMILES. Moreover, SELFIES represents all information of a molecular graph (atoms, bonds, branches and rings) as characters in brackets. These characters can be directly recognized as tokens by the atom-level tokenization.

As a result, one could train a specific SPE vocabulary dedicated for DeepSMILES or SELFIES without any modification.

4.2.2. Dataset Preparation

ChEMBL25 [16] was used to train the SPE vocabulary and benchmark the generative models. The QSAR benchmark datasets were directly taken from a previous study by Cortés-

Ciriano et al. [19] that include the curated pIC50 values for 24 protein targets. All molecules were standardized with the following steps using MolVS [20] and RDKit [21] packages in Python: (1)

Sanitizing with RDKit; (2) Replace all atoms with the most abundant isotope for that element; (3)

Remove counterions in the salts and neutralize the molecules; (4) Remove the mixtures. The canonical SMILES were then generated for modeling. After curation, about 1.7 million

ChEMBL25 SMILES did remain. The QSAR benchmark datasets are summarized in Table 4.1.

87

Table 4.1. Summary of QSAR benchmark datasets. Targets Number of Molecules* A2a 199 Dopamine 469 Dihydrofolate 573 Carbonic 591 ABL1 755 opioid 777 Cannabinoid 1,086 COX-1 1,306 Monoamine 1,307 LCK 1,336 Glucocorticoid 1,387 Ephrin 1,507 Caspase 1,584 Coagulation 1,591 Estrogen 1,622 B-raf 1,717 Glycogen 1,724 Vanilloid 1,761 Aurora-A 2,084 JAK2 2,388 COX-2 2,759 Acetylcholinesterase 2,966 erbB1 4,742 HERG 5,010 * Sorted from small to large

4.2.3. Machine Learning

The molecular generation was formulated as a language/text modeling task, which was first introduced by Waller et.al [22] for de novo molecular design. The RNN-based language models were trained using a large chemical data set to predict the next token ti+1 given a sequence of tokens

{t1, t2, …., ti} preceding it. The models learn a probability distribution of the training molecules and can then sample from the learned distribution to generate new molecules.

The QSAR models were developed using the MolPMoFiT framework we developed recently [23]. MolPMoFiT is an effective transfer learning method for QSAR modeling, which uses the chemical language model pre-training + task-specific fine-tuning strategy [24]. Fine- 88

tuning the pre-trained language model on QSAR datasets enables the knowledge learned from the large unlabeled chemical data to be transferred to smaller supervised datasets.

4.2.4. Evaluation Metrics

4.2.4.1. Evaluation Metrics for Generative Models

• Validity: the percentage of generated SMILES that can be converted to valid molecules;

• Novelty: the percentage of valid molecules that are not included in the training set;

• Uniqueness: the percentage of valid molecules that are unique.

4.2.4.2. Evaluation Metrics for QSAR Models

All 24 QSAR benchmark datasets correspond to regression tasks. The root-mean-square- error (RMSE), coefficient of determination (R2) and mean absolute error (MAE) were used as evaluation metrics for the regression models. Cohen’s d [25] (Eq 1) measures the relative performances of two methods. The x̅ 1 and x̅ 2 are the mean values for each group of results. The

SD1 and SD2 are the standard deviations for each group of results. A positive d value means method

1 has a larger mean than method 2 while a negative d value means method 1 has a smaller mean than method 2. The thresholds of small, medium and large effects are set to 0.2, 0.5 and 0.8 as recommended [25, 26]. The effect with a |d| (absolute value of d) less than 0.2 as no difference; between 0.2 and 0.5 as of minor difference; between 0.5 and 0.8 as medium difference; greater than 0.8 as large difference. In the following analysis, method 1 references as SPE tokenization and method 2 references as atom-level tokenization.

𝑥̅ − 𝑥̅ Cohen’s 푑 = 1 2 (Eq 1) 2 2 √(푆퐷1 + 푆퐷2 )/2

89

4.2.5. Experiments

Training a SPE vocabulary. SPE is a data-driven algorithm therefore both data quality and quantity are crucial. SMILES augmentation [23, 27–30] is widely used as a data augmentation technique in deep learning applications. In order to capture common SMILES substrings in both canonical and non-canonical SMILES, we generated one non-canonical SMILES for each canonical SMILES in the curated ChEMBL dataset. As a result, 3.4M SMILES were obtained for training the actual SPE vocabulary. The maximum vocabulary size (MVS) was set to 30,000 and the frequency threshold (FT) is set to 2,000 to ensure the common SMILES substrings can be included in the vocabulary.

Language models. We trained two language models with SPE tokenization and atom-level tokenization, respectively. A high-quality language model requires a large training corpus. Herein,

9 million SMILES (1 canonical + 5 non-canonical SMILES for each compound) generated from the curated ChEMBL25 dataset are used for the model training. The model architecture we choose for language modeling is AWD-LSTM [31] (ASGD Weight-Dropped LSTM), a variant of LSTM

(long short-term memory) models that are enhanced with various kinds of dropouts and regularizations. Specially, dropouts are applied to embedding layer, input layer, weights and hidden layers. It has been shown as resulting in strong performances for language modeling in

NLP. We chose the same model hyperparameters used in our previous MolPMoFiT study [23]: the models have an embedding layer with a size of 400, three LSTM layers which 1152 hidden units per layer, and a softmax layer. We apply embedding dropout of 0.1, input dropout of 0.6, weight dropout of 0.5 and hidden dropout of 0.2. Both models are trained with a base learning rate of 0.008 for 10 epochs using one cycle policy [32].

90

Molecular Generation. For each language model, ten sampled sets of 1,000 SMILES strings were generated and evaluated for their validity, novelty, and uniqueness. The validation of generated SMILES is evaluated by RDKit.

QSAR models. The QSAR models were fine-tuned on the pre-trained language models following the procedure of MolPMoFiT [23]. All the models were tuned with base learning rates and training epochs on the validation sets and evaluated on the test sets on ten random 80:10:10 splits. SMILES augmentation was applied as descripted in our previous study [23]. During training, the SMILES of training sets were augmented 25 times and the SMILES of validation sets were augmented 15 times. Test time augmentation (TTA) was applied to compute the final predictions: for each compound, the final prediction is generated by averaging predictions of the canonical SMILES and four augmented SMILES. These SMILES augmentation settings were found to perform well on a variety of datasets.

4.2.6. Implementation

We implemented all the machine learning models using PyTorch [33], fastai [34] and

MolPMoFiT. The MolPMoFiT code is available at https://github.com/XinhaoLi74/MolPMoFiT.

4.3. Results and Discussion

4.3.1. SMILES Pair Encoding on ChEMBL

A dataset with ~3.4 million SMILES generated from the curated ChEMBL dataset, containing both canonical and non-canonical SMILES, was used to train a SPE vocabulary. The trained SPE vocabulary contained 3,002 unique SMILES substrings with length ranges from 1 to

22 (Figure 4.1). The length was computed by counting the number of atom-level characters in the

SMILES strings. As shown in Figure 4.2, the SMILES strings are human-readable and mostly

91

correspond to chemically meaningful substructures and functional groups. The full list of SPE vocabulary can be downloaded from the project GitHub repository. Several machine learning architectures [35] and techniques [30, 36] can interpret the model predictions by computing the importance/contribution scores of the input tokens. In this regard, the SMILES strings are more interpretable than individual atom characters.

Figure 4.1. Distribution of length of SMILES Pair Encoding substrings trained on ChEMBL.

92

Figure 4.2. Representative SPE fragments.

Table 4.2 shows some examples of tokenized SMILES from SPE. Compared to atom-level tokenization, SPE provides a more compact representation of SMILES for deep learning models.

Figure 4.3 shows the results of SPE and atom-level tokenization for the ChEMBL25 dataset. The

SPE tokenization has a mean length of approximately 6 tokens while the atom-level tokenization

93

has a mean length of approximately 40. Such shorter input sequences can dramatically benefit

DLNN models in different aspects. Due to the sequential nature of RNN-based models, they require longer training time and suffer long-term dependencies in case of long input sequences. As a result, for the same deep learning application, SPE can save the computational cost and accelerate the training and inference processes.

Table 4.2. Example of Tokenized SMILES.

Tokenized SMILES SMILES Substrings

'CC(', 'CCCC', 'C(=O)Nc1ccc(', CC(CCCCC(=O)Nc1ccc(C(F)(F)F)cc1)NCC(O)c1cccc(Cl)c1 'C(F)(F)F)cc1)', 'N', 'CC(O)', 'c1cccc(Cl)c1'

'CCC(O)(', 'C(=O)N', CCC(O)(C(=O)Nc1ccccc1Cl)C(F)(F)F 'c1ccccc1Cl)', 'C(F)(F)F'

'O=C1', 'CS', '/C(', '=N/N', O=C1CS/C(=N/N=C\c2ccco2)N1Cc1ccccc1 '=C\\', 'c2ccco2)', 'N1', 'Cc1ccccc1'

94

Figure 4.3. Distribution of length of tokenized SMILES of ChEMBL. Blue: SMILES Pair Encoding tokenization; Orange: Atom-level tokenization.

4.3.2. Molecular Generation Case Study

We evaluated the performance of SPE versus atom-level tokenization using an RNN-based language model architecture described in the Experiments Section. The models were trained using 9 million SMILES (1 canonical + 5 non-canonical SMILES for each compound) generated from the curated ChEMBL25 dataset. We compared the validity, novelty and uniqueness of ten sets of 1,000 sampled SMILES generated from each model. The results are summarized in Table

4.3. The model trained with atom-level tokenization can only produce 58.1% valid SMILES whereas the model trained with SPE tokenization can produce 93.1% valid SMILES. The invalidation of SMILES-based generative models is mainly due to the constraints of SMILES syntax: (1) the missing of ring or branch closures; (2) wrong atomic valence. Instead of generating

95

a molecule atom-by-atom, the model trained with SPE tokenization uses a fragment-by-fragment approach, which is naturally more error proofing. In addition, SPE tokenization also achieved a higher novelty score compared to the atom-level tokenization (97.3% vs. 96.7%). Both models can generate 100% unique molecules. Figure 4.4 shows some examples of molecules.

Table 4.3. Metrics for Molecular Generation. SMILES Pair Encoding Atom-level Validity 0.931 ± 0.008 0.581 ± 0.011 Novelty 0.973 ± 0.006 0.967 ± 0.006 Uniqueness 1.0 1.0

(a) SMILES Pair Encoding (b) Atom Level

Figure 4.4. Random sampled examples of Generated Molecules. (a) examples from the model trained with SMILES Pair Encoding tokenization; (b) examples from the model trained with atom-level tokenization.

96

4.3.3. Molecular Property Prediction Case Study

We also compared the performances for molecular activity prediction models trained with the two tokenization methods using 24 regression datasets (pIC50). The models were evaluated on ten 80:10:10 random splits. RMSE (Figure 4.5a), R2 and MAE (Tables B1 and B2) were used as evaluation metrics. The Cohen’s d was used to measure the effect size of the two methods (Figure

4.5b). The thresholds of small, medium and large effects were set to 0.2, 0.5 and 0.8 as recommended [25, 26]. As shown in Figure 4.5, models trained with SPE tokenization afforded comparable or better performances for 23 out of 24 datasets compared to those trained with atom- level tokenization. Specifically, SPE tokenization resulted in a large effect on Cannabinoid and medium effect on A2a, LCK, Estrogen and Aurora-A. In addition to the strong performances, the models with SPE were trained on average 5 times faster due to the shorter input sequence.

97

(a)

(b) Figure 4.5. Results of QSAR benchmark. (a) Test set RMSE (b) The effect size (Cohen’s d value) of difference between models trained with SPE tokenization and atom-level tokenization. A positive d value means atom-level tokenization performances better than SPE tokenization. A negative d value means SPE tokenization performances better than atom-level tokenization. The size effect with a |d| (absolute value of d) less than 0.2 as no difference; between 0.2 and 0.5 as of minor difference; between 0.5 and 0.8 as medium difference; greater than 0.8 as large difference.

98

4.4. Conclusion

In this study, we proposed SMILES Pair Encoding (SPE), a data-driven substructure tokenization algorithm for deep learning. SPE learns a vocabulary of high frequency SMILES substrings from ChEMBL and then tokenizes new SMILES into a sequence of tokens for deep learning models. SPE splits SMILES into human-readable and chemically explainable substrings and shows superior performances on both generative and predictive tasks compared to the atom- level tokenization. In the generative task, it led to a significantly higher validity and novelty of generated SMILES. In the predictive tasks, SPE showed better or comparable performances on 23 out of 24 datasets. In addition to the strong performances, SPE has shorter input sequences which saves the computational cost of both model training and inferencing. Overall, SPE could represent a better tokenization method for the development of future deep learning applications in cheminformatics.

Supporting Information (Appendix B)

Performance of QSAR models trained with SPE tokenization (Table B1). Performance of QSAR models trained with atom-level tokenization (Table B2).

List of Abbreviations

SPE, SMILES Pair Encoding; DLNN, deep learning neural networks; NLP, natural language processing; RNN, recurrent neural network; CNN, convolutional neural network; BPE, the byte pair encoding; SMILES, Simplified Molecular Input Line Entry System; QSAR, quantitative structure activity relationship; LSTM, long short-term memory; TTA, Test time augmentation;

MolPMoFiT, Molecular Prediction Model Fine-Tuning;

99

References

1. Chen H, Engkvist O, Wang Y, et al (2018) The rise of deep learning in drug discovery. Drug Discov Today 23:1241–1250. https://doi.org/10.1016/J.DRUDIS.2018.01.039 2. Lavecchia A (2019) Deep learning in drug discovery: opportunities, challenges and future prospects. Drug Discov Today 24:2017–2032. https://doi.org/10.1016/j.drudis.2019.07.006 3. Stokes JM, Yang K, Swanson K, et al (2020) A Deep Learning Approach to Antibiotic Discovery. Cell 180:688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021 4. Maziarka Ł, Danel T, Mucha S, et al (2020) Molecule Attention Transformer. http://arxiv.org/abs/2002.08264 5. Muratov EN, Bajorath J, Sheridan RP, et al (2020) QSAR without borders. Chem Soc Rev. http://xlink.rsc.org/?DOI=D0CS00098A 6. Elton DC, Boukouvalas Z, Fuge MD, Chung PW (2019) Deep learning for molecular design - A review of the state of the art. Mol Syst Des Eng 4:828–849. https://doi.org/10.1039/c9me00039a 7. Polykovskiy D, Zhebrak A, Sanchez-Lengeling B, et al (2018) Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. https://arxiv.org/abs/1811.12823 8. Brown N, Fiscato M, Segler MHS, Vaucher AC (2019) GuacaMol: Benchmarking Models for de Novo Molecular Design. J Chem Inf Model 59:1096–1108. https://doi.org/10.1021/acs.jcim.8b00839 9. Öztürk H, Özgür A, Schwaller P, et al (2020) Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 00: https://doi.org/10.1016/J.DRUDIS.2020.01.020 10. Lipton ZC, Berkowitz J, Elkan C (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. http://arxiv.org/abs/1506.00019 11. Kim Y (2014) Convolutional Neural Networks for Sentence Classification. http://arxiv.org/abs/1408.5882 12. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in Neural Information Processing Systems. pp 5999–6009 13. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model 28:31–36. https://doi.org/10.1021/ci00057a005 14. Weininger D, Weininger A, Weininger JL (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Model 29:97–101. https://doi.org/10.1021/ci00062a008 15. Sennrich R, Haddow B, Birch A (2016) Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational

100

Linguistics, Stroudsburg, PA, USA, pp 1715–1725 16. Gaulton A, Bellis LJ, Bento AP, et al (2012) ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107. https://doi.org/10.1093/nar/gkr777 17. O’Boyle N, Dalke A (2018) DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. chemRxiv. https://github.com/nextmovesoftware/deepsmiles 18. Krenn M, Häse F, Nigam A, et al (2019) Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation. http://arxiv.org/abs/1905.13741 19. Cortés-Ciriano I, Bender A (2019) Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks. J Chem Inf Model 59:1269–1281. https://doi.org/10.1021/acs.jcim.8b00542 20. Swain M MolVS: Molecule Validation and Standardization. https://github.com/mcs07/MolVS 21. Landrum G RDKit: Open-source cheminformatics. http://www.rdkit.org 22. Segler MHS, Kogej T, Tyrchan C, Waller MP (2018) Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent Sci 4:120–131. https://doi.org/10.1021/acscentsci.7b00512 23. Li X, Fourches D (2020) Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. J Cheminform 12:27. https://doi.org/10.1186/s13321-020-00430-x 24. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). pp 328–339 25. Cohen J (1988) Statistical power analysis for the behavioral sciences. L. Erlbaum Associates, Hillsdale, N.J. 26. Nicholls A (2016) Confidence limits, error bars and method comparison in molecular modeling. Part 2: Comparing methods. J Comput Aided Mol Des 30:103–126. https://doi.org/10.1007/s10822-016-9904-5 27. Bjerrum EJ (2017) SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. http://arxiv.org/abs/1703.07076 28. Arús-Pous J, Johansson SV, Prykhodko O, et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminform 11:1–13. https://doi.org/10.1186/s13321-019-0393-0 29. Arús-Pous J, Blaschke T, Ulander S, et al (2019) Exploring the GDB-13 chemical space using deep generative models. J Cheminform 11:20. https://doi.org/10.1186/s13321-019- 0341-z 30. Karpov P, Godin G, Tetko I V. (2020) Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J Cheminform 12:17. https://doi.org/10.1186/s13321-020- 00423-w 31. Merity S, Keskar NS, Socher R (2018) Regularizing and optimizing LSTM language 101

models. In: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings 32. Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. http://arxiv.org/abs/1803.09820 33. Adam Paszke; Sam Gross; et al (2017) Automatic differentiation in PyTorch. 31st Conf Neural Inf Process Syst (NIPS 2017) 34. Howard J, Gugger S (2020) Fastai: A Layered API for Deep Learning. Information 11:108. https://doi.org/10.3390/info11020108 35. Zheng S, Yan X, Yang Y, Xu J (2019) Identifying Structure–Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism. J Chem Inf Model 59:914–923. https://doi.org/10.1021/acs.jcim.8b00803 36. Goh GB, Hodas NO, Siegel C, Vishnu A (2017) SMILES2Vec: An Interpretable General- Purpose Deep Neural Network for Predicting Chemical Properties. http://arxiv.org/abs/1712.02034

102

Chapter 5. Benchmarking Transfer Learning Approaches on Small Chemical Datasets

Abstract

In the lead optimization stage of drug discovery, the structure of a molecule is modified to improve its usefulness as a drug. This process typically involves iteratively synthesis and experimentally testing hundreds or even thousands of molecules. Therefore, it would be highly relevant to utilize the in silico predictive tools such as QSAR models to aid the decision making and reducing the cycle time. Lead optimization endpoints usually contain a small set of congeneric molecules which limits the application of deep learning. Transfer learning enables training a machine learning model with less data. Recently, we developed a transfer learning approach based on language model pre-training combined with task specific fine-tuning for QSAR modelling. The

SMILES representation learned by the language model can also be used for other machine learning methods. In this study, we explored the transfer learning approach to QSAR modelling for lead optimization endpoints. A set of lead-optimization-like benchmark datasets was created from 8 datasets that include potency values against the protein targets. Two language models, a LSTM model and a RoBERTa model, were trained on 10M SMILES from ChEMBL and then used as the knowledge source for the downstream QSAR tasks. The results show transfer learning indeed provide performance gains for lead optimization endpoints.

103

5.1. Introduction

Lead optimization is a vital part of the drug discovery process which involve several iterations of optimizing the structures a set of small chemicals to improve their desired pharmacological profile. It typically requires the design, synthesis, and experimental testing of hundreds of compounds over several years. Therefore, in silico predictive tools are crucial for rational, hypothesis driven decision making and the cycle time reduction [1–3]. High quality

QSAR models can provide insights and guidance in the lead optimization process by predicting potential modification that are most likely to improve the activity/property [4–7]. Recently, deep neural networks have been successfully applied in molecular activity/property predictions and shown remarkable performances in several QSAR case studies [8–13]. However, deep learning models typically require large training sets which are more useful in the early stages of drug discovery where abundance data are available. In lead optimization stages, only a small set of congeneric molecules is available which in turn is a challenge for either physics-based or data driven modelling approaches. Transfer learning has proven to be able to make modeling more effective with less labelled data [14, 15]. It enables the knowledge learned from one task can be transferred to another related task. Our recent work shows that the knowledge transferred from a language model trained on millions of SMILES can provide significant performance gains for the downstream QSAR tasks [16].

In this chapter, we show the preliminary results of the efforts to explore transfer learning approaches to QSAR modelling for lead optimization endpoints and to understand the values and limits of such approaches. A set of lead-optimization-like benchmark datasets were created from

8 datasets that include pIC50 values for protein targets. We used the language model pre-training

+ task specific fine-tuning transfer learning framework we developed recently [16]. We compared

104

the transfer leaning method with the a baseline traditional shallow QSAR model trained with structural fingerprints and the state-of-the-art graph-based message passing neural network. We demonstrate that the transfer learning indeed achieves strong performance on small datasets.

5.2. Method

5.2.1. Datasets

The ChEMBL25 [17] dataset was used for training the language models. The QSAR benchmark datasets were created from Cortés-Ciriano et al. that include the curated potency values

(pIC50) for 8 protein targets [18]. All molecules were standardized with the following steps using

MolVS [19] and RDKit [20] packages in Python: (1) Sanitizing with RDKit; (2) Replace all atoms with the most abundant isotope for that element; (3) Remove counterions in the salts and neutralize the molecules; (4) Remove the mixtures. The canonical SMILES were then generated for modeling.

In this study, we focus on lead optimization (LO) datasets which usually contain a small number of structurally similar molecules (usually share a common molecular scaffold/substructure). Thus, using the curated Cortés-Ciriano et al. dataset, a sphere exclusion clustering (https://docs.chemaxon.com/display/docs/Sphere+Exclusion+clustering) using Morgan fingerprints of radius 2 and a Tanimoto coefficient threshold of 0.65 was used. Structurally similar compounds from singletons or less populated clusters was added manually to larger clusters after visual inspection.

The final lead optimization-like datasets are summaries in Table 5.1 and Figure 5.1. The datasets were split into training and test sets using a scaffold split with a ratio of 80:20 (Figure

5.2).

105

Table 5.1. Lead Optimization-like Datasets. Target Number of Compounds Acetylcholinesterase 117 Aurora-A 290 B-raf 230 Caspase 110 HERG 117 JAK2 307 Vanilloid 133 erbB1 374

Figure 5.1. Property Distribution of Lead Optimization-like Datasets.

106

Figure 5.2. Training and Test Set Property Distribution Following the Scaffold

Split of Lead Optimization-like Datasets (Ratio=80:20).

5.2.2. Transfer Learning

In this study, we use the language model pre-training + task-specific fine-tuning framework for transfer learning. A Language model can learn the molecular representation from SMILES and provide significant performance gains for the downstream QSAR tasks. In downstream QSAR tasks, the pre-trained language model can be either (1) used as a molecular generator or (2) directly fine-tuned on QSAR datasets. SMILES is a text representation of molecules that encodes a molecular graph as a string of characters. The characters represent the atoms, bonds, and branches information of the molecular structure. A single molecular structure can be represented by multiple

SMILES. Thus, SMILES augmentation has been used as a data augmentation technique for

SMILES-based deep learning models. As choice of tokenization method will also affect the model performances, we choose the SMILES pair encoding (SPE) tokenization method we developed recently [21]. The SPE is a data driven tokenization method which tokenizes the SMILES into

107

chemical fragments (SMILES substrings) based on a learned vocabulary. The SPE vocabulary used in this study was downloaded from https://github.com/XinhaoLi74/SmilesPE. The SPE vocabulary contains 3072 unique tokens.

Language modelling: We implemented two language modelling approaches: the classic language model and the masked language model (Figure 5.3). Both models were trained on ~10M

SMILES generated from ChEMBL. This corresponds to one canonical SMILES and four augmented SMILES from each molecule.

A classic language model aims to predict the next possible token based on the given

SMILES token sequence (Figure 5.3a). It is usually trained with recurrent neural network (RNN).

At each time step, the language model takes the hidden state from the previous time step and the current token as input, generating a hidden state to capture the contextual meaning of the current token. Due to the sequential nature of RNN models, the language model can only use the preceding tokens as its context. The language model was trained using the same setting and procedure as

MolPMoFiT (Details see Chapter 3). The model was trained with 20 epochs.

A masked language model aims to reconstruct the original SMILES token sequence from an altered version which some tokens are randomly masked (Figure 5.3b). Bidirectional Encoder

Representations from Transformers (BERT) is the first approach that uses a masked language model task to learn the contextual relations between words in a sentence or text. The model architecture of BERT is a Transformer encoder which consists of multiple multi-headed self- attention layer and feed-forward layer blocks. Compared to a LSTM-based language model, BERT captures the bi-directional context of a word/token. The BERT [22] model learns the contextual meanings of tokens based on all the other tokens in the sequence whereas the LSTM model can only use the tokens before the current token as its context. Here, we used the Robustly optimized

108

BERT approach (RoBERTa) [23], a variant of BERT with improved training methodology, for the language model pre-training. The model follows BERTsmall architecture: 6 layers, 768 hidden size,

12 attention heads. The model was trained with 5 epochs.

(a)

(b) Figure 5.3. Classic language model (a) vs. Masked language model (b).

Molecular representations from language models: The language model encoders process the input tokens and output the hidden states that encode the contextual meanings of the input tokens (Figure 5.3). Molecular representations can be extracted from the hidden states of

109

pre-trained language models for downstream tasks. Different methods can be applied to generate molecular representation from the pre-trained language model. For LSTM-based language models, the hidden state of the last time step of the last LSTM layer encodes the information of the whole sequence which can be used as the molecular representation. For BERT-based language models, a

[CLS] token is added to the beginning of the input sequence (Figure 5.3b). The hidden state of the [CLS] token is used as molecular representation. In addition, the molecular representation can also be generated from a pooling over the hidden states (Figure 5.4). The commonly used pooling methods are mean pooling and max pooling. The molecular representation can also be the concatenation of max pooling, mean pooling, and/or the hidden state of the last time step (concat pooling).

Figure 5.4. Pooled Molecular Representation.

110

In this study, we evaluated multiple methods for converting token embedding to SMILES representation:

• For mask language model (BERT model): o [CLS] token (MLM_cls) o Max pooling (MLM_max) o Mean pooling (MLM_mean) o Concatenation of max and mean pooling (MLM_max_mean) • For classic language model (LSTM model): o Max pooling (LM_max) o Mean pooling (LM_mean) o Concatenation of max and mean pooling (LM_max_mean) o Hidden state of the last time step (LM_hidden) o Concatenation of max pooling, mean pooling and hidden state of the last time step (LM_all)

Language model fine-tuning: The pre-trained language model can also be directly fine- tuned on the downstream QSAR tasks. Currently, we only implemented the fine-tuning of the classic language model using the MolPMoFiT framework we developed before (Details see

Chapter 3).

5.2.3. Machine Learning

In this study, we evaluated three modelling approaches:

LightGBM: LightGBM [24] is an efficient gradient boosting decision tree algorithm with the features of fast training, low memory usage and better accuracy. It also supports parallel and

GPU learning. LightGBM models were trained with Morgan fingerprints (ECFP6) and the

SMILES embeddings from pre-trained language models on the QSAR benchmark datasets.

MolPMoFiT: The LSTM language model were fine-tuned on the QSAR benchmark datasets using the procedure of MolPMoFiT (details see Chapter 3). SMILES augmentation and

111

test time augmentation (TTA) were applied: for each molecule, 25 augmented SMILES were generated.

Chemprop: Chemprop [8] is the state-of-the-art graph-based deep learning method for molecular property prediction using a message passing neural network (MPNN). The graph features learned by Chemprop can also be incorporated with other features. The additional features are concatenated with the learned graph features before feeding into the feed-forward neural network. In this study, we evaluated 4 version of Chemprop models: (1) graph features only

(Chemprop_default); (2) graph features and RDKit descriptors (Chemprop_rdkit); (3) graph features and SMILES embedding from pre-trained LSTM language model

(Chemprop_molpmofit); and (4) graph features and SMILES embedding from pre-trained

RoBERTa model (Chemprop_roberta). All models were trained on the QSAR benchmark datasets with the default setting.

5.2.4. Implementation

The deep learning models were implemented using PyTorch (https://pytorch.org/) deep learning framework. The RoBERTa model was trained using Huggingface Transformers [25] library (https://github.com/huggingface/transformers). The implementation of MolPMoFiT can be found at https://github.com/XinhaoLi74/MolPMoFiT. The implementation of Chemprop can be found at https://github.com/chemprop/chemprop. The LightGBM models were trained using

LightGBM library (https://github.com/microsoft/LightGBM).

5.3. Results and Discussion

In order to evaluate different machine learning approaches, especially transfer learning, on lead optimization endpoints, we created 8 lead optimization-like datasets. Each dataset contains a

112

small number of structurally similar molecules. Figure 5.5 shows some representative molecules from the HERG dataset, all molecules share the same molecular scaffold.

Figure 5.5. Representative Molecules from HERG.

The transfer learning method used in this study is language model pre-training + task- specific fine-tuning. Two language models (1) LSTM and (2) RoBERTa were trained on the

ChEMBL datasets. We evaluated two forms of transfer learning: (1) use the pre-trained language model as feature generator and (2) fine-tuning the pre-trained language model directly on the downstream QSAR tasks. The SMILES embeddings (molecular representation) extracted from the language models were used to train the LightGBM models. Different methods for generating

SMILES embedding were evaluated (see Section 5.2.2). The LSTM language model was fine- tuned on the downstream QSAR tasks. The transfer learning models (the LightGBM models trained with SMILES embeddings and the fine-tuning models) were compared with (1) LightGBM models trained with ECFP6 and (2) Chemprop models.

All models were compared based on their average ranks on 8 QSAR datasets (Figure 5.6).

The fine-tuning method outperforms other methods with an average rank 5.5 based on Pearson R and an average rank 6.6 based on RMSE. The LightGBM model trained with ECFP6 has a

113

competitive performance whereas the Chemprop models don’t. Since Chemprop learns the graph features from scratch, it is not surprising it did not perform well on the small datasets.

The results suggest that the SMILES embedding learned by the language model can indeed provide performance gains for small datasets, though the conventional ECFP6 is almost qeually performant. Among the models trained with SMILES embedding, the concat pooling

(LM_max_mean and LM_all) of the LSTM model performs better. Fine-tuning the language model can provide more performance gains than using the language model as a feature extractor even on the small datasets. The choice of SMILES embedding generation will affect the model performances.

114

(a)

(b)

Figure 5.6. The average rank on the test sets across the 8 QSAR benchmark datasets based on (a) Pearson R and (b) RMSE.

115

5.4. Conclusion and Future Direction

In this study, we explored the language model pre-trained + task specific fine-tuning transfer learning approach to QSAR modelling for lead optimization endpoints. We created 8 lead optimization-like datasets. Two type of language models, a LSTM model and a RoBERTa model, were trained on 10 million SMILES from ChEMBL. The molecular representation (SMILES embedding) learned by the pre-trained language models were used for training QSAR models using LightGBM. We also fine-tuned the LSTM model on the downstream QSAR tasks. The models are compared to the LightGBM model trained with ECFP6 and the state-of-the-art graph- based deep learning model. The preliminary results show that transfer learning approaches, especially fine-tuning, indeed provide performance gains for lead optimization endpoints by expanding the molecular representation required for better learning.

We observed that LSTM performs better than RoBERTa when used as a feature extractor and different SMILES feature generation methods will result in different performance. In future work, we will examine the models to understanding the underlying mechanisms that lead to an increase of performance. This would help us to learn more about the usefulness and limitation of transfer learning on QSAR modelling for lead optimization endpoints.

Acknowledgements

This project is part of the work during my internship at GSK (2020 summer). I thank Dr. Jamel

Meslamani and Dr. Constantine Kreatsoulas for their support and guidance for this project. I also thank Dr. Jamel Meslamani for his advice and editing of this manuscript.

116

References

1. Popov VM, Yee WA, Anderson AC (2006) Towards in silico lead optimization: Scores from ensembles of protein/ligand conformations reliably correlate with biological activity. Proteins Struct Funct Bioinforma 66:375–387. https://doi.org/10.1002/prot.21201 2. Vaz RJ, Zamora I, Li Y, et al (2010) The challenges of in silico contributions to drug metabolism in lead optimization. Expert Opin. Drug Metab. Toxicol. 6:851–861 3. Lewis RA (2005) A general method for exploiting QSAR models in lead optimization. J Med Chem 48:1638–1648. https://doi.org/10.1021/jm049228d 4. Cherkasov A, Muratov EN, Fourches D, et al (2014) QSAR modeling: Where have you been? Where are you going to? J Med Chem 57:4977–5010. https://doi.org/10.1021/jm4004285 5. Muratov EN, Bajorath J, Sheridan RP, et al (2020) QSAR without borders. Chem Soc Rev. http://xlink.rsc.org/?DOI=D0CS00098A 6. Tropsha A (2010) Best Practices for QSAR Model Development, Validation, and Exploitation. Mol Inform 29:476–488. https://doi.org/10.1002/minf.201000061 7. Mittal RR, McKinnon RA, Sorich MJ (2009) Comparison data sets for benchmarking QSAR methodologies in lead optimization. J Chem Inf Model 49:1810–1820. https://doi.org/10.1021/ci900117m 8. Yang K, Swanson K, Jin W, et al (2019) Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model 59:3370–3388. https://doi.org/10.1021/acs.jcim.9b00237 9. Mater AC, Coote ML (2019) Deep Learning in Chemistry. J Chem Inf Model 59:2545– 2559. https://doi.org/10.1021/acs.jcim.9b00266 10. Unterthiner T, Mayr A, Unter Klambauer G¨, et al Deep Learning as an Opportunity in Virtual Screening 11. Lavecchia A (2019) Deep learning in drug discovery: opportunities, challenges and future prospects. Drug Discov Today 24:2017–2032. https://doi.org/10.1016/j.drudis.2019.07.006 12. Stokes JM, Yang K, Swanson K, et al (2020) A Deep Learning Approach to Antibiotic Discovery. Cell 180:688-702.e13. https://doi.org/10.1016/j.cell.2020.01.021 13. Chen H, Engkvist O, Wang Y, et al (2018) The rise of deep learning in drug discovery. Drug Discov Today 23:1241–1250. https://doi.org/10.1016/J.DRUDIS.2018.01.039 14. Simões RS, Maltarollo VG, Oliveira PR, Honorio KM (2018) Transfer and Multi-task Learning in QSAR Modeling: Advances and Challenges. Front Pharmacol 9:74. https://doi.org/10.3389/fphar.2018.00074 15. Cai C, Wang S, Xu Y, et al Transfer Learning for Drug Discovery. https://dx.doi.org/10.1021/acs.jmedchem.9b02147 16. Li X, Fourches D (2020) Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. J Cheminform 12:27. 117

https://doi.org/10.1186/s13321-020-00430-x 17. Gaulton A, Bellis LJ, Bento AP, et al (2012) ChEMBL: A large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:1100–1107. https://doi.org/10.1093/nar/gkr777 18. Cortés-Ciriano I, Bender A (2019) Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks. J Chem Inf Model 59:1269–1281. https://doi.org/10.1021/acs.jcim.8b00542 19. Swain M MolVS: Molecule Validation and Standardization. https://github.com/mcs07/MolVS 20. Landrum G RDKit: Open-source cheminformatics. http://www.rdkit.org 21. Li X, Fourches D (2020) SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. ChemRxiv 22. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805 23. Liu Y, Ott M, Goyal N, et al (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. http://arxiv.org/abs/1907.11692 24. Ke G, Meng Q, Finley T, et al (2017) LightGBM: A highly efficient gradient boosting decision tree. In: Advances in Neural Information Processing Systems. pp 3147–3155 25. Wolf T, Debut L, Sanh V, et al (2019) HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://arxiv.org/abs/1910.03771

118

Chapter 6. CryptoChem: Encoding and Storing Information Using Chemicals

Phyo Phyo Kyaw Zin1,2┼, Xinhao Li1,2┼, Dhoha Triki1,2, and Denis Fourches1,2*

(┼ equal contribution)

1 Department of Chemistry, North Carolina State University, Raleigh, NC, USA.

2 Bioinformatics Research Center, North Carolina State University, Raleigh, NC, USA.

119

Abstract

This study presents CryptoChem, a new method and associated software to securely store and transfer information using chemicals. Relying on the concept of Big Chemical Data, molecular descriptors and machine learning techniques, CryptoChem offers a highly complex and robust system with multiple layers of security for transmitting confidential information. This revolutionary technology adds fully untapped layers of complexity and is thus of relevance for different types of applications and users. The algorithm directly uses chemical structures and their properties as the central element of the secured storage. QSDR (Quantitative Structure-Data

Relationship) models are used as private keys to encode and decode the data. Herein, we validate the software with a series of five datasets consisting of numerical and textual information with increasing size and complexity. We discuss (i) the initial concept and current features of

CryptoChem, (ii) the associated MOLREAD and MOLWRITE programs which encode messages as series of molecules and decodes them with an ensemble of QSDR machine learning models, (iii) the Analogue Retriever and Label Swapper methods, which enforce additional layers of security,

(iv) the results of encoding and decoding the five datasets using CryptoChem, and (v) the comparison of CryptoChem to contemporary encryption methods. CryptoChem is freely available for testing at https://github.com/XinhaoLi74/CryptoChem

120

6.1. Introduction

As the amount of data produced worldwide is growing exponentially, the need to reliably, durably, and securely store and transfer that data has never been so critical. However, data storage is at a pivotal crossroad. Physical storage devices such as optical drives and tapes are not only reaching their technical limit in terms of capacity and density of storage but also in terms of durability. Also on the rise is the need for disrupting encryption methods based on advanced technologies requiring highly technical skills and/or appropriate equipment. To solve these problems, next-generation DNA storage [1, 2] has been developed to encode and store enormous amounts of information on DNA molecules [3]; but this technology suffers many practical challenges, especially when it comes to obfuscation protocols [4]. Meanwhile, the chemical space is estimated to be filled with 1060 unique small molecules that can be characterized by properties directly computed from their two-, three-, or even four-dimensional structures and conformations

[5, 6]. There has been a few attempts to use different subsets of chemicals for establishing novel steganography and/or cryptography methods, especially with dyes [7]. But as of today, to the best of our knowledge, there is no available technology capable of exploiting the chemical space to directly store information in a secured, reproducible, and efficient way.

Herein, we propose to use chemical structures and their properties as the central element for a completely novel data encoding method. It is based on the high complexity and uniqueness of the thousands of structural characteristics and properties that can be computed for every single molecule of the chemical universe. As the chemical universe is estimated to be in the order of 1060 molecules [5], it is nearly impossible to enumerate all possible molecules and their properties using a brute force algorithm. Therefore, storing information using chemicals could simultaneously offer very high levels of storage density [8] and security. Ultimately, this approach could also be used

121

to physically store encryption keys or any other information in physical storage technology (e.g., direct encoding by chemicals packaged within a DNA storage cell [9]).

Moreover, for the past decade, robust quantitative structure-activity relationship (QSAR) models [10–13] have been built to predict very specific physical, chemical, and biological endpoints of compounds. Those QSAR models are based on the hypothesis that similar compounds have similar properties. But chemicals first need to be encoded into numerical data being fully amenable to computer-based calculations. To do so, we use molecular descriptors that are well- defined, reproducible, interpretable, and numerical parameters directly and solely computed from the chemical structure of a compound. For a given chemical, thousands of descriptors can be computed solely based on its two-dimensional structure, thousands more based on its three- dimensional structure (i.e., one 3D conformation of that compound). Importantly, those 3D conformation-dependent descriptors can also be computed for thousands of 3D conformations of that same particular compound (e.g., time-dependent descriptors computed from molecular dynamics trajectories [6, 14]). Overall, every chemical can be characterized by tens of thousands of numerical descriptors directly computed in silico.

For this study, we developed Quantitative Structure-Data Relationships (QSDR) models that use machine learning to establish non-linear, quantified links between computed molecular descriptors and numerical/textual data. A collection of QSDR models was built using several types of machine learning techniques (e.g., deep learning neural networks, random forests) and families of 2D, 3D structural descriptors and fingerprints directly computed from various series of molecules. In this CryptoChem project, we designed, implemented, and validated two proof-of- concept programs, MOLWRITE and MOLREAD, enabling the encoding and retrieval of information using series of molecular structures. MOLWRITE program stores and encodes a given message in

122

the molecular CryptoChem format, whereas the MOLREAD program decodes a given CryptoChem message using the right QSDR model. The feasibility and capabilities of this innovative technology have been tested on a series of five different datasets consisting of numerical and textual information. To have a better understanding of CryptoChem in relation to other popular encoding algorithms, we also compared it to contemporary methods such as block and stream ciphers on the basis of features, weaknesses and strengths.

6.2. Method

6.2.1. Overview of CryptoChem

CryptoChem is comprised of two major components: (1) MOLWRITE and (2) MOLREAD.

The former encodes textual/numerical information into in silico molecules, and the later decodes the molecules back into the original textual/numerical message. The overview of CryptoChem encryption and decryption algorithm is provided in Figure 6.1.

123

Figure 6.1. Simplified workflow of the CryptoChem Algorithm.

MOLWRITE (the encryption part) is composed of four major functions: ASCII Encoder,

Label Swapper (LS) Encoder, Molecular Encoder (ME), and Analogue Retriever (AR). First, the input message is encoded into digits using ASCII Encoder (section 6.2.2. ASCII Encoder). In the current version of the method/software, the algorithm can encode the first 128 characters in the standard ASCII table [15]. To introduce an extra layer of protection in the algorithm, the digits are then transformed into new digits using Label Swapper (LS, see section 6.2.3. Label Swapper).

Next, Molecular Encoder (ME, see section 6.2.4. Molecular Encoder) is applied to replace each value by a virtual reference chemical which has been tagged with labels. Then, an additional layer of security and confusion is added into the algorithm with the use of Analogue Retriever (AR, see section 6.2.5. Analogue Retriever). AR replaces the previously tagged reference molecules with new chemical analogues which have no explicit tags and/or labels associated with them (also defined as the reference set). These new molecular analogues are used to generate the encoded

124

message, also known as CryptoChem message that carries the original textual/numerical information in encoded molecular format (SMILES). The Simplified Molecular-Input Line-Entry

System (SMILES) encodes the molecular structures as strings of text. [16]

MOLREAD (the decryption part) is comprised of three major functions: Molecular Decoder,

Label Swapper (LS, Decoder) and ASCII Decoder. Molecular Decoder converts the molecules from CryptoChem message into digits. Then, LS (Decoder) is applied to transform these digits back into the original digits. These digits are then decoded into characters from the initial message using ASCII Decoder.

In the following sections, we will explain each function accordingly.

6.2.2. ASCII Encoder

ASCII stands for American Standard Code for Information Interchange. ASCII encodes a numeric value to different characters and symbols. For example, the ASCII value of the character

“d” is 100. The complete ASCII table can be found at https://theasciicode.com.ar/. In this study, the original character “d” is converted to the ASCII value of 100. Label Swapper is then applied to switch 100 to a different number (e.g., 91). The new number is then encoded with a chemical molecule from the reference set. For the current version of the method, the algorithm can encode the first 128 characters in the standard ASCII table.

6.2.3. Label Swapper

Label swapper (LS) employs a permutation key (also known as the molecular key) and 128

(pre-shared with message sender and receiver) neighbor molecules, to switch the ASCII digits to new digits. The algorithm of LS is directly inspired by the famous Enigma machine that was used by the Germans during World War II [17]. Instead of using the real keyboard and rotors to encrypt

125

the information, LS uses the permutation key and the neighbor molecules to generate a virtual look-up table for each ASCII digit to change from one digit to another. For each neighbor molecule, we assigned a digit (0-127) to it. During label swapping, the distances between the permutation key and neighbor molecules are computed, and neighbor molecules are ranked according to the computed distances. The virtual look-up table is formulated based on the originally assigned digits and the computed distance ranks of the neighbor molecules. The Label swapper is an important component to ensure a higher security level in CryptoChem.

Essentially, the native CryptoChem system (without Label Swapper) can be seen as a form of substitution encryption. The machine learning model acts as a substitution key. The assumption of QSDR is that ‘similar’ molecules (characterized by the specific molecular descriptors used) will be used to represent the same ACSII code. The trained QSDR machine learning models learn the patterns of the text-molecule relationship so that it can map the molecules to the text.

LS is designed to resist several major cryptanalysis techniques against the substitution encryption part in case the machine learning model itself is compromised by the adversarial. The concept of LS is shown in Figure 6.2. The central part of Label Swapper is the molecular key. It is of high importance to underline that it can be any molecule chosen from the whole molecular universe (~1060). Based on same molecular properties/descriptors of the molecular key, an initial look-up table and some rotors are generated by one or several mathematical functions (Figure

6.2a). The look-up table maps each ACSII code to another. Every time encoding a new text, the rotors will generate a new look-up table. For example, the text ‘aaaaaa’ would be changed to

‘pomcfr’. In CryptoChem system, the molecular key acts as a permutation key. In the encoding process (MOLWRITE), the original text is changed to a ‘new’ text before translating to the molecules. In the decoding process (MOLREAD), the text decoded by the machine learning model

126

will be further translated back to the original text by the Label Swapper. To fully “crack” the Label

Swapper, the adversarial would thus need to know (1) the exact molecular key; (2) the exact set of molecular properties/descriptors used for computing the initial look-up table and rotors; (3) the exact set of functions to compute the initial look-up table and rotors; and (4) how the rotors work.

Figure 6.2. Graphical Illustration of Label Swapper.

6.2.4. Molecular Encoder

The reference set contains 128 clusters of molecules with tagged digits. Molecular Encoder

(ME) takes the output from the previous step of LS (see Figure 6.1); which are numerical digits.

127

Based on these digits, ME randomly samples a chemical cluster from the reference set and picks a molecule from that cluster. Hence, in this step, digits from LS are encoded into chemical molecules.

6.2.5. Analogue Retriever

As the name suggests, Analogue Retriever (AR) aims at identifying and retrieving

“analogues” from a very large external set (millions randomly chosen and/or generated among

1060 chemicals) of molecules and replaces a target reference molecule. It is incorporated into

MOLWRITE to add an additional layer of security in encoding the CryptoChem messages. In this disclosed case study, we split it into two datasets: reference set (a small set of chemicals with tagged labels) and analogue set (a large chemical space containing multiple clusters of chemicals).

Reference set contains molecules with associated cluster labels, and analogue set contains multiple clusters (e.g., thousands) of molecules without any labels attached to them. AR is an essential part of MOLWRITE in improving the security level by obscuring the relationships between labels and molecules even more. Without AR in MOLWRITE, chemicals could be potentially mapped with labels.

When MOLWRITE program encodes texts into molecules, all these molecules are taken from the reference set. Cryptochem messages encoded with molecules directly retrieved from reference set could be vulnerable to security breach since having access to reference set enables not only encoding but also decoding CryptoChem messages through SMILE matching and identifying their cluster labels. Thus, it is important to devise a strategy to ensure that chemicals used in the

CryptoChem messages are not easily traceable to the original labels associated with them.

Therefore we developed and implemented Analogue Retriever (AR) which made it impossible to

128

directly decode the message as the same compound is never used twice to encode the same character in the same message or in difference messages.

In this method, the molecules with classified labels in the reference set are not actually used in encoding CryptoChem messages; instead, they are only used as reference molecules for picking compounds from an extremely large external set which, in our case, is called the Analogue

Set. The highlight of applying DNN models (see next sections) in CryptoChem is that well-trained

DNN models can, however, predict the right clusters for these analogues.

We incorporated multistage sampling (clustering, stratified sampling) into designing AR to make the process of searching and selecting analogues efficient. There is certainly ample room for optimizing the algorithm and integrating parallel computing on GPUs to further improve the efficiency and runtime. Currently, the software can run on multiple CPUs of a standard desktop computer. The scheme of AR is provided in Figure 6.3.

Figure 6.3. Analogue Retriever scheme for replacing a reference molecule with an analogue from the Analogue Set.

129

AR takes the output molecules from Molecular Encoder (Figure 6.1) generated initially with the reference set. For readers’ convenience, we will refer to them as target molecules from now on. These reference molecules are later replaced with new molecules from the analogue set.

Then, AR can compute the Euclidean distance (among other metrics) between each target molecule and multiple centroids (hundreds or thousands) from the list of internal_centroids. The closest centroid is chosen based on the smallest Euclidean distance (and/or other metrics). To make this process efficient, we have already extracted all the centroids from all selected chemical clusters and assigned codenames to each centroid beforehand. Once the closest centroid is identified, the code name can be extracted. Using the code name, the chemical cluster is identified, and an analogue is randomly selected from that chemical cluster to replace the target molecule. Finally, the molecules from the original CryptoChem message generated initially with reference set are thus, one by one, entirely replaced by molecules from the analogue set.

6.2.6. Molecular Data Preparation

In this disclosed case study, 1 million molecules were randomly selected from the purchasable ‘drug-like’ molecules in the ZINC15 library [18]. The selected molecules (called V1 library) were used as the training set to develop the deep learning neural network DNN model

(molecular decoder). The full V1 library was grouped into 128 clusters using k-means algorithm in Scikit-learn package [19] in Python based on 166-bit MACCS keys (but any custom-pool of descriptors could be used for a given user/application). We then extracted a few molecules from each of 128 clusters, assigned them labels 0 to 127, and exported them to the reference set. Our machine learning model was trained to associate the molecular MACCS keys with the labels. After removing the reference set, several internal clusters within each of these aforementioned 128 clusters were then generated (Figure C1 in Appendix C), resulting in thousands of chemical

130

clusters. These chemical clusters, also known as the analogue set, have no explicit label or digit associated with them. The centroids from all the internal clusters were assigned code names and saved as internal_centroids which is an integral part of AR (see section 6.2.5. Analogue

Retriever).

6.2.7. Model Development

We used the V1 Library as training set to develop the model for our molecular decoder.

The disclosed Decoder model was trained using Keras 2.2.2 functional API, TensorFlow backend with GPU acceleration, NVIDIA CuDNN libraries [20, 21]. We used RMSprop algorithm to train for 1,000 epochs for all molecules in one batch with default learning parameters. It is composed of seven hidden dense layers activated with RELU. Since it is an integer classification (128 classes) task, we used multilabel classification model with 128 nodes in the output layer with SoftMax activation function, sparse categorical crossentropy loss function, and accuracy as the evaluation metric. We then evaluated it by predicting the clusters for the same dataset, and the accuracy of our model is above ~99.7%. A few outliers (0.3%, ~140 compounds) which were predicted to have wrong cluster labels were simply discarded to ensure the 100% accuracy for chemical-to-character and character-to-chemical QSDR-based recognition.

6.2.8. MOLWRITE Software

MOLWRITE program takes an input text file and encode it into molecules (SMILES). The workflow is summarized in Figure 6.1. The input message is first translated to ASCII values (a string of digits). For each digit in the ASCII values, LS encoder (section 2.2. Label Swapper) is then applied to change the original digits to new ones. Based on the new digits, molecules from the reference set are randomly picked (e.g., for digit ‘12’, a molecule tagged with label 12 is

131

picked) using ME (section 6.2.3. Molecular Encoder). Then, AR (section 6.2.4. Analogue

Retriever) replaces the picked molecules with an unlabeled analogue molecules from the analogue set, generating CryptoChem messages as output. That CryptoChem message can be further treated with modern encryption algorithm (e.g., AES).

6.2.9. MOLREAD Software

MOLREAD program takes a CryptoChem message as input file, which contains molecules

(SMILES). Available set of descriptors developed in Python using the RDKit modules are reported in Table C1. Chemical descriptors (both the sender and the receiver must pre-share and/or know the type and exact number of descriptors applied) are then computed on these molecules and fed into the machine learning model that was initially trained with V1 library (substitution key; it could be any model trained with any chemical library). The model then predicts and decodes cluster labels for each molecule based on its chemical descriptors. The generated cluster labels are stored in a string of digits in the same linear order as molecules. LS decoder (section 6.2.2. Label

Swapper) transforms these digits back into the original labels. These digits are translated into characters using ASCII decoder, resulting in the decoded original message.

6.3. Results

6.3.1 Assessment of CryptoChem with five datasets

To demonstrate the feasibility and capabilities of CryptoChem, we validated the QSDR and the Molecular Encoder approach by encoding and decoding a series of five datasets consisting of numerical and textual information with increasing size and complexity (Table 6.1). Often considered as benchmark for storage and encoding techniques, these datasets have different sizes and complexity increasing from dataset 1 to 5.

132

Table 6.1. Description of the five datasets to be encoded in this project. Datasets to be stored using the Molecular Informatics ID Size Complexity technique 1 “123456789” + + 2 “abcdefghijklmnopqrstuvwxyz” + + 3 “Operation start at 11:00PM” + ++ Latitude, Longitude, and Corresponding Time Zones for Major Cities. Source: https://www.infoplease.com/world/world- 4 +++ ++ geography/major-cities-latitude-longitude-and-corresponding- time-zones Full text of the Declaration of Independence 5 +++++ +++++ Source: http://www.ushistory.org/declaration/document/

MOLWRITE directly uses text files (e.g., “.txt”, “.csv”) as input and stores the encoded molecules (SMILES) in CryptoChem messages. MOLREAD takes the generated CryptoChem messages as input and decodes the molecules into original textual message. All the five datasets were saved as plain texts in txt files.

For each of the five datasets, we used MOLWRITE to encode the original numerical and/or textual information with and without analogue retriever. The encoded messages were then decoded by MOLREAD. The results are summarized in Table 6.2.

The first and the second datasets consists of Arabic numerals and English alphabet, respectively, which are considered as the basic elements of English texts. The results of the first and the second datasets showed that the MOLWRITE and MOLREAD programs can fast encode and decode the Arabic numerals and English alphabets with perfect accuracy. The result of the third dataset demonstrated the ability of MOLWRITE and MOLREAD programs to encode and decode simple sentence.

133

Table 6.2. Results of encoding and decoding the five datasets using MOLWRITE and MOLREAD programs (with programs performing in single CPU mode). Accuracy of No. of No. of Encoding Decoding Dataset original characters a compounds Time Time molecules set 1 10 10 100% 1s 1s 2 26 26 100% 1s 1s 3 26 26 100% 1s 1s 4 5,163 5,163 100% 144 s 10s 5 8,685 8,685 100% 247s 16s a with spaces.

The real challenges are the 4th and 5th datasets. The 4th dataset (Figure 6.4) is the latitude, longitude, and corresponding time zones for major cities. It is worth noting that this dataset is stored in a tabular format in the txt file. Impressively, the output of MOLREAD of this dataset is the same as original text. This demonstrated that our programs can not only encode and decode the content of text, but also the format of the text. The 5th dataset is the entire article of the US

Declaration of Independence. We tested our programs on the entire text including 8,685 characters and got the encoded text fully recovered. This validated our programs and demonstrated the feasibility of storing complex information with molecules.

134

(a)

CCCCOc1cccc(NCCCO)c1.O=C(NCCC[N@H+]1CCCC[C@@H]1CO)c2ccc(O)c(Cl)c2.C[C@@H](C(=O)N1CC Cc2cc(OC(F)(F)F)ccc21)n3cccn3.CCc1nccn1Cc2cc(C(=O)O)ccc2OC.FC(F)(F)c1ccc(C[C@@H]2CCC[NH2+]C2) cc1.Cc1ncnc2c1ncn2c3ccc([C@H](C)O)cc3.CCOC(=O)[C@H]1C(C)=NC(=O)N[C@@H]1c2ccccc2C.COc1cc([C @H](C)NC(=O)[C@@H]2C[C@@H](O)C[NH2+]2)ccc1OC(C)C.Cc1nc(Cl)c2c(I)n[nH]c2n1.COc1cc([C@@H]2[ NH2+]CCc3cc(O)c(O)cc32)ccc1O.COc1ccc(Cl)cc1C(=O)N2CCN([C@@H]3CCC[C@H]3O)CC2.C[C@H]1C[N @@H+](CCc2ccccc2)[C@H](C)CC1=O.Cc1nnc(CN(C)Cc2nc(c3ccc(F)cc3)oc2C)o1.C[C@H](c1ncc(C(C)(C)C)o1 )N2CCC[C@@H](CS(N)(=O)=O)C2.CC[C@H](C)[NH2+]Cc1cccc2[nH]ccc12.OCCCn1cnc(c2ccccc2)c1c3cccc4c 3OCO4.O=C\1NC(=O)N(c2ccccc2F)C(=O)/C1=C/c3ccc(O)cc3.CCN(C(=O)COc1cc(C)ccc1C(N)=O)C2=CCCCC2. Fc1ccc(C2=Nc3nnnn3[C@H](c4ccc(F)cc4)C2)cc1.CCOCCC[NH2+]Cc1cc(Cl)c(OCC)c(OC)c1.Cc1cc(C)c(NC(=O )COc2ccc(F)nc2)c(C)c1.CCn1cnnc1CNC(=O)[C@H](c2ccccc2)N3CCSCC3.C#C[C@@H]1C[C@@H]2CC(=O)[C @@H]3[C@@H]4CCC(=O)[C@@]4(C)CC[C@@H]3[C@@]2(C)C[C@H]1O.Clc1cccc(S[C@H]2CC[NH2+]C2) c1.Cc1ccc(C(=O)C[n+]2ccccc2C)c(C)c1.NC(=O)CN1CCN(C(=O)C=C2CCCCC2)CC1.NNCc1ccccc1OCc2ccccc2. CCOc1cccc(/C=C/2\SC(=S)NC2=O)c1.CN1/C(=C\C=C\2/SC(=S)NC2=O)/C(C)(C)c3ccccc31.CCCOc1ccc(C[NH2 +]CCOCCO)cc1Br.Cc1cc(NC(=O)C[C@H](C)c2ccccc2)no1.Cc1cccc(OC[C@H](O)C[NH2+][C@H](C)c2cncc(F) c2)c1.CCN1C(=O)/C(=C/c2ccc(OC(C)=O)cc2)/SC1=S.CC(C)N1C(=O)/C(=C/c2ccc(N(C)C)cc2)/SC1=S.COc1ccc( Cl)c([C@@H]2CCC[NH2+]C2)c1.Cc1ccc(c2nnc(COC(=O)c3cnccn3)o2)cc1.C[C@@H](O)C[NH2+]CCNCc1cccc c1OCc2ccccc2Cl.O=C(CSc1nncc2ccccc12)NCc3ccccc3F.CC[C@H]1CC[NH2+][C@@H]1Cc2ccc(F)cc2.Cc1cc(C N2CC[NH+](C[C@@H](O)COc3ccccc3)CC2)on1.

(b) Figure 6.4. (a) The 4th dataset, and (b) the encoded version using MOLWRITE.

6.4. Discussion

CryptoChem builds on the concept of Quantitative Structure-Data Relationship (QSDR) modeling method to encode and store information using machine learning, Big Chemical Data, molecular structures and their properties. There are several key advantages of using machine learning to create those relationships: (i) the descriptor  data relationship is encoded non- linearly, meaning there is no direct, obvious link between a particular chemical scaffold, shape, volume and the actual character to be stored; (ii) different molecules can still encode the same character, meaning that character frequency analysis (traditionally used to crack methods that

135

substitute a given alphabet by another one) is useless with this technology, (iii) the relationships are quantified, meaning there is an actual weight (or set of weights) associating each molecular descriptor with a character. Therefore, a CryptoChem message cannot be read and/or written without using the right QSDR model (that includes the compounds used to train the model, the type of machine learning algorithms, the set of descriptors used to characterize these compounds, and the actual parameters of the models (e.g., descriptor weights, number of neurons/hidden layers if using DNN). In this disclosed study, we utilized implicit and explicit 2D molecular descriptors/fingerprints and deep learning neural networks (DNN) model to encrypt and decipher

CryptoChem messages containing textual and numerical data.

We have successfully developed our first QSDR-DNN models that directly link the molecular descriptors of chemicals in V1 dataset to numerical/textual characters. It should be noted that the application of our DNN models is unique from others found in literature. In contemporary research in cheminformatics, DNN models are usually used to build the quantitative relationship between molecular structures and their properties/activities (e.g., binding affinity, toxicity) in the context of drug discovery or computational toxicity. For those models, the relevance and choice of descriptor sets are crucial in the model’s performance to predict these desired outputs. DNN models must be trained with the most appropriate and relevant set molecular descriptors to correctly predict the targeted endpoint. In other words, the accuracy in those predictions relies heavily on choosing the right set of features, relevance between these features and the targeted characteristics, data curation, etc. If such important criteria are not met, it is usually challenging to obtain high accuracy. It is therefore important to follow normal protocols [22] in preprocessing data, training such QSDR models and evaluating their accuracy to ensure that the model’s performance will be good enough to predict values on new data.

136

However, in this study, we train our models to associate manually created labels (labeled by k-means clustering algorithm) with the molecular descriptors of compounds. The major difference between contemporary approaches and ours is that we can manipulate the descriptor sets and cluster labels as we see fit, and our model will still learn to associate certain descriptors with cluster labels. If it fails to predict the right clusters, we can simply discard those data. In essence, our models learn to distinguish molecules based on specific molecular descriptors. Our goal is to use DNN models as keys to decode information, so standard protocols used in developing machine learning models are not strictly relevant here. For a given compound, trained DNN models must assign the right clusters for our large libraries of molecules given the right set of descriptors. So, we trained and fitted our models (envisioned as substitution key) with the entire dataset, instead of splitting it into train and test sets like in other contemporary cheminformatics researches. The wrongly predicted compounds which represent a very small percentage are discarded. This way, our fitted models have learned to associate molecules in our libraries with cluster labels with 100% accuracy.

MOLWRITE and MOLREAD programs were implemented to respectively write and read

CryptoChem messages encoded with chemical molecules that could only be decoded using our

QSDR DNN model as substitution key and specific descriptor sets. We showed in Assessment of

CryptoChem with five datasets section the true accomplishment of our system by encoding and decoding five datasets containing textual, numerical information and tabular formats with increasing complexity with 100% accuracy not only in content but also in format. Besides the fact that it is the first time ever one actually encodes the full text of the US Declaration of Independence with chemicals, this result validates this proof-of-concept project and demonstrate the feasibility of storing complex information with molecules and their properties.

137

Additionally, we implemented Analogue Retriever (AR) to replace molecules in

CryptoChem messages generated from reference set in MOLWRITE with the new “analogues” from analogue set. By introducing new molecules from an unlabeled external set, it ensures a better integrity of data even if the initial reference database and CryptoChem messages are compromised.

However, our well-trained QSAR models can decipher them correctly provided the right descriptor set to use and the correct sets of parameters. We tested AR by applying it on the five datasets we previously mentioned. The molecules from these CryptoChem messages were replaced with new molecules using AR, and we encoded them using MOLREAD program. The QSDR model used in

MOLREAD was able to decipher the analogue retrieved CryptoChem messages with 100% accuracy. This accomplishment showed two important things: (1) our DNN models successfully learned the intricate features and representations mapping molecular descriptors with cluster labels, and (2) AR was designed with a good understanding of how our DNN model works.

In terms of implementing the AR, we tested several different metrics such as Tanimoto,

Euclidean distance, string similarity, etc. We previously assumed that we could sample a few molecules from different clusters in the external set, and selected the cluster containing a molecule with the highest Tanimoto score or highest string match score or lowest Euclidean distance to the target molecule. These approaches were tested with different sample sizes, and clustering methods

(DL-model based, initial k-means based), but all failed to retrieve the original messages. None of them yielded an accuracy score more than 51%. The analogues have not been chosen simply based on Tanimoto scores, Euclidean distance or string match. More complexity is involved, and selecting the analogues relies heavily on the initial k-means clustering. Unless the centroids from the initial k-means clustering generated with the exact random seed is known, the actual decoding could be near impossible. On the other side, it showed that our model may have learned to

138

recognize these centroids and decipher the cluster labels on its own. By successfully implementing

AR, we may have gained a better understanding on the inner workings of our QSDR model.

The summary of security layers in CryptoChem is shown in Figure 6.5. In order to encode the message, one can select a particular chemical space including a set of highly specific and similar chemical compounds or a combination of various chemical scopes. The chemical universe is vast and there are many possibilities of designing and enumerating one’s in-silico libraries. For instance, cheminformatics software such as PKS Enumerator [23] or SIME [24] can automatically build extremely large libraries of macrocycles, macrolactones or macrolides with multiple constitutional and structural constraints.

Next, the choice of descriptor sets, or fingerprints is highly relevant in establishing the

QSDR relationships between chemicals and cluster labels as well. Both the sender and receiver must know the type and exact descriptor set applied in order to successfully encode and decode

CryptoChem messages. For even further improved security, one can specify in-house, hand-picked chemical descriptors or use 3D descriptors for which both the sender and the receiver have to know the method to specify the conformational arrangement used for each molecule to derive the correct

3D descriptors. Going one step further, one can even use 4D/MD descriptors [14] where chemicals are docked or run through molecular dynamic simulation in a specified binding site of a protein or ribosome target of interest for a specific duration. One can then extract 4D/MD chemical descriptors from that process and use that to establish to QSDR relationships. This is however highly complex and thus very challenging and rewarding at the same time, if the proper protocol and regulations are meticulously applied.

139

Figure 6.5. Summary of security layers in CryptoChem

Another security layer is AR which obscures and adds confusion to the link between chemical structures and digits associated with them. Since new analogues without any associated cluster labels (analogue set) are being used in CryptoChem messages, it makes it difficult for third parties to decode the labels by attempting to identify the chemicals even with access to reference set. It is also highly improbable to reproduce the same QSDR model without using the exact parameters such as number of decision trees (in RF), nodes, layers, activation functions, dropout layers, regularizations, etc. (in DNN); thus it is treated as an additional security layer in

CryptoChem technology. LS is another security layer to protect CryptoChem messages. It could protect them from cryptanalysis attacks purely based on frequency analysis by generating new labels based on permutation key (a molecular key) and 128 neighbor molecules.

140

We also compared CryptoChem to contemporary encryption methods such as block cipher

(e.g. AES - Advanced Encryption Standard [25]) and stream cipher (e.g. One-Time Pad [26]). A summary of encryption features among these three encryption methods are provided in Table 6.3.

Both CryptoChem and block cipher are based on substitution and permutation concepts whereas stream cipher is based on substitution alone. In terms of cryptographic keys, CryptoChem requires both substitution and permutation keys; the former can be any virtual chemical library of user’s choice and the later can be any chemical molecule that needs to be passed in a different secure channel or in a physical form. With the application of label swapper, even if the permutation key (any chemical molecule of user’s choice) and the substitution key (any large virtual chemical library) are compromised, the key space for permutation key is still rather large (factorial of 128

= 128!) due to the application of Label Swapper in CryptoChem. Meanwhile, block ciphers keys are bit dependent such as 56-, 128-, 256-bits etc. and consequently the key space depends on the bits employed as well (e.g. 2256 = 1.2 x 1077). Stream cipher relies on pseudorandom number generator. Block cipher is deterministic with the use of same initialization vector; an encryption system is deterministic if the same cypher text is reproducible given the same plaintext and key.

Stream cipher is deterministic as well. On the contrary, CryptoChem is not deterministic; meaning that different encoded messages are generated even if the same key molecule and models are being applied to generate them.

XOR operation [27] is not applicable to CryptoChem though it is used in both block and stream ciphers. Next, we compared avalanche effect, one important trait in determining the strength of a cryptographic algorithm wherein a higher avalanche effect is desired for a stronger encryption system [28]. An avalanche effect is achieved if a substantial change in the cipher message can be triggered by a slight change in the plaintext for a fixed key [29]. CryptoChem has

141

a high level of avalanche effect since one character can be expressed by a cluster of chemicals, and the use of Analogue Retriever and Label Swapper switched chemicals and change labels accordingly. Consequently, very different CryptoChem message are generated even when the same message is used as input. Block cipher possesses avalanche effect whereas stream cipher does not.

The operation unit of CryptoChem is byte (character) which is converted to molecules in the encrypted message. On the other hand, block cipher uses block of bits and stream cipher uses bytes. Both CryptoChem and block cipher use both principles of confusion and diffusion; which are central to the security of conventional encryption algorithms [30] while stream cipher uses confusion principle alone. Confusion is the concealment of the relation between the secret key and the cipher text whereas diffusion is the complexity of the relationship between the plain text and the cipher text [30]. Regarding the reversibility, both CryptoChem and block cipher made it highly improbable to reverse cipher text back to original text unlike stream cipher which applies XOR that can easily reverse cipher text to the plain text. CryptoChem and stream cipher are strongly resistant to error; meaning that an error in a character will not affect the rest. However, block cipher is more susceptible to error because an error in a byte will affect the entire block.

It is important to underscore that, despite the results of this comparison, we are not claiming that CryptoChem can be seen as a robust, fully secure, quantum-ready encryption technique in its current state. In order to fully assess those claims, a specific and detailed cryptanalysis study would need to be conducted (far beyond the scope of this proof-of-concept study).

142

Table 6.3. Comparison with Contemporary Encryption Methods. Block Stream Features CryptoChem cipher cipher Substitution- Principle Substitution-permutation Substitution permutation Pseudorandom cryptographic 56, 128, Substitution key + permutation key number key 256-bits, etc. generator Yes (if using No (if using the same key molecule and same deterministic Yes model) Initialization vector) Depend on Substitution key: any chemical library key length Key space 215 256 --- Permutation key: 128! = 4 x 10 256-bit: 2 77 = 1.2 x 10 XOR based No Yes Yes avalanche effect yes yes no Operation unit Character (byte) to molecule Block of bits byte Uses both Confusion and confusion Relies on Uses both confusion and diffusion diffusion and confusion only diffusion It uses XOR for the Reversing encryption Reversibility Reversing encrypted text is hard. encrypted which can be text is hard. easily reversed to the plain text. Weak (an Strong (an error in a Susceptibility to Strong (an error in a character will not error in a byte byte will error affect the rest) will not affect affect the the rest) entire block)

6.5. Conclusion

In this study, we conceived and implemented the CryptoChem method that uses chemical structures and their properties as the central element to encode and store information using chemicals. In this proof-of-concept version, we achieved the implementation of both MOLWRITE

143

and MOLREAD programs to write and read CryptoChem messages. Additional levels of complexity afforded by Label Swapper and Analogue Retriever significantly strengthen the security of

CryptoChem. The program validation on five datasets showed the accomplishment of our goal to encode and accurately decode data/information using molecules. Additionally, the comparison of

CryptoChem to contemporary encryption methods could justify launching a detailed cryptanalysis of the method to fully understand its strong characteristics as well as its weak points. That full cryptanalysis and stress tests need to be conducted in the future to further assess and potentially demonstrate the security of CryptoChem, especially with the rise of quantum computers.

Moreover, one could use CryptoChem as an additional layer of protection prior to encoding a message using AES or other modern cryptographic technology.

In further phases, we plan to develop different machine learning models and create multiple libraries with varying sizes based on different sets of descriptors (including 3D, 4D/MD) and fingerprints. To encode larger texts such, we certainly need much bigger chemical libraries. There is certainly more work to be done regarding the optimization of both AR and LS in terms of method improvement and GPU-accelerated parallel computing. GPU acceleration is the key to have

CryptoChem ready for real-world and commercial applications. The software is freely-available for testing and available on our GitHub.

Software Availability

The current version of CryptoChem is available at: https://github.com/XinhaoLi74/CryptoChem

Author Contributions

PPKZ and XL equally contributed to the development of CryptoChem. PPKZ designed

Analogue Retriever, and XL designed Label Swapper. DF conceived and designed the study. All

144

authors edited, revised the manuscript and have given approval to the final version of the manuscript.

Supporting Information (Appendix C)

Molecular descriptors currently used by CryptoChem (Table C1). Illustration of the clustering process in Analogue Retriever (AR) (Figure C1).

Funding

PPKZ and XL gratefully acknowledge DARPA/ARO for financial support (grant

W911NF1810315). PPKZ held an “International Fellowship from AAUW”, and Olive Ruth

Russell Fellowship from Berea College. DF thanks the NC State Chancellor’s Faculty Excellence

Program for funding this project.

Acknowledgments

We thank DARPA/ARO for funding. The goal of the DARPA-funded component of this work was to explore additional dimensions of molecular storage that would be possible using molecular descriptors. This research was developed with funding from the Defense Advanced

Research Projects Agency (DARPA). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the

Department of Defense or the U.S. Government.

145

References

1. Goldman N, Bertone P, Chen S, et al (2013) Towards practical, high-capacity, low- maintenance information storage in synthesized DNA. Nature 494:77–80. https://doi.org/10.1038/nature11875 2. Adleman LM, Rothemund PWK, Roweis S, Winfree E (1999) On applying molecular computation to the data encryption standard. J Comput Biol 6:53–63. https://doi.org/10.1089/cmb.1999.6.53 3. Malhotra M, . G (2019) DNA Cryptography: A Novel Approach for Data Security Using Flower Pollination Algorithm. SSRN Electron J 4. Extance A (2016) How DNA could store all the world’s data. Nature 537:22–24. https://doi.org/10.1038/537022a 5. Fourches D (2014) Cheminformatics: At the Crossroad of Eras. 539–546. https://doi.org/10.1007/978-94-017-9257-8_16 6. Ash J, Fourches D (2017) Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories. J Chem Inf Model 57:. https://doi.org/10.1021/acs.jcim.7b00048 7. Leone L, Pezzella A, Crescenzi O, et al (2015) Trichocyanines: A Red-Hair-Inspired Modular Platform for Dye-Based One-Time-Pad Molecular Cryptography. ChemistryOpen 4:370–377. https://doi.org/10.1002/open.201402164 8. Arcadia CE, Kennedy E, Geiser J, et al Multicomponent molecular memory. Nat Commun. http://dx.doi.org/10.1038/s41467-020-14455-1 9. Chen K, Zhu J, Boskovic F, Keyser UF (2019) Secure data storage on DNA hard drives. bioRxiv 857748. https://doi.org/10.1101/857748 10. Gramatica P, Fourches D, Cherkasov A, et al (2008) Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis. J Chem Inf Model 48:766– 784. https://doi.org/10.1021/ci700443v 11. Alves VM, Muratov E, Fourches D, et al (2015) Predicting chemically-induced skin reactions. Part II: QSAR models of skin permeability and the relationships between skin permeability and skin sensitization. Toxicol Appl Pharmacol 284:. https://doi.org/10.1016/j.taap.2014.12.013 12. Kuenemann MA, Fourches D (2017) Cheminformatics Modeling of Amine Solutions for Assessing their CO2Absorption Properties. Mol Inform 36:. https://doi.org/10.1002/minf.201600143 13. Cherkasov A, Muratov EN, Fourches D, et al (2014) QSAR modeling: Where have you been? Where are you going to? J Med Chem 57:. https://doi.org/10.1021/jm4004285 14. Kyaw Zin PP, Borrel A, Fourches D (2020) Benchmarking 2D/3D/MD-QSAR Models for Imatinib Derivatives: How Far Can We Predict? J Chem Inf Model 15. Mudawwar MF (1997) Multicode: A truly multilingual approach to text encoding. Computer (Long Beach Calif) 30:37–43. https://doi.org/10.1109/2.585152

146

16. Weininger D (1988) SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Model 28:31–36. https://doi.org/10.1021/ci00057a005 17. Keegan J (2003) Intelligence in War, Knowledge of the Enemy from Napoleon to Al- Qaeda by Keegan, John: (2003) | Andrew Barnes Books / Military Melbourne. In: Hutchinson London 2003 18. Sterling T, Irwin JJ (2015) ZINC 15 - Ligand Discovery for Everyone. J Chem Inf Model 55:2324–2337. https://doi.org/10.1021/acs.jcim.5b00559 19. Géron A (2017) Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly Media 20. Peris Á (2018) NMT-Keras Documentation 21. Chollet F (2017) Deep Learning with Python, 1st editio. Manning Publications 22. Fourches D, Muratov E, Tropsha A (2010) Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J Chem Inf Model 50:1189–1204. https://doi.org/10.1021/ci100176x 23. Zin PPKPP, Williams G, Fourches D (2018) Cheminformatics-based enumeration and analysis of large libraries of macrolide scaffolds. J Cheminform 10:53. https://doi.org/10.1186/s13321-018-0307-6 24. Zin PPK, Williams G, Fourches D (2020) SIME: Synthetic insight-based macrolide enumerator to generate the V1B library of 1 billion macrolides. J Cheminform 12:23. https://doi.org/10.1186/s13321-020-00427-6 25. Sanchez-Avila C, Sanchez-Reillo R (2001) The Rijndael block cipher (AES proposal): A comparison with DES. In: IEEE Annual International Carnahan Conference on Security Technology, Proceedings. pp 229–234 26. Zeng K, Yang CH, Wei DY, Rao TRN (1991) Pseudorandom Bit Generators in Stream- Cipher Cryptography. Computer (Long Beach Calif) 24:8–17. https://doi.org/10.1109/2.67207 27. Huo F, Gong G (2015) XOR Encryption Versus Phase Encryption, an In-Depth Analysis. IEEE Trans Electromagn Compat 57:903–911. https://doi.org/10.1109/TEMC.2015.2390229 28. Karuppiah M, Ramanujam S, Professor A (2011) Designing an algorithm with high Avalanche Effect 29. Dawson E, Gustafson H, Pettitt AN Strict Key Avalanche Criterion 30. Coskun B, Memon N (2006) Confusion/diffusion capabilities of some robust hash functions. In: 2006 IEEE Conference on Information Sciences and Systems, CISS 2006 - Proceedings. Institute of Electrical and Electronics Engineers Inc., pp 1188–1193

147

APPENDICES

148

Appendix A

Supporting Information Chapter 2. Hierarchical H-QSAR Modeling Approach for Integrating Binary/Multi Classification and Regression Models of Acute Oral Systemic Toxicity

Table A1. Hyperparameter Search Spaces. Algorithm Parameters Space RDkit and Mordred: {'n_neighbors': [5,9,15,19,25,35,45,55,71], 'weights': ['distance'],'p': [1,2]} ECFP6_Bits and MACCS: {'n_neighbors': [5,9,15,19,25,35,45,55,71], 'weights': ['distance'], kNN 'metric': ['jaccard', 'dice', 'rogerstanimoto']} ECFP6_Counts: {'n_neighbors': [5,9,15,19,25,35,45,55,71], 'weights': ['distance'], 'metric': ['hamming', 'canberra', 'braycurtis']} {'C': [0.01, 0.1, 1, 10, 100, 200, 400, 1000], 'kernel': ['linear']},

SVM {'C': [0.01, 0.1, 1, 10, 100, 200, 400, 1000], 'gamma': [100,10,1,1e-1,1e-2, 1e- 3], 'kernel': ['rbf']}] {'bootstrap': [True, False], 'max_depth': [5, 20, 35, 50, 65, 80, None], 'max_features': ['log2', 'sqrt'], RF 'min_samples_leaf': [2, 4, 6], 'min_samples_split': [2, 5, 10], 'n_estimators': [500, 1500]} {'learning_rate': [0.01, 0.1], 'max_depth': [3, 6, 10], 'min_child_weight': [1, 3, 5], XGBoost 'gamma': [0, 1, 5], 'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 'n_estimators': [500, 1500]}

149

Table A2. Hyperparameter Tuning of Regression Models Algorithm Feature Best Parameter Setting ECFP6_Bits {'metric': 'dice', 'n_neighbors': 15, 'weights': 'distance'} ECFP6_Counts {'metric': 'braycurtis', 'n_neighbors': 15, 'weights': 'distance'} MACCS {'metric': 'rogerstanimoto', 'n_neighbors': 9, 'weights': 'distance'} kNN RDKit {'n_neighbors': 9, 'p': 1, 'weights': 'distance'} Mordred {'n_neighbors': 9, 'p': 1, 'weights': 'distance'} H-Feature {'n_neighbors': 45, 'p': 2, 'weights': 'distance'} ECFP6_Bits {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'} ECFP6_Counts {'C': 10, 'gamma': 0.0, 'kernel': 'rbf'1} MACCS {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} SVM RDKit {'C': 1, 'gamma': 1, 'kernel': 'rbf'} Mordred {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} H-Feature {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': ECFP6_Bits 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False} {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 2, ECFP6_Counts 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False} {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': 2, MACCS 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False} RF {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': RDKit 2, 'max_features': 'sqrt', 'max_depth': 65, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 10, 'min_samples_leaf': Mordred 2, 'max_features': 'sqrt', 'max_depth': 65, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': H-Feature 4, 'max_features': 'sqrt', 'max_depth': 20, 'bootstrap': False} {'subsample': 0.9, 'n_estimators': 1500, 'min_child_weight': 3, ECFP6_Bits 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.7} {'subsample': 1.0, 'n_estimators': 500, 'min_child_weight': 5, ECFP6_Counts 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.6} {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 1, MACCS 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.7} XGBoost {'subsample': 1.0, 'n_estimators': 500, 'min_child_weight': 1, RDKit 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.5} {'subsample': 0.9, 'n_estimators': 500, 'min_child_weight': 1, Mordred 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.8}; {'subsample': 0.6, 'n_estimators': 500, 'min_child_weight': 3, H-Feature 'max_depth': 6, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.6}

150

Table A3. Hyperparameter Tuning of Binary Models Algorithm Feature Best Parameter Setting ECFP6_Bits {'metric': 'dice', 'n_neighbors': 19, 'weights': 'distance'} ECFP6_Counts {'metric': 'braycurtis', 'n_neighbors': 19, 'weights': 'distance'} MACCS {'metric': 'rogerstanimoto', 'n_neighbors': 23, 'weights': 'distance'} kNN RDKit {'n_neighbors': 15, 'p': 1, 'weights': 'distance'} Mordred {'n_neighbors': 15, 'p': 1, 'weights': 'distance'} H-Feature {'n_neighbors': 71, 'p': 1, 'weights': 'distance'} ECFP6_Bits {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'} ECFP6_Counts {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'} MACCS {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'} SVM RDKit {'C': 1, 'gamma': 1, 'kernel': 'rbf'} Mordred {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} H-Feature {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': 4, ECFP6_Bits 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 5, 'min_samples_leaf': 2, ECFP6_Counts 'max_features': 'sqrt', 'max_depth': 35, 'bootstrap': True} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': 2, MACCS 'max_features': 'log2', 'max_depth': 35, 'bootstrap': False} RF {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': 2, RDKit 'max_features': 'log2', 'max_depth': 65, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 5, 'min_samples_leaf': 4, Mordred 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 5, 'min_samples_leaf': 6, H-Feature 'max_features': 'log2', 'max_depth': 35, 'bootstrap': False} {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 5, ECFP6_Bits 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.9} {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 1, ECFP6_Counts 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.7} {'subsample': 0.6, 'n_estimators': 500, 'min_child_weight': 5, MACCS 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.8} XGBoost {'subsample': 0.8, 'n_estimators': 1500, 'min_child_weight': 1, RDKit 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.6} {'subsample': 1.0, 'n_estimators': 1500, 'min_child_weight': 3, Mordred 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.7} {'subsample': 0.6, 'n_estimators': 500, 'min_child_weight': 1, H-Feature 'max_depth': 3, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.5}

151

Table A4. Hyperparameter Tuning of Multiclass Models Algorithm Feature Best Parameter Setting ECFP6_Bits {'metric': 'dice', 'n_neighbors': 9, 'weights': 'distance'} ECFP6_Counts {'metric': 'braycurtis', 'n_neighbors': 9, 'weights': 'distance'} MACCS {'metric': 'rogerstanimoto', 'n_neighbors': 15, 'weights': 'distance'} kNN RDKit {'n_neighbors': 9, 'p': 1, 'weights': 'distance'} Mordred {'n_neighbors': 9, 'p': 1, 'weights': 'distance'} H-Feature {'n_neighbors': 35, 'p': 1, 'weights': 'distance'} ECFP6_Bits {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'} ECFP6_Counts {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'} MACCS {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} SVM RDKit {'C': 10, 'gamma': 1, 'kernel': 'rbf'} Mordred {'C': 10, 'gamma': 1, 'kernel': 'rbf'} H-Feature {'C': 0.1, 'kernel': 'linear'} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': ECFP6_Bits 2, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False} {'n_estimators': 1500, 'min_samples_split': 2, 'min_samples_leaf': ECFP6_Counts 2, 'max_features': 'sqrt', 'max_depth': 65, 'bootstrap': False} {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': MACCS 2, 'max_features': 'sqrt', 'max_depth': 35, 'bootstrap': False} RF {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': RDKit 2, 'max_features': 'sqrt', 'max_depth': 80, 'bootstrap': False} {'n_estimators': 500, 'min_samples_split': 5, 'min_samples_leaf': Mordred 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False} {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': H-Feature 6, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False} {'subsample': 0.7, 'n_estimators': 500, 'min_child_weight': 1, ECFP6_Bits 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 0.9} {'subsample': 0.8, 'n_estimators': 500, 'min_child_weight': 1, ECFP6_Counts 'max_depth': 10, 'learning_rate': 0.1, 'gamma': 1, 'colsample_bytree': 1.0} {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 3, MACCS 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.5} XGBoost {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 1, RDKit 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 0, 'colsample_bytree': 0.8} {'subsample': 0.8, 'n_estimators': 1500, 'min_child_weight': 3, Mordred 'max_depth': 6, 'learning_rate': 0.1, 'gamma': 5, 'colsample_bytree': 0.8} {'subsample': 0.6, 'n_estimators': 1500, 'min_child_weight': 3, H-Feature 'max_depth': 10, 'learning_rate': 0.01, 'gamma': 5, 'colsample_bytree': 0.7}

152

Table A5. 10-Fold Cross-Validation Performances of Regression Models Algorithm Feature RMSE R2 MAE MSE Base Models ECFP6_Bits 0.646±0.017 0.487±0.026 0.477±0.014 0.418±0.022 ECFP6_Counts 0.647±0.017 0.486±0.027 0.477±0.012 0.419±0.023 kNN MACCS 0.633±0.028 0.508±0.043 0.462±0.016 0.402±0.035 RDKit 0.588±0.026 0.576±0.032 0.43±0.019 0.346±0.031 Mordred 0.609±0.023 0.545±0.031 0.443±0.015 0.371±0.028 ECFP6_Bits 0.668±0.02 0.454±0.024 0.49±0.016 0.446±0.026 ECFP6_Counts 0.623±0.013 0.524±0.02 0.465±0.009 0.388±0.016 SVM MACCS 0.61±0.026 0.543±0.031 0.452±0.019 0.373±0.031 RDKit 0.628±0.025 0.516±0.036 0.458±0.017 0.395±0.031 Mordred 0.613±0.024 0.54±0.026 0.447±0.015 0.376±0.03 ECFP6_Bits 0.628±0.022 0.517±0.026 0.466±0.016 0.394±0.028 ECFP6_Counts 0.619±0.02 0.531±0.023 0.459±0.015 0.383±0.024 RF MACCS 0.597±0.024 0.562±0.032 0.44±0.017 0.357±0.029 RDKit 0.637±0.025 0.503±0.031 0.461±0.015 0.406±0.032 Mordred 0.597±0.027 0.562±0.032 0.44±0.018 0.357±0.033 ECFP6_Bits 0.62±0.025 0.529±0.03 0.46±0.017 0.385±0.031 ECFP6_Counts 0.621±0.019 0.527±0.021 0.461±0.014 0.386±0.024 XGBoost MACCS 0.595±0.027 0.565±0.029 0.435±0.018 0.355±0.032 RDKit 0.594±0.025 0.567±0.028 0.435±0.018 0.353±0.03 Mordred 0.584±0.028 0.582±0.033 0.428±0.018 0.341±0.033 Hierarchical Models kNN H-Features 0.559±0.024 0.617±0.028 0.405±0.018 0.312±0.027 SVM H-Features 0.551±0.024 0.628±0.026 0.395±0.018 0.304±0.026 RF H-Features 0.549±0.025 0.631±0.028 0.396±0.019 0.301±0.027 XGBoost H-Features 0.550±0.024 0.628±0.027 0.398±0.018 0.303±0.027

153

Table A6. 10-Fold Cross-Validation Performances of Binary Models Balance Algorithm Feature Accuracy F1-Score MCC AUROC Accuracy Base Models ECFP6_Bits 0.746±0.014 0.737±0.013 0.745±0.014 0.479±0.028 0.737±0.013 ECFP6_Counts 0.749±0.012 0.74±0.011 0.748±0.012 0.486±0.023 0.74±0.011 kNN MACCS 0.758±0.011 0.747±0.01 0.756±0.011 0.503±0.022 0.747±0.01 RDKit 0.763±0.013 0.754±0.013 0.762±0.013 0.514±0.027 0.754±0.013 Mordred 0.77±0.016 0.763±0.016 0.769±0.016 0.529±0.032 0.763±0.016 ECFP6_Bits 0.7±0.009 0.665±0.007 0.675±0.01 0.393±0.016 0.665±0.007 ECFP6_Counts 0.764±0.013 0.753±0.012 0.762±0.012 0.515±0.027 0.753±0.012 SVM MACCS 0.782±0.014 0.772±0.014 0.78±0.014 0.552±0.029 0.772±0.014 RDKit 0.766±0.015 0.755±0.015 0.764±0.015 0.519±0.031 0.755±0.015 Mordred 0.774±0.012 0.767±0.011 0.773±0.012 0.537±0.023 0.767±0.011 ECFP6_Bits 0.758±0.011 0.743±0.012 0.754±0.011 0.502±0.025 0.743±0.012 ECFP6_Counts 0.758±0.014 0.74±0.014 0.753±0.014 0.505±0.03 0.74±0.014 RF MACCS 0.784±0.013 0.774±0.013 0.782±0.013 0.556±0.028 0.774±0.013 RDKit 0.786±0.011 0.776±0.009 0.784±0.011 0.561±0.022 0.776±0.009 Mordred 0.789±0.014 0.778±0.013 0.787±0.014 0.567±0.027 0.778±0.013 ECFP6_Bits 0.751±0.017 0.74±0.017 0.749±0.017 0.488±0.034 0.74±0.017 ECFP6_Counts 0.766±0.013 0.756±0.014 0.764±0.013 0.519±0.028 0.756±0.014 XGBoost MACCS 0.772±0.015 0.762±0.015 0.77±0.015 0.532±0.03 0.762±0.015 RDKit 0.788±0.01 0.78±0.009 0.787±0.01 0.566±0.02 0.78±0.009 Mordred 0.792±0.014 0.783±0.013 0.791±0.014 0.573±0.028 0.783±0.013 Hierarchical Models kNN H-Features 0.796±0.017 0.79±0.017 0.796±0.017 0.584±0.034 0.79±0.017 SVM H-Features 0.799±0.015 0.792±0.015 0.798±0.015 0.588±0.03 0.792±0.015 RF H-Features 0.805±0.015 0.799±0.014 0.804±0.015 0.601±0.029 0.799±0.014 XGBoost H-Features 0.802±0.011 0.795±0.01 0.801±0.011 0.594±0.021 0.795±0.01

154

Table A7. 10-Fold Cross-Validation Performances of Multiclass Models Balance Algorithm Feature Accuracy MCC F1-Score Accuracy Base Models ECFP6_Bits 0.605±0.022 0.537±0.023 0.373±0.034 0.594±0.023 ECFP6_Counts 0.607±0.014 0.542±0.017 0.375±0.022 0.596±0.015 kNN MACCS 0.605±0.013 0.529±0.014 0.37±0.024 0.592±0.015 RDKit 0.61±0.018 0.542±0.014 0.382±0.022 0.6±0.019 Mordred 0.614±0.009 0.547±0.011 0.388±0.015 0.604±0.011 ECFP6_Bits 0.589±0.019 0.522±0.019 0.348±0.029 0.579±0.02 ECFP6_Counts 0.606±0.018 0.542±0.017 0.374±0.027 0.597±0.018 SVM MACCS 0.604±0.01 0.545±0.013 0.377±0.012 0.598±0.011 RDKit 0.609±0.016 0.548±0.015 0.384±0.019 0.602±0.018 Mordred 0.606±0.01 0.545±0.016 0.376±0.021 0.598±0.012 ECFP6_Bits 0.615±0.018 0.494±0.015 0.374±0.02 0.577±0.02 ECFP6_Counts 0.619±0.015 0.494±0.012 0.38±0.017 0.58±0.018 RF MACCS 0.627±0.015 0.531±0.018 0.398±0.029 0.608±0.017 RDKit 0.636±0.011 0.535±0.008 0.412±0.012 0.616±0.012 Mordred 0.638±0.011 0.541±0.013 0.416±0.022 0.619±0.012 ECFP6_Bits 0.603±0.019 0.521±0.02 0.361±0.025 0.587±0.02 ECFP6_Counts 0.617±0.019 0.535±0.02 0.384±0.029 0.601±0.019 XGBoost MACCS 0.629±0.011 0.547±0.01 0.407±0.015 0.616±0.012 RDKit 0.633±0.013 0.538±0.011 0.409±0.014 0.616±0.014 Mordred 0.618±0.016 0.522±0.017 0.383±0.024 0.599±0.018 Hierarchical Models kNN H-Features 0.654±0.018 0.576±0.013 0.447±0.023 0.64±0.02 SVM H-Features 0.655±0.015 0.569±0.013 0.448±0.018 0.639±0.017 RF H-Features 0.659±0.016 0.587±0.015 0.458±0.023 0.648±0.017 XGBoost H-Features 0.658±0.015 0.588±0.012 0.456±0.023 0.647±0.017

155

Table A8. Test Set Performances of Regression Models Algorithm Feature RMSE R2 MAE MSE Base Models ECFP6_Bits 0.616 0.528 0.444 0.380 ECFP6_Counts 0.620 0.523 0.449 0.384 kNN MACCS 0.598 0.555 0.427 0.358 RDKit 0.596 0.559 0.428 0.355 Mordred 0.590 0.566 0.422 0.349 ECFP6_Bits 0.627 0.512 0.458 0.393 ECFP6_Counts 0.598 0.556 0.440 0.357 SVM MACCS 0.576 0.589 0.418 0.331 RDKit 0.616 0.529 0.4452 0.379 Mordred 0.602 0.550 0.430 0.362 ECFP6_Bits 0.604 0.548 0.445 0.365 ECFP6_Counts 0.600 0.553 0.440 0.360 RF MACCS 0.580 0.583 0.424 0.336 RDKit 0.561 0.610 0.404 0.315 Mordred 0.575 0.590 0.417 0.331 ECFP6_Bits 0.590 0.568 0.437 0.348 ECFP6_Counts 0.588 0.571 0.430 0.346 XGBoost MACCS 0.566 0.602 0.404 0.321 RDKit 0.563 0.606 0.412 0.318 Mordred 0.567 0.600 0.411 0.322 Base Model Consensus 0.551 0.623 0.398 0.304 Hierarchical Models kNN H-Features 0.534 0.647 0.381 0.285 SVM H-Features 0.529 0.652 0.376 0.280 RF H-Features 0.529 0.652 0.376 0.280 XGBoost H-Features 0.531 0.649 0.377 0.283 Hierarchical Model Consensus 0.526 0.656 0.374 0.277

156

Table A9. Test Set Performances of Binary Models Balance Algorithm Feature Accuracy F1-Score MCC AUROC Accuracy Base Models ECFP6_Bits 0.748 0.735 0.745 0.478 0.735 ECFP6_Counts 0.756 0.743 0.754 0.495 0.743 kNN MACCS 0.768 0.754 0.765 0.520 0.754 RDKit 0.771 0.759 0.769 0.527 0.759 Mordred 0.770 0.760 0.768 0.525 0.760 ECFP6_Bits 0.734 0.709 0.724 0.449 0.709 ECFP6_Counts 0.757 0.745 0.755 0.498 0.745 SVM MACCS 0.780 0.767 0.777 0.545 0.767 RDKit 0.763 0.751 0.761 0.510 0.751 Mordred 0.775 0.765 0.774 0.536 0.765 ECFP6_Bits 0.763 0.745 0.758 0.510 0.745 ECFP6_Counts 0.756 0.735 0.750 0.497 0.735 RF MACCS 0.783 0.770 0.780 0.551 0.770 RDKit 0.779 0.766 0.777 0.544 0.766 Mordred 0.782 0.770 0.780 0.550 0.770 ECFP6_Bits 0.746 0.733 0.744 0.475 0.733 ECFP6_Counts 0.766 0.753 0.764 0.517 0.753 XGBoost MACCS 0.772 0.758 0.769 0.528 0.758 RDKit 0.782 0.772 0.781 0.550 0.772 Mordred 0.787 0.777 0.785 0.560 0.777 Base Model Consensus 0.790 0.777 0.788 0.567 0.777 Hierarchical Models kNN H-Features 0.799 0.790 0.798 0.586 0.790 SVM H-Features 0.796 0.788 0.795 0.581 0.788 RF H-Features 0.801 0.793 0.800 0.590 0.793 XGBoost H-Features 0.801 0.793 0.800 0.591 0.793 Hierarchical Consensus 0.800 0.792 0.799 0.589 0.792

157

Table A10. Test Set Performances of Multiclass Models Balance Algorithm Feature Accuracy F1-Score MCC Accuracy Base Models ECFP6_Bits 0.625 0.555 0.614 0.407 ECFP6_Counts 0.636 0.572 0.626 0.426 kNN MACCS 0.627 0.563 0.616 0.411 RDKit 0.633 0.572 0.624 0.422 Mordred 0.636 0.578 0.627 0.427 ECFP6_Bits 0.624 0.522 0.598 0.394 ECFP6_Counts 0.634 0.545 0.614 0.413 SVM MACCS 0.631 0.542 0.614 0.411 RDKit 0.633 0.548 0.617 0.413 Mordred 0.635 0.548 0.618 0.416 ECFP6_Bits 0.633 0.516 0.602 0.411 ECFP6_Counts 0.641 0.524 0.610 0.426 RF MACCS 0.652 0.566 0.637 0.447 RDKit 0.659 0.569 0.642 0.456 Mordred 0.663 0.575 0.648 0.464 ECFP6_Bits 0.631 0.550 0.618 0.412 ECFP6_Counts 0.639 0.555 0.624 0.425 XGBoost MACCS 0.649 0.574 0.638 0.445 RDKit 0.653 0.567 0.638 0.448 Mordred 0.635 0.550 0.619 0.417 Base Model Consensus 0.669 0.578 0.651 0.475 Hierarchical Models kNN H-Features 0.678 0.612 0.668 0.493 SVM H-Features 0.678 0.611 0.668 0.492 RF H-Features 0.678 0.615 0.667 0.492 XGBoost H-Features 0.671 0.615 0.663 0.483 Hierarchical Model Consensus 0.681 0.618 0.670 0.497

158

Table A11. Performances of Hierarchical Consensus Models based on Prediction Zones. Models Metric OutZone InZone HalfZone_B HalfZone_M Overall RMSE 0.490 0.622 0.804 0.588 0.526 R2 0.714 0.464 0.183 0.315 0.656 Regression MAR 0.349 0.558 0.575 0.375 0.374 MSE 0.240 0.387 0.646 0.238 0.277 Accuracy 0.839 0.571 0.819 0.528 0.800 BA 0.831 0.583 0.775 0.527 0.792 Binary F1-Score 0.838 0.571 0.814 0.529 0.799 MCC 0.666 0.167 0.585 0.053 0.589 AUROC 0.831 0.583 0.775 0.527 0.792 Accuracy 0.705 0.533 0.437 0.706 0.681 BA 0.668 0.343 0.401 0.250 0.618 Multiclass F1-Score 0.698 0.444 0.433 0.585 0.670 MCC 0.548 0.125 0.182 0.000 0.497 %AD 79% 1% 9% 11% 100%

159

Figure A1. Distribution of logLD50 (mmol/kg) of Train and Test Sets.

Figure A2. Distribution of logLD50 (mg/kg) of the Training Set.

160

Appendix B

Supporting Information Chapter 4. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning

Table B1. Performance of QSAR models trained with SPE tokenization. Targets RMSE R2 MAE A2a 0.669±0.111 0.724±0.116 0.523±0.078 Dopamine 0.719±0.055 0.522±0.080 0.551±0.045 Dihydrofolate 0.771±0.053 0.545±0.049 0.586±0.042 Carbonic 0.582±0.035 0.781±0.046 0.427±0.032 ABL1 0.746±0.050 0.637±0.067 0.566±0.045 opioid 0.636±0.046 0.742±0.042 0.477±0.032 Cannabinoid 0.671±0.041 0.715±0.036 0.500±0.021 COX-1 0.655±0.064 0.486±0.064 0.489±0.046 Monoamine 0.604±0.054 0.645±0.046 0.454±0.037 LCK 0.758±0.035 0.673±0.048 0.589±0.028 Glucocorticoid 0.541±0.042 0.692±0.057 0.426±0.030 Ephrin 0.652±0.051 0.647±0.037 0.494±0.026 Caspase 0.586±0.098 0.835±0.079 0.440±0.093 Coagulation 0.771±0.037 0.622±0.039 0.580±0.031 Estrogen 0.630±0.037 0.780±0.024 0.459±0.032 B-raf 0.599±0.057 0.756±0.046 0.452±0.036 Glycogen 0.731±0.028 0.593±0.041 0.548±0.023 Vanilloid 0.669±0.055 0.543±0.080 0.522±0.032 Aurora-A 0.711±0.049 0.723±0.033 0.530±0.035 JAK2 0.623±0.046 0.725±0.041 0.467±0.026 COX-2 0.728±0.043 0.605±0.050 0.537±0.032 Acetylcholinesterase 0.675±0.0049 0.749±0.033 0.495±0.033 erbB1 0.658±0.023 0.757±0.011 0.492±0.019 HERG 0.536±0.019 0.625±0.033 0.395±0.019

161

Table B2. Performance of QSAR models trained with atom-level tokenization. Targets RMSE R2 MAE A2a 0.776±0.224 0.612±0.215 0.550±0.151 Dopamine 0.748±0.097 0.479±0.140 0.576±0.071 Dihydrofolate 0.794±0.101 0.525±0.101 0.592±0.072 Carbonic 0.578±0.069 0.792±0.046 0.421±0.052 ABL1 0.750 0.046 0.635±0.046 0.574±0.034 opioid 0.642±0.072 0.735±0.066 0.485±0.049 Cannabinoid 0.717±0.055 0.679±0.034 0.552±0.033 COX-1 0.665±0.094 0.484±0.089 0.478±0.058 Monoamine 0.624±0.061 0.633±0.053 0.467±0.048 LCK 0.835±0.165 0.591±0.181 0.617±0.047 Glucocorticoid 0.535±0.058 0.695±0.074 0.411±0.042 Ephrin 0.664±0.055 0.636±0.043 0.508±0.027 Caspase 0.587±0.061 0.837±0.050 0.444±0.046 Coagulation 0.770±0.037 0.622±0.045 0.582±0.023 Estrogen 0.655±0.044 0.761±0.028 0.474±0.031 B-raf 0.599±0.067 0.762±0.046 0.443±0.043 Glycogen 0.744±0.045 0.579±0.052 0.555±0.040 Vanilloid 0.670±0.065 0.542±0.082 0.515±0.040 Aurora-A 0.744±0.073 0.698±0.044 0.547±0.040 JAK2 0.642±0.035 0.708±0.042 0.481±0.023 COX-2 0.736±0.064 0.596±0.074 0.543±0.048 Acetylcholinesterase 0.679±0.060 0.745±0.044 0.485±0.037 erbB1 0.661±0.019 0.754±0.013 0.492±0.016 HERG 0.531±0.025 0.637±0.030 0.391±0.020

162

Appendix C

Supporting Information Chapter 6. CryptoChem for Encoding and Storing Information Using Chemical Structures

Table C1. Molecular descriptors currently used by CryptoChem. Descriptor sets Information Explicit 6_Lipinski Molecular Weight, Topological Polar Surface Area, Hydrophobicity, Hydrogen Bond Acceptors/Donors, Rotatable Bonds 115_filtered MaxEStateIndex, MinEStateIndex, MaxAbsEStateIndex, MinAbsEStateIndex, qed,MolWt, HeavyAtomMolWt, ExactMolWt, NumValenceElectrons, NumRadicalElectrons, MaxPartialCharge, MinPartialCharge, MaxAbsPartialCharge, MinAbsPartialCharge, FpDensityMorgan1, FpDensityMorgan2, FpDensityMorgan3, BalabanJ,BertzCT, Chi0, Chi0n, Chi0v, Chi1, Chi1n, Chi1v, Chi2n, Chi2v, Chi3n, Chi3v, Chi4n, Chi4v, HallKierAlpha, Ipc, Kappa1, Kappa2, Kappa3, LabuteAnalgoue SetA, PEOE_VSA1, PEOE_VSA10, PEOE_VSA11, PEOE_VSA12, PEOE_VSA13, PEOE_VSA14, PEOE_VSA2, PEOE_VSA3, PEOE_VSA4, PEOE_VSA5, PEOE_VSA6, PEOE_VSA7, PEOE_VSA8, PEOE_VSA9, SMR_VSA1, SMR_VSA10, SMR_VSA2, SMR_VSA3, SMR_VSA4, SMR_VSA5, SMR_VSA6, SMR_VSA7, SMR_VSA8, SMR_VSA9, SlogP_VSA1, SlogP_VSA10, SlogP_VSA11, SlogP_VSA12, SlogP_VSA2, SlogP_VSA3, SlogP_VSA4, SlogP_VSA5, SlogP_VSA6, SlogP_VSA7, SlogP_VSA8, SlogP_VSA9, TPSA,EState_VSA1, EState_VSA10, EState_VSA11, EState_VSA2, EState_VSA3, EState_VSA4, EState_VSA5, EState_VSA6, EState_VSA7, EState_VSA8, EState_VSA9, VSA_EState1, VSA_EState10, VSA_EState2, VSA_EState3, VSA_EState4, VSA_EState5, VSA_EState6, VSA_EState7, VSA_EState8, VSA_EState9, FractionCSP3, HeavyAtomCount, NHOHCount, NOCount, NumAliphaticCarbocycles, NumAliphaticHeterocycles, NumAliphaticRings, NumAromaticCarbocycles, NumAromaticHeterocycles, NumAromaticRings, NumHAcceptors, NumHDonors, NumHeteroatoms, NumRotatableBonds, NumSaturatedCarbocycles, NumSaturatedHeterocycles, NumSaturatedRings, RingCount, MolLogP, MolMR 200_complete The complete set of 200 available 2D descriptors from RDKit Implicit 1024_Morgan 1024 bits from Morgan circular fingerprint (fingerprints) 2048_Morgan The complete set of 2048 bits from Morgan circular fingerprint 166_MACCS The complete set of 166 MACCS keys Atom-pair All available 8388608 atom-pair fingerprint bits

163

Figure C1. Illustration of the clustering process in Analogue Retriever (AR).

164