ABSTRACT LI, XINHAO. Development of Novel Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT LI, XINHAO. Development of Novel Machine Learning Approaches and Exploration of Their Applications in Cheminformatics. (Under the direction of Dr. Christopher Gorman). Cheminformatics is the scientific field that develops and applies a combination of mathematics, informatics, machine learning, and other computational technologies to solve chemical problems. It aims at helping chemists in investigating and understanding complex chemical biological systems and guide the experimental design and decision making. With the rapid growing of chemical data (e.g., high-throughput screening, metabolomics, etc.), machine learning has become an important tool for exploring chemical space and mining chemical information. In this dissertation, we present five studies on developing novel machine learning approaches and exploring their applications in cheminformatics. One of the primary tasks of cheminformatics is to predict the physical, chemical, and biological properties of a given compound. Quantitative Structure Activity Relationship (QSAR) modeling relies on machine learning techniques to establish quantified links between molecular structures and their experimental properties/activities. In chapter 2, we developed a dual-layer hierarchical modeling method to fully integrate regression and classification QSAR models for assessing rat acute oral systemic toxicity, with respect to regulatory classifications of concern. The first layer of independent regression, binary and multiclass models (base models) were solely built using computed chemical descriptors/fingerprints. Then, a second layer of models (hierarchical models) were built by stacking all the cross-validated out-of-fold predictions from the base models. All models were validated using an external test set and we found that the hierarchical models did outperform the base models for all the three endpoints. The H-QSAR modeling method represents a promising approach for chemical toxicity prediction and more generally for stacking and blending individual QSAR models into more predictive ensemble models. In chapter 3, we proposed the Molecular Prediction Model Fine-Tuning (MolPMoFiT) approach, an effective transfer learning method based on self-supervised pre-training + task- specific fine-tuning for QSPR/QSAR modeling. It enables knowledge learned from large chemical data sets to transfer to smaller data sets, thereby improving the model performance and generalization. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner and can then be fine- tuned on various QSPR/QSAR tasks for smaller chemical datasets with specific endpoints. The benchmark results show this transfer learning method can achieve strong performances compared to other state-of-the-art machine learning modeling techniques. SMILES-based deep learning models are slowly emerging as an important research topic in cheminformatics. In chapter 4, we introduced SMILES Pair Encoding (SPE), a data-driven tokenization algorithm. SPE learns a vocabulary of high frequency SMILES substrings from ChEMBL and then tokenizes new SMILES into a sequence of tokens for deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances for both molecular generation (by boosting the validity and novelty of generated SMILES) as well as QSAR prediction tasks. In particular, we evaluated the performance of SPE- based QSAR prediction models using 24 benchmark datasets where SPE consistently either did match or outperform atom-level tokenization. SPE could represent a better tokenization method for the development of future deep learning applications in cheminformatics. Lead optimization endpoints usually contain a small set of congeneric molecules which limits the application of deep learning. Transfer learning enables training a deep learning model with less data. In chapter 5, we explored our transfer learning approach (see Chapter 3) for lead optimization endpoints. A set of lead-optimization-like benchmark datasets was created from 8 datasets that include pIC50 values against the protein targets. Two language models, a LSTM model and a RoBERTa model, were trained on 10M SMILES from ChEMBL and then used as the knowledge source for the downstream QSAR tasks. The results show transfer learning indeed provide performance gains for lead optimization endpoints. In chapter 6, we present CryptoChem, a new method and associated software to securely store and transfer information using chemicals. Relying on the concept of big chemical data, molecular descriptors and machine learning techniques, CryptoChem offers a highly complex and robust system with multiple layers of security for transmitting confidential information. The algorithm directly uses chemical structures and their properties as the central element of the secured storage. QSDR (Quantitative Structure-Data Relationship) models are used as private keys to encode and decode the data. The software is validated with a series of five datasets consisting of numerical and textual information with increasing size and complexity. © Copyright 2020 by Xinhao Li All Rights Reserved Development of Novel Machine Learning Approaches and Exploration of Their Applications in Cheminformatics by Xinhao Li A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Chemistry Raleigh, North Carolina 2020 APPROVED BY: _______________________________ _______________________________ Christopher Gorman Gavin Williams Committee Chair _______________________________ _______________________________ Caroline Proulx Brian Reich _______________________________ Denis Fourches External Member DEDICATION This dissertation is dedicated to Anqi Tu and my family. ii BIOGRAPHY Xinhao Li grew up in Dalian, China. With some very good experience in high school chemistry class, he decided to study chemistry in Beijing University of Chemical Technology where he got his bachelor’s and master’s degrees. During the master’s study, his research focus on organic synthesis. In 2017, he joined Dr. Denis Fourches’ lab at North Carolina State University and started to pursuit a Ph.D. with a focus on cheminformatics and machine learning. iii ACKNOWLEDGMENTS I would like to thank Dr. Denis Fourches for all the guidance, support, and encouragement he has given me for the past three years. His knowledge, vision and suggestions have been a valuable source of inspiration for my research. I would like to thank Dr. Christopher Gorman, Dr. Gavin Williams, Dr. Caroline Proulx and Dr. Brian Reich for being my committee members. I would like to thank Dr. Jamel Meslamani for his mentoring during my internship at GSK. I would also like to thank North Carolina State Chemistry Department and DARPA for the financial support. Finally, I would like to thank all my family and friends for their continued support over the years. I would like to especially thank Anqi Tu for the constant love and support along the way. iv TABLE OF CONTENTS LIST OF TABLES ..................................................................................................................... viii LIST OF FIGURES ..................................................................................................................... ix Chapter 1. Introduction ............................................................................................................... 1 1.1. Machine Learning ................................................................................................................ 2 1.2. QSAR Modeling ................................................................................................................... 3 References ................................................................................................................................... 8 Chapter 2. Hierarchical H-QSAR Modeling Approach for Integrating Binary/Multi Classification and Regression Models of Acute Oral Systemic Toxicity ............................... 10 Abstract ..................................................................................................................................... 11 2.1. Introduction ........................................................................................................................ 12 2.2. Materials and Methods ....................................................................................................... 15 2.3. Results ................................................................................................................................ 23 2.3.1. Chemical space of the full dataset ............................................................................... 23 2.3.2. Models development, validation, and comparison ...................................................... 24 2.3.3. Applicability Domain for Hierarchical Models ........................................................... 33 2.4. Discussion .......................................................................................................................... 37 2.4.1. Comparison with other methodologies recently reported in the literature .................. 39 2.5. Conclusion .........................................................................................................................