ABSTRACT ZIN, PHYO PHYO KYAW. Development

ABSTRACT ZIN, PHYO PHYO KYAW. Development and Applications of Next-Generation Cheminformatics Approaches (Under the Direction of Dr. Denis Fourches). Cheminformatics is the field that solves chemical problems using computers. In the context of drug discovery, one important goal of cheminformatics is to explore and design potential therapeutic drugs by screening very large chemical libraries in a cost-and-time effective manner. Macrolides are glycosylated macrocyclic lactones with twelve to sixteen atoms in the core cyclic frame. They possess unusual druglike properties, often violating Lipinski’s and Veber’s rules of drug likeness and bioavailability. Gaining insight into this unique structural class can potentially transform our understanding and development of novel antibiotic drugs, which has become critical due to increasing antimicrobial resistance. However, macrolides remain under- exploited due to the difficulty in their organic synthesis, and the lack of publicly available screening-ready virtual libraries of macrolides to perform molecular modeling. To overcome these challenges, it is crucial to develop virtual library generating software to diversify and expand the chemical space of macrolides. Subsequently, these libraries can be used for in silico studies to screen, model and prioritize macrolides to be biosynthesized. Herein, our goal is to implement cheminformatics approaches to construct well-designed and diversity-controlled virtual libraries of screening-ready macrolides. In Chapter 2, we developed a novel cheminformatics approach and software called PKS Enumerator software in order to design virtual macrolide/macrocyclic libraries by permuting and adding the building blocks one by one with multiple user-defined constraints. As a case study, we generated our first sample library of V1M (Virtual 1 million Macrolide scaffolds) using PKS Enumerator and conducted a cheminformatics analysis to determine the chemical diversity of all macrolide scaffolds generated. We additionally performed a comparative analysis of the V1M scaffolds to those of eighteen well-known bioactive macrolides compiled from different studies. In Chapter 3, we developed another innovative cheminformatics approach along with our new software SIME, Synthetic Insight-based Macrolide Enumerator, which enumerates fully assembled in-silico macrolides with sugars present and high synthetic feasibility. One can utilize constitutional and structural knowledge extracted from biosynthetic and important chemical insights of macrolides and apply them to design biosynthetically feasible macrolides. SIME allows users for specifying a macrolide core structure, identifying positions in the core structure at which chemical components aka structural motifs and/or sugars can be inserted, and selecting the types of structural motifs and sugars to be incorporated at those positions. With additional parameters for constitutional and structural constraints, one can design highly specific libraries of biosynthetically feasible macrolides towards a biological target of interest (e.g. ribosome) and/or based on a privileged scaffold with therapeutic potential. As proof-of-concept, we generated V1B library containing 1 billion in-silico macrolides and conducted chemical distribution analysis of important molecular properties. To have the most success in screening, identifying and designing most potent macrolides, it is important to extract SAR (structure-activity relationships) of known, experimentally tested macrolides. The knowledge and information from such SAR studies can be used to aid in the design of in-silico macrolide libraries with high success rate. However, there are currently no centralized database of known macrolactones/macrolides with bioactivity data. Thus, in Chapter 4, we developed the MacrolactoneDB, a web application hosting 14k macrolactones mined from public repositories with associated biological information. To accommodate different research projects focusing on various chemical scopes of macrolactones, we included options to filter them based on criteria such as ring size, number of sugars, molecular weight, available biological data, etc. This unique database has the potential to inspire the development of new macrolide therapeutics with multiple interesting targets. We also developed 91 mrc descriptors to better characterize macrolactones/macrolides. We additionally conducted large machine learning workflow with three common disease targets to investigate whether macrolactones/macrolides can be trained and validated using various state-of-the-art machine learning algorithms and chemical descriptors. Virtual screening methods such as QSAR modeling, 3D molecular docking are essential cheminformatics techniques that can tremendously facilitate drug discovery researches by effectively identifying potential lead compounds. In Chapter 5, we developed a hybrid MD-QSAR modeling approach relying on deep neural nets (DNN) and random forest (RF) using combinations of 2D, 3D, and 4D/MD (molecular docking) based chemical descriptors directly computed from molecular dynamics simulations. An ensemble of QSAR models using this approach yielded good success in predicting the biological endpoints of imatinib derivatives towards BCR-ABL kinase protein, a key target for treating Chronic Myeloid Leukemia (CML). We additionally conducted a thorough chemical analysis and visualization of Imatinib datasets, evaluated the binding modes of top analogues based on the 3D molecular docking procedure and compared the pioneering cheminformatics technologies of molecular docking and MD-QSAR modeling. Finally, another unique opportunity for cheminformatics is the use of chemical molecules to store and transfer information. Herein, in Chapter 6, we developed a novel molecular informatics-based encryption technology, CryptoChem, which uses Quantitative Structure-Data Relationships (QSDR) and complex machine learning approaches to encipher and decipher information in molecules. This study introduces an innovative application of machine learning techniques as well as the big chemical data concept. The preliminary results from encoding and decoding five datasets in virtual molecules confirmed the success of our working prototype. © Copyright 2020 by Phyo Phyo Kyaw Zin All Rights Reserved Development and Applications of Next-Generation Cheminformatics Approaches by Phyo Phyo Kyaw Zin A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Chemistry Raleigh, North Carolina 2020 APPROVED BY: ___________________________________ ___________________________________ Denis Fourches Gavin Williams Committee Chair ___________________________________ ___________________________________ Jonathan Lindsey Steffen Heber DEDICATION This dissertation is dedicated to my loving and devoted husband, Andrés and my family. Without you, this document would not exist. Thank you from the bottom of my heart! ii BIOGRAPHY I graduated from Berea College, Kentucky with double majors in Chemistry and Computer Science with a 4-year full-ride scholarship. I worked as a research assistant in 2014 summer with Associate Professor of Chemistry, Paul Smithson, at Berea College, wherein I analyzed lead deposits in soil around Berea campus and nearby areas. In a 2015 summer internship, I developed Sustainability Dashboard Management System with Assistant Professor of Computer Science, Scott Heggen, at Berea College. My interest in cheminformatics was fueled by my humble roots in Myanmar, a third-world country with abundant tropical neglected diseases such as Malaria, Dengue Fever, Tuberculosis, etc. My goal is to develop skillsets to find potent therapeutics in a highly cost-and-time effective manner so that I can one day contribute to drug discovery efforts to help third world countries like my native land. During the last year of my undergrad, I began to realize the potential, impact and usefulness of merging computer science with chemistry. Innovative computational methods can be developed and applied to gain insights into pharmacokinetic properties of medicinal drugs, to better understand molecular mechanisms and to design better, safer and more potent therapeutics in the context of drug discovery. Thus, right after earning my bachelor’s degree at Berea College, I joined the cheminformatics lab of Dr. Denis Fourches and worked on my PhD with a focus on Cheminformatics at the Chemistry Department from North Carolina State University. During the past three and a half years, I have gained cheminformatics experience in developing cheminformatics approaches/software to design in-silico chemical libraries based on constitutional and structural constraints set by user’s choice, virtual screening, molecular docking, molecular dynamics simulations, chemical data mining and curation, data analysis and visualization of large chemical databases, QSAR modelling, algorithm development and scientific writing. During my PhD years, I was awarded the Olive Ruth Russell Fellowship from my alma mater, Berea College for three academic years in 2016, 2017 and 2019, and the highly prestigious AAUW International Doctoral Fellowship for the academic year of 2019 – 2020. iii ACKNOWLEDGMENTS I would like to thank Dr. Denis Fourches for all the guidance and support he has given me these past 4 years. I would also like to thank Dr. Gavin Williams for his expertise and generous contributions on a majority of the collaboration

ABSTRACT ZIN, PHYO PHYO KYAW. Development

Semantic Web Integration of Cheminformatics Resources with the SADI Framework Leonid L Chepelev1* and Michel Dumontier1,2,3

Joelib Tutorial

Cheminformatics and Chemical Information

Committee on Publications and Cheminformatics Data Standards (CPCDS)

Scalable Analysis of Large Datasets in Life Sciences

Cheminformatics for Genome-Scale Metabolic Reconstructions

Print Special Issue Flyer

Trust, but Verify: on the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research

Chemoinformatics for Green Chemistry

Harnessing Synthetic Biology for the Bioprospecting and Engineering of Aromatic Polyketide Synthases

The Literature of Chemoinformatics: 1978–2018

Cheminformatics-Aided Pharmacovigilance: Application To