Exploration of Chemical Space by Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Skolkovo Instutute of Science and Technology EXPLORATION OF CHEMICAL SPACE BY MACHINE LEARNING Doctoral Thesis by Sergey Sosnin Doctoral Program in Computational and Data Science and Engineering Supervisor Vice-President for Artificial Intelligence and Mathematical Modelling, Professor, Maxim Fedorov Moscow – 2020 © Sergey Sosnin, 2020 I hereby declare that the work presented in this thesis was carried out by myself at Skolkovo Institute of Science and Technology, Moscow, except where due acknowledgement is made, and has not been submitted for any other degree. Candidate: Sergey Sosnin Supervisor (Prof. Maxim V. Fedorov) 2 EXPLORATION OF CHEMICAL SPACE BY MACHINE LEARNING by Sergey Sosnin Submitted to the Center for Computational and Data-Intensive Science and Engineering and Innovation on September 2020, in partial fulfillment of the requirements for the Doctoral Program in Computational and Data Science and Engineering Abstract The enormous size of potentially reachable chemical space is a challenge for chemists who develop new drugs and materials. It was estimated as 1060 and, given such large numbers, there is no way to analyze chemical space by brute-force search. However, the extensive development of techniques for data analysis provides a basis to create methods and tools for the AI-inspired exploration of chemical space. Deep learning revolutionized many areas of science and technology in recent years, i.e., computer vision, natural language processing, and machine translation. However, the potential of application of these methods for solving chemoinformatics challenges has not been fully realized yet. In this research, we developed several methods and tools to probe chemical space for predictions of properties of organic compounds as well as for visualizing regions of chemical space and for the sampling of new compounds. We explored a new type of 3D spatial descriptors and demonstrated that one can use these descriptors with 3D Convolutional neural networks for the bioconcentration factor prediction. In our research, we proved that multitask deep learning can achieve better performance been compared with single-task learning. To improve the navigation trough chemical space, we have developed a parametric t-SNE method for visualization of large chemical datasets. We developed molecular grammars for the generation of the organic structures and implemented a library to work with these grammars: Legogram. These methods and tools provide the environment for the AI-driven exploration of chemical space. We believe that our findings will accelerate the drug discovery process. 3 Publications The main results of the work have been published in papers: 1. Sergey Sosnin, Dmitry Karlov, Igor V. Tetko, and Maxim V. Fedorov. Com- parative Study of Multitask Toxicity Modeling on a Broad Chemical Space. Journal of Chemical Information and Modeling, 2018a. ISSN 1549-9596; 1549- 960X. doi:10.1021/acs.jcim.8b00685. Place: United States Publisher: United States (Q1 journal, IF=4.54) 2. Sergey Sosnin, Mariia Vashurina, Michael Withnall, Pavel Karpov, Maxim Fedorov, and Igor V. Tetko. A Survey of Multi-Task Learning Methods in Chemoinformatics. Molecular informatics, 37, 2018c. ISSN 1868-1751; 1868- 1743. doi:10.1002/minf.201800108. Place: Weinheim, Germany, Germany Pub- lisher: Weinheim, Germany, Germany (Q2 journal, IF=2.74) 3. Sergey Sosnin, Maksim Misin, David Palmer, and Maxim Fedorov. 3D mat- ters! 3D-RISM and 3D convolutional neural network for accurate bioaccumu- lation prediction. Journal of Physics Condensed Matter, 30(32), 2018b. ISSN 1361-648X; 0953-8984. doi:10.1088/1361-648x/aad076. Place: United King- dom Publisher: United Kingdom (Q1 journal, IF=2.7) 4. Dmitry S. Karlov, Sergey Sosnin, Igor V. Tetko, and Maxim V. Fedorov. Chem- ical space exploration guided by deep neural networks. RSC advances, (9): 5151–5157, 2019. ISSN 2046-2069. doi:10.1039/c8ra10182e. Place: United Kingdom Publisher: United Kingdom (Q1 journal, IF=3.0) These publications describe findings which are related to the study: 1. Yury I. Kostyukevich, Gleb Vladimirov, Elena Stekolschikova, Daniil G. Ivanov, Arthur Yablokov, Alexander Ya Zherebker, Sergey Sosnin, Alexey Orlov, Maxim Fedorov, Philipp Khaitovich, and Evgeny N. Nikolaev. Hydrogen/Deuterium exchange aids compounds identification for LC-MS and MALDI imaging lipidomics. Analytical Chemistry, 2019. ISSN 1520-6882; 0003-2700. doi:10.1021/acs.analchem.9b02461. Place: United States Publisher: United States (Q1 journal, IF=6.35) 4 2. Dmitry S. Karlov, Sergey Sosnin, Maxim V. Fedorov, and Petr Popov. graphDelta: MPNN scoring function for the affinity prediction of protein–ligand complexes. ACS Omega, 5(10):5150–5159, March 2020. doi:10.1021/acsomega.9b04162. URL https://doi.org/10.1021/acsomega.9b04162 (Q1 journal, IF=2.87) 3. Sergey Osipenko, Inga Bashkirova, Sergey Sosnin, Oxana Kovaleva, Maxim Fedorov, Eugene Nikolaev, and Yury Kostyukevich. Machine learning to pre- dict retention time of small molecules in nano-HPLC. Analytical and Bio- analytical Chemistry, August 2020. doi:10.1007/s00216-020-02905-0. URL https://doi.org/10.1007/s00216-020-02905-0 (Q1 journal, IF=3.63) 5 Acknowledgments Firstly, I would like to express my sincere gratitude to my advisor, Vice-President for Artificial Intelligence and Mathematical Modelling of Skoltech, Prof. Maxim Fedorov, for the support during the PhD program in Skoltech. His openness to new ideas inspired me for this research project. I deeply grateful for Prof. Igor V. Tetko for the fruitful cooperation and ultimate support during this research. I acknowledge my co-authors and colleagues Dmitry Karlov, Petr Popov, Olga Novitskaya and Yury I. Kostyukevich for the beneficial teamwork. I am very grateful to my wife and colleague, Ekaterina Sosnina, who supported me a lot during my PhD program. 6 Contents 1 Introduction9 2 The Literature Review 17 2.1 Molecular Descriptors........................... 20 2.2 Machine Learning Methods in QSAR/QSPR Studies......... 24 2.3 Models Validation and Performance Measurement........... 31 2.4 3D Reference Interaction Site Model (3D-RISM)............ 36 2.5 Multitask Learning for Chemical Data Analysis............ 37 3 3D RISM and 3D CNNs for bioconcentration prediction 44 3.1 Materials and Methods.......................... 46 3.2 Results and Discussion.......................... 52 3.3 Conclusions................................ 55 4 Multitask learning for acute toxicity modelling 56 4.1 Materials and Methods.......................... 58 4.2 RTECS Chemical Space......................... 60 4.3 Correlation Analysis of Endpoints.................... 64 4.4 Comparison of Models.......................... 64 4.5 Attributed Modeling........................... 68 4.6 Conclusions................................ 73 5 Chemical space visualization guided by deep learning 75 5.1 Materials and Methods.......................... 76 5.1.1 Datasets.............................. 77 5.1.2 Parametric t-SNE......................... 78 5.1.3 Dimensionality Reduction Methods............... 82 5.1.4 Validation protocols....................... 82 5.2 Results and Discussion.......................... 82 5.3 Conclusions................................ 87 6 Legogram: Molecular grammars 88 6.1 Formal Definition of Molecular Grammars............... 90 6.2 Implementation of Molecular Grammars................ 92 6.3 Validation of the Algorithm....................... 97 6.4 Grammar Compression.......................... 98 6.5 Generative Models............................ 100 7 Contents Contents 6.5.1 Legogram-based Generative Modeling.............. 101 6.5.2 Optimization of Drug-likeness.................. 102 6.5.3 Synthetic Accessibility...................... 104 6.6 Sampling Compounds from Chemical Space.............. 106 6.7 Conclusions................................ 110 7 Conclusions 114 Glossary 120 Bibliography 123 8 Supplementary Material 147 8 Chapter 1 Introduction There is a common opinion that pharmaceutical R & D is in crisis Pammolli et al. [2011]. Companies have to spend more than ten years and more than a billion dol- lars on marketing a new drug globally. But the problem is not only about high costs and risks. The "gold-mines" – scaffolds that produced blockbuster drugs – are dried out. Investigations of new, unexplored regions of chemical space is a risky business because the clinical trial failure ratio is high. A possible solution is the intensive investigation of new areas of chemical space – to find new scaffolds for new drugs. The researchers should go to terra-incognita of chemical space, because there is a supreme request for exploration of chemical space for a search of new drug candi- dates. COVID-19 outbreak stressed the basic fact that humanity does not have an adequate response to viral diseases, and infection outbreaks are not something that we thought was a problem of the XIX century. Antibiotics resistance is another possible pain spot for humankind Ventola[2015]. Antibiotics saved millions of lives, but now we faced with the rise of resistant bacteria. Horizontal gene transfer allows bacteria to exchange genes responsible for the deactivation of antibacterial drugs Barlow[2009], Lerminiaux and Cameron[2019]. On the other hand, the pharma- ceutical industry requires time for clinical trials, limiting our chances to combat the next infection crisis. Non-communicable diseases are the leading cause of death all around the world. Despite the developed treatment strategies, there are no "silver bullets" for the majority of chronic