Molecular Similarity and Diversity Analysis of Bioactive Small Molecules Using Chemoinformatics Approaches

Molecular similarity and diversity analysis of bioactive small molecules using chemoinformatics approaches Varun Khanna A thesis submitted to Macquarie University in fulfilment of the degree of Doctor of Philosophy Department of Chemistry and Biomolecular Sciences Macquarie University Sydney, Australia March 2011 DEDICATED TO ALL THE PIONEERING SCIENTISTS OF ARYAVARTA (ANCIENT INDIA), WHO GAVE THE KNOWLEDGE OF VEDIC MATHEMATICS , YOGA AND AYURVEDA TO THE WORLD i DECLARATION This thesis contains original work, which was performed by me. Several aspects of this work have been carried out with the hep and guidance of many researchers; these people have been acknowledged and their contributions recognised in the section in which their assistance was received. This thesis contains no material that has been accepted for the award of any higher degree or diploma at any University or Institution, and to the best of my knowledge, contains no material previously published or written by another person, except where due reference is made in the text of the thesis. Varun Khanna March 2011 ii ACKNOWLEDGEMENTS Working at a scientific problem can be fun, stimulating and great. However, at times it can also be tough, time consuming and persevering and in the tough times one often needs help. It is my pleasure to thank many people who extended their support to me in difficult times and made this thesis possible. Professional My supervisor, Prof. Shoba Ranganathan, for her constant support and invaluable suggestions during this work. I thank her for giving me the freedom to follow up on interesting topics in chemoinformatics but also keeping me on track. I am very grateful for her patience, motivation, never ending enthusiasm, intellectual support and especially for encouraging me to visit Strasbourg Summer School on Chemoinformatics, Obernai very early in my PhD to participate and interact with leading scientists in the area of computational chemistry. I would also like to thank her for educating me on time management in academic research. Prof. Joanne Jamie, my co-supervisor for the support throughout my PhD tenure, for the motivation and useful discussions during meetings. I also thank her for providing the excellent scientific and social environment at the department. Dr. Dominique Gorse, for giving an opportunity to visit QFAB and helping me out with the final manuscript. I also thank him for suggestions on post PhD plans. I am grateful to my past and current lab members Dr. Durgaprasad Bollina, Dr. Shivashankar Hiriyur Nagaraj, Javed Khan, Dr. Jitendra Gaikwad, Dr. Adrian Cootes, Gagan Garg, , Gaurav Kumar, Elsa Chacko, Ranjeeta Menon and FoS staff members Michael Baxter, Chris McRae (CBMS), Maria Hyland (CBMS) Catherine Wong (CBMS) for their support. Personal My deepest gratitude goes to my parents and all other family members for their love, care and encouragement when I was miles away from home. Without their support it would not been possible to reach this important milestone in my research career. iii I am deeply indebted to my dear younger brother Dheeraj Khanna for sharing wonderful times and emotional support. I would like to thank all my following friends for making these three years a memorable experience with their good company and for their moral support: Javed Iqubal Khan, Sosuke (gym buddy), Gulshan Kumar, Long Vu (table tennis pal), Annaf, John (for gym training), Sudhir, Vikas Dhiyia (India), Bhupinder Jangara (India), Aneesh, Carlos, Rajesh Mudugal (Adelaide), Faraaz, Jens Moll, Lucky Tur, Nandan Deshpanday, Orisi, Pankaj, Ruksana Master. My hearty thanks to my best friend, Sumit Kalra, India. I got the opportunity to know him in B.Sc. and I thank him for the wonderful times we shared since then. I will especially never forget the train journeys which we took together. Thanks to another great friend Nogc Pham, Australia who has always enjoyed all my publications and work even though he never understood anything about it. I would also like to thank all my science teachers especially, Ruby madam (4th and 5th standard), Mona madam (6th - 8th standard), Dr. Rajpal Yadav (9th and 10th standard), Dr. Monga (Graduation, Botany), Dr. Dharam Singh Chopra (Graduation, Biochemistry) and Dr. Sudhir Kumar (Masters, Bioinformatics) for helping me maintain interest in science and encouraging me to take science as a career. iv TABLE OF CONTENTS Declaration ii Acknowledgements iii Table of Contents v List of Abbreviations viii List of Figures ix List of Tables xii List of Publications included in this thesis xiii Abstract xiv CHAPTER 1 Introduction and literature survey 1 1.1 Overview 1 1.2 Bioinformatics, chemoinformatics and drug discovery 2 1.2.1 Molecular similarity and diversity analyses 1.2.2 Descriptors used in chemoinformatics 1.2.2.1 1D descriptors 1.2.2.2 2D descriptors 1.2.2.3 3D descriptors 1.2.3 Descriptor selection 1.2.3.1 Wrapper methods 1.2.3.2 Filter methods 1.2.4 Molecular coefficients in chemoinformatics 1.2.5 Comparison of drug-like and non drug-like compounds 1.2.5.1 Physicochemical properties analyses 1.2.5.2 Scaffold and molecular fragment data analysis 1.2.5.3 Machine learning and datamining methods 1.3 Databases and resources available 30 Publication 1: Molecular similarity and diversity 38 approaches in drug discovery Publication 2: In silico methods for the analysis of 51 metabolites and drug molecules. 1.4 Objectives 73 CHAPTER 2 Methods and Applications 74 v CHAPTER 3 Development of CMKb chemoinformatics module to digitize 76 and store chemical information. 3.1 Summary 76 Publication 2: CMKb: a web-based prototype for 77 integrating Australian Aboriginal customary medicinal plant knowledge. 3.2 Conclusions 85 CHAPTER 4 Comparison of physicochemical property space among 86 human metabolites, drugs and toxins 4.1 Summary 86 Publication 3: Physicochemical property space 87 distribution among human metabolites, drugs and toxins 4.2 Conclusions 107 CHAPTER 5 Scaffold and fragment co-occurrence studies on datasets of 108 biological interest. 5.1 Summary 108 Publication 4: Scaffold and fragment co-occurrence 109 studies on datasets of biological interest. 5.2 Conclusions 149 CHAPTER 6 Virtual screening of compounds active against parasitic 150 nematodes of major socio-economic importance 6.1 Summary 150 Publication 5: In silico approach to screen compounds 151 active against parasitic nematodes of major socio- economic 6.2 Conclusion CHAPTER 7 General discussion and conclusion 7.1 Principal findings and conclusions 7.2 Contributions to the field of chemoinformatics highlighting the significance of the work 7.3 Future directions References vi ABBREVIATIONS 1D One-dimensional 2D Two-dimensional 3D Three-dimensional ACD Available Chemicals Directory ANN Artificial neural network AUC Area under the curve BCI Barnard Chemical Information BE Backward elimination BFGS Broyden–Fletcher–Goldfarb–Shanno CAS Chemical Abstract Service CASRN Chemical Abstract Service Reference Number CMC Comprehensive Medicinal Chemistry CMKb Customary Medicinal Knowledgebase CoMFA Comparative Molecular Field Analysis CoMSIA Comparative Molecular Similarity Indices Analysis CPDB The Carcinogenic Potency Database DT Decision trees DUD Directory of useful decoys ECFP Extended connectivity fingerprints FCFP Functional class fingerprints FS Forward selection GA Genetic algorithm GRIND Grid-Independent Descriptors HMDB Human Metabolome database HTS High-throughput screening InChI International Chemical Identifier IUPAC International Union of Pure and Applied Chemistry KEGG Kyoto Encyclopaedia of Genes and Genomes MACCS Molecular Access System MDDR MDL Drug Data Report MIP Molecular Interaction Potentials NCI National Cancer Institute vii PDB Protein Data Bank QSAR Quantitative structure activity relationships Ro5 Rule of Five ROC Receiver Operation Characteristic SA Simulated annealing SCAM Statistical Classification of Activities of Molecules SD Structure data SMILES Simplified Molecular Input Line Entry Specification SOM Self-organising maps SVM Support vector machines TI Topological indices TTD Therapeutic Target Database VS Virtual screening viii LIST OF FIGURES Figure 1.1 Steps in a typical drug design program. 3 Identification and validation of biological target is followed by identification and optimization of leads. After testing the potential drug in the laboratory, it is approved by FDA to release into the market. The whole process on an average takes around 15 years and $ 1billion of investment. Figure 1.2 Examples to show compounds with the same biological function 4 but different scaffolds. The Tanimoto similarity of methadone is quite low to typical opioid receptor agonist such as morphine and codeine; however, it performs the same function as others. Figure 1.3 Representation of a molecule M at position i in N dimensional 5 descriptor space. The molecules that lie close to each other in the descriptor space are likely to exhibit similar biological properties. Figure 1.4 Sample molecular descriptors for vasicine. 6 1D descriptors are the simplest and mostly refer to whole molecule properties like molecular weight, while 2D descriptors refer to topological characteristics of the molecule. 3D descriptors on the other hand encode information regarding ligand-receptor binding. Figure 1.5. Representing a. a labelled graph as b. a bond (adjacency) 9 matrix. Connected nodes are represented as 1 while nodes with no connection are represented as 0. ix Figure 1.6. Connection table for a chemical compound, aspirin. 9 The connection table is divided into two blocks: atom block and bond block. The atom block comprises all the information regarding atoms, including 3D coordinates while the bond block contains information on all the bonds present in the molecule. The first row of the connection table contains the total number of atoms in a molecule followed by total number of bonds. Figure 1.7 Generic scheme for filter- and wrapper-based feature selection. 15 Filter-based methods are fast and easy to apply especially in case of large dataset whereas wrapper-based approaches require higher computational resources but are more accurate. Figure 1.8 Different levels of abstraction used in the literature for 22 molecular scaffold analysis.

Load more