Virtual Screening of One Billion Compound Libraries Using Novel

Dr. Olexandr Isayev and Prof. Denis Fourches Laboratory for Molecular Modeling, University of North Carolina at Chapel Hill, USA Decline in Pharmaceutical R&D efficiency The cost of developing a new drug roughly doubles every nine years. 1033 drug-like chemicals* 108 compounds in PubChem 106 compounds in ChEMBL with ≥1known bioactivity Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200 Need of novel bio/cheminformatics methods that (i) Fully exploit the potential of modern chemical biological data streams; (ii) Reliably forecast compounds’ bioactivity and safety profiles; (iii) Accelerate the translation from basic research to drug candidates * Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9. 2 Ligand-based Virtual Screening to identify potential hits Empirical Rules/Filters Similarity Search ConsensusQSAR MODELS QSA VIRTUAL SCREENING ~102 – 103 molecules Potential ~106 – 109 Hits molecules 3 Thousands of molecular descriptors O are available for organic compounds N constitutional, topological, structural, 0.613 CC quantum mechanics based, fragmental, steric, O pharmacophoric, geometrical, 0.380 A O N thermodynamical, conformational, etc. A O -0.222 O D Samples Features (descriptors) CC M N 0.708 M O E (compounds) N Descriptor X X ... X 1 2 1.146 TTm PP O matrixS Quantitative N C 1 X11 X12 0.491... X1m O StructureACTIVITY (i) II O N R 0.301 2 X X ... X O I 21 22 2m Activity 0.141 VV UU N P O R...elationships ... ... 0.956... ... N T II N O N O 0.256 N - Buildingn of modelsX X ... X R n1 n2 TTnm DD using machine learning 0.799 S methods (NN, SVM, RF) O 1.195 Y N Y SS - Validation of models 1.005 according to numerous statistical procedures, and their applicability domains. 4 Discovery of Novel Antimalarial Compounds Enabled by QSAR-Based Virtual Screening - Severe infectious diseases (~700,000 deaths per year worldwide) - Caused by unicellular eukaryotic parasites, mainly Plasmodium falciparum - Modeling Set: 158 actives, 2,975 inactives. kNN and SVM with ISIDA descriptors External predictive power of QSAR models is critical 176 putative hits Chembridgeto enable Similaritytheir applicationDrug-likeness to virtualQSAR screening. Database Filters Filters models 42 putative inactives 454,638 44,112 39,944 chemicalsTechnically chemicals challenging chemicals to compute molecular properties and descriptors for more >109 compounds. EXPERIMENTAL CONFIRMATION (Dr. Guy, St Jude Res. Hosp) -Most potent hit (SJ000565000) with EC = 95.6 nM and novel 50 9 Nomolecular cheminformatics scaffold architecture is able to screen >10 -7 compounds with EC less than 2 μM compounds. 50 -18 compounds with moderate activity (EC50 2-8 μM) -All of the 42 putative inactives have EC50 >10 μM 14.2% hit rate >> HTS hit rate (0.1 – 5%) SJ000565000 Zhang, Fourches, et al. JCIM, 2013, 53, 475-492 5 Study Design 6 6 Chemical Datasets Largest publicly available virtual libraries GDB-13 955 M compounds GDB-13-ABCDE subset 141 M GDB-17 subset 50 M 1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733 2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875 Setup Hardware Stack Software Stack • Intel Core i7 4770 • Ubuntu 12.04 CPU 3.4GHz, • Anaconda Scientific Py • Intel H87-based thon 2.7.6 Distribution motherboard, • Pandas / Pytables • 32GB of DDR3 1600 • MKL optimized NumPy memory • NUMBAPRO for CPU • Nvidia Tesla K20 for optimization GPU accelerated • RDKit calculations • C / CUDA High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Screening & Modeling High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Smaller datasets (<1M) directly allocated in RAM Storage & Interactive Manipulation analytics with IPython Indexed, fully searchable, and accessible via high level API, e.g., (data. MolWt > 150) & (data.logP == 3) Access in chunks or streaming compound by compound. High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Smaller datasets (<1M) directly allocated in RAM Storage & Interactive Manipulation analytics with IPython Indexed, fully searchable, and accessible via high level API, e.g., (data. MolWt > 150) & (data.logP == 3) Access in chunks or streaming compound by compound. Modeling & Screening GPU accelerated Rapid screening of extremely large libraries with similarity search multiple molecular probes and QSAR/QSPR models ~1M Tanimoto/s GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries GDB-13 GDB-17 Subset of 141 M Random sample of 50 M Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds. GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds. GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds - Lacosamide (trade name Vimpat) is an anticonvulsant drug used to prevent seizures for patients treated for epilepsy; Lacosamide - Functionalized amino acid; - Many active analogues have been synthesized in Prof. Harold Kohn’s laboratory* at UNC-CH. *Wang et al., 2011, ACS Chem Neurosci, 2, 90–106 GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Analog 1 Analog 2 Analog 3 Analog 4 Analog 5 Similarity search using 200M compound subset of GDB-13/17 Lacosamide as molecular probe GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Compound ID Tanimoto Ts The GPU-accelerated screening Analog 2 0.997 platform was able to retrieve: Analog 3 0.995 -known active analogues of Analog 1 0.994 lacosamide, Analog 4 0.992 -several functionalized amino Analog 5 0.978 acids present in GDB-13, Gdb13-a10573585 0.977 -a novel compound (Gdb17- Gdb13-b28137563 0.977 Gdb13-a36264983 0.976 44140083) fully matching the Gdb13-a36264952 0.976 pharmacophore of lacosamide. Gdb13-a10616005 0.976 Gdb13-a3011053 0.976 Gdb13-b21242261 0.976 Gdb17-44140083 0.976 Gdb13-a30878321 0.975 Gdb13-b3485216 0.975 In Summary • GPU-accelerated cheminformatics platform for high performance virtual screening of extremely large chemical libraries. • Tested for the analysis of the largest publicly available dataset GDB-13 (~900M compounds) and (2) the screening of ~200M compound library for similarity search using an anticonvulsant drug as the molecular probe. • Our platform aims to virtually screen billions of compounds using similarity filters and QSAR models. Acknowledgements • Professor Alex Tropsha (UNC-CH) • Colleagues at MML laboratory • NVIDIA & Mark Berger for generous hardware donation Funding - NSF ABI program - Office of Naval Research Molecular fingerprints - bit string encodings of structural features and/or calculated molecular properties. INFORMATION ABOUT THE PRESENCE OF MOLECULAR FRAGMENTS 1 – FRAGMENT IS PRESENT 0 – FRAGMENT IS ABSENT Similarity Search Similarity searching using fingerprint representations of molecules is one of the most widely used approaches for chemical database mining: it assumes that similar compounds possess similar biological activities. Tanimoto Coefficient From J. Bajorath, SSS Cheminformatics, Obernai 2008 .

Virtual Screening of One Billion Compound Libraries Using Novel

Structure-Based Virtual Screening of Hypothetical Inhibitors of the Enzyme Longiborneol Synthase—A Potential Target to Reduce Fusarium Head Blight Disease

Virtual Screening of the Inhibitors Targeting at the Viral Protein 40 of Ebola Virus V

Report on an NIH Workshop on Ultralarge Chemistry Databases Wendy A

Qsar Methods Development, Virtual and Experimental Screening for Cannabinoid Ligand Discovery

Fast Three Dimensional Pharmacophore Virtual Screening of New Potent Non-Steroid Aromatase Inhibitors

Bigger Is Better in Virtual Drug Screens

Autodock Vina Manual

Structure Based Pharmacophore Modeling, Virtual Screening

Virtual Screening in Drug Design – Overview of Most Frequent Techniques

Virtual Screening, Docking, ADMET and System Pharmacology Studies

Cheminformatics in Drug Discovery, an Industrial Perspective

Probabilistic Approach for Virtual Screening Based on Multiple Pharmacophores