Virtual Screening of One Billion Compound Libraries Using Novel
Total Page:16
File Type:pdf, Size:1020Kb
Dr. Olexandr Isayev and Prof. Denis Fourches Laboratory for Molecular Modeling, University of North Carolina at Chapel Hill, USA Decline in Pharmaceutical R&D efficiency The cost of developing a new drug roughly doubles every nine years. 1033 drug-like chemicals* 108 compounds in PubChem 106 compounds in ChEMBL with ≥1known bioactivity Scannell et al. Nature Reviews Drug Discovery, 2012, 11, 191-200 Need of novel bio/cheminformatics methods that (i) Fully exploit the potential of modern chemical biological data streams; (ii) Reliably forecast compounds’ bioactivity and safety profiles; (iii) Accelerate the translation from basic research to drug candidates * Polishchuk, Madzhidov, Varnek. J Comput Aided Mol Des. 2013, 27(8):675-9. 2 Ligand-based Virtual Screening to identify potential hits Empirical Rules/Filters Similarity Search ConsensusQSAR MODELS QSA VIRTUAL SCREENING ~102 – 103 molecules Potential ~106 – 109 Hits molecules 3 Thousands of molecular descriptors O are available for organic compounds N constitutional, topological, structural, 0.613 CC quantum mechanics based, fragmental, steric, O pharmacophoric, geometrical, 0.380 A O N thermodynamical, conformational, etc. A O -0.222 O D Samples Features (descriptors) CC M N 0.708 M O E (compounds) N Descriptor X X ... X 1 2 1.146 TTm PP O matrixS Quantitative N C 1 X11 X12 0.491... X1m O StructureACTIVITY (i) II O N R 0.301 2 X X ... X O I 21 22 2m Activity 0.141 VV UU N P O R...elationships ... ... 0.956... ... N T II N O N O 0.256 N - Buildingn of modelsX X ... X R n1 n2 TTnm DD using machine learning 0.799 S methods (NN, SVM, RF) O 1.195 Y N Y SS - Validation of models 1.005 according to numerous statistical procedures, and their applicability domains. 4 Discovery of Novel Antimalarial Compounds Enabled by QSAR-Based Virtual Screening - Severe infectious diseases (~700,000 deaths per year worldwide) - Caused by unicellular eukaryotic parasites, mainly Plasmodium falciparum - Modeling Set: 158 actives, 2,975 inactives. kNN and SVM with ISIDA descriptors External predictive power of QSAR models is critical 176 putative hits Chembridgeto enable Similaritytheir applicationDrug-likeness to virtualQSAR screening. Database Filters Filters models 42 putative inactives 454,638 44,112 39,944 chemicalsTechnically chemicals challenging chemicals to compute molecular properties and descriptors for more >109 compounds. EXPERIMENTAL CONFIRMATION (Dr. Guy, St Jude Res. Hosp) -Most potent hit (SJ000565000) with EC = 95.6 nM and novel 50 9 Nomolecular cheminformatics scaffold architecture is able to screen >10 -7 compounds with EC less than 2 μM compounds. 50 -18 compounds with moderate activity (EC50 2-8 μM) -All of the 42 putative inactives have EC50 >10 μM 14.2% hit rate >> HTS hit rate (0.1 – 5%) SJ000565000 Zhang, Fourches, et al. JCIM, 2013, 53, 475-492 5 Study Design 6 6 Chemical Datasets Largest publicly available virtual libraries GDB-13 955 M compounds GDB-13-ABCDE subset 141 M GDB-17 subset 50 M 1 Blum and Reymond, 2009, J Am Chem Soc, 131, 8732–8733 2 Ruddigkeit et al., 2012, J Chem Inf Model, 52, 2864-2875 Setup Hardware Stack Software Stack • Intel Core i7 4770 • Ubuntu 12.04 CPU 3.4GHz, • Anaconda Scientific Py • Intel H87-based thon 2.7.6 Distribution motherboard, • Pandas / Pytables • 32GB of DDR3 1600 • MKL optimized NumPy memory • NUMBAPRO for CPU • Nvidia Tesla K20 for optimization GPU accelerated • RDKit calculations • C / CUDA High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Screening & Modeling High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Smaller datasets (<1M) directly allocated in RAM Storage & Interactive Manipulation analytics with IPython Indexed, fully searchable, and accessible via high level API, e.g., (data. MolWt > 150) & (data.logP == 3) Access in chunks or streaming compound by compound. High throughput -Data parsing from descriptor generator Data Smiles Mol weight, logP, Processing -2D structure H acceptors/donors, generation Rot bonds, Daylight Chemical 30M/hr -Automatic curation fingerprints. Library Smaller datasets (<1M) directly allocated in RAM Storage & Interactive Manipulation analytics with IPython Indexed, fully searchable, and accessible via high level API, e.g., (data. MolWt > 150) & (data.logP == 3) Access in chunks or streaming compound by compound. Modeling & Screening GPU accelerated Rapid screening of extremely large libraries with similarity search multiple molecular probes and QSAR/QSPR models ~1M Tanimoto/s GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries GDB-13 GDB-17 Subset of 141 M Random sample of 50 M Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds. GPU - Case Study 1 Fast Computation of Molecular Properties for Extremely Large Chemical Libraries Our GPU-accelerated cheminformatics platform is able to compute key molecular properties for GDB-13 (855M), GBD-13-ABCDE (141M), and a subset of GDB-17 (50M) compounds. GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds - Lacosamide (trade name Vimpat) is an anticonvulsant drug used to prevent seizures for patients treated for epilepsy; Lacosamide - Functionalized amino acid; - Many active analogues have been synthesized in Prof. Harold Kohn’s laboratory* at UNC-CH. *Wang et al., 2011, ACS Chem Neurosci, 2, 90–106 GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Analog 1 Analog 2 Analog 3 Analog 4 Analog 5 Similarity search using 200M compound subset of GDB-13/17 Lacosamide as molecular probe GPU - Case Study 2 Virtual Screening of Very Large Chemical Libraries to Identify Bioactive Compounds Compound ID Tanimoto Ts The GPU-accelerated screening Analog 2 0.997 platform was able to retrieve: Analog 3 0.995 -known active analogues of Analog 1 0.994 lacosamide, Analog 4 0.992 -several functionalized amino Analog 5 0.978 acids present in GDB-13, Gdb13-a10573585 0.977 -a novel compound (Gdb17- Gdb13-b28137563 0.977 Gdb13-a36264983 0.976 44140083) fully matching the Gdb13-a36264952 0.976 pharmacophore of lacosamide. Gdb13-a10616005 0.976 Gdb13-a3011053 0.976 Gdb13-b21242261 0.976 Gdb17-44140083 0.976 Gdb13-a30878321 0.975 Gdb13-b3485216 0.975 In Summary • GPU-accelerated cheminformatics platform for high performance virtual screening of extremely large chemical libraries. • Tested for the analysis of the largest publicly available dataset GDB-13 (~900M compounds) and (2) the screening of ~200M compound library for similarity search using an anticonvulsant drug as the molecular probe. • Our platform aims to virtually screen billions of compounds using similarity filters and QSAR models. Acknowledgements • Professor Alex Tropsha (UNC-CH) • Colleagues at MML laboratory • NVIDIA & Mark Berger for generous hardware donation Funding - NSF ABI program - Office of Naval Research Molecular fingerprints - bit string encodings of structural features and/or calculated molecular properties. INFORMATION ABOUT THE PRESENCE OF MOLECULAR FRAGMENTS 1 – FRAGMENT IS PRESENT 0 – FRAGMENT IS ABSENT Similarity Search Similarity searching using fingerprint representations of molecules is one of the most widely used approaches for chemical database mining: it assumes that similar compounds possess similar biological activities. Tanimoto Coefficient From J. Bajorath, SSS Cheminformatics, Obernai 2008 .