The Pubchemqc Project: a Large Chemical Database from the First

Home , Chemical database

arXiv:1512.08572v1 [physics.chem-ph] 29 Dec 2015 Rseta ,0 oeue o Vseta n other and spectra, UV for molecules 1,600 for spectra, molecules 16,000 IR over contains Chem- It NIST [6]. be WebBook would istry one largest au- the as knowledge, far thors’ As database properties. large their and a molecules have many don’t having we however, data, from istry [5]. begun just also has many are: chemistry preliminary to to and cases [4], application recognition successful applied speech and very been [3] [2] Some has image engineering. approach of this areas and data, promising very in abstractions a high-level model is to learning attempt than which Deep further technique now. little do we very if can do them we of to what part system a imitation do an can restrict we that believe we however, one by imitate to program. difficult computer and important chemists very indeed sophisticated both are by of experiments intuitions and and system calculations Experiences have such difficult. which course of of molecules Development is find properties. to desired system some expert and gine do. chemists experimental other the chemistry what quantum helps still only implementations, good and [1], re- their packages. program conduct chemistry quantum cannot without search chemists molecules exper- Nowadays, than of iments. better properties sometimes predict even experiments, and is without obtain chemistry can quantum of we achievements that great the of One ene uedtstfrcmonst er chem- learn to compounds for dataset huge a need We implement, to hard are experiences and intuitions Such en- search molecular a develop to is goal final Our foundation theoretical a powerful have we Although PACS: th using propertie desired is, Keywords: have goal which molecules final find Our to database. system expert domain) la (public the free http://pubchemqc.riken.jp/. of and one at is public Chemi which to Applied (http://pubchem.ncbi.nlm.nih.gov/) and open g Pure are of optimized InC Union results B3LYP the International refer 6-31G* the only by of we formula entries molecules, calculate million To 1.53 calculations. over have we Abstract. h uCeQ rjc:alreceia aaaefrom database chemical large a Project: PubChemQC The FAvne etrfrCmuigadCmuiain RIKEN, Communication, and Computing for Center Advanced 2F 11.- 31.15.vj 31.15.A-, nti eerh ehv encntutn ag database large a constructing been have we research, this In aaae 3Y,goer optimization geometry B3LYP, Database, INTRODUCTION h rtpicpecalculations principle first the AAAMaho NAKATA 5-18JAPAN 351-0198 gs ntewrd(prxmtl 3mlinmlclsael are molecules million 63 (approximately world the in rgest h oeua aahv entknfo h uCe Project PubChem the from taken been have data molecular The I(nentoa hmclIetfir ersnaino c of representation Identifier) Chemical (International hI s. s aa odvlpamlclrsac nieo molecular or engine search molecular a develop to data, ese ty(UA) hs orfrnet xeietldt.Thes data. experimental to reference no thus, (IUPAC), stry iey oee,ol ml opud r itd(typi- listed are pub- compounds exhaus- small the molecules only in however, enumerate tively, and GDB-17 The databases), domain. other lic and ChEMBL molecules from (contains enough large of is number molecules the registered that are for Compounds reasons PubChem The employing Compounds. PubChem from molecules database curated well activity. pharmacological a and was [9] bio-medical It for ChEMBL popular. also purpose, in is biochemical database molecules For billion database. 166 this are and there chlorine Actually, sulfur, hydrogen. oxygen, nitrogen, carbon, atoms It of seventeen to consist [8]. up GDB-17 molecules possible be for all enumerates would use combinations to largest using hard The database. database thus, public database, a developing proprietary use: secondary a How- is substances. it inorganic ever, and more organic has million It 71 database. than Service would Abstracts documents Chemical the scientific be in published from database pure, of compounds. million non-duplicating and 63 the standardized, over and provides Compound, compound Pubchem PubChem Substance, and Pubchem three BioAssay projects: are Pubchem Pubchem There the molecules. in projects small of activity NIH biological the through provides and (NIH) initiative, Roadmap Health assembled Libraries of Molecular was Institutes database National the This by [7]. project Pubchem the compounds. many for much experiments too real costs doing it for unfortunately, learning; deep very for apparently are small numbers the Still, data. experimental u oli hsppri eeoigadtbs for database a developing is paper this in goal Our largest The available. also are databases Other be would molecule for database public largest The oere n e xie ttsb -1G TDDFT 6-31+G* by states excited ten and eometries fmlclsby molecules of -,Hrsw,Wk-iy Saitama, Wako-City, Hirosawa, 2-1, binitio ab acltos Currently, calculations. hemical isted) e cally, molecular weights are less than 300), and we do (structure-data file). There are approximately 3,000 SDF not want to restrict atoms to carbon, nitrogen, oxygen, files and each file contain almost 25,000 molecules. All sulfur, chlorine and hydrogen. the molecules are almost unique and some molecules are The calculation will complete in seventeen years, even obsoleted and removed. A SDF of PubChem Compound though if we calculate ten thousand molecules per day. contains Our resources are limited to do calculation for thou- • Three dimensional structures without hydrogen via sand to ten thousand molecules per day, therefore our ap- PubChem 3D project[13]. proach may not feasible at the moment. However, we are optimistic about it, as algorithms for quantum chemistry • IUPAC names. have been improving and computer resources increase by • InChI and SMILES representations of molecule Moore’s law. • Molecular weights, We consider our project, which provides molecular in- etc. We actually use the InChI representation of formation by quantum chemistry calculations, is benefi- molecules and the molecular weights. Other information cial as following reasons: are not referenced. • We can provide very accurate results comparable to experiments by the ab initio calculations. • We can perform efficient and low-cost molecular InChI representation of molecule design and search by high throughput screening. The IUPAC International Chemical Identifier is a tex- In our PubChemQC project, the general calculation tual identifier for chemical substances, designed to pro- scheme for each molecule is following: vide a standard and human-readable way to encode • Download molecular information from Pubchem molecular information and to facilitate the search for compound, and we extract molecular information in such information in databases and on the web (from International Chemical Identifier (InChI) represen- Wikipedia). tation [14]. Ethanol is represented as follows: • We generate initial geometry by OpenBABEL and InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 empirical PM3 geometry optimization. • We further do geometry optimization by Hartree- L-ascorbic is represented as follows: Fock in STO-6G basis set, followed by the final InChI=1S/C6H8O6/c7-1-2(8)5-3(9)4(10)6(11)12-5/h2,5,7-8,10-11H,1H2/t2-,5+/m0/s1 geometry optimization by density functional theory (DFT) calculation employing B3LYP functional us- . ing 6-31G* basis set. • Using the final molecular geometry, we calculate ten lowest excited states by time depended den- Details of quantum chemical calculation sity functional theory (TDDFT) calculation using 6- 31+G* basis set. First, we made a file containing compound identi- • Upload the results with input files to fication number (CID) with InChI representation and http://pubchemqc.riken.jp/. molecular weight in each line of PubChem Compound. This file was approximately 7.8G bytes and contained 51,520,346 lines. HOW TO PROCEED WITH THE Then, we sorted by molecular weight in ascending or- der so that earlier lines contain lighter molecules; our CALCULATION calculation started from hydrogen and will end in very heavy molecule. In Figure 1, we show how molecular In the following subsections, we show the outline how weight distributes in PubChem Compound. Interestingly, PubChemQC project has been doing. the sorted CID does not show as log scale. It shows that the number of molecules known to us scale as lin- early whereas we can just enumerate astronomical num- Information stored in PubChem Compound ber of molecules by combinatrics. Roughly saying, we can calculate lighter molecules much faster and much We acquired information of the molecule from Pub- easier than heavier molecules. For heavier molecules, we Chem project[11]. There were 63,159,671 molecules in should consider many conformers, better basis sets and 2015/2/4 [12] and this number has been updated daily. relativistic effects, however, these points are out of scope The molecules at PubChem Project are provided in SDF at this moment. 600 Issues and discussion

500 There are some issues on calculations. 400 The ﬁrst one is issue on theoretical parameters: which 300 basis function, calculation method should be used. Of Molecular Weight 200 course, better basis is better (e.g., cc-pVQZ) and better method is better (e.g., CCSD(T)). However, if we 100 employ such choices, calculation would take extremely

0 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 long time. We believe our choice, 6-31G* basis set and Sorted PubChem Compound ID by Molecular Weight DFT with B3LYP, are fairly good compromise between quality of results and resources we have. Usually, properties obtained by B3LYP functional are comparable to FIGURE 1. Molecular weight distribution of PubChem MP2 calculations. In some cases, we may need to em- Compounds. ploy some more expensive methods. We just left such molecules as calculation failed. The second one is that the resultant molecular struc- There are some compounds not suitable for calcula- tures are not necessarily global minimum energy ones, as tion: a mixture like ionic salts (Ni2+SO2−). These sys- 4 initial structure issued by OpenBABEL from InChI, each η5 tems are omitted. Unfortunately, molecules contains optimization process may not appropriate ones. More- bonding (e.g. Ferrocene) are also not calculated, due to over, there should be conformers and secondary struc- problem in InChI representation. Moreover, we only cal- tures. If we want to make sure that obtained geometries culate molecules containing H, He, Li, Be, B, C, N, O, are really global minimum of the molecule, we need F, Ne, Na, Mg, Al, Si, P, S, Cl, Ar, K, Ca, Sc, Ti, V, Cr, to perform global geometry search like GRRM method Mn, Fe, Co, Ni, Cu and Zn. This is due to limitation of [23], which will take extremely very long time. Never- 6-31G* basis set. theless, we expect current resultant structures are reason- The initial geometry for a molecule was made by able, not so funny. OpenBABEL with “–gen3d -addH” options from InChI The third one is limitation of InChI representation representation. Then, the first geometry optimization cal- of molecule. A clear example for InChI representation culation is performed by PM3 method, resultant geome- does not work is Ferrocene. We may need to extend try was further optimized by Hartree-Fock method using InChI representation. Unfortunately, any representation STO-6G basis set. Finally, we re-optimized by density has its limitation. Most general one may be store in three functional theory of B3LYP functional using 6-31G* ba- dimensional coordinates of nucleus. Of course, we need sis set. The final optimization process has actually three to compromise between usefulness and having many steps; the first step was done by FireFly [21], then the corner cases. second step was done by GAMESS [20], the third step The fourth one is which molecules to be calculated. was again done by GAMESS, since calculation time by The GDB-17 has astronomical number of molecules [? FireFly was substantially faster, and slightly less accu- ], even, these molecules consist of at most seventeen of rate. And why we do optimization by GAMESS twice is atoms. Which molecules considered to be important is to make sure that the final optimization was really final; a difficult problem. One possible answer might be “All only one geometry search should be done. important molecules are listed in PubChem or in CAS Subsequent excited state calculation was done by database, but there should be more”. TDDFT with 6-31+G* basis set employing same geometry as the final optimization of previous calculation. Final results and input files are uploaded at SUMMARY AND OUTLOOK http://pubchemqc.riken.jp/. All calculations are done on RICC supercomputer We described research outline of the PubChemQC (Intel Xeon 5570 (2.93GHz) x 2, 1024 nodes) at RIKEN, project, we have been constructing a database listed in Quest supercomputer (Intel Core2 CPU L7400 @ PubChem Compound by NIH by ab initio calculations. 1.50GHz, 700 nodes) at RIKEN, and Oakleaf-FX su- We only refer chemical formula in InChI of PubChem percomputer (Fujitsu PRIMEHPC FX10, SPARC64 compound, therefore, no reference to experimental data. IX [email protected]) at the University of Tokyo. With Currently, optimized geometries, ground state energies above computational resources, it is possible to calculate and excited state energies of 15.3 million molecules thousand to ten thousand molecules per day. are calculated by GAMESS, FireFly and OpenBABEL, and uploaded to http://pubchem.riken.jp/. We employed calculation method as density functional theory with B3LYP functional and 6-31G* and 6-31+G* basis set. 10. The PubChemQC Project http://pubchemqc. Since the results are presented as input and output riken.jp/ (2014.06.30). files, the reproducible results and reuse Is easy. Within 11. ftp://ftp.ncbi.nlm.nih.gov/pubchem/ the current computational resources, it is possible to Compound/CURRENT-Full/SDF/ (2014.06.30). 12. PubChem Compound https://www.ncbi.nlm. calculate thousand to ten thousand molecules per day. nih.gov/pccompound?term=all[filt] As the future plan, we will construct web site, for (2014.06.30). compound search, and calculation of other properties like 13. E. E. Bolton, J. Chen, S. Kim, L. Han, S. He, W. Shi, V. vibrating structure, NMR chemical shift, and structure Simonyan, Y. Sun, P. A Thiessen, J. Wang, B. Yu, J. Zhang optimization in the excited state and with solvent effects. and S. H Bryant PubChem3D: a new resource for scientists, Journal of Cheminformatics (2011), 3:32. 14. Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. (2013). “InChI - the worldwide chemical ACKNOWLEDGMENTS structure identifier standard”. Journal of Cheminformatics 5 (1): 7. We are very grateful for Motoyoshi KUROKAWA and 15. NIST Computational Chemistry Comparison and Ryutaro HIMENO, and Masashi Horikoshi for kind Benchmark Database NIST Standard Reference Database advices, maintenance of computers, and providing re- Number 101 Release 16a, August 2013, Editor: Russell D. Johnson III http://cccbdb.nist.gov/ sources. The calculations were performed by using the (2014.06.30). RIKEN Integrated Cluster of Clusters (RICC) facility 16. J. Hachmann, R. Olivares-Amaya, A. Jinich, A. L. and the research is partially supported by the Initiative on Appleton, M. A. Blood-Forsythe, L. R. Seress, C. Promotion of Supercomputing for Young or Women Re- RomÃ ˛an-Salgado, K. Trepte, S. Atahan-Evrenk, S. Er, searchers, Supercomputing Division, Information Tech- S. Shrestha, R. Mondal, A. Sokolov, Z. Bao, and A. nology Center, The University of Tokyo. Aspuru-Guzik. Energy and Environmental Science 7, no. 2 (2014): 698-704. 17. Anderson, E.; Veith, G. D.; Weininger, D. (1987). SMILES: A line notation and computerized interpreter for REFERENCES chemical structures. Duluth, MN: U.S. EPA, Environmental Research Laboratory-Duluth. Report No. EPA/600/M- 1. P. A. M. Dirac, Proceedings of the Royal Society of 87/021. London. Series A, Containing Papers of a Mathematical 18. Noel M OâA˘ ZBoyle,´ Towards a Universal SMILES and Physical Character, Vol. 123, No. 792 (6 April 1929). representation - A standard method to generate canonical 2. Q. V. Le, M. Ranzato, R. Monga, M. Devin, K Chen, G. S. SMILES based on the InChI, Journal of Cheminformatics Corrado, J. Dean, A. Y. Ng, “Building High-level Features 2012, 4:22. Using Large Scale Unsupervised Learning”, ICML, 2012. 19. N. M O’Boyle, M. Banck, C A James, C. Morley, T. 3. A. Krizhevsky, I. Sutskever, G. H. Hinton, “ImageNet Vandermeersch, and G R Hutchison, Open Babel: An open Classification with Deep Convolutional Neural Networks”, chemical toolbox, J. Cheminf. 2011, 3:33. Advances in Neural Information Processing Systems 25 20. “Advances in electronic structure theory: GAMESS a (NIPS 2012). decade later” M.S.Gordon, M.W.Schmidt pp. 1167-1189, 4. F. Seide, G. Li and D. Yu. Conversational Speech in “Theory and Applications of Computational Chemistry: Transcription Using Context-Dependent Deep Neural the first forty years” C.E.Dykstra, G.Frenking, K.S.Kim, Network, in INTERSPEECH, pp. 437-440 (2011). G.E.Scuseria (editors), Elsevier, Amsterdam, 2005. 5. M. Rupp, A. Tkatchenko, K.-R. Müller, O. A. von 21. Alex A. Granovsky, Firefly version 8.0, http: Lilienfeld: Fast and Accurate Modeling of Molecular //classic.chem.msu.su/gran/firefly/ Atomization Energies with Machine Learning, Physical index.html (2014.06.30). Review Letters, 108(5):058301, 2012, G. Montavon, M. 22. Stewart, James J. P. (1989). “Optimization of parameters Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. for semiempirical methods I. Method”. J. Comput. Chem. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Machine 10 (2): 209. Learning of Molecular Electronic Properties in Chemical 23. K. Ohno and S. Maeda, A Scaled Hypersphere Search Compound Space, New Journal of Physics, 2013, to appear. Method for the Topography of Reaction Pathways on the 6. NIST Chemistry WebBook http://webbook.nist. Potential Energy Surface. Chem. Phys. Lett. 384(4-6), gov/chemistry/ (2014.06.30). 277-282 (2004). 7. Blton E, Wang Y, Thiessen PA, Bryant SH. PubChem: Integrated Platform of Small Molecules and Biological Activities. Chapter 12 IN Annual Reports in Computational Chemistry, Volume 4, American Chemical Society, Washington, DC, 2008 Apr. 8. Ruddigkeit Lars, van Deursen Ruud, Blum L. C.; Reymond J.-L. J. Chem. Inf. Model., 2012, 52, 2864-2875. 9. ChEMBL: a large-scale bioactivity database for drug discovery” Nucleic Acids Research. Volume 40. Issue D1. Pages D1100-D1107.2011.