
ACCDB: A Collection of Chemistry DataBases for Broad Computational Purposes Pierpaolo Morgante,1 Roberto Peverati1* 1 P. Morgante, R. Peverati Chemistry Program, Florida Institute of Technology, 150 W. University Blvd., Melbourne, FL 32901, United States. E-mail: [email protected] ABSTRACT. The importance of databases of reliable and accurate data in chemistry has substantially increased in the past two decades. Their main usage is to parametrize electronic structure theory methods, and to assess their capabilities and accuracy for a broad set of chemical problems. The collection we present here—ACCDB—includes data from 16 different research groups, for a total of 44,931 unique reference data points, all at a level of theory significantly higher than DFT, and covering most of the periodic table. It is composed of five databases taken from literature (GMTKN, MGCDB84, Minnesota2015, DP284, and W4-17), two newly developed reaction energy databases (W4-17-RE, and MN-RE), and a new collection of databases containing transition metals. A set of expandable software tools for its manipulation is also presented here for the first time, as well as a case study where ACCDB is used for benchmarking commercial CPUs for chemistry calculations. Introduction heats of formation), anharmonicity effects (for vibrational frequencies), vector-relativistic Databases of atomic and molecular energies effects (spin-orbit couplings), etc. Moreover, have always been an important tool in high-level calculated data do not suffer from computational chemistry for the assessment and experimental errors. Albeit they do still suffer parametrization of semi-empirical methods.1,2 from the computational accuracy error of the Large sets of experimental data of small level of theory that is used to obtain them, such molecules became increasingly important with error is easier to assess and to control than the the development of methods for the calculation stochastic experimental error, and in many of heats of formation of molecules, starting from cases, it is preferable. the early days of semi-empirical methods (for a historical perspective see Ref. 3), to the Gaussian The importance of computational chemistry composite methods of Curtiss and co-workers,4 databases containing high-level data has grown and the multi-coefficient correlation methods of exponentially in the last two decades because of Truhlar and co-workers.5 More recently, the the advent of semi-empirical exchange- development of high-accuracy ab initio correlation (xc) functionals in density functional calculation methods has shifted the interest theory (DFT). While the necessity of extremely from databases of experimental data to large databases for parametrization of new xc databases of purely calculated data. The functionals—and the increasing number of advantages of the latter for the assessment and parameters in the functional forms—is a parametrization of semi-empirical electronic controversial topic that is outside the scope of 6,7 structure methods is clear, since they include this work (see for example refs. , and for a 8-10 data that are directly comparable between recent debate, see ), we believe that calculations, without the need of corrections for databases will undoubtedly remain a experimental conditions, such as zero-point fundamental tool for the assessment of the energies (in most cases), thermal corrections (for applicability, accuracy, and reliability of new and 1 existing xc functionals in the years to come, main-group and transition-metals regardless of the underlying wars between thermochemistry, non-covalent interactions, optimization philosophies (from first principle vs. and chemical kinetics. parametrized). It is objectively true that more parameters in the xc functionals require more In the next two section we describe the structure data in the training set (a statistical necessity to of ACCDB, including details for all the databases avoid over-training), but the availability of larger that compose it, and for the automation tools training sets does not necessarily translate into that we developed to simplify the management more parametrized functionals, nor into more of such a large quantity of data. Before the parameters in a functional.11 conclusion, we also present a case study where ACCDB is used to benchmark the performance of The modern use of computational chemistry three commercial CPUs recently introduced onto databases containing a large number of high- the market (as of August 2018) by AMD, for level calculated data is not limited to the routine computational chemistry calculation. parametrization and assessment of semi- empirical composite or DFT methods: other Structure of the database interesting applications include validation of high-accuracy ab initio method,12 benchmarking ACCDB includes five different databases from 21 22,23 of software or hardware,13-15 and the application five different sources: MGCDB84, GMTKN, 24,25 26,27 12 of modern data mining techniques to chemistry, Minnesota, DP284, and W4-17. The such as artificial intelligence and machine data from these sources cover primarily main learning.16 group elements. Some transition metals (TMs) are present in the Minnesota database, but we In light of this increasing importance of large sets significantly expand this number by introducing of calculated data with high accuracy in here a database of reactions involving first-, chemistry, we present here a collection of second-, and third-row TMs, as well as elements almost 45k unique reference data points (with ~ from the second transition. In addition to these 37k of them presented here for the first time), all “traditional” data points, we also introduce here at a level of theory significantly higher than DFT. two new databases obtained using automatic This collection—that we named ACCDB— generation of reaction energies,28 containing includes data from 16 different research groups, further 36,275 “non-traditional” reference data a total of more than 10k atomic and molecular points. structures files, as well as a set of software tools for their manipulation, automation of In total, ACCDB includes 191 subsets and 44,931 corresponding jobs, and statistical analysis. Its data points. A brief description of each database primary difference with respect to other recent is presented below, and a summary is in Table 1. large databases of chemical compounds17-20 is We suggest the user of each pre-existing the high level of accuracy at which our data are database to refer to the original publications for obtained. Another key aspect of ACCDB is its more information on the subsets and to give broad applicability to different areas of proper credit to its primary authors. chemistry, including—but not limited to—both 2 Table 1. Summary of all databases included in ACCDB.a Name of the Number of Unique Reference Brief description of what is included in the database: Ref. database: Structures: Data Points: MGCDB84 Main Group Chemistry DataBase 5,931 4,985 21 General Main Group, Thermochemistry, Kinetics, and Non- GMTKN 2,639 1,664 22,23 covalent interactions and Mindless Benchmarks Thermochemistry, Kinetics, Non-Covalent interactions Minnesota 719 471 24,25,29 (Database2015, Database2015A, and Database2015B) Automatically Generated Reaction Energies from - MN-RE - 9,135 This work Minnesota 2015B DP284 Dipole Moments and Polarizabilities 181 284 26,27 W4-17 Total Atomization Energies 215 1,042 12 - W4-17-RE Automatically Generated Reaction Energies from W4-17 - 27,140 This work Metals&EE Collection of subsetscontaining metals and excitation energies 364 210 30-40 8,656 ACCDB 10,049 (44,931)b [a] For further details of all subsets of each database—and clarifications on the corresponding reference—see the supporting information. [b] The number in parentheses includes the “non-traditional” RE databases. MGCDB84. The Main Group Chemistry DataBase database,23 both from Goerigk’s and Grimme’s has been introduced by Mardirossian and Head- groups. GMTKN55 is the extension of two Gordon.21 It is the largest database included in previously published databases by Goerigk et al., ACCDB, with 84 subsets, 5,931 single-point called GMTKN2446 and GMTKN30.47,48 It includes geometries, and 4,985 reference data. Part of it 55 subsets, 2,459 single-point geometries, and was used as training set for new xc 1,499 relative energies. The four areas of functionals,11,41-43 while it has been used as interest are: non-covalent interactions (both benchmark set in its entirety, for assessing the inter- and intra-molecular interactions, 595 performance of the Minnesota functionals,44 as data), basic properties (atomization energies, well as about 200 more functionals.21 Its main electron and proton affinities, dissociation focus areas are: non-covalent interactions, energies of various compounds, 467 data), which make up almost half of the database thermochemistry (reaction energies and (2,647 data), thermochemistry (1,205 data), isomerization reactions, 243 data), and kinetics isomerization energies (910 data) and kinetics (barrier heights, 194 data). MB08-165 is a (barrier heights: 206 data). Additionally, the database containing 165 “Mindless Benchmark”, electronic energies of the first 18 atoms are also obtained with randomly-generated, artificial included. Quite recently, a statistically significant molecules each containing 8 atoms.23 This version of the MGCDB84 database, called MG8, database was included in both GMTKN24 and has been proposed by Bun Chan.45 GMTKN30, but it was replaced in GMTKN55 by a new database of 43 artificial molecules of 16 GMTKN. This database is composed by the atoms each (MB16-43). Because of its relevance GMTKN55 database,22 and the MB08-165 3 as a chemically “unbiased” test set (the W4-17. W4-1712 is an extension of two previous molecules are not real ones, reducing the databases—namely W4-08,51 and W4-1152— possibility of forecasting the outcome of the developed by Martin’s and Karton’s groups. W4- calculations), we have decided to keep MB08- 17 includes 203 total atomization energies of 165 in our collection, in addition to GMTKN55.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages24 Page
-
File Size-