<<

Downloaded by guest on September 23, 2021 nih notesotoig fDTi h rdcino this of polarizability dipole prediction the in DFT interest. of some of quantity shortcomings offers fundamental approach the -learning decom- into atom-centered our insight The in DFT. hybrid accu- implicit of an position that at exceeds isomers) that (includ- hydrocarbon racy molecules and drugs, nucleobases, larger small acids, 52 carbohydrates, amino molecular systems, of conjugated yielding set challenging density diverse transferable, ing The hybrid a and cost. for of robust computational polarizabilities that is negligible than a model smaller at resultant that (DFT) error magnitude theory an of with functional order molecules molec- an small LR-CCSD the these is predict of to polarizabilities possible approach, is ular it using (LR- machine-learning that theory symmetry-adapted computed demonstrate doubles we a molecules and Using singles organic cluster CCSD). small coupled 7,211 response accurate linear polarizabil- of dipole highly tensors static present the ity we of calculations work, electronic mechanical this underlying quantum the In to description. sensitive of this structure quite as prediction is difficult, more quantity accurate considerably response Compared is an fields. polarizability properties, molecular force the ground-state polarizable in other vital ingredient with a essential plays and an molecules; dispersion; of is signatures and spectroscopic intermolecu- the induction and determining in as intra- role such key applied an interactions, governs to quantity lar response a This in of moment field. tendency dipole electric its the change describes to (received polarizability molecule 2018 26, dipole December approved molecular and PA, The Philadelphia, University, Temple Science, 2018) Molecular 23, Computational September of review Institute for Klein, L. Michael by Edited b a Wilkins M. learning David machine and theory cluster coupled with polarizabilities molecular Accurate www.pnas.org/cgi/doi/10.1073/pnas.1816132116 for predictions reliable considerably and provide accurate to the- more shown doubles been and has (17–19) singles (LR-CCSD) cluster ory provide coupled to response important linear regard, is for it genera- values next (12–16), of benchmark fields development force the polarizable in tion ingredient and key (8–11), and a spectroscopy (5–7) generation represents interactions frequency sum dispersion and and Raman induction that underlies fact simultane- that the be of light must in determining error when incompleteness for accounted set ously correlation basis electron and nontrivial such, underly- effects As the structure. of electronic description mechanical ing quantum the to that sensitive fact the elec- to applied due an of presence field the in changes dipole of molecular description the reliable polarizability and property dipole accurate molecular every an the for instance, cost accurate For equally computational interest. for not of manageable is sufficient a DFT providing is However, at by (1–3). that applications endeavor accuracy this useful an in many with role properties pivotal a ground-state theory played functional has density (DFT) Kohn–Sham materials. and molecules T theory functional density aoaoyo opttoa cec n oeig ntttdsMat des Institut Modeling, and Science Computational of Laboratory eateto hmsr n hmclBooy onl nvriy taa Y14853 NY Ithaca, University, Cornell Biology, Chemical and Chemistry of Department els eaehsse ra rgesi h rtprinci- first the in of progress properties and great stabilities, structures, the seen of evaluation has ples decade last he a eqiedfcl ooti 4.Ti sprimarily is This (4). obtain to difficult quite be can E, | a ahn learning machine nraGrisafi Andrea , α | sarsos rpryta sparticularly is that property response a is α asinpoesregression process Gaussian α eodteacrc fDT nthis In DFT. of accuracy the beyond safnaetlqatt finterest of quantity fundamental a is | esrta ecie how describes that tensor a α, ope lse theory cluster coupled a agYang Yang , o hs esn and reasons these For α. α hnue ncon- in used when b aU Lao Un Ka , | eriaux, ´ b oetA itsoJr. DiStasio A. Robert , cl oyehiu F Polytechnique Ecole ´ ulse nieFbur ,2019. 7, February online Published 1073/pnas.1816132116/-/DCSupplemental. y at online information supporting contains article This Materials the in deposited the been database, have 1 acenes QM7b and the (https://archive.materialscloud.org/2019.0002/v1).y alkenes Archive of Cloud the polarizabilities and dipole dataset, and showcase Coordinates deposition: Data the under Published Submission.y Direct PNAS a is article This interest.y paper. of y the conflict wrote no M.C. declare and authors R.A.D., research; The K.U.L., Y.Y., and designed A.G., R.A.D., D.M.W., K.U.L., Y.Y., M.C. and A.G., data; D.M.W., and analyzed research; M.C. R.A.D., performed K.U.L. K.U.L., and Y.Y., Y.Y., A.G., A.G., D.M.W., D.M.W., contributions: Author opiain,asmer-dpe asinpoesregression process these Gaussian as avoid symmetry-adapted To a well decomposition. complications, cum- fragment molecules—as model a inelegant the require flexible transferable and would of bersome for thought a of frame suitable line obtain compounds—this be reference different to also the However, would in that 32). written com- (31, the tensor learning molecule by the achieved of easily is ponents this predicted the molecules, of the rigid symmetries to For nature, the challenge to tensorial according stone. its additional form stepping to an a Due poses ML. as computation- however, method more polarizability, but structure The reached accurate and electronic be less 28) efficient can a achieved accuracy (27, ally using be (30) when properties can cluster easily coupled molecular DFT more or many than) accu- (29) better of DFT that that even prediction shown (or been the with has in par it on particular, molecular racy structure In of electronic (24–26). prediction complementing the methods or to substituting approach alternative properties, an even as prohibitive tion quite as become few can as with which molecules treating size), sub- when power system a sixth the the by with accompanied of (scaling cost is computational set prediction larger basis a stantially one-particle such (diffuse) However, large (20–23). sufficiently a with junction owo orsodnemyb drse.Eal itsocreleuo michele. or [email protected] Email: addressed. be ceriotti@epfl.ch. may correspondence whom To hrfr eoe osdrbeosal oacrt and accurate to matter. of obstacle modeling atomistic-based considerable reliable a and atoms, removes of therefore dozens containing molecules the of estimating polarizabilities for strategy transferable is and that inexpensive, rate, error model an machine-learning with a tensor 50% about polarizability for the data predict training can that as highly well theory as The functional density (DFT) hybrid use. of accuracy much-needed routine the a of for provide benchmark herein demanding reported calculations too accurate often are methods chemistry evalua- that quantum accurate requires principles Its first from experiments. tion cen- many is of and interpretation materials techniques, modeling the and phenomena, physical molecules several to of tral polarizability dipole The Significance ntels e er,mcielann M)hsgie trac- gained has (ML) learning machine years, few last the In mle hnDT hsfaeokpoie naccu- an provides framework This DFT. than smaller y PNAS NSlicense.y PNAS ed ´ rl eLuan,11 asne wteln;and Switzerland; Lausanne, 1015 Lausanne, de erale | ´ eray2,2019 26, February b,1 n ihl Ceriotti Michele and , | o.116 vol. www.pnas.org/lookup/suppl/doi:10. 10 SO − 15 (3) | atoms. a,1 o 9 no. oaingroup. rotation α uttrans- must | 3401–3406

CHEMISTRY (SA-GPR) scheme has recently been derived to naturally incor- LR-CCSD results, which will be referred to as DFT and cou- porate this SO(3) covariance into an ML scheme that is suitable pled cluster singles and doubles theory (CCSD), respectively, to predict tensorial quantities of arbitrary order (33). In this throughout the remainder of the manuscript. paper, we present comprehensive coupled cluster-level bench- marks for the polarizabilities of the ∼ 7, 000 small organic Improved SA-GPR. The formalism underlying the SA-GPR molecules contained in the QM7b database (34). We use scheme in general and the λ-SOAP (smooth overlap of atomic these reference calculations to assess the accuracy of different positions) descriptors on which our model is based have been hybrid DFT schemes and train an SA-GPR–based ML scheme introduced elsewhere (33) and are summarized in Materials and (AlphaML) that can accurately predict the polarizability tensor Methods. In this work, we include several substantial improve- with nominal computational cost. We then test the extrapolative ments that increase the accuracy and speed of the SA-GPR prediction capabilities of AlphaML on a showcase dataset com- model, and these are worth a separate discussion. For one, eval- posed of 52 larger molecules and demonstrate that this approach uation of the λ-SOAP representation is greatly accelerated by provides a viable alternative to state-of-the-art and computa- choosing the most significant few hundred spherical harmonic tionally prohibitive electronic structure methods for predicting components (of several tens of thousands) using farthest point molecular polarizabilities. sampling (FPS) (43). The calculation of the kernel in Eq. 1 can be carried out with essentially the same result as if all Results components were retained but with a much lower computa- Electronic Structure Calculations. The QM7b database (26, 34, 35) tional cost. A second improvement is the generalization of the includes N = 7,211 molecules containing up to seven “heavy” λ-SOAP kernels beyond the linear kernels used in ref. 33. It atoms (i.e., C, N, O, S, Cl) with varying levels of H satura- has been shown that, in many cases, taking an integer power tion. This dataset is based on a systematic enumeration of small of the scalar SOAP kernel improves the performance of the organic compounds (35) and contains a rich diversity of chemical associated ML model. This can be understood in terms of the groups, making it a challenging test of the accuracy associated order (two body, three body,. . .) of the interatomic correla- with DFT and quantum chemical methodologies. DFT-based tions that are described by different kernels (44, 45). In the molecular polarizabilities were obtained by (numerical) differ- tensorial case, one should be careful, as the linear nature of entiation of the molecular dipole moment, µ, with respect to the kernel is essential to ensure the correct covariant behav- an external electric field E using the hybrid B3LYP (36, 37) ior. To include nonlinearity and increase the order of the model and SCAN0 (38) functionals. Reference molecular polarizabil- without affecting the symmetry properties, we multiplied the ities were obtained using LR-CCSD. To account for basis set λ > 0 kernels by the scalar λ = 0 kernel raised to the power of incompleteness error, which can be even more important than ζ − 1 as in Eq. 2. Finally, we combined multiple kernels com- higher-order electron correlation effects in an accurate and reli- puted with different environment radii, rc, which have been able determination of α (21–23, 39), we used the d-aug-cc-pVDZ shown to be beneficial in the scalar case (30). Together, these basis set (39) for all calculations herein. Although this double-ζ improvements halve the error on QM7b as discussed in detail in basis set has only a moderate number of polarization func- SI Appendix. tions, augmentation with an additional set of diffuse functions almost always increases the convergence of α with respect to Learning on the QM7b Database. These highly accurate reference aug-cc-pVDZ (21, 22, 39–41). The alternative choice of retain- CCSD calculations and the SA-GPR scheme lay the foundation ing a single set of diffuse functions and simply increasing the for a transferable model to predict molecular polarizabilities. angular momentum by using the slightly larger aug-cc-pVTZ In this first incarnation of the AlphaML model, we use the basis set yields α values of comparable quality to d-aug-cc- reference DFT and CCSD calculations on the QM7b set for pVDZ (SI Appendix) (21, 22, 39–41), albeit with a significant training (34). As a first verification of its performance, we com- increase in the computational effort required to treat the entire puted learning curves for the DFT and CCSD polarizabilities of QM7b dataset. A more detailed description of the electronic the QM7b dataset. We used up to 5,400 structures for training structure calculations performed in this work is in Materials with subsequent assessment of the accuracy and reliability of the α and Methods. AlphaML model in the prediction of for the 1,811 structures that were not included in the training set. The structures were To enable comparisons between molecules of different sizes, added to the training set according to their FPS order (43) (i.e., all error estimates (explicit expressions for which are given starting from the most diverse configurations). This procedure is in Materials and Methods) are computed based on molecular representative of an efficient learning strategy that aims to obtain polarizabilities divided by the number of atoms, ni , contained uniform accuracy with the minimum number of reference calcu- within a given molecule. On the QM7b database, the popular lations (30). Using the best kernel hyperparameters (as described B3LYP hybrid DFT functional predicts α with a mean signed in SI Appendix), we trained a model to learn the CCSD polariz- error (MSE) of 0.259 a.u., a mean absolute error (MAE) of abilities. We report ML errors in terms of the percentage of the 0.302 a.u., and a root mean square error (RMSE) of 0.404 intrinsic variability of the CCSD dataset (σ = 2.216 a.u. per a.u. with respect to the reference LR-CCSD values. These CCSD atom) so as to provide a direct measure of the learning perfor- errors, which include both scalar and anisotropic contributions, mance. As illustrated by the learning curves in Fig. 1, using up are quite substantial and correspond to 18.3% of the intrin- to 75% of the QM7b database for training yields a 2.5% RMSE sic variability within the QM7b database, defined as σCCSD = 2 with respect to σCCSD in predicting CCSD polarizabilities. 1 P (CCSD) (CCSD) 2 1/2 [ N i kαi − hα ikF /ni ] . The large MSE value To get a clearer idea of the accuracy associated with these obtained with B3LYP indicates a systematic overestimation of ML-based predictions, one can compare these values against α by this functional (4, 42); results from the SCAN0 hybrid func- hybrid DFT. Using the same metric, the intrinsic error of tional show a substantially reduced MSE of 0.059 a.u. Despite DFT is 18% of σCCSD in the prediction of CCSD polarizabil- the smaller systematic overestimation of α in comparison with ities. This demonstrates that an ML model based on SA-GPR B3LYP, the statistical errors obtained with SCAN0 are still quite can yield polarizabilities with an accuracy that is approximately large, with computed MAE (RMSE) values of 0.217 (0.316) a.u. one order of magnitude greater than DFT. At the same time, From the ML point of view, the AlphaML model presented the corresponding DFT polarizabilities can be learned with herein performs almost equally well for B3LYP and SCAN0. an error of 3.2% of σCCSD. As seen in other cases (29, 30), For this reason, we focus our discussion on the B3LYP and highly accurate quantum chemistry calculations are smoother

3402 | www.pnas.org/cgi/doi/10.1073/pnas.1816132116 Wilkins et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 usata eraei cuay hc st eexpected be to is which accuracy, in observe decrease we accuracy, substantial for DFT the here than a the of better observed polarizabilities with using that molecules CCSD when showcase predicts to model AlphaML similar the While is B3LYP. of which behavior functional, the SCAN0 discuss further the we in the of baseline reduction However, the additional polariz- DFT. as using CCSD DFT simply predict of than use to accurate model more again AlphaML is we abilities the section, previous using the that in note seen components As both efficiency. the learns similar in AlphaML with that that with demonstrates comparable this response trace, anisotropic the in error an hnuigDTt prxmt CD al lobreaks also 1 Table the CCSD. made into error approximate error the the to down as DFT well using as when AlphaML using molecules for the showcase test by periphery challenging spanned a the space constitute at compound therefore AlphaML. are and chemical molecules dataset of QM7b these portion of the of many Appendix, SI uloae,du oeue,croyrts n 3ioesof show- isomers a acids, 23 QM7b amino and in entire includes C carbohydrates, which molecules, polarizabilities the molecules, drug large extrap- the on nucleobases, 52 this of predicted model dataset in then this case AlphaML trained and predict of we database to behavior environ- regime, one the each olative allows test for feature To polarizabilities molecules. This predicted are of ker- (30). AlphaML sum environmental ment by a of predicted as polarizabilities average given the an that means as nels molecules two Molecules. between Larger to Extrapolation observed that as SCAN0 for B3LYP. structure In for accuracy electronic target similar calculation. the showing of is DFT method, details AlphaML the baseline of to performance insensitive a the rather that performing error demonstrate of prediction we the Appendix, cost reduce the further to at way a provides therefore ikn tal. et Wilkins of learning CCSD direct learn to the of baseline factor to a additional as an relative (29, by DFT itself error of the quantity reduces use raw polarizabilities the the instance, learning For than 30). error smaller much com- as in correction to a referred theory, of monly levels different methods, between correction approximate more than learn to DFT. like easier slightly the of and fraction a as as 1,811 well RMSE of as the polarizability, consists CCSD DFT shows set the testing of axis or variability The right-hand intrinsic CCSD two. the either the and between using molecules, (∆) calculated difference the database for QM7b the in 1. Fig. 8 nTbe1 eso h MEerr npredicting in errors RMSE the show we 1, Table In h lhM oe a lob rie oeaut the evaluate to trained be also can model AlphaML The H n erigcre o h e tmplrzblte ftemolecules the of polarizabilities atom per the for curves Learning temlcl e sin is key molecule (the ∆ ∼ λ erigta sotnfudt result to found often is that learning 20 0 = − and 30% .A icse in discussed As Appendix). SI u ento ftekernel the of definition Our λ ∆ α nteerr In error. the in 2 = lann es ed oan to leads sense -learning CCSD σ CCSD opnnsof components Fg 1). (Fig. . α IAppendix, SI ∆ o larger for α learning with α; o the for ∼ 2× SI al .RS ntepeito fteprao polarizabilities atom Method per the of molecules prediction showcase the 52 in of RMSE 1. Table nanotechnological prototypical this For well. gap remarkably sizable forms a with molecules C For (like CCSD. pre- and mechanical DFT quantum divergent reference) like and from methods, far unphysical (but costlier the of avoid dictions are completely AlphaML and of predictions stable the gap, HOMO-LUMO vanishing the on demonstrated we that dataset. showcase predictions transferable accu- the and obtaining for rate instrumental local also the (i.e., is physics, AlphaML nonlocal dataset of limita- and structure collective QM7b a learn is to the trying this when in Although tion respectively). included benzene, those s-trans and than hexatriene the larger for 3, are value Fig. that constant in therefore a shown AlphaML of by to As size predicted saturate set. the polarizabilities training and carbon the per domains the in local included the that molecules of behavior struc- the range (nonlocal) represent the collective beyond to any extends AlphaML, environments disregards like atomic quan- tacitly framework local reference tures, ML on of An relies source ML. the which for as data reliable chemical like longer tum methods and no not practice, 48), are are (47, In CCSD polarizabilities that wavefunction. divergent to CCSD) multireference leads a and this on DFT based (like elec- explicitly for methods challenge structure significant a tronic represent states, (46). systems character these near-metallic multireference such, strong to As exhibit to lead known are hydrocarbons which conjugated (highest in HOMO-LUMO model. sys- gaps AlphaML narrowing orbital) molecular the these the unoccupied orbital-lowest of molecular in CCSD, occupied structure and physics inherent DFT the underlying For as nonlocal the well the of as reflects tems nature octatetraene) collective (like and alkenes for comparatively observed exhibit also QM7b, dra- in AlphaML errors. large represented which poorly for which structures, are alkenes, except Sulfur-containing reference DFT. polarizable DFT outperforms the matically highly than the accurate less polarizability, for slightly the be of to component tend tensorial of the degree pre- for ML dictions high complex The predictions. more accurate a underly- ensure and to by cutoffs molecules the reference larger characterized systems, requires is which these delocalization, structure For alkenes electronic long-chain nucleobases. as ing purine such predomi- actually conjugation, the occur of are and degree that errors those large errors particularly Large a compounds, the the show molecules. polarizable in that highly most for molecules shows nantly for individual 2 small of Fig. very errors dataset. the showcase analyzing by the in detail molecules larger the to extrapolated dataset. is showcase model the when CCSD/ML xrse naoi nt au)prao n rkndw noteerrors the is into RMSE down total broken (λ The and scalar database. atom the QM7b per with (a.u.) associated full predic- units the ML atomic on All in training polarizabilities. expressed polar- on DFT DFT based and are and CCSD tions CCSD the between predicting differences in the errors AlphaML. the using give izabilities DFT/ML and CCSD/ML CCSD/DFT CS-F)M 0.181 ∆(CCSD-DFT)/ML DFT/ML vnwe tcmst hlegn ojgtdsseswt a with systems conjugated challenging to comes it when Even AlphaML and CCSD, DFT, between discrepancy large The more in AlphaML of performance the investigate can We CDDTdntstedsrpnybtenCS n F aus while values, DFT and CCSD between discrepancy the denotes CCSD/DFT 60 ,tennoaiyi esptooia,adApaLper- AlphaML and pathological, less is nonlocality the ), PNAS MERS (λ RMSE RMSE 0.244 0.573 0.302 | CS-F)M ie h ro npredicting in error the gives ∆(CCSD-DFT)/ML = eray2,2019 26, February )adtnoil(λ tensorial and 0) 0.083 0.348 0.143 0.120 | = = o.116 vol. )RS (λ RMSE 0) )cmoet of components 2) lee n acenes and alkenes | o 9 no. 0.161 0.456 0.266 0.212 | α α. 3403 = (2) 2) ,

CHEMISTRY αi ; a large fraction of the polarizability of −OH and −COOH groups is assigned to the environments centered around nearby H and C atoms, but O atoms systematically contribute another anisotropic term oriented perpendicularly to the highly polariz- able lone pairs (e.g., fructose as well as the carboxyl group in the amino acids). The sulfur-centered environments in cysteine and methionine have the largest contribution to the total polar- izability in the showcase set and exhibit a strongly anisotropic response. All of these examples suggest that AlphaML can use relatively local structural information to determine an atom- centered decomposition of α that encodes nontrivial quantum mechanical contributions from each functional group (or moiety) contained within a given molecule. It is this ability to predict such an environment-dependent decomposition of α that underlies the observed better than DFT performance of AlphaML when faced with the often insurmountable challenge of transferability to a sector of chemical compound space that contains molecules that are quite distinct and notably larger than those included in the training set. A similar atom-centered decomposition can also be performed in the context of ∆ learning, revealing the molecular features that are associated with the most substantial errors of the approximate methods. As shown in SI Appendix, Fig. 2. RMSE made in approximating the λ = 0 (Lower) and λ = 2 (Upper) this approach reveals how the large discrepancy between DFT components of the per atom polarizability in the showcase dataset. The x and CCSD for alkenes is associated primarily with the extended axis corresponds to the numerical indices provided in the showcase molecule key in SI Appendix, and the vertical lines show the partitioning of the conjugate system. dataset into the different groups outlined in the same figure. Red squares show the ML error, blue circles show the error made in using DFT to approx- Discussion imate CCSD, and black crosses show the error made when ∆ learning the Polarizability calculations with traditional quantum chemical CCSD correction with respect to DFT. methods have always implied a tradeoff between accuracy and computational cost. While CCSD calculations give more accu- rate predictions for the polarizabilities of molecules (especially system, ML predictions are within 10% of DFT and CCSD large molecules) than DFT with various functionals (4, 21), the results and within the range of experimental values, despite the associated computational cost can be prohibitive. In our case, extrapolation to a system size that is one order of magnitude the CCSD calculations for the largest molecules in the show- larger than the molecules in the training set. case dataset required thousands of central processing unit (CPU) hours and approximately 500 GB of RAM. In this paper, we have Atom-Centered Environmental Polarizabilities. The atom-centered demonstrated that the AlphaML framework, which combines structure of AlphaML provides a natural additive decomposi- SA-GPR with λ-SOAP kernels and CCSD reference calcula- α P α tion of into a sum of local terms, i i , which can be used tions on small molecules, allows us to sidestep these expensive to better understand how different functional groups contribute calculations and obtain results with an accuracy that almost to the molecular polarizability. Unlike other methods for decom- posing the polarizability, such as an atoms-in-molecules scheme (52) or a self-consistent decomposition (53), the approach used in this section does not require any additional calculations on top of the molecular polarizability, as the atom-centered polar- izabilities are obtained as a by-product of the local nature of the SA-GPR scheme. When interpreting the αi , one should keep in mind that each term corresponds to the contribution from the entire atom-centered environment, and the way that the polar- izability is split between neighboring atoms is entirely inductive, reflecting the interplay between data, structure (as represented by the kernels), and regression rather than explicit physico- chemical considerations. For instance, a few atoms within the showcase dataset (in particular, several H environments) have αi with negative eigenvalues, which reflects the fact that they reduce the dielectric response of the functional group to which they belong. With this in mind, one can recognize physically meaningful features in the magnitude and anisotropy of the αi . Fig. 4 depicts eight representative examples. Comparing saturated and unsatu- rated hydrocarbons (e.g., 2,2-dimethylhexane, cis-4-octene, and octatetraene), one sees that AlphaML predicts the contribu- tion from the unsaturated carbon atoms to be large and very Fig. 3. Polarizability per C atom for the series of s-trans alkenes (from C6H8 to C22H24) and acenes (from benzene to pentacene) as well as (C60). anisotropic, which is consistent with the higher degree of elec- The reference CCSD results for anthracene and tetracene were taken from tron delocalization along conjugated molecules. Similarly large ref. 49, and the reference CCSD result for C60 was taken from ref. 50. The and anisotropic contributions are associated with aromatic sys- green squares (and error bars) indicate the experimental measurements for tems as seen, for example, in guanine and the indole ring of C60 (51). Results are provided from DFT and CCSD calculations as well as the tryptophan. Oxygen atoms are associated with a very anisotropic corresponding AlphaML models.

3404 | www.pnas.org/cgi/doi/10.1073/pnas.1816132116 Wilkins et al. Downloaded by guest on September 23, 2021 Downloaded by guest on September 23, 2021 nFg )wr eae olwn h rtclue o Mb(4.All (34). a QM7b using for method used finite-field the protocol showcase with the 52 computed following All were relaxed 35). molecule polarizabilities were (34, fullerene DFT and 3) database series, acene used were Fig. QM7b series, geometries alkene (38) in the the molecular as from functional well the (as taken SCAN0 of molecules were All the ML (55). with v5.0 for Psi4 calculations Q-Chem with with DFT performed were performed while calculations LR-CCSD (54), all v1.1 and 37) (36, functional Calculations. Principles First Methods and Materials from obtained be can that experiments. insight of the power increasing predictive and the simulations improving thereby gener- frequency spectroscopies, sum ation and Raman well evaluate as computationally simulations to atomistic as for fields force polarizable accurate anisotropic fully eventually, the of atom-centered and estimates inexpensive polarizabil- of systems divergent availability the molecular The predict phases. have complex condensed to increasingly These to possible for gaps. it CCSD ity HOMO-LUMO make and will vanishing DFT improvements to alkenes, by due conjugated predicted polarizabilities of series for are challenging analysis need the which an of The by behavior model. the underscored the of is developments into include fundamental to physics efforts these nonlocal more as and well as collective predictions, the the estimate underlying in to the uncertainty methods of of incorporation accuracy will calculations, the reference in work molecules larger improvements include future oligomers, to set and training molecules, the of small extensions on focus of polarizabilities learning these that scheme. than decomposition mind rather atoms-in-molecules in environments an keep chemical should to one correspond however, contributions so, and doing delocalized In from large tems. originate the that instance contributions for anisotropic revealing considerations, ochemical of of predictions decomposition by ML atom-centered an improved the The with systematically set. be compounds training can the larger and extending predict DFT to organic rivals small used that of accuracy be database can a cost. it on computational trained the molecules, was of model fraction this a at Although but DFT exceeds always has scale) to drawn not is (which key positive figure are details. The they eigenvalues additional (red). and corresponding negative shown, the or are whether (black) axes on based principal colored The are eigenvalue. along aligned corresponding are the ellipsoids of The of molecules. axes showcase principal of the selection a for sor 4. Fig. ikn tal. et Wilkins aigsontepoieo h lhM rmwr by framework AlphaML the of promise the shown Having rdce tmccnrbtost h oa CDplrzblt ten- polarizability CCSD total the to contributions atomic Predicted α i n hi xeti rprinlt h qaeroot square the to proportional is extent their and , α nti ok F acltoswt h B3LYP the with calculations DFT work, this In a eitrrtdi em fphysic- of terms in interpreted be can α ilb sflt einmore design to useful be will π sys- rtl nteetocmoet,since components, two these on arately (i steps. scalar following the the nents: on tensor, based ity is polarizability the SA-GPR. sizes. different of molecules between comparison the polarizabilities, MSE containing anisotropic of and (each scalar sets both tures two includes and Given invariant components. rotationally is that way Assessment. P Error (56). al. et Yang polarizabilities CCSD in used, was given method as finite-field core obtained the frozen were where The cases computa- LR-CCSD. the in by large In scf needed listed prohibitively direct disk) and as the and approximation 28 to (memory (orbital and resources due the tional using 26, method obtained 25, finite-field were molecules 23, unrelaxed) these 21, for for include polarizabilities 20, except (which CCSD dataset 19, LR-CCSD, showcase the using 18, in molecules calculated molecules largest eight were the polarizabilities of of those size CCSD step All a a.u. with formula difference central weights (v the of nature linear properties, the symmetry correct preserve and the To enforce use it to we power. crucial normalizing integer is which an by kernels, λ-SOAP to three-body introduced it to be up raising can correlations then correlations atomic (iv describe Many-body (43). can terms. procedure kernel sampling SOAP linear farthest-point The a descriptors, with the selected of automatically shorthand components the harmonic spherical use we where order of components (iii 33. r X components vector √ [ ainlSproptn etrPoetI 83adteEF Scientific EPFL Contract the Energy and Swiss of s843 from Department ID grant sup- US a Project Center. Computing by is Center the Leader- supported Supercomputing which of also Argonne National was Laboratory, Science the work National of This of DE-AC02-06CH11357. Argonne Office resources by at used through ported Facility University research Computing K.U.L., Cornell Y.Y., This ship from Nanoscience. support funding. A.G. Molecular partial startup from 677013-HBMAP. for acknowledge R.A.D. Center support Grant and (MPG-EPFL) acknowledge 2020 Planck– Lausanne Max M.C. de Horizon the from and funding Council acknowledges D.M.W. Research discussions. European helpful for ACKNOWLEDGMENTS. at available also is for framework, tool AlphaML prediction online An and environments. basis, representative the as used environments which in c N 1 ftecnrlatom central the of , j o ahcmoetof component each For ) 2 i P ,j n ecieitrtmccreain ihnapecie uofradius, cutoff prescribed a within correlations interatomic describe and h ∈{x ≡ α i h aekre ewe w niomns utbet er tensor learn to suitable environments, two between kernel base The ) xy

,y N 1 α , w ,z h AGRfaeokue eent ul nM oe for model ML an build to herein used framework SA-GPR The ` α IAppendix SI N P } i 2 k yz − α µ sdcmoe noisirdcbe(elshrcl compo- spherical) (real irreducible its into decomposed is α, stetann set, training the is = i , (kα ij 2 htaedtrie yotmzn h loss the optimizing by determined are that α k α oass h cuayo oaiaiiyestimate polarizability a of accuracy the assess to , µ,A∈N µj λ i 0 xz X

k ,µ i , k F 2 µµ λ,ζ 0 2α F /n α α k eueteFoeisnr,dfie as defined norm, Frobenius the use We − 0 hαnl (0) ≡

zz PNAS i 2 (X = h oaiaiiydt eeae a efudin found be can generated data polarizability The . α ste endas defined then is λ, ]

−α 1/2 k ( = j ∂ µ (λ) 2 α j h ento fteecmoet sgvni ref. in given is components these of definition The . h uhr hn ei ui n ihe Willatt Michael and Musil Felix thank authors The µµ λ , √ α 2 X n xx i 0 k rosaedfie naprao ai osimplify to basis atom per a on defined are Errors . U (A) 3 α

0 0 i j λ ebidakre ig ersinmdlwith model regression ridge kernel a build we α, k −α n ,k /∂ (X | F xx ) 0 tm) edfietefloigquantities: following the define we atoms), )/n ← ← l − + 0 j eray2,2019 26, February E yy yewr sddrn l CDcalculations. CCSD all during used were type , |X M 2 X i k k k α j X , diinldtiso h acltosare calculations the of details Additional . MAE ; ∈A {J ∈M j λ µµ λ j k (2) ,k sa(osbysas)sto representative of set sparse) (possibly a is α ,λµ yy ) } . xx = 0 + w (X −α i oidct usto h possible the of subset a indicate to r 2 K X k α {J r optdfrec environment each for computed are ≡ MM µ j

zz , yy } 0 kα k X http://alphaml.org. (k hX N ) 1 j λ i ,j / k stemti fkresbetween kernels of matrix the is n a opt h MEsep- RMSE the compute can One . j λ P

,k ) √ j ,λµ F 2 F k ) µµ n h five-vector the and 3

i = cl oyehiu F Polytechnique Ecole 00 0 ´

k |J δ (X 0 | |α α k λ E ihJ

,k 2 i j α (0)

= o.116 vol. , − + X |X F i | 1.8897261250 .

F 2 α k σ and αnl k ) + 2 i 0 ,λµ ζ

w ahpolarizabil- Each ) −1 |α α F T 0 α /n 0 ; K i ae nthe on based α, (2) | n ? i 0 MM 0 for , i IAppendix SI , | l o 9 no. n RMSE and ; F 2 0 (ii . w, i htare that , ) × N kα λ-SOAP ed ´ α | α 10 struc- erale ´ 3405 (2) na in −5 F 2 [1] [2] [3] ≡ = = ). )

CHEMISTRY 1. Engel E, Dreizler RM (2011) Density Functional Theory: An Advanced Course (Springer, 28. Faber FA, et al. (2017) Prediction errors of learning models lower Berlin). than hybrid DFT error. J Chem Theory Comput 13:5255–5264. 2. Burke K (2012) Perspective on density functional theory. J Chem Phys 136:150901. 29. Ramakrishnan R, Dral PO, Rupp M, von Lilienfeld OA (2015) Big data meets quan- 3. Lejaeghere K, et al. (2016) Reproducibility in density functional theory calculations of tum chemistry approximations: The ∆-machine learning approach. J Chem Theory solids. Science 351:145–152. Comput 11:2087–2096. 4. Hait D, Head-Gordon M (2018) How accurate are static polarizability predictions from 30. Bartok´ AP, et al. (2017) Machine learning unifies the modeling of materials and density functional theory? An assessment over 132 species at equilibrium geometry. molecules. Sci Adv 3:e1701816. Phys Chem Chem Phys 20:19800–19810. 31. Bereau T, Andrienko D, von Lilienfeld OA (2015) Transferable atomic multipole 5. Stone A (1997) The Theory of Intermolecular Forces, International Series of machine learning models for small organic molecules. J Chem Theory Comput Monographs on Chemistry (Clarendon, Oxford, United Kingdom). 11:3225–3233. 6. Hermann J, DiStasio RA, Jr, Tkatchenko A (2017) First-principles models for van der 32. Liang C, et al. (2017) Solvent fluctuations and nuclear quantum effects modulate the Waals interactions in molecules and materials: Concepts, theory, and applications. molecular hyperpolarizability of water. Phys Rev B 96:041407. Chem Rev 117:4714–4758. 33. Grisafi A, Wilkins DM, Csanyi´ G, Ceriotti M (2018) Symmetry-adapted machine 7. Grimme S (2014) Dispersion interaction and chemical bonding. The Chemical Bond: learning for tensorial properties of atomistic systems. Phys Rev Lett 120:036002. Chemical Bonding Across the Periodic Table, eds Frenking G, Shaik S (Wiley-VCH, 34. Montavon G, et al. (2013) Machine learning of molecular electronic properties in Hoboken, NJ), pp 477–500. chemical compound space. New J Phys 15:095003. 8. Shen YR (1989) Surface properties probed by second harmonic and sum-frequency 35. Blum LC, Reymond JL (2009) 970 million druglike small molecules for virtual screening generation. Nature 337:519–525. in the chemical universe database GDB-13. J Am Chem Soc 131:8732–8733. 9. Luber S, Iannuzzi M, Hutter J (2014) Raman spectra from ab initio molecular dynamics 36. Becke AD (1993) Density-functional thermochemistry. III, the role of exact exchange. and its application to liquid s-methyloxirane. J Chem Phys 141:094503. J Chem Phys 98:5648–5652. 10. Morita A, Hynes JT (2000) A theoretical analysis of the sum frequency generation 37. Stephens PJ, Devlin FJ, Chabalowski CF, Frisch MJ (1994) Ab initio calculation of vibra- spectrum of the water surface. Chem Phys 258:371–390. tional absorption and circular dichroism spectra using density functional force fields. 11. Medders GR, Paesani F (2016) Dissecting the molecular structure of the air/water J Phys Chem 98:11623–11627. interface from quantum simulations of the sum-frequency generation spectrum. 38. Hui K, Chai JD (2016) Scan-based hybrid and double-hybrid density functionals from Chem Phys Lett 138:3912–3919. models without fitted parameters. J Chem Phys 144:044114. 12. Sprik M, Klein ML (1988) A polarizable model for water using distributed charge sites. 39. Woon DE, Dunning TH, Jr (1994) Gaussian basis sets for use in correlated molecu- J Chem Phys 89:7556–7560. lar calculations. IV. calculation of static electrical response properties. J Chem Phys 13. Fanourgakis GS, Xantheas SS (2008) Development of transferable interaction poten- 100:2975–2988. + tials for water. v. extension of the flexible, polarizable, thole-type model potential 40. Christiansen O, Hattig¨ C, Gauss J (1998) Polarizabilities of CO, N2, HF, Ne, BH, and CH (TTM3-F, v. 3.0) to describe the vibrational spectra of water clusters and liquid water. from ab initio calculations: Systematic studies of electron correlation, basis set errors J Chem Phys 128:074506. and vibrational contributions. J Chem Phys 109:4745–4757. 14. Ponder JW, et al. (2010) Current status of the AMOEBA polarizable force field. J Phys 41. Reis H, Papadopoulos MG, Avramopoulos A (2003) Calculation of the microscopic Chem B 114:2549–2564. and macroscopic linear and nonlinear optical properties of acetonitrile. I. Accurate 15. Medders GR, Babin V, Paesani F (2014) Development of a “first-principles” water molecular properties in the gas phase and susceptibilities of the liquid in onsager’s potential with flexible monomers. III. Liquid phase properties. J Chem Theory Comput reaction-field model. J Phys Chem A 107:3907–3917. 10:2906–2910. 42. Karne AS, et al. (2015) Systematic comparison of DFT and CCSD dipole moments, 16. Bereau T, DiStasio RA, Jr, Tkatchenko A, von Lilienfeld OA (2018) Non-covalent polarizabilities and hyperpolarizabilities. Chem Phys Lett 635:168–173. interactions across organic and biological subsets of chemical space: Physics-based 43. Imbalzano G, et al. (2018) Automatic selection of atomic fingerprints and reference potentials parametrized from machine learning. J Chem Phys 148:241706. configurations for machine-learning potentials. J Chem Phys 148:241730. 17. Monkhorst HJ (1977) Calculation of properties with the coupled-cluster method. Int J 44. Bartok´ AP, Kondor R, Csanyi´ G (2013) On representing chemical environments. Phys Quantum Chem 12:421–432. Rev B 87:184115. 18. Koch H, Jørgensen P (1990) Coupled cluster response functions. J Chem Phys 93:3333– 45. Glielmo A, Zeni C, De Vita A (2018) Efficient nonparametric n-body force fields from 3344. machine learning. Phys Rev B 97:184307. 19. Christiansen O, Jørgensen P, Hattig¨ C (1998) Response functions from Fourier compo- 46. Voloshina E, Paulus B (2014) First multireference correlation treatment of bulk metals. nent variational perturbation theory applied to a time-averaged quasienergy. Int J J Chem Theory Comput 10:1698–1706. Quantum Chem 68:1–52. 47. Smith SM, et al. (2004) Static and dynamic polarizabilities of conjugated molecules 20. Christiansen O, Gauss J, Stanton JF (1999) Frequency-dependent polarizabilities and and their cations. J Phys Chem A 108:11063–11072. first hyperpolarizabilities of CO and H2O from coupled cluster calculations. Chem Phys 48. Gruning¨ M, Gritsenko OV, Baerends EJ (2002) Exchange potential from the common Lett 305:147–155. energy denominator approximation for the Kohn–Sham Green’s function: Applica- 21. Hammond JR, de Jong WA, Kowalski K (2008) Coupled-cluster dynamic polarizabili- tion to (hyper)polarizabilities of molecular chains. J Chem Phys 116:6435–6442. ties including triple excitations. J Chem Phys 128:224102. 49. Huzak M, Deleuze MS (2013) Benchmark theoretical study of the electric polarizabil- 22. Hammond JR, Govind N, Kowalski K, Autschbach J, Xantheas SS (2009) Accurate ities of naphthalene, anthracene, and tetracene. J Chem Phys 138:024319. dipole polarizabilities for water clusters n=2-12 at the coupled-cluster level of theory 50. Kowalski K, Hammond JR, de Jong WA, Sadlej AJ (2008) Coupled cluster calculations and benchmarking of various density functionals. J Chem Phys 131:214103. for static and dynamic polarizabilities of C60. J Chem Phys 129:226101. 23. Lao KU, Jia J, Maitra R, DiStasio RA, Jr (2018) On the geometric dependence of the 51. Sabirov DS (2014) Polarizability as a landmark property for fullerene chemistry and molecular dipole polarizability in water: A benchmark study of higher-order elec- materials science. RSC Adv 4:44996. tron correlation, basis set incompleteness error, core electron effects, and zero-point 52. Laidig KE, Bader RFW (1990) Properties of atoms in molecules: Atomic polarizabilities. vibrational contributions. J Chem Phys 149:204303. J Chem Phys 93:7213–7224. 24. Behler J, Parrinello M (2007) Generalized neural-network representation of high- 53. Applequist J, Carl JR, Fung KK (1972) Atom dipole interaction model for molecu- dimensional potential-energy surfaces. Phys Rev Lett 98:146401. lar polarizability. Application to polyatomic molecules and determination of atom 25. Bartok´ AP, Payne MC, Kondor R, Csanyi´ G (2010) Gaussian approximation poten- polarizabilities. J Am Chem Soc 94:2952–2960. tials: The accuracy of quantum mechanics, without the electrons. Phys Rev Lett 104: 54. Parrish RM, et al. (2017) Psi4 1.1: An open-source electronic structure program empha- 136403. sizing automation, advanced libraries, and interoperability. J Chem Theory Comput 26. Rupp M, Tkatchenko A, Muller¨ KR, von Lilienfeld OA (2012) Fast and accurate 13:3185–3197. modeling of molecular atomization energies with machine learning. Phys Rev Lett 55. Shao Y, et al. (2015) Advances in molecular quantum chemistry contained in the Q- 108:058301. Chem 4 program package. Mol Phys 113:184–215. 27. De S, Bartok´ AP, Csanyi´ G, Ceriotti M (2016) Comparing molecules and solids across 56. Yang Y, et al. (2019) Coupled-cluster polarizabilities in the QM7b and a showcase structural and alchemical space. Phys Chem Chem Phys 18:13754–13769. database. Materials Cloud Archive (2019), doi:10.24435/materialscloud:2019.0002/v1.

3406 | www.pnas.org/cgi/doi/10.1073/pnas.1816132116 Wilkins et al. Downloaded by guest on September 23, 2021