<<

Accurate and transferable multitask prediction of chemical properties with an -in-molecule neural network

Roman Zubatyuk1,2, Justin S. Smith2,3, Jerzy Leszczynski1 Olexandr Isayev4*

1 Interdisciplinary Nanotoxicity Center, Department of , Physics and Atmospheric Sciences , Jackson State University, Jackson, MS, USA 2 Center for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA 3 Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA 4 Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

* Corresponding author; email: [email protected]

Abstract

Atomic and molecular properties could be evaluated from the fundamental Schrodinger’s equation and therefore represent different modalities of the same quantum phenomena. Here we present AIMNet, a modular and chemically inspired deep neural network potential. We used AIMNet with multitarget training to learn multiple modalities of the state of the in a molecular system. The resulting model shows on several benchmark datasets the state-of-the-art accuracy, comparable to the results of orders of magnitude more expensive DFT methods. It can simultaneously predict several atomic and molecular properties without an increase in computational cost. With AIMNet we show a new dimension of transferability: the ability to learn new targets utilizing multimodal information from previous training. The model can learn implicit solvation energy (like SMD) utilizing only a fraction of original training data, and archive MAD error of 1.1 kcal/mol compared to experimental solvation free energies in MNSol database.

Introduction The high computation cost of QM methods has become a critical bottleneck which limits a researcher’s abilities to study larger realistic atomistic systems, as well as long time scales relevant to an experiment. Hence, robust approximate but accurate methods are required for continued scientific progress. Machine Learning (ML) has been successfully applied to approximate potential energy surfaces1–5, obtain atomic forces6, and even predict reaction synthesis7,8. ML techniques have become popular for use in predicting QM molecular properties.9,10 ML models seek to learn a “black box” function that maps a molecular structure to the property of interest. Until recently, many ML-based potentials relied on a philosophy of parametrization to one chemical system at a time.11 Such methods can achieve high accuracy12 with relatively small amounts of QM data13 but are not transferable to new chemical systems. Using such approach for any new system requires a new set of QM calculations and extra parametrization time for each new study. Recent breakthroughs in the development of ML models in chemistry have produced general purpose models that predict potential energies and other molecular properties accurately for a broad class of chemical systems.2,3,14,15 General purpose models promise to make ML a viable alternative to classical empirical potentials (EP) and force fields since EPs are known to have many weaknesses, for example, poor description of the underlying physics, lack of transferability, and are hard to improve in accuracy systematically. Various techniques for improving the accuracy and transferability of general purpose ML potentials have been employed. Active learning methods16,17, which provide a consistent and automated improvement in accuracy and transferability, have contributed greatly to the success of general purpose models. An active learning algorithm achieves this by deciding what new QM calculations should be performed then adding the new data to the training dataset. The act of letting the ML algorithm drive sampling is shown to improve the transferability of an ML potential greatly. Further, transfer learning methods allow the training of accurate ML potentials by combining multiple QM approximations18. Several recent reviews summarized the rapid progress in this field.19,20 The success of modern ML may be largely attributed to a highly flexible functional form for fitting to high-dimensional data. ML is known to extract complex patterns and correlations from such data. These data can contain counterintuitive statistical correlations that are difficult for humans to comprehend. With a few notable exceptions,21,22 these models do not capture the underlying physics of electrons and atoms. These statistical fits are often fragile, the behavior of an ML potential far from the training data could be non-physical. The essential challenge for ML is to capture the correct physical behavior. Therefore, the immediate frontiers in ML lie in physics- aware artificial intelligence (PAI) and explainable artificial intelligence (XAI).23 Future PAI methods will learn a model with relevant physics-inspired constraints included. The inclusion of physics-inspired constraints promises to deliver better performance by forcing the model to obey physical laws and cope better with sparse and/or noisy data. XAI will step even further, complementing models with logical reasoning and explanations of their actions, to ensure that researchers are getting the right answer for the right reasons.23 Natural phenomena often inspire the structure of ML models and techniques.24 For example, the human brain is constantly interacting with various types of information related to the physical world; each piece of information is called a modality. In many systems, multiple data modalities can be used to describe the same process. One such physical system is the human brain, which provides more reliable information processing based on multimodal information.25 Many ML related fields of research have successfully applied multimodal ML model training.26 In chemistry, molecules, which are often represented by structural descriptors, can also be described with accompanying properties (dipole moments, partial atomic charges) and even electron densities. The multimodal learning that treats multimodal information as inputs has been an actively developing field in recent years.27 Multimodal and multitask learning aims at improving the generalization performance of a learning task by jointly learning multiple related tasks together, and could increase the predictivity of a model.28 This boost is caused by the use of additional information that captures the implicit mapping between the learnable endpoints. In this paper, we present the AIMNet (Atom-in-Molecule Network) a chemically inspired, modular deep neural network molecular potential We use multimodal and multi-task learning to obtain information-rich representation of an atom in a molecule. We show a state of the art accuracy of the model to predict energies, atomic charges, and volumes, simultaneously. We also show how the multimodal information about the atom state could be used to efficiently learn new properties, like solvation free energies, with much less training data.

Results AIMNet architecture. We propose a new architecture for the development of neural network potentials (NNP) we call AIMNet (see Figure 1). It extends the ANI-1 model3 with learnable atomic feature vectors and a single network for all elements. The overall architecture name is inspired by Bader’s Theory of Atoms in Molecules (AIM).29 The quantum theory of atoms in molecules (AIM) is a model of molecular systems in which atoms and bonds are expressed via an observable electron density distribution function. An electron density distribution of a molecule is a probability distribution that describes the average electron charge distribution in the attractive field of the nuclei. In AIMNet, instead of electron density, atoms are characterized by feature vectors and interaction with the neighbors is passed through an Environmental Field Layer Fi. The AIMNet potential builds on the success of Behler-Parrinello (BP)30 type NNP, especially the ANI-1/ANI-1x NNP models3,31. These potentials are based on continuous and differentiable Gaussian-type radial and angular symmetry functions to encode the atomic environment up to a cutoff radius. The so-called atomic environment vectors (AEVs) are used as the input features for a set of deep feed-forward neural networks, one for each chemical element type. The outputs of the networks are the values referred to as atomic energies, which are summed to produce molecular energy. The structure of the AEV input layer determines most of the limitations of BP or ANI type models. It is a concatenation of radial sub-AEVs for every chemical element in the training set and angular sub-AEVs for every pair combination of chemical elements in the training set. As an example, for a model with four chemical elements, e.g. CHNO, the radial part of the AEV will contain 4 radial sub-AEVs, one for each element, and the angular part will contain 10 sub-AEVs for neighbor pairs CC, CH, CN, etc. This means that the size of the input layer grows as ( ) with the number of included chemical elements. Therefore, the models could be trained only for 2a specific (small) number of chemical elements and the only way to add the 𝑂𝑂 𝑁𝑁 new element is to rebuild and re-train the whole set of neural networks. One solution to this problem was introduced with the Deep Tensor Neural Network (DTNN)32 as learnable vectors of atomic features which are used as an embedding of atomic symmetry functions to make a unified representation of each atom’s chemical environment. This eliminates input layer size dependence on the number of included chemical elements and allows the use of a single neural network for all chemical elements. The diagram in Figure 1d shows the design of the AIMNet model. It consists of inputs, an iterative block, a learnable atomic interaction (AIM) layer and several DNN multitask blocks. The first is an embedding block. The embedding block takes atomic Cartesian coordinates as input and encodes information about the local chemical environment of each of the atoms. This block 𝒓𝒓�⃗ ensures the representation of the molecule is rotationally and translationally invariant with respect to an external frame of reference. For this, we use symmetry functions, implemented as described in the development of the ANI-1 model (see SI for the details): for each atom we calculate – a vector of 16 values generated by gaussian functions to probe different regions𝑡𝑡ℎ of distances with 𝑖𝑖 𝐺𝐺𝑖𝑖𝑖𝑖 the neighboring atom, up to cutoff radius of 4.6 Å (Figure 1a). The angular symmetry functions 𝑡𝑡(ℎ and are indices of a unique pair of neighbors) are calculated with 4 radial and 8 angular 𝑗𝑗 probes and a cutoff radius of 3.1Å. To differentiate between different chemical atom types in the 𝐺𝐺𝑖𝑖𝑗𝑗𝑗𝑗 𝑗𝑗 𝑘𝑘 environment, the ANI and BP-type NNPs use concatenation of the sums of symmetry vectors for different species (Figure 1b). In contrast, to represent both geometry and atom types in the environment, the AIMNet model uses an outer product of atomic environment vectors with learnable atomic feature vectors of the neighboring atoms (Figure 1c). Symmetry functions are 𝐺𝐺𝑖𝑖 designed such that their sums and can efficiently encode the local chemical 𝐴𝐴 environment of the atom. Therefore, the outer products with a vector can be summed together ∑𝑗𝑗 𝐺𝐺𝑖𝑖𝑖𝑖 ∑𝑗𝑗𝑗𝑗 𝐺𝐺𝑖𝑖𝑖𝑖𝑖𝑖 without introducing 𝑡𝑡additionalℎ information loss. This operation eliminates the dependence of the AEV representation𝑖𝑖 on the types of neighboring atoms. In contrast to DTNN32, SchNet33, and HIP- NN34 we use both radial and angular symmetry functions. For embedding the angular symmetry ( ) functions we need feature vectors of the atom pairs . Learning these values, the same way 𝑝𝑝 as atomic feature𝑖𝑖𝑖𝑖𝑖𝑖 vectors will make the number of model𝑗𝑗𝑗𝑗 parameters to scale as ( ) in the 𝐺𝐺 𝐴𝐴 ( ) number of included chemical elements. Instead of learning directly, we derive these2 values 𝑝𝑝 𝑂𝑂 𝑁𝑁 with an additional deep neural network (DNN) applied to a commutative combination of pair and 𝐴𝐴𝑗𝑗𝑗𝑗 atomic feature vectors; hence, it will produce the required atom pair features. We define the commutative combination as a concatenation of pairwise products and sums of two atomic feature vectors: ( ) = × , + , 𝑝𝑝 𝐴𝐴𝑗𝑗𝑗𝑗 𝐷𝐷𝐷𝐷𝐷𝐷��𝐴𝐴𝑗𝑗 𝐴𝐴𝑘𝑘 𝐴𝐴𝑗𝑗 𝐴𝐴𝑘𝑘�� where DNN is a series of transformations = ( + ) with and as learnable parameters and is the ELU35 activation function.𝑙𝑙+1 𝑙𝑙 𝑙𝑙 𝑙𝑙 𝒙𝒙 𝑓𝑓 𝒙𝒙 𝑾𝑾 𝒃𝒃 𝑾𝑾 𝒃𝒃

𝑓𝑓

Figure 1. Representation of atomic environment vectors in AIMNet and BP-type models, and the design of AIMNet. (a) Radial symmetry functions for the carbon atom highlighted in blue of benzene molecule for carbon and hydrogen neighbors. Each row represents a neighboring carbon or hydrogen atom. (b) Atomic environment vector (AEV) in BP-type networks as concatenation of sums of symmetry functions for each neighboring atomic element. (c) AEV in AIMNet as a matrix product of radial symmetry functions with learnable atomic vectors. Angular component of AEV is computed analogously. (d) Design diagram of the AIMNet model. Inputs are shown in green, learnable parameters in blue and outputs in yellow. In the test the following abbreviations are used: – Atomic Feature Vector (AFV), – environment field (EF), – Atom-In-Molecule layer (AIM). 𝐴𝐴

𝐹𝐹 𝑀𝑀 The final vector , which is an AEV vector of the atom is defined as the concatenation 𝑡𝑡ℎ of radial and angular components:𝑖𝑖 𝐺𝐺 ( ) ( ) 𝑖𝑖 ( ) = { , } 𝑝𝑝 The non-linear transformation of the𝑟𝑟 vector𝑎𝑎 with layers of neural network block 𝐺𝐺𝑖𝑖 𝐺𝐺𝑖𝑖𝑖𝑖 ∙ 𝐴𝐴𝑗𝑗 𝐺𝐺𝑖𝑖𝑖𝑖𝑖𝑖 ∙ 𝐴𝐴𝑗𝑗𝑗𝑗 produces the vector which could be regarded as a field introduced by neighboring atoms on the 𝐺𝐺𝑖𝑖 central atom. The vector which combines (via concatenation) the environment field and the 𝐹𝐹𝑖𝑖 properties of the central atom should contain all the information to obtain properties of the 𝐹𝐹𝑖𝑖 atom. However, following the paradigm of multi-target learning, we use a DNN to transform 𝑡𝑡itℎ 𝐴𝐴𝑖𝑖 𝑖𝑖 further through the Interaction block to a vector we call Atom-In-Molecule layer . This layer should contain the information about all the properties of the atom the network is trained to. 𝑀𝑀𝑖𝑖 Finally, we use DNNs to approximate a set of functions which give the target atomic properties to the AIM vector. One fundamental limitation of BP-type models is the inability to pass information between atoms at distances greater than twice the cutoff for molecular properties such as energy. This short- range problem is worse for atomic properties because intermediate atomic information cannot be passed to neighbors since each neural network works independently. Therefore, the effective range for atomic properties is the cutoff radius. In AIMNet, the solution to this problem is inspired by the mean field theory (MFT). The main idea of MFT is to replace all interactions to any one body with an average or effective interaction, sometimes called a molecular field. This reduces any multi-body problem into an effective one-body problem. In , MFT is also known as the self-consistent field (SCF) method. For the information flow in the AIMNet model described above, the critical part is the properties of the atoms encoded in the atomic feature vectors . We hypothesize that these vectors represent averaged properties of the atoms of a given chemical type for the training dataset. One 𝐴𝐴 could make an intuitive analogy with pro-atoms (non-interacting atoms) in electronic structure calculations. Some properties of the molecule could be obtained from the electron density of pro- atoms, like regions of noncovalent interactions between atoms.36 However, for precise quantum- chemical calculations the change of atomic properties (electron densities) is due to interactions with other atoms (nuclei and electrons). This is done within the one-electron approximation in the SCF procedure: the distribution of each of the electrons in the system is iteratively updated to the distribution of other electrons. In the AIMNet model we use a similar idea: given the information about the field around an atom and the properties of the atom itself, we calculate an update to the properties of the atom using another DNN block. More details about the architecture of the neural network blocks are given in the SI. With the design of the AIMNet model, our goal was to archive best possible accuracy while maintaining the modular structure. This architecture was obtained with a limited hyperparameter search. Each of the DNN blocks consists of 3 layers; the overall model has about 1M trainable parameters. When training the model with three “SCF-like” passes, the neural network is 33 layers deep. The “Update” block, however, serves as a residual connection, which makes such a deep model trainable.

Multimodal learning The AIMNet model was trained to learn six diverse atomic or molecular properties. In addition to molecular energies, we used modules for atomic electric moments of the atoms up to = 3, i.e. atomic charge, dipole, quadrupole, and octupole, as well as atomic volumes. The intent was to provide as much information about the state of the atom as possible to obtain an 𝑙𝑙 information-rich AIM layer to reinforce the learning of new molecular properties. The atomic properties were derived from the QTAIM analysis of DFT electron density calculated using the Minimal Basis Iterative Stockholder (MBIS) partitioning scheme.37 MBIS is a variant of the iterative Hirshfeld partitioning which does not require information about pro-atomic electron densities. The calculations were performed with the Horton software library.38 For learning molecular properties, we used an extended ANI-1x dataset31 (about 8.5M nonequilibrium conformers of organic molecules containing {H, C, N, O, S, F, Cl} atoms with a median size of 13 atoms). The extension was constructed with the active learning approach described by Smith et al.17 using the ANI neural network potential. We are also using the extended COMP617 benchmark dataset in the same manner, dubbed herein as COMP6-SFCl. In contrast to the original ANI-1x dataset, all DFT calculations included in the COMP6? to construct training and test datasets were performed using the larger basis set def2-TZVPP with the ωB97x39 functional. We have trained an ensemble of AIMNet models using a 5-fold cross-validation of the dataset. All the prediction values for AIMNet reported here are the average prediction over 5 models. See SI for complete technical details. The molecular energies, forces, atomic charges and volumes (which are closely related to atomic polarizabilities40,41) are regarded as primary and useful properties to apply in practice. The Figure 2 shows that the AIMNet model achieves state of the art performance for all of them, as judged by the COMP6-SFCl benchmark suite. Other endpoints like atomic dipoles, quadrupoles, and octupoles are probably not very useful per se and could be considered as “auxiliary”. Hence, the accuracy of their fits is summarized in the SI (Table S4).

Figure 2. Performance of the AIMNet model on the DrugBank subset of COMP6-SFCl benchmark. Figure 2 provides correlation plots of the AIMNet potential vs DFT for normalized atomization energies (see SI for details), forces (computed as negative first derivatives of molecular energy w.r.t. atomic coordinates), atomic charges and volumes assessed on the DrugBank subset of COMP6- SFCl benchmark. This benchmark contains properties of 23203 non-equilibrium conformations of 1253 drug-like molecules. The median molecule size is 43 atoms, more than 3 times larger than molecules in the training dataset. Thus, this benchmark shows transferability and extensibility of the AIMNet model. The RMS error of the energy predictions is 4.0 kcal/mol within the range of about 1500 kcal/mol. For comparison, an ANI model trained and evaluated on the same datasets has an energy RMS error of 5.8 kcal/mol. Predicted components of atomic forces have RMSD = 4.7 kcal/mol/Å highlighting the AIMNet model utility to reproduce the curvature of molecular potential energy surfaces accurately and thus its applicability for geometry optimization and molecular dynamics. Atomic charges could be learned up to a “chemical accuracy” of 0.01 . Overall, this level of accuracy for both AIMNet and ANI-1x is on par with the best single molecule− 𝑒𝑒 potentials constructed with the original BP descriptor.30

Iterative “SCF-like” update The AIMNet model was trained with outputs from every “SCF-like” pass contributing equally to the cost function (see SI for details). This way, the model learns to give the best possible answer for each pass, given input AFVs. The AFVs are improved with every iteration leading to lower errors. The model with T = 1 is conceptually similar to ANI-1x network since no updates are made to the atomic features in both models (in BP-type networks representation of atomic features is latent) and the receptive field of the AIMNet model is roughly equal to the size of the AEV descriptor in ANI-1x. Figure 3 shows the aggregated performance of prediction energies ( ), relative conformer energies ( ) and forces ( ) on the complete COMP6-SFCl benchmark with 𝐸𝐸 increasing number of passes T. As expected, the accuracy of AIMNet with T=1 is very similar or Δ𝐸𝐸 𝐹𝐹 better compared to the ANI-1x network. The second iteration (T = 2) provides the biggest boost in performance for all quantities. For T = 3 the results are converging, so a larger number of iterations are not shown. Overall, the biggest gains in accuracy were up to 0.75 kcal/mol for relative energies and 1.5 kcal/mol/Å for forces. This corresponds to about 15-20% error reduction.

Figure 3. Comparison of the performance of the ANI potential vs AIMNet with the different number of iterative updates (T = 1, 2, 3) on COMP6-SFCl benchmark. In addition to molecular properties such as energy, iterative updates are especially important for accurate prediction of atomic properties too. To illustrate, let us consider a simple but realistic series of substituted thioaldehydes (Figure 4a). Here a substituent R is located as far as 5Å away from the sulfur atom. This distance is longer than the cutoff radius of 4.6Å used in constructing the AEVs. However, due to a conjugated chain and the high polarizability of sulfur, the partial atomic charge on sulfur could change by as much as 0.15e by varying R with different electron donating or withdrawing groups. Here, AIMNet with T=1 (as well as all neural network potentials with local geometric descriptors) will incorrectly predict that the sulfur partial charge in all molecules is equal. The correct trend could be recovered with either an increase of the radius for local environment (usually at a substantial computational cost and potentially an impact on model extensibility) or with iterative updates to the AFVs. AIMNet with T=2 correctly reproduces DFT charges on the sulfur atom for most substituents, except for the most polar. At T=3, the AIMNet charge redistribution in the AIMNet model completes, and it quantitively reproduces DFT charges.

Figure 4. DFT ωB97x/def2-TZVPP atomic charges on the sulfur atom of substituted thioaldehyde and AIMNet prediction with a different number of iterative passes T.

Among learned (E, q, V) and derived (∆E, F) properties we noticed that per-atom quantities like charge and volume are much more sensitive towards the iterations. Atomic charges and volumes have the highest relative RMSE errors for the AIMNet with no updates (T=1). By the second iteration (T=3) errors on atomic charges and volumes are about twice lower (Figure 5).

Figure 5. Change of relative RMSE errors for learned (E, q, V) and derived ( , F) properties vs. the number of iterative updates T. 𝚫𝚫𝑬𝑬

The Nature of the AFV representation To gain insights into the learned latent information inside the atomic feature vectors, we performed a parametric t-SNE42 embedding of this 16-dimensional space into 2D. The parametric t-SNE (pt-SNE) is an unsupervised dimensionality reduction technique. pt-SNE learns a parametric mapping between the high-dimensional data space and the low-dimensional latent space using a deep neural network in such a way that the local structure of the data in high dimensional space is preserved as much as possible in the low dimensional space. Figure 6 shows the 2D pt-SNE for 3742 DrugBank database (https://drugbank.ca) molecules, with about 327k atoms in total. In the AIMNet model the AFVs are used to discriminate atoms by their chemical types. The trivial discrimination could be achieved with orthogonal vectors (which would effectively be a one-hot encoding). The pt-SNE embedding of atomic feature vectors at the Figure 6a shows that the distances between clusters corresponding to different chemical elements resembles their positions in the Periodic Table. The t-SNE component on the horizontal axis roughly corresponds to a period and vertical component to a group in the periodic table. Embeddings for the hydrogen atoms are closer to halogens then to any other set of elements. It is interesting to note the wide spread of the points corresponding to sulfur and hydrogen atoms. In case of the sulfur, this is the only element in the set that may possess different valence states (6 and 2) in common organic molecules. The spread for the H atoms is determined mainly by the parent sp2, sp3 carbon atoms or heteroatom environments. But the most structure and diversity in the pt-SNE embedding plot is observed for carbon atoms. In Figure 6b we show a zoom in the region of the carbon atoms, with coloring by the hybridization and structure of local chemical environments. Two main distinct clusters corresponding to sp2 (or aromatic) and sp3 C atoms. Inside every cluster, atoms are grouped by the local bonding environment. There is also a clear trend in the increase of substituent polarity towards the upper left corner of the plot.

Figure 6 Parametric t-SNE embedding of atomic feature vectors (T=3) for a set of drug-like molecules. (a) Feature vectors for different chemical elements. (b) Feature vectors for several common types of the carbon atoms.

Conformations and dihedral profiles benchmark One of the most promising applications of the NNPs for computational drug discovery is conformer generation and ranking. Therefore, we evaluated the performance of the AIMNet model against two distinct external datasets. Both are tailored to benchmark the performance of molecular potentials to describe conformer energies, that have high-quality CCSD(T)/CBS reference data.

Figure 7. Performance of AIMNet compared to other QM and MM methods on torsion benchmark of Sellers et al.43 (a) and MPCONF19644 benchmark (b). All methods are compared against CCSD(T)/CBS calculations (with DLPNO approximation for some of largest molecules from MPCONF196). The basis set is 6-311+G** for (a) and def2-TZVPD for (b) if applicable but not specified. The dots provide: (a) the mean absolute error (MAE) for each of the 62 torsion profiles for each method, (b) MAE for relative conformer energy for all 196 conformers in the dataset. The methods are ordered in descending median MAE (black line in the center of the boxes) from bottom to top. The boxes represent the upper and lower quartile while the whiskers represent 1.5 times the interquartile range or the min/max values. The small molecule torsion benchmark of Sellers et al.43 measures the applicability of an atomistic potential in drug discovery. This benchmark includes 62 torsion profiles from molecules containing the elements C, H, N, O, S, and F computed with various force fields, semi-empirical QM, DFT, and ab initio QM methods. These methods are compared to CCSD(T)/CBS reference calculations. Figure 7a provides the performance of the methods presented in Sellers et al. together with AIMNet single point calculations on MP2/6-311+G** optimized geometries. According to this benchmark, the AIMNet potential is much more accurate compared to semi-empirical methods and OPLS type force fields, which is specifically tailored to describe conformation energies of drug-like molecules. The performance of the AIMNet model could be directly compared to MP2/6- 311+G** and DFT methods. Another benchmark set MPCONF19644 measures the performance of various potentials to rank both low and high energy conformers of acyclic and cyclic peptides and several macrocycles, comprising 13 compounds in total. The reference data were obtained as single point calculations at the CCSD(T)/CBS level (Tight-DLPNO approximation for the largest molecules in the dataset), on MP2/cc-pVTZ geometries. Figure 7b shows a comparison of AIMNet to a subset of methods benchmarked in original paper of Řezáč et al.44 The DFT methods fall into two categories, depending on whether dispersion corrections were included or not, with highly empirical M06-2x being somewhere in between. The AIMNet model has been trained to reproduce the ωB97x functional without explicit dispersion correction. Therefore, its performance is clearly worse compared to the dispersion-corrected counterpart. However, much of error could be reduced by adding an empirical D3 dispersion correction to AIMNet energies. The benchmark data show that the AIMNet model is on par with traditional DFT functionals without dispersion correction, and it clearly outperforms semiempirical methods, even methods that have dispersion correction built in.

Learning new properties The multimodal knowledge residing inside the AIM layer could be exploited to efficiently learn new atomic properties without re-training the whole model. This could be done by using a pre-computed AIM layer as atom descriptors and learning a neural network model with a relatively small number of parameters that fits the AIM layer to the new property. We show the ability to learn atomic energies with an implicit solvent approximation. For this exercise, a subset of 480k structures was taken from the training data and molecular energies were calculated at SMD(Water)-ωB97x/def2-TZVPP level of theory. The AIM layer was computed using a pre-trained AIMNet model and was then used as a descriptor for training a simple DNN model to learn the SMD energy. The shape and activation function of the DNN was selected to be the same as in the base AIMNet model. The total number of trained parameters was about 32k. If this new trained DNN model is placed as the “AIM to energy” block, the AIMNet model (Figure 1d) would predict energies with implicit solvent correction in addition to six other properties. The performance of the AIMNet model trained this way was assessed against experimental solvation free energies of neutral molecules in the MNSol database.45 The geometries of 414 molecules were optimized using the gas phase and SMD version of AIMNet, and the Hessian matrix was calculated by means of analytical second derivatives of the energies w.r.t. coordinates utilizing the PyToch autograd module. Thermal correction to the Gibbs free energy was computed in Harmonic and rigid rotor approximations for the ideal gas at 298 K. In the MNSol database the Gibbs free energies are reported for 1 mol/L concentration (Ben-Naim standard state), therefore the calculated values were corrected by the RTln(24.46) (1.89 kcal/mol) which is the free energy change of 1 mol of an ideal gas from 1 atm (24.46 L/mol) to 1 L. The results are shown in Figure 8a. The AIMNet model clearly outperforms SMD applied with semiempirical and DFTB methods that have RMSD and MAD errors on solvation energies of 2.8-2.9 kcal/mol on this benchmark set, even after re-optimization of the atomic radii.46

Figure 8 a) Experimental versus predicted with AIMNet solvation free energies (kcal/mol) for 414 neutral molecules from MNSol database. b) performance of AIMNet and other solvation models on torsion benchmark of Sellers et al.43 See note to Figure 6. Finally, we also compared the performance of the AIMNet model with other solvent- corrected QM and MM methods for predicting torsional profiles of small drug-like molecules on aforementioned benchmark data of Sellers et al.43 This benchmark is important for predicting molecular conformations of solvated molecules. The meaning of the data points in the Figure 8b is the same as in Figure 7a. It shows the performance of the solvent-corrected AIMNet model on torsion profiles compared to QM and MM methods. The results show that OPLS(GBSA), PCM- B3LYP and AIMNet have very similar accuracy to predict the conformations of solvated molecules. It should be noted that the solvent-corrected energy evaluation with AIMNet has no additional computational cost. It takes 90 seconds to compute the single point energies of 36 angles for each of the 62 torsions (40ms per energy evaluation) using an ensemble of 5 AIMNet models on a single Nvidia M5000 GPU.

Discussion The AIMNet framework presented here is transforming ML methods from simple potential energy predictors to fully functional simulation methods. Most ML property predictors introduced to date require an individual ML model or some set of QM calculations for every quantity and system of interest. In most practical applications, multiple physical quantities are required for a complete analysis of the problem at hand. For example, molecular dynamics (MD) requires conservative energy predictions to provide the correct forces for nuclear propagation, while the distribution of other properties over the simulation might be of interest for computing results comparable to experiment (e.g. dipole moments for IR spectra). In contrast to straightforward training of separate models for each individual property, AIMNet trains to these quantities simultaneously with multimodal and multitask training techniques. The AIMNet potential achieves these predictions with a negligible increase in computational cost over a single energy evaluation. The AIMNet model makes it possible to discover a joint latent representation, via the AIM layer , that captures relationships across various modalities. Different modalities typically carry different kinds of information. Since there is much structure in this complex data, it is difficult to 𝑀𝑀 discover the highly non-linear relationships that exist between features across different modalities. This might explain why traditional ML research in the chemical and materials sciences has focused on the design of optimal descriptors or representation learning, rather than maximizing learning by taking advantage of the information inherent in different modalities of information-rich datasets. Inside the AIMNet model, the AIM (Atom-In-Molecule) layer is trained to automatically produce an information rich representation for every atom but is also constrained by different modalities to implicitly encode more physical information. The primary benefit of such an information rich representation is the ability to learn secondary (additional) tasks without retraining the overall model. This is very useful for properties that are hard to compute or with scarce experimental data. For example, we have shown that after training the AIMNet model to the energy, partial atomic charges and atomic volumes, the new additional task to predict SMD solvation energy just based on the AIM layer vector could achieve predictions of the Gibbs free energy of solvation with the accuracy of 1.8 kcal/mol using just 6% of ANI-1x data. This accuracy is comparable to the differences of DFT methods with different solvation models. Three key ingredients allow AIMNet to achieve the high level of accuracy that it accomplishes. First, it overcomes the sparsity of training data with multi-task and multimodal learning. To make this learning transferrable, the second ingredient is to find a joint information- rich representation of an atom in a molecule that allows learning to multiple modalities. Finally, for the best performance and accounting of long-range interactions, AIMNet uses an iterative “SCF-like” procedure. The number of iterative passes T could serve as a single, easy to use parameter that determines the length scale of interactions inside the chemical system since more iterations increase the effective information transfer distance. As with any supervised ML method, the critical property of the trained model is an ability to generalize to new samples not seen within the training dataset. In the case of NNPs, this is usually discussed in terms of transferability and extensibility – the ability to generalize to vast chemical compound space while retaining applicability to larger molecules than those in the training set. The AIMNet model introduces multimodality as a new dimension for generalization of NNPs – applicability to a wide range of molecular and atomic properties, and ease of learning new properties.

Acknowledgements O.I. acknowledges support from DOD-ONR (N00014-16-1-2311) and National Science Foundation (NSF CHE-1802789) award. J.S.S acknowledges the Center for Non-linear Studies (CNLS) at Los Alamos National Laboratory (LANL) for support, and the support of the U.S. Department of Energy (DOE) through the LANL LDRD Program. The authors acknowledge Extreme Science and Engineering Discovery Environment (XSEDE) award DMR110088, which is supported by NSF grant number ACI-1053575. This research in part was done using resources provided by the Open Science Grid47,48 which is supported by the award 1148698, and the U.S. DOE Office of Science. This work was performed, in part, at the Center for Integrated Nanotechnologies, an Office of Science User Facility operated for the U.S. DOE Office of Science. We gratefully acknowledge the support and hardware donation from NVIDIA Corporation and express our special gratitude to Mark Berger.

Supporting Information The Supporting Information is available free of charge on the ACS Publications website. PyTorch implementation of AIMNet and pre-trained models are available from GitHub (https://github.com/aiqm/aimnet **To be uploaded upon paper acceptance**) . We also provide iPython notebook examples explaining how to run calculations with AIMNet.

Authors Contributions O.I., R.Z. and J.L. designed the study. R.Z. developed and implemented the method. R.Z. and J.S. run calculations and prepared all QM data. All authors contributed to writing of the paper and analysis of the results. References

(1) Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108 (5), 058301. (2) Yao, K.; Herr, J. E.; Toth, D. W.; Mcintyre, R.; Parkhill, J. The TensorMol-0.1 Model Chemistry: A Neural Network Augmented with Long-Range Physics. Chem. Sci. 2017, 9 (8), 2261–2269. (3) Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: An Extensible Neural Network Potential with DFT Accuracy at Computational Cost. Chem. Sci. 2017, 8 (4). (4) Gastegger, M.; Behler, J.; Marquetand, P. Machine Learning Molecular Dynamics for the Simulation of Infrared Spectra. Chem. Sci. 2017, 8 (10), 6924–6935. (5) Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.; Tkatchenko, A. Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space. J. Phys. Chem. Lett. 2015, 6 (12), 2326–2331. (6) Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Schütt, K. T.; Müller, K.-R. Machine Learning of Accurate Energy-Conserving Molecular Force Fields. Sci. Adv. 2017, 3 (5), e1603015. (7) Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555 (7698), 604–610. (8) Raccuglia, P.; Elbert, K. C.; Adler, P. D. F.; Falk, C.; Wenny, M. B.; Mollo, A.; Zeller, M.; Friedler, S. A.; Schrier, J.; Norquist, A. J. Machine-Learning-Assisted Materials Discovery Using Failed Experiments. Nature 2016, 533 (7601), 73–76. (9) Mansouri Tehrani, A.; Oliynyk, A. O.; Parry, M.; Rizvi, Z.; Couper, S.; Lin, F.; Miyagi, L.; Sparks, T. D.; Brgoch, J. Machine Learning Directed Search for Ultraincompressible, Superhard Materials. J. Am. Chem. Soc. 2018, 140 (31), 9844–9853. (10) Ryan, K.; Lengyel, J.; Shatruk, M. Structure Prediction via Deep Learning. J. Am. Chem. Soc. 2018, 140 (32), 10158–10168. (11) Behler, J.; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98 (14), 146401. (12) Behler, J. First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems. Angew. Chemie Int. Ed. 2017, 56 (42), 12828–12840. (13) Rupp, M.; Ramakrishnan, R.; von Lilienfeld, O. A. Machine Learning for Quantum Mechanical Properties of Atoms in Molecules. J. Phys. Chem. Lett. 2015, 3309–3313. (14) Huo, H.; Rupp, M. Unified Representation for Machine Learning of Molecules and . 2017, No. i. (15) Sifain, A. E.; Lubbers, N.; Nebgen, B. T.; Smith, J. S.; Lokhov, A. Y.; Isayev, O.; Roitberg, A. E.; Barros, K.; Tretiak, S. Discovering a Transferable Charge Assignment Model Using Machine Learning. J. Phys. Chem. Lett. 2018, 9 (16), 4495–4501. (16) Podryabinkin, E. V; Shapeev, A. V. Active Learning of Linear Interatomic Potentials. 2016, No. Ml, 1–19. (17) Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E. A. E. Less Is More: Sampling Chemical Space with Active Learning. J. Chem. Phys. 2018, 148 (24), 241733. (18) Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.; Tretiak, S.; Isayev, O.; Roitberg, A. Outsmarting Quantum Chemistry Through Transfer Learning. ChemRxiv 2018. (19) Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine Learning for Molecular and Materials Science. Nature 2018, 559 (7715), 547–555. (20) Sanchez-Lengeling, B.; Aspuru-Guzik, A. Inverse Molecular Design Using Machine Learning: Generative Models for Matter Engineering. Science 2018, 361 (6400), 360–365. (21) Welborn, M.; Cheng, L.; Miller, T. F. Transferability in Machine Learning for Electronic Structure via the Molecular Orbital Basis. J. Chem. Theory Comput. 2018, 14 (9), 4772–4779. (22) Li, H.; Collins, C.; Tanha, M.; Gordon, G. J.; Yaron, D. J. A Density Functional Tight Binding Layer for Deep Learning of Chemical Hamiltonians. 2018. (23) Gunning, D. Explainable Artificial Intelligence (XAI); 2017. (24) Forbes, N. Imitation of Life. How Biology Is Inspiring Computing; MIT Press, 2004. (25) Pulvermüller, F. Brain Mechanisms Linking Language and Action. Nat. Rev. Neurosci. 2005. (26) Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A. Y. Multimodal Deep Learning. Proc. 28th Int. Conf. Mach. Learn. 2011. (27) Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. 2017. (28) Caruana, R. Multitask Learning. Mach. Learn. 1997, 28 (1), 41–75. (29) Bader, R. F. W. Atoms in Molecules : A Quantum Theory; Clarendon Press, 1990. (30) Behler, J. High-Dimensional Neural Network Potentials for Complex Systems. Angew. Chemie Int. Ed. 2017. (31) Smith, J. S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A. E. Less Is More: Sampling Chemical Space with Active Learning. J. Chem. Phys. 2018, 148 (24). (32) Schütt, K. T.; Arbabzadah, F.; Chmiela, S.; Müller, K. R.; Tkatchenko, A. Quantum-Chemical Insights from Deep Tensor Neural Networks. Nat. Commun. 2017, 8 (0), 13890. (33) Schütt, K. T.; Kindermans, P.-J.; Sauceda, H. E.; Chmiela, S.; Tkatchenko, A.; Müller, K.-R. SchNet: A Continuous-Filter Convolutional Neural Network for Modeling Quantum Interactions. 2017, 1 (1), 1–10. (34) Lubbers, N.; Smith, J. S.; Barros, K. Hierarchical Modeling of Molecular Energies Using a Deep Neural Network. 148 (24). (35) Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). Under Rev. ICLR2016, 提出了ELU 2015, No. 1997, 1–13. (36) Johnson, E. R.; Keinan, S.; Mori-Sánchez, P.; Contreras-García, J.; Cohen, A. J.; Yang, W. Revealing Noncovalent Interactions. J. Am. Chem. Soc. 2010, 132 (18), 6498–6506. (37) Verstraelen, T.; Vandenbrande, S.; Heidar-Zadeh, F.; Vanduyfhuys, L.; Van Speybroeck, V.; Waroquier, M.; Ayers, P. W. Minimal Basis Iterative Stockholder: Atoms in Molecules for Force- Field Development. J. Chem. Theory Comput. 2016, 12 (8), 3894–3912. (38) Verstraelen, T.; Tecmer, P.; Heidar-Zadeh, F.; González-Espinoza, C. E. .; Chan, M.; Kim, T. D. .; Boguslawski, K.; Fias, S.; Vandenbrande, S.; Berrocal, D.; et al. HORTON 2.1.0. 2017. (39) Chai, J. Da; Head-Gordon, M. Systematic Optimization of Long-Range Corrected Hybrid Density Functionals. J. Chem. Phys. 2008, 128 (8), 084106. (40) Laidig, K. E.; Bader, R. F. W. Properties of Atoms in Molecules: Atomic Polarizabilities. J. Chem. Phys. 1990, 93 (10), 7213–7224. (41) Tkatchenko, A.; Scheffler, M. Accurate Molecular Van Der Waals Interactions from Ground-State Electron Density and Free-Atom Reference Data. Phys. Rev. Lett. 2009, 102 (7), 073005. (42) Maaten, L. Learning a Parametric Embedding by Preserving Local Structure. In Artificial Intelligence and Statistics; 2009; pp 384–391. (43) Sellers, B. D.; James, N. C.; Gobbi, A. A Comparison of Quantum and Molecular Mechanical Methods to Estimate Strain Energy in Druglike Fragments. J. Chem. Inf. Model. 2017, 57 (6), 1265– 1275. (44) Řezáč, J.; Bím, D.; Gutten, O.; Rulíšek, L. Toward Accurate Conformational Energies of Smaller Peptides and Medium-Sized Macrocycles: MPCONF196 Benchmark Energy Data Set. J. Chem. Theory Comput. 2018, 14 (3), 1254–1266. (45) Marenich, A. V; Kelly, C. P.; Thompson, J. D.; Hawkins, G. D.; Chambers, C. C.; Giesen, D. J.; Winget, P.; Cramer, C. J.; Truhlar, D. G. Minnesota Solvation Database. Univ. Minnesota, Minneapolis, MN 2012. (46) Kromann, J. C.; Steinmann, C.; Jensen, J. H. Improving Solvation Energy Predictions Using the SMD Solvation Method and Semiempirical Electronic Structure Methods. J. Chem. Phys. 2018, 149 (10), 104102. (47) Pordes, R.; Petravick, D.; Kramer, B.; Olson, D.; Livny, M.; Roy, A.; Avery, P.; Blackburn, K.; Wenaus, T.; Würthwein, F.; et al. The Open Science Grid. In Journal of Physics: Conference Series; IOP Publishing, 2007; Vol. 78, p 012057. (48) Sfiligoi, I.; Bradley, D. C.; Holzman, B.; Mhashilkar, P.; Padhi, S.; Würthwein, F. The Pilot Way to Grid Resources Using GlideinWMS. In 2009 WRI World Congress on Computer Science and Information Engineering, CSIE 2009; IEEE, 2009; Vol. 2, pp 428–432.