Arxiv:1305.7074V1 [Physics.Chem-Ph]

Machine Learning of Molecular Electronic Properties in Chemical Compound Space Grégoire Montavon,1 Matthias Rupp,2 Vivekanand Gobre,3 Alvaro Vazquez-Mayagoitia,4 Katja Hansen,3 Alexandre Tkatchenko,3,5, ∗ Klaus-Robert Müller,1,6, † and O. Anatole von Lilienfeld4, ‡ 1Machine Learning Group, Technical University of Berlin, Franklinstr 28/29, 10587 Berlin, Germany 2Institute of Pharmaceutical Sciences, ETH Zurich, 8093 Zürich, Switzerland 3Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany 4Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, Illinois 60439, USA 5Department of Chemistry, Pohang University of Science and Technology, Pohang 790–784, Korea 6Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea (Dated: May 31, 2013) The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel, and predictive structure-property relationships. Such relationships enable high- throughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning (ML) model, trained on a data base of ab initio calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic ground- and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. The ML model is based on a deep multi-task artificial neural network, exploiting underlying correlations between various molecular properties. The input is identical to ab initio methods, i.e. nuclear charges and Cartesian coordinates of all atoms. For small organic molecules the accuracy of such a “Quantum Machine” is similar, and sometimes superior, to modern quantum-chemical methods—at negligible computational cost. I. INTRODUCTION neural networks (NNs)3,4, basis-set effects using support vector machines5,6, or molecular reorganization energies affecting charge transfer rates7,8, or for solid ternary ox- The societal need for novel computational tools and ides9. Ordinarily, these applications rely on association, data treatment that serve the accelerated discovery of using regression methods that create statistically opti- improved and novel materials has gained considerable mized relationships between so called descriptor variables momentum in the form of the materials genome initia- and electronic property of interest. Not surprisingly, the tive1. Modern electronic structure theory and compute heuristic ad hoc identification and formatting of appro- hardware have progressed to the point where electronic priate descriptors represents a crucial and challenging as- properties of virtual compounds can routinely be calcu- pect of any QSPR, and is to be repeated for every prop- lated with satisfying accuracy. For example, using quan- erty and class of chemicals. tum chemistry and distributed computing, members of the widely advertised Harvard Clean Energy Project en- We make use of an alternative ML approach, recently introduced by some of us for the modeling of molec- deavor to calculate relevant electronic properties for mil- 10 lions of chromophores2. A more fundamental challenge ular atomization energies . This approach is based on a strict first principles view on chemical compound persists, however: It is not obvious how to distill from 11 the resulting data the crucial insights that relate struc- space . Specifically, solutions to Schrödinger’s equa- ture to property in a predictive and quantitative manner. tion (SE) are inferred for organic query molecules using How are we to systematically construct robust models of the same variables that also enter the electronic Hamil- R 65 electronic structure properties that properly reflect the tonian H, i.e. nuclear charges ZI and positions I , arXiv:1305.7074v1 [physics.chem-ph] 30 May 2013 information already obtained for thousands to millions and that are mapped to the corresponding total poten- Ψ 11,12 of different chemical compounds? tial energy, H( ZI , RI ) E. Unlike the afore- { } 7−→ With increasing amounts of data and available compu- mentioned QSPRs this ML model is free of any heuris- tational resources, increasingly sophisticated statistical tics: It exactly encodes the supervised learning problem data analysis, or machine learning (ML) methods, have posed by SE, i.e. instead of finding the wavefunction already been applied to predicting not only outcomes of Ψ which maps the system’s Hamiltonian to its energy, experimental measurements but also outcomes of compu- it directly maps system to energy (based on examples ML tationally demanding high-level electronic structure cal- given for training), ZI , RI E. The employed de- culations. In close analogy to the quantitative struc- scriptor, dubbed “Coulomb”-matrix,{ } 7−→ is directly obtained ture property relationships (QSPRs) prevalent in chem- from Zi, RI . As such this constitutes a well defined informatics and bioinformatics, QSPRs can also be con- supervised-learning{ } problem and in the limit of converged structed for electronic structure properties. Examples number of training examples the ML model becomes a include QSPRs for exchange-correlation potentials using formally exact inductive equivalent to the deductive so- 2 lution of SE. It is advantageous that the training data amide, amines, alcohol, epoxy, sulfide, ether, ester, chlo- can come from experiment just as well as from numeri- ride, aliphatic, and aromatic groups. For each of the cal evaluation of the corresponding quantum mechanical many possible stoichiometries, many constitutional iso- observable using approximate wave-functions (separated mers are considered, each being represented only by a nuclear and electronic wavefunctions, Slater determinant single conformational isomer. expansions etc.), Hamiltonians (such as Hückel or any Based on the string representation (SMILES17,18) of exchange-correlation potential), and self-consistent field molecules in the data base, we used the universal force- procedures. 19 10 field to generate reasonable Cartesian molecular geome- Building on our previously introduced work , we here tries, as implemented in OpenBabel20. The resulting ge- present a more mature ML model developed to accom- ometries were relaxed using the PBE approximation21 to plish the following two additional tasks, (i) simultane- Kohn-Sham DFT22 in converged numerical basis, as im- ously predict a variety of different electronic properties plemented in the FHI-aims code23 (tight settings/tier2 for a single query, and (ii) reach an accuracy comparable basis set). All geometries are provided in the supplemen- with the employed reference method used for generating tary material. the training set. The presented ML model is based on a multi-task deep artificial NN approach that captures correlations between seemingly related and unrelated properties and levels of theory. Remarkable predictive accuracy for “out-of-sample” molecules (i.e. molecules that B. Molecular representation (descriptor) were not part of the training set) has been obtained through the use of random Coulomb-matrices that intro- One of the most important aspects for creating a func- duce invariance with respect to atom indexing. For train- tional ML model is the choice of an appropriate data ing we generated a quantum chemical data base contain- representation (descriptor) that reflects important con- ing nearly 105 entries for over seven thousand stable or- straints and properties due to the underlying physics, SE ganic molecules, made of up to 7 atoms from main-group in our case. While there is a wide variety of descriptors elements, consisting of C, N, O, S, and Cl, saturated with used in chem- and bio-informatics applications24–28 they hydrogens to satisfy valence rules.13,14 For each molecule conventionally are based on prior knowledge about chem- atomization energy, static polarizabilities, frontier orbital ical binding, electronic configuration, or other quantum eigenvalues, and excitation energies and intensities have mechanical observables. Instead, we derive our repre- been calculated using a variety of widely used electronic sentation without any pre-conceived knowledge, i.e. ex- structure methods, including state-of-the-art first prin- clusively from stoichiometry and configurational infor- ciples methods, such as hybrid density-functional the- mation, from the generated according to the previous ory and the many-body GW approach (see Methods and subsection. As such, the molecular representation is in Ref.15). Fig. 1 illustrates the complete property data complete analogy to the electronic Hamiltonian used in base, and how it has been used for model training and ab initio methods. prediction. For this study we use a randomized variant of the recently introduced “Coulomb matrix”, M.10. The II. METHODS Coulomb matrix is an inverse atom-distance matrix representation that is unique (i.e. no two molecules will A. Molecular structures (input) have the same Coulomb matrix unless they are identical or enantiomers), and retains invariance with respect to molecular translation and rotation by construction. While the present ML model approach is generally ap- plicable, for the purpose of this study we restrict our- 2.4 selves to the chemical space of small organic

Arxiv:1305.7074V1 [Physics.Chem-Ph]

Open Data, Open Source, and Open Standards in Chemistry: the Blue Obelisk Five Years On" Journal of Cheminformatics Vol

A Java-Based Platform for Evolutionary De Novo Molecular Design

Molecular Structure Input on the Web Peter Ertl

Designing Universal Chemical Markup (UCM) Through the Reusable Methodology Based on Analyzing Existing Related Formats

Open Data, Open Source and Open Standards in Chemistry: the Blue Obelisk ﬁve Years On

Biomolecular Simulation Data Management In

Open Source Visualization of Scientific Data 8 August 2011 Dr

The Chemistry Development Kit (CDK). 3

Open Source Molecular Modeling

A Multi-Scale Molecular Dynamic Approach to the Study of the Outer Membrane of the Bacteria Psudomonas Aeruginosa PA01 and the Biocide Chlorhexidine

Cheminformatics and the Semantic Web: Adding Value with Linked Data and Enhanced Provenance

UNIVERSITY of CALIFORNIA RIVERSIDE Small Molecule