3Dmolnet: a Generative Network for Molecular Structures
Total Page:16
File Type:pdf, Size:1020Kb
3DMolNet: A Generative Network for Molecular Structures Vitali Nesterov, Mario Wieser, Volker Roth Department of Mathematics and Computer Science University of Basel, Switzerland [email protected] Abstract With the recent advances in machine learning for quan- tum chemistry, it is now possible to predict the chemical properties of compounds and to generate novel molecules. Existing generative models mostly use a string- or graph- based representation, but the precise three-dimensional co- ordinates of the atoms are usually not encoded. First at- tempts in this direction have been proposed, where autore- gressive or GAN-based models generate atom coordinates. Those either lack a latent space in the autoregressive setting, such that a smooth exploration of the compound space is not possible, or cannot generalize to varying chemical compo- sitions. We propose a new approach to efficiently generate molecular structures that are not restricted to a fixed size or composition. Our model is based on the variational autoen- coder which learns a translation-, rotation-, and permutation- invariant low-dimensional representation of molecules. Our experiments yield a mean reconstruction error below 0.05 A,˚ outperforming the current state-of-the-art methods by a fac- tor of four, and which is even lower than the spatial quanti- zation error of most chemical descriptors. The compositional and structural validity of newly generated molecules has been confirmed by quantum chemical methods in a set of experi- Figure 1: A schematic illustration of the decoding process of ments. the 3DMolNet. The model learns a low-dimensional repre- sentation Z of molecules (red plane). Each latent encoding can be decoded to a nuclear charge matrix C, an Euclidean Introduction distance matrix D, and a bond matrix B using neural net- An exploration of the chemical compound space (CCS) is works (blue dotted arrows). From this, we recover the 3-d essential in the design process of new materials. Consid- molecular structure of a molecule (brown arrows). ering that the amount of small organic molecules is in the magnitude of 1060 (Kenakin 2006), a systematic coverage of molecules in a combinatorial manner is intractable. Further- graph-based molecular representations (Gomez-Bombarelli more, a precise approximation of quantum chemical prop- et al. 2018; Kusner, Paige, and Hernandez-Lobato´ 2017; Jin, arXiv:2010.06477v1 [q-bio.BM] 8 Oct 2020 erties by solving the Schrodinger’s¨ equation is computa- Barzilay, and Jaakkola 2018; Janz et al. 2018). The pre- tionally demanding even for a tiny amount of mid-sized cise spatial configuration of the molecules is usually not molecules. In recent years, an entirely new research field enclosed. This leads to various drawbacks, for instance, emerged, aiming a statistical approach by training regression molecules which have the same chemical composition but models for quantum chemical property prediction to over- different geometries, a so called isomers, cannot be distin- come computational drawbacks (Schutt¨ et al. 2018; Faber guished. Furthermore, a targeted exploration of the CCS et al. 2017; Christensen, Faber, and von Lilienfeld 2019). with respect to quantum chemical property is rather poor and More recently, deep latent variable models have been intro- with respect to geometries it is impossible. duced to solve the inverse problem and to generate novel molecules. Here, the major focus of the research community To account for this challenges, Gebauer, Gastegger, and is on deep latent variable models that use either string- or Schutt¨ proposed G-SchNet (Gebauer, Gastegger, and Schutt¨ 2019), an autoregressive model for generation of atom types Preprint. Work in progress. and corresponding 3-d points. The rotation- and translation- invariances are achieved by using Euclidean distances be- A,˚ outperforming the state-of-the-art method by almost a tween atoms. The permutation-invariant features are ex- factor of four. tracted with continuous-filter convolutions (Schutt¨ et al. 2018). Due to the autoregressive nature of the model, it has no continuous latent space representation for smooth explo- Related Work ration of the CCS. Furthermore, the probability for comple- tion of complex structures decays with the amount of sam- Variational Autoencoder. In recent years, the Variational pling steps. Hoffmann and Noe´ introduced EDMNet that autoencoder (VAE) (Kingma and Welling 2014; Rezende, generates Euclidean distance matrices (EDM) in a one-shot Mohamed, and Wierstra 2014) became an important tool fashion, which can be further transformed into actual co- for various machine learning tasks such as semi-supervised ordinates. For generation of the EDMs a GAN architecture learning (Kingma et al. 2014) and conditional image gen- (Goodfellow et al. 2014) is used, while the critic network is eration (Sohn, Lee, and Yan 2015). An important line of based on the SchNet architecture (Schutt¨ et al. 2018). This research are VAEs for disentangled representation learning approach, however, is limited to a fixed chemical compo- (Bouchacourt, Tomioka, and Nowozin 2018; Klys, Snell, sition to circumvent the permutation problem. In addition, and Zemel 2018; Lample et al. 2017; Creswell et al. GANs are frequently difficult to train. A smooth exploration 2017). More recently, variational autoencoders have shown of the nearby areas in the latent space is not possible as well. strong connections to the information bottleneck principle In order to overcome the aforementioned limitations, we (Tishby, Pereira, and Bialek 1999; Alemi et al. 2017; Achille propose the 3DMolNet, a generative network for 3-d molec- and Soatto 2018), where multiple model extensions have ular structures. A graphical sketch of our approach is illus- been proposed in the context of sparsity (Wieczorek et al. trated in Figure 1. The molecule representation consists of 2018), causality (Parbhoo et al. 2020b,a), archetypal analy- two core components, a nuclear charge matrix and an EDM. sis (Keller et al. 2019, 2020) or invariant subspace learning Since some bond types have high variances in the bond (Wieser et al. 2020). lengths, an exact assignment of bond types often cannot be derived from the distances. For this reason, we additionally include an explicit bond matrix to our representation. To Graph-based Deep Generative Models. Deep generative overcome the permutation problem, we use an InChI-based models are heavily used in the context of computational canonical ordering of the atoms. chemistry to generate novel molecular structures based on Due to the absence of canonical identifiers for most of the graph representations. With a text-based graph represen- hydrogen atoms, we involve only heavy atoms in our repre- tation, a SMILES string, (Gomez-Bombarelli et al. 2018) sentation. This is however not problematic, since the back- introduced a chemical VAE to generate novel molecules. bone of a molecule is the most important part. The hydro- However, this method has limitations in generating syntacti- gens are explicitly defined if the core structure of a molecule cally correct SMILES strings. Several methods have been is known. A quantum mechanical approximation of the hy- proposed to overcome this drawback (Kusner, Paige, and drogen coordinates is computationally an easy optimization Hernandez-Lobato´ 2017; Dai et al. 2018; Janz et al. 2018). task. This is because, given a fixed core structure, a many- Another line of research deals with explicit graph-based body problem has a significantly smaller degree of freedom. molecular descriptors. Liu et al. introduce a constrained Additionally, the positions of the hydrogen atoms are en- graph VAE to generate valid molecular structures. Subse- tirely determined only if the stereospecificity is defined. quently, improved approaches has been introduced by (Jin, Our model is based on the Variational autoencoder Barzilay, and Jaakkola 2018; Jin et al. 2019; Li et al. 2018). (Kingma and Welling 2014; Rezende, Mohamed, and Wier- stra 2014) architecture to learn a low-dimensional latent rep- resentation of the molecules. To recover a molecule, we de- code each representation component separately and obtain Coordinate-based Deep Generative Models. More re- the entire molecular structure in the post processing steps. cently, deep generative models for spatial structures has been In summary, the main contributions of this work are: proposed to overcome the limitations of graph-based de- scriptors. Such representations do not take into account pre- 1. We introduce a translation-, rotation-, and permutation- cise three-dimensional configurations of the atoms. Man- invariant molecule representation. To overcome the per- simov et al. estimated the 3-d configurations of molecules mutation problem a canonical ordering of the atoms is from molecular graphs. Hoffmann and Noe´ learned the used. distribution of spatial structures by generating translation- 2. We propose a VAE-based model to learn a continuous and rotation-invariant Euclidean distance matrices. How- low-dimensional representation of the molecules and to ever, their approach is limited to molecules with the same generate 3-d molecular structures in a one-shot fashion. chemical composition. Current state-of-the-art results are provided by the G-SchNet (Gebauer, Gastegger, and Schutt¨ 3. We demonstrate on the QM9 dataset that the proposed 2019), which is based on an autoregressive model which, model is able to reconstruct almost all chemical compo- however, is lacking a latent representation of