Arxiv:2107.12375V2 [Physics.Chem-Ph] 30 Jul 2021 Techniques Which Generalize Neural Networks to Eu- and Edges (Bonds)

Geometric Deep Learning on Molecular Representations Kenneth Atz1;y, Francesca Grisoni2;1;†∗, Gisbert Schneider1;3∗ 1ETH Zurich, Dept. Chemistry and Applied Biosciences, RETHINK, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland. 2Eindhoven University of Technology, Dept. Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherlands. 3ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore, Singapore. y these authors contributed equally to this work *[email protected], [email protected] Abstract Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementar- ity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences. 1 Introduction aO b HO Recent advances in deep learning, which is an instance of artificial intelligence (AI) based on neural networks [1, 2], have led to numerous applications in the molec- N ular sciences, e.g., in drug discovery [3, 4], quantum chemistry [5], and structural biology [6, 7]. Two charac- S teristics of deep learning render it particularly promising when applied to molecules. First, deep learning c methods can cope with "unstructured" data represen- CC1(C)[C@H](C(O)=O)N2[C@@H](CC2)S1 tations, such as text sequences [8, 9], speech signals [10, 11], images [12–14], and graphs [15, 16]. This ability is particularly useful for molecular systems, for which ed chemists have developed many models (i.e., "molecular representations") that capture molecular properties at varying levels of abstraction (Figure 1). The sec- ond key characteristic is that deep learning can per- form feature extraction (or feature learning) from the input data, that is, produce data-driven features from the input data without the need for manual interven- tion. These two characteristics are promising for deep learning as a complement to “classical” machine learning Figure 1: Exemplary molecular representations for a applications (e.g., Quantitative Structure-Activity Re- selected molecule (i.e., the penam substructure of peni- lationship [QSAR]), in which molecular features (i.e., cillin) "molecular descriptors" [17]) are encoded a priori with a. Two-dimensional (2D) depiction (Kekulé structure). rule-based algorithms. The capability to learn from un- b. Molecular graph (2D), composed of vertices (atoms) structured data and obtain data-driven molecular fea- and edges (bonds). tures has led to unprecedented applications of AI in the c. SMILES string [20], in which atom type, bond type molecular sciences. and connectivity are specified by alphanumerical char- One of the most promising advances in deep learn- acters. ing is geometric deep learning (GDL). Geometric deep d. Three-dimensional (3D) graph, composed of vertices learning is an umbrella term encompassing emerging (atoms), their position (x, y, z coordinates) in 3D space, arXiv:2107.12375v2 [physics.chem-ph] 30 Jul 2021 techniques which generalize neural networks to Eu- and edges (bonds). clidean and non-Euclidean domains, such as graphs, e. Molecular surface represented as a mesh colored ac- manifolds, meshes, or string representations [15]. In cording to the respective atom types. general, GDL encompasses approaches that incorporate a geometric prior, i.e., information on the structure The aim of this review is to (i) provide a structured space and symmetry properties of the input variables. and harmonized overview of the applications of GDL Such a geometric prior is leveraged to improve the qual- on molecular systems, (ii) delineate the main research ity of the information captured by the model. Although directions in the field, and (iii) provide a forecast of GDL has been increasingly applied to molecular mod- the future impact of GDL. Three fields of application eling [5, 18, 19], its full potential in the field is still are highlighted, namely drug discovery, quantum chem- untapped. istry, and computer-aided synthesis planning (CASP), 1 with particular attention to the data-driven molecular X . For instance, many molecular descriptors are invari- features learned by GDL methods. A glossary of se- ant to the rotation and translation of the molecular rep- lected terms can be found in Box 1. resentation by design [17], e.g., the Moriguchi octanol- water partitioning coefficient [27], which relies only on 2 Principles of geometric deep learning the occurrence of specific molecular substructures for calculation. The symmetry properties of molecular fea- The term geometric deep learning was coined in 2017 tures extracted by a neural network depend on both the [15]. Although GDL was originally used for methods symmetry properties of the input molecular representa- applied to non-Euclidean data [15], it now extends to tion and of the utilized neural network. all deep learning methods that incorporate geometric Many relevant molecular properties (e.g., equilib- priors [21], that is, information about the structure and rium energies, atomic charges, or physicochemical prop- symmetry of the system of interest. Symmetry is a cru- erties such as permeability, lipophilicity or solubility) cial concept in GDL, as it encompasses the properties of are invariant to certain symmetry operations (Box 2). the system with respect to manipulations (transforma- In many tasks in chemistry, it is thus desirable to de- tions), such as translation, reflection, rotation, scaling, sign neural networks that transform equivariantly under or permutation (Box 2). the actions of pre-defined symmetry groups. Exceptions Symmetry is often recast in terms of invariance and occur if the targeted property changes upon a symme- equivariance to express the behavior of any mathemati- try transformation of the molecules (e.g., chiral prop- cal function with respect to a transformation T (e.g. ro- erties which change under inversion of the molecule, or tation, translation, reflection or permutation) of an act- vector properties which change under rotation of the ing symmetry group [22]. Here, the mathematical func- molecule). In such cases, the inductive bias (learning tion is a neural network F applied to a given molecular bias) of equivariant neural networks would not allow for input X . F(X ) can therein transform equivariantly, the differentiation of symmetry-transformed molecules. invariantly or neither with respect to T , as described While neural networks can be considered as uni- below: versal function approximators [28], incorporating prior knowledge such as reasonable geometric information • Equivariance. A neural network F applied to (geometric priors) has evolved as a core design principle an input X is equivariant to a transformation T of neural network modeling [21]. By incorporating geo- if the transformation of the input X commutes metric priors, GDL allows to increase the quality of the with the transformation of F(X ), via a trans- model and bypasses several bottlenecks related to the 0 formation T of the same symmetry group, such need to force the data into Euclidean geometries (e.g., 0 that: F(T (X )) = T F(X ). Neural networks are by feature engineering). Moreover, GDL provides novel therefore equivariant to the actions of a symmetry modeling opportunities, such as data augmentation in group on their inputs if and only if each layer of low data regimes [29, 30]. the network “equivalently" transforms under any transformation of that group. 3 Molecular GDL • Invariance. Invariance is a special case of equiv- The application of GDL to molecular systems is chal- ariance, where F(X ) is invariant to T if T 0 is the lenging, in part because there are multiple valid ways of trivial group action (i.e., identity): F(T (X )) = representing the same molecular entity. Molecular rep- T 0F(X ) = F(X ). resentations1 can be categorized based on their different • F is neither invariant nor equivariant to T levels of abstraction and the physicochemical and geo- when (i) the transformation of the input X does metrical aspects they capture. Importantly, all of these not commute with the transformation of F(X ): representations are models of the same reality and are F(T (X )) 6= T 0F(X ), and (ii) T 0 is not the trivial thus "suitable for some purposes, not for others" [63]. group action. GDL provides the opportunity to experiment with different representations of the same molecule and lever- The symmetry properties of a neural network archi- ages their intrinsic geometrical features to increase the tecture vary depending on the network type and the quality of the model. Moreover, GDL has repeatedly symmetry group of interest and are individually dis- proven useful in providing insights into relevant molecu- cussed in the following sections. Readers can find an lar properties for the task at hand, thanks to its feature in-depth treatment of equivariance and group equivari- extraction (feature learning) capabilities. In the follow- ant layers in neural networks elsewhere [23–26]. ing sections, we delineate

Arxiv:2107.12375V2 [Physics.Chem-Ph] 30 Jul 2021 Techniques Which Generalize Neural Networks to Eu- and Edges (Bonds)

Big-Data Science in Porous Materials: Materials Genomics and Machine Learning

Unstructured Data Is a Risky Business

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Big Data Mining Tools for Unstructured Data: a Review YOGESH S

Extracting Unstructured Data from Template Generated Web Documents

Top Natural Language Processing Applications in Business UNLOCKING VALUE from UNSTRUCTURED DATA for Years, Enterprises Have Been Making Good Use of Their 1

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

Solving the Unstructured Data Puzzle with Analytics

Cheminformatics for Genome-Scale Metabolic Reconstructions

Geospatial Semantics Yingjie Hu GSDA Lab, Department of Geography, University of Tennessee, Knoxville, TN 37996, USA

Unstructured Data Analysis in Arcgis

The Role of Text Analytics in Healthcare: a Review of Recent Developments and Applications