Geometric Deep Learning on Molecular Representations

Kenneth Atz1,†, Francesca Grisoni2,1,†∗, Gisbert Schneider1,3∗ 1ETH Zurich, Dept. Chemistry and Applied Biosciences, RETHINK, Vladimir-Prelog-Weg 4, 8093 Zurich, Switzerland. 2Eindhoven University of Technology, Dept. Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherlands. 3ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore, Singapore. † these authors contributed equally to this work *[email protected], [email protected]

Abstract Geometric deep learning (GDL), which is based on neural network architectures that incorporate and process symmetry information, has emerged as a recent paradigm in artificial intelligence. GDL bears particular promise in molecular modeling applications, in which various molecular representations with different symme- try properties and levels of abstraction exist. This review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction, and quantum chemistry. Emphasis is placed on the relevance of the learned molecular features and their complementar- ity to well-established molecular descriptors. This review provides an overview of current challenges and opportunities, and presents a forecast of the future of GDL for molecular sciences.

1 Introduction aO b HO Recent advances in deep learning, which is an instance of artificial intelligence (AI) based on neural networks [1, 2], have led to numerous applications in the molec- N ular sciences, e.g., in drug discovery [3, 4], quantum chemistry [5], and structural biology [6, 7]. Two charac- S teristics of deep learning render it particularly promis- ing when applied to molecules. First, deep learning c methods can cope with "unstructured" data represen- CC1(C)[C@H](C(O)=O)N2[C@@H](CC2)S1 tations, such as text sequences [8, 9], speech signals [10, 11], images [12–14], and graphs [15, 16]. This ability is particularly useful for molecular systems, for which ed chemists have developed many models (i.e., "molecu- lar representations") that capture molecular properties at varying levels of abstraction (Figure 1). The sec- ond key characteristic is that deep learning can per- form feature extraction (or feature learning) from the input data, that is, produce data-driven features from the input data without the need for manual interven- tion. These two characteristics are promising for deep learning as a complement to “classical” Figure 1: Exemplary molecular representations for a applications (e.g., Quantitative Structure-Activity Re- selected molecule (i.e., the penam substructure of peni- lationship [QSAR]), in which molecular features (i.e., cillin) "molecular descriptors" [17]) are encoded a priori with a. Two-dimensional (2D) depiction (Kekulé structure). rule-based algorithms. The capability to learn from un- b. Molecular graph (2D), composed of vertices (atoms) structured data and obtain data-driven molecular fea- and edges (bonds). tures has led to unprecedented applications of AI in the c. SMILES string [20], in which atom type, bond type molecular sciences. and connectivity are specified by alphanumerical char- One of the most promising advances in deep learn- acters. ing is geometric deep learning (GDL). Geometric deep d. Three-dimensional (3D) graph, composed of vertices learning is an umbrella term encompassing emerging (atoms), their position (x, y, z coordinates) in 3D space,

arXiv:2107.12375v2 [physics.chem-ph] 30 Jul 2021 techniques which generalize neural networks to Eu- and edges (bonds). clidean and non-Euclidean domains, such as graphs, e. Molecular surface represented as a mesh colored ac- manifolds, meshes, or string representations [15]. In cording to the respective atom types. general, GDL encompasses approaches that incorpo- rate a geometric prior, i.e., information on the structure The aim of this review is to (i) provide a structured space and symmetry properties of the input variables. and harmonized overview of the applications of GDL Such a geometric prior is leveraged to improve the qual- on molecular systems, (ii) delineate the main research ity of the information captured by the model. Although directions in the field, and (iii) provide a forecast of GDL has been increasingly applied to molecular mod- the future impact of GDL. Three fields of application eling [5, 18, 19], its full potential in the field is still are highlighted, namely drug discovery, quantum chem- untapped. istry, and computer-aided synthesis planning (CASP),

1 with particular attention to the data-driven molecular X . For instance, many molecular descriptors are invari- features learned by GDL methods. A glossary of se- ant to the rotation and translation of the molecular rep- lected terms can be found in Box 1. resentation by design [17], e.g., the Moriguchi octanol- water partitioning coefficient [27], which relies only on 2 Principles of geometric deep learning the occurrence of specific molecular substructures for calculation. The symmetry properties of molecular fea- The term geometric deep learning was coined in 2017 tures extracted by a neural network depend on both the [15]. Although GDL was originally used for methods symmetry properties of the input molecular representa- applied to non-Euclidean data [15], it now extends to tion and of the utilized neural network. all deep learning methods that incorporate geometric Many relevant molecular properties (e.g., equilib- priors [21], that is, information about the structure and rium energies, atomic charges, or physicochemical prop- symmetry of the system of interest. Symmetry is a cru- erties such as permeability, lipophilicity or solubility) cial concept in GDL, as it encompasses the properties of are invariant to certain symmetry operations (Box 2). the system with respect to manipulations (transforma- In many tasks in chemistry, it is thus desirable to de- tions), such as translation, reflection, rotation, scaling, sign neural networks that transform equivariantly under or permutation (Box 2). the actions of pre-defined symmetry groups. Exceptions Symmetry is often recast in terms of invariance and occur if the targeted property changes upon a symme- equivariance to express the behavior of any mathemati- try transformation of the molecules (e.g., chiral prop- cal function with respect to a transformation T (e.g. ro- erties which change under inversion of the molecule, or tation, translation, reflection or permutation) of an act- vector properties which change under rotation of the ing symmetry group [22]. Here, the mathematical func- molecule). In such cases, the inductive bias (learning tion is a neural network F applied to a given molecular bias) of equivariant neural networks would not allow for input X . F(X ) can therein transform equivariantly, the differentiation of symmetry-transformed molecules. invariantly or neither with respect to T , as described While neural networks can be considered as uni- below: versal function approximators [28], incorporating prior knowledge such as reasonable geometric information • Equivariance. A neural network F applied to (geometric priors) has evolved as a core design principle an input X is equivariant to a transformation T of neural network modeling [21]. By incorporating geo- if the transformation of the input X commutes metric priors, GDL allows to increase the quality of the with the transformation of F(X ), via a trans- model and bypasses several bottlenecks related to the 0 formation T of the same symmetry group, such need to force the data into Euclidean geometries (e.g., 0 that: F(T (X )) = T F(X ). Neural networks are by feature engineering). Moreover, GDL provides novel therefore equivariant to the actions of a symmetry modeling opportunities, such as data augmentation in group on their inputs if and only if each layer of low data regimes [29, 30]. the network “equivalently" transforms under any transformation of that group. 3 Molecular GDL • Invariance. Invariance is a special case of equiv- The application of GDL to molecular systems is chal- ariance, where F(X ) is invariant to T if T 0 is the lenging, in part because there are multiple valid ways of trivial group action (i.e., identity): F(T (X )) = representing the same molecular entity. Molecular rep- T 0F(X ) = F(X ). resentations1 can be categorized based on their different • F is neither invariant nor equivariant to T levels of abstraction and the physicochemical and geo- when (i) the transformation of the input X does metrical aspects they capture. Importantly, all of these not commute with the transformation of F(X ): representations are models of the same reality and are F(T (X )) 6= T 0F(X ), and (ii) T 0 is not the trivial thus "suitable for some purposes, not for others" [63]. group action. GDL provides the opportunity to experiment with dif- ferent representations of the same molecule and lever- The symmetry properties of a neural network archi- ages their intrinsic geometrical features to increase the tecture vary depending on the network type and the quality of the model. Moreover, GDL has repeatedly symmetry group of interest and are individually dis- proven useful in providing insights into relevant molecu- cussed in the following sections. Readers can find an lar properties for the task at hand, thanks to its feature in-depth treatment of equivariance and group equivari- extraction (feature learning) capabilities. In the follow- ant layers in neural networks elsewhere [23–26]. ing sections, we delineate the most prevalent molecular The concept of equivariance and invariance can also GDL approaches and their applications in chemistry, be used in reference to the molecular features obtained grouped according to the respective molecular represen- from a given molecular representation (X ), depending tations used for deep learning: molecular graphs, grids, on their behaviour when a transformation is applied to strings, and surfaces.

1Note that in this review the term "representation" is used solely to denote human-made models of molecules (e.g., molecular graphs, 3D conformers, SMILES strings). To avoid confusion with other usages of the word "representation" in deep learning, we will use the term "feature" whenever referring to any numerical description of molecules, obtained either with rule-based algorithms (molecular descriptors) or learned (extracted) by neural networks.

2 Box 1: Glossary of selected terms CoMFA and CoMSIA. Comparative Molecular Field Analysis (CoMFA) [31] and Comparative Molecu- lar Similarity Indices Analysis (CoMSIA) [32] are popular 3D QSAR methods developed in the 1980s and 1990s, in which three-dimensional grids are used to capture the distributions of molecular features (e.g., steric, hydrophobic, and electrostatic properties). The obtained molecular descriptors serve as inputs to a regression model for quantitative bioactivity prediction. Convolution. Operation within a neural network that transforms a feature space into a new feature space and thereby captures the local information found in the data. Convolutions were first introduced for pixels in images [33, 34] but the term "convolution" is now used for neural network architectures covering a variety of data structures such as graphs, point clouds, spheres, grids, or manifolds. Density Functional Theory (DFT). A quantum mechanical modeling approach used to investigate the electronic structure of molecules. Data augmentation. Artificial increase of the data volume available for model training, often achieved by leveraging symmetrical properties of the input data which are not captured by the model (e.g., rotation or permutation). Feature. An individually measurable or computationally obtainable characteristic of a given sample (e.g., molecule), in the form of a scalar. In this review, the term refers to a numeric value characterizing a molecule. Such molecular features can be computed with rule-based algorithms ("molecular descrip- tors") or generated automatically by deep learning from a molecular representation ("hidden" or "learned" features). Geometric prior. An inductive bias incorporating information on the symmetric nature of the system of interest into the neural network architecture. Also known as symmetry prior. Inductive bias. Set of assumptions that a learning algorithm (e.g., a neural network) uses to learn the target function and to make predictions on previously unseen data points. One-hot encoding. Method for representing categorical variables as numerical arrays by obtaining a binary variable (0, 1) for each category. It is often used to covert sequences (e.g., SMILES strings) into numerical matrices, suitable as inputs and/or outputs of deep learning models (e.g., chemical language models). Quantitative Structure-Activity Relationship (QSAR). Machine learning techniques aimed at find- ing an empirical relationship between the molecular structure (usually encoded as molecular descriptors) and experimentally determined molecular properties, such as pharmacological activity or toxicity. Reinforcement learning. A technique used to steer the output of a machine learning algorithm toward user-defined regions of optimality via a predefined reward function [35]. Transfer learning. Transfer of knowledge from an existing deep learning model to a related task for which fewer training samples are available [36]. Unstructured data. Data that are not arranged as vectors of (typically handcrafted) features. Examples of unstructured data include graphs, images, and meshes. Molecular representations are typically unstruc- tured, whereas numerical molecular descriptors (e.g., molecular properties, molecular "fingerprints") are examples of structured data. Voxel. Element of a regularly spaced, 3D grid (equivalent to a pixel in 2D space).

3.1 Learning on molecular graphs ments (Figure 2a, [66, 67]). Different architectures of GNNs have been introduced [68], the most popular of 3.1.1 Molecular graphs which fall under the umbrella term of message passing Graphs are among the most intuitive ways to represent neural networks [5, 69, 70]. Such networks iteratively molecular structures [65]. Any molecule can be thought update the vertex features of the l-th network layer l l+1 of as a mathematical graph G = (V, E), whose vertices (vi → vi ) via graph convolutional operations, em- (vi ∈ V) represent atoms, and whose edges (ei,j ∈ E) ploying at least two learnable functions ψ and φ, and a constitute their connection (Figure 3.1). In many deep local permutation-invariant aggregation operator (e.g., l+1 P l l  learning applications, molecular graphs can be further sum): vi = ψ j∈N (i) φ vi, vj, eij . characterized by a set of vertex and edge features. Since their introduction as a means to predict quan- tum chemical properties of small molecules at the den- 3.1.2 Graph neural networks sity functional theory (DFT) level [5], GNNs have found Deep learning methods devoted to handling graphs as many applications in quantum chemistry [71–75], drug input are commonly referred to as graph neural net- discovery [37, 76, 77], CASP [39], and molecular prop- works (GNNs). When applied to molecules, GNNs al- erty prediction [78, 79]. When applied to quantum low for feature extraction by progressively aggregating chemistry tasks, GNNs often use E(3)-invariant 3D in- information from atoms and their molecular environ- formation by including radial and angular information

3 Table 1: Summary of selected geometric deep learning (GDL) approaches for molecular modeling. For each ap- proach, the utilized molecular representation(s) and selected applications are reported. 1D, one-dimensional; 2D, two-dimensional; 3D, three-dimensional. GDL approach Molecular representation(s) Applications

Molecular property prediction [37, 38], Graph neural networks 2D molecular graph CASP [39, 40], and generative molecular design [41, 42]

Prediction of quantum chemical Equivariant graph 3D molecular graph energies [43–45], forces [45–47] neural networks and point cloud and wave-functions [48]

3D convolutional Structure-based drug design 3D grid neural networks and property prediction [49–51]

Surface encoded Mesh convolutional Protein-protein interaction prediction as a mesh (represented as neural networks and ligand-pocket fingerprinting [18] 2D grid or 3D graph)

Generative molecular design [19, 52, 53], Recurrent neural synthesis planning [54], protein String notation (1D grid) networks structure prediction [55, 56] and prediction of properties in drug discovery [57, 58]

Prediction of properties in drug discovery [59], synthesis planning [54, 60], String notation (encoded Transformers prediction of reaction yields [61], as a graph) generative molecular design [62] and protein structure prediction [6, 7]

into the edge features of the graph [47, 71, 72, 80, 81], edge addition from an initial vertex [91] (Figure 2b). thereby improving the prediction accuracy of quantum GNNs have also been combined with variational au- chemical forces and energies for equilibrium and non- toencoders [42, 92–94] and reinforcement learning [41, equilibrium molecular conformations, as in the case of 95, 96]. Finally, GNNs have been applied to CASP [39, SchNet [82, 83] and PaiNN [47]. SchNet-like architec- 97, 98]; however, the current approaches are limited to tures were used to predict quantum mechanical wave- reactions in which one bond is removed between the functions in the form of Hartree-Fock and DFT density products and the reactants. matrices [84], and differences in quantum properties ob- tained by DFT and coupled cluster level-of-theory cal- 3.1.3 Equivariant message passing culations [85]. GNNs for molecular property prediction have been A recent area of development of graph-based meth- shown to outperform human-engineered molecular de- ods are SE(3)- and E(3)-equivariant GNNs (equivariant scriptors for several biologically relevant properties message passing networks) which deal with the absolute [86]. Although including 3D information into molec- coordinate systems of 3D graphs [99, 100] (Figure 2b). ular graphs generally improved the prediction of drug- Thus, these networks may be particularly well-suited to relevant properties, no marked difference was observed be applied to 3D molecular representations. Such net- between using a single or multiple molecular conform- works exploit Euclidean symmetries of the system (Box ers for network training [87]. Because of their natural 2). connection with molecular representations, GNNs seem 3D molecular graphs G3D = (V, E, R), in addition particularly suitable in the context of explainable AI to their vertex and edge features (vi ∈ V and eij ∈ E, (XAI) [88, 89], where they have been used to inter- respectively), also encode information on the vertex po- pret models predicting molecular properties of preclini- sition in a 3D coordinate system (ri ∈ R). By em- cal relevance [38] and quantum chemical properties [90]. ploying E(3)- [45] and SE(3)-equivariant [99] convo- GNNs have been used for de novo molecule genera- lutions, such networks have shown high accuracy for tion [41, 91–93], for example by performing vertex and predicting several quantum chemical properties such as

4 Box 2: Euclidean symmetries in molecular systems Molecular systems (and three-dimensional representations thereof) can be considered as objects in Eu- clidean space. In such a space, one can apply several symmetry operations (transformations) that are (i) performed with respect to three symmetry elements (i.e., line, plane, point), and (ii) rigid, that is, they preserve the Euclidean distance between all pairs of atoms (i.e., isometry). The Euclidean transformations are as follows: • Rotation. Movement of an object with respect to the radial orientation to a given point. • Translation. Movement of every point of an object by the same distance in a given direction. • Reflection. Mapping of an object to itself through a point (inversion), a line or a plane (mirroring). All three transformations and their arbitrary finite combinations are included in the Euclidean group [E(3)]. The special Euclidean group [SE(3)] comprises only translations and rotations. Molecules are always symmetric in the SE(3) group, i.e., their intrinsic properties (e.g., biological and physicochemical properties, and equilibrium energy) are invariant to coordinate rotation and translation, and combinations thereof. Several molecules are chiral, that is, some of their (chiral) properties depend on the absolute configuration of their stereogenic centers, and are thus non-invariant to molecule reflection. Chirality plays a key role in chemical biology; relevant examples of chiral molecules are DNA, and several drugs whose isomers exhibit markedly different pharmacological and toxicological properties [64].

original

rotation translation

reflection (inversion) reflection (mirroring)

energies [43, 44, 46, 47, 101–103], interatomic poten- small molecular systems because of the large size of tials for molecular dynamics simulations [45, 46, 104], the learned matrices, which scale quadratically with the and wave-functions [48]. SE(3) equivariant neural net- number of electrons in the system. works possess reflection-equivariance, and thereby en- able the model to distinguish between chiral molecules 3.2 Learning on grids [99]. SE(3) neural networks are computationally expen- sive due to their use of spherical harmonics [105] and Grids capture the properties of a system at regularly Wigner D-functions [106] to compute learnable weight spaced intervals. Based on the number of dimensions in- kernels. E(3)-equivariant neural networks are compu- cluded in the system, grids can be 1D (e.g., sequences), tationally more efficient and have shown to perform 2D (e.g., RGB images), 3D (e.g., cubic lattices), or equal to, or better than, SE(3)-equivariant networks, higher-dimensional. Grids are defined by a Euclidean e.g., for the modeling of quantum chemical properties geometry and can be considered as a graph with a spe- and dynamic systems [45]. Equivariant message pass- cial adjacency, where (i) the vertices have a fixed or- ing networks have been applied to predict the quan- dering that is defined by the spatial dimensions of the tum mechanical wave-function of nuclei and electron- grid, and (ii) each vertex has an identical number of ad- based representations in an end-to-end fashion [107– jacent edges and is therefore indistinguishable from all 109]. However, such networks are currently limited to other vertices structure-wise [21]. These two properties

5 a Feature labeling Feature updates Aggregation Molecular property

message passing

Atomic property

Bond property

b Feature labeling Feature updates Aggregation Molecular property

equivariant message passing

Atomic property

Figure 2: Deep learning on molecular graphs. a. Message passing graph neural networks applied to two-dimensional (2D) molecular graphs: 2D molecular dv de graph G = (V, E) with its labeled vertex (atom) features (vi ∈ R ), and edge (bond) features (eij ∈ R ). Vertex features are updated by iterative message passing for a defined number of time steps T across each pair t of vertices vi and vj, connected via an edge ej,i. After the last message passing convolution, the final vertex vi can be (i) mapped to a bond (yij) or atom (yi) property, or (ii) aggregated to form molecular features (that can be mapped to a molecular property y). b. E(3)-equivariant message passing graph neural networks applied to three-dimensional (3D) molecular graphs: dv 3D graphs G = (V, E, R) that are labeled with atom features (vi ∈ R ), their absolute coordinates in 3D 3 de space (ri ∈ R ) and their edge features (eij ∈ R ). Iterative spherical convolutions are used to obtain data- t driven atomic features (vi), which can be mapped to atomic properties or aggregated, and mapped to molecular properties (yi and y, respectively). render local convolutions applied to a grid inherently ant GNNs, which until now have mainly been applied permutation invariant, and provide a strong geometric to molecules with fewer than approximately 100 atoms. prior for translation invariance (e.g. by weight sharing Thus, 3D CNNs are a method of choice when the pro- in convolutions). These grid properties have critically tein structure has to be considered, e.g., for protein- determined the success of convolutional neural networks ligand binding affinity prediction [49, 50, 112–114], or (CNNs), e.g., in computer vision [33, 34], natural lan- active site recognition [51]. guage processing [9, 110], and speech recognition [10, 11]. 3.3 Learning on molecular surfaces

3.2.1 Molecular grids Molecular surfaces are defined by the surface enclosing the 3D structure of a molecule at a certain distance from Molecules can be represented as grids in different ways. each atom center. Each point on such a continuous sur- 2D grids (e.g., molecular structure drawings) are gener- face is characterized by its chemical (e.g., hydrophobic- ally more useful for visualization rather than prediction, ity, electrostatics) and geometric features (e.g., shape, with few exceptions [111]. Analogous with some popu- curvature). From a geometrical perspective, molecu- lar pre-deep learning approaches, for example Compar- lar surfaces are considered as 3D meshes, i.e., a set of ative molecular field analysis (CoMFA) [31], and com- polygons called faces described in terms of a set of ver- parative molecular similarity indices analysis (CoM- tices that describe how the mesh coordinates exist in SIA) [32], 3D grids are often used to capture the spa- the 3D space [115]. The vertices can be represented by tial distribution of the properties within one (or more) a 2D grid structure (where four vertices on the mesh molecular conformer. Such representations are then define a pixel) or by a 3D graph structure. The grid- used as inputs to the 3D CNNs. 3D CNNs are char- and graph-based structures of meshes enable applica- acterized by a greater resource efficiency than equivari- tions of 2D CNNs and GNNs to learn on mesh-based

6 molecular surfaces. Recently, 2D CNNs have been ap- 3.4.2 Chemical language models plied to learn on mesh-based representations of protein Chemical language models are machine learning meth- surfaces to predict protein-protein complexes and pro- ods that can handle molecular sequences as inputs tein binding sites [18]. However, 2D CNNs applied to and/or outputs. The most common algorithms for meshes come with certain limitations, such as the need chemical language modeling are Recurrent neural net- for rotational data augmentation (due to the network works (RNNs) and Transformers: invariance) and for homogenous mesh resolution. Re- cently introduced GNNs for mesh-based representations • RNNs (Figure 3b) [130] are neural networks that have been shown to incorporate rotational equivariance process sequence data as Euclidean structures, into their network architecture and allow for hetero- usually via one-hot-encoding. RNNs model a dy- geneous mesh resolution (i.e., non-consistent distance namic system in which the hidden state (ht) of between the vertices in 3D) [116]. Such GNNs are the network at any t-th time point (i.e., at any t- computationally efficient and have potential for mod- th position in the sequence) depends on both the eling macromolecular structures; however, they have current observation (st) and the previous hidden not yet found applications in molecular systems. Other state (ht−1). RNNs can process sequence inputs studies have used 3D voxel-based surface representa- of arbitrary lengths and provide outputs of arbi- tions of (macro)molecules as inputs to 3D CNNs, e.g., trary lengths. RNNs are often used in an "auto- for protein-ligand affinity [117] and protein binding-site regressive" fashion, i.e., to predict the probability [118] prediction. distribution over the next possible elements (to- kens) at the time step t+1, given the current hid- 3.4 Learning on string representations den state (ht) and the preceding portions of the sequence. Several RNN architectures have been 3.4.1 Molecular strings proposed to solve the gradient vanishing or ex- Molecules can be represented as molecular strings, i.e., ploding problems of "vanilla" RNNs [131, 132], linear sequences of alphanumeric symbols. Molecu- such as long short-term memory [110] and gated lar strings were originally developed as manual cipher- recurrent units [133]. ing tools to complement systematic chemical nomen- • Transformers (Figure 3c) process sequence data clature [119, 120] and later became suitable for data as non-Euclidean structures, by encoding se- storage and retrieval. Some of the most popular string- quences as either (i) a fully connected graph, or based representations are the Wiswesser Line Notation (ii) a sequentially connected graph, where each [121], the Sybyl line notation [122], the International token is only connected to the previous tokens Chemical Identifier (InChI) [123], Hierarchical Editing in the sequence. The former approach is of- Language for Macromolecules [124], and the Simplified ten used for feature extraction in general (e.g., Molecular Input Line Entry System (SMILES) [20]. in a Transformer-encoder), whereas the latter is Each type of linear representation can be considered employed for next-token prediction e.g. in a as a "chemical language." In fact, such notations pos- Transformer-decoder). The positional informa- sess a defined syntax, i.e., not all possible combinations tion of tokens is usually encoded by positional of alphanumerical characters will lead to a “chemically embedding or sinusoidal positional encoding [8]. valid” molecule. Furthermore, these notations possess Transformers combine graph-like processing with semantic properties: depending on how the elements the so-called attention layers. Attention layers of the string are combined, the corresponding molecule allow Transformers to focus on ("pay attention will have different physicochemical and biological prop- to") the perceived relevant tokens for each pre- erties. These characteristics make it possible to extend diction. Transformers have been particularly suc- the deep learning methods developed for language and cessful in sequence-to-sequence tasks, such as lan- sequence modeling to the analysis of molecular strings guage translation. for "chemical language modeling" [125, 126]. SMILES strings – in which letters are used to Extending early studies [19, 134, 135], RNNs for represent atoms, and symbols and numbers are used next-token prediction have been routinely applied to to encode bond types, connectivity, branching, and the de novo generation of molecules with desired bi- stereochemistry (Figure 3a) – have become the most ological or physicochemical properties, in combination frequently employed data representation method for with transfer [19, 53, 136, 137] or reinforcement learn- sequence-based deep learning [19, 54]. Whereas sev- ing [138, 139]. In this context, RNNs have shown re- eral other string representations have been tested in markable capability to learn the SMILES syntax [19, combination with deep learning, e.g., InChI [127], 136], and capture high-level molecular features ("se- DeepSMILES [128], and self-referencing embedded mantics"), such as physicochemical [19, 136] and biolog- strings (SELFIES) [129], SMILES remains the de facto ical properties [53, 134, 137, 140]. In this context, data representation of choice for chemical language model- augmentation based on SMILES randomization [135, ing. The following text introduces the most prominent 141] or bidirectional learning [142] have proven to be chemical language modeling methods, along with se- efficient for improving the quality of the chemical lan- lected examples of their application to chemistry. guage learned by RNNs. Most published studies have used SMILES strings or derivative representations. In

7 a b O s s s s HO N12[C@@H](CC1)SC(C)(C)[C@@H]2(C(O)=O)

N CC1(C)[C@H](C(O)=O)N2[C@@H](CC2)S1

S O=C(O)C[C@@H]1N2[C@@H](CC2)SC1(C)(C) s s s s c Sequence Graph Feature labelling Feature updates

s s residual attention blocks s s s s s s s s s s s

Figure 3: Chemical language modeling. a. SMILES strings, in which atom types are represented by their element symbols, and bond types and branching are indicated by other predefined alphanumeric symbols. For each molecule, via the SMILES algorithm a string of T symbols ("tokens") is obtained (s = {s1, s2, . . . , sT }), which encodes the molecular connectivity, herein illustrated via the color that indicates the corresponding atomic position in the graph (left) and string (right). A molecule can be encoded via different SMILES strings depending on the chosen starting atom. Three random permutations incorporating identical molecular information are presented. b. Recurrent neural networks, at any sequence position t, learn to predict the next token st+1 of a sequence s given the current sequence ({s1, s2, . . . , st}) and hidden state ht. c. Transformer-based language models, in which the input sequence is structured as a graph. Vertices are featur- dv ized according to their token identity (e.g., via token embedding, vi ∈ R ) and their position in the sequence dv (e.g., via sinusoidal positional encoding, pi ∈ R ). During transformer learning, the vertices are updated via T residual attention blocks. After passing T attention layers, an individual feature representation st for each token is obtained. a few studies, one-letter amino acid sequences were em- also been used for de novo molecule design by learning ployed for peptide design [52, 143–146]. RNNs have to translate the target protein sequence into SMILES also been applied to predict ligand–protein interactions strings of the corresponding ligands [62]. Represen- and the pharmacokinetic properties of drugs [57, 58], tations learned from SMILES strings by Transformers protein secondary structure [55, 56], and the temporal have shown promise for property prediction in low-data evolution of molecular trajectories [147]. RNNs have regimes [160]. Furthermore, Transformers have recently been applied for molecular feature extraction [148, 149], been combined with E(3) and SE(3) equivariant lay- showing that the learned features outperformed both ers to learn the 3D structures of proteins from their traditional molecular descriptors and graph-convolution amino-acid sequence [6, 7]. These equivariant Trans- methods for virtual screening and property prediction formers achieve state-of-the-art performance in protein [148]. The Fréchet ChemNet distance [150], which is structure prediction. based on the physicochemical and biological features Other deep learning approaches have relied on learned by an RNN model, has become the de facto string-based representations for de novo design, e.g., reference method to capture molecular similarity in this conditional generative adversarial networks [161–163] context. and variational autoencoders [164, 165]. Most of these Molecular Transformers have been applied to CASP, models, however, have limited or equivalent ability to which can be cast as a sequence-to-sequence translation automatically learn SMILES syntax, as compared to task, in which the string representations of the reac- RNNs. 1D CNNs [166, 167] and self-attention networks tants are mapped to those of the corresponding prod- [168–170] have been used with SMILES for property uct, or vice versa. Since their initial applications [151], prediction. Recently, deep learning on amino acid se- Transformers have been employed to predict multi- quences for property prediction was shown to perform step syntheses [152], regio- and stereoselective reactions on par with approaches based on human-engineered fea- [153], enzymatic reaction outcomes [154], and reaction tures [171]. yields and classes [61, 155, 156]. Recently, Transform- ers have been applied to molecular property prediction [157, 158] and optimization [159]. Transformers have

8 Box 3: Structure-activity landscape modeling with geometric deep learning This worked example shows how geometric deep learning (GDL) can be used to interpret the structure- activity landscape learned by a trained model. Starting from a publicly available molecular dataset containing estrogen receptor binding information [172], we trained an E(3)-equivariant graph neural network (six hidden layers, 128 hidden neurons per layer) and analyzed the learned features and their relationship to ligand binding to the estrogen receptor. The figure shows an analysis of the learned molecular features (third hidden layer, analyzed via principal component analysis; the first two principal components are shown), and how these features relate to the density of active and inactive molecules in the chemical space. The network successfully separated the molecules based on both their experimental bioactivity and their structural features (e.g., atom scaffolds [173]) and might offer novel opportunities for explainable AI with GDL.

.. HO .. F HO OH S OH O N O

O HO HO

S OH OH

O N low density high density O of actives of actives Atom scaffold

4 Conclusions and outlook drug discovery applications, this approach has often failed to clearly outbalance the increased complexity Geometric deep learning in chemistry has allowed re- of the model. E(3)-equivariant graph neural networks searchers to leverage the symmetries of different un- have also been applied for conformation-aware de novo structured molecular representations, resulting in a design [174], but prospective experimental validation greater flexibility and versatility of the available com- studies have not yet been published. putational models for molecular structure generation Molecular grids have become the de facto standard and property prediction. Such approaches represent for 3D representations of large molecular systems, due a valid alternative to classical chemoinformatics ap- to (i) their ability to capture information at a user- proaches that are based on molecular descriptors or defined resolution (voxel density) and (ii) the Euclidean other human-engineered features. For modeling tasks structure of the input grid. that are usually characterized by the need for highly Finally, molecular surfaces are currently at the fore- engineered rules (e.g., chemical transformations for de front of GDL. We expect many interesting applications novo design, and reactive site specification for CASP), of GDL on molecular surfaces in the near future. the benefits of GDL have been consistently shown. In To further the application and impact of GDL in published applications of GDL, each molecular repre- chemistry, an evaluation of the optimal trade-off be- sentation has shown characteristic strengths and weak- tween algorithmic complexity, performance, and model nesses. interpretability will be required. These aspects are cru- Molecular strings, like SMILES, have proven partic- cial for reconciling the “two QSARs” [175] and connect ularly suited for generative deep learning tasks, such as and chemistry communities. We en- de novo design and CASP. This success may be due to courage GDL practitioners to include aspects of inter- the relatively easy syntax of such a chemical language, pretability in their models (e.g., via XAI [89]) whenever which facilitates next-token and sequence-to-sequence possible and transparently communicate with domain prediction. For molecular property prediction, SMILES experts. The feedback from domain experts will also strings could be limited due to their non-univocity. be crucial to develop new "chemistry-aware" architec- Molecular graphs have shown particular usefulness tures, and further the potential of molecular GDL for for property prediction, partly because of their human concrete prospective applications. interpretability and ease of inclusion of desired edge The potential of GDL for molecular feature extrac- and node features. The incorporation of 3D informa- tion has not yet been fully explored. Several studies tion (e.g., with equivariant message passing) is useful have shown the benefits of learned representations com- for quantum chemistry related modeling, whereas in pared to classical molecular descriptors, but in other

9 cases, GDL failed to live up to its promise in terms of 5 Acknowledgements superior learned features. Although there are several benchmarks for evaluating machine learning models for This research was supported by the Swiss National Sci- property prediction [176, 177] and molecule generation ence Foundation (SNSF, grant no. 205321_182176) [178, 179], at present, there is no such framework to en- and the ETH RETHINK initiative. able the systematic evaluation of the usefulness of data- driven features learned by AI. Such benchmarks and 6 Competing interest systematic studies are key to obtaining an unvarnished assessment of deep representation learning. Moreover, G.S. declares a potential financial conflict of interest as investigating the relationships between the learned fea- co-founder of inSili.com LLC, Zurich, and in his role as tures and the physicochemical and biological properties scientific consultant to the pharmaceutical industry. of the input molecules will augment the interpretability and applicability of GDL, e.g., to modeling structure- 7 List of abbreviations function relationships like structure-activity landscapes (Box 3). AD: Applicability Domain Compared to conventional QSAR approaches, in AI: Artificial Intelligence which the assessment of the applicability domain (i.e., CASP: Computer-aided Synthesis Planning CNN: Convolutional Neural Network the region of the chemical space where model predic- DFT: Density Functional Theory tions are considered reliable) has been routinely per- E(3): Euclidean Symmetry Group formed, contemporary GDL studies lack such an as- GDL: Geometric Deep Learning sessment. This systematic gap might constitute one of GNN: Graph Neural Network the limiting factors to the more widespread use of GDL QSAR: Quantitative Structure-Activity Relationship approaches for prospective studies, as it could lead to RNN: Recurrent Neural Network unreliable predictions, e.g., for molecules with different SE(3): Special Euclidean Symmetry Group mechanisms of action, functional groups, or physico- SMILES: Simplified Molecular Input Line Entry Systems chemical properties than the training data. In the fu- XAI: Explainable Artificial Intelligence ture, it will be necessary to devise “geometry-aware” 1D: One-dimensional approaches for applicability domain assessment. 2D: Two-dimensional 3D: Three-dimensional Another opportunity will be to leverage less ex- plored molecular representations for GDL. For instance, the electronic structure of molecules has vast poten- References tial for tasks such as CASP, molecular property pre- diction, and prediction of macromolecular interactions 1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learn- (e.g. protein-protein interactions). Although accu- ing. Nature 521, 436–444 (2015). rate statistical and quantum mechanical simulations are 2. Schmidhuber, J. Deep learning in neural net- computationally expensive, modern quantum machine works: An overview. Neural Networks 61, 85–117 learning models [180–184] trained on large quantum (2015). data collections [185–187] allow quantum information 3. Gawehn, E., Hiss, J. A. & Schneider, G. Deep to be accessed much faster with high accuracy. This learning in drug discovery. Molecular Informat- aspect could enable quantum and electronic featuriza- ics 35, 3–14 (2016). tion of extensive molecular datasets, to be used as input molecular representations for the task of interest. 4. Jiménez-Luna, J., Grisoni, F., Weskamp, N. & Deep learning can be applied to a multitude of bio- Schneider, G. Artificial intelligence in drug dis- logical and chemical representations. The correspond- covery: Recent advances and future perspectives. ing deep neural network models have the potential to Expert Opinion on Drug Discovery, 1–11 (2021). augment human creativity, paving the way for new sci- 5. Gilmer, J., Schoenholz, S. S., Riley, P. F., entific studies that were previously unfeasible. How- Vinyals, O. & Dahl, G. E. Neural message ever, research has only explored the tip of the iceberg. passing for quantum chemistry. arXiv preprint One of the most significant catalysts for the integra- arXiv:1704.01212 (2017). tion of deep learning in molecular sciences may be the 6. Jumper, J. et al. Highly accurate protein struc- responsibility of academic institutions to foster interdis- ture prediction with AlphaFold. Nature, 1–11 ciplinary collaboration, communication, and education. (2021). Picking the "high hanging fruits" will only be possible with a deep understanding of both chemistry and com- 7. Baek, M. et al. Accurate prediction of protein puter science, along with out-of-the-box thinking and structures and interactions using a three-track collaborative creativity. In such a setting, we expect neural network. Science (2021). molecular GDL to increase the understanding of molec- 8. Vaswani, A. et al. Attention is all you need. ular systems and biological phenomena. Advances in Neural Information Processing Sys- tems, 5998–6008 (2017).

10 9. Brown, T. B. et al. Language models are few- 23. Cohen, T. & Welling, M. Group equivariant con- shot learners. arXiv preprint arXiv:2005.14165 volutional networks. International conference on (2020). machine learning, 2990–2999 (2016). 10. Hinton, G. et al. Deep neural networks for acous- 24. Cohen, T. S. & Welling, M. Steerable cnns. arXiv tic modeling in speech recognition: The shared preprint arXiv:1612.08498 (2016). IEEE Signal Pro- views of four research groups. 25. Cohen, T. S., Geiger, M., Köhler, J. & cessing Magazine 29, 82–97 (2012). Welling, M. Spherical cnns. arXiv preprint 11. Mikolov, T., Deoras, A., Povey, D., Burget, L. & arXiv:1801.10130 (2018). Černocky, J. Strategies for training large scale 26. Kondor, R. & Trivedi, S. On the generalization 2011 IEEE neural network language models. of equivariance and convolution in neural net- Workshop on Automatic Speech Recognition & works to the action of compact groups. Interna- Understanding, 196–201 (2011). tional Conference on Machine Learning, 2747– 12. Krizhevsky, A., Sutskever, I. & Hinton, G. E. Im- 2755 (2018). agenet classification with deep convolutional neu- 27. Moriguchi, I., HIRONO, S., LIU, Q., NAK- Communications of the ACM 60, ral networks. AGOME, I. & MATSUSHITA, Y. Simple method 84–90 (2017). of calculating octanol/water partition coefficient. 13. Farabet, C., Couprie, C., Najman, L. & LeCun, Chemical and pharmaceutical bulletin 40, 127– Y. Learning hierarchical features for scene label- 130 (1992). IEEE transactions on pattern analysis and ing. 28. Cybenko, G. Approximation by superpositions of machine intelligence 35, 1915–1929 (2012). a sigmoidal function. Mathematics of control, sig- 14. Tompson, J. J., Jain, A., LeCun, Y. & Bregler, nals and systems 2, 303–314 (1989). C. Joint training of a convolutional network and 29. Tetko, I. V., Karpov, P., Van Deursen, R. a graphical model for human pose estimation. & Godin, G. State-of-the-art augmented NLP Advances in Neural Information Processing Sys- transformer models for direct and single-step tems, 1799–1807 (2014). retrosynthesis. Nature communications 11, 1–11 15. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, (2020). A. & Vandergheynst, P. Geometric deep learn- 30. Skinnider, M. A., Stacey, R. G., Wishart, D. S. IEEE Signal ing: going beyond euclidean data. & Foster, L. J. Chemical language models enable Processing Magazine 34, 18–42 (2017). navigation in sparsely populated chemical space. 16. Monti, F., Frasca, F., Eynard, D., Mannion, D. Nature Machine Intelligence, 1–12 (2021). & Bronstein, M. M. Fake news detection on so- 31. Cramer, R. D., Patterson, D. E. & Bunce, J. D. arXiv cial media using geometric deep learning. Comparative molecular field analysis (CoMFA). preprint arXiv:1902.06673 (2019). 1. Effect of shape on binding of steroids to car- 17. Todeschini, R. & Consonni, V. Molecular descrip- rier proteins. Journal of the American Chemical tors for chemoinformatics: volume I: alphabetical Society 110, 5959–5967 (1988). listing/volume II: appendices, references (John 32. Klebe, G. Comparative molecular similarity in- Wiley & Sons, 2009). dices analysis: CoMSIA. 3D QSAR in drug de- 18. Gainza, P. et al. Deciphering interaction finger- sign, 87–104 (1998). prints from protein molecular surfaces using ge- 33. LeCun, Y., Bengio, Y., et al. Convolutional net- Nature Methods 17, ometric deep learning. 184– works for images, speech, and time series. The 192 (2020). handbook of brain theory and neural networks 19. Segler, M. H., Kogej, T., Tyrchan, C. & Waller, 3361, 1995 (1995). M. P. Generating focused molecule libraries for 34. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, drug discovery with recurrent neural networks. P. Gradient-based learning applied to document ACS Central Science 4, 120–131 (2018). recognition. Proceedings of the IEEE 86, 2278– 20. Weininger, D. SMILES, a chemical language and 2324 (1998). information system. 1. Introduction to method- 35. Sutton, R. S. & Barto, A. G. Reinforcement Journal of Chemical ology and encoding rules. learning: An introduction (MIT press, 2018). Information and Computer Sciences 28, 31–36 (1988). 36. Pan, S. J. & Yang, Q. A survey on transfer learn- ing. IEEE Transactions on knowledge and data 21. Bronstein, M. M., Bruna, J., Cohen, T. & engineering 22, 1345–1359 (2009). Veličković, P. Geometric deep learning: Grids, et al. groups, graphs, geodesics, and gauges. arXiv 37. Feinberg, E. N. PotentialNet for molecu- ACS Central Science 4, preprint arXiv:2104.13478 (2021). lar property prediction. 1520–1530 (2018). 22. Marsden, J. & Weinstein, A. Reduction of sym- plectic manifolds with symmetry. Reports on mathematical physics 5, 121–130 (1974).

11 38. Jiménez-Luna, J., Skalic, M., Weskamp, N. & 53. Merk, D., Grisoni, F., Friedrich, L. & Schneider, Schneider, G. Coloring molecules with explain- G. Tuning artificial intelligence on the de novo able artificial intelligence for preclinical relevance design of natural-product-inspired retinoid X re- assessment. Journal of Chemical Information and ceptor modulators. Communications Chemistry Modeling 61, 1083–1094 (2021). 1, 1–9 (2018). 39. Somnath, V. R., Bunne, C., Coley, C. W., 54. Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Krause, A. & Barzilay, R. Learning Graph Mod- Laino, T. Found in Translation: predicting out- els for Retrosynthesis Prediction. arXiv preprint comes of complex organic chemistry reactions us- arXiv:2006.07038 (2020). ing neural sequence-to-sequence models. Chemi- 40. Jin, W., Coley, C., Barzilay, R. & Jaakkola, cal Science 9, 6091–6098 (2018). T. Predicting organic reaction outcomes with 55. Senior, A. W. et al. Protein structure prediction weisfeiler-lehman network. Advances in Neu- using multiple deep neural networks in the 13th ral Information Processing Systems, 2607–2616 Critical Assessment of Protein Structure Predic- (2017). tion (CASP13). Proteins: Structure, Function, 41. Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, and Bioinformatics 87, 1141–1148 (2019). P. Optimization of molecules via deep reinforce- 56. Zhou, S., Zou, H., Liu, C., Zang, M. & Liu, T. ment learning. Scientific Reports 9, 1–10 (2019). Combining Deep Neural Networks for Protein 42. Jin, W., Barzilay, R. & Jaakkola, T. Junction tree Secondary Structure Prediction. IEEE Access 8, variational autoencoder for molecular graph gen- 84362–84370 (2020). eration. arXiv preprint arXiv:1802.04364 (2018). 57. Wang, X. et al. Optimizing Pharmacoki- 43. Miller, B. K., Geiger, M., Smidt, T. E. & Noé, netic Property Prediction Based on Integrated F. Relevance of rotationally equivariant convolu- Datasets and a Deep Learning Approach. Jour- tions for predicting molecular properties. arXiv nal of Chemical Information and Modeling 60, preprint arXiv:2008.08461 (2020). 4603–4613 (2020). 44. Anderson, B., Hy, T. S. & Kondor, R. Cor- 58. Zheng, S., Li, Y., Chen, S., Xu, J. & Yang, Y. morant: Covariant molecular neural networks. Predicting drug–protein interaction using quasi- Advances in Neural Information Processing Sys- visual question answering system. Nature Ma- tems, 14510–14519 (2019). chine Intelligence 2, 134–140 (2020). 45. Satorras, V. G., Hoogeboom, E. & Welling, M. E 59. Mayr, A. et al. Large-scale comparison of ma- (n) Equivariant Graph Neural Networks. arXiv chine learning methods for drug target predic- preprint arXiv:2102.09844 (2021). tion on ChEMBL. Chemical Science 9, 5441– 46. Fuchs, F. B., Worrall, D. E., Fischer, V. 5451 (2018). & Welling, M. SE (3)-transformers: 3D roto- 60. Schwaller, P. et al. Molecular transformer for translation equivariant attention networks. arXiv chemical reaction prediction and uncertainty esti- preprint arXiv:2006.10503 (2020). mation. arXiv preprint arXiv:1811.02633 (2018). 47. Schütt, K. T., Unke, O. T. & Gastegger, M. 61. Schwaller, P., Vaucher, A. C., Laino, T. & Rey- Equivariant message passing for the prediction of mond, J.-L. Prediction of chemical reaction yields tensorial properties and molecular spectra. arXiv using deep learning. Machine Learning: Science preprint arXiv:2102.03150 (2021). and Technology 2, 015016 (2021). et al. SE(3)-equivariant prediction of 48. Unke, O. T. 62. Grechishnikova, D. Transformer neural network molecular wavefunctions and electronic densities for protein-specific de novo drug generation as a 2106.02347 [physics.chem-ph] 2021. arXiv: . machine translation problem. Scientific Reports 49. Jiménez, J., Skalic, M., Martinez-Rosell, G. & 11, 1–13 (2021). De Fabritiis, G. K deep: Protein–ligand absolute 63. Hoffmann, R. & Laszlo, P. Representation in binding affinity prediction via 3d-convolutional chemistry. Angewandte Chemie International neural networks. Journal of Chemical Informa- Edition in English 30, 1–16 (1991). tion and Modeling 58, 287–296 (2018). 64. Nguyen, L. A., He, H. & Pham-Huy, C. Chi- 50. Jiménez, J. et al. DeltaDelta neural networks ral drugs: an overview. International Journal of for lead optimization of small molecule potency. Biomedical Science: IJBS 2, 85 (2006). Chemical Science 10, 10911–10918 (2019). 65. Kipf, T. N. & Welling, M. Semi-supervised classi- 51. Jiménez, J., Doerr, S., Martinez-Rosell, G., Rose, fication with graph convolutional networks. arXiv A. S. & De Fabritiis, G. DeepSite: protein- preprint arXiv:1609.02907 (2016). binding site predictor using 3D-convolutional neural networks. Bioinformatics 33, 3036–3042 66. Battaglia, P. W., Pascanu, R., Lai, M., Rezende, (2017). D. & Kavukcuoglu, K. Interaction networks for 52. Grisoni, F. et al. Designing anticancer peptides learning about objects, relations and physics. arXiv preprint arXiv:1612.00222 by constructive machine learning. ChemMed- (2016). Chem 13, 1300–1302 (2018).

12 67. Battaglia, P. W. et al. Relational inductive bi- 82. Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., ases, deep learning, and graph networks. arXiv Tkatchenko, A. & Müller, K.-R. SchNet–A deep preprint arXiv:1806.01261 (2018). learning architecture for molecules and materials. The Journal of Chemical Physics 148, 68. Zhou, J. et al. Graph neural networks: A review 241722 of methods and applications. AI Open 1, 57–81 (2018). (2020). 83. Schütt, K. T., Arbabzadah, F., Chmiela, S., 69. Geerts, F., Mazowiecki, F. & Pérez, G. A. Let’s Müller, K. R. & Tkatchenko, A. Quantum- Agree to Degree: Comparing Graph Convolu- chemical insights from deep tensor neural net- Nature Communications 8, tional Networks in the Message-Passing Frame- works. 1–8 (2017). work. arXiv preprint arXiv:2004.02593 (2020). 84. Schütt, K., Gastegger, M., Tkatchenko, A., 70. Duvenaud, D. K. et al. Convolutional networks Müller, K.-R. & Maurer, R. J. Unifying machine on graphs for learning molecular fingerprints. learning and quantum chemistry with a deep neu- Advances in Neural Information Processing Sys- ral network for molecular wavefunctions. Nature Communications 10, tems, 2224–2232 (2015). 1–10 (2019). 71. Klicpera, J., Groß, J. & Günnemann, S. Di- 85. Bogojeski, M., Vogt-Maranto, L., Tuckerman, rectional message passing for molecular graphs. M. E., Müller, K.-R. & Burke, K. Quantum arXiv preprint arXiv:2003.03123 (2020). chemical accuracy from density functional ap- proximations via machine learning. Nature Com- 72. Zhang, S., Liu, Y. & Xie, L. Molecular munications 11, 1–11 (2020). Mechanics-Driven Graph Neural Network with et al. Multiplex Graph for Molecular Structures. arXiv 86. Yang, K. Analyzing learned molecular rep- Journal of preprint arXiv:2011.07457 (2020). resentations for property prediction. Chemical Information and Modeling 59, 3370– 73. Withnall, M., Lindelöf, E., Engkvist, O. & Chen, 3388 (2019). H. Building attention and edge message pass- ing neural networks for bioactivity and physical– 87. Axelrod, S. & Gomez-Bombarelli, R. Molecu- chemical property prediction. Journal of Chem- lar machine learning with conformer ensembles. arXiv preprint arXiv:2012.08452 informatics 12, 1 (2020). (2020). 74. Tang, B. et al. A self-attention based message 88. Gunning, D. Explainable artificial intelligence. Defense Advanced Research Projects Agency 2 passing neural network for predicting molecular lipophilicity and aqueous solubility. Journal of (2017). 12, 1–9 (2020). 89. Jiménez-Luna, J., Grisoni, F. & Schneider, G. 75. Goodall, R. E. & Lee, A. A. Predicting materials Drug discovery with explainable artificial intel- Nature Machine Intelligence 2, properties without crystal structure: Deep rep- ligence. 573–584 resentation learning from stoichiometry. Nature (2020). Communications 11, 1–9 (2020). 90. Schnake, T. et al. XAI for Graphs: Ex- 76. Stokes, J. M. et al. A deep learning approach to plaining Graph Neural Network Predictions arXiv preprint antibiotic discovery. Cell 180, 688–702 (2020). by Identifying Relevant Walks. arXiv:2006.03589 (2020). 77. Torng, W. & Altman, R. B. Graph convolutional neural networks for predicting drug-target in- 91. Li, Y., Vinyals, O., Dyer, C., Pascanu, R. & teractions. Journal of Chemical Information and Battaglia, P. Learning deep generative models of arXiv preprint arXiv:1803.03324 Modeling 59, 4131–4149 (2019). graphs. (2018). 78. Li, J., Cai, D. & He, X. Learning graph-level 92. Simonovsky, M. & Komodakis, N. Graphvae: To- representation for drug discovery. arXiv preprint wards generation of small graphs using varia- International Conference on arXiv:1709.03741 (2017). tional autoencoders. Artificial Neural Networks, 412–422 (2018). 79. Liu, K. et al. Chemi-Net: a molecular graph convolutional network for accurate drug prop- 93. De Cao, N. & Kipf, T. MolGAN: An implicit gen- arXiv erty prediction. International journal of Molec- erative model for small molecular graphs. preprint arXiv:1805.11973 ular Sciences 20, 3389 (2019). (2018). 80. Unke, O. T. & Meuwly, M. PhysNet: a neural net- 94. Flam-Shepherd, D., Wu, T. C. & Aspuru-Guzik, work for predicting energies, forces, dipole mo- A. MPGVAE: Improved Generation of Small Or- ments, and partial charges. Journal of Chemical ganic Molecules using Message Passing Neural Machine Learning: Science and Technology Theory and Computation 15, 3678–3693 (2019). Nets. (2021). 81. Klicpera, J., Giri, S., Margraf, J. & Günne- mann, S. Fast and Uncertainty Aware Directional 95. You, J., Liu, B., Ying, Z., Pande, V. & Leskovec, Message Passing for Non Equilibrium Molecules. J. Graph convolutional policy network for goal- Advances arXiv preprint arXiv:2011.14115 (2020). directed molecular graph generation. in Neural Information Processing Systems, 6410– 6421 (2018).

13 96. Jin, W., Barzilay, R. & Jaakkola, T. Multi- 112. Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J. objective molecule generation using interpretable & Koes, D. R. Protein–ligand scoring with con- substructures. International Conference on Ma- volutional neural networks. Journal of Chemical chine Learning, 4849–4859 (2020). Information and Modeling 57, 942–957 (2017). 97. Coley, C. W. et al. A graph-convolutional neu- 113. Li, Y., Rezaei, M. A., Li, C. & Li, X. Deepatom: ral network model for the prediction of chemical A framework for protein-ligand binding affinity reactivity. Chemical Science 10, 370–377 (2019). prediction. 2019 IEEE International Conference 98. Lei, T., Jin, W., Barzilay, R. & Jaakkola, T. on Bioinformatics and Biomedicine (BIBM), Deriving neural architectures from sequence and 303–310 (2019). graph kernels. arXiv preprint arXiv:1705.09037 114. Karimi, M., Wu, D., Wang, Z. & Shen, Y. (2017). DeepAffinity: interpretable deep learning of 99. Thomas, N. et al. Tensor field networks: compound–protein affinity through unified recur- Rotation-and translation-equivariant neural net- rent and convolutional neural networks. Bioinfor- works for 3d point clouds. arXiv preprint matics 35, 3329–3338 (2019). arXiv:1802.08219 (2018). 115. Ahmed, E. et al. A survey on deep learning 100. Smidt, T. E., Geiger, M. & Miller, B. K. Find- advances on different 3D data representations. ing symmetry breaking order parameters with arXiv preprint arXiv:1808.01462 (2018). Euclidean neural networks. Physical Review Re- 116. Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A. search 3, L012002 (2021). & Battaglia, P. W. Learning Mesh-Based Sim- 101. Kondor, R., Lin, Z. & Trivedi, S. Clebsch–gordan ulation with Graph Networks. arXiv preprint nets: a fully fourier space spherical convolutional arXiv:2010.03409 (2020). neural network. Advances in Neural Information 117. Liu, Q. et al. OctSurf: Efficient hierarchical Processing Systems, 10117–10126 (2018). voxel-based molecular surface representation for 102. Smidt, T. E. Euclidean symmetry and equivari- protein-ligand affinity prediction. Journal of ance in machine learning. Trends in Chemistry Molecular Graphics and Modelling 105, 107865 (2020). (2021). 103. Hutchinson, M. et al. LieTransformer: Equivari- 118. Mylonas, S. K., Axenopoulos, A. & Daras, P. ant self-attention for Lie Groups. arXiv preprint DeepSurf: A surface-based deep learning ap- arXiv:2012.10885 (2020). proach for the prediction of ligand binding sites arXiv preprint arXiv:2002.05643 104. Batzner, S. et al. SE (3)-Equivariant Graph on proteins. Neural Networks for Data-Efficient and Ac- (2020). curate Interatomic Potentials. arXiv preprint 119. Barnard, J. M. Representation of Molecular arXiv:2101.03164 (2021). Structures-Overview. Handbook of Chemoinfor- matics: From Data to Knowledge in 4 Volumes, 105. Müller, C. Spherical harmonics (Springer, 2006). 27–50 (2003). 106. Dray, T. A unified treatment of Wigner D func- tions, spin-weighted spherical harmonics, and 120. Wiswesser, W. J. Historic development of chem- Journal of Chemical Information monopole harmonics. Journal of mathematical ical notations. and Computer Sciences 25, physics 27, 781–792 (1986). 258–263 (1985). 107. Hermann, J., Schätzle, Z. & Noé, F. Deep-neural- 121. Wiswesser, W. J. The Wiswesser Line Formula network solution of the electronic Schrödinger Notation. Chemical & Engineering News Archive equation. Nature Chemistry 12, 891–897 (2020). 30, 3523–3526 (1952). 108. Pfau, D., Spencer, J. S., Matthews, A. G. & 122. Ash, S., Cline, M. A., Homer, R. W., Hurst, T. Foulkes, W. M. C. Ab initio solution of the many- & Smith, G. B. SYBYL line notation (SLN): A electron Schrödinger equation with deep neural versatile language for chemical structure repre- networks. Physical Review Research 2, 033429 sentation. Journal of Chemical Information and (2020). Computer Sciences 37, 71–79 (1997). 109. Choo, K., Mezzacapo, A. & Carleo, G. Fermionic 123. Heller, S., McNaught, A., Stein, S., Tchekhovskoi, neural-network states for ab-initio electronic D. & Pletnev, I. InChI the worldwide chemical structure. Nature Communications 11, 1–7 structure identifier standard. Journal of Chem- (2020). informatics 5, 1–9 (2013). 110. Hochreiter, S. & Schmidhuber, J. Long short- 124. Zhang, T., Li, H., Xi, H., Stanton, R. V. & Rot- term memory. Neural Computation 9, 1735–1780 stein, S. H. HELM: A Hierarchical Notation Lan- (1997). guage for Complex Biomolecule Structure Repre- Journal of Chemical Information and A machine learning approach to sentation. 111. Tabchouri, S. Modeling 52, molecular structure recognition in chemical lit- 2796–2806 (2012). erature PhD thesis (Massachusetts Institute of Technology, 2019).

14 125. Öztürk, H., Özgür, A., Schwaller, P., Laino, T. 139. Popova, M., Isayev, O. & Tropsha, A. Deep rein- & Ozkirimli, E. Exploring chemical space using forcement learning for de novo drug design. Sci- natural language processing methodologies for ence Advances 4, eaap7885 (2018). Drug Discovery Today 25, drug discovery. 689– 140. Grisoni, F. et al. Combining generative artifi- 705 (2020). cial intelligence and on-chip synthesis for de novo 126. Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler- drug design. Science Advances 7 (2021). Doty, M. & Grzybowski, B. A. Organic Chem- 141. Arús-Pous, J. et al. Randomized SMILES strings istry as a Language and the Implications of improve the quality of molecular generative mod- Chemical Linguistics for Structural and Ret- els. Journal of Cheminformatics 11, 1–13 (2019). rosynthetic Analyses. Angewandte Chemie Inter- national Edition 53, 8108–8112 (2014). 142. Grisoni, F., Moret, M., Lingwood, R. & Schnei- der, G. Bidirectional Molecule Generation with 127. Gómez-Bombarelli, R. et al. Automatic chemical Recurrent Neural Networks. Journal of Chemical design using a data-driven continuous represen- Information and Modeling (2020). tation of molecules. ACS Central Science 4, 268– 276 (2018). 143. Müller, A. T., Hiss, J. A. & Schneider, G. Recur- rent neural network model for constructive pep- 128. O’Boyle, N. & Dalke, A. DeepSMILES: An Adap- tide design. Journal of Chemical Information and tation of SMILES for Use in Machine-Learning of Modeling 58, 472–479 (2018). Chemical Structures. ChemRxiv. Preprint. chem- rxiv.7097960.v1. 144. Nagarajan, D. et al. Computational antimi- crobial peptide design and evaluation against 129. Krenn, M., Häse, F., Nigam, A., Friederich, P. multidrug-resistant clinical isolates of bacteria. & Aspuru-Guzik, A. Self-Referencing Embedded Journal of Biological Chemistry 293, 3492–3509 Strings (SELFIES): A 100% robust molecular (2018). string representation. Machine Learning: Science and Technology 1, 045024 (2020). 145. Hamid, M.-N. & Friedberg, I. Identifying antimi- 130. Rumelhart, D. E., Hinton, G. E. & Williams, crobial peptides using word embedding with deep Bioinformatics 35, R. J. Learning internal representations by error recurrent neural networks. issn propagation tech. rep. (California Univ San Diego 2009–2016. : 1367-4803 (Nov. 2018). La Jolla Inst for Cognitive Science, 1985). 146. Das, P. et al. Accelerated antimicrobial discov- 131. Hochreiter, S. The vanishing gradient problem ery via deep generative models and molecular Nature Biomedical Engi- during learning recurrent neural nets and prob- dynamics simulations. neering 5, lem solutions. International Journal of Uncer- 613–623 (2021). tainty, Fuzziness and Knowledge-Based Systems 147. Tsai, S.-T., Kuo, E.-J. & Tiwary, P. Learning 6, 107–116 (1998). molecular dynamics with simple language model 132. Pascanu, R., Mikolov, T. & Bengio, Y. On the built upon long short-term memory neural net- difficulty of training recurrent neural networks. work. Nature Communications 11, 1–11 (2020). International Conference on Machine Learning, 148. Gomez-Bombarelli, R. et al. Automatic Chemical 1310–1318 (2013). Design Using a Data-Driven Continuous Repre- 133. Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. sentation of Molecules. ACS Central Science 4, Empirical evaluation of gated recurrent neural 268–276 (2018). networks on sequence modeling. arXiv preprint 149. Lin, X., Quan, Z., Wang, Z.-J., Huang, H. & arXiv:1412.3555 (2014). Zeng, X. A novel molecular representation with 134. Yuan, W. et al. Chemical Space Mimicry for Drug BiGRU neural networks for learning atom. Brief- Discovery. Journal of Chemical Information and ings in Bioinformatics 21, 2099–2111 (2019). Modeling 57. PMID: 28257191, 875–882 (2017). 150. Preuer, K., Renz, P., Unterthiner, T., Hochreiter, 135. Bjerrum, E. J. & Threlfall, R. Molecular gen- S. & Klambauer, G. Fréchet ChemNet distance: eration with recurrent neural networks (RNNs). a metric for generative models for molecules in arXiv preprint arXiv:1705.04612 (2017). drug discovery. Journal of Chemical Information and Modeling 58, 136. Gupta, A. et al. Generative recurrent networks 1736–1741 (2018). for de novo drug design. Molecular Informatics 151. Schwaller, P. et al. Molecular transformer: A 37, 1700111 (2018). model for uncertainty-calibrated chemical reac- ACS Central Science 5, 137. Merk, D., Friedrich, L., Grisoni, F. & Schneider, tion prediction. 1572– G. De novo design of bioactive small molecules by 1583 (2019). artificial intelligence. Molecular Informatics 37, 152. Schwaller, P. et al. Predicting retrosynthetic 1700153 (2018). pathways using transformer-based models and a 138. Olivecrona, M., Blaschke, T., Engkvist, O. & hyper-graph exploration strategy. Chemical Sci- Chen, H. Molecular de-novo design through deep ence 11, 3316–3325 (2020). reinforcement learning. Journal of Cheminfor- matics 9, 1–14 (2017).

15 153. Pesciullesi, G., Schwaller, P., Laino, T. & Rey- 167. Kimber, T. B., Engelke, S., Tetko, I. V., Bruno, mond, J.-L. Transfer learning enables the molec- E. & Godin, G. Synergy effect between convo- ular transformer to predict regio-and stereoselec- lutional neural networks and the multiplicity of tive reactions on carbohydrates. Nature Commu- SMILES for improvement of molecular predic- nications 11, 1–8 (2020). tion. arXiv preprint arXiv:1812.04439 (2018). 154. Kreutter, D., Schwaller, P. & Reymond, J.-L. 168. Zheng, S., Yan, X., Yang, Y. & Xu, J. Iden- Predicting Enzymatic Reactions with a Molec- tifying structure–property relationships through ular Transformer. Chemical Science (2021). SMILES syntax analysis with self-attention Journal of Chemical Information 155. Schwaller, P., Vaucher, A. C., Laino, T. & Rey- mechanism. and Modeling 59, mond, J.-L. Data augmentation strategies to im- 914–923 (2019). prove reaction yield predictions and estimate un- 169. Lim, S. & Lee, Y. O. Predicting Chemical Prop- certainty (2020). erties using Self-Attention Multi-task Learning arXiv preprint 156. Schwaller, P. et al. Mapping the space of chemical based on SMILES Representation. arXiv:2010.11272 reactions using attention-based neural networks. (2020). Nature Machine Intelligence, 1–9 (2021). 170. Shin, B., Park, S., Kang, K. & Ho, J. C. Self- 157. Chithrananda, S., Grand, G. & Ramsundar, B. attention based molecule representation for pre- Machine Learn- ChemBERTa: Large-Scale Self-Supervised Pre- dicting drug-target interaction. ing for Healthcare Conference, training for Molecular Property Prediction. arXiv 230–248 (2019). preprint arXiv:2010.09885 (2020). 171. ElAbd, H. et al. Amino acid encoding for deep BMC bioinformatics 21, 158. Morris, P., St. Clair, R., Hahn, W. E. & Baren- learning applications. holtz, E. Predicting Binding from Screening As- 1–14 (2020). says with Transformer Network Embeddings. 172. Valsecchi, C., Grisoni, F., Motta, S., Bonati, L. Journal of Chemical Information and Modeling & Ballabio, D. NURA: A curated dataset of nu- 60, 4191–4199 (2020). clear receptor modulators. Toxicology and Ap- plied Pharmacology 407, 115244 (2020). 159. He, J. et al. Molecular Optimization by Captur- ing Chemist’s Intuition Using Deep Neural Net- 173. Bemis, G. W. & Murcko, M. A. The properties of works (2020). known drugs. 1. Molecular frameworks. Journal of Medicinal Chemistry 39, 2887–2893 (1996). 160. Honda, S., Shi, S. & Ueda, H. R. SMILES transformer: pre-trained molecular fingerprint 174. Satorras, V. G., Hoogeboom, E., Fuchs, F. B., for low data drug discovery. arXiv preprint Posner, I. & Welling, M. E (n) Equivariant Nor- arXiv:1911.04738 (2019). malizing Flows for Molecule Generation in 3D. arXiv preprint arXiv:2105.09016 (2021). 161. Mirza, M. & Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 175. Fujita, T. & Winkler, D. A. Understanding the (2014). roles of the “two QSARs”. Journal of Chemical Information and Modeling 56, 269–274 (2016). 162. Arjovsky, M., Chintala, S. & Bottou, L. Wasser- stein generative adversarial networks. Interna- 176. Hu, W. et al. Open graph benchmark: Datasets tional Conference on Machine Learning, 214–223 for machine learning on graphs. arXiv preprint (2017). arXiv:2005.00687 (2020). 163. Méndez-Lucio, O., Baillif, B., Clevert, D.-A., 177. Wu, Z. et al. MoleculeNet: a benchmark for Rouquié, D. & Wichard, J. De novo generation molecular machine learning. Chemical Science 9, of hit-like molecules from gene expression signa- 513–530 (2018). tures using artificial intelligence. Nature Commu- 178. Polykovskiy, D. et al. Molecular sets (MOSES): a nications 11, 1–10 (2020). benchmarking platform for molecular generation 164. Griffiths, R.-R. & Hernández-Lobato, J. M. Con- models. Frontiers in Pharmacology 11 (2020). strained Bayesian optimization for automatic 179. Brown, N., Fiscato, M., Segler, M. H. & Vaucher, chemical design using variational autoencoders. A. C. GuacaMol: benchmarking models for de Chemical Science 11, 577–586 (2020). novo molecular design. Journal of Chemical In- 165. Alperstein, Z., Cherkasov, A. & Rolfe, J. T. All formation and Modeling 59, 1096–1108 (2019). smiles variational autoencoder. arXiv preprint 180. Von Lilienfeld, O. A., Müller, K.-R. & arXiv:1905.13343 (2019). Tkatchenko, A. Exploring chemical compound 166. Hirohara, M., Saito, Y., Koda, Y., Sato, K. space with quantum-based machine learning. Na- & Sakakibara, Y. Convolutional neural network ture Reviews Chemistry, 1–12 (2020). based on SMILES representation of compounds 181. Christensen, A. S., Bratholm, L. A., Faber, for detecting chemical motif. BMC bioinformat- F. A. & Anatole von Lilienfeld, O. FCHL revis- ics 19, 83–94 (2018). ited: Faster and more accurate quantum machine learning. The Journal of Chemical Physics 152, 044107 (2020).

16 182. Huang, B. & von Lilienfeld, O. A. Quantum 185. Ramakrishnan, R., Dral, P. O., Rupp, M. & Von machine learning using atom-in-molecule-based Lilienfeld, O. A. Quantum chemistry structures fragments selected on the fly. Nature Chemistry and properties of 134 kilo molecules. Scientific 12, 945–951 (2020). Data 1, 1–7 (2014). 183. Heinen, S., Schwilk, M., von Rudorff, G. F. & 186. Isert, C., Atz, K., Jiménez-Luna, J. & Schnei- von Lilienfeld, O. A. Machine learning the com- der, G. QMugs: Quantum Mechanical Prop- putational cost of quantum chemistry. Machine erties of Drug-like Molecules. arXiv preprint Learning: Science and Technology 1, 025002 arXiv:2107.00367 (2021). (2020). 187. Von Rudorff, G. F., Heinen, S. N., Bragato, M. & 184. Heinen, S., von Rudorff, G. F. & von Lilienfeld, von Lilienfeld, O. A. Thousands of reactants and O. A. Quantum based machine learning of com- transition states for competing E2 and S2 reac- peting chemical reaction profiles. arXiv preprint tions. Machine Learning: Science and Technology arXiv:2009.13429 (2020). 1, 045026 (2020).

17