Molecular Representations in AI-Driven Drug Discovery
Total Page:16
File Type:pdf, Size:1020Kb
David et al. J Cheminform (2020) 12:56 https://doi.org/10.1186/s13321-020-00460-5 Journal of Cheminformatics REVIEW Open Access Molecular representations in AI-driven drug discovery: a review and practical guide Laurianne David1* , Amol Thakkar1,2 , Rocío Mercado1 and Ola Engkvist1 Abstract The technological advances of the past century, marked by the computer revolution and the advent of high-through- put screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various felds. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complex- ity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these felds. Keywords: Molecular representation, Cheminformatics, Drug discovery, Small molecules, Macromolecules, Linear notation, Molecular graphs, Reaction prediction, Artifcial intelligence Introduction the empirical formula of alanine, C3H7NO2, exempli- Te representation of molecules has been of interest to fes the complexity of building a notation. Indeed, while scientists since the nineteenth century [1, 2]. Tradition- information about the atoms is available, it is not possi- ally, molecules are represented as structure diagrams ble to know how the atoms are linked from the molecular with bonds and atoms, and this is likely the representa- formula which, moreover, does not encode information tion most people think of when they think of molecules. related to the molecular geometry. As such, the molecu- However, other representations are required for the com- lar formula above can be associated to alanine as well as putational processing of chemical structures in chemin- to sarcosine and lactamide. Variants of empirical formu- formatics. Here, we defne a chemical representation as las emphasizing any functional groups also exist but are any encoding of a chemical compound; linear representa- loosely defned. In these group-centric representations, tions are referred to as notations. elements are grouped in the formula as they would be in Over the years, scientists have developed many nota- the molecule so as to highlight any functional groups pre- tions depicting various properties of a compound. A sent e.g. CH3CH(NH2)COOH to represent alanine. classic notation is the empirical formula, a non-standard Te advent of computers led to the development of a form of Hill notation. Seemingly simple at frst glance, wide variety of machine-readable chemical representa- tions. Computers allowed for the rapid digital storage *Correspondence: [email protected] and querying of compounds and their structures, swift 1 Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca modifcations of digital information, and greater physi- Gothenburg, Sweden cal storage efciency. Algorithms were implemented Full list of author information is available at the end of the article to visualize compounds as 2D depictions [3, 4] and the © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. David et al. J Cheminform (2020) 12:56 Page 2 of 22 computational visualization of compounds in 3D space could not consider other mappings. In typical graph rep- was popularized with the development of specialized resentations, the nodes are represented using circles or programs [5–7]. spheres, and the edges using lines. In the case of molecu- Many precursors to computer-readable notations were lar graphs, the nodes are instead often represented using introduced between 1947 and 1964 and were dedicated letters indicating the atom type (as on the periodic table), to small organic molecules [2, 8]. At the time, memory or simply using points where the bonds meet (for carbon efciency was an important factor impacting the devel- atoms). opment of chemical notations. Popular representations A molecular graph representation is formally a 2D used nowadays, however, were largely developed in and object that can be used to represent 3D information (e.g. since the 70′s to represent small molecules [9–11], mac- atomic coordinates, bond angles, chirality). However, any romolecules [12–17] and chemical reactions [18–21]. spatial relationships between the nodes must be encoded In this review, we focus on chemical representations as node and/or edge attributes, as nodes in a graph (the in cheminformatics and drug discovery. We frst intro- mathematical object) do not formally have spatial posi- duce the concept of a molecular graph, which is the tions, only pairwise relationships. Tere are of course most common machine-readable representation, and we limitations to this representation, which are discussed in give a brief overview of the main notations which paved a later section. Te 2D and 3D representations of graphs the way for the current cheminformatics notations. We can easily be visualized by many software packages, then focus on the representations that are used nowa- including ChemDraw [30], Mercury [31], Avogadro [32], days in the feld of applying artifcial intelligence (AI) to VESTA [33], PyMOL [34], and VMD [35] (the latter 5 are cheminformatics and drug discovery. Finally, we provide suitable for small- and macro-molecules, and either free examples of AI-related applications using the chemical or open-source). representations discussed in this review. Tis review is intended to provide an overview of basic cheminformat- Mathematical defnition of a graph 1 ics knowledge to practising cheminformaticians, students A graph is formally defned as a tuple G = (V, E) of a set in chemistry, cheminformatics, bioinformatics, and com- of nodes V and a set of edges E, where each edge e ∊ E puter science, and anyone interested to learn more about connects pairs of nodes in V. In a molecular graph, V is molecular representations in drug discovery. While the intuitively the set of all atoms in a molecule, and E is the coverage of representations introduced herein is not set of all bonds linking the atoms, although this does not intended to be exhaustive, we emphasise that the repre- have to be the case. Molecular graphs are generally undi- sentation used to solve a problem is always dependent rected, meaning that the pairs in E are unordered. [36]. on the task. Tus, the coverage is limited to areas where To map a graph from an abstract mathematical con- there is active research in applying machine learning cept to a concrete representation that can be handled (ML) and AI to cheminformatics and drug discovery. For on a computer, one needs to map the sets of nodes and readers interested in further reading on these topics we edges to linear data structures; a common way to do this recommend references [22–29], in addition to the refer- is using data structures such as matrices or arrays. Lin- ences cited in each section. ear data structures are necessary in order to specify the connectivity of the nodes. To do so, an artifcial node- Graph representations for small molecules ordering must frst be calculated for encoding a molecule Introduction to the molecular graph representation using arrays, even though V and E are formally sets and In order to understand the chemical representations pre- the order of elements in sets is irrelevant. Te informa- sented in this review, it is important that the reader frst tion to be mapped can include (1) how the atoms in the has a solid understanding of molecular graphs, as most molecule are connected, (2) the identity of the atoms, and molecular representations discussed in this work are (3) the identity of the bonds. built on the molecular graph representation. However, How the atoms are connected is commonly repre- there is a distinction between the notations and fle for- sented in the form of an adjacency matrix A; given that mats built using molecular graphs (e.g. SMILES strings, Molfles), and the abstract mathematical structure/data structure of a graph itself. Te latter is introduced here. 1 Note that throughout this section, using the typical convention, bold itali- Te idea behind the molecular graph representa- cized symbols are used to represent matrices and vectors, where an uppercase tion lies in mapping the atoms and bonds that make up symbol specifcally denotes a matrix (e.g.