Atomic Ring Invariant and Modified CANON Extended Connectivity
Total Page:16
File Type:pdf, Size:1020Kb
Krotko J Cheminform (2020) 12:48 https://doi.org/10.1186/s13321-020-00453-4 Journal of Cheminformatics RESEARCH ARTICLE Open Access Atomic ring invariant and Modifed CANON extended connectivity algorithm for symmetry perception in molecular graphs and rigorous canonicalization of SMILES Dmytro G. Krotko1,2* Abstract We propose new invariant (the product of the corresponding primes for the ring size of each bond of an atom) as a simple unambiguous ring invariant of an atom that allows distinguishing symmetry classes in the highly symmetrical molecular graphs using traditional local and distance atom invariants. Also, we propose modifcations of Weininger’s CANON algorithm to avoid its ambiguities (swapping and leveling ranks, incorrect determination of symmetry classes in non-aromatic annulenes, arbitrary selection of atom for breaking ties). The atomic ring invariant and the Modifed CANON algorithm allow us to create a rigorous procedure for the generation of canonical SMILES which can be used for accurate and fast structural searching in large chemical databases. Keywords: Molecular graphs, Symmetry perception, SMILES, Canonicalization Introduction molecular graphs are widely used in the contemporary Te perception of the symmetry of atoms in molecular chemical databases since they allow to compare chemical graphs is an important problem for chemoinformatics. structures as plain strings. Tese strings could be lexico- Te systems for the synthesis design use the constitu- graphically ordered, which allows using very fast binary tional symmetry information to generate and evaluate searching in ultra-large databases of canonicalized linear the reaction pathways [1]. Te constitutional symmetry is representations of molecular graphs [1]. Te accuracy important for the interpretation of NMR and ESR spec- and speed of search in millions of known and billions of tra. Computer-assisted structure elucidation systems also virtual compounds for drug discovery relays on the cor- use this information [1]. Te correct determination of the rectness and the efectiveness of these algorithms. symmetry classes for the atoms in a molecular graph is Te frst algorithm for the symmetry perception of a basis for fnding a canonical ordering of the atoms in the atoms in the molecular graphs was developed in a molecule, which in turn is necessary for generating a 1965 by Morgan [7] for Chemical Abstracts Service. unique representation of the molecule [1]. Tis approach In the frst phase of this algorithm, an initial value is is used in the canonicalization algorithms for the linear assigned to each atom based on its local properties like representations of the molecular graphs like SMILES [2– degree, atomic number, and bond type. In a second so- 5] and InChI [4, 6]. Te canonical linear notations of the called relaxation step, these values are iteratively refned by summing up of the current values of the immediate neighbors of each atom and assigning this new value to *Correspondence: [email protected] 1 Enamine Ltd, 78 Chervonotkatska Street, Kyiv 02094, Ukraine the atom. Te algorithm is terminated when a count of Full list of author information is available at the end of the article diferent values ceases to increase from the previous step © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creat iveco mmons .org/publi cdoma in/ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Krotko J Cheminform (2020) 12:48 Page 2 of 11 of the algorithm. Tis algorithm is a classical realization of McKay algorithm (nauty) is implemented in InChI of the extended connectivity algorithm and works pretty canonicalization algorithm [6]. Another approach is well for most of the typical chemical structures. But in shown by Schneider, Sayle, and Landrum [5] in 2015 and 1975 Randic ́ has pointed out [8] that issues exist with this is implemented in an open-source RDKit library. Tey algorithm: the same extended connectivity values can use some nonlocal invariants of the atoms (the chirality be assigned for nonequivalent atoms in some molecular invariant and the special high-symmetry invariant) for graphs. Some of these issues are related to the fact that resolving the symmetry classes in the chiral structures the sum of integers is not unambiguous function and and the highly symmetrical molecular graphs by their addition various integers can give the same sum. Tis version of the extended connectivity algorithm. Using problem could be solved by a method described by Wein- these invariants allows obtaining an accurate count of inger et al. [3] in 1989 by replacing summing at the refn- the symmetry classes in the chiral structures and ‘patho- ing step of the algorithm by the product of corresponding logical’ molecular graphs, respectively. But I would like to primes as an unambiguous function that can be easily propose another (from my point of view, more obvious shown from the prime factorization theorem. But the for the chemists) nonlocal invariants of the atoms for the method of the corresponding primes is not universal: for same purposes. some ‘pathological’ molecular graphs (one of them shown Te mathematical theory of canonical coding of the in Fig. 1) none of the extended connectivity algorithms, graphs with application to the molecular graphs is out- whose initial values assigned to each atom based on the lined in the review by Ivanciuc [1]. For further discus- local properties of atom only, can resolve all symmetry sion, we have to give the major defnitions and facts classes of these highly symmetrical molecular graphs from this review. A code of the labeled graph is a string [9]. Tis fact led lvanciuc to the statement [1] that only obtained from graph by a set of rules. Te code is a com- algorithms with an explicit determination of the graph plete representation of graph because the labeled graph automorphisms (an isomorphism of a graph with itself can be reconstructed from code. Te code is not a struc- is called an automorphism) can accurately determine all tural invariant, because diferent labelings of graph usu- symmetry classes in complex molecular graphs. Such ally give diferent codes. An important property of codes algorithms, like Shelley-Munk algorithm [10], the HOC is that the lexicographical relation between two strings (Hierarchically Ordered Extended Connectivities) algo- induces an order of the codes. A rigorous method to rithm [11], McKay algorithm [12] (which is implemented derive the canonical code and automorphism partition- in open-source library nauty) and Faulon et al. [13, 14] ing of the graph with N vertices is to construct all N! per- canonical signatures algorithm, are quite complex and mutations, to generate their codes and compare them to computationally intensive since they use some sorts of extract the one-to-one correspondence. Te permutation the path enumerations and the label permutations. Te labelings corresponding to the canonical code are iden- simplifed and adapted for the molecular graphs version tifed by a lexicographical comparison of the N! codes, followed by the selection of the maximal (or minimal) code. From the above defnition it is clear that for a given molecular graph the canonical code is unique. Tis prop- 12 12 erty is used in graph isomorphism testing and in stor- age, retrieval and comparison of chemical compounds, 6 6 because two molecular graphs with identical canonical codes represent the same chemical compound. Te pro- cess of generation of the canonical code by investigating 6 6 6 6 automorphism permutations is called canonical code generation by automorphism permutation (CCAP). Since for practical purposes N! permutations is a very large 6 6 6 number, all coding algorithms use a heuristic approach to 6 reducing (in a deterministic way, which does not depend on labeling of the molecular graph) the number of per- 6 6 mutation labelings that have to be investigated in order to detect the canonical labelings. A vertex graph invariant is any vertex property, computed on the basis of the graph 12 12 structure, whose value does not depend on the graph Fig.1 ‘Pathological’ graph with the sizes of the smallest rings for the labeling. Examples of vertex invariants are atomic num- atoms ber, the degree and distance sum. Any vertex invariant Krotko J Cheminform (2020) 12:48 Page 3 of 11 can be used to obtain a preliminary partition of the ver- they are incomplete, i.e. for some molecular graphs they tices from a molecular graph. Two vertices from difer- could give diferent canonical SMILES for the label per- ent atom invariant classes (AIC) cannot be automorphic, mutations of one molecular graph. Unlike them, the while two vertices from the same AIC are not necessarily InChI canonicalization algorithm [6] is complete. So our automorphic. Despite numerous eforts, no vertex graph main goal is to create a complete but fast algorithm for invariant is known which is sufcient to establish the SMILES canonicalization. To achieve it, we propose an automorphism partitioning, because for certain graphs efcient GIAP procedure, which in almost all cases of the non-automorphic vertices are partitioned in the same molecular graphs gives the correct partition of the atoms class.