Effective Algorithms for Searching of Identical Molecules and Their Application

MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨ OF I !"#$%&'()+,-./012345<yA|NFORMATICS Effective algorithms for searching of identical molecules and their application DIPLOMA THESIS Bc. Karol Kružel Brno, spring 2008 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: RNDr. Radka Svobodová Vaˇreková,Ph.D. ii Acknowledgement First of all, I would like to thank Radka Svobodová Vaˇrekováfor her time, advice and never ending support during my work on this thesis. My thanks go to Ján Slaninka with whom we worked on the IMF project for all the effort and especially for keeping me awake. From the depth of my heart I want to thank DD. You know that the list would be longer than this work... I would also like to thank my mother and rest of my family and friends for their support and patience during the whole time I studied at the Faculty of Informatics. Last but not least I would like to thank Lukáš Peterka, for introducing me into the world of programming. iii Abstract This master thesis is focused on searching of isomorphic molecules. Molecules are in computers usually represented in form of molecular graphs. While this structure is feasible for further processing it has a disadvantage of possibly ambiguous indexing of atoms in represented molecule. This work starts with definition of all required terms and then the problem of graph isomorphism is briefly described. Possible solutions of such as bruteforce approach, backtracking and its further optimization using equivalence classes of graph vertices are listed and their expected performance is briefly described. Implementation part of this work contains description of design and implementation of the IMF tool – software allowing detection of isomorphic molecules as well as managing a database of molecular data. iv Keywords molecular graph, graph isomorphism, isomorphism detection, molecule visualization, Java, Jmol, Swing, MySQL v Contents I Introduction 1 II Theoretical part 4 1 Basic definitions ...................................... 5 1.1 Atom and chemical element . 5 1.2 Molecules and chemical bonds . 5 2 Graphs and molecular graphs .............................. 6 2.1 Undirected graph . 6 2.2 Molecular graph . 6 2.3 Adjacency matrix . 7 3 Graph isomorphism .................................... 8 3.1 Isomers, graph isomerism . 8 3.2 Graph isomorphism . 9 3.3 Non-isomorphism detection . 9 3.4 Isomorphism detection algorithms . 10 3.4.1 Brute force . 10 3.4.2 Backtracking . 10 3.4.3 Equivalence classes of vertices in molecular graphs . 11 III Methods 12 4 Development environment ................................ 13 4.1 Java and NetBeans IDE . 13 4.2 MySQL database server . 13 4.3 Other software components used . 13 4.3.1 Jmol . 13 4.3.2 JDBC drivers . 14 5 Chemical data and means of its storage ......................... 15 5.1 MDL Molfile and SDFile . 15 5.2 Relational databases . 16 5.3 SMILES definitions . 17 IV Implementation 18 6 Description of IMF Application ............................. 19 6.1 Basic program description . 20 6.2 SDFile import . 20 6.3 Database browser . 20 6.4 Export list . 23 6.5 Isomorphism search . 23 6.6 Backtracking visualization . 25 7 Design Specification .................................... 27 7.1 IMF Architecture . 27 vi 7.2 Molecule representation . 28 8 Database layer description ................................ 29 8.1 DB schema description . 29 8.2 MoleculeDB class . 31 9 Implementation of isomorphism search algorithms . 33 9.1 Molecule fingerprints . 33 9.2 Brute force approach . 34 9.3 Backtracking . 34 9.3.1 Backtracking algorithm . 35 10 Implementation of user interface ............................ 36 10.1 GUI implementation . 36 10.2 MoleculePanel . 36 10.3 JmolPanel . 38 V Conclusion 39 11 Conclusion .......................................... 40 Bibliography . 41 Index . 42 A Contents of attached DVD ................................ 43 vii Part I Introduction Introduction Pure informatics is a very interesting and important field of study. However, I believe that the the greatest potential of computer science lies in its application in other fields. One of such interdisciplinary application is the computational chemistry – Prof. Schleyer, one of the greatest thinkers of this field, defines it in the following way: “... endeavour to model all aspects of real chemistry as accuracy as possible using calculations instead of experiments”. Computational chemistry was established in 1960s as a complementary field to the experimental chemistry. Experimental research of many chemical processes and properties of chemical substances is very difficult. Chemical substances for instance might be toxic, radioactive or just too expensive to use. Chemical processes are often extremely fast, explo- sive, exothermic or bounded to conditions that are very hard to achieve. In cases like these it is feasible to replace experiments with simulations – instead of classical chemical methods we use the computational chemistry. For computational processing of chemical substances it is necessary to provide the computer with molecule description in a format that provide information about its chemical structure (number of atoms and bonds between them). One of the ways to describe a molecule is a graph data structure – atoms are the vertices and bonds are the edges. This representation requires, that all atoms have indices assigned and that each bond is marked by a pair of indices that belong to respective bounded atoms. Graph representation is one of the most used molecule descriptions in computational chemistry. Its main advantage is, that it does not imply compression of any kind and that data about atoms, bonds and op- tionally other information (such as spatial placement of atoms, length of bonds or bond and torsional angles) is readily available for algorithmic processing. Unfortunate aspect of graph representation is that for molecule with N atoms we might use N! different ways of indexing. The problem that arises is to identify whether two graphs with different indexing of vertices represent the same molecule or not. Solution of this problem lies in finding an isomorphism (a bijective mapping of chemically equivalent atoms) between the two given graphs – if such mapping exists, both graphs represent the same molecule. However the isomorphism searching has the complexity of N!, because that is the number of possible bijections between the two graphs. It is obvious that the isomorphism search has to be optimized (backtracking, dividing atoms into equivalence classes, ...), for the impact of the factorial complexity to be eliminated as much as possible. It would be also beneficial to left out the isomorphism search entirely when possible – e.g. use the numerical characteristics of given molecular graphs (number of atoms, degrees of vertices, etc.) to find out that it is not possible for the graphs to describe the same molecule in advance. Graph representation of the molecule and isomorphism testing of two given molecules might be very useful for experimental as well as for computational chemists. It allows searching for molecules (as well as additional information about them) in chemical databases. Currently there are several databases like these available. They contain large amount of useful information (pharmaceuticals, anorganic molecules, crystal structures, ...) and it would 2 be beneficial to perform fast search in these databases. Goal of my thesis is to create software tool that will be able to work with such databases and to perform searches based on numerical characteristics and isomorphisms of contained molecules. This thesis was written in cooperation with ANF DATA company and it is part of the projects in the Life Science & Innovations section. The goals of the thesis a) Gain knowledge about basic terms required to work with molecular structures, in partic- ular: molecular graph, isomerism, canonical indexing, isomorphism, numerical characteristics of molecule. b) Understand file formats used to store molecule description in computer (SDF, MOL, ...). c) Design methodology for molecule search in databases. Find appropriate application for numerical characteristics of molecule in combination with efficient improvements of isomorphism search algorithm (backtracking, equivalence classes of atoms, ...). d) Implementation of software that will load several databases of molecules and input molecule. Search through loaded databases and output position of the input molecule within the input databases. Develop the software in such way, that the search will finish in acceptable time. e) Test the functionality and effectiveness of the software using the existing molecule databases. Note: Topic of this work allows for the realization by two cooperating students. The work on the project can be divided into separated parts, that can be completed independently. Focus of the task is quite wide, and the given problem is difficult from the viewpoint of understand- ing the application domain (chemistry), algorithmic viewpoint (isomorphism, molecular graphs) and implementational viewpoint (use of large databases, molecule visualization). 3 Part II Theoretical part Chapter 1 Basic definitions Before looking at representation of chemical structures in computers and describing the algorithms that work with these structures, lets have a look at few basic chemical

Load more