MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨ OF I !"#$%&'()+,-./012345 Effective algorithms for searching of identical molecules and their application DIPLOMATHESIS Bc. Karol Kružel Brno, spring 2008 Declaration Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Advisor: RNDr. Radka Svobodová Vaˇreková,Ph.D. ii Acknowledgement First of all, I would like to thank Radka Svobodová Vaˇrekováfor her time, advice and never ending support during my work on this thesis. My thanks go to Ján Slaninka with whom we worked on the IMF project for all the effort and especially for keeping me awake. From the depth of my heart I want to thank DD. You know that the list would be longer than this work... I would also like to thank my mother and rest of my family and friends for their support and patience during the whole time I studied at the Faculty of Informatics. Last but not least I would like to thank Lukáš Peterka, for introducing me into the world of programming. iii Abstract This master thesis is focused on searching of isomorphic molecules. Molecules are in com- puters usually represented in form of molecular graphs. While this structure is feasible for further processing it has a disadvantage of possibly ambiguous indexing of atoms in represented molecule. This work starts with definition of all required terms and then the problem of graph isomorphism is briefly described. Possible solutions of such as bruteforce approach, backtracking and its further optimization using equivalence classes of graph ver- tices are listed and their expected performance is briefly described. Implementation part of this work contains description of design and implementation of the IMF tool – software allowing detection of isomorphic molecules as well as managing a database of molecular data. iv Keywords molecular graph, graph isomorphism, isomorphism detection, molecule visualization, Java, Jmol, Swing, MySQL v Contents I Introduction 1 II Theoretical part 4 1 Basic definitions ...... 5 1.1 Atom and chemical element ...... 5 1.2 Molecules and chemical bonds ...... 5 2 Graphs and molecular graphs ...... 6 2.1 Undirected graph ...... 6 2.2 Molecular graph ...... 6 2.3 Adjacency matrix ...... 7 3 Graph isomorphism ...... 8 3.1 Isomers, graph isomerism ...... 8 3.2 Graph isomorphism ...... 9 3.3 Non-isomorphism detection ...... 9 3.4 Isomorphism detection algorithms ...... 10 3.4.1 Brute force ...... 10 3.4.2 Backtracking ...... 10 3.4.3 Equivalence classes of vertices in molecular graphs ...... 11 III Methods 12 4 Development environment ...... 13 4.1 Java and NetBeans IDE ...... 13 4.2 MySQL database server ...... 13 4.3 Other software components used ...... 13 4.3.1 Jmol ...... 13 4.3.2 JDBC drivers ...... 14 5 Chemical data and means of its storage ...... 15 5.1 MDL Molfile and SDFile ...... 15 5.2 Relational databases ...... 16 5.3 SMILES definitions ...... 17 IV Implementation 18 6 Description of IMF Application ...... 19 6.1 Basic program description ...... 20 6.2 SDFile import ...... 20 6.3 Database browser ...... 20 6.4 Export list ...... 23 6.5 Isomorphism search ...... 23 6.6 Backtracking visualization ...... 25 7 Design Specification ...... 27 7.1 IMF Architecture ...... 27 vi 7.2 Molecule representation ...... 28 8 Database layer description ...... 29 8.1 DB schema description ...... 29 8.2 MoleculeDB class ...... 31 9 Implementation of isomorphism search algorithms ...... 33 9.1 Molecule fingerprints ...... 33 9.2 Brute force approach ...... 34 9.3 Backtracking ...... 34 9.3.1 Backtracking algorithm ...... 35 10 Implementation of user interface ...... 36 10.1 GUI implementation ...... 36 10.2 MoleculePanel ...... 36 10.3 JmolPanel ...... 38 V Conclusion 39 11 Conclusion ...... 40 Bibliography ...... 41 Index ...... 42 A Contents of attached DVD ...... 43 vii Part I Introduction Introduction Pure informatics is a very interesting and important field of study. However, I believe that the the greatest potential of computer science lies in its application in other fields. One of such interdisciplinary application is the computational chemistry – Prof. Schleyer, one of the greatest thinkers of this field, defines it in the following way: “... endeavour to model all aspects of real chemistry as accuracy as possible using calculations instead of experiments”. Computational chemistry was established in 1960s as a complementary field to the ex- perimental chemistry. Experimental research of many chemical processes and properties of chemical substances is very difficult. Chemical substances for instance might be toxic, radioactive or just too expensive to use. Chemical processes are often extremely fast, explo- sive, exothermic or bounded to conditions that are very hard to achieve. In cases like these it is feasible to replace experiments with simulations – instead of classical chemical methods we use the computational chemistry. For computational processing of chemical substances it is necessary to provide the com- puter with molecule description in a format that provide information about its chemical structure (number of atoms and bonds between them). One of the ways to describe a molecule is a graph data structure – atoms are the vertices and bonds are the edges. This representation requires, that all atoms have indices assigned and that each bond is marked by a pair of indices that belong to respective bounded atoms. Graph representation is one of the most used molecule descriptions in computational chemistry. Its main advantage is, that it does not imply compression of any kind and that data about atoms, bonds and op- tionally other information (such as spatial placement of atoms, length of bonds or bond and torsional angles) is readily available for algorithmic processing. Unfortunate aspect of graph representation is that for molecule with N atoms we might use N! different ways of indexing. The problem that arises is to identify whether two graphs with different indexing of vertices represent the same molecule or not. Solution of this prob- lem lies in finding an isomorphism (a bijective mapping of chemically equivalent atoms) between the two given graphs – if such mapping exists, both graphs represent the same molecule. However the isomorphism searching has the complexity of N!, because that is the number of possible bijections between the two graphs. It is obvious that the isomorphism search has to be optimized (backtracking, dividing atoms into equivalence classes, ...), for the impact of the factorial complexity to be eliminated as much as possible. It would be also beneficial to left out the isomorphism search entirely when possible – e.g. use the numerical characteristics of given molecular graphs (number of atoms, degrees of vertices, etc.) to find out that it is not possible for the graphs to describe the same molecule in advance. Graph representation of the molecule and isomorphism testing of two given molecules might be very useful for experimental as well as for computational chemists. It allows searching for molecules (as well as additional information about them) in chemical databases. Currently there are several databases like these available. They contain large amount of use- ful information (pharmaceuticals, anorganic molecules, crystal structures, ...) and it would 2 be beneficial to perform fast search in these databases. Goal of my thesis is to create software tool that will be able to work with such databases and to perform searches based on numerical characteristics and isomorphisms of contained molecules. This thesis was written in cooperation with ANF DATA company and it is part of the projects in the Life Science & Innovations section. The goals of the thesis a) Gain knowledge about basic terms required to work with molecular structures, in partic- ular: molecular graph, isomerism, canonical indexing, isomorphism, numerical charac- teristics of molecule. b) Understand file formats used to store molecule description in computer (SDF, MOL, ...). c) Design methodology for molecule search in databases. Find appropriate application for numerical characteristics of molecule in combination with efficient improvements of iso- morphism search algorithm (backtracking, equivalence classes of atoms, ...). d) Implementation of software that will load several databases of molecules and input molecule. Search through loaded databases and output position of the input molecule within the input databases. Develop the software in such way, that the search will finish in acceptable time. e) Test the functionality and effectiveness of the software using the existing molecule databases. Note: Topic of this work allows for the realization by two cooperating students. The work on the project can be divided into separated parts, that can be completed independently. Focus of the task is quite wide, and the given problem is difficult from the viewpoint of understand- ing the application domain (chemistry), algorithmic viewpoint (isomorphism, molecular graphs) and implementational viewpoint (use of large databases, molecule visualization). 3 Part II Theoretical part Chapter 1 Basic definitions Before looking at representation of chemical structures in computers and describing the al- gorithms that work with these structures, lets have a look at few basic chemical definitions. 1.1 Atom and chemical element Atom is the smallest particle that is recognized as a chemical element. Each atom has a nu- cleus that consists of positively charged protons and electrically neutral neutrons. Nucleus is surrounded by an cloud of negatively charged electrons. Chemical elements are distin- guished by number of protons contained in atomic nucleus – this number is called atomic number. [5, 6] 1.2 Molecules and chemical bonds Electrons surrounding the nucleus in atom are often described by dividing them into several electron shells – all electrons in one shell have the same amount of energy. Outer most shell of electrons is called valence shell and it sets the ability of given element to form chemical bonds. Electrons in the valence shells are also called valence electrons. Bond between atoms is formed when the distance of two atoms is small enough for an overlap in their valence shells to appear. This overlap causes valence electrons of both atoms to change the trajectory of their movement. If this newly established system has a lower energy than before then the two atoms will remain in their new positions and a chemical bond is created. Depending on the number of electrons that form the chemical bond we talk about single (one pair of valence electrons), double (two pairs) or triple (three pairs) bonds. Two or more atoms held together by strong chemical bonds in stable state form a molecule. Molecules can be described using molecular formulas or chemical names, these usually pro- vide information about number of atoms and their types. More information is provided in structural formulas that also describe how are atoms in molecule arranged. [1] In computational chemistry there are several ways of representing molecules in com- puter. I will describe some of the file formats commonly used to store molecular structure later in this work. From the informatics point of view molecule is very often described as a graph structure. Before describing in detail how is a molecule represented in computer memory, we have to define few basic terms from the graph theory. 5 Chapter 2 Graphs and molecular graphs 2.1 Undirected graph Undirected graph G is [11] defined as an unordered pair G = (V,E) where: • V is a set of vertices of the graph G • E is a set of edges of graph G: E ⊆ {{v, w} |v ∈ V ∧ w ∈ V ∧ v 6= w} In Figure 2.1 is a simple graph G = (V,E) V = {v1, v2, v3, v4} E = {{v1, v3} , {v2, v3} , {v3, v4}} Figure 2.1: Sample undirected graph 2.2 Molecular graph To represent a molecule, we usually use labeled pseudograph, that better describe atoms and chemical bonds in molecule. Molecular graph [1] can be defined as a quintuple G = (V, E, f, φ, S) where: • V is a set of vertices of the molecular graph G • E is a set of edges of the molecular graph G • f is a function that defines end vertices for each edge 6 2.3. ADJACENCY MATRIX • φ is a valuation function for vertices • S is a set of symbols for valuation of vertices Following is the sample graph representation of formaldehyde molecule. V = {v1, v2, v3, v4} E = {e1, e2, e3, e4, e5, e6} f : f(e1) = {v1, v3} ; f(e2) = {v2, v3} ; f(e3) = {v3, v4} ; f(e4) = {v3, v4} ; f(e5) = {v4, v4} ; f(e6) = {v4, v4} φ(v1) = H; φ(v2) = H; φ(v3) = C; φ(v4) = O S = {C,O,H} Figure 2.2: Structural formula of formaldehyde molecule and its graph representation 2.3 Adjacency matrix Another way to represent a molecule structure and a molecular graph is a adjacency matrix. Adjacency matrix A = (aij) for a given molecular graph G is a symmetric matrix with rows and columns corresponding to the vertices of graph G. Entries of adjacency matrix are set as follows: • aii = number of non-valence electrons of atom represented by vertex vi • aij = aji =number of edges e : f(e) = {vi, vj} The molecular graph of formaldehyde seen in Figure 2.2 can be represented in this matrix: 0 0 1 0 0 0 1 0 1 1 0 2 0 0 2 4 7 Chapter 3 Graph isomorphism 3.1 Isomers, graph isomerism Isomers are molecules that can be represented using one molecular formula, while their structural formulas may differ. It means that these molecules have the same number of atoms of a given type, but bonds between these atoms and their types may be arranged in a different way for each isomer. Given a molecular graph G = (V, E, f, φ, S) with M being the number of edges in G and molecular graph G0 = (V 0,E0, f 0, φ0,S) with M 0 being the number of edges in G0, we say G and G0 are isomeric if the following conditions are met: • Number of edges in both graphs is equal: M = M 0. • A bijection exists ψ : V → V 0 such that the valuation of vertices is preserved: φ(w) = φ0(ψ(w)) for each w ∈ V We say, that sets V and V’ are similar. Two molecular graphs are isomeric if they represent molecules, that are isomers. Figure 3.1: Two possible isomers for structural formula CH2O and their molecular graphs 8 3.2. GRAPH ISOMORPHISM 3.2 Graph isomorphism Molecular graph G = (V, E, f, φ, S) and a molecular graph G0 = (V 0,E0, f 0, φ0,S) are iso- morphic if a bijection p : V → V 0 exists such that: • if vertices v, w ∈ V form n edges {v, w} in graph G, then vertices p(v), p(w) ∈ V 0 form n edges {p(v), p(w)} in graph G0 • valuation of vertices remains unchanged: φ(v) = φ0(i(v)) for each v ∈ V The graph isomorphism is an equivalence relation on graphs. If two molecular graphs G and G0 are isomorphic, then they represent the same molecule. Figure 3.2: Two different graphical representations of the same molecule The p : V → V 0 permutation can be expressed as a permutation matrix P . Then for the adjacency matrices A and A0 two isomorphic graphs G and G0 the following applies: A = P T A0P 3.3 Non-isomorphism detection Problem of determining whether two graphs are isomorphic does not currently have a polynomial-time solution, nor it is proved that such algorithm does not exists. Since the isomorphism of two graphs can be only proven by finding the correct permutation of ver- tices of one of the tested graphs, it is usually time consuming. However for many pairs of graphs G and G0 it can be easily seen that they are non-isomorphic. It is therefore more time-effective to eliminate graphs that can not be isomorphic first. There are several characteristics of molecular graphs (e.g. number of edges, types of contained atoms) that can be computed in O(n) or O(n2) time. If values of these basic char- acteristics of given graphs differ then they are not isomorphic. 9 3.4. ISOMORPHISM DETECTION ALGORITHMS 3.4 Isomorphism detection algorithms 3.4.1 Brute force The most trivial algorithm for isomorphism detection is the naive “brute force” approach. It is very easy to implement, but is also the slowest one. For each possible permutation p : V → V 0 we test whether G and G’ are equal. Complexity of this algorithm belongs to O(n!) as for the sets V and V 0 with cardinality n there are n! permutations p : V → V 0. 3.4.2 Backtracking Since the valuation of vertices has to remain the same in the isomorphic graphs, backtrack- ing algorithm limits the search of matching permutations only to the structures of vertices with same valuation. We mark one of the input graphs as a source one. The source graph is then being depth-first searched, while the matching structure of vertices is searched for in the second graph. In the second graph a random starting vertex is chosen and then the graph is traversed from this vertex choosing edges with same valuation as in source graph. If it not possible to find a matching partial bijection the depth-first search steps back a different permutation of possible vertices is chosen in the second graph to evaluate. Isomorphism detection using backtracking Stack knownEqualVertexes equalFromVertex(vertex v1, vertex v2) { mark current position in knownEqualVertexes stack if [v1,v2] in knownEqualVertexes { return true } if v1.valuation <> v2.valuation { return false } set nl1 to ordered list of neigbours of v1 set nl2 to ordered list of neigbours of v2 while exists unchecked permutation of items in nl2 { for i = 1 to number of neigbours { if not equalFromVertex(nl1[i],nl2[i]) { jump to next permutation of nl2 } } knownEqualVertexes.push(v1,v2); return true } remove all items from knownEqualVertexes above set mark return false } 10 3.4. ISOMORPHISM DETECTION ALGORITHMS GraphsIsomorphic(graph G1, graph G2) { choose a starting vertex sv from G1 for each vertex v2 from G2 { if equalFromVertex(v1,v2) { return true } } return false } In the worst case scenario, the backtracking algorithm would also have the complexity of O(n!), however in most cases it is significantly faster than the brute force algorithm. The speed increase is in reciprocal proportion of the number of vertices that have the same val- uation. In case of molecular graphs defined before, the valuation of an vertex represents atomic element. Since most of the molecules in organic chemistry contain many Carbon atoms we might want to change the valuation function, so that these large groups of ver- tices with equal valuations are divided into smaller parts, thus allowing us to increase the speed of isomorphism testing. 3.4.3 Equivalence classes of vertices in molecular graphs According to the definition of molecular graph, each vertex has assigned a value – this value corresponds to the chemical symbol of given atom. Given molecular graph G we can define equivalence relation on members of set V : v1 ∼ v2 ⇐⇒ φ(v1) = φ(v2), v1, v2 ∈ V . Then the vertices can be divided into equivalence classes [a] = {x ∈ V |x ∼ a}. Since graph isomor- phism is equivalent relation, the division into equivalence classes must remain the same in the isomorphic graph G0 [1, 9]. When testing for isomorphism using backtracking, we have to perform permutation only on the vertices remaining into the same equivalence class, thus reducing the number of vertices in each equivalence class leads to decrease in the size of array to permutate. This can be achieved by defining the ∼ relation, so that it accounts for other characteristics beside the valuation of the vertex (e.g. degree of the vertex). Detailed description of this method can be found in [8]. 11 Part III Methods Chapter 4 Development environment For development part of my thesis (IMF1 software), following tools were used. 4.1 Java and NetBeans IDE Java was chosen as the programming language for implementation of the IMF tool. There are several popular Java IDEs available – NetBeans IDE 6 was chosen for development of IMF. Some of the reasons for this decision included: rich-featured GUI designer for SWING based applications, support for SVN together with other collaboration tools such as conver- sation support, integrated profiler and availability of wide range of plug-in modules. An useful feature is also the NetBeans support for refactoring – it was helpful when in some stages of development was necessary to modify source code to improve performance after benchmarks were performed on chemical data. 4.2 MySQL database server MySQL is a widely used database server, available in both commercial and free edition. Data storage in IMF is realized using MySQL server version 5.0 (Community Server edition). For the administration of the server, as well as for easier data and database structure manipulation there is a MySQL GUI Tools bundle available. During the development of IMF I have actively used MySQL Administrator and MySQL Query Browser. These tools offer intuitive user interface and all desired functionality. Since these tools use TCP/IP connection to the MySQL server it is also possible to use them for administration of remote server. 4.3 Other software components used Apart from the standard Java libraries following software packages were also used. 4.3.1 Jmol Jmol is an open-source molecule viewer written in Java [12]. Its primary goal is to provide easy-to-use tool to visualize various chemical structures in space. Jmol displays 3D model 1Isomorphic Molecule Finder 13 4.3. OTHER SOFTWARE COMPONENTS USED of given molecule. This model can be rotated around all three axes, moved around the view- port window and it is possible to adjust the zoom level. Each atom in the model has a color specified based on its type. Jmol provides three primary ways of use: as web applet, desktop application and as a component that can be integrated into new Java applications. We have used this last capability to provide users of our application 3D view of molecules contained in a molecule database. More information about integration of Jmol library into IMF can be found in the Implementation part of this work. The stand-alone Jmol application was also very useful in the early stages of the development of the IMF before we have finished our own visualization component. Some of the images in this work were also produced by Jmol. Figure 4.1: Jmol desktop application 4.3.2 JDBC drivers JDBC is an API that defines the way of communication between programs written in Java and stand-alone database servers. For each of tested database servers appropriate JDBC driver had to be installed and used. 14 Chapter 5 Chemical data and means of its storage During the development of IMF it was necessary to have sufficiently large sample of chem- ical data. There are several databases of various chemical substances that are available for free download. Bc. Lucie Kaiserová has prepared a list of available chemical databases for her bachelor’s thesis, we have used data from these databases [7]. Data in these databases are available in the SDFiles1 format [10]. IMF is able to load this format and store contained molecules in a RDBMS engine. As an alternative input format also the SMILES molecule representation is accepted. In following text is a brief description of these data formats. 5.1 MDL Molfile and SDFile Several file formats are in general used for storing information about molecule structures. Large amount of chemical data is available in the MDL Molfile and its SDFile variant. Molfile is a simple text-based file format that stores molecule information in the form of simple table with two sections. The first section contains information about atoms contained in the given molecule, including information about their spatial arrangement. Information about bonds between atoms and their types are stored in the second section. The Molfile is intended to store information only about one molecule with limited space for storing meta information about the contained structure. SDFile is variant allowing stor- age of more molecules in one file and additional information can be provided for each con- tained molecule. Detailed information about these file formats can be found in [8]. Bellow is a sample description table in the Molfile format (Pyruvic acid molecule ). See Figure 5.1 for graphical representation of this molecule. 10 9 0 0 0 0 0 0 0 0 1 V2000 -0.0162 1.3417 0.0094 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0021 -0.0041 0.0020 O 0 0 0 0 0 0 0 0 0 0 0 0 1.0257 1.9623 0.0028 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.3171 2.0694 0.0195 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3548 1.4513 0.0211 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.3375 3.5762 0.0278 C 0 0 0 0 0 0 0 0 0 0 0 0 0.8460 -0.4761 -0.0086 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.8244 3.9417 0.9174 H 0 0 0 0 0 0 0 0 0 0 0 0 1Structure-data file 15 5.2. RELATIONAL DATABASES -0.8329 3.9513 -0.8625 H 0 0 0 0 0 0 0 0 0 0 0 0 -2.3700 3.9255 0.0346 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 2 0 0 0 0 1 4 1 0 0 0 0 2 7 1 0 0 0 0 4 5 2 0 0 0 0 4 6 1 0 0 0 0 6 8 1 0 0 0 0 6 9 1 0 0 0 0 6 10 1 0 0 0 0 Figure 5.1: Pyruvic acid 5.2 Relational databases Information about molecular structures can become rather large. Some of the databases we have used were several gigabytes in size. As was mentioned, earlier input SDFiles used for data import are just a simple flat files with text description of contained molecules. These can be easily used for sequential processing, but are not performing well, when we are looking for molecules based on their structural information. Rather than building some kind of indexing system over the SDF files or than using sequential searching, we have decided to store molecule information in an existing database server. This approach makes it possible to store further meta information about molecule as well as use the built-in indexing to improve search speed. During the development of IMF, we tested several freely available database servers (more about it can be found in Implementation part of this work). At the end we have decided to use the MySQL Community Server, however with a few minor modification it is possible to use our application with almost any database server with available JDBC driver. 16 5.3. SMILES DEFINITIONS 5.3 SMILES definitions SMILES 2 is a format of molecular structure representation in a form of a simple string. For example, the SMILES definition for Pyruvic acid mentioned earlier is: CC(C(O)=O)=O. In IMF it is possible to load a SMILES definition of molecule and run a isomorphism search against molecules stored in the database. Advantages of SMILES format are the size of resulting string representing a molecule and the wide availability of SMILES description for many molecules. Molecule loaded from SMILES is given unambiguously, but several different SMILES can represent one molecule. Thus it is not possible to say that two molecules defined by different SMILES are not isomor- phic. SMILES representation depends on the indices in graph representation of molecule from which the SMILES was generated. Attempts exist to introduce canonical indexing for SMILES, but currently there are several different standards. For more information about SMILES see [9] 2Simplified molecular input line entry specification 17 Part IV Implementation Chapter 6 Description of IMF Application In this chapter, I will describe the IMF application. IMF is the result of effort to use the theoretical background to implement usable application, that would allow to store molecule information in a SQL database and then perform isomorphism searches in the available data. Figure 6.1: IMF – graphical user interface IMF is written in Java – source code of the application can be found on the attached DVD (see XrefId[??]). We have decided to use this widespread, object oriented language with assumption, that our implementation should be easily understood by majority of possibly interested readers. Another advantage of using JDK is possible portability to wide range of platforms. The development itself was done simultaneously on both Linux and Windows XP operating systems, the code base was maintained in SVN server. IMF installation and setup instructions can be found on the attached DVD. 19 6.1. BASIC PROGRAM DESCRIPTION 6.1 Basic program description IMF offers basic functionality to work with database of molecules, view their basic charac- teristics and perform search of isomorphic molecules within the database. This functionality is available both, through the command line interface (CLI), that can be easily incorporated into scripts and through the graphical user interface (GUI), that is designed for interactive use. This chapter focus on description of the GUI, for description of CLI please see [10]. The functionality of GUI is for better clarity divided into five different areas and available to the user in form of a tabbed pane1. Following is the list of used tabs and brief description of their function. • Find isomorphism: find all isomorphic molecules for given input molecule. • Database setup: setup of connection information for database that is used for searches • Export / Import: import of all data from SDFile into database and possibility to select one or more molecules and export them into a SDFile • Database browser: browse the database and display information about molecules • Backtracking visualization: visualization of the run of backtracking algorithm 6.2 SDFile import First step after the setup of IMF and MySQL server (see attached DVD for detailed instruc- tions) is usually to load molecule data into SQL database2. To load data from SDFile or Molfile select the Export / Import tab, choose the file to import and fill in the additional infor- mation about new data source (see Figure 6.2). Data source is a term we use to identify a data file, that was imported into IMF database together with additional information about this file – each molecule in database holds information about its source. In the database setup tab it possible to choose which database sources will be used for isomorphism search and other molecule operations (see Figure 6.3). 6.3 Database browser Once the data are loaded in the database, the database browser (see Figure 6.4) can be used to display model of the molecules in the database as well as information about their properties. In the left part of the window is integrated the Jmol visualisation plug-in that is used to display model of current molecule. This model can be rotated and positioned using mouse 1Tab is a navigational widget for switching between sets of controls or documents[2] 2The isomorphism search will work without loading data into the database, however additional functionality that uses database storage will not be available. 20 6.3. DATABASE BROWSER Figure 6.2: Import of new data source Figure 6.3: Selection of data sources wheel and buttons. There is a simple navigation panel that can be used to go through the database or to jump to a molecule with given ID below the model. Summary information about displayed molecule is shown in the right part of window. • Name: Name of the molecule, extracted from the source SDFile. • ID: ID of the molecule in the IMF database • Fingerprint and Bond Fingerprint: These are the characteristic strings of molecule generated by IMF (for more information see Section 9.1) – only molecules that have both of these characteristics same can be isomorphic. • # of potential isomorphisms: number of molecules in currently selected data sources that can be potentially isomorphic with current molecule 21 6.3. DATABASE BROWSER • SMILES: SMILES string for current molecule • Molecule source information: Description of source data file for given molecule Figure 6.4: Database browser Description of available buttons • Find isomorphism: Switches to the Find isomorphism tab sets the source molecule ID to the ID of the current molecule • Add to export list: Adds the molecule to the Export list (see below) • First / Prev / Next / Last: Buttons that allow to jump to the first / previous / next / last molecule in the database. • Go: Jumps to the molecule with chosen ID. • Add all potential isomorphisms to export list: Adds the molecule and all other molecules with the matching fingerprints and bond fingerprints to the export list. • Prev / Next: Buttons that allow to jump to the next / previous molecule with the same fingerprint. 22 6.4. EXPORT LIST 6.4 Export list It might be useful to export molecules from the IMF database for further processing in any other software tool. This is possible using the list of molecules to export on the Export / Import tab. In the database browser it is possible to insert molecules into this list. Contained molecules can be then displayed and exported into a single SDFile. Figure 6.5: Export list 6.5 Isomorphism search The main goal of our application is to provide easy-to-use search of isomorphic molecules. This function is available in the first tab. For isomorphism search it is necessary to define the input molecule and the database over which the search will be performed. Figure 6.6: Defining the source molecule of isomorphism search 23 6.5. ISOMORPHISM SEARCH There are three possible ways to define input molecule. From existing database by en- tering molecule’s ID, Name or Orignal ID, from a SDFile by specifying full path to file and index of molecule that should be used or by entering a SMILES description of molecule. As the database of potentially isomorphic molecules may be used either the internal IMF database (it is possible to connect to the different database than the source molecule is from) or the SDF file. Figure 6.7: Selecting the input database After input parameters are specified the isomorphism search may be performed. Since some of the available chemical molecules contain molecule descriptions without Hydrogen atoms, it is possible to perform search in molecules without taking Hydrogens in account (see Figure 6.8). Hydrogen atoms are always ignored if the source molecule is entered in form of SMILES description (for explanation why, please see [8]). Search progress and list of found isomorphic molecules are situated in the bottom part of the window – if the search was performed against the IMF database it is possible to show each found molecule in the database browser. Figure 6.8: Results of searching molecule isomorphic to "NSC 49" from the NCI database [7] 24 6.6. BACKTRACKING VISUALIZATION 6.6 Backtracking visualization Even though it is not a functional necessity, the GUI contains also a specialized visualization component, that allows us to see the run of backtracking algorithm on the potentially iso- morphic molecule pair. It allows us to better demonstrate how the backtracking algorithm works and we have also used this module to find ways to improve backtracking perfor- mance. Figure 6.9: Backtracking visualization component Component displays both molecules – the “source” one in the blue field, the “examined” one in the red field (In Figure 6.9 there are two molecules from the NCI database: NSC 12 and NSC 2629). The source molecule is always the same, as the molecule currently displayed in the molecule browser. In the blue field there are four navigational buttons, to go trough the database – change of selected molecule cause also the change in database browser. In the red field there are also navigational buttons, but these only cycles through the molecules that are in the database and are potentially isomorphic with the source molecule. In case there is not any potentially isomorphic match, a random isomorphism is generated from the source molecule by shuffling indexing of molecule in memory (the same can be achieved also for molecule with possible isomorphism by pressing the RandIso button). Controls in the top toolbar have following functions. 25 6.6. BACKTRACKING VISUALIZATION • Test Isomorphism Starts the visualization animation. • Reset colors After the visualization run is done, both molecules are either coloured green (in case they are isomorphic) or red (if they are not), see Figure 6.10. This button can be used to reset colours of displayed atoms back to their defaults. • Orientation buttons Allow to change the orientation of red and blue field to horizon- tal / vertical. (see Figure 6.10) • Autoresize If checked, molecules are automatically resized when size of windows or visualization panels change. • Show Hydrogen atoms Especially for display of large molecules the clarity of display may be improved by hiding hydrogen atoms. • Panels linked If checked, molecules are resized / moved simultaneously in both, the blue and the red field. • Animation speed Slider allowing to control speed of the visualization. Figure 6.10: Display of two isomorphic (left) and non-isomorphic (right) molecules 26 Chapter 7 Design Specification 7.1 IMF Architecture
<<