MASARYK UNIVERSITY FACULTY}w¡¢£¤¥¦§¨  OF I !"#$%&'()+,-./012345

Effective algorithms for searching of identical molecules and their application

DIPLOMATHESIS

Bc. Karol Kružel

Brno, spring 2008 Declaration

Hereby I declare, that this paper is my original authorial work, which I have worked out by my own. All sources, references and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Advisor: RNDr. Radka Svobodová Vaˇreková,Ph.D.

ii Acknowledgement

First of all, I would like to thank Radka Svobodová Vaˇrekováfor her time, advice and never ending support during my work on this thesis. My thanks go to Ján Slaninka with whom we worked on the IMF project for all the effort and especially for keeping me awake. From the depth of my heart I want to thank DD. You know that the list would be longer than this work... I would also like to thank my mother and rest of my family and friends for their support and patience during the whole time I studied at the Faculty of Informatics. Last but not least I would like to thank Lukáš Peterka, for introducing me into the world of programming.

iii Abstract

This master thesis is focused on searching of isomorphic molecules. Molecules are in com- puters usually represented in form of molecular graphs. While this structure is feasible for further processing it has a disadvantage of possibly ambiguous indexing of atoms in represented molecule. This work starts with definition of all required terms and then the problem of graph isomorphism is briefly described. Possible solutions of such as bruteforce approach, backtracking and its further optimization using equivalence classes of graph ver- tices are listed and their expected performance is briefly described. Implementation part of this work contains description of design and implementation of the IMF tool – software allowing detection of isomorphic molecules as well as managing a database of molecular data.

iv Keywords molecular graph, graph isomorphism, isomorphism detection, molecule visualization, Java, Jmol, Swing, MySQL

v Contents

I Introduction 1 II Theoretical part 4 1 Basic definitions ...... 5 1.1 Atom and chemical element ...... 5 1.2 Molecules and chemical bonds ...... 5 2 Graphs and molecular graphs ...... 6 2.1 Undirected graph ...... 6 2.2 Molecular graph ...... 6 2.3 Adjacency matrix ...... 7 3 Graph isomorphism ...... 8 3.1 , graph isomerism ...... 8 3.2 Graph isomorphism ...... 9 3.3 Non-isomorphism detection ...... 9 3.4 Isomorphism detection algorithms ...... 10 3.4.1 Brute force ...... 10 3.4.2 Backtracking ...... 10 3.4.3 Equivalence classes of vertices in molecular graphs ...... 11 III Methods 12 4 Development environment ...... 13 4.1 Java and NetBeans IDE ...... 13 4.2 MySQL database server ...... 13 4.3 Other software components used ...... 13 4.3.1 Jmol ...... 13 4.3.2 JDBC drivers ...... 14 5 Chemical data and means of its storage ...... 15 5.1 MDL Molfile and SDFile ...... 15 5.2 Relational databases ...... 16 5.3 SMILES definitions ...... 17 IV Implementation 18 6 Description of IMF Application ...... 19 6.1 Basic program description ...... 20 6.2 SDFile import ...... 20 6.3 Database browser ...... 20 6.4 Export list ...... 23 6.5 Isomorphism search ...... 23 6.6 Backtracking visualization ...... 25 7 Design Specification ...... 27 7.1 IMF Architecture ...... 27

vi 7.2 Molecule representation ...... 28 8 Database layer description ...... 29 8.1 DB schema description ...... 29 8.2 MoleculeDB class ...... 31 9 Implementation of isomorphism search algorithms ...... 33 9.1 Molecule fingerprints ...... 33 9.2 Brute force approach ...... 34 9.3 Backtracking ...... 34 9.3.1 Backtracking algorithm ...... 35 10 Implementation of user interface ...... 36 10.1 GUI implementation ...... 36 10.2 MoleculePanel ...... 36 10.3 JmolPanel ...... 38 V Conclusion 39 11 Conclusion ...... 40 Bibliography ...... 41 Index ...... 42 A Contents of attached DVD ...... 43

vii Part I

Introduction Introduction

Pure informatics is a very interesting and important field of study. However, I believe that the the greatest potential of computer science lies in its application in other fields. One of such interdisciplinary application is the computational chemistry – Prof. Schleyer, one of the greatest thinkers of this field, defines it in the following way: “... endeavour to model all aspects of real chemistry as accuracy as possible using calculations instead of experiments”. Computational chemistry was established in 1960s as a complementary field to the ex- perimental chemistry. Experimental research of many chemical processes and properties of chemical substances is very difficult. Chemical substances for instance might be toxic, radioactive or just too expensive to use. Chemical processes are often extremely fast, explo- sive, exothermic or bounded to conditions that are very hard to achieve. In cases like these it is feasible to replace experiments with simulations – instead of classical chemical methods we use the computational chemistry. For computational processing of chemical substances it is necessary to provide the com- puter with molecule description in a format that provide information about its chemical structure (number of atoms and bonds between them). One of the ways to describe a molecule is a graph data structure – atoms are the vertices and bonds are the edges. This representation requires, that all atoms have indices assigned and that each bond is marked by a pair of indices that belong to respective bounded atoms. Graph representation is one of the most used molecule descriptions in computational chemistry. Its main advantage is, that it does not imply compression of any kind and that data about atoms, bonds and op- tionally other information (such as spatial placement of atoms, length of bonds or bond and torsional angles) is readily available for algorithmic processing. Unfortunate aspect of graph representation is that for molecule with N atoms we might use N! different ways of indexing. The problem that arises is to identify whether two graphs with different indexing of vertices represent the same molecule or not. Solution of this prob- lem lies in finding an isomorphism (a bijective mapping of chemically equivalent atoms) between the two given graphs – if such mapping exists, both graphs represent the same molecule. However the isomorphism searching has the complexity of N!, because that is the number of possible bijections between the two graphs. It is obvious that the isomorphism search has to be optimized (backtracking, dividing atoms into equivalence classes, ...), for the impact of the factorial complexity to be eliminated as much as possible. It would be also beneficial to left out the isomorphism search entirely when possible – e.g. use the numerical characteristics of given molecular graphs (number of atoms, degrees of vertices, etc.) to find out that it is not possible for the graphs to describe the same molecule in advance. Graph representation of the molecule and isomorphism testing of two given molecules might be very useful for experimental as well as for computational chemists. It allows searching for molecules (as well as additional information about them) in chemical databases. Currently there are several databases like these available. They contain large amount of use- ful information (pharmaceuticals, anorganic molecules, crystal structures, ...) and it would

2 be beneficial to perform fast search in these databases. Goal of my thesis is to create software tool that will be able to work with such databases and to perform searches based on numerical characteristics and isomorphisms of contained molecules. This thesis was written in cooperation with ANF DATA company and it is part of the projects in the Life Science & Innovations section.

The goals of the thesis a) Gain knowledge about basic terms required to work with molecular structures, in partic- ular: molecular graph, isomerism, canonical indexing, isomorphism, numerical charac- teristics of molecule. b) Understand file formats used to store molecule description in computer (SDF, MOL, ...). c) Design methodology for molecule search in databases. Find appropriate application for numerical characteristics of molecule in combination with efficient improvements of iso- morphism search algorithm (backtracking, equivalence classes of atoms, ...). d) Implementation of software that will load several databases of molecules and input molecule. Search through loaded databases and output position of the input molecule within the input databases. Develop the software in such way, that the search will finish in acceptable time. e) Test the functionality and effectiveness of the software using the existing molecule databases.

Note: Topic of this work allows for the realization by two cooperating students. The work on the project can be divided into separated parts, that can be completed independently. Focus of the task is quite wide, and the given problem is difficult from the viewpoint of understand- ing the application domain (chemistry), algorithmic viewpoint (isomorphism, molecular graphs) and implementational viewpoint (use of large databases, molecule visualization).

3 Part II

Theoretical part Chapter 1 Basic definitions

Before looking at representation of chemical structures in computers and describing the al- gorithms that work with these structures, lets have a look at few basic chemical definitions.

1.1 Atom and chemical element

Atom is the smallest particle that is recognized as a chemical element. Each atom has a nu- cleus that consists of positively charged protons and electrically neutral neutrons. Nucleus is surrounded by an cloud of negatively charged electrons. Chemical elements are distin- guished by number of protons contained in atomic nucleus – this number is called atomic number. [5, 6]

1.2 Molecules and chemical bonds

Electrons surrounding the nucleus in atom are often described by dividing them into several electron shells – all electrons in one shell have the same amount of energy. Outer most shell of electrons is called valence shell and it sets the ability of given element to form chemical bonds. Electrons in the valence shells are also called valence electrons. Bond between atoms is formed when the distance of two atoms is small enough for an overlap in their valence shells to appear. This overlap causes valence electrons of both atoms to change the trajectory of their movement. If this newly established system has a lower energy than before then the two atoms will remain in their new positions and a is created. Depending on the number of electrons that form the chemical bond we talk about single (one pair of valence electrons), double (two pairs) or triple (three pairs) bonds. Two or more atoms held together by strong chemical bonds in stable state form a molecule. Molecules can be described using molecular formulas or chemical names, these usually pro- vide information about number of atoms and their types. More information is provided in structural formulas that also describe how are atoms in molecule arranged. [1] In computational chemistry there are several ways of representing molecules in com- puter. I will describe some of the file formats commonly used to store molecular structure later in this work. From the informatics point of view molecule is very often described as a graph structure. Before describing in detail how is a molecule represented in computer memory, we have to define few basic terms from the .

5 Chapter 2 Graphs and molecular graphs

2.1 Undirected graph

Undirected graph G is [11] defined as an unordered pair G = (V,E) where: • V is a set of vertices of the graph G

• E is a set of edges of graph G: E ⊆ {{v, w} |v ∈ V ∧ w ∈ V ∧ v 6= w} In Figure 2.1 is a simple graph G = (V,E) V = {v1, v2, v3, v4} E = {{v1, v3} , {v2, v3} , {v3, v4}}

Figure 2.1: Sample undirected graph

2.2 Molecular graph

To represent a molecule, we usually use labeled pseudograph, that better describe atoms and chemical bonds in molecule. Molecular graph [1] can be defined as a quintuple G = (V, E, f, φ, S) where: • V is a set of vertices of the molecular graph G

• E is a set of edges of the molecular graph G

• f is a function that defines end vertices for each edge

6 2.3. ADJACENCY MATRIX

• φ is a valuation function for vertices

• S is a set of symbols for valuation of vertices

Following is the sample graph representation of formaldehyde molecule. V = {v1, v2, v3, v4} E = {e1, e2, e3, e4, e5, e6} f : f(e1) = {v1, v3} ; f(e2) = {v2, v3} ; f(e3) = {v3, v4} ; f(e4) = {v3, v4} ; f(e5) = {v4, v4} ; f(e6) = {v4, v4} φ(v1) = H; φ(v2) = H; φ(v3) = C; φ(v4) = O S = {C,O,H}

Figure 2.2: of formaldehyde molecule and its graph representation

2.3 Adjacency matrix

Another way to represent a molecule structure and a molecular graph is a adjacency matrix. Adjacency matrix A = (aij) for a given molecular graph G is a symmetric matrix with rows and columns corresponding to the vertices of graph G. Entries of adjacency matrix are set as follows:

• aii = number of non-valence electrons of atom represented by vertex vi

• aij = aji =number of edges e : f(e) = {vi, vj}

The molecular graph of formaldehyde seen in Figure 2.2 can be represented in this matrix: 0 0 1 0 0 0 1 0   1 1 0 2 0 0 2 4

7 Chapter 3 Graph isomorphism

3.1 Isomers, graph isomerism

Isomers are molecules that can be represented using one molecular formula, while their structural formulas may differ. It means that these molecules have the same number of atoms of a given type, but bonds between these atoms and their types may be arranged in a different way for each . Given a molecular graph G = (V, E, f, φ, S) with M being the number of edges in G and molecular graph G0 = (V 0,E0, f 0, φ0,S) with M 0 being the number of edges in G0, we say G and G0 are isomeric if the following conditions are met:

• Number of edges in both graphs is equal: M = M 0.

• A bijection exists ψ : V → V 0 such that the valuation of vertices is preserved: φ(w) = φ0(ψ(w)) for each w ∈ V We say, that sets V and V’ are similar.

Two molecular graphs are isomeric if they represent molecules, that are isomers.

Figure 3.1: Two possible isomers for structural formula CH2O and their molecular graphs

8 3.2. GRAPH ISOMORPHISM

3.2 Graph isomorphism

Molecular graph G = (V, E, f, φ, S) and a molecular graph G0 = (V 0,E0, f 0, φ0,S) are iso- morphic if a bijection p : V → V 0 exists such that:

• if vertices v, w ∈ V form n edges {v, w} in graph G, then vertices p(v), p(w) ∈ V 0 form n edges {p(v), p(w)} in graph G0

• valuation of vertices remains unchanged: φ(v) = φ0(i(v)) for each v ∈ V

The graph isomorphism is an equivalence relation on graphs. If two molecular graphs G and G0 are isomorphic, then they represent the same molecule.

Figure 3.2: Two different graphical representations of the same molecule

The p : V → V 0 permutation can be expressed as a permutation matrix P . Then for the adjacency matrices A and A0 two isomorphic graphs G and G0 the following applies: A = P T A0P

3.3 Non-isomorphism detection

Problem of determining whether two graphs are isomorphic does not currently have a polynomial-time solution, nor it is proved that such algorithm does not exists. Since the isomorphism of two graphs can be only proven by finding the correct permutation of ver- tices of one of the tested graphs, it is usually time consuming. However for many pairs of graphs G and G0 it can be easily seen that they are non-isomorphic. It is therefore more time-effective to eliminate graphs that can not be isomorphic first. There are several characteristics of molecular graphs (e.g. number of edges, types of contained atoms) that can be computed in O(n) or O(n2) time. If values of these basic char- acteristics of given graphs differ then they are not isomorphic.

9 3.4. ISOMORPHISM DETECTION ALGORITHMS

3.4 Isomorphism detection algorithms

3.4.1 Brute force The most trivial algorithm for isomorphism detection is the naive “brute force” approach. It is very easy to implement, but is also the slowest one. For each possible permutation p : V → V 0 we test whether G and G’ are equal. Complexity of this algorithm belongs to O(n!) as for the sets V and V 0 with cardinality n there are n! permutations p : V → V 0.

3.4.2 Backtracking Since the valuation of vertices has to remain the same in the isomorphic graphs, backtrack- ing algorithm limits the search of matching permutations only to the structures of vertices with same valuation. We mark one of the input graphs as a source one. The source graph is then being depth-first searched, while the matching structure of vertices is searched for in the second graph. In the second graph a random starting vertex is chosen and then the graph is traversed from this vertex choosing edges with same valuation as in source graph. If it not possible to find a matching partial bijection the depth-first search steps back a different permutation of possible vertices is chosen in the second graph to evaluate.

Isomorphism detection using backtracking Stack knownEqualVertexes

equalFromVertex(vertex v1, vertex v2) { mark current position in knownEqualVertexes stack

if [v1,v2] in knownEqualVertexes { return true } if v1.valuation <> v2.valuation { return false }

set nl1 to ordered list of neigbours of v1 set nl2 to ordered list of neigbours of v2

while exists unchecked permutation of items in nl2 { for i = 1 to number of neigbours { if not equalFromVertex(nl1[i],nl2[i]) { jump to next permutation of nl2 } } knownEqualVertexes.push(v1,v2); return true }

remove all items from knownEqualVertexes above set mark return false }

10 3.4. ISOMORPHISM DETECTION ALGORITHMS

GraphsIsomorphic(graph G1, graph G2) { choose a starting vertex sv from G1

for each vertex v2 from G2 { if equalFromVertex(v1,v2) { return true }

} return false }

In the worst case scenario, the backtracking algorithm would also have the complexity of O(n!), however in most cases it is significantly faster than the brute force algorithm. The speed increase is in reciprocal proportion of the number of vertices that have the same val- uation. In case of molecular graphs defined before, the valuation of an vertex represents atomic element. Since most of the molecules in organic chemistry contain many Carbon atoms we might want to change the valuation function, so that these large groups of ver- tices with equal valuations are divided into smaller parts, thus allowing us to increase the speed of isomorphism testing.

3.4.3 Equivalence classes of vertices in molecular graphs According to the definition of molecular graph, each vertex has assigned a value – this value corresponds to the chemical symbol of given atom. Given molecular graph G we can define equivalence relation on members of set V : v1 ∼ v2 ⇐⇒ φ(v1) = φ(v2), v1, v2 ∈ V . Then the vertices can be divided into equivalence classes [a] = {x ∈ V |x ∼ a}. Since graph isomor- phism is equivalent relation, the division into equivalence classes must remain the same in the isomorphic graph G0 [1, 9]. When testing for isomorphism using backtracking, we have to perform permutation only on the vertices remaining into the same equivalence class, thus reducing the number of vertices in each equivalence class leads to decrease in the size of array to permutate. This can be achieved by defining the ∼ relation, so that it accounts for other characteristics beside the valuation of the vertex (e.g. degree of the vertex). Detailed description of this method can be found in [8].

11 Part III

Methods Chapter 4 Development environment

For development part of my thesis (IMF1 software), following tools were used.

4.1 Java and NetBeans IDE

Java was chosen as the programming language for implementation of the IMF tool. There are several popular Java IDEs available – NetBeans IDE 6 was chosen for development of IMF. Some of the reasons for this decision included: rich-featured GUI designer for SWING based applications, support for SVN together with other collaboration tools such as conver- sation support, integrated profiler and availability of wide range of plug-in modules. An useful feature is also the NetBeans support for refactoring – it was helpful when in some stages of development was necessary to modify source code to improve performance after benchmarks were performed on chemical data.

4.2 MySQL database server

MySQL is a widely used database server, available in both commercial and free edition. Data storage in IMF is realized using MySQL server version 5.0 (Community Server edition). For the administration of the server, as well as for easier data and database structure manipulation there is a MySQL GUI Tools bundle available. During the development of IMF I have actively used MySQL Administrator and MySQL Query Browser. These tools offer intuitive user interface and all desired functionality. Since these tools use TCP/IP connection to the MySQL server it is also possible to use them for administration of remote server.

4.3 Other software components used

Apart from the standard Java libraries following software packages were also used.

4.3.1 Jmol Jmol is an open-source molecule viewer written in Java [12]. Its primary goal is to provide easy-to-use tool to visualize various chemical structures in space. Jmol displays 3D model

1Isomorphic Molecule Finder

13 4.3. OTHER SOFTWARE COMPONENTS USED of given molecule. This model can be rotated around all three axes, moved around the view- port window and it is possible to adjust the zoom level. Each atom in the model has a color specified based on its type. Jmol provides three primary ways of use: as web applet, desktop application and as a component that can be integrated into new Java applications. We have used this last capability to provide users of our application 3D view of molecules contained in a molecule database. More information about integration of Jmol library into IMF can be found in the Implementation part of this work. The stand-alone Jmol application was also very useful in the early stages of the development of the IMF before we have finished our own visualization component. Some of the images in this work were also produced by Jmol.

Figure 4.1: Jmol desktop application

4.3.2 JDBC drivers JDBC is an API that defines the way of communication between programs written in Java and stand-alone database servers. For each of tested database servers appropriate JDBC driver had to be installed and used.

14 Chapter 5 Chemical data and means of its storage

During the development of IMF it was necessary to have sufficiently large sample of chem- ical data. There are several databases of various chemical substances that are available for free download. Bc. Lucie Kaiserová has prepared a list of available chemical databases for her bachelor’s thesis, we have used data from these databases [7]. Data in these databases are available in the SDFiles1 format [10]. IMF is able to load this format and store contained molecules in a RDBMS engine. As an alternative input format also the SMILES molecule representation is accepted. In following text is a brief description of these data formats.

5.1 MDL Molfile and SDFile

Several file formats are in general used for storing information about molecule structures. Large amount of chemical data is available in the MDL Molfile and its SDFile variant. Molfile is a simple text-based file format that stores molecule information in the form of simple table with two sections. The first section contains information about atoms contained in the given molecule, including information about their spatial arrangement. Information about bonds between atoms and their types are stored in the second section. The Molfile is intended to store information only about one molecule with limited space for storing meta information about the contained structure. SDFile is variant allowing stor- age of more molecules in one file and additional information can be provided for each con- tained molecule. Detailed information about these file formats can be found in [8]. Bellow is a sample description table in the Molfile format (Pyruvic acid molecule ). See Figure 5.1 for graphical representation of this molecule. 10 9 0 0 0 0 0 0 0 0 1 V2000 -0.0162 1.3417 0.0094 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0021 -0.0041 0.0020 O 0 0 0 0 0 0 0 0 0 0 0 0 1.0257 1.9623 0.0028 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.3171 2.0694 0.0195 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.3548 1.4513 0.0211 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.3375 3.5762 0.0278 C 0 0 0 0 0 0 0 0 0 0 0 0 0.8460 -0.4761 -0.0086 H 0 0 0 0 0 0 0 0 0 0 0 0 -0.8244 3.9417 0.9174 H 0 0 0 0 0 0 0 0 0 0 0 0

1Structure-data file

15 5.2. RELATIONAL DATABASES

-0.8329 3.9513 -0.8625 H 0 0 0 0 0 0 0 0 0 0 0 0 -2.3700 3.9255 0.0346 H 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 1 3 2 0 0 0 0 1 4 1 0 0 0 0 2 7 1 0 0 0 0 4 5 2 0 0 0 0 4 6 1 0 0 0 0 6 8 1 0 0 0 0 6 9 1 0 0 0 0 6 10 1 0 0 0 0

Figure 5.1: Pyruvic acid

5.2 Relational databases

Information about molecular structures can become rather large. Some of the databases we have used were several gigabytes in size. As was mentioned, earlier input SDFiles used for data import are just a simple flat files with text description of contained molecules. These can be easily used for sequential processing, but are not performing well, when we are looking for molecules based on their structural information. Rather than building some kind of indexing system over the SDF files or than using sequential searching, we have decided to store molecule information in an existing database server. This approach makes it possible to store further meta information about molecule as well as use the built-in indexing to improve search speed. During the development of IMF, we tested several freely available database servers (more about it can be found in Implementation part of this work). At the end we have decided to use the MySQL Community Server, however with a few minor modification it is possible to use our application with almost any database server with available JDBC driver.

16 5.3. SMILES DEFINITIONS

5.3 SMILES definitions

SMILES 2 is a format of molecular structure representation in a form of a simple string. For example, the SMILES definition for Pyruvic acid mentioned earlier is: CC(C(O)=O)=O. In IMF it is possible to load a SMILES definition of molecule and run a isomorphism search against molecules stored in the database. Advantages of SMILES format are the size of resulting string representing a molecule and the wide availability of SMILES description for many molecules. Molecule loaded from SMILES is given unambiguously, but several different SMILES can represent one molecule. Thus it is not possible to say that two molecules defined by different SMILES are not isomor- phic. SMILES representation depends on the indices in graph representation of molecule from which the SMILES was generated. Attempts exist to introduce canonical indexing for SMILES, but currently there are several different standards. For more information about SMILES see [9]

2Simplified molecular input line entry specification

17 Part IV

Implementation Chapter 6 Description of IMF Application

In this chapter, I will describe the IMF application. IMF is the result of effort to use the theoretical background to implement usable application, that would allow to store molecule information in a SQL database and then perform isomorphism searches in the available data.

Figure 6.1: IMF – graphical user interface

IMF is written in Java – source code of the application can be found on the attached DVD (see XrefId[??]). We have decided to use this widespread, object oriented language with assumption, that our implementation should be easily understood by majority of possibly interested readers. Another advantage of using JDK is possible portability to wide range of platforms. The development itself was done simultaneously on both Linux and Windows XP operating systems, the code base was maintained in SVN server. IMF installation and setup instructions can be found on the attached DVD.

19 6.1. BASIC PROGRAM DESCRIPTION

6.1 Basic program description

IMF offers basic functionality to work with database of molecules, view their basic charac- teristics and perform search of isomorphic molecules within the database. This functionality is available both, through the command line interface (CLI), that can be easily incorporated into scripts and through the graphical user interface (GUI), that is designed for interactive use. This chapter focus on description of the GUI, for description of CLI please see [10]. The functionality of GUI is for better clarity divided into five different areas and available to the user in form of a tabbed pane1. Following is the list of used tabs and brief description of their function.

• Find isomorphism: find all isomorphic molecules for given input molecule.

• Database setup: setup of connection information for database that is used for searches

• Export / Import: import of all data from SDFile into database and possibility to select one or more molecules and export them into a SDFile

• Database browser: browse the database and display information about molecules

• Backtracking visualization: visualization of the run of backtracking algorithm

6.2 SDFile import

First step after the setup of IMF and MySQL server (see attached DVD for detailed instruc- tions) is usually to load molecule data into SQL database2. To load data from SDFile or Molfile select the Export / Import tab, choose the file to import and fill in the additional infor- mation about new data source (see Figure 6.2). Data source is a term we use to identify a data file, that was imported into IMF database together with additional information about this file – each molecule in database holds information about its source. In the database setup tab it possible to choose which database sources will be used for isomorphism search and other molecule operations (see Figure 6.3).

6.3 Database browser

Once the data are loaded in the database, the database browser (see Figure 6.4) can be used to display model of the molecules in the database as well as information about their properties. In the left part of the window is integrated the Jmol visualisation plug-in that is used to display model of current molecule. This model can be rotated and positioned using mouse

1Tab is a navigational widget for switching between sets of controls or documents[2] 2The isomorphism search will work without loading data into the database, however additional functionality that uses database storage will not be available.

20 6.3. DATABASE BROWSER

Figure 6.2: Import of new data source

Figure 6.3: Selection of data sources wheel and buttons. There is a simple navigation panel that can be used to go through the database or to jump to a molecule with given ID below the model. Summary information about displayed molecule is shown in the right part of window.

• Name: Name of the molecule, extracted from the source SDFile.

• ID: ID of the molecule in the IMF database

• Fingerprint and Bond Fingerprint: These are the characteristic strings of molecule generated by IMF (for more information see Section 9.1) – only molecules that have both of these characteristics same can be isomorphic.

• # of potential isomorphisms: number of molecules in currently selected data sources that can be potentially isomorphic with current molecule

21 6.3. DATABASE BROWSER

• SMILES: SMILES string for current molecule

• Molecule source information: Description of source data file for given molecule

Figure 6.4: Database browser

Description of available buttons

• Find isomorphism: Switches to the Find isomorphism tab sets the source molecule ID to the ID of the current molecule

• Add to export list: Adds the molecule to the Export list (see below)

• First / Prev / Next / Last: Buttons that allow to jump to the first / previous / next / last molecule in the database.

• Go: Jumps to the molecule with chosen ID.

• Add all potential isomorphisms to export list: Adds the molecule and all other molecules with the matching fingerprints and bond fingerprints to the export list.

• Prev / Next: Buttons that allow to jump to the next / previous molecule with the same fingerprint.

22 6.4. EXPORT LIST

6.4 Export list

It might be useful to export molecules from the IMF database for further processing in any other software tool. This is possible using the list of molecules to export on the Export / Import tab. In the database browser it is possible to insert molecules into this list. Contained molecules can be then displayed and exported into a single SDFile.

Figure 6.5: Export list

6.5 Isomorphism search

The main goal of our application is to provide easy-to-use search of isomorphic molecules. This function is available in the first tab. For isomorphism search it is necessary to define the input molecule and the database over which the search will be performed.

Figure 6.6: Defining the source molecule of isomorphism search

23 6.5. ISOMORPHISM SEARCH

There are three possible ways to define input molecule. From existing database by en- tering molecule’s ID, Name or Orignal ID, from a SDFile by specifying full path to file and index of molecule that should be used or by entering a SMILES description of molecule. As the database of potentially isomorphic molecules may be used either the internal IMF database (it is possible to connect to the different database than the source molecule is from) or the SDF file.

Figure 6.7: Selecting the input database

After input parameters are specified the isomorphism search may be performed. Since some of the available chemical molecules contain molecule descriptions without Hydrogen atoms, it is possible to perform search in molecules without taking Hydrogens in account (see Figure 6.8). Hydrogen atoms are always ignored if the source molecule is entered in form of SMILES description (for explanation why, please see [8]). Search progress and list of found isomorphic molecules are situated in the bottom part of the window – if the search was performed against the IMF database it is possible to show each found molecule in the database browser.

Figure 6.8: Results of searching molecule isomorphic to "NSC 49" from the NCI database [7]

24 6.6. BACKTRACKING VISUALIZATION

6.6 Backtracking visualization

Even though it is not a functional necessity, the GUI contains also a specialized visualization component, that allows us to see the run of backtracking algorithm on the potentially iso- morphic molecule pair. It allows us to better demonstrate how the backtracking algorithm works and we have also used this module to find ways to improve backtracking perfor- mance.

Figure 6.9: Backtracking visualization component

Component displays both molecules – the “source” one in the blue field, the “examined” one in the red field (In Figure 6.9 there are two molecules from the NCI database: NSC 12 and NSC 2629). The source molecule is always the same, as the molecule currently displayed in the molecule browser. In the blue field there are four navigational buttons, to go trough the database – change of selected molecule cause also the change in database browser. In the red field there are also navigational buttons, but these only cycles through the molecules that are in the database and are potentially isomorphic with the source molecule. In case there is not any potentially isomorphic match, a random isomorphism is generated from the source molecule by shuffling indexing of molecule in memory (the same can be achieved also for molecule with possible isomorphism by pressing the RandIso button). Controls in the top toolbar have following functions.

25 6.6. BACKTRACKING VISUALIZATION

• Test Isomorphism Starts the visualization animation.

• Reset colors After the visualization run is done, both molecules are either coloured green (in case they are isomorphic) or red (if they are not), see Figure 6.10. This button can be used to reset colours of displayed atoms back to their defaults.

• Orientation buttons Allow to change the orientation of red and blue field to horizon- tal / vertical. (see Figure 6.10)

• Autoresize If checked, molecules are automatically resized when size of windows or visualization panels change.

• Show Hydrogen atoms Especially for display of large molecules the clarity of display may be improved by hiding hydrogen atoms.

• Panels linked If checked, molecules are resized / moved simultaneously in both, the blue and the red field.

• Animation speed Slider allowing to control speed of the visualization.

Figure 6.10: Display of two isomorphic (left) and non-isomorphic (right) molecules

26 Chapter 7 Design Specification

7.1 IMF Architecture

 

 

   

    

Figure 7.1: IMF Architecture

IMF can be divided into several functional parts. Interaction with user is provided either through the graphical user interface, more suitable for interactive work or using a command line parameters – this allows IMF to be included in the batch data processing. The data processing part provides the core functionality including isomorphism detection and com- puting various characteristics of examined molecules. Data access layer provides classes to obtain data from various sources as well as to allow data export. The main storage is the database, additionally work with SMILES definition and SDFiles is possible.

27 7.2. MOLECULE REPRESENTATION

7.2 Molecule representation

Figure 7.2: Class diagram for classes Molecule, Atom and Bond

Following is the brief description of classes used to represent a molecule.

• Atom – class representing an atom

• Bond – class representing a bond between two atoms

• Molecule – class representing a molecule and providing methods for manipulation with molecular structure and for obtaining molecular characteristics.

Detailed description of most important methods can be found in [8]. The full description of all methods and parameters can be found in the JavaDoc documentation on the attached DVD.

28 Chapter 8 Database layer description

8.1 DB schema description

Schema used for IMF database is a very simple one – Figure 8.1 contains the ER diagram. Since we wanted to preserve all the information obtained from the original SDF files, major- ity of information about molecule is stored in the sdf_mol_file field of the molecule table – all the other fields of molecule table contain additional meta data used either for isomorphism search or for molecule identification. Primary reason for using the database server was to omit sequential access required when loading molecule information from SDF files, rather then to decompose molecular structure and store it in several rows of normalized data fields. In the early stages of development we used a slightly different representation, where two separate fields were used to store list of atoms and list of bonds (further "normalization" to the level where a atom would be represented by a single table would not make much sense) – but we decided to use the structural data in form of the SDF record. The original SDF information is still required for visualization and the parsing of data stored in this format was not a speed disadvantage in comparison with other means of storage. Table source is intended to store information about source data files, that were used to import information about molecules into database. A simple 1:n relation tells us which source does a molecule belong to.

Figure 8.1: ER diagram

29 8.1. DB SCHEMA DESCRIPTION

Description of the data tables Molecule • idmolecule – primary key of the Molecule table, automatically generated number sequence • name – name of the molecule, value can be retrieved from some of the SDFiles. • smiles – SMILES representation of a molecule • weight – molecular weight • originalid – ID that molecule had in the original database, from which it was ob- tained • fingerprint – several numerical characteristics of the molecule stored in the form of varchar value (see Section 9.1 for more information) • sdf_mol_file – compressed value of SDFile description of molecule • bfingerprint – second set of numerical characteristics of the molecule (see Sec- tion 9.1 for more information) • sourceid – foreign key to the source table

Source • idsource – primary key of the Source table • name – name of the data source • description – further description of the source • url – URL of the web site the source was obtained from • inserted – date and time the source was inserted into database • filename – filename of the SDFile the source was imported from Apart from defining database schema it is also necessary to think about indices that will eventually improve database performance. When declaring indices it is necessary to think about typical way of the database use. Since our database is mainly intended as a “perma- nent” the number of indices do not need to be limited in any way. The application itself does not use any UPDATE queries. Data are usually inserted from large data files (several thousand records at once) and time of the insert operation is not the key performance factor, therefore a slight increase in time required to run INSERT statements caused by defining indices is an acceptable tradeoff. The high performance of executing SELECT queries on the other hand is desired.

30 8.2. MOLECULEDB CLASS

Indices defined on the molecule table • primary index is created on the idmolecule field

• idx_fingerprint index created on two fields, fingerprint and bfingeprint – it is useful because the isomorphic testing against the database selects potentially isomorphic molecules based on these two fields

• idx_sourceid is an index created on the sourceid field – it is useful when the only se- lected data sources are used

• idx_originalid is an index created on the originalid field – it is used when a source molecule is defined by this value

• idx_name is an index created on the name field – it is used when a source molecule is defined by this value

8.2 MoleculeDB class

All functionality required for manipulation with data stored in SQL database is contained in the MoleculeDB class and methods necessary to maintain connection to the database server are all within the DBConnection class. This isolation from the rest of the applica- tion would make it possible to easily perform modifications to use the IMF with different database server.

Figure 8.2: MoleculeDB – class diagram

31 8.2. MOLECULEDB CLASS

After an instance of the MoleculeDB class is created it is possible to setup connection to the database server using the setupConnection method. Apart from methods that allow us to perform data manipulation there are two more “setup” methods:

• void useSources(int[] sourceids) by using the sourceids parameter it is possible to define which data sources (based on value of the sourceid field in the molecule table) will be used to for all data retrieval methods

• void useAllSources() will reset the active sources list – all molecules in database will be used

There are two different groups of data manipulation methods available:

• Methods for import and export of data: FileToDB, SaveMolToDB, DBtoFile

• Methods to a obtain single molecule from the database. Selection can be based on the information from the original database: LoadByName, LoadByOriginalID; on the ID in the IMF database: LoadFirst, LoadById, LoadByIdNext; or by the de- sired Fingerprint: LoadByFingerPrintNext (The ...Prev variants of these meth- ods work in similar fashion). These methods are used to browse through the molecule database (method names should be self-explaining, for the Next/Prev methods a “cur- rent” molecule’s ID is provided in the parameter). Each invocation of these methods returns an instance of Molecule class or a null value if there is no next/previous molecule available. If the MoleculeDB.useSources method was invoked, only molecules belonging to the sources specified will be returned.

SQL queries used in these methods use only a limited SQL syntax, that should not be af- fected by eventual differences between SQL servers. The only necessary modification would be handling of AUTO_INCREMENT values, that are available on the MySQL server. For these "automatic" values a different approach migh by used by different servers (e.g. use of sequences on Oracle...).

32 Chapter 9 Implementation of isomorphism search algorithms

9.1 Molecule fingerprints

Search for two isomorphic molecules can be quite time consuming process (as was men- tioned in the Theoretical part it has complexity in O(N!)) When we search for molecules isomorphic to given source molecule in a large database then one of the key performance factors would be not to perform an isomorphism check at all. Numerical characteristics, that will allow us to determine that two molecules can not be isomorphic we call molecule fingerprint. In the Molecule class there are two methods that are used to retrieve these characteristics.

• String getFingerprint() – returns a molecule fingerprint. It is a string represen- tation of types and numbers of atoms contained in the molecule as well as number of bond types

• String getBondFingerprint() – returns a bond fingerprint, that is a string repre- sentation of types of different bonds contained in the molecule

The reason for having two different methods is mostly historical. The getFingerprint() method was the first one implemented. Later during test runs against database containing approximately ten million molecules, it became apparent that a finer graining would be beneficial. The two separate methods (and their respective database representation) were preserved, for the possible comparison of their effectiveness.

Fingerprint The fingerprint string has variable length, depending on number and type of atoms in a molecule. It consists of several blocks describing contained atoms and followed by numbers of contained bond types. It has the following structure: [{Atom symbol}{Number of Atoms in the molecule}], [{bond type}:{number of given type of bonds},]

Blocks describing number of atoms are always ordered by atomic number, blocks describing bond types are always ordered by bond valence. Because in some of the available databases there are molecule definitions without hydrogen atoms – therefore these are omitted from the molecule fingerprint.

33 9.2. BRUTE FORCE APPROACH

Bond fingerprint The bond fingerprint is a variable length string with fixed structure. It consists of six num- bers divided by spaces. It has the following structure: {Number of single bonds between two carbon atoms} {number of double bond between two carbon atoms} {number of single bond between an oxygen and a carbon atom} {number of double bonds between an oxygen and a carbon atom} {number of single bonds between a nitrogen and a carbon atom} {number of bonds of any valence between a nitrogen and an oxygen atom}

9.2 Brute force approach

The brute force approach to isomorphism detection is not usable in practical application, due large time complexity – even on very small molecules might the computation last for hours. Implementation of this algorithm was used for comparison reasons and is not intended for real-life usage. It is implemented in the BruteForce method of Molecule class public boolean BruteForce(Molecule mol) { Permutation per; Molecule permut; per = new Permutation(atoms.size()); while (per.hasNext()) { per.next(); permututated = this.Permutate(per); if ((mol.DBAtoms().equals(permututated.DBAtoms())) && (mol.DBBonds().equals(permututated.DBBonds()))) { return true; } } return false; }

The Permutation class is designed to iterate through all permutations of array int values 1 .. n where n is the size of the array. Method Molecule.Permutate returns a new instance of Molecule class with indexes of atoms reordered based on given permutation.

9.3 Backtracking

Backtracking algorithm is the main method of isomorphism detection used in IMF applica- tion. Based on our testing it performs well on most kinds of the tested molecules. Due to the design of this algorithm, its execution time is in direct proportion to number of atoms that belong to the same equivalence class (see Section 3.4.3).

34 9.3. BACKTRACKING

9.3.1 Backtracking algorithm Backtracking algorithm is based on simultaneously depth-first searching [3] pair of tested graphs – as long as the algorithm is succesful in visiting atoms with equal evaluation in both graphs, the algorithm goes on further “down”. As soon as none of the possible neighbours of currently visited atoms can have the same valuation the algorithm “steps back” to previ- ously visited atom and it tries a different permutation of bonds that defines possible path to neighbouring atoms. When backtracking starts, it is already known that the number and types of atoms in both molecules are the same (using fingerprints and bond fingerprints). The first step of backtracking algorithm therefore has to be selection of appropriate "starting" atom – it is feasible to choose atom, that has the least possible images in the "matched" molecule. In following code fragment we can see how a starting atom is chosen: minimal_atom = fingerprint.getMinimalAtom(); nonH_bond_count = getSmallestBondTypeGroup(minimal_atom); atomtocompare = atoms.firstElement(); for (int i = 0; i < atoms.size(); i++) { if (atoms.elementAt(i).getAtomicNumber() == minimal_atom) { atomtocompare = atoms.elementAt(i); if (atomtocompare.getNonHBondCount() == nonH_bond_count) { break; }

}

}

First a getMinimalAtom() method is called to discover which atom is the least numerous in a given molecule. Since there is still possible to have more atoms that differ by the number and type of bonds, the second line gives us the "ideal" number of bonds, that the start atom should have (e.g. if we have molecule with three Oxygen atoms of which one is bounded to carbon using double bond and the rest two are bounded to two different atoms using single bonds, than the starting atom should be the first oxygen). After the starting atom is selected the backtracking search is performed by calling the Atom.btEqualsTo method (which then recursively calls itself to perform the depth-first search) against all atoms that could be possibly equal with the start atom. If the given molecules are isomorphic, than at least one of these runs return true. To get a better idea on how the backtracking traverse trough the molecular structure it is possible to run IMF and test isomorphism search in the Backtracking visualization tab. Due to length I do not include the source code of the main backtracking detection routine Atom.btEqualsTo in the printed version of this work. It is available in the Atom.java source file on the attached DVD. The formalized description in pseudo code is in the Theo- retical part of this work.

35 Chapter 10 Implementation of user interface

10.1 GUI implementation

Graphical user interface of IMF is based on the Java Swing [4] libraries. The main window of the application is described in the MainForm class. This class is partly automatically generated by NetBeans IDE – code that was added is mainly in the bodies of the event- listeners methods that define GUI behaviour based on user interactions. Additional classes handling some aspects of user interface had to be defined

• GUICheckIsoDB – isomorphism may be time consuming in order not to “freeze” the user interface while the computation is performed, it is desirable to run it in sepa- rate Java thread. This class is ancestor of Thread class and controls the isomorphism search, when the isomorphism is searched in the database

• GUICheckIsoSDF – class similar to the previous one. Used for isomorphism search in the SDF file.

• GUILoadFile – similary to the isomorphism search, also loading of the large source SDFile may be a very time consuming operation (during the tests of IMF, the loaded data files often had several gigabytes in size). This class implements a thread to handle the imports.

• JmolPanel – wrapper class for the easier integration of the Jmol visualization com- ponent (for Jmol description see Section 4.3.1)

• MoleculePanel – the customized visualization component for displaying 2D molecule structure as well as to visualize the backtracking algorithm run is implemented in this class.

• SDFFileFilter – simple class providing the file opening/saving dialog with de- scription for SDFile.

10.2 MoleculePanel

The MoleculePanel class extends Swing JPanel and override some of its methods to pro- vide the user of IMF with 2D representation of molecular graph.

36 10.2. MOLECULEPANEL

Figure 10.1: Class diagram – MoleculePanel

MoleculePanel methods In the following list is description of some of the methods of MoleculePanel class. Full list of methods can be found in the JavaDoc documentation on attached DVD.

• setMolecule(Molecule mol) – Assignes the molecule which should be be dis- played.

• fillPanel() – Changes the size of the molecule, so that it fills the whole display area.

• setSpreadXY(int newspreadx, int newspready) Sets the distance between atoms.

• rotate(int angle_diff) – Allows to rotate the displayed molecule.

• moveMolecule(int x, int y) – Moves displayed molecule to the specified loca- tion

• void centerViewToAtom(Atom a) – Centers the displayed image on given atom.

• isVisible(Atom a) – Method used to find out whether a given atom is in the visi- ble portion of the molecule.

37 10.3. JMOLPANEL

• setLinkedPanel(MoleculePanel linkedpanel) – Sets up a link between two MoleculePanel instances – user actions are then synchronizes in both panels.

10.3 JmolPanel

Jmol is a very powerful visualisation tool for molecular structures. For easier use in IMF the JmolPanel class extends the Swing JPanel component and hides calls to some some of the Jmol methods [12], allowing easy integration to the GUI design using the Design view of NetBeans. The most important method from the GUI programmer’s point of view is the void LoadSDF(String molfile) method that passes the given input String representation of molfile to the Jmol viewer component.

38 Part V

Conclusion Chapter 11 Conclusion

During the work on this diploma thesis, I tried to gain appropriate knowledge in the field of computational chemistry and use this information during the implementation phase of this work. Concepts of molecular structure and its representation in molecular graphs as well as understanding of various types of available molecular data formats (such as SDFile or SMILES) were necessary foundation for for fulfiling the goals of the work. Specifically, the main topic of the work was to design and impelemt a software IMF (Isomorphic Molecule Finder), that will provide functions for detection of molecular isomorphism in large chemi- cal databases. IMF offers an intuitive graphical user interface and allows to collect and orga- nize large amount of chemical data from various sources. During the implementation of its isomorphism detection algorithms various approaches were implemented and the resulting set of isomorphism matching functions is optimized to perform fast searches. Included tech- niques (fingerprint generation, equivalence classes detection) allows the program to work reasonably fast even on a very large (gigabytes of data) input databases. Developed software was tested on a real-world data and in all tested cases it was possible to find matching isomorphic molecules within three seconds in a database containing more than five million records (having approximately 5 GB in size). Created software implemented all the requested features (with required or better then re- quired performance) and is ready to be deployed as a searching tool for finding isomorphic structures in databases used in bioinformatics projects of ANF DATA company. Further development might focus on deeper analysis of the data preprocessing during the import into IMF database. It would be helpful to eliminate differences (e.g. represen- tation of aromatic bonds) that arise when data from various sources are used. Also, an improvement can be a usage of maximum information from the metadata that might be already part of the input SDFiles. Optimization of the backtracking algorithm might be possible by increasing accuracy of atom equivalence classes detection and probably by detecting larger equivalent substruc- tures (e.g. benzene rings).

40 Bibliography

[1] Svobodová Vaˇreková, R.: Poˇcítaˇcováchemie [slides], 2005, [Online; last accessed 13-May- 2008]. 1.2, 2.2, 3.4.3

[2] Wikipedia: Tab (GUI) – Wikipedia, The Free Encyclopedia, [Online; ac- cessed 20-May-2008]. 1

[3] Wikipedia: Depth-first search – Wikipedia, The Free Encyclopedia, [Online; accessed 20-May-2008]. 9.3.1

[4] Sun Microsystems, Inc.: Creating a GUI with JFC/Swing, [Online; accessed 22-May-2008]. 10.1

[5] Wikipedia: Chemical element – Wikipedia, The Free Encyclopedia, 2008, [Online; accessed 13-May-2008]. 1.1

[6] Wikipedia: Atom – Wikipedia, The Free Encyclopedia, 2008, [Online; accessed 13-May- 2008]. 1.1

[7] Kaiserová, L.: Structure databases, 2007, [Online; last accessed 13-May-2008]. 5, 6.8

[8] Slaninka, J.: Diploma thesis – Effective algorithms for searching of identical molecules and their application, 2008. 3.4.3, 5.1, 6.5, 7.2

[9] Wikipedia: Equivalence class – Wikipedia, The Free Encyclopedia, 2008, [Online; accessed 15-May-2008]. 3.4.3, 5.3

[10] MDL Information Systems, Inc.: CTFile Formats, 2007, [Online; accessed 15-May- 2008]. 5, 6.1

[11] Hlinˇený,P.: Teorie Graf˚u(FI:MA010), 2007, [Online; accessed 12-May-2008]. 2.1

[12] Jmol: an open-source Java viewer for chemical structures in 3D, [Online; accessed 20-May-2008]. 4.3.1, 10.3

41 Index

Adjacency matrix, 7 Atom, 5

Backtracking algorithm, 10, 25, 34 Bond, 5 Brute force method, 10, 34

Chemical element, 5

Database, 13, 20, 31

Equivalence classes of atoms, 11

Isomer, 8 Isomorphism, 9, 23

Jmol, 13, 38

Molecular graph, 6 Molecule, 5 Molecule fingerprint, 33 MySQL, 13

SDFile format, 15, 20 SMILES, 17

Undirected graph, 6

Visualization, 25

42 Appendix A Contents of attached DVD

Attached DVD contains these folders: /doc – this thesis and additional documents /imf – JAR distribution package and the database installation instructions /imf_project – NetBeans project folder /VMware – VMware image with preinstalled Linux and MySQL

43