Classification of Metabolic Reactions based on Physicochemical Properties and Search for Inhibitors

Den Naturwissenschaftlichen Fakultäten der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades

vorgelegt von Martin Johann Reitz aus Forchheim

Als Dissertation genehmigt von den Naturwissenschaftlichen Fakultäten der Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 11. Mai 2007 Vorsitzender der Prüfungskomission: Prof. Dr. E. Bänsch Erstberichterstatter: Prof. Dr. J. Gasteiger Zweitberichterstatter: Prof. Dr. H. Sticht Danksagung

Meinem Doktorvater

Prof. Dr. Johann Gasteiger danke ich für die Überlassung des interessanten Themas sowie die vielfältige Unterstützung und die wertvollen Anregungen wodurch das Gelingen der Arbeit erst ermöglicht wurde.

Zu besonderem Dank bin ich verpflichtet:

Herrn Alexander von Homeyer für die Zusammenarbeit im Rahmen der gemeinschaftlichen Publikation und des Buchkapitels sowie seine Unterstützung bei den Strukturüberlagerungen.

Dr. Oliver Sacher für die Zusammenarbeit im Rahmen des BFAM Projektes sowie die stete Hilfsbereitschaft rund um BioPath, Cactvs und Cora.

Weiterhin danken möchte ich Dr. Thomas Kleinöder, Dr. Lothar Terfloth und Dr. Yongquan Han für die anregenden wissenschaftlichen Diskussionen sowie zusammen mit Dr. Achim Herwig, Herrn Thomas Tröger und Dr. Markus Sitzmann für die tatkräftige Unterstützung im Rahmen der Linux-Administration.

Danke auch an die Herren Jan Griebsch, Arno Buchner und Hanjo Täubig vom Lehrstuhl für Effiziente Algorithmen der TU München für die Zusammenarbeit im Rahmen des BFAM Projektes.

Darüber hinaus danke ich allen derzeitigen und ehemaligen Mitarbeitern des Arbeitskreises, welche hier nicht namentlich erwähnt sind, sowie den Sekretärinnen für die stete Hilfsbereitschaft und Bereitstellung einer funktionierenden Infrastruktur sowie für die stets angenehme Arbeitsatmosphäre.

Ganz besonders herzlichen Dank meiner Freundin Silvia sowie meiner ganzen Familie, insbesondere meinen Eltern, für die Geduld und Unterstützung in jeglicher Hinsicht beim Verfassen dieser Arbeit.

Dem Bundesministerium für Bildung und Forschung (BMBF) danke ich für die Finanzierung der Arbeit durch das BFAM-Projekt. Contents

1 Introduction...... 1

1.1 Scientific Background...... 4 1.1.1 Connection Table...... 4 1.1.2 Atom-Atom Mapping...... 5 1.1.3 Reaction Center Marking ...... 6

1.2 Exploiting the Data...... 7

References ...... 9

2 Enabling the Exploration of Biochemical Pathways ...... 11

2.1 General Introduction...... 11

References ...... 11

Abstract ...... 12

2.2 Introduction ...... 12

2.3 The BioPath Database...... 15 2.3.1 The Data Model...... 15 2.3.2 Chemical Structures...... 15 2.3.3 Chemical Reactions...... 16 2.3.4 ...... 18 2.3.5 Augmenting the Contents ...... 18 2.3.6 Data Input...... 18 2.3.7 Data Processing and Storing ...... 19

2.4 The C@ROL Retrieval System ...... 21

2.5 Searching in the BioPath Database ...... 24 2.5.1 Searching in the Molecule Database ...... 24 2.5.2 Name and Name Fragment Searching...... 24 2.5.3 Gross-formula Searches...... 27

i 2.5.4 Full Structure and Substructure Searches...... 28 2.5.5 Property Retrieval...... 32 2.5.6 3D-Substructure Searches...... 32 2.5.7 Searching in the Reaction Database...... 34 2.5.8 Searching with Chemical Structures...... 34 2.5.9 Searches on the Reaction Centre...... 35 2.5.10 Searching on Enzymes ...... 38 2.5.11 Combined Searches...... 40

2.6 Conclusions ...... 41

Acknowledgements ...... 41

Table of Acronyms...... 42

References ...... 43

2.7 Improvements on the BioPath Database ...... 45 2.7.1 Data Cleanup & Improvement ...... 45 2.7.2 C@ROL Interface...... 45

References ...... 49

3 Query Generation to Search for Inhibitors of Enzymatic Reactions ...... 50

3.1 General Introduction...... 50

References ...... 51

Abstract ...... 52

3.2 Introduction ...... 53

3.3 Materials and Methods ...... 55

3.4 Results and Discussion...... 60

3.5 Conclusions ...... 75

Acknowledgement...... 75

ii References ...... 76

3.6 Further Conclusions...... 79

References ...... 79

4 Database Screening for Enzyme Inhibitors...... 80

4.1 Introduction ...... 80

4.2 Preparation of data...... 81

4.3 Computation...... 83

4.4 The Fitness Function...... 84

4.5 Results...... 87

4.6 Conclusions ...... 93

References ...... 94

5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on ...... 96

5.1 General Introduction...... 96

Abstract ...... 98

5.2 Introduction ...... 98

5.3 Materials and Methods ...... 101 5.3.1 BioPath...... 101 5.3.2 Datasets ...... 101 5.3.3 Choice of Descriptors...... 107 5.3.4 Kohonen Neural network...... 107

5.4 Results and Discussion...... 109 5.4.1 EC 3.b.c.d...... 109 5.4.2 EC 3.1.c.d...... 118 5.4.3 EC 3.2.c.d...... 121

iii 5.4.4 EC 3.5.c.d...... 122

5.5 Conclusions ...... 127

Acknowledgement...... 128

References ...... 128

5.6 Further Conclusions...... 131

References ...... 131

6 Conclusions & Outlook...... 132

References ...... 133

7 Summary ...... 134

8 Zusammenfassung...... 136

Publikationen ...... A

Lebenslauf...... B

iv 1 Introduction ______1 Introduction

Metabolic reactions are of high interest as these processes keep us alive. Misbalances or failures in these highly regulated reaction networks can lead to severe diseases. In biotechnology and agricultural industry, it is also of high importance to understand the metabolism and get hand on its regulation in order to enhance the output of desired products by living organisms. Metabolic reactions are very energy-intensive processes in the organism. For example, a human at rest consumes about 65 kilograms of ATP per day, nearly the weight of his own body, which has to be regenerated; a human doing activity consumes even much more [1].

Figure 1.1 Example of a metabolic reaction network. The rectangular boxes represent the catalyzing enzyme with EC number, oval boxes indicate connected pathways, and small circles stand for chemical compounds. Filled arrows indicate a direct molecular interaction while unfilled arrows indicate a connection to another pathway. The example was taken from the KEGG website [2].

1 1 Introduction ______

This effective network of reactions is driven and regulated by enzymes. These proteins enable the reactions at the required reaction rate in water solution at body temperature. Therefore, to understand the enzymes and how they work enables us to understand metabolism which may then lead to new drugs. An example for such a complex reaction network is given in Figure 1.1. In the past, much effort has been spent on the investigation of metabolic reactions and pathways [3]. One of the first researchers who performed systematic and quantitative studies on metabolism by using a balance was the Italian physiologist Santorio Santorio in the 16th century [4]. Then, in the 19th and 20th centuries, the basic concepts of metabolism, including the major metabolic pathways and enzymes, were investigated. Nowadays, a large amount of metabolic data, often stored in electronic databases, is available [5]. Since the 1990’s, an enormous increase of biochemistry and molecular biology data can be observed as a consequence of new methods in genomics, proteomics, and high-throughput technology. Starting with the sequencing of the Haemophilus influenza genome in 1995 [6], a milestone in genomics was reached with the complete sequence of the human genome in 2001 [7]. Nowadays, the sequencing of genomes is an established tool and commercially interesting plant and livestock genomes are under sequencing or already uncovered. Parallel to the growth of genomic data, also the knowledge on proteomics data rises fast and is collected and stored in databases. One of the most comprehensive databases in the field of proteomics is the Protein Data Bank (PDB) [8] founded in 1971. As a consequence of the huge amounts of data now available, bioinformatics has emerged as an important discipline in biosciences. The methods provided by bioinformatics help to understand the data and to generate new knowledge from them. Even the simulation of whole cells in silico is under development [9]. However, to view metabolism from the standpoint of a biologist is only one side of the medal. The whole machinery of all the genes and proteins involved in metabolism is at the very end only a tool, optimized to enable the essential task of metabolism: the chemical reaction. Therefore, if one wants to completely understand metabolism, one also has to view it from a chemical standpoint. Here, modern technology helped to accumulate a multitude of chemical data related to metabolism, like reaction data or data on chemical compounds, which is often also

2 1 Introduction ______available in computer readable form. A quite comprehensive database in this field is the KEGG database [2]. As a counterpart to bioinformatics, on the chemistry side, the discipline of chemoinformatics arose in order to analyze these data by computer methods. Therefore, to understand metabolism which is the basis for the development of useful drugs for curing metabolic disorders, one has to integrate the knowledge of both worlds: bioinformatics and chemoinformatics (Figure 1.2).

Figure 1.2 To gain a complete understanding of metabolism, both methods, bioinformatics and chemoinformatics, have to interact. While bioinformatics investigates the situation from the view of genes and proteins, chemoinformatics handles the data on the chemical compounds and reactions involved in metabolism.

In this work, a database of metabolic reactions, BioPath, was used as a starting point. This database is suited to act as a link between bioinformatics and chemoinformatics and derived from a preceding project funded by the ‘Bundesministerium fuer Bildung und Forschung’ (BMBF) [10]. In this project, the data from the well known Boehringer Biochemical Pathways wall charts [11] was turned into computer-readable form by storing the compounds and reactions on atomic level. This was done by marking the sites where the enzymes act on the chemical compounds, the reaction center, and by atom- atom mapping, matching the individual atom of substrate and product together. In the following section 1.1, some basic concepts this database relies on are explained. For a detailed explanation refer to Ref. 12 and 13.

3 1 Introduction ______1.1 Scientific Background

1.1.1 Connection Table

To allow the handling of chemical structures and reactions on an atomic level, the chemical compounds have to be presented in a computer-readable form. For this, a chemical structure can be considered as a graph consisting of edges, the bonds, and nodes, the atoms. The nodes (atoms) are labeled with atom symbols and there can be multiple connections (bonds) between the nodes, reflecting double and triple bonds. Such a molecular graph can then be presented by a matrix where the nodes and their connecting edges are described. In Figure 1.3 such a matrix representation of a molecule is shown. Those matrices can be further simplified, e.g. by omitting the hydrogen atoms and zero values, and by elimination of redundancy.

Figure 1.3 Structure diagram of acetaldehyde (left-hand) with the according bond matrix (right-hand). The numbers in the structure diagram indicate the atom index. In the bond matrix, each atom is annotated on the x and y axis by its index. Between all atoms the bond order is shown in the table. Zero here means no bond at all. This kind of representation contains redundancy, as every bond is encoded twice.

Such matrix representations have a major drawback because the number of entries increases with the square of the number of atoms which make up the molecule. To avoid this problem, the representation of molecules by connection tables was developed and is nowadays the predominant method used for encoding chemical structures. In this representation, a molecule is considered as two tables. One table contains the atoms of the molecule (atom list) and the second table the bonds between the atoms (bond list). This is depicted in Figure 1.4.

4 1 Introduction ______

Figure 1.4 Structure diagram of acetaldehyde (left-hand) with its corresponding connection table (right- hand) consisting of a atom list and a bond list. The atom indices are indicated in the structure diagram. In the atom list, each atom is assigned to an atom label representing the atom type. In the bond list, the bonds between the atoms are represented with their bond order. With this kind of encoding, the number of entries increases linear with the size of the molecule.

Many file formats for storing chemical structures and reactions rely on connection tables, including the widely used MOL file format [14] introduced by MDL [15].

1.1.2 Atom-Atom Mapping

The concept of a connection table as presented in section 1.1.1 is not restricted to the encoding of single compounds; it can also be extended to represent reactions. A widely used file format for encoding reactions is the MDL RXN file format [14]. This file format is built on the MOL file format mentioned in the previous section and extended for the needs of reactions. One feature of this coding is the possibility to include atom-atom mapping numbers. Those numbers can be assigned to a reaction in order to identify each atom on reagent and product site and to map them together. In Figure 1.5 the decarboxylation of pyruvate to acetaldehyde is shown. Here, each atom in the reagent molecule is assigned by a unique number which is carried over to the product molecules. The fate of each atom can be traced through the reaction process.

5 1 Introduction ______

Figure 1.5 The decarboxylation of pyruvate as catalyzed by the enzyme Pyruvate decarboxylase. Each atom on the reactant side is assigned by a unique number, here shown in brackets. This number is carried over to the product site, mapping each reagent atom on the corresponding product atom.

1.1.3 Reaction Center Marking

Important information on a reaction is to know what happens in the reaction process or more exactly, which bonds are broken, made, or altered throughout the reaction process. The bonds and atoms directly participating in the reaction are called the reaction center. The information on the reaction center is crucial information if one wants to do reaction searching which will be shown in section 2.3.3. in detail. In Figure 1.6, the decarboxylation of pyruvate is shown with the reaction center marked by red lines.

Figure 1.6 The decarboxylation of pyruvate as catalyzed by the enzyme Pyruvate decarboxylase. Atom- atom mapping numbers are given in brackets. The bonds participating in the reaction and therefore are part of the reaction center are crossed by a red line. Single crossed stands for a change in the bond order; a double crossed bond indicates the forming or breaking of that bond.

6 1 Introduction ______1.2 Exploiting the Data

Now, having a database on hand which stores the molecules and reactions on atomic resolution, the task was then to exploit this database and to extract new knowledge out of it. In section 2, the multiplicity of search possibilities on the chemical structure and reaction information of the BioPath database is shown. This was made possible by integrating the data into the C@ROL system provided by Molecular Networks GmbH [16]. Using the C@ROL interface not only allows searching the database by a variety of textual information, but also searching on the atomic level by integrating structure and substructure searching methods for molecules and reactions using the reaction center information. This was extensively used to extract diverse subsets of the BioPath database for the subsequent exploration. Then, in the remaining sections, two applications are presented where the reaction center information was used to generate new knowledge from the BioPath data. In section 3, a hypothesis pointed out by Linus Pauling [17,18] was investigated. He stated that an enzyme preferably stabilizes the transition state of a reaction more than its substrate. Therefore, an analog of the transition state should then bind very tightly into the enzyme and act as a strong inhibitor. We have investigated this hypothesis that many enzyme inhibitors are analogs to the transition state for a number of enzymatic reactions. By using the information on the reaction center contained in the BioPath database, the bonds broken and made during the reaction process, we were in a position to automatically generate reaction intermediates for those reactions where an intermediate exists. These intermediates can be considered energetically close to the transition state. We will show that these generated intermediates can be well superimposed on known transition state analog inhibitors of the according enzymes by using a genetic algorithm. The genetic algorithm finds the maximum number of atoms which can be superimposed in the 3D space. Oppositely, such generated intermediates then should act as a query to search compound databases for new inhibitors of enzymatic reactions, even if no information on the 3D structure of the enzyme is known. This is shown in section 4, where a large compound database is scanned for inhibitors of the enzyme AMP deaminase.

7 1 Introduction ______

Another application where the information on the reaction center is used is presented in section 5. Here, this information was used to build a classification of enzymatic reactions and therefore also the catalyzing enzymes. Traditionally, enzymes are classified using the EC code system [19], but this classification has some drawbacks as it is based on a variety of criteria like reaction patterns, substrates, transferred groups, or acceptor groups and the emphasis on these criteria shifts between the classes. The action of an enzyme is the catalysis of a reaction where bonds are broken and made. Therefore, for our classification we concentrate on these events. For this we use the information on the reaction center to extract the bonds of the substrate participating in the reaction process and to calculate physico-chemical descriptors for each bond. The resulting vectors serve as input for a self- organizing neural network. The resulting classification then is compared to the existing EC nomenclature. This is presented for reactions of EC class 3, Hydrolases.

8 1 Introduction ______

References

[1] Rich, P. Chemiosmotic coupling: The cost of living. Nature 2003, 421, 583.

[2] Kanehisa, M.; Goto, S.; Hattori, M.; Aoki-Kinoshita, K. F.; Itoh, M.; Kawashima, S.; Katayama, T.; Araki, M.; Hirakawa, M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34, 354-357. http://www.genome.jp/kegg.

[3] Löffler, G.; Petrides, P. E. Biochemie und Pathobiochemie – 6. Auflage. Springer- Verlag: Berlin, Germany, 1998.

[4] Santorio Santorio. In Encyclopedia Britannica Online 2007. http://www.britannica.com/eb/article-9065653.

[5] von Homeyer, A.; Reitz, M. In Handbook of Chemoinformatics - From Data to Knowledge. Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003, pp 756-789.

[6] Fleischmann, R. D. Whole-genome random Sequencing and Assembly of Haemophilus influenzae Rd. Science 1995, 269, 496-512.

[7] International Human Genome Sequencing Consortium; Initial Sequencing and Analysis of the Human Genome. Nature 2001, 409, 860–921.

[8] Berman, H. M.; Henrick, K.; Nakamura, H. Announcing the worldwide Protein Data Bank. Nature Structural Biology 2003, 10, 980. http:www.pdb.org.

[9] Tomita, M. Whole-cell simulation: a grand challenge of the 21st century. Trends Biotechnol. 2001, 19, 205-210.

[10] Bundesministerium für Bildung und Forschung (BMBF). Projects no. 031U112D and 031U212D.

[11] Michal, G. Biochemical Pathways Wall Chart; Boehringer Mannheim (now Roche): Mannheim, Germany, 1993. http://www.expasy.org/tools/pathways.

[12] Gasteiger, J.; Engel, T. Chemoinformatics – A Textbook; Wiley-VCH; Weinheim, Germany, 2003. pp. 30-43, 173-175.

9 1 Introduction ______

[13] Barnard, J. M. In Handbook of Chemoinformatics - From Data to Knowledge. Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003, pp 27-50.

[14] MDL Information Systems. MDL Ctfile formats reference can be downloaded at http://www.mdl.com/solutions/white_papers/ctfile_formats.jsp.

[15] MDL Information Systems, Inc.: San Leandro, CA, USA. http://www.mdl.com.

[16] C@ROL, Erlangen: Molecular Networks GmbH, http://www.molecular- networks.com.

[17] Pauling, L. The Nature of Forces between large Molecules of biological Interest, Nature 1948, 161, 707-709.

[18] Zhang, X.; Houk, K. N. Why Enzymes Are Proficient Catalysts: Beyond the Pauling Paradigm. Acc. Chem. Res. 2005, 38, 379-385.

[19] Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they Catalyse [internet]. http://www.chem.qmul.ac.uk/iubmb/enzyme/

10 2 Enabling the Exploration of Biochemical Pathways ______2 Enabling the Exploration of Biochemical Pathways

2.1 General Introduction

In this chapter, a database of metabolic reactions, BioPath, is presented. This database was built within a BMBF funded project [1] and served as starting point for this work. The database incorporates all reactions of the well-known Biochemical Pathways wall charts [2] and, in addition, the supplementary information derived from the corresponding atlas [3]. Beyond the explicit information on the enzymes, pathway affiliation, links to other information resources, and other miscellaneous textual information; all reactions are coded in the form of connection tables in a stoichiometrically correct manner; this means that even single protons or water molecules are coded. Furthermore, all reactions are coded including atom-atom mapping numbers and the marking of reaction centers. This precision is absolutely necessary for the applications presented in this thesis. The encoding of compounds and reactions on the molecular level allows the access by chemoinformatics methods. The following chapter first presents the original publication as published in Organic and Biomolecular Chemistry in 2004. At the end of this chapter a brief summary of new improvements added to the BioPath database is given. The numbering of the chapters, Figures, and Tables was adapted to this thesis. In this publication, explicit information is given on the technical background of the BioPath database and the information retrieval in connection with the C@ROL [4] system.

References

[1] Bundesministerium für Bildung und Forschung (BMBF). Projects no. 031U112D and 031U212D.

[2] Michal, G. Biochemical Pathways Wall Chart; Boehringer Mannheim (now Roche): Mannheim, Germany, 1993. http://www.expasy.org/tools/pathways.

[3] Michal, G. Biochemical Pathways. An Atlas of Biochemistry and Molecular Biology. Spektrum Akademischer Verlag: Heidelberg, Germany, 1999.

[4] C@ROL, Erlangen: Molecular Networks GmbH, http://www.molecular-networks.com.

11 2 Enabling the Exploration of Biochemical Pathways ______

Enabling the Exploration of Biochemical Pathways Martin Reitz 1, Oliver Sacher 2, Aleksey Tarkhov 2, Dietrich Trümbach 1, Johann Gasteiger *1,2

1 Computer-Chemie-Centrum and Institute of Organic Chemistry, University of Erlangen- Nuremberg, Naegelsbachstr. 25, 91052 Erlangen, Germany 2 Molecular Networks GmbH, Naegelsbachstr. 25, 91052 Erlangen, Germany

Reitz, M.; Sacher, O.; Tarkhov, A.; Trümbach, D.; Gasteiger, J. Org. Biomol. Chem. 2004, 2, 3226-3237.

Abstract The Biochemical Pathways Wall Chart [1] has been converted into a molecule and reactions database. Major features of this database are that each molecule is represented by lists of all atoms and bonds (as connection tables), and in the reactions the reaction centre, the atoms and bonds directly involved in the bond rearrangement process, are marked. The information in the database has been enriched by a set of diverse 3D structure conformations generated by the programs CORINA and ROTATE. The web-based structure and reaction retrieval system C@ROL provides a wide range of search methods to mine this rich database. The database is accessible at http://www2.chemie.uni-erlangen.de/services/biopath/index.html and http://www.mol- net.de/databases/biopath.html

2.2 Introduction

With the deciphering of the human genome, the blueprint of the human organism and its functions, interest has shifted towards the role of the gene product and the activity of the expressed proteins. In other words, the interest has shifted from genomics to proteomics. Increasingly, the attention is now focused on how some of these proteins, the enzymes, regulate the processes within the cell, how nutrients are metabolized and how energy is produced and transferred through metabolism: Metabolomics has entered the stage.

12 2 Enabling the Exploration of Biochemical Pathways ______

Over many decades, a large body of information has been accumulated detailing the chemical species that occur and are processed within the cell and how these chemical species are interconverted by series of chemical reactions. Much of this knowledge has been beautifully assembled by G. Michal and colleagues in the Biochemical Pathways Wall Chart distributed initially by Boehringer Mannheim and now by Roche [1]. The principal scientific questions have always been the interpretation of the pathway information functionally, temporally and spatially. The task has always been to connect, and indeed correlate, the information from enzyme regulation with the elucidation of the intracellular biochemical reactions. To this end, both bioinformatics and chemoinformatics specialists are now in a position to collaborate. Impressive as the Biochemical Pathways Chart is, it has a number of drawbacks. Firstly, it has a high information density dictated by the complexity of known facts about metabolites and biochemical transformations. Secondly, it is difficult to locate specific compounds, particularly if they are contained at several places on the map (independent of their temporal and spatial locations). This is particularly true for such ubiquitous compounds, e.g. ATP or pyruvate, which are produced and consumed in many chemical reactions. This raises some essential questions: in which reactions does ATP participate? Or similarly, which reactions involve acetyl coenzyme A and in what way are they involved? A critical inspection of the Biochemical Pathways Wall Chart shows the essence of the problem: relationships between many compounds through a large number of reactions have to be stored in a two-dimensional plane (see Figure 2.1). This can only be achieved by large distortions and an awkward arrangement of reaction arrows, ostensibly for reasons of clarity.

13 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.1 View of a segment of the Biochemical Pathways Wall Chart [1]; from Michal, Biochemical Pathways 1999 © Elsevier GmbH, Spektrum Akademischer Verlag, Heidelberg.

Essentially, biochemical pathways form a high-dimensional space, a space interconnecting many compounds by a multitude of reactions, a high-dimensional space that had to be projected into two-dimensions in order to produce the Wall Chart. It is highly desirable to analyze this multifunctional nature of the space of biochemical pathways, a task that can be achieved by structure and reaction search methods that have been developed for searching in compound and reaction databases. In this way, the arsenal of chemoinformatics methods [2,3] can be brought in to analyze biochemical pathways. The task is then to store the structures and reactions of biochemical pathways in a structure and reaction database. It has to be stored in a way that is standard for chemical databases: store structures in the form of connection tables, providing access to each atom and bond of a chemical structure. Store reactions not only by giving information on the structures of starting materials and products but also by specifying the reaction centre, specifying which atoms and bonds are directly involved in the reaction process. Only then can

14 2 Enabling the Exploration of Biochemical Pathways ______genuine reaction searching be performed by focusing on the changes in the arrangement of bonds in a chemical reaction.

2.3 The BioPath Database

2.3.1 The Data Model

Six years ago, at the outset of our work on biochemical pathways no information on metabolic networks or biochemical pathways was available in a form, as was required to be able to employ the full power of structure and reaction search methods. KEGG, a database with a long history is in the meantime also providing chemical structures in the form of connection tables [4] as is done by the collection of databases on the BioCyc web page [5a]. These databases store chemical structures not only by name, but also as connection tables which are amenable to structure and substructure search methods. Recently, a database on metabolic reactions annotated for Escherichia coli was published that also contains a mapping of the atoms of the starting materials onto those of the products of a chemical reaction [5b]. Nevertheless, our original data model had a number of features that are not yet contained in all other metabolic or biochemical pathways databases. Thus, our BioPath database allows a depth of search on chemical structures and reactions not achievable with any other metabolic or biochemical pathways database. This is provided by a detailed data model for representing chemical structures, reactions, and enzymes.

2.3.2 Chemical Structures

Chemical structures are represented by connection tables, i.e., by lists of all atoms and all bonds, including those to hydrogen atoms. Stereochemistry at chiral centres and at double bonds is represented by stereochemical descriptors. Such a detailed representation is given to all small molecules, including starting materials, products, coenzymes, and regulators. Furthermore, the names of all molecular species including synonyms have been stored.

15 2 Enabling the Exploration of Biochemical Pathways ______

2.3.3 Chemical Reactions

Chemical reactions are specified by the starting materials and products of a reaction, and the enzymes, coenzymes and regulators involved. Enzymes are characterized both by name and by their EC code number. Care was taken to ensure that each reaction equation was stoichiometrically balanced; even the involvement of a proton as a starting material or product was specified in a reaction equation. A unique feature of our handling of reactions is that the reaction centre, the bonds broken and made in a reaction, have been marked. Furthermore, all atoms in the starting materials and products have been mapped against each other. This feature is not available in any other metabolic or biochemical pathways database. On the other hand, this kind of information is essential for proper reaction searching. If one is interested in all reactions that reduce a carbonyl group to an alcohol and the only criteria are that the starting material should contain a carbonyl group and the product an alcohol group, one would also obtain the reaction in the top of Figure 2.2 as a hit because the starting material has a carbonyl group and the product an alcohol group. The actual reaction, however, is a phosphorylation of an alcohol, of D-glyceraldehyde to D- glyceraldehyde-3-phosphate. Only when one specifies, in addition, that the atoms of the carbonyl group must map onto the atoms of the alcohol group, this reaction will not be perceived as a hit. Only the reaction on the bottom of Figure 2.2, the reduction of D- glyceraldehyde to glycerol, will be perceived as satisfying to the query.

16 2 Enabling the Exploration of Biochemical Pathways ______

HO C H COHO H C O P O H O HO C D-glyceraldehyde-3-phosphate

H COH H H C OH H C OH H H COH D-glyceraldehyde H C OH H glycerol

Figure 2.2 Two reactions of glyceraldehyde, phosphorylation (top) and reduction (bottom).

Furthermore, the marking of the reaction centre also allows one to investigate intermediates and transition states of biochemical reactions. On this basis, the transition state hypothesis, formulated by Pauling [6] many decades ago, can be explored. This transition state hypothesis emphasizes that the role of an enzyme primarily lies in strongly binding the transition state of a biochemical reaction in order to decrease the activation energy. Studies probing this transition state hypothesis were made by superimposing CORINA-generated 3D molecular models [7,8] of transition states and intermediates of enzyme catalyzed reactions calculated from the information contained in the BioPath database onto the 3D structure of enzyme inhibitors. These studies are reported in a separate publication [9]. In addition, for each reaction, information is given on whether it is a general pathway, or whether it occurs in animals, in higher plants and yeasts, or in prokarya (bacteria and archae). Furthermore, it is indicated whether the reaction is reversible or irreversible, and whether it is a catabolic or anabolic reaction. Moreover, the compartment where a reaction occurs is specified. If a reaction belongs to a certain group of reactions, such as the citrate cycle, this is also specified.

17 2 Enabling the Exploration of Biochemical Pathways ______

2.3.4 Enzymes

Enzymes are represented by their names and the EC code number; a link to information on this enzyme in the BRENDA enzyme database [10] is provided.

2.3.5 Augmenting the Contents

Having represented the chemical structures in the Biochemical Pathways database in such details as given by a connection table, allows the processing of chemical structures by chemoinformatics software developed to provide additional information on chemical structures. Thus, all chemical structures have been processed by the automatic 3D structure generator CORINA [7,8] providing for each small molecule a 3D molecular model. CORINA generates a single low energy conformation. In order to more fully explore the conformational space, an ensemble of conformations was generated by the program ROTATE [11,12]. For each chemical structure clearly, the conformational space can be quite large. So as not to consider too many conformations, the number of conformations for each chemical compound was limited by exploring only the three central bonds in a molecule. In this process the generation of conformations was constrained in such a way as to obtain an ensemble of quite diverse conformations. Other data and properties that were generated by computational methods are: Molecular mass (weight), number of rotatable bonds, number of atoms, number of rings as well as number of hydrogen bond donor and acceptor atoms.

2.3.6 Data Input

The chemical structures and reactions were input with ISIS/Draw, available from MDL Information Systems [13]. Additional information on chemical structures and reactions was input by an Attribute Editor, specifically developed for this purpose. The basis for extracting the necessary data was the Biochemical Pathways Atlas [14] that contains information quite parallel to the Wall Chart as it has been produced by the same

18 2 Enabling the Exploration of Biochemical Pathways ______author. However, the Atlas provides more detailed and more up-to-date information than the Biochemical Pathways Wall Chart. In particular, the detailed reaction schemes in the Atlas were essential for marking the reaction centres. In order to emphasize the correspondence between Wall Chart and Atlas, the location of each structure and reaction on the Wall Chart has been indicated by giving the grid square of the Wall Chart where the structure, reaction, or enzyme is located. Furthermore, the implementation at the Computer-Chemie-Centrum [18a] also provides a link to the ExPASy server [15] that contains a scanned image of the Wall Chart. The grid square specification which is also maintained on the ExPASy server allows direct access to that part of the scanned image where the structure, reaction, or enzyme is contained. Clearly, the input of this detailed information was quite labour intensive and, in particular, the marking of the reaction centre required detailed chemical analysis. However, we believe that the quality of information that is now available was worth the effort.

2.3.7 Data Processing and Storing

The entire chemical information of structures is stored in MOL and ISIS Sketch files both generated with MDL ISIS/Draw. The reaction information is stored in RXNfiles also generated with MDL ISIS/Draw. The 2D coordinates of the molecules were set while drawing the structures (preferably in the Fischer projection) and stored in the MOLfiles. In order to also get 3D structures of each molecule the program CORINA was then called for each file. Besides the 2D and 3D structure information, stored in MOL and RXNfiles, the textual information (such as compound names, EC numbers, species, and compartments) is stored in so-called attribute files. These plain text files were generated by the program Attribute Editor (AttEd) based on the CACTVS system [16] (see Figure 2.3).

19 2 Enabling the Exploration of Biochemical Pathways ______

GIF

MOL C@ROL ISIS/Draw

RXN Converter ISIS Base

AttEd CORINA IBM ROTATE DB2 TXT

(attribute)

Figure 2.3 Data Processing from the raw data into the structure and reaction databases.

The MOL, RXN, attribute, and GIF files are used to generate the various database versions of the Biochemical Pathways. Currently, the system is available for IBM DB2, MDL RDF for ISIS/Base, and in the C@ROL format. In contrast to all other formats, the IBM version supports only the entire textual information of BioPath, but neither connection tables, nor reaction centres. The textual content of the IBM DB/2 version is accessed through standard SQL statements [17]. The structure information of molecules or reactions is retrieved by displaying single or assembled GIF images which were generated during the data input by using CorelDraw. The BioPath database was also implemented under ISIS/Base, a proprietary database management and retrieval system of MDL Information Systems [13]. For generating an ISIS/Base version of BioPath all RXNfiles were concatenated into one RDFile, whereas the additional information provided in the plain text files were also integrated for each reaction and molecule. This task was done by a script based on the CACTVS system [16]. Finally, to take full advantage of the rich information input into the BioPath database it was integrated into the C@ROL retrieval system. This version was generated from the MOL, RXN, and attribute files. As this retrieval system also supports 3D searches in the conformational space, the program ROTATE was run and each calculated conformation of

20 2 Enabling the Exploration of Biochemical Pathways ______the unique molecules was stored as a property in the C@ROL database. This version is made available to the scientific community via the internet URLs http://www2.chemie.uni-erlangen.de/services/biopath/index.html and http://www.mol-net.de/databases/biopath.html [18].

2.4 The C@ROL Retrieval System

In order to take full advantage of the rich and diverse information stored in the BioPath database, we had to develop our own structure and reaction retrieval system. The C@ROL system (Compound Access and Retrieval OnLine) [19] will be briefly outlined here. It is a web-based retrieval system that allows searches either on chemical structure information or on chemical reaction information. Although here, the use of C@ROL for searching in the BioPath database will be detailed, C@ROL is a general retrieval system for searching chemical structure or reaction information on web based databases. Its use on a large structure database can also be freely explored [19]. Figures 2.4 and 2.5 show the main graphical user interface (GUI) for specifying a query in the structure, or in the reaction mode, respectively. A switch between these two operation modes can be made by clicking on the upper left-hand corner of the GUI. In both cases, the integrated applet of the JME molecule editor, developed by Peter Ertl at Novartis [20], allows the graphical specification of a query of a chemical structure or of a reaction centre.

21 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.4 C@ROL GUI for searching in a structure database with the substructure of L-glutamate specified as a query.

Figure 2.4 shows this for the specification of the skeleton of L-glutamate for a stereochemical substructure search. If a substructure search is initiated, unspecified bonds needed to bring an atom to its standard valence state are assumed to be open sites, to carry any atom. If a full structure search is specified these open sites will be connected to hydrogen atoms.

22 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.5: C@ROL GUI for searching in a reaction database. The conversion of a carbonyl group to an alcohol group is specified as a query.

Figure 2.5 shows the GUI for the specification of the reduction of a carbonyl group to an alcohol group. In each case, whether it is a structure or a reaction search, a variety of additional query specifications can be made. Different search criteria can be combined by logical operators (AND, OR, XOR). Space prohibits a list of all search possibilities. A user manual of the C@ROL system details the full array and potential of search methods provided [19]. Insight into some of the search methods incorporated into the C@ROL system can be obtained from the following examples of searches in the BioPath database as well as from a more detailed publication on the use of the BioPath database on enhancing our insight into biochemical pathways [21].

23 2 Enabling the Exploration of Biochemical Pathways ______2.5 Searching in the BioPath Database

The BioPath database has been made accessible to the scientific community on the web [18]. With the following examples using the C@ROL retrieval system we want to assist the users in taking full advantage of the cornucopia of information contained in the BioPath database. A more detailed publication will show the application of the BioPath database to enhancing our insight into biochemical pathways [21]. An analysis of the metabolites of Escherichia coli contained in the EcoCyc database [22] has recently been published [23]. Here, we will first outline queries into the molecule database of the BioPath database and then make investigations into the reaction part of the BioPath database.

2.5.1 Searching in the Molecule Database

2.5.2 Name and Name Fragment Searching

The C@ROL system allows a variety of search queries on names and name fragments. Thus, switching the entry in the drop-down list to “name fragment” and typing “glycer” provides 26 hits covering such compounds as glycerone-P, D-glyceraldehyde, cytidyl-5’- diphosphate-1,2-diacylglycerol, and D-erythro-imidazol-glycerolphosphate (see Figure 2.6).

24 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.6 Hit list obtained by inputting the name fragment “glycer” as a query.

By clicking on one of the record numbers, a C@ROL Detail Page will open giving the structure diagram of the compound and additional information such as molecular weight, elemental composition, in which organisms and pathways this compound occurs, on which grid spaces of the Biochemical Pathways Wall Chart the compound is contained, and which enzymes work on this compound. This is illustrated in Figure 2.7 with compound no. 92 of the hit list in Figure 2.6, glycero-3-phosphoethanolamine.

25 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.7 Detailed information obtained for the compound with record no. 92, glycero-3- phosphoethanolamine, from the hit list in Figure 2.6.

26 2 Enabling the Exploration of Biochemical Pathways ______

2.5.3 Gross-formula Searches

C@ROL also allows searches based on the atomic compositions of molecules. A search with C3O3, also allowing the presence of any other atom, returns 10 hits. With C3O3N0 nine hits were obtained and with C3O3N0S0 seven compounds, containing, other than the 3 carbon and 3 oxygen atoms only hydrogen atoms. Figure 2.8 gives the list of the compounds.

Figure 2.8 Hit list obtained by inputting “C3O3N0S0” in a gross-formula search.

The structures of these compounds can be visualized as 2D or as 3D structures. The 3D structures were obtained from the automatic 3D molecule structure generator CORINA [7,8]; for visualization the 3D molecule viewer JMol [24] integrated into the C@ROL system was used. Figure 2.9 shows the 3D structures of L- and of D-lactate, respectively. Rotation of the 3D molecular models allows one to obtain an excellent impression of the 3D structure of the corresponding molecule.

27 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.9 3D molecular models of L-lactate and D-lactate obtained from the hit list in Figure 2.8.

2.5.4 Full Structure and Substructure Searches

The most powerful searches that provide detailed insights into the chemical nature of the metabolome are certainly structure and substructure searches. The full structure search is particularly helpful for locating chemical compounds on the Wall Chart. After input of a chemical structure by the molecule editor JME, and initiating a search, a result is obtained that also gives the grid squares of the Wall Chart where this compound is located. Figure 2.10 shows the detailed information obtained through a full structure search on oxaloacetate. Particular emphasis is given here to the that transform oxaloacetate, starting with EC number 1.1.1.37 malate dehydrogenase. By clicking on the EC number of a particular enzyme listed here, the reaction catalyzed by this enzyme will be shown. Thus, in effect, a switch from the molecule database to the reaction database can be performed. The detailed information also indicates the grid squares where oxaloacetate is

28 2 Enabling the Exploration of Biochemical Pathways ______contained on the Biochemical Pathways Wall Chart, and, by the same token, on the web site of the ExPASy server [15].

Figure 2.10 Detail Page obtained in a full structure search on oxaloacetate.

By clicking on these specifications of grid squares, the Computer-Chemie-Centrum implementation [18a] establishes a link to the ExPASy server where the Biochemical Pathways Wall Chart is contained in scanned form. Thus, a click on grid square F5 directly leads to that part of the Wall Chart where the reduction of oxaloacetate by malate dehydrogenase is embedded (Figure 2.11).

29 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.11 View of the grid square F5 of the Biochemical Pathways Wall Chart as contained on the ExPASy server [15]; from Michal, Biochemical Pathways 1999 © Elsevier GmbH, Spektrum Akademischer Verlag, Heidelberg.

2D substructure searches can be initiated in much the same way as full structure searches: The 2D substructure is drawn with JME, the button “Substructure” is activated and a search is started by clicking on “Search” of Step 2 in the top most bar. A search with the substructure indicated in Figure 2.12 provides 60 hits of steroids. The substructure of the query is highlighted in the structures that are obtained as hits.

30 2 Enabling the Exploration of Biochemical Pathways ______a)

b)

Figure 2.12 Substructure specified with JME for a 2D substructure search (a) and one of the hits obtained in this query (b).

The same substructure additionally containing an OH group in position 3 gives 49 hits. Changing ring A of the steroid skeleton to an aromatic ring provides eight structures as hits.

31 2 Enabling the Exploration of Biochemical Pathways ______

2.5.5 Property Retrieval

A variety of properties of chemical structures are offered for searching in the BioPath database. For example, searching for compounds with four rotatable bonds provides 131 structures.

2.5.6 3D-Substructure Searches

The BioPath database contains 3D structures generated by the combined application of the 3D structure generator CORINA [7,8] (giving a single low energy conformation) and ROTATE [11,12] (providing an ensemble of quite diverse conformations). On this basis, 3D substructure searches can be performed. Diethylstilbestrol, DES (see Figure 2.13a), is a synthetic estrogen not used any more as contraceptive because of its potential carcinogenicity. In order to search for structures with similar biological properties as DES the phenol substructure and the second oxygen atom were considered to be important for biological activity. The 3D structure of DES showed that the two oxygen atoms are at a distance of 11.92 Å. Accordingly, the 3D pharmacophore was specified as consisting of a benzene ring with an OH-group and an additional oxygen atom at a distance of 11.92±1.5 Å from the oxygen atom of the OH- group (see Figure 2.13b).

32 2 Enabling the Exploration of Biochemical Pathways ______

a) OH

HO

b)

Figure 2.13 Structure of diethylstilbestrol, DES, (a) and a 3D substructure search query derived from DES (b).

The 3D substructure search resulted in a hit list of 13 molecules. To further reduce the number of hits, the additional restriction was imposed that the hits should have only two hydrogen bonding acceptor atoms as DES has only two such atoms. This reduced the number of hits to two, estrone and estradiol (see Figure 2.14). Thus, in this 3D search the natural ligands of the estrogen-receptor that also binds the query structure, diethylstilbestrol, were found.

33 2 Enabling the Exploration of Biochemical Pathways ______

O H O H H H

H H H H H H O O

a) b)

Figure 2.14 Hits found with the query specified in Figure 2.13b and only having two hydrogen bond acceptor atoms.

2.5.7 Searching in the Reaction Database

2.5.8 Searching with Chemical Structures

When C@ROL is switched to reaction searching by choosing this feature in the upper left-hand corner of the GUI, the search window opens with the molecule editor already showing a reaction arrow (see Figure 2.5). Drawing a chemical structure on the left-hand side of this arrow allows one to search for all those reactions where this structure is a starting material. Figure 2.15 shows this for a query for all reactions starting from chorismate. This query provided 10 reactions, as shown in Figure 2.16. Six of these 10 reactions are superpathways, combining several single reactions.

34 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.15 Query for searching for all reactions that start with chorismate.

Figure 2.16 Results for the query in Figure 2.15, the reactions that chorismate undergoes.

2.5.9 Searches on the Reaction Centre

It has already been emphasized that the most important searches on chemical reactions have to go through the actual event occurring at the molecular level, investigating the bonds broken or made. In order to be able to perform such searches we had to go through

35 2 Enabling the Exploration of Biochemical Pathways ______the laborious task of marking the reaction centre, marking the bonds broken and made in a reaction and indicating how the atoms in the starting materials are mapped onto the atoms of the products. Having done so, allows us to ask questions on chemical reactions that would otherwise have remained unanswered. In particular, questions on reaction types, on reaction instances having common features can be asked. Thus, for example, specifying a C—C bond on the right-hand side of the arrow in the JME editor (Figure 2.17) provides all those reactions that form a C—C single bond. In the case of the BioPath database, 119 reactions were obtained as hits. Among them such diverse reactions were contained as those listed in Figure 2.18, encompassing reactions catalyzed by the enzymes prostaglandin synthase, choline kinase, or geranyl-trans-, etc.

Figure 2.17 Query for searching for all reactions that form a carbon-carbon atom bond.

36 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.18 Part of the list of the hits obtained with the query in Figure 2.17.

Furthermore, specific bond transformations can be asked, such as the conversion of a C— H bond into a C—O—H bond (Figure 2.19). This provided 97 reactions, one is shown in Figure 2.20.

Figure 2.19 Query for searching for all bonds that involve the oxidation of a C–H bond to a C–OH bond.

37 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.20 One of the hits obtained in the query of Figure 2.19.

2.5.10 Searching on Enzymes

A variety of search methods for chemical reactions based on specifications on enzymes are offered. Enzymes can be searched by their name, by enzyme types through name fragments such as “oxidase”, or by partial or complete EC-numbers such as 3.1.*.* or 3.1.3.3. Searching with the EC number 3.1.3.3 provides two hits, one of it being the conversion of 3-phosphoserine to L-serine by the enzyme phosphoserine-phosphatase (Figure 2.21).

38 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.21 Reaction obtained when searching with enzyme EC code 3.1.3.3.

Incidentally, this example again emphasizes the difficulty of representing the biochemical pathways on the 2D Wall Chart. Switching to the ExPASy server by following the link obtained in this query gives the grid square shown in Figure 2.22. The starting material, the product and the name of the enzyme had to be highlighted in order to locate this reaction in this complicated scheme, because of the strong distortion of the reaction scheme in this 2D plot.

39 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.22 The reaction of Figure 21 as contained on the Boehringer Pathways Wall Chart – and the ExPASy server; from Michal, Biochemical Pathways 1999 © Elsevier GmbH, Spektrum Akademischer Verlag, Heidelberg.

Thus, the multi-dimensional nature of biochemical pathways is reiterated and the advantages of searching biochemical structures and reactions in a database highlighted. Over and beyond containing links to the ExPASy server, the EC number also provides a link to the BRENDA database [10] where detailed information including kinetic data on the various enzymes can be obtained.

2.5.11 Combined Searches

C@ROL allows the combination of several search queries. As an example, the combination of a reaction search with the search for different organisms is given. L-tryptophane is an essential that cannot be synthesized by animals. Inputting L-tryptophane as product of a reaction and selecting prokarya as organism provides a hit showing the synthesis of tryptophane from chorismate. If, however, animals are selected as organism, no hit is obtained, showing that this compound cannot be synthesized by animals and humans.

40 2 Enabling the Exploration of Biochemical Pathways ______2.6 Conclusions

Storing the metabolites and the chemical reactions that interconvert them in a database opens biochemical pathways for a detailed inspection. The C@ROL retrieval system has been specifically extended for searching in the biochemical pathways database to extract information on these all-important structures and reactions. We are now in a position to take advantage of the wealth of information stored in the BioPath database and contribute to the studies of metabolomics. Results of an investigation will be published in a subsequent paper.

Acknowledgements We appreciate assistance in the construction of the BioPath database and the analysis of its contents through projects funded by the Bundesministerium fuer Bildung und Forschung (projects no. 08 C 5850 0, 08 C 5879, 031U112D, 031U212D, 031U112A, and 031U212A). We appreciate the initiation of the Biochemical Pathways project by Spektrum Akademischer Verlag and the collaboration with Prof. Guido Moerkotte and Dr. Carl-Christian Kanne, University of Mannheim, Germany in establishing our data scheme. Dr. Wolf-Dietrich Ihlenfeldt, at that time at the CCC provided important contributions to this project, most notably the CACTVS system. Discussions with Dr. Gerhard Michal were always stimulating. We are indebted to a number of students who carefully input the structures and reactions into the BioPath database. The BFAM project, initiated by Prof. Hans-Werner Mewes allowed us to continue the work on the BioPath database.

41 2 Enabling the Exploration of Biochemical Pathways ______

Table of Acronyms

BioPath http://www2.chemie.uni-erlangen.de/services/biopath/index.html

http://www.mol-net.de/databases/biopath.html

Biochemical Pathways http://www.expasy.org/tools/pathways/

KEGG http://www.genome.jp/kegg/

BioCyc http://www.biocyc.org/

CORINA http://www2.chemie.uni-erlangen.de/software/corina/index.html

http://www.mol-net.de/software/corina/index.html

ROTATE http://www.mol-net.de/software/rotate/index.html

C@ROL http://www.mol-net.de/software/carol/index.html

CACTVS http://www2.chemie.uni-erlangen.de/software/cactvs/index.html

JME http://www.molinspiration.com/jme/index.html

JMol http://jmol.sourceforge.net/

42 2 Enabling the Exploration of Biochemical Pathways ______

References

[1] Biochemical Pathways Wall Chart, ed. G. Michal, Boehringer Mannheim, Germany; now Roche, also on the internet at: http://www.expasy.org/tools/pathways/

[2] Chemoinformatics – A Textbook, ed. J. Gasteiger, T. Engel, Wiley-VCH, Weinheim, Germany, 2003.

[3] Handbook of Chemoinformatics, ed. J. Gasteiger, Wiley-VCH, Weinheim, Germany, 2003, 4 volumes.

[4] S. Goto, Y. Okuno, M. Hattori, T. Nishioka, H. Kanekisa, Nucl. Acids Res., 2002, 30, 402-404. KEGG/LIGAND, Kyoto University, Japan, http://www.genome.ad.jp/kegg/

[5] BioCyc Database Collection, http://www.biocyc.org.

[5a] M. Arita, Proc. Natl. Acad. Sci USA, 2004, 101, 1543-1547. http://www.metabolome.jp

[6] L. Pauling, Chem. Eng. News, 1946, 24, 1375-1377.

[7] J. Sadowski, J. Gasteiger, G. Klebe, J. Chem. Inf. Comput. Sci., 1994, 34, 1000-1008.

[8] CORINA can be tested on the internet at http://www2.chemie.uni- erlangen.de/software/corina/free_struct.html and is available from Molecular Networks GmbH, Germany, [email protected], http://www.mol-net.de

[9] M. Reitz, J. Gasteiger, in preparation.

[10] BRENDA – The Comprehensive Enzyme Information System http://www.brenda.uni-koeln.de

[11] C.H. Schwab, in Handbook of Chemoinformatics - From Data to Knowledge, ed. J. Gasteiger, Wiley-VCH, Weinheim, Germany, 2003, p. 262-301.

[12] ROTATE is available from Molecular Networks GmbH, Germany, [email protected], http://www.mol-net.de

[13] MDL Information Systems, Inc., San Leandro, CA, USA, http://www.mdl.com

43 2 Enabling the Exploration of Biochemical Pathways ______

[14] Biochemical Pathways, Biochemistry Atlas, ed. G. Michal, Spektrum Akademischer Verlag, Heidelberg, Germany 1999.

[15] ExPASy Server, University of Geneva, Switzerland, http://www.expasy.org/tools/pathways/

[16] W. D. Ihlenfeldt, Y. Takahashi, H. Abe, S. Sasaki, J. Chem. Inf. Com. Sci., 1994, 34, 109-116.

[17] C.-C. Kanne, F. Schreiber, D. Trümbach, in Proceedings of the 7th International Symposium on Graph Drawing (GD'99) - Lecture Notes in Computer Science, ed. J. Kratochvil, Springer-Verlag, Berlin, 1999, vol. 1731, p. 418-419.

[18] BioPath database available on the web at

a) http://www2.chemie.uni-erlangen.de/services/biopath/index.html and

b) http://www.mol-net.de/databases/biopath.html

[19] C@ROL is available from Molecular Networks GmbH, Germany, [email protected], http://www.mol-net.de

[20] JME molecule editor, developed by P. Ertl, Novartis, available from Molinspiration Cheminformatics, http://www.molinspiration.com

[21] O. Sacher, J. Marusczyk, Y. Han, J. Gasteiger, in preparation.

[22] P.D. Karp, M. Riley, S.M. Paley, A. Pellegrini-Toole, Nucl. Acids Res., 2002, 30, 59- 61.

[23] I. Nobeli, H. Ponstingl, E.B. Krissinel, J.M. Thornton, J. Mol. Biol., 2003, 334, 697- 719.

[24] The free open source molecule viewer JMol is available at http://jmol.sourceforge.net

44 2 Enabling the Exploration of Biochemical Pathways ______2.7 Improvements on the BioPath Database

2.7.1 Data Cleanup & Improvement

After the appearance of the publication shown in the previous section, the BioPath database went into a revision process and quite a few improvements were added to it. First, the database was split into a part with metabolic reactions and a part with signal transduction reactions. The latter mainly covers the contents of part two of the Boehringer Biochemical Pathways wallpapers. In the present version of the BioPath database, only the part with metabolic reactions is included and is accessible on the internet. The part with signal transduction reactions will be incorporated into a separate database in the future as these data mainly involve macromolecules and therefore other methods are needed for handling this data than are used for the handling of small molecules. Additionally, all reactions and molecules of the database were incorporated into the version control system subversion [1] and the index system was rebuilt. For the reactions part, the mass balance for all reactions was checked and corrected if necessary. The same was done for the reaction center markings. In addition, the enzyme EC numbers for all enzymes were updated to match the latest version of the EC nomenclature recommendations. Moreover, the annotation of the reactions to pathways was extended and for each reaction a textual description was added. Furthermore, more molecules involved in metabolism were added to the database. Thus, the database actually consists of 2,878 compounds and 1,545 metabolic reactions. For all compounds, stereo-centers were marked by the R/S nomenclature.

2.7.2 C@ROL Interface

As another improvement to BioPath, the C@ROL interface was changed to be more user- friendly and renamed BioPath.Explore. The search interface consists now of a basic and an advanced search mode. Therefore, the user is not overwhelmed by all of the search parameters when he initially enters the query page. Furthermore, the depiction of the compounds and reactions was enhanced. First, compounds are now shown in Natta projection instead of Fischer projection, making it easy to recognize all carbon atoms in an

45 2 Enabling the Exploration of Biochemical Pathways ______alkyl or sugar structure. Second, for all reactions standardized compound graphics are shown resulting in a more harmonic depiction. Screenshots demonstrating the new interface are shown in Figures 2.23, 2.24, 2.25, and 2.26.

Figure 2.23 Query page of Biopath.Explore in basic mode. Here, a search for the sub-structure of L- glutamate is started.

46 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.24 Hitlist for the structure search for L-glutamate shown in Figure 2.23 in BioPath.Explore.

47 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.25 Detail view for L-glutamate in BioPath.Explore.

48 2 Enabling the Exploration of Biochemical Pathways ______

Figure 2.26 Reaction detail view in BioPath.Explore. Here, L-glutamate is produced from L-glutamine.

References [1] Subversion. http://subversion.tigris.org.

49 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______3 Query Generation to Search for Inhibitors of Enzymatic Reactions

3.1 General Introduction

In this chapter, an application is presented where the explicit reaction encoding in the form of connection tables contained in the BioPath database was utilized to automatically generate intermediates of enzymatic reactions. This can be done efficiently for reactions where an intermediate state exists, e.g. reactions where an attack at a Csp2-atom occurs with first an addition and then an elimination step. Having the involved atoms and bonds marked, this information can be used to construct the intermediate for the reaction. These intermediates were then compared to transition state analog inhibitors of the enzymes catalyzing this reaction by 3D superimposition using a genetic algorithm. This study was done in preparation to perform automatic searches for enzyme inhibitors in large compound databases by using the generated intermediate as search template. This will be presented in section 4. The search for inhibitors by using the intermediate of an enzymatic reaction was developed according to a hypothesis presented by Linus Pauling [1,2]. Pauling stated that enzymes stabilize well particularly the transition state of a reaction and therefore analogs of these transition states should act as very potent enzyme inhibitors. For the modeling of the transition state, sophisticated quantum mechanical calculations are necessary. As we were interested in a fast method suited for handling large datasets, we simplified the method by using the intermediate state instead. However, it will be shown that this simplification is sufficient to deliver good results. The following chapter first presents the original publication as published in the Journal of Chemical Information and Modeling in 2006. At the end of this chapter some brief further conclusions are given. The numbering of the chapters, Figures, and Tables was adapted to this thesis.

50 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

References

[1] Zhang, X.; Houk, K. N. Why Enzymes Are Proficient Catalysts: Beyond the Pauling Paradigm. Acc. Chem. Res. 2005, 38, 379-385. [2] Pauling, L. The Nature of Forces between large Molecules of biological Interest, Nature 1948, 161, 707-709.

51 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

Query Generation to Search for Inhibitors of Enzymatic Reactions Martin Reitz, Alexander von Homeyer, Johann Gasteiger *

University of Erlangen-Nuernberg, Computer-Chemie-Centrum and Institute for Organic Chemistry, Erlangen, Germany

Reitz, M.; von Homeyer, A.; Gasteiger, J. J. Chem. Inf. Model. 2006, 46, 2333-2341.

Abstract A method for the generation of intermediates of enzyme catalyzed reactions is presented. These intermediates can be used as three-dimensional structural queries for searching for inhibitors of enzymatic reactions. The intermediates can be considered as being structurally quite close to transition state analogs. For this application, a database containing detailed chemical information on metabolic reactions is used. The likely three- dimensional structure of the intermediates of enzyme catalyzed reactions can be generated from the information in the database. For three reactions catalyzed by the enzymes AMP deaminase (EC-code 3.5.4.6), triose phosphate (EC-code 5.3.1.1), and II (EC-code 3.5.3.1) we show how a 3D model of these intermediates can be superimposed onto known inhibitors of these enzymes by a program that uses a genetic algorithm. For this we are testing different methods for the superimposition using information on the enzymatic , on physicochemical properties calculated from the molecular structure, or without having any information in the superimposition process. We show that these inhibitors are most similar to the corresponding intermediates regarding the 3D structure.

52 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

3.2 Introduction

In recent years, much research has been put into elucidating how genes control biochemical reactions, the endogenous metabolism. Genes express proteins which are often enzymes that catalyze these biochemical reactions. This catalysis is highly efficient leading to rate enhancements of up to 1020 compared against the uncatalyzed reactions [1]. These rate enhancements are caused by a variety of factors encompassing geometric, electronic, and bonding effects. Current studies point out that there might also be covalent bonding involved between the transition state and the enzyme to explain such proficient rate enhancements [2]. Clearly, an enzyme must bind the substrate of a reaction but, beyond that, it must even more favorably bind the transition state of a reaction leading to a substantial lowering of the activation energy. This has been pointed out quite some time ago by Linus Pauling [3,4], who stated that enzymes stabilize the transition states of biochemical reactions by binding them very tightly and thus lowering the energy barrier of the reaction. He further postulated that analogs to these transition states should act as potent inhibitors of enzymatic reactions. The inhibitor of an enzyme should be quite similar to the transition state of the reaction catalyzed by this enzyme in terms of geometric arrangement and of physicochemical effects. However, in contrast to the transition state, an inhibitor cannot undergo the bond breaking and making process observed in the enzymatic reaction of the natural substrate. Thus, the transition state analog occupies the catalytic site of the enzyme and blocks it from processing the natural substrate, leading to inhibition. Biochemical reactions catalyzed by enzymes form a network that is regulated by signaling pathways and, most importantly, by the expression of enzymes. Many dysfunctions of metabolic pathways in human and in other species result from an unbalancing of these reaction networks and it is, therefore, of high interest to interfere in the regulation of pathways. The inhibition of enzymes is thus an important tool in drug and agrochemical research [5]. An understanding of the structure of the transition state and of intermediates of an enzyme catalyzed reaction asks for an atomic resolution in the analysis of substrates and its reactions. Clearly, the structure of transition states and of reaction intermediates can be

53 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______calculated by quantum mechanical methods of various degrees of sophistication. This often asks for substantial amounts of computational efforts. We, however, were interested in developing a fast method that can be applied to large datasets of molecules. That is where chemoinformatics has to come in, in order to model the 3D structure of substrates and to analyze physicochemical effects that bind small molecules in proteins and that make bonds breaking and new ones forming. In order to support this endeavor we have developed BioPath, a database of biochemical reactions, that stores molecules and reactions at atomic resolution [6]. Specifically, molecules are stored as connection tables, as lists of all atoms and all bonds. Such a standard representation of chemical structures by connection tables allows the interfacing of automatic 3D-structure generators for obtaining 3D molecular models. The bond breaking and making events in the biochemical reactions are indicated by marking the reaction center and by mapping the atoms of the reactants onto those of the products. The marking of the reaction center plays a crucial role in the studies reported here as it allows the generation of intermediates of enzymatic reactions. This, in conjunction with the 3D modeling of all molecules puts us in a position to explore how inhibitors of enzymes match in 3D space with the starting materials, intermediates and products of enzyme catalyzed reactions. Based on this, the generation of intermediates from the information contained in the BioPath database provides a 3D structural query for searching for inhibitors of enzyme catalyzed reactions. We are testing this methodology with several enzymatic reactions for which inhibitors are known. This should provide a proof of concept for then using only information on the structure of a reaction intermediate to search for inhibitors in 3D structure databases.

54 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______3.3 Materials and Methods

General Outline. The structure of a transition state can be determined best by quantum mechanical calculations. The exact determination where the transition state lies on the reaction coordinate, i.e., which geometry, bond lengths, and energy the transition state has, requires quite sophisticated quantum mechanical calculations. To avoid determining the exact geometry and energy of a transition state by time-consuming calculations we simplify the problem by first investigating those reactions that proceed through a reaction intermediate. Such reactions are predominantly observed when the reaction occurs through an attack at a Csp2-atom involving first an addition and then an elimination step. When the energy of such a reaction intermediate is appreciably above the substrate, the structure of the transition state should be quite close to that of the reaction intermediate according to the Hammond postulate [7].

Figure 3.1 Energy diagram of an uncatalyzed reaction compared to an enzyme catalyzed reaction (ΔG‡u

vs. ΔG‡e) with the corresponding transition states Tu and Te and reaction intermediates Iu

and Ie.

55 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

Figure 3.1 shows the energy diagram of an uncatalyzed and of an enzyme catalyzed reaction proceeding through an intermediate. In this diagram, it is assumed that the binding of the substrate leads to an energy decrease, but the energy decrease for the binding of the reaction intermediate, Ie is much more pronounced in accordance with the Pauling hypothesis [3,4]. Such intermediates of a reaction can automatically be generated if an appropriate data source is available. A suitable database for this task, the BioPath database is presented in the next section. The general outline of the approach for generating reaction intermediates as transition state models and then to search for transition state analogs is presented in Figure 3.2. The various steps and the software involved are presented now in detail.

Figure 3.2 General outline of the process of comparing reaction intermediates with enzyme inhibitors indicating the different steps and the software programs used.

BioPath Database. The BioPath biochemical pathways database is a database of molecules involved in the endogenous metabolism and of the reactions interconverting them. The database was produced from the information which is contained on the famous wall-chart distributed by Boehringer Mannheim, now Roche [8]. In order to make the wealth of data contained on the poster and the corresponding atlas [9] accessible by computational

56 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______methods, the effort was made to input all information into a database. For this purpose, all structures were entered as connection tables, lists of all atoms and their bonds. Reactions were represented by their starting materials and products and cofactors involved, giving the full stoichiometry of the reaction including even protons. Furthermore, all atoms of the starting materials were mapped onto those of the products, indicating their correspondence by the numbers of their atoms and all reaction sites where bonds are broken, made, or altered were marked. This latter feature makes our database unique among all other databases of metabolic pathways like for example KEGG [10] or those on the BioCyc [11] webpage. Additionally, each reaction was enriched by supplementary information such as enzyme name, EC number, the pathway the reaction is part of, and the organism it occurs. The BioPath database presently consists of about 2.200 reactions and more than 1.500 structures. BioPath has been made accessible through the C@ROL [12] retrieval system on the web at: http://www2.chemie.uni-erlangen.de/services/biopath/index.html and http://www.mol-net.com/databases/biopath.html. Of eminent importance for the application reported here is that all reactions in BioPath have their reaction centers marked, i.e., the bonds broken and made in a reaction are indicated and the atoms of those bonds are mapped from the starting materials onto those in the products. This allows the automatic construction of reaction intermediates. Figure 3.3 illustrates this for the reaction catalyzed by AMP deaminase (EC 3.5.4.6) converting adenosine monophosphate (AMP), 1, into inosine monophosphate (IMP), 2. From the information on which bonds are broken in this reaction, the reaction intermediate, 3, can be generated.

57 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

NH2 O N H N N N O N H H O N H H N + O N + N O PO O O PO O H O O

1 OO 2 OO H H H H

HO NH 2 HO H N HN N HN O N N N N O PO O O HO

OO HO OH 3 H H 4

Figure 3.3 Hydrolysis of AMP, 1, into IMP, 2, and ammonia by AMP deaminase as stored in the BioPath database. The bonds broken and made are marked by lines crossing the bonds. The reaction intermediate, 3, as generated from this reaction center information and carbocyclic coformycin, 4, an inhibitor of AMP Deaminase.

To generate the reaction intermediate the BioPath database is loaded into the CACTVS system [13]. This program offers an extensive scripting interface which allows the manipulation of data. For this application a program was implemented which allows the generation of intermediates for several reaction types. This is done by a simple algorithm which uses the information on the bonds broken and made in the reaction center for a specific reaction type. It allows the generation of intermediates for all reactions matching a specific reaction type. First, the reaction center for a specific reaction is defined and then the BioPath database is scanned for all reactions matching this defined reaction center. The retrieved reactions are stored into a hit list. The reactions from the hit list are then split into a substrate-handle and a product-handle. The handle which is closer to the intermediate (reaction center and transformation to build the intermediate) is then modified by making and breaking the bonds that are part of the reaction center according to the intermediate. The generated intermediates are saved in a file.

58 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

CORINA. Compared to the vast amount of known compounds, experimental 3D structure information from X-ray data is available only for a small fraction of compounds. To obtain 3D structure information also for those compounds where no experimental data are on- hand computational methods are necessary. The 3D structure generator CORINA (COoRdINAtes) converts the constitution of a molecule as laid down in a connection table into a 3D structure [14,15]. This 3D molecular model is a single low-energy conformation of a molecule. It should be emphasized that this conformation does not necessarily correspond to the biologically active conformation. This problem will be addressed later in the superimposition process by the program GAMMA.

PETRA. The program package PETRA (Parameter Estimation for the Treatment of Reactivity Applications) [16] allows the calculation of a variety of physicochemical effects in organic molecules by using various empirical methods. PETRA can calculate properties for atoms, bonds, or for the whole molecule. In our studies, we used a variety of atom properties such as total charges, lone pair electronegativities, effective polarizabilities, and the lipophilicity represented by the octanol/water coefficient. These values can be used in the superimposition process by the program GAMMA.

GAMMA. The method developed for the superimposition of three-dimensional structures is based on atom-atom-matching of non-hydrogen atoms including conformational flexibility of the compounds. The key algorithms have been described elsewhere in detail [17]. In this approach, two functions are additionally used to automatically superimpose molecules. First, atoms can optionally be characterized by physicochemical properties. The atoms to be overlaid must then conform to a given interval of the physicochemical property. For example, if the matching criterion is chosen to be total atomic charges, qtot, and the interval selected to be qtot = ± 0.05 e, then for an atom of the first molecule with qtot = -0.2 e, only atoms in the interval of qtot = [-0.25, -0.15] are allowed to build match tuples with this first atom. Combinations of several physicochemical properties have to be

59 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______valid at the same time. The physicochemical properties are calculated by the program package PETRA [14]. Secondly, GAMMA allows the selection of sets of atom tuples that can be enforced to match. Therefore, indices have to be given for all those atoms of the molecules that must build match tuples with each other. All the remainder of the atoms have to fit the resulting spatial or, if given, physicochemical demands. The quality of a superposition is scored by the root mean square (RMS) error and the size of the achieved substructure.

3.4 Results and Discussion

In the following, the method for generating intermediates of enzyme catalyzed reactions and their superimposition with inhibitors of these reactions was tested for three enzymes. The examples were chosen so as to cover quite different substrates and reaction types. The study with AMP deaminase (EC-code 3.5.4.6) investigates a hydrolysis reaction at an aromatic heterocyclic system. With triose phosphate isomerase (TIM, EC-code 5.3.1.1) an isomerization reaction in an aliphatic system showing a fair amount of conformational flexibility was investigated. Furthermore, the study of two different inhibitors provided deeper insights into the structure of the intermediate of this reaction and of important features of the binding pocket. The last example, arginase II (EC-code 3.5.3.1) investigates a hydrolysis reaction of an aliphatic system having substantial conformational flexibility. Furthermore, the hydrolysis of a guanidine group can serve as a model reaction for a large group of hydrolysis reactions involving ester and amide groups. Three different inhibitors were studied in order to gain deeper insights into the validity of the transition state hypothesis and the performance of our approach. The studies presented here also investigate the use of different types of 3D structures. In the first three investigations, all 3D structures, of the starting materials, of the products, of the intermediates, and of the inhibitors were generated by CORINA and then submitted to the 3D structure superimposition by GAMMA (see section ‘Materials & Methods’). This

60 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______is the approach which has to be taken when searching in databases for new potential enzyme inhibitors where no information about existing drugs or the enzyme binding site is available. In the last study with arginase II, the 3D structures of the inhibitors were taken from the 3D experimental observations as stored in the Protein Data Bank (PDB) [18]. Structures of the intermediates, on the other hand, had to be generated by CORINA. In this last study it was also explored how the quality of the superimposition is affected if different levels of knowledge are given into the superimposition process. For this, three different overlay procedures were performed. First, the atoms which are known to participate in the hydrogen bondings of the inhibitor into the catalytic pocket of the enzyme were forced to be matched to the corresponding atoms of the intermediate. In the second approach, constraints were provided that allowed only atoms of both molecules with similar physicochemical properties to be matched to each other. In the third superimposition process, no further constraints were provided to see how the program can find a solution if no information on binding is available. For all shown superimpositions the intermediate structure generated from BioPath was handled as flexible while the superimposition partner (the substrate, the product, or the inhibitor, respectively) served as a rigid template.

AMP Deaminase. As a first example, we selected the enzyme AMP deaminase (EC-code 3.5.4.6) from the database, an enzyme which provides a relevant function in purine nucleotide metabolism. AMP deaminase (AMPDA) catalyzes the conversion of adenosine- 5’-monophosphate (AMP), 1, to inosine-5’-monophosphate (IMP), 2, by hydrolytic deamination and plays a key role in maintaining the relative concentrations of adenylate nucleotides. During this conversion a tetrahedral intermediate, 3, is formed, where the leaving ammonia group and the attacking water molecule are simultaneously bound (addition mechanism, see Figure 3.3). This enzyme has been well examined in recent years and there are efficient transition state analogs known. Deficiency of AMPDA leads to disruption of muscle energy production. Symptoms are rapid fatigue, pain and cramps. In plants, an inhibition of AMPDA results in a strong herbicidal effect [19]. One of these known inhibitor molecules

61 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______is carbocyclic coformycin, 4, (Figure 3.3), a fermentation product of Saccharothrix spec. with herbicidal activity [20]. In our investigations, the inhibitor, carbocyclic coformycin, 4, was superimposed by the program GAMMA with the starting material, AMP, 1, with the product, IMP, 2, and with the tetrahedral intermediate, 3, respectively. As explained in section ‘Materials & Methods’, we always selected the reaction intermediate for our investigations as, in contrast to a transition state, the bonds in a reaction intermediate are clearly defined. In this reaction, the structure of the transition state should be structurally close to the intermediate according to the Hammond postulate [7]. Figure 3.4 shows the resulting superimpositions obtained from GAMMA. Clearly, in all three cases, the ribose part and the five-membered imidazol-part of the bicyclic ring system match nearly exactly the corresponding parts of the carbocyclic coformycin. It is also clear that the seven-membered ring of the inhibitor can not completely match the six-membered ring of AMP, of IMP, or of the reaction intermediate. Differences in the three superimpositions show up at the reaction site where the NH2-group is exchanged against an OH-group (into its tautomeric form, to be exact). In the superimposition of AMP and the inhibitor, the NH2-group of AMP and the OH- group of the inhibitor points to substantially different directions in space (Figure 3.4a). The RMS value for this superimposition is 0.19Å with a substructure size of 16 matching atoms. Also for the superimposition of the inhibitor with the reaction product IMP there are discernible structure differences: the OH-group of the inhibitor and the carbonyl group of IMP clearly point to different positions in space (Figure 3.4b). Here, an RMS value of 0.20Å was obtained with a substructure size of 16 matching atoms. In contrast, the superimposition of the tetrahedral intermediate with the inhibitor shows a good match of both OH-groups (Figure 3.4c). For this superimposition an RMS value of 0.13Å was reached with a substructure size of 16 atoms. This emphasizes the close geometric correspondence of the two structures and points out that the OH-group - and its direction in space – apparently is very important for the binding of the reaction intermediate and of the inhibitor. Thus, in fact, the geometry of the intermediate of this reaction and that of the inhibitor are in close correspondence.

62 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

a) b)

c)

Figure 3.4 Superimposition of the inhibitor carbocyclic coformycin, 4, a) with the reaction substrate AMP, 1, b) with the reaction product IMP, 2, c) with the AMP deaminase reaction intermediate, 3.

Triose Phosphate Isomerase. Triose phosphate isomerase (TIM, EC-code 5.3.1.1) is an enzyme consisting of two identical subunits, each comprising of about 250 amino acids which form the active enzyme [21]. TIM plays an important role in glycolysis converting D-glyceraldehyde-3-phosphate (DGAP), 5, into dihydroxyacetone-phosphate (DHAP), 6, (sometimes also called glycerone-phosphate) and vice versa (Figure 3.5). The conversion passes through two enediolate intermediates, 7 and 8, which differ in the localization of one proton, taking into account that this is only one of several proposed reaction mechanisms as the reaction mechanism of TIM is not fully elucidated [22].

63 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

O H O O O O O H P P O O O O O O H H H 5 DGAP 6 DHAP

H O O O H O O O P P O O O O O O H H 7 8

O O O P O O H O O N O P O O 9 O 10 O

2PGH 2PG

Figure 3.5 Isomerization of D-glyceraldehyde-3-phosphate (DGAP), 5, into dihydroxyacetone- phosphate (DHAP), 6, catalyzed by triose phosphate isomerase (TIM). In this reaction, two intermediates, 7 and 8, are formed differing in the position of one proton. At the bottom two inhibitors of TIM are shown: 2-phosphoglycolohydroxamate (2PGH), 9, and 2- phosphoglycolate (2PG), 10.

The catalytic mechanism of the TIM reaction has been studied in detail [23,24]. TIM deficiency leads to a severe multisystemic disease with hemolytic anemia and neurological disorders [25]. Several inhibitors of this enzyme are known, two of them are 2- phosphoglycolohydroxamate (2PGH), 9, [24] and 2-phosphoglycolate (2PG), 10, [24] shown in Figure 3.5. Both have been proposed to be analogs of the enediolate intermediate state [22,23]. First, the situation with the inhibitor 2PGH, 9, will be investigated. For this analysis, three superimpositions with this inhibitor were performed: First, with the substrate DGAP, 5, secondly with the product DHAP, 6, and finally with the reaction intermediate, 7. For the superimposition only one intermediate was modeled, as transition states 7 and 8 have the same 3D structure and differ only in the position of one proton. The first superimposition of the substrate DGAP with the inhibitor shows that both molecules are quite similar in shape, as they have the same number of non-hydrogen

64 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______atoms (Figure 3.6a). The phosphate groups of both molecules match very well, but the other end of the molecules does not match exactly, as the configuration at the C-2 atom of DGAP and the C-1 atom of 2PGH is not the same: there is sp3-hybridization in the case of DGAP and sp2-hybridization in the case of 2PGH. This results in a different geometry for the two molecules. The RMS value for this superimposition is 0.79Å at a substructure size of 10 atoms, which is the maximum substructure size that can be reached in this case as DGAP and 2PGH both have only 10 non-hydrogen atoms. The superimposition of the product DHAP and the inhibitor 2PGH (Figure 3.6b) on the other hand shows a quite close matching between both molecules, especially regarding the terminal O-atoms. But, to obtain the close matching of the terminal atoms the matching of the medial atoms is not perfect which results in a RMS of 1.64 at a substructure size of 10 atoms. The superimposition of the inhibitor 2PGH with the intermediate in contrast shows a perfect match (Figure 3.6c) as the hydroxamic acid group is an excellent bioisostere of the enediolate transition state [26], resulting from the partial double bond character of the amide bond. For this superimposition an RMS value of 0.04Å for a substructure size of 10 atoms was obtained.

65 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

a) b) c)

d) e) f)

Figure 3.6 At the top, the superimposition of the inhibitor 2PGH, 9, with the substrate DGAP, 5 (a), with the product DHAP, 6 (b), and with the reaction intermediate, 7 (c), of the triose phosphate isomerase catalyzed reaction is shown. At the bottom, the superimposition of the inhibitor 2PG, 10, with the substrate DGAP, 5 (d), with the product DHAP, 6 (e), and with the reaction intermediate, 7 (f), of the triose phosphate isomerase catalyzed reaction is shown.

The superimposition of the second inhibitor, 2PG, 10, with the starting material D- glyceraldehyde-3-phosphate, 5, (Figure 3.6d) shows clearly the structural differences between both molecules. The first and quite obvious difference is in the number of atoms between DGAP, 5, having 10 non-hydrogen atoms, and the inhibitor 2PG, 10, having 9 such atoms. This does not allow a full superimposition of both molecules. A further difference lies in the hybridization state of carbon atom C-2 of D-glyceraldehyde-3-phosphate and carbon atom C-1 of 2PG: while the C-2-atom of D-glyceraldehyde-3-phosphate is tetrahedral, the matching carboxygroup of the 2PG is planar, thus both cannot be completely matched. The RMS value for this superimposition is 0.76Å at a substructure size of 9 atoms. The superimposition of the reaction product, DHAP, 6, with the inhibitor 2PG, 10, (Figure 3.6e) shows the same difference in molecular size as before (10 vs. 9 atoms), but in this case the configuration at the C-atoms are the same in both molecules and, thus, the OH-group of 2PG matches the OH-groups of DHAP quite well. Additionally, like in the

66 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______superimposition of DHAP with the first inhibitor 2PGH, the matching of the medial atoms is not perfect. The RMS value for this superimposition is 1.55Å for a substructure size of 9 atoms. The superimposition of the intermediate with the inhibitor 2PG (Figure 3.6f) shows a perfect match. The carboxylate function in 2PG mimics the planar enediolate in the intermediate state [27]. The RMS value for this superimposition is 0.03Å for a substructure size of 9 atoms. A summary on the RMS values obtained for the superimpositions of both inhibitors is given in Table 3.1.

Table 3.1 RMS values obtained for the superimpositions of Triose phosphate isomerase. All values are given in Å. The substructure-size for 2PGH is 10 atoms and for 2PG it is 9 atoms.

molecule species from the reaction

inhibitor DGAP DHAP intermediate

2PGH 0.79 1.64 0.04

2PG 0.76 1.55 0.03

The study of the superimpositions of both inhibitors of TIM, 2PGH and 2PG, leads to the following conclusions: 1) The reaction involves two intermediate states that differ in the arrangement of a proton and, accordingly, the reaction progresses through three transition states. 2) The geometry at C-atom 1 of dihydroxyacetone-phosphate is not important for the transition state of this reaction. 3) The C-atom 1 can even be deleted and still an inhibitor can be obtained. Thus, it seems that the geometry on carbon atom C-1 of DGAP, which is no longer present in 2PG, is not that important for the binding.

67 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

Arginase II. The previously presented studies work with the assumption that no 3D information is available from experimental sources, as this is the case when searching for new inhibitors in a database. Therefore, CORINA generated models have been used for the intermediate as well as for the inhibitor compounds. In contrast, in the following example we studied the situation when experimentally derived 3D structural information is available and brought into the superimposition process. This may give us an indication if we can indeed work only with computed 3D information for the alignments. Additionally, we were interested in this experiment if the quality of the alignment can noticeably be increased if we incorporate known information into the superimposition. For the first case, atoms that are known to interact with the binding pocket of the enzyme through hydrogen bonds were forced to match together. In the second case, physicochemical properties, calculated by PETRA (see section ‘Materials & Methods’), which were assumed to be relevant for the receptor-ligand-interaction, were introduced. In the last and simplest case, only 3D structural information of the inhibitor was used. Arginase is a binuclear manganese metalloenzyme that catalyzes the hydrolytic cleavage of L-arginine, 11, into L-ornithine, 12, and , 13, through a metal-activated hydroxide mechanism [28]. In mammals, two isoenzymes are identified: arginase I is found predominantly in hepatocytes and arginase II occurs extrahepatic. The arginase isoenzymes differ from each other in terms of their catalytic, molecular, and immunological properties. Human penile arginase is a potential target for the treatment of sexual dysfunction in male [29]. The reaction and the invoked intermediate, 14, are given in Figure 3.7.

68 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

H NH N NH2 H H O HO NH2 HO H H + O + H2N NH2 O O NH2 NH2 11 12 13 H H N H N O H H O H N H H O N H H 14

Figure 3.7 Hydrolysis of L-arginine, 11, to L-ornithine, 12, and urea, 13, catalyzed by arginase II with the corresponding reaction intermediate, 14, shown.

In this experiment we have used the 3D structures of three inhibitors: (S)-2-amino-6- boronohexanoic acid (ABH), 15, (PDB-Id: 1D3V) [30,31], S-(2-boronoethyl)-L-cysteine (BEC, sometimes also S2C), 17, (PDB-Id: 1HQ5) [30,31], and S-(2-sulfonamidoethyl)-L- cysteine (SDC), 19, (PDB-Id: 1R1O) [28] from Rattus norvegicus, all shown in Figure 3.8.

69 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

H OH H O + H + H N B N B H 3 H O OH O H O O O O ABH 15 16

H OH H O + H + H N B N B H 3 H S O S OH O H O O O O BEC 17 18

H H O H + N N S H H S O

O O SDC 19

Figure 3.8 Inhibitors of arginase II:

- ABH, 15, and in its active form as hydrated ABH, 16.

- BEC, 17, and in its active form as hydrated BEC, 18.

- SDC, 19.

ABH and BEC are slow binding competitive inhibitors belonging to the class of boronic acid inhibitors while SDC contains a sulfonamide group. Bound into the of the enzyme, ABH and BEC form tetrahedral boronate anions, 16 and 18, respectively. These mimic the tetrahedral intermediate of the argininase hydrolysis reaction. The same function is fulfilled by the sulfonamide group of SDC. For all three inhibitors, the experimentally derived 3D structure as bound into arginase is available from the PDB protein databank [18]. After extraction of the 3D structures of the inhibitors from the PDB the data were processed for the superimposition conserving the original 3D structural data. Then, five physicochemical properties used for the superimposition were calculated by PETRA: These properties comprise lone pair electronegativity, σ-electronegativity, effective atom

70 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______polarizability, total charge, octanol/water partition coefficient. The same steps were performed for the intermediate of the reaction which was generated from the BioPath database. For each inhibitor, three superimpositions were performed. In the first superimposition, the atoms of the inhibitor and of the intermediate that should match were assigned as constraints for the superimposition process. This information was derived from refs. 28, 29, and 31. The atoms assigned to match between the intermediate and each inhibitor are indicated in structures 14, 16, 18, and 19 by dashed boxes. For the second kind of superimposition, similarity ranges regarding physicochemical properties which are describing the electronic effects for the binding into the binding pocket of the enzyme were taken as matching criteria. This allows only those atoms to match which are similar regarding these properties and should therefore bind into the same region of the binding pocket. The physicochemical values used in the superimposition and the defined ranges are given in Table 3.2. The ranges, Δp, in the physicochemical properties were used such that only those atoms were allowed to be superimposed if their properties, p, had values that deviated by less than Δp.

Table 3.2 Ranges, Δp, of physicochemical properties assigned to the superimposition process.

inhibitor

physicochemical property ABH BEC SDC

lone pair electronegativity (eV) 2.10 2.10 not used

σ-electronegativity (eV) 2.10 2.10 not used

effective atom polarizability (Å3) 0.60 1.00 not used

total charge (e.u.) 0.25 0.25 0.35

octanol/water partition coefficient 0.50 0.50 0.60

71 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

The ranges were set by initial inspection of the properties of the atoms given as match tuples in the first superimposition experiment. For the third superimposition no constraints were specified for the superimposition process providing a match totally adjusted to the geometry of the molecules. This is the case which is typically true for scanning a database of compounds for new potential inhibitors. First, the superimposition with the inhibitor ABH, 16, was analyzed. All three experiments showed a good overlap between the inhibitor and the reaction intermediate. A look at the RMS values shows how close the superimpositions lie together: With given match tuples the RMS value is 0.30 Å (Figure 3.9a), with given constraints on physicochemical properties 0.24 Å (Figure 3.9b), and without any constraints on the superimposition 0.29 Å (Figure 3.9c). As can be seen, the superimposition without constraints performs as good as with given match tuples. For all three superimpositions the maximum substructure-size is 13 atoms.

72 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

a) b) c)

d) e) f)

g) h) i)

Figure 3.9 Superimposition of the arginase reaction intermediate, 14, with ABH, 16, with given match-tuples (a), based on physicochemical properties (b), and without any constraints (c).

Superimposition of the arginase reaction intermediate, 14, with BEC, 18, with given match- tuples (d), based on physicochemical properties (e), and without any constraints (f).

Superimposition of the arginase reaction intermediate, 14, with SDC, 19, with given match tuples (g), based on physicochemical properties (h), and without any constraints (i).

For the inhibitor BEC, 18, the RMS value of the superimposition onto the intermediate is 0.31 Å when matching tuples are given (Figure 3.9d), 0.23 Å when physicochemical properties are given (Figure 3.9e), and 0.43 Å without any given constraints (Figure 3.9f). For all superimpositions a substructure size of 13 atoms was obtained except in the case with superimposition based on physicochemical properties: here a maximum substructure size of 10 atoms was obtained. In this case, the sulfur atom and both flanking C-atoms of BEC were not recognized as match partners to the corresponding atoms of the intermediate as they exceeded the given property ranges. Here, the purely geometric superimposition performs slightly poorer than the others.

73 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

For the last inhibitor SDC, 19, in the superimposition with matching atoms given the RMS value is 1.00 Å (Figure 3.9g), with given physicochemical properties the RMS is 0.41 Å (Figure 3.9h), and without any constraints the RMS is 0.50 Å (Figure 3.9i). Here, again the geometric superimposition is better than with given match tuples, but also slightly poorer than with physicochemical properties. An overview of the RMS values for all superimpositions is given in Table 3.3. For all three inhibitors the differences between the three methods can hardly be recognized by visual inspection.

Table 3.3 RMS values obtained in the superimposition experiments with arginase II. The RMS values (give in Å). The substructure-size for all superimpositions is given in braces.

Information given into the superimposition process

ranges of inhibitor match tuples given physicochemical no constraints given properties given

ABH 0.30 (13) 0.24 (13) 0.29 (13)

BEC 0.31 (13) 0.23 (10) 0.43 (13)

SDC 1.00 (13) 0.41 (13) 0.50 (13)

74 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______3.5 Conclusions

In this paper it was shown that 3D molecular models of intermediates of enzyme catalyzed reactions can automatically be generated from a database of biochemical reactions and can serve as templates for matching inhibitors of the enzymes that catalyze the corresponding reaction. It was shown by superimposing these generated intermediates onto known transition state analog inhibitors that the similarity between both is sufficient to use the intermediate as a template to search for new transition state analog inhibitors. This can be performed by a superimposition method which uses a genetic algorithm enriched with a numerical optimization method. If there is no experimental 3D information on the inhibitors available it is also possible to use computed 3D molecular information which still delivers good results. As the superimposition process also allows conformational changes, detailed information on the steric requirements of enzyme- catalyzed reactions can be gained. The consideration of physicochemical effects in the superimpositions allows one to draw conclusions on the electronic effects operating in the enzyme pocket. The comparison of several inhibitors of specific enzymes also allows accumulating knowledge on the crucial features of an inhibitor for the enzyme. This approach provides a three-dimensional structure query that can be used for searching in databases of chemical structures for new potential enzyme inhibitors without using elaborate and time-consuming ab initio methods. This opens the prospects for finding new drugs and agrochemicals.

Acknowledgement We gratefully acknowledge funding for this project by the ‘Bundesministerium fuer Bildung und Forschung’ (BMBF), projects no. 031U112D and 031U212D, ‘Bioinformatics for the Functional Analysis of Mammalian genomes’ (BFAM) which is part of the ‘Nationales Genomforschungsnetz Deutschland’ (NGFN). The work on this project was also supported by KONWIHR (ENZYMECH), funded by the state of Bavaria through the ‘High-Tech-Offensive Bayern’.

75 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

References

[1] Borman, S. Much ado about enzyme mechanisms. Chem. Eng. News 2004, 82, 35-39.

[2] Zhang, X.; Houk, K. N. Why Enzymes Are Proficient Catalysts: Beyond the Pauling

Paradigm. Acc. Chem. Res. 2005, 38, 379-385.

[3] Pauling, L. The nature of forces between large molecules of biological interest, Nature 1948, 161, 707-709.

[4] Pauling, L. Molecular Architecture and Biological Reactions. Chem. Eng. News 1946, 24, 1375-1377.

[5] Robertson, J.G. Mechanistic Basis of Enzyme-Targeted Drugs. Biochemistry 2005, 44, 5561-5571.

[6] Reitz, M.; Sacher, O.; Tarkhov, A.; Trümbach, D.; Gasteiger, J. Enabling the exploration of biochemical pathways. Org. Biomol. Chem. 2004, 2, 3226-3237.

[7] Hammond, G. S. A correlation of reaction rates. J. Am. Chem. Soc. 1955, 77, 334-338.

[8] Biochemical Pathways Wall Chart. Michal, G., Ed. Boehringer Mannheim: Germany; now Roche, 1993. http://www.expasy.org/tools/pathways.

[9] Biochemical Pathways Biochemistry Atlas. Michal, G., Ed., Spektrum Akademischer Verlag: Heidelberg, Germany, 1999.

[10] Goto, S.; Okuno, Y.; Hattori, M.; Nishioka, T.; Kanehisa, H. LIGAND: database of

chemical compounds and reactions in biological pathways. Nucl. Acids Res. 2002, 30, 402-404.

[11] BioCyc Database Collection, http://www.biocyc.org.

[12] C@ROL, Molecular Networks GmbH, Germany, [email protected], http://www.mol-net.com.

[13] Ihlenfeldt, W. D.; Takahashi, Y.; Abe, H.; Sasaki, S. Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility. J. Chem. Inf. Comput. Sci. 1994, 34, 109-116.

76 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

[14] Sadowski,J.; Gasteiger,J.; Klebe,G. Comparison of Automatic Three-Dimensional Model Builders Using 639 X-Ray Structures. J. Chem. Inf. Comput. Sci. 1994, 34, 1000-1008.

[15] CORINA, Molecular Networks GmbH, Germany, [email protected], http://www.mol-net.com. CORINA can be tested on the internet at http://www2.chemie.uni-erlangen.de/software/free_struct.html.

[16] Gasteiger, J. In Physical Property Prediction in Organic Chemistry, Jochum, C.; Hicks, M. G.; Sunkel, J., Eds., Springer Verlag: Heidelberg, 1988; pp 119-138

[17] Handschuh, S.; Wagener, M.; Gasteiger, J. Superposition of Three-Dimensional

Chemical Structures Allowing for Conformational Flexibility by a Hybrid Method. J. Chem. Inf. Comput. Sci. 1998, 38, 220-232.

[18] Bernstein, F. C.; Koetzle, T. F.; Williams, G. F.; Meyer, E. F. jr.; Brice, M. D.; Rodgers, J. R.; Tasumi, M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 1977, 112, 535-542.

[19] Bojack, G.; Earnshaw, C. G.; Klein, R.; Lindell, S. D.; Lowinski, C.; and Preuss, R. Design and Synthesis of Inhibitors of Adenosine and AMP Deaminase. Organic Letters 2001, 3, 839-842.

[20] Dancer, J. E.; Hughes, R. G.; and Lindell, S. D. Adenosine-5'-Phosphate Deaminase - A Novel Herbicide Target. Plant Physiol. 1997, 114, 119-129.

[21] Witmans,C. J. An approach to the rational design of new inhibitors for Trypanosoma brucei Triosephosphate Isomerase. Dissertation University Groningen, Netherlands, 1995. http://www.ub.rug.nl/eldoc/dis/science/c.j.witmans/index.html.

[22] Cui, Q.; Karplus, M. Catalysis and Specifity in Enzymes: A Study of Triosephosphate Isomerase and Comparison with Methyl Glyoxal Synthase. In Advances in Protein Chemistry, Daggett, V., Ed., Elsevier Academic Press: San Diego, CA, 2003; 66, pp 315-320.

[23] Aqvist, J.; Fothergill, M. Computer simulation of the triosephosphate isomerase

catalyzed reaction. J. Biol. Chem. 1996, 271, 10010-10016.

77 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______

[24] Collins, K. D. An activated intermediate analogue. The use of phosphoglycolohydroxamate as a stable analogue of a transiently occurring

dihydroxyacetone phosphate-derived enolate in enzymatic catalysis. J. Biol. Chem. 1974, 249, 136-142.

[25] Livet, M. O. Triose-phosphate isomerase deficiency, Orphanet encyclopedia 2003, http://www.orpha.net/data/patho/GB/uk-TPI.pdf.

[26] Lewis, D. J.; Lowe, G. Phosphoglycollohydroxamic acid: an inhibitor of class I and II aldolases and triosephosphate isomerase. A potential antibacterial and antifungal agent. J. Chem. Soc. Chem. Commun. 1973, 19, 713-715.

[27] Wolfenden, R. Transition state analogues for . Nature 1969, 233, 704- 705.

[28] Cama, E.; Hyunshun, S.; and Christianson, D. W. Design of amino acid sulfonamides as transition-state analogue inhibitors of arginase. J. Am. Chem. Soc. 2003, 125, 13052-13057.

[29] Kim, N. N.; Cox, J. D.; Baggio, R. F.; Emig, F. A.; Mistry, S. K.; Harper, S. L.; Speicher, D. W.; Morris, S. M. Jr.; Ash, D. E.; Traish, A.; Christianson, D. W. Probing Erectile Function: S-(2-Boronoethyl)-L-Cysteine Binds to Arginase as a Transition State Analogue and Enhances a Smooth Muscle Relaxation in Human Penile Corpus Cavernosum. Biochemistry 2001, 40, 2678-2688.

[30] Cox, J. D.; Kim, N. N.; Traish, A. M.; Christianson, D. W. Arginase-Boronic Acid

Complex Highlights a Physiological Role in Erectile Function. Nat. Struct. Biol. 1999, 6, 1043-1047.

[31] Cama, E.; Colleluori, D. M.; Emig, F. A.; Hyunshun, S.; Kim, S. W.; Kim, N. N.; Traish, A. M.; Ash, D. E.; Christianson, D. W. Human Arginase II: Crystal Structure and Physiological Role in Male and Female Sexual Arousal. Biochemistry 2003, 42, 8445-8451.

78 3 Query Generation to Search for Inhibitors of Enzymatic Reactions ______3.6 Further Conclusions

In this publication we have shown that the use of intermediates is suited to recognize analogs of the transition state of enzymatic reactions. With the reactions stored at molecular level and the marking of the reaction center in BioPath, these intermediates easily can be constructed and then used as a query template for database searching. For 3D information, a low energy conformation is created. As this conformation may not correspond to the situation found in nature, a genetic algorithm was used to access the conformational space. It was shown that this method is still sufficient even without the knowledge on the enzyme bound conformation. In these studies, we have also learned that the results can be improved by introducing physico-chemical constraints into the superimposition process. The calculation of a multitude of physico-chemical descriptors can by performed by rapid empirical methods suited also for the handling of large compound databases [1,2]. The selection of the descriptors can be chosen individually to fit the needs of the problem. Here, descriptors were chosen describing the situation in the catalytic site of an enzyme. Further, it was shown that the investigated inhibitors in fact are most similar to the intermediate state than to the substrate or reaction product, supporting the theory stated by Linus Pauling described in section 3.1. In the following section, a database search performed by using the just presented method will be shown.

References

[1] Kleinöder, T. 2007, unpublished results.

The calculation time with PETRA averages 250 ms per compound on a standard Linux PC.

[2] Gasteiger, J. Modeling Chemical Reactions for Drug Design. J. Comput. Aided Mol. Des. DOI 10.1007/s10822-006-9097-4.

79 4 Database Screening for Enzyme Inhibitors ______4 Database Screening for Enzyme Inhibitors

4.1 Introduction

To test the procedure described in the preceding section, an investigation was started to search a database for inhibitors of AMP deaminase by using a computer-generated low energy 3D conformation of the reaction intermediate. This ligand-based virtual screening approach should provide a proficient method to downsize a compound database containing molecules of interest, in this case to extract potential inhibitors of AMP deaminase. While in the process of high-throughput-screening (HTS), libraries of chemicals are tested against biological targets. The virtual screening approach uses computer methods on virtual compounds which might not be synthesized at all. Therefore, virtual screening has been established as a fast and inexpensive method to scale down the search space for further screening. With the availability of high-resolution protein structures, the idea of receptor-based ligand design surfaced. Here, a new ligand is docked into the receptor structure without further knowledge on existing ligands. This allows getting away from a design which is based on existing substrates and gives access to new scaffolds and leads. Although this approach indeed delivers viable results and also ended in marketed drugs like Viracept [1], an HIV protease inhibitor, or Relenza [2], an anti-influenza drug, there are serious problems one has to deal with considering receptor- based drug design. One problem is the small number of biologically active and synthetically accessible compounds in relation to the enormous chemical space. Another problem is the highly flexible and complicated receptor structure with unpredictable shape and solvent structure. Further there are difficulties in calculating the ligand- receptor binding affinities [3]. In contrast, in the approach presented in the preceding section we have shown that using a low energy 3D structure of a computer-generated reaction intermediate can help in finding potential candidates for enzyme inhibitors without knowledge on the enzyme binding site. Clearly, such an approach will not provide the same prediction quality as delivered by a full knowledge-based approach but it still can be a useful tool for a preliminary database screening bypassing the problems of a receptor-based screening. In the approach presented here, the computed 3D structure of

80 4 Database Screening for Enzyme Inhibitors ______a molecule is used as a query template to search for molecules with similar 3D structure. For this, the rigid query molecule was superimposed onto the compounds in the database using a genetic algorithm incorporated in the program system GAMMA, as described in section 3.3. The usage of a genetic algorithm here allows the exploring of a large conformational space for each compound contained in the database without the calculation of different conformers in advance.

4.2 Preparation of data

As already stated, the automatically generated intermediate of the AMP deaminase reaction (Figure 4.1) was chosen as query compound. This reaction and its intermediate were already described in section 3.4. As search space, the database MDL Drug Data Report (MDDR) version 2005.1 was used, covering biologically active compounds and derivatives with a multitude of properties derived from patent literature, journals, meetings, and congresses [4]. Initially this database contains 159,662 compounds. Before screening, a prefiltering step was applied cutting down the database to molecules suited to act as inhibitors of AMP deaminase. The data were filtered using the descriptors introduced by Lipinski et al. [5] and the values were chosen to be in range of the query structure. Therefore, as filtering criteria, the calculated octanol/water partition coefficient (cLogP), the number of hydrogen-bond donors and acceptors and the molecular weight were chosen. Additionally, the number of rotatable bonds was restricted to a maximum of seven in order to control the flexibility of the compounds and to therefore restrict the complexity of the alignment process. The applied filtering parameters are given in Table 4.1. The filtering was performed using MDL ISIS/Base [6].

HO NH2 HN N

O N N O PO O O

OHOH

Figure 4.1 Reaction intermediate of the AMP deaminase reaction.

81 4 Database Screening for Enzyme Inhibitors ______

Table 4.1 Properties used for prefiltering.

Property Value

cLogP < 10

Rotatable bonds <= 7

H-donors <= 5

H-acceptors <= 9

Molecular weight 100-1000

After applying the prefiltering step, 32,482 compounds were left for searching. These structure data then were prepared in the following way: First, all small fragments such as counter-ions were removed and all charges were neutralized. Further, lacking H-atoms were added and a low energy 3D structure was calculated. These steps were performed with CORINA 3.2. After this step, the octanol/water partition coefficients based on atomic increments (LogP) [7] and total atomic partial charges [8,9] were calculated for all molecules using PETRA 4 [10]. The calculation of the 3D structure and the calculation of LogP increment and partial charges were also done for the query structure, the AMP deaminase reaction intermediate. For an illustration of the process see Figure 4.2. This clean-up process is similar to the one presented in the publication of chapter 3 (see Figure 3.2).

82 4 Database Screening for Enzyme Inhibitors ______

Figure 4.2 Flowchart of the steps involved in the database screening process. For the software involved refer to Figure 3.2

4.3 Computation

The genetic algorithm used by GAMMA is a method with a high computational effort, resulting from the high number of generated conformations used for the superimposition. Therefore, a parallelized Version of GAMMA 2.7 was used to run on a computing cluster. The search was performed using a Linux computing cluster maintained by the Rechenzentrum of the University Erlangen-Nuernberg (RRZE). Here, eight computing nodes each consisting of a dual Xeon 3.20 GHz "Nocona" (800 MHz FSB / 666 MHz RAM) processor, two GByte RAM, and a 80 GB IDE hard disk per node were used. The total runtime for the calculations was 139.63 hours.

83 4 Database Screening for Enzyme Inhibitors ______4.4 The Fitness Function

The fitness function used to rank the quality of the superimposition result is defined by a linear combination expressed by following equation:

F = N – Dr –S

Here, for the fitness (F) three criteria were considered. First, the size of the substructure, as given by the number, N, of matching atoms; second, the geometric fit of the matching atoms, Dr, defined by the relative differences of corresponding atom distances; third, the deviations in stereochemistry of the substructure atoms, S. The size of the substructure is hereby used as a measure for the quality of the superimposition, whereby two penalty parameters, the geometric differences and the deviation in stereochemistry, are subtracted.

The term Dr, indicates the geometrical quality for the superimposition of two molecules and is defined by the following equation:

N N | d1 (i, j) − d2 (i, j) | Dr = ∑∑ i j max(d1 (i, j), d2 (i, j)) j!=i

In this equation, i and j represent two of the N match pairs. The distances of the atoms in molecule 1 and molecule 2, respectively, are represented by d1(i,j) and d2(i,j). The two arguments, i and j, describe the match pair of the atoms for which the distance is defined.

As the parameter Dr is not able to discriminate between enantiomeric structures, another parameter, S, is necessary which describes the local spatial environment of the atoms.

84 4 Database Screening for Enzyme Inhibitors ______

Therefore, the term S is defined as the sum of all stereochemistry-parameters over all atom tuples:

S = S ∑ i

Hereby, the stereochemistry descriptor of a match tupel, Si, is determined by comparing two planes which are defined by the atoms of the nearest three match pairs. One plane is spanned by the three atoms of the first molecule; the second plane is spanned by the three atoms of the second molecule. If the compared atoms are lying on the same side of the according plane they are considered as identical respective of their stereochemistry. If they are lying on different sides they behave like mirror images. The descriptor Si is then defined as the largest distance between the according planes and the corresponding central atom. This means that not only the spatial difference, but also the deviation of the two atoms is taken into account.

⎧0, if d1i ⋅ d2i > 0 Si = ⎨ max(d , d ), if d ⋅ d < 0 ⎩ 1i 2i 1i 2i

85 4 Database Screening for Enzyme Inhibitors ______

OH

N HN N N O HO

HO

#1: Pentostatin Fitness: 14.79, 16 atoms substructure size

O

N HN

N H2N N

O OH O P OH

#2: 183843 Fitness: 14.65, 17 atoms substructure size

CH 3 H HS NN NO O 2 O O CH 3

#3: 257335 Fitness: 13.70, 17 atoms substructure size

S

HN N

N H2N N O HO

H2N OH

#4: 260526 Fitness: 13.66, 15 atoms substructure size

O

N N H C 3 OH N N H N OH O

#5: 177436 Fitness: 13.64, 15 atoms substructure size

Figure 4.3 Structures and superimpositions of the five best hits retrieved from the database search. Beside the compound name, the fitness value and the matched substructure-sizes are given. For each superimposition, 24 experiments were performed and the best result kept. The

structure of the superimposed reaction intermediate is shown in Figure 4.1. 86 4 Database Screening for Enzyme Inhibitors ______4.5 Results

Beside the 3D structure as steric feature, two other physico-chemical properties were brought into the superimposition process reflecting hydrophobicity and electrostatics: The octanol/water partition coefficient based on atomic increment, LogP, and the total charges, qtot,. An interval of +/- 0.4 of the van der Waals radius was specified. Both are playing an important role for binding a ligand into the catalytic site of an enzyme. Each of the compounds in the database was superimposed with the query structure in 24 experiments. From these 24 experiments, the results were ranked by linear rank scaling [11] and the result with the highest fitness was kept. Afterwards, all compounds in the database were sorted by the fitness of their best results from highest to lowest value. For the five highest ranked compounds the structure and the superimposition with the AMP deaminase reaction intermediate are shown (Figure 4.3). The first hit, pentostatin, is described as a transition state drug analog of . This compound is structurally very close to the AMP deaminase transition state drug analog carbocyclic coformycin investigated in section 3.4. Both compounds contain a diazepine ring system with a seven-membered ring while the natural substrate AMP contains a purine ring system with a six-membered ring. Despite this difference the compounds were considered as similar by GAMMA. The second hit, 183848, shows structural similarities regarding the ring system and might therefore even have been found by a 2D sub-structure search. This is also true for the fourth and fifth hit, 260526 and 177436. In contrast, the third hit, 257335, shows clear structural differences to the intermediate of the AMP deaminase reaction and shows no obvious similarity. By visual inspection this high similarity recognized by GAMMA is not traceable. However, with 17 atoms the matched substructure is quite large. In Figure 4.4 the two matched compounds are shown in their 2D structure with the matched atoms marked.

87 4 Database Screening for Enzyme Inhibitors ______

Figure 4.4 Comparison of the 2D structures of the compound 257335 (top) and the intermediate of the AMP deaminase reaction (bottom). The 17 atoms recognized as matching on each other by GAMMA are marked with red boxes. The match partners are indicated by letters.

In the MDDR database, for each compound several pharmacological activities are annotated. Using this information, in Figure 4.5 the activities of the compounds rated as the 50 best molecular superimpositions are shown. Interestingly, the database contains no molecules annotated with the function ‘AMP deaminase inhibitor’ but several with the function ‘Adenosine deaminase inhibitor’. Two compounds having this activity are contained in the 50 highest ranked superimpositions (marked in Figure 4.5).

88 4 Database Screening for Enzyme Inhibitors ______

Figure 4.5 Activities annotated with the 50 highest ranked compounds. Note that a compound can have more than one activity annotated.

89 4 Database Screening for Enzyme Inhibitors ______

The best hit in the search, pentostatine (2’-deoxycoformycin), is classified as an Adenosine deaminase inhibitor and is structurally quite similar to the inhibitor of AMP deaminase shown in chapter 3.4, carbocyclic coformycin (see Figure 4.6). Also the substrates of the two enzymes, adenosine and adenosine monophosphate, are structurally quite similar (see Figure 4.7), except that AMP has a phosphate group; and also the catalyzed reactions, the hydrolysis of an amino-group, are identical.

a) b)

Figure 4.6 Pentostatin, an inhibitor of Adenosine deaminase (a) and carbocyclic coformycin, an inhibitor of AMP deaminase (b) in comparison.

a) b)

Figure 4.7 Adenosine, the substrate of Adenosine deaminase (a) and AMP, the substrate of AMP deaminase (b) in comparison.

Beside the function as inhibitor of Adenosine Deaminase, pentostatin is also described as a transition state analog inhibitor for AMP Deaminase [12]. Therefore, the activity ‘Adenosine deaminase inhibitor’ was chosen to serve as reference in this investigation.

90 4 Database Screening for Enzyme Inhibitors ______

The searched database contains 15 compounds annotated with this activity. In Table 4.2, the molecules with this activity and their rank after the database searching are shown and Figure 4.8 shows the enrichment for this activity. While for the initial database all molecules with that activity are expected to be distributed equally, after the searching the molecules are enclosed in the upper 20% of the search results. This indicates a clear enrichment for molecules with this activity.

Table 4.2: Adenosine deaminase inhibitors contained in the database of 32,482 compounds ranked by the fitness. The position within the database is given by percent.

Mol. No. Rank % of Database Name Fitness

1 1 0.00 PENTOSTATIN 14.79

2 50 0.15 211612 11.88

3 175 0.54 ADECYPENOL 10.46

4 226 0.70 CLADRIBINE 10.07

5 513 1.58 212605 8.92

6 530 1.63 GP-3367 8.84

7 737 2.27 215462 8.33

8 847 2.61 FR-234938 8.15

9 848 2.61 285795 8.15

10 1320 4.06 235147 7.53

11 1670 5.14 334503 7.15

12 5166 15.90 285796 5.32

13 5179 15.94 285791 5.32

14 5245 16.15 CPC-407 5.30

15 6453 19.87 FR-221647 4.96

91 4 Database Screening for Enzyme Inhibitors ______

Figure 4.8 Enrichment plot for activity ‘Adenosine deaminase inhibitor’. 15 compounds with this activity are contained in the database of 32,482 compounds. The black line gives the expectation of active compounds which are expected to be found by a random selection. The red line shows the result of the screening in percent. For selected points, the number of found active compounds is annotated.

Other activities connected to the 50 highest ranked compounds are listed in Figure 4.5. As can be seen, seven compounds of the 50 best hits have the pharmacological function ‘antianginal’, which means that they are useful for the cure of angina pectoris, a symptom of ischaemic heart disease. In fact, Dhasmana et al. reported an antianginal effect of several adenosine deaminase inhibitors, amongst others 2’-deoxycoformycin (pentostatin) [13]. Curiously, for this compound in MDDR this activity is not annotated. Another activity annotated with the 50 highest ranked compounds is the activity ‘antiviral’. The inhibitor pentostatin is reported to remarkably support the antiviral therapy with antiviral drugs [14].

92 4 Database Screening for Enzyme Inhibitors ______

4.6 Conclusions

The study presented here clearly supports the theory that the intermediates of reactions can act as query to search for inhibitors of these reactions as presented in section 3. The database used here for searching contained no compounds with the activity ‘AMP deaminase inhibitor’, but several with the activity ‘Adenosine deaminase inhibitor’. Considering the close structural similarity between the substrates of AMP deaminase and Adenosine deaminase, and the similarity of the catalyzed reactions, an enrichment of the ‘Adenosine deaminase inhibitor’ activity could be expected. In fact, a clear enrichment in this activity can be observed together with further activities which are connected to Adenosine deaminase inhibitors and AMP deaminase inhibitors. Taking this into account, the compounds retrieved with high fitness from this search might be able to serve as a starting point for the development new inhibitors of AMP deaminase. Considering the variety of activities which are connected to the retrieved compounds may also indicate the involvement of the query enzyme in these indications and help to detect new areas of application for already known compounds. Beside compounds with apparent structural similarity to the query compound, also compounds where retrieved where this similarity is not recognizable in 2D but reveals to be existent in the 3D space. Therefore, by using the 3D structure of a reaction intermediate as a search query, compounds can be retrieved which would have not been found by 2D sub-structure searching and gives the possibility to explore a much higher variety of structural scaffolds possibly leading to potentially new lead structures. In addition, by exploring the structural similarities of the retrieved molecules, conclusions on structural and electronic features of the enzyme’s binding pocket can be drawn.

93 4 Database Screening for Enzyme Inhibitors ______

References

[1] Henry, C. M. Structure Based Drug Design. Chemical & Engineering News 2001, 79, 69-74.

[2] Itzstein, M. V.; et al. Rational Design of potent Sialidase-based Inhibitors of Influenza Virus Replication. Nature 1993, 363, 418−423.

[3] Shoichet, B. K. Virtual Screening of Chemical Libraries. Nature 2004, 432, 862-865.

[4] MDDR. MDL Information Systems Inc. www.mdli.com.

[5] Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. 1997, 23, 3-25.

[6] ISIS/Base. MDL Information Systems Inc. www.mdli.com.

[7] Wang, R.; Fu, Y.; Lai, L. A New Atom-Additive Method for Calculating Partition Coefficients. J. Chem. Inf. Comput. Sci. 1997, 37, 615-621.

[8] Bauerschmidt, S.; Gasteiger, J. Overcoming the Limitations of a Connection Table Description: A Universal Representation of Chemical Species. J. Chem. Inf. Comput. Sci. 1997, 37, 705-714.

[9] Gasteiger, J.; Marsili, M. Iterative Partial Equalization of Orbital Electronegativity - A rapid Access to Atomic Charges. Tetrahedron 1980, 36, 3219-3228.

[10] PETRA. Molecular Networks GmbH: Erlangen, Germany. [email protected], http://www.mol-net.com.

[11] Baker, J. E. Adaptive Selection Methods for Genetic Algorithms. In Proceedings of the 1st International Conference on Genetic Algorithms; Grefenstette, J. J., Ed; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 1985; pp 101 – 111.

[12] Merkler, D. J.; Brenowitz, M.; Schramm, V. L. The Rate Constant describing slow- onset Inhibition of Yeast AMP Deaminase by Coformycin Analogues is independent of Inhibitor Structure. Biochemistry 1990, 29, 8358-8364.

94 4 Database Screening for Enzyme Inhibitors ______

[13] Dhasmana, J. P.; Digerness, S. B.; Geckle, J. M.; Ng T.C.; Glickson, J. D.; Blackstone, E. H. Effect of Adenosine Deaminase Inhibitors on the Heart's Functional and Biochemical Recovery from Ischemia: a Study utilizing the isolated Rat Heart adapted to 31P Nuclear Magnetic Resonance. J. Cardiovasc. Pharmacol. 1983, 5, 1040-1047.

[14] Naesens, L.; Hatse, S.; Andrei, G.; Balzarini, J.; Neyts, J.; Snoeck, R.; De Clercq, E. Biological activities of 9-(2-phosphonylmethoxyethyl)-N6-cyclopropyl-2,6- diaminopurine. Antiviral Research 1998, 37, 74-74.

95 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases

5.1 General Introduction

In this chapter, a second application is presented where the information on the reaction center marking contained in the BioPath database was utilized. Here it was used for building a classification of metabolic reactions, respectively the catalyzing enzymes. For this, the reaction center information was used to extract the bonds of the substrate involved in the reaction. For these bonds, a set of physicochemical descriptors was calculated. The descriptors were chosen to be suited to describe the electronic effects acting in the reaction. The resulting vector then can be used as input for data analysis methods. This process is shown in Figure 5.1. The method for reaction-center extraction and descriptor calculation is implemented in the program system CORA [13]. In the investigation presented here, the classification was built using an unsupervised Kohonen neural network to intuitively illustrate the similarities to the established EC nomenclature used for classifying enzymes. A classification of reactions based on the reaction center information can overcome some of the drawbacks of a manual classification system, like the EC classification system, as here the classification is based solely on the chemical features. The following chapter first presents the original publication as published in the journal PROTEINS: Structure, Function, and Bioinformatics in 2007. At the end of this chapter some further conclusions are given. The numbering of the chapters and Figures was adapted to this thesis.

96 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.1 Flowchart describing the process used to build a classification of metabolic reactions based on the reaction-center information. Here, the bonds participating in the reaction process are extracted from the reagents and then physico-chemical properties are calculated for these bonds. The resulting data vector serves as input for data analysis methods.

97 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Classification of Metabolic Reaction Based on Physicochemical Descriptors: Investigation on Hydrolases Martin Reitz and Johann Gasteiger *

University of Erlangen-Nuernberg, Computer-Chemie-Centrum and Institute of Organic Chemistry, Naegelsbachstr. 25, 91052 Erlangen, Germany

Reitz, M.; Gasteiger, J.; submitted to PROTEINS: Structure, Function, and Bioinformatics 2007.

Abstract A classification method for enzyme catalyzed metabolic reactions is presented and illustrated with reactions catalyzed by hydrolases. The classification is exclusively based on physicochemical descriptors calculated for the bonds reacting in the substrate of the reaction. This classification method tries to overcome some of the drawbacks of the classical EC code system, where different criteria are used in different hierarchies. In order to foster an understanding of this classification method we have, however, selected enzymatic reactions where the EC code system is largely built on criteria based on the reaction mechanism. This is true for hydrolysis reactions, falling into the domain of the EC class 3 (EC 3.b.c.d). The comparison is made by a Kohonen neural network based on an unsupervised learning algorithm. For these hydrolysis reactions, the classification based on physicochemical effects produces results that are, by and large, similar to the EC code. However, this classification method reveals finer details of the reaction mechanism and thus can provide a better basis for the comparison of enzymes.

5.2 Introduction

A widely accepted and used method for enzyme classification is the Enzyme Commission (EC) code, maintained by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) and constituted in 1992 [1]. This classification system builds on a variety of criteria such as reaction patterns, substrates,

98 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______transferred groups, and acceptor groups. In this system, a unique number is assigned to each enzyme, following a scheme consisting of four numbers separated by dots: a.b.c.d. The first number, a, gives the main class, consisting of six classes: (1), (2), Hydrolases (3), (4), (5), and (6). These main classes are separated into several subclasses, b, again separated into sub-subclasses, c. The last number, d, is arbitrarily assigned to provide a unique number for each enzyme. For example: The enzyme Glucose-6-phosphatase is classified as EC 3.1.3.9: hydrolases (a=3), acting on ester bonds (b=1), in this case phosphoric monoester hydrolases (c=3). Because of the diversity of criteria used at different levels of the classification scheme, the EC classification is not quite coherent as, depending on the EC class, the emphasis shifts between different criteria. For example in the second level b, the subdivision criteria for the Oxidoreductases (EC 1) jump between the focus on electron donor or acceptor compounds, respectively. The Transferases (EC 2) are sub-grouped by the type of transferred group. The Hydrolases (EC 3) are sub-grouped by the type of bonds that is hydrolyzed. Lyases (EC 4) are sub grouped by the type of bond which is broken. Isomerases (EC 5) focus on the type of isomerization, and the Ligases (EC 6) are sub- grouped by the bond which is formed during the reaction. Clearly, the most important action of an enzyme is the catalysis of a reaction, an event that breaks and makes bonds. We therefore investigated the classification of reactions based on the site where the reaction occurs, the reaction center. The reaction center was characterized by physicochemical effects calculated for the bonds which take part in the reaction. This information can be used as input to data analysis methods. In our case, we used a self organizing (Kohonen) neural network for grouping together all reactions considered similar on the basis of these properties. The results were then compared with the EC classification system. In this paper, we restricted our studies to the class of hydrolases, as these reactions follow similar reaction mechanisms and the EC classification system also uses features based on the reaction type for the classification of this class. This provides an opportunity for the comparison of both classification systems.

99 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Through the classification of reactions, similarities between reactions can be perceived. This can form the basis for searching for other enzymes that can catalyze a given reaction. Or, vice versa, for a given enzyme, different or novel reactions might be found that are catalyzed by the same enzyme. The classification of enzymatic reactions is a field of high interest, as data on proteins with unknown function grow fast and the modeling of biochemical reaction networks depends on the knowledge of protein functions [2]. Izrailev and Farnum [3] used ligand-binding data of proteins with unknown function to classify them into the EC system by comparing the sets of processed ligands. Dobson and Doig [4] used several computable attributes from the crystal structure for this task. This bypasses the widely used method of protein function annotation by sequence comparison to other proteins with known function. Kotera et al. [5] introduced a method for identifying the reaction center of a reaction by matching the substrate onto the product and used this information to assign an RC number (reaction classification) and compare this to the EC system. A quite similar approach was made by Zhang and Aires-de-Sousa [6] where for the reagent and product of each reaction a Molmap was computed using a set of physicochemical descriptors. From these Molmaps, a reaction Molmap can be computed, coding the reaction center. This information can then be used for an automatic reaction classification. Ruepp et al. [7] presented a classification system, FunCat, for the functional description of proteins which is based on the pathway a protein is involved and which is not restricted to enzymes and transporters as the EC system is. Clearly, the kinds of reactions that can be catalyzed by a given enzyme are influenced by two dominant factors, the shape of the active site and the physicochemical effects that are operative at this site. In this investigation we concentrate on the elementary steps of a reaction, the bonds broken in a reaction and how this process can be characterized by physicochemical effects. This will then form the basis for a classification of enzymes. Thus, in a first step, we neglected the size and the shape of the active site as often this information is not available.

100 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______5.3 Materials and Methods

5.3.1 BioPath

The reactions that were used in this study were taken from the BioPath database which has been presented in detail in ref [8]. This database was compiled from the information of the well-known Biochemical Pathways wall chart [9] and the corresponding atlas [10]. In this database, all compounds and reactions are stored as connection tables giving access to each atom and bond of a molecule. Reactions were input in a stoichiometrically correct manner and enriched by information such as enzyme name with EC code, regulators, reactant and product name, as well as by information on the compartment where the reaction occurs. Important for the present application is the fact that all atoms of the substrates are mapped onto the corresponding atoms of the products by atom-atom mapping numbers and that all bonds which are broken, made, or altered during the reaction are marked. This feature of the BioPath database allows the extraction of the reaction center of each reaction and the calculation of physicochemical descriptors describing the reaction center which is used for the studies described here. The data of the BioPath database can be accessed through the web-based retrieval system C@ROL which offers a variety of structure- and text-based search possibilities [11]. The BioPath database is publicly available at http://www2.chemie.uni-erlangen.de/services/biopath/index.html and http://www.molecular-networks.com/biopath/index.html.

5.3.2 Datasets

From all reactions of the BioPath database, those reactions were extracted which are catalyzed by enzymes of class EC 3.b.c.d, the hydrolases. From this dataset, subsets of the subclasses EC 3.1.c.d, EC 3.2.c.d, and EC 3.5.c.d were created for a more detailed investigation. Other subclasses of EC 3.b.c.d were not examined separately, as there were too few reactions available for these subclasses in the database. The datasets are described in more details in the following sections. After extraction from BioPath, the datasets were preprocessed before the calculation of physicochemical properties. Reactions containing labels, R, for residues had the R’s be

101 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______substituted by an H atom; otherwise the calculation of the physicochemical descriptors would fail as the methods require exact atom information for the calculation. Also, reactions occurring at macromolecules, e.g. RNA, were removed, as this study is focused on small molecules. Further, reversible reactions, where the hydrolyzed product was coded on the substrate side were inverted to match with the EC nomenclature for hydrolases, where the hydrolysed compound is always considered on the product side. After cleanup, six physicochemical descriptors (see section 5.3.3) were calculated for all reacting bonds using the program PETRA [12]. The bond in the water molecule was not considered as this bond is the same in all hydrolysis reactions and thus provides no specific information for classification. Data analysis methods such as Kohonen networks require the individual objects that are analyzed to be represented by the same number of descriptors. Therefore, vectors of different length had to be filled up by standard values; here we chose values of zero. To avoid vectors with too many zero values, the dataset was restricted to a distinct number of reacting bonds, dependent on the dataset. For classes EC 3.1.c.d and EC 3.2.c.d, reactions with only one reacting bonds were considered, resulting in a vector of length six. For classes EC 3.b.c.d and EC 3.5.c.d all reactions with one or two reacting bonds were considered, resulting in a vector of length twelve. After calculation of the physicochemical descriptors, all values were scaled to better allow a comparison of all vectors.

EC 3.b.c.d All reactions catalyzed by enzymes with the code EC 3.b.c.d (hydrolases) were extracted from the BioPath database. Hydrolases catalyze the hydrolysis of a variety of compounds such as esters, ethers, peptides, glycosides, phosphoric acid esters, acid anhydrides, or C-C bonds using water. The dataset contained 134 reactions with one or two reacting bonds in the substrate, ignoring the bond in the water molecule. The majority of reactions in the dataset comprise reactions from the subclass EC 3.1.c.d, hydrolases acting on ester bonds, with 54 reactions. Subclass EC 3.2.c.d, the glycosylases, had a share of 18 reactions. There was no reaction of subclass EC 3.3.c.d (hydrolases acting on ether bonds) and the subclass

102 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

EC 3.4.c.d, peptidases, comprised two reactions. A large part, consisting of 44 reactions, is constituted by subclass EC 3.5.c.d, hydrolases acting on carbon-nitrogen bonds other than in peptides. EC 3.6.c.d (hydrolases acting on acid anhydrides) is represented by twelve reactions, whereby all reactions are of sub-subclass EC 3.6.1.d (acting on phosphorus containing anhydrides). The subclass EC 3.7.c.d (hydrolases acting on carbon-carbon bonds) contained three reactions, and there was one reaction of subclass EC 3.8.c.d, hydrolases acting on carbon-sulfur bonds).

EC 3.1.c.d All reactions catalyzed by enzymes falling into the subclass EC 3.1.c.d (hydrolases acting on ester bonds) were extracted from the database. The original dataset contained 52 reactions; all have one reacting bond on the substrate site. The dataset is composed of 14 reactions of sub-subclass EC 3.1.1.d (carboxylic ester hydrolases), four reactions of sub- subclass EC 3.1.2.d (thioesters hydrolases), 23 reactions of sub-subclass EC 3.1.3.d (phosphoric monoester hydrolases), nine reactions of sub-subclass EC 3.1.4.d (phosphoric diester hydrolases), and one reaction each of sub-subclasses EC 3.1.5.d (triphosphoric monoester hydrolases) and EC 3.1.6.d (sulfuric ester hydrolases). An overview of the reactions catalyzed by the enzymes of these sub-subclasses is given in Figure 5.2.

103 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.2 Reaction equations for the reactions catalyzed by the sub-subclasses of EC 3.1.c.d (hydrolases acting on ester bonds): EC 3.1.1.d (carboxylic ester hydrolases), EC 3.1.2.d (thioester hydrolases), EC 3.1.3.d (phosphoric monoester hydrolases), EC 3.1.4.d (phosphoric diester hydrolases), EC 3.1.5.d (phosphoric trimester hydrolases), EC 3.1.6.d (sulfuric ester hydrolases).

104 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

EC 3.2.c.d The dataset of subclass EC 3.2.c.d (glycosylases) contained 18 reactions, each having one reacting bond on the substrate side. The dataset consisted of two sub-subclasses: EC 3.2.1.d (glycosidases) with ten reactions and EC 3.2.2.d (glycosylases hydrolyzing N- glycosylic compounds) with eight reactions which in all cases contain a purine ring system. The reactions catalyzed by the enzymes of sub-subclasses 3.2.1.d and EC 3.2.2.d are shown in Figure 5.3.

Figure 5.3 Reaction equations for the reactions catalyzed by the sub-subclasses of EC 3.2.c.d (glycosylases): EC 3.2.1.d (glycosidases hydrolyzing O- and S-glycosylic compounds), EC 3.2.2.d (hydrolazes hydrolyzing N-glycosylic compounds).

EC 3.5.c.d All reactions catalyzed by enzymes of subclass EC 3.5.c.d (hydrolases, acting on carbon- nitrogen bonds, other than peptide bonds) were collected. The dataset contained 44 reactions with one or two reacting bonds. The dataset consisted of four sub-subclasses: EC 3.5.1.d (hydrolases acting on linear amides) with sixteen reactions, EC 3.5.2.d (hydrolases acting on cyclic amides (lactames)) with nine reactions, EC 3.5.3.d (hydrolases acting on linear amidines) with five reactions, and EC 3.5.4.d (hydrolases acting on cyclic amidines) with 14 reactions. An overview on the reactions considered is given in Figure 5.4.

105 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.4 Reaction equations for the reactions catalyzed by the sub-subclasses of EC 3.5.c.d (hydrolases acting on carbon-nitrogen bonds, other than peptide bonds): EC 3.5.1.d (acting on open- chain amides), EC 3.5.2.d (acting on lactames), EC 3.5.3.d (acting on guanidines), EC 3.5.4.d (acting in cyclic amidines).

106 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

5.3.3 Choice of Descriptors

To represent a reaction, each reacting bond on the substrate side was described by six physicochemical descriptors, suited to describe the electronic character of the bonds taking part in the reaction. The choice of descriptors was based on the work reported in [13] and modified for the needs of biochemical reactions. All descriptors were calculated by rapid empirical procedures implemented in the program PETRA [12]. The following descriptors were used for the study:

● difference in total charges, Δqtot, describes the polarity of a bond,

● difference in σ-electronegativities, Δχσ, [14] describes the ability of an atom to attract the electrons in a σ-bond,

● difference in π-electronegativities, Δχπ, [15], describes the ability of an atom to attract the electrons in a π-bond,

● effective bond polarizability, αb, [16], describes the tendency of the bond-electrons to be perturbated by an external electrical field,

● delocalization stabilization of a negative charge, D-, [17] describes the stabilization of a negative charge generated by heterolysis of the bond,

● delocalization stabilization of a positive charge, D+, [17] describes the stabilization of a positive charge generated by heterolysis of the bond.

5.3.4 Kohonen Neural network

By calculating the six descriptors as described above, each bond of a reaction is represented by the six descriptors. Therefore, the breaking of each bond is an event in a six-dimensional space, spanned by the above six descriptors as coordinates. In order to determine the position of each reaction in this six-dimensional space, each reaction is projected into a two-dimensional plane using a Kohonen neural network. This method has already successfully been used for the classification of organic reactions [18, 19]. The

107 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______method projects data from multidimensional space into a two-dimensional map, preserving the topological relationships of the multi-dimensional space. The software used for the generation of the Kohonen maps was SONNIA [20, 21]. In SONNIA, each neuron is initialized with a random weight and has a dimension, m, identical to that of the input vectors (Figure 5.5). During the training process, each input vector is presented to the network and projected into the neuron with weights most similar to the input vector. The weights of the neurons are then adapted in a manner that decreases with increasing distance. This process is iterated several times until convergence is obtained.

Figure 5.5 Schematic illustration of a Kohonen neural network. The network is composed of a two- dimensional arrangement of neurons. Each neuron has a number of weights equal to the number of descriptors representing the reacting bonds (input vector). In the training process, the input vectors are presented to the network and the neuron having weights most similar to the descriptors of reacting bonds will receive the considered reaction. A weight adjustment process is the initiated.

For comparison with the classical EC nomenclature, each neuron in the 2D map is colored by the majority of the reactions belonging to a specific EC subclass. Neurons, where reactions of different EC subclasses come together are called ‘conflicting neurons’. These

108 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______are the cases where the method described here considers reactions similar that are put into different sub- or sub-subclasses by the EC system. In the following Figures the Kohonen network is inspected from the top of Figure 5.5 in order to show how the various reactions are distributed into the individual neurons.

5.4 Results and Discussion

5.4.1 EC 3.b.c.d

In this study, all reactions catalyzed by EC class 3, hydrolases, were examined. For this dataset containing 134 reactions, the six descriptors described in section 2.3 were calculated for each reacting bond and the resulting data were then mapped into a planar 9x7 Kohonen map. The resulting map is shown in Figure 5.6. The content of the map, the projection of the reactions into the individual neurons, is shown in Figures 5.7 and 5.8 in more detail. In these Figures, the reaction center and its adjacent atoms are shown for the reactions that are mapped into the individual neurons. To allow a comparison with the EC system, the neurons in the Figures were colored by their affiliation with the different subclasses b (EC 3.b.c.d). Neurons, in which reactions of a different EC 3 subclass b are mapped into are called ‘conflicting’ neurons and are indicated by a cross in Figure 5.6. The color of these conflicting neurons is defined by the subclasses where the majority of reactions belong to.

109 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.6 Projection of the reactions of class EC 3.b.c.d (hydrolases; 134 reactions) into a two- dimensional Kohonen neural network. The coloring of the neurons is assigned by the subclass b. Neurons, where reactions of different subclasses collide are marked by a cross. In these neurons, the color is determined by the subclass with the majority of reactions. The dotted line marks the border between the area with reactions where on bond is broken and the area where two bonds are broken during the reaction process.

Inspection of Figure 5.6 shows a clear separation of the active neurons into two groups, indicated by a dashed line. On the left-hand side, there is a wide-spread cluster of reactions belonging to subclass EC 3.5.c.d. They all have in common, that two bonds of the substrate are involved in the reaction. This cluster is separated by empty neurons in column 4 from the reactions on the right-hand side, where only one bond takes part in the reaction. Clearly, the number of descriptors, caused by the different number of reacting bonds, must result in a strong separation of the two groups of reactions. Reactions of this subclass EC 3.5.c.d break a carbon-nitrogen bond; where a second bond is broken it is in most cases also a carbon-nitrogen bond or, sometimes a phosphor-oxygen bond or, in one case, a oxygen-hydrogen bond. Observe that also changes in bond order, converting a

110 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______double bond into a single bond are indicated. In the following analysis we will go through the reactions of each subclass and observe in which neurons they were deposited. For better overview, the map is dissected into two parts. Figure 5.7 shows the left-hand part of Figure 5.6, covering the reactions with two bonds broken. Figure 5.8 shows the right- hand part with reactions where only one bond is broken.

111 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.7 Detail of Figure 5.6 (left-hand part). In this Figure, the contents of the neurons containing reactions breaking two bonds are labeled by the reaction centers of the mapped reactions. All reactions in this map are catalyzed by enzymes belonging to EC 3.5.c.d. For each neuron, the reaction center and its neighboring atoms and bonds are shown. Crossed bonds indicate a change in bond order; bonds doubly crossed indicate a bond broken. Substructures surrounded by a circle indicate that the bond breaks occur simultaneously in one reaction. The letter captions are explained in the text.

112 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.8 Cutout of Figure 5.6. In this Figure, the content of the neurons containing reactions breaking one bond in the reaction process is labeled by the reaction centers of the mapped reactions. For each neuron, the reaction center and its surrounding neurons are shown. Crossed bonds indicate a change in bond order; bonds doubly crossed indicate a bond broken. The color of the neurons is assigned by the subclass to which the majority of reactions belong. The letter captions are explained in the text.

113 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

The first subclass, EC 3.1.c.d, is distributed into four areas on the right-hand side of the map (indicated in red). Neuron A9 contains reactions catalyzed by thioesters hydrolases, EC 3.1.2.d, catalyzing the hydrolysis of a thioester. Neuron C9, clearly separated from the reactions of neuron A9, contains reactions catalyzed by carboxylic ester hydrolases, EC 3.1.1.d, hydrolyzing a carboxylic acid ester. In neuron E8, there is one reaction catalyzed by sterylsulfatase, belonging to sub-subclass EC 3.1.6.d, sulfuric ester hydrolases. This is the only reaction of this sub-subclass in the dataset. The classification into a neuron flanked by the subclasses EC 3.6.c.d in neurons E9 and F8, containing reactions where a phosphate group is cleaved from the substrate, reflects the similarity of the phosphate and sulfate group in reactivity. Finally, neurons G8 and G9 contain reactions of sub-subclass EC 3.1.3.d, phosphoric monoester hydrolases and EC 3.1.4.d, phosphoric diester hydrolases. In neuron G8 additionally one reaction of class 3.1.5.d (structure a in Figure 5.8) is located, triphosphoric monoester hydrolases. Here, also a phosphoric diester bond is broken and therefore these two sub-subclasses cover reactions with the same mechanism and should be considered together. The reactions of the different sub- subclasses of EC 3.1.c.d are clearly separated from each other. The hydrolysis of thioesters, carboxylic esters, sulfuric esters, and phosphoric esters occur with different ease and reaction rates because of the substantial differences in the physicochemical effects that operate on the bonds being hydrolyzed. Putting them into the same subclass EC 3.1.c.d by the EC classification scheme is questionable. Rather, we believe that they should be classified into different classes not at the third level of the classification scheme but already on the second level. This more detailed classification will be investigated in section 5.4.2. The reactions of subclass EC 3.2.c.d, glycosylases, indicated in orange, are all projected into two neurons, A6 and A7, whereby neuron A7 obtains only a single reaction, all other reactions occurring at glycosidic bonds fall into neuron A6. In these reactions, both N- and O-glycosylic bonds are broken. The focused deposition of these reactions into two adjacent neurons underlines the strong similarity between them and reproduces the EC classification system for this subclass.

114 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

The reactions of subclass EC 3.4.c.d, peptidases, (see structure b in Figure 5.8) fall into neuron C9. Here they fall together with reactions of subclass EC 3.1.c.d, carboxylic ester hydrolases, and reactions of subclass EC 3.5.c.d, hydrolases acting on carbon-nitrogen bonds other than peptide bonds. To put these reactions together with subclass EC 3.5.c.d in our opinion does make sense on the basis of mechanistic considerations; the separation of hydrolysis of peptides from those of amides is only of a phenomenological manner not warranted by the reaction mechanism which is of the same nature, and thus they are not separated by our method. Now let us concentrate on the reactions of subclass EC 3.5.c.d, the most widespread reactions on the map. Most obviously they are separated into reactions with two bonds reacting (B1, C1, D1, E2, F3, and G1) on the left-hand side of Figure 5.6 and reactions with one reacting bond (B5, C8, C9, D8, and D9) on the right-hand side. On the right- hand side, a cluster is formed including neurons C8, D8, D9, and also C9 (see structure b in Figure 5.8). Here, the reactions hydrolyze an amide or an amidine while breaking one bond. In fact, in these reactions always the same type of bond is broken: a carbon- nitrogen bond with a partial double-bond character caused by mesomerism. In neuron B5, in the center of the map, only one reaction is placed, the reaction catalyzed by Methenyl- THF-cyclohydrolase (EC 3.5.4.9), where a cyclic amidine bond is broken. The reason for the separation of this reaction from all other reactions of subclass EC 3.5.c.d lies in the fact that one nitrogen atom is charged. This certainly has a strong effect on the hydrolysis of this substrate and is perceived by our classification. Now let us consider the reactions involving two bonds. Many of these reactions, in fact, only break one bond but simultaneously involve a change of a double bond into a single bond as observed for amidines. These reactions are shown in Figure 5.7 in detail. In neuron B1, a lactame is broken and simultaneously a phosphate group from ATP is transferred by the enzyme 5- Oxoprolinase (EC 3.5.3.9). Reactions where two amide bonds are broken in a substituted urea with concomitant release of carbon dioxide are found in neurons C1 and D1 (structure a in Figure 5.7). In neuron C1, two other types of reactions are also mapped into: one where a bond in an amidine is broken with a concomitant change in the bond order of the carbon-nitrogen double bond (structure b in Figure 5.7) and one reaction of

115 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______sub-subclass EC 3.1.3.d, where an oxygen-hydrogen bond is broken with simultaneous cleavage of a phosphate group from ATP (structure c in Figure 5.8). Although this latter reaction is quite different from all other reactions breaking two bonds, which all involve amidines, , or amides, apparently, not enough space is reserved for the reactions breaking two bonds that would allow the separation of this reaction of sub-subclass EC 3.1.3.d from the reaction of EC 3.5.c.d. For the reaction in neuron E2, an amide bond is broken in a cyclic compound together with the cleavage of the terminal phosphate group of ATP. In neuron F3 deamination reactions on pyrimidine substrates are located (structure d in Figure 5.7) and also one reaction of Barbiturase, where two amide bonds are broken simultaneously (structure e in Figure 5.7). Deamination reactions are also located in neuron G1, but here on purine substrates and therefore well separated from the deamination of pyrimidines (structure f and g in Figure 5.7). Further, reactions breaking the ring structure in cyclic substrates catalyzed by cyclohydrolase enzymes (structures h, i, j in Figure 5.7) are mapped into this neuron. There is another reaction of this subclass EC 3.5.c.d which is located in neuron A6, the reaction of . The enzyme catalyzing this reaction is classified as EC 3.5.3.4, acting on carbon-nitrogen bonds in open chain amidines. In fact, the substrate contains an amidine substructure, but the reaction does not break an amide bond. Instead the adjacent carbon-nitrogen bond is broken during this reaction (structure c in Figure 5.7). In our opinion, the classification of this reaction by the EC system is not warranted by our results. However, the classification method we present here estimates a higher similarity to the reactions of subclass EC 3.2.c.d located in neuron A6, breaking N-glycosylic bonds amongst others. The reactions of subclass EC 3.6.c.d, cleaving acid anhydrides, are clustered into three neurons: E9, F8, and also G8. All reactions of this subclass in the dataset are belonging to the sub-subclass EC 3.6.1.d, hydrolases, cleaving phosphorus anhydrides. In E9 reactions are found where a diphosphate is cleaved. For the reactions in neuron F8, always a diphosphate group is cleaved from a nucleotide triphosphate and for the reactions in neuron G8, always the terminal phosphate group is cleaved either from a nucleotide- triphosphate or -diphosphate (structures d and e in Figure 5.8). In this neuron they occur together with reactions of sub-subclass EC 3.1.4.d, as, in fact, the same type of bond is

116 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______broken. This shows that the reactivity of phosphoester bonds is quite variable, depending on its environment. These differences can be clearly detected by the method using physicochemical properties. Three reactions of subclass EC 3.7.c.d, reactions acting on carbon-carbon bonds in 1,3- diketones, are in the dataset. They are all of sub-subclass EC 3.7.1.d, as this is in fact the only sub-subclass in subclass EC 3.7.c.d. One reaction, Fumaryl-aceto-acetase (3.7.1.2), is located in neuron A6, where a carbon-carbon-bond attached to a 1,3-diketone is broken (structure f in Figure 5.8). Having a closer look at this structure reveals, that the reaction center is connected to the terminal carboxylic acid group by a conjugated system. This is likely to give the carbon atom a character similar to a hetero atom and therefore it is considered similar to the glycosylases (EC 3.2.c.d) located in this neuron. Two further reactions of EC 3.7.c.d are located in neuron C9 (structure g in Figure 5.8). Here, they fall together with reactions of subclass EC 3.1.c.d, carboxylic ester hydrolases, and subclass EC 3.5.c.d, hydrolases breaking amide bonds. In fact, also in this case the first carbon atom of the reaction center is attaches to a keto-group and the second is very close to an amino and a carboxylic acid group. As for the reaction before, this is likely to give the carbon atom a hetero-atom character which results in similarity to a carboxylic ester or amide bond. For subclass EC 3.8.c.d, there is only one reaction contained in the dataset, catalyzed by Thyroxine deiodinase (EC 3.8.1.4), where a carbon-iodine bond is broken (structure d in Figure 5.8). The reaction is located in neuron A6 together with glycosylases. In summary, by and large the classification of reactions based on physicochemical descriptors reproduces the classification of enzymes by the EC code in a situation, as given here, where the identity of the reacting bond is considered in the EC classification. One might even think that the same kind of reaction classification could have been obtained by consideration of the substructures at the reaction center. However, closer inspection of the two-dimensional maps show details that go beyond a classification by substructures. For, in a Kohonen map also the distance between neurons has significance, smaller distances expressing higher similarity of reactions. As a case in point, let us discuss the distribution of the reactions that belong to subclass EC 3.1.c.d (red areas). The hydrolysis

117 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______of thioesters (neuron A9) are well separated from all other reaction types of this subclass, a fact, that reflects the peculiar high reactivity of this group of compounds. The hydrolysis of carboxylic acid esters (neuron C9) are in the proximity of the hydrolysis of amides, resulting in a certain overlap with these reactions. The hydrolysis of phosphoric acid mono- and diesters contained in neurons G8 and G9 are well separated from the hydrolysis of carboxylic acid esters and thioesters which reflects the quite different reactivity of these classes of compounds. In fact, it is quite surprising that the hydrolysis of phosphoric acid mono- and diesters are grouped together with thioesters and carboxylic acid esters into the same subgroup EC 3.1.c.d. Rather, on mechanistic grounds, on the type of bonds broken one would group the hydrolysis of phosphoric acid mono- and diester together with the hydrolysis of diphosphates and the hydrolysis of nucleotide di- and triphosphates, which happen to be classified into subclass EC 1.6.c.d. In the Kohonen map, on the other hand, this close similarity of the hydrolysis of phosphoric acid mono- and diesters as well as of di- and triphosphates is well reproduced by the projection into neurons E9, F8, G8, and G9. In order to investigate subclass EC 3.1.c.d and further, to unravel finer details in classification we investigated this subclass EC 3.1.c.d separately from all other subclasses. This will be followed by separate investigations of the subclasses EC 3.2.c.d and EC 3.5.c.d.

5.4.2 EC 3.1.c.d

For this study, all reactions of EC subclass 3.1.c.d were explored. The 52 reactions, each with one reacting bond on the substrate site, were classified into a planar 4x6 Kohonen map. The coloring of the neurons is based on the sub-subclass c (EC 3.1.c.d) and the resulting map with the reaction centers indicated is shown in Figure 5.9.

118 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.9 Projection of reactions of subclass EC 3.1.c.d (hydrolases acting on ester bonds; 52 reactions) into a two-dimensional Kohonen neural network. The coloring of the neurons is determined by the sub-subclass c. The neuron, where reactions of different sub-subclasses fall together is marked by a cross. The color of this neuron is assigned by the sub-subclass to which the majority of reactions belong.

At first glance, the dataset is mainly split into four areas, constituted by the sub-subclasses EC 3.1.1.d (carboxylic ester hydrolases) (red area), EC 3.1.2.d (thioesters hydrolases) (orange area), EC 3.1.3.d (phosphoric monoester hydrolases) (yellow area), and EC 3.1.4.d (phosphoric diester hydrolases) (green area). All reactions catalyzed by the enzymes of these four sub-subclasses were completely separated and do not overlap in any neuron. In neuron D3, also reactions of two additional sub-subclasses are located: EC 3.1.5.d and EC 3.1.6.d. The neurons A4, A5, and A6 are filled by reactions of sub-subclass EC 3.1.1.d, where a carboxylic acid ester bond is hydrolyzed. In most cases the hydrolyzation of a phospholipide or (phospho-) triglyceride, or the opening of a lactone is catalyzed. Next, a well separated cluster is constituted by the reactions of sub-subclass EC 3.1.2.d, reactions where a thioester bond is cleaved, occupying the neurons C6 and D6. The high

119 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______difference in reactivity compared to the carboxylic acid ester hydrolases and phosphoric ester hydrolases supports this strong separation. Obviously, the entire left-hand side from column one to three of the map is occupied by reactions hydrolyzing a phosphoric ester bond. They comprize the sub-subclasses EC 3.1.3.d (neurons A1, B2, B3, and C1) and EC 3.1.4.d. (neurons C3, D1, and D3). The separation of these two sub-subclasses EC 3.1.3.d and EC 3.1.4.d from sub-subclasses EC 3.1.1.d and EC 3.1.2.d reflects the high similarity of these reactions. Although in both cases the same type of bond, a P-O bond, is broken, nevertheless the sub-subclasses are clearly separated by the network, as the bond cleaved in a phosphoric mono- or diester has somehow different physicochemical properties and different reaction rates. Furthermore, finer details within this cluster can be detected. The reactions of sub-subclass EC 3.1.3.d that fall into neuron A1 have a phosphate group that is cleaved from a variety of substrates like sugars, amino acids, or nucleotides. These reactions are clearly separated from the reactions in neurons B2 and B3 where the substrate carries a second phosphoric ester group. The reactions which fall into neuron C1 have a phosphate group cleaved from a multiply phosphorylated inositol rest. The reactions of sub-subclass EC 3.1.4.d are located in neurons C3, D1, and D3. For the reaction in neuron C3, a cyclic diphosphoester is broken. In neuron D1, reactions of phospholipases and phosphodiesterases are located. Two reactions of additional sub- subclasses are located in neuron D3: dGTPase (EC 3.1.5.1), hydrolyzing a triphosphate group from dGTP (structure b in Figure 5.9) and Sterylsulfatase (EC 3.1.6.2), hydrolyzing a sulfate group from a steryl-substrate (structure a in Figure 5.9). These two reactions are the only ones of their sub-subclasses in the dataset. The classification of the reaction of dGTPase into a neuron carrying reactions breaking bonds of phosphodiesters is clearly warranted, as here also a phosphodiester bond is broken. The reaction of sterylsulfatase is apparently governed by physicochemical effects similar to those operating in phosphodiesters. This results in their projection into the same neuron. In summary, as already described in section 5.4.1, the differences in reactivity of the several reaction types grouped within this EC subclass are quite distinct and this is

120 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______reflected by their clear separation in the map of Figure 5.9. We believe that these distinct differences should be considered on a higher level of the classification scheme.

5.4.3 EC 3.2.c.d

For this study, 18 reactions of subclass EC 3.2.c.d, glycosylases, were classified into a planar 4x2 Kohonen neural network (Figure 5.10).

Figure 5.10 Projection of reactions of subclass EC 3.2.c.d (glycosylases; 18 reactions) into a two- dimensional Kohonen neural network. The coloring of the neurons is determined by the sub- subclass c. The color of the neurons is assigned by the sub-subclass to which the majority of reactions belong. The label indicates the type of bond broken by the reactions in the neurons.

The dataset contained reactions of two sub-subclasses, EC 3.2.1.d (glycosylases hydrolyzing O- and S-glycosyl compounds) and EC 3.2.2.d (glycosylases hydrolyzing N- glycosyl compounds). Both subgroups are completely separated by the method based on physicochemical effects without any conflicts. Thus, also the conflict observed in the classification of the 3.b.c.d dataset is now resolved. This emphasizes that the classification of the subsets enables a more detailed classification. The eleven reactions of sub-subclass EC 3.2.1.d are deposited into two neurons, where ten reactions fall into neuron B4 and one into neuron A4. These reactions cover the hydrolysis of several di- or polysaccharides breaking an O-glycosylic bond. Nearly three fourth of the map are occupied by the reactions of EC 3.2.2.d, hydrolyzing a N-glycosylic bond. The eight reactions are equally distributed on five neurons, A1, A2, B1, B2, and B3. In more detail, in neurons A1 and B1, the nicotinamide part is cleaved

121 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______from NAD or NADP, respectively. Neurons B1 and B2 contain the hydrolysis of purine nucleotides, neuron B3 of a pyrimidine nucleotide. This study shows that the investigation of subsets can give more details, as the separation of the two sub-subclasses could not be achieved in the EC 3.b.c.d study described in section 5.4.1.

5.4.4 EC 3.5.c.d

For this investigation, the dataset, containing 44 reactions of subclass EC 3.5.c.d (hydrolases, acting on carbon-nitrogen bonds, other than peptide bonds), was projected into a planar 4x6 Kohonen map which is shown in Figure 5.11.

Figure 5.11 Projection of the reactions of subclass EC 3.5.c.d (hydrolases acting on carbon-nitrogen bonds, other than peptide bonds; 52 reactions) into a two-dimensional Kohonen neural network. The coloring of the neurons is assigned by the sub-subclass c. Neurons, where reactions of different sub- subclasses collide are marked by a cross. In these neurons, the color is determined by the subclass to which the majority of reactions belong to. The dotted line marks the border between the area with reactions where on bond is broken and the area where two bonds are broken during the reaction process.

The dataset was restricted to reactions with one or two reacting bonds. At a first glance, all four subclasses of EC 3.5.c.d contained in the dataset are separated into different areas

122 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______of the map. A clear separation can be detected between reactions where two bonds are broken versus those where one bond is broken, each having 22 reactions: All reactions to the left of column five in Figure 5.11 have two bonds broken while the reactions to the right of column five have one bond broken in the reaction process. In the following analysis we will inspect the reactions breaking two bonds first and then concentrate on the reactions breaking one bond. The reaction centers for the reactions with two bonds broken are indicated for the various neurons in Figure 5.12, showing how the reactions are projected into the network.

Figure 5.12 Detail of Figure 5.11 (left-hand part). In this Figure, the contents of the neurons containing reactions breaking two bonds simultaneously are labeled by the reaction centers of the mapped reactions. For each neuron, the reaction center and its neighboring atoms and bonds are shown. Crossed bonds indicate a change in bond order; bonds doubly crossed indicate a bond broken. The color of the neurons is determined by the subclass to which the majority of reactions belong. The letter captions are explained in the text.

123 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Four reactions of sub-subclass EC 3.5.1.d, hydrolases acting on carbon-nitrogen bonds in imides are located in neurons A3 and A4. Two amide bonds are simultaneously broken in these imides, releasing carbon dioxide and ammonia. Three reactions of sub-subclass EC 3.5.2.d, hydrolases acting on carbon-nitrogen bonds in lactames are deposited into neuron C4. Beside the breaking of the amide bond; in one reaction another amide bond is broken and in the other two cases a phosphate ester bond from ATP is hydrolyzed simultaneously. Three reactions of sub-subclass EC 3.5.3.d, breaking bonds in amidines or guanidines, are located in neuron B4. A guanidinium group is cleaved from the substrate releasing urea. In fact, only one bond is broken and a second bond has a change in bond order. Twelve reactions of sub-subclass EC 3.5.4.d, hydrolases acting on cyclic amidines, are distributed on the neurons A1, A2, B1, D1, and D4. In a detailed view, the reactions located in neurons A1 and A2 occur on bonds attached to the carbon atom at position 6 of a purine ring system, performing either a deamination or a ring opening (structures a and b in Figure 5.12). The one reaction located in neuron B1 is catalyzed by Methenyltetrahydromethanopterin-cyclohydrolase (EC 3.5.4.27), also opening a ring structure. The reactions of these three neurons are clearly separated from the reactions in neuron D1, where bonds attached to the carbon atom at position 2 of a purine ring system are affected by the reaction; again either by deamination or ring opening (structures c and d in Figure 5.12). Obviously, the influence of the position in the ring leads to a strong separation of the reactions in the map. In neuron D4, deamination reactions at cytidine, deoxycytidine, cytosine, and deoxycytidine triphosphate are located. Again, the reactions in this neuron are strongly separated from the others of this sub-subclass, showing the different reactivity of a pyrimidine ring compared to a purine ring system.

Now let us turn toward the reactions of subclass EC 3.5.c.d breaking one bond. These reactions are far more interwoven than the reactions breaking two bonds and are deposited in neurons A6, B6, C6, and D6. The reaction centers for these reactions are indicated in Figure 5.13.

124 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Figure 5.13 Detail of Figure 5.11 (right-hand part). In this Figure, the contents of the neurons containing reactions breaking one bond in the reaction process are labeled by the reaction centers of the mapped reactions. For each neuron, the reaction center and its neighboring atoms and bonds are shown. Crossed bonds indicate a change in bond order; bonds doubly crossed indicate a bond broken. The color of the neurons is determined by the subclass to which the majority of reactions belong. The letter captions are explained in the text.

The twelve reactions of sub-subclass EC 3.5.1.d, hydrolases breaking open-chain amides, are distributed into neurons A6, B6, and C6 (structures a, b, c in Figure 5.13). Neurons A6 and C6 also contain reactions of other sub-subclasses. The distribution of these reactions into the neurons is strongly correlated with the substructure: While in neuron A6 always a primary amide is broken with release of ammonia, in neuron B6 always a secondary amide is hydrolyzed. For the reactions in neuron C6, a formyl rest is hydrolyzed from the substrate. This clearly reflects the influence of the environment of the reaction center onto the reactivity of a bond.

125 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Six reactions of sub-subclass EC 3.5.2.d, hydrolases breaking bonds in lactames, fall into neurons A6 (one reaction) and C6 (structures d, e in Figure 5.13). They are mixed up with reactions of sub-subclass EC 3.5.1.d. This indicates, that the distinction between open- chain and cyclic amides and amidines is not retraced by the method presented here, as no direct impact on the reactivity of the amide bond is to be expected. The two reactions of sub-subclass EC 3.5.3.d fall into neuron A6, , and D6, Allantoicase, (structure f, g in Figure 5.13). For Arginine deiminase, the terminal carbon-nitrogen bond of the guanidinium group is broken. For Allantoicase, in fact the molecule carries an amidine group, but the broken bond is not located in this group (see structure g in Figure 5.13). These both reactions are very different, indeed, and a classification into the same sub-subclass is not warranted in our opinion. Finally, the two reactions of sub-subclass EC 3.5.4.d fall into neuron D6, Methenyl-THF- cyclohydrolase (EC 3.5.4.9) (structure h in Figure 5.13) and A6, Creatinine deiminase (EC 3.5.4.21) (structure i in Figure 5.13). While for the reaction in A6, Creatinine deiminase, a carbon-nitrogen bond attached to a five membered ring is broken, in the reaction of Methenyl-THF-cyclohydrolase, a carbon-nitrogen bond that is part of a five membered ring is broken. Clearly, the physicochemical effects on the reactions are different, leading to this clear separation. Also here, the classification into the same sub-subclass is not warranted when considering the reactivity of the reacting bond. Looking into the reactions not from the standpoint of EC sub-subclasses, but from their affiliation to neurons, the following observations can be made: In neuron A6, all reactions break the terminal carbon-nitrogen bond of a primary amide or guanidinium group releasing ammonia. This is true for all reaction but one, (3.5.2.7), where a cyclic amide bond is broken. For the reaction located in neuron B6, in all cases an amide bond of a secondary amide is broken. In neuron C6, two reactions break an amide bond under release of a formyl rest, for the rest of the reactions the amide bond of a lactame is broken. Finally, neuron D6 contains two reactions; the one of Methenyl-THF- cyclohydrolase (EC 3.5.4.9), breaking a cyclic amidine under opening of a ring structure and the reaction of Allantoicase (EC 3.5.3.4) where the broken bond lies outside of the amide bond as stated before.

126 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

In conclusion, this study gives a nice example how the character of a bond is strongly influenced by the atoms and bonds of its environment. Analyzing the reactions catalyzed by enzymes of sub-subclass 3.5.4.d shows a strong separation of these reactions, even if the same type of bond is broken. Here, a strong separation is obtained by the type of ring system the reacting bond is attached to, purine or pyrimidine, or the location of the bond in the purine ring system. Such evident differences in bond reactivity are only hard to recognize by visual inspection of the substructure and this result could not be achieved in the study of dataset EC 3.b.c.d. The inspection of the subset here reveals finer details. Apart from the separation caused by the different number of broken bonds, the reactions of the sub-subclasses are indeed mapped into the same region of the map. This means that, even through the differences in bonds broken, the similarity of the reaction center is recognized by the network. This highlights the advantage of similarity perception by a self-organizing neural network. Neurons of a certain part of the maps reflect a certain type of similarity. On top of that, distances between the neurons in such an area reflect the degree of similarity.

5.5 Conclusions

These studies have shown that for the enzyme class EC 3.b.c.d, hydrolases, the reaction center bond properties produce a classification that overall compares well with the EC system. For this type of reactions, this result meets the expectations, as here the EC classification system also is to a large extent based on the reaction mechanism. However, the method presented here perceives finer details showing how the reacting bonds are influenced by the atoms and bonds in their neighborhood. In particular, the classification of reactions based on the physicochemical effects of the reacting bonds shows similarities and differences in the reactions that go beyond the phenomenological classification of the EC system as shown in the detailed discussions of this publication. A classification on the basis of physicochemical bond properties takes implicitly account of the reaction mechanism and the events in the enzyme pocket and, thus, allows the perception of novel similarities of enzyme catalyzed reactions and, by the same token, of enzymes. It should be emphasized that the similarity of reactions is encoded by the physicochemical effects.

127 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

Data analysis by a Kohonen neural network was only be done because of the intuitive visualization of similarities in a map. Clearly, a variety of other data analysis methods could be used for perceiving such similarities; some of these methods would even allow the assessment of a numerical value of similarity. The method of classification of reactions by reaction center properties presented here can act as a valuable tool in the exploration of the metabolome and drug research. It can reveal unknown capabilities of enzymes to catalyze certain reactions or define a family of enzymes capable to catalyze a given reaction type.

Acknowledgement We thank the ‘Bundesministerium fuer Bildung und Forschung’ (BMBF) for funding our research within the project ‘Bioinformatics for the Functional Analysis of Mammalian Genomes’ (BFAM), part of the ‘Nationales Genomforschungsnetz Deutschland’ (NGFN), projects 031U112D and 031U212D. We also thank Dr. Oliver Sacher, Molecular Networks GmbH, for his help with descriptor encoding and Dr. Thomas Kleinoeder and Dr. Lothar Terfloth, both of our research group, for their support in the use of the PETRA and SONNIA software.

References

[1] Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes by the Reactions they Catalyse [internet]. http://www.chem.qmul.ac.uk/iubmb/enzyme/

[2] Arita M. The metabolic world of Escherichia coli is not small. Proc Natl Acad Sci USA 2004; 101:1543-1547.

[3] Izrailev S, Farnum MA. Enzyme classification by ligand binding. Proteins. 2004; 57:711-24.

128 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

[4] Dobson PD, Doig AJ. Predicting enzyme class from protein structure without alignments. J Mol Biol 2005; 345:187-199.

[5] Kotera M, Okuno Y, Hattori M, Goto S, Kanehisa M. Computational assignment of the EC numbers for genomic-scale analysis of enzymatic reactions. J Am Chem Soc 2004; 126:16487-16498.

[6] Zhang QY, Aires-de-Sousa J. Structure-based classification of chemical reactions without assignment of reaction centers. J Chem Inf Model 2005; 45:1775-1783.

[7] Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Gueldener U, Mannhaupt G, Muensterkoetter M, Mewes HW. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 2004; 32:5539-5545.

[8] Reitz M, Sacher O, Tarkhov A, Truembach D, Gasteiger J. Enabling the exploration of biochemical pathways. Org Biomol Chem 2004; 2:3226-3237.

[9] Michal G. Biochemical Pathways Wall Chart, third edition. Mannheim (Germany): Boehringer (now Roche); 1993, also on the internet at: http://www.expasy.org/tools/pathways/.

[10] Michal G. 1999. Biochemical Pathways. An Atlas of Biochemistry and Molecular Biology. Heidelberg, Germany: Spektrum Akademischer Verlag. 277 p.

[11] C@ROL, Erlangen: Molecular Networks GmbH, http://www.molecular- networks.com.

[12] Gasteiger J. Empirical Methods for the Calculation of Physicochemical Data of Organic Compounds In Physical Property Prediction in Organic Chemistry. In: Jochum C, Hicks MG, Sunkel J, editors. Heidelberg (Germany): Springer Verlag; 1988. p 119-138.

[13] Sacher O. 2001. Dissertation. Erlangen (Germany): University of Erlangen- Nuernberg.

129 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______

[14] Gasteiger J, Marsili M. Iterative Partial Equalization of Orbital Electronegativity - A Rapid Access to Atomic Charges. Tetrahedron 1980; 36: 3219-3228.

[15] Gasteiger J, Saller H. Calculation of the Charge Distribution in Conjugated Systems by a Quantification of the Resonance Concept. Angew. Chem. Int. Ed. Engl. 1985; 24: 687-689.

[16] Gasteiger J, Hutchings MG. Quantitative Models of Gas-Phase Proton Transfer Reactions Involving Alcohols, Ethers, and their Thio Analogs. Correlation Analyses Based on Residual Electronegativity and Effective Polarizability. J. Am Chem Soc 1984; 106: 6489-6495.

[17] Froehlich A. 1993. Dissertation. Muenchen (Germany): Technical University Muenchen.

[18] Satoh H, Sacher O, Nakata T, Chen L, Gasteiger J, Funatsu K. Classification of Organic Reactions: Similarity of Reactions Based on Changes in the Electronic Features of Oxygen Atoms at the Reaction Sites. J Chem Inf Comput Sci 1998; 38: 210-219.

[19] Chen L, Gasteiger J. Reactions Classified by Neural Networks: Michael Additions, Friedel- Crafts Alkylations by Alkenes, and Related Reactions. Angew Chem Int Ed Engl 1996; 35: 763-765.

[20] SONNIA, Erlangen: Molecular Networks GmbH, http://www.molecular- networks.com.

21) Zupan J, Gasteiger J. 1999. Neural Networks in Chemistry and Drug Design (Second Edition). Weinheim: Wiley-VCH. 380 p.

130 5 Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases ______5.6 Further Conclusions

With the classification of metabolic reactions it was shown how the assignment of reaction centers can help in building a classification scheme which is exclusively based on physicochemical effects. A classification like this has many advantages above manually classification like the EC code. First, the basis of the classification is reproducible and clear-cut in contrast to the existing EC system where rearrangements of the EC classes and reassignments of the reactions to the classes have to be done on a regular basis due to new knowledge. Further, the classification method shown is highly adaptable to the current needs. By choosing different sets of descriptors the focus of the classification can be changed. The calculation of descriptors by empirical methods as used here can be done very rapidly. The classification shown here was performed with descriptors reflecting the requirements of enzymatic reactions. However, this kind of classification is not restricted to enzymatic reactions and can be extended to all kinds of organic reactions [1]. The dataset used for this study contained only reactions belonging to class EC 3, Hydrolases. This was done to allow a better comparison with the EC system as this class is mainly based on the reaction mechanism and thus corresponds to the results gained by the method presented here. For other classes, e.g. class EC 1, Oxidoreductases, discrepancies can be expected as here the EC system focuses on other classification criteria, namely the electron donor and acceptor compounds. However, datasets can be selected by many criteria. Using the C@ROL retrieval system as shown in chapter 2 enables selection of a variety of datasets from BioPath based on reaction criteria, e.g. all reactions where a carbon-carbon bond is made. The study of datasets of this kind can serve as a valuable tool for understanding the reactivity of metabolic processes.

References

[1] Sacher, O. 2001. Dissertation. Erlangen, Germany: University of Erlangen-Nuernberg.

131 6 Conclusions & Outlook ______6 Conclusions & Outlook

The results achieved in this work can give an idea on the impact of chemoinfomatics for modern biology research. To gain these benefits, the quality of the data source is of crucial importance. Only the marking of the reaction centers and the atom-atom mapping information, which are an outstanding feature of the BioPath database, allowed the application of the methods presented here. This information has been input manually by elaborate work. This of course is not applicable on very large datasets. However, nowadays attempts are going on to carry out this task by computers automatically [1,2]. Applying these methods to large databases like e.g. KEGG [3] will open a treasure trove of new information and support the highly active field on metabolic investigation [4]. On the basis of the applications presented here, we have demonstrated how this information on the reaction center can be used to perform comprehensive searching in metabolic reaction data and how it can be turned into new knowledge. Of course, the applications presented here can only give an idea on the enormous capabilities buried in these kinds of data. With the search for enzyme inhibitors we have shown how the reaction center information can be used to automatically generate intermediates of enzymatic reactions. These intermediates, optionally enriched by rapidly calculated physico-chemical properties, can be successfully used as a search query for new potential enzyme inhibitors without having knowledge on the 3D structure of the enzyme. This allows exploration of the chemical space in a fast and inexpensive manner. Even compounds which are not synthesized at all can be investigated using virtual compound libraries. The method presented shows that such a screening is also suited to search databases only with information on the reaction itself. By using a reaction intermediate as search query, a fast method is available bypassing extensive studies on the enzyme structure. Also, by using the knowledge on the retrieved compounds, conclusions can be drawn on the involvement of enzymes in the mechanism of diseases. The other way around, predictions on structural and electronic features operating in the binding pocket of the enzyme can be made by exploring the structural similarities of the retrieved molecules.

132 6 Conclusions & Outlook ______

With the classification of metabolic reactions, we have demonstrated how the reaction center information contained in BioPath can be used to build a classification of metabolic reactions solely based on the reaction center information. By using physico-chemical properties calculated for the bonds which are part of the reaction center, we have shown that this classification corresponds to a large extent to the established EC classification system. Furthermore, it also shows finer details and chemically sound results as it takes implicit account of the reaction mechanism and goes beyond the phenomenological classification of the EC system. However, a classification like this can serve as a tool in many ways. For example, it can be used to search for enzymes which can catalyze a reaction if the original reaction is blocked or down-regulated. Or, querying the system with a reaction can lead to other similar reactions and allow the prediction of enzymes which are potentially able to catalyze the query reaction. This, for example, can be used to predict the metabolism of drugs and xenobiotica.

References

[1] Apostolakis, J.; Sacher, O.; Körner, R.; Gasteiger, J. Automatic Determination of Reaction Center Information for Biochemical Reactions. submitted to Journal of the American Chemical Society 2007.

[2] Körner, R.; Apostolakis, J. A simple Theory for the Prediction of Reaction Center Information and Reaction Mappings. submitted to Bioinformatics 2006.

[3] Kanehisa, M.; Goto, S.; Hattori, M.; Aoki-Kinoshita, K. F.; Itoh, M.; Kawashima, S.; Katayama, T.; Araki, M.; Hirakawa, M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34, 354-357. http://www.genome.jp/kegg.

[4] Arita, M. The metabolic world of Escherichia coli is not small. Proc. Natl. Acad. Sci USA 2004, 101, 1543-1547. http://www.metabolome.jp.

133 7 Summary ______7 Summary

This work shows how a database containing chemical structure and reaction data can be explored by methods of chemoinformatics to gain new knowledge for biological or pharmaceutical purposes. Particularly, the storage of all compounds and reactions as connection tables, the annotations with atom-atom mapping numbers and reaction center-marking which are contained in the database presented here allow the access of these data by chemoinformatics methods. The following results were achieved in this work:

1) A database of metabolic reactions (BioPath) was used as a starting point for the investigations presented here. The database is based on the posters ‘Biochemical Pathways’ which were up to this point only available in printed form or as a scanned picture on the internet, and the appending atlas with extended information. Therefore, all reactions together with the available annotations, especially on the corresponding enzymes, were stored in the database. Additionally, a) care was taken on stoichimetrically correct input of all reactions, b) all atoms on the side of reagents and products were marked by atom-atom mapping numbers to allow an exact correlation between these atoms, and c) all bonds which are part of the reaction center, therefore which are built or broken in the reaction process, were marked. The database is available in different versions, amongst others for MDL ISIS/Base, IBM DB2, and for the C@ROL system developed by Molecular Networks. Using the C@ROL system allows data retrieval over the internet and the use of enhanced search methods like sub-structure or reaction sub-structure searches. The BioPath database was extensively presented in a publication in the journal Organic and Biomolecular Chemistry. In a subsequent revision the data quality and user interface of the database was improved. 2) The information on the bonds broken and made in the BioPath database was used to generate intermediates of enzymatic reactions. By 3D superimposition of these intermediates onto several transition state analog inhibitors, we have shown that

134 7 Summary ______

the reaction intermediate can be considered as structurally quite similar to the transition state. This was investigated for the enzymes AMP deaminase, Triosephosphate isomerase, and Arginase II. These studies were performed in preparation to use such generated intermediates as a template for an automated screening for transition state analog inhibitors in large compound databases. The results of these investigations were published in the Journal of Chemical Information and Modeling. Moreover, with these studies it was shown that a search solely based on the 3D structure, optionally enriched by constraints on physico-chemical properties, can deliver results comparable to a knowledge based- superimposition. Subsequent, a database screening was performed for inhibitors of the enzyme AMP deaminase using the reaction intermediate of its reaction as a search template. It was shown that those compounds were enriched having the activity to inhibit the enzyme Adenosine deaminase, respectively AMP deaminase. Further, the retrieved compounds have activities reported which correlate with inhibitors of AMP deaminase. 3) The information on the reacting bonds contained in the BioPath database was used to develop a new classification of metabolic reactions. The classification is based on a selection of bond descriptors. For this, six physico-chemical descriptors, describing the reactivity of a bond, were calculated for the bonds which are involved in the reaction process and which are part of the substrate. The resulting vector then can serve as an input for data analysis methods. Here, the visualization was performed by Kohonen neural networks. For the investigated dataset consisting of reactions catalyzed by Hydrolases we have shown that the classification by this method delivers results comparable to the established EC nomenclature. Further, our classification can deliver finer details reflecting how the reacting bonds are influenced by the atoms and bonds in the neighborhood. The results of these investigations were submitted for publication to the journal Proteins: Structure, Function, and Bioinformatics.

135 8 Zusammenfassung ______8 Zusammenfassung

Diese Arbeit beschreibt wie eine Datenbank chemischer Struktur- und Reaktionsdaten durch Methoden der Chemoinformatik ausgewertet werden kann um daraus neue Erkenntnisse für biologische und pharmazeutische Fragestellungen zu gewinnen. Insbesondere die Speicherung der Strukturen und Reaktionen in Form von „Connection Tables“ sowie deren Annotation durch „Atom-Atom Mapping“ Nummern und durch Markierung der Reaktionszentren erlaubt den Zugriff durch diese Methoden. Folgende Ergebnisse wurden in dieser Arbeit erzielt:

1) Es wurde eine Datenbank metabolischer Reaktionen (BioPath) als Ausgangspunkt für die hier gezeigten Anwendungen benutzt. Die Datenbank basiert auf den bis dato nur als Druck bzw. als gescanntes Abbild im Internet verfügbaren Postern „Biochemical Pathways“ und dem dazugehörigen Atlas mit erweiterter Information. Dafür wurden alle Reaktionen mit den verfügbaren Annotationen, vor allem auch den dazugehörigen Enzyminformationen, in der Datenbank gespeichert. Darüber hinaus wurde a) auf stöchiometrisch korrekte Eingabe aller Reaktionen geachtet, b) alle Atome auf Edukt und Produktseite durch so genannte „Atom-Atom Mapping“ Nummern markiert um eine exakte Zuordnung zwischen den Atomen zu ermöglichen und c) alle Bindungen, welche Teil des Reaktionszentrums sind und damit die im Reaktionsprozess gebrochen oder gebildet werden, markiert. Die Datenbank ist in verschiedenen Versionen verfügbar, z.B. für MDL ISIS/Base, IBM DB2 und für das C@ROL System der Firma Molecular Networks. Mithilfe des C@ROL Systems ist es möglich die Datenbank über das Internet abzufragen sowie erweiterte Suchmethoden zu verwenden, z.B. Substruktur- oder Reaktionssubstruktur-Suchen. Die BioPath Datenbank wurde ausführlich in einer Publikation im Journal Organic and Biomolecular Chemistry vorgestellt. In einer anschließenden Revision wurde die Datenbank hinsichtlich der Qualität der Daten und der Benutzeroberfläche überarbeitet.

136 8 Zusammenfassung ______

2) Die in der BioPath Datenbank enthaltene Information zu den gebrochenen und gebildeten Bindungen wurde genutzt um Intermediate solcher Reaktionen zu generieren, welche von einem sp2-hybridisierten C-Atom über einen sp3-Zustand wieder zu einem sp2-Hybrid verlaufen. Diese so erzeugten Intermediate konnten erfolgreich als Ausgangspunkt für eine automatisierte Suche nach Übergangszustands-Analoga für das die jeweilige Reaktion katalysierende Enzym eingesetzt werden. Dies wurde anhand der Enzyme AMP-Deaminase, Triosephosphat-Isomerase sowie Arginase II verdeutlicht. An diesen Beispielen konnte dargestellt werden, dass eine Suche anhand der 3D Struktur, optional angereichert durch Einschränkungen anhand physiko-chemischer Eigenschaften, Ergebnisse ähnlicher Qualität liefern wie ein wissensbasierter Ansatz. Die Ergebnisse dieser Untersuchungen wurden im Journal of Chemical Information and Modeling publiziert. Weiterhin wurde eine Datenbanksuche anhand dieser Methode nach Inhibitoren des Enzyms AMP Deaminase durchgeführt. Dabei wurde gezeigt, dass Substanzen angereichert werden konnten, welche die Eigenschaft tragen die Enzyme Adenosin Deaminase bzw. AMP Deaminase zu inhibieren. Einige weitere der zu den gefundenen Substanzen annotierten Aktivitäten sind in der Literatur als im Zusammenhang mit AMP Deaminase Inhibitoren stehend beschrieben. 3) Die Information zu den reagierenden Bindungen in der BioPath Datenbank wurde zur Entwicklung einer neuartigen Klassifizierung enzymatischer Reaktionen genutzt. Diese Klassifizierung beruht auf einer Auswahl an Bindungs-Deskriptoren für diejenigen Bindungen im Substrat einer enzymatischen Reaktion, welche an der Reaktion beteiligt sind. Für jede dieser Bindungen wurden sechs physiko- chemische Eigenschaften berechnet welche die Reaktivität der jeweiligen Bindung beschreiben. Der daraus resultierende Vektor kann anschließend als Eingabe für verschiedene Daten-Analyse Methoden verwendet werden. Die Visualisierung der Ergebnisse wurde hier mit Hilfe selbst-organisierender Kohonen-Karten durchgeführt. Für den untersuchten Datensatz, bestehend aus Reaktionen die der Klasse der Hydrolasen angehören, konnte gezeigt werden dass die Klassifizierung

137 8 Zusammenfassung ______

mittels dieser Methode Ergebnisse liefert die mit der bestehenden EC Nomenklatur vergleichbar sind. Weiterhin konnten wir mit dieser Klassifizierung feinere Details aufzeigen welche die Beinflussung der reagierenden Bindungen durch die benachbarten Atome und Bindungen aufzeigen. Eine Publikation dieser Ergebnisse wurde im Journal Proteins: Structure, Function, and Bioinformatics eingereicht.

138 Publikationen ______

Publikationen

• Reitz, M.; Gasteiger, J. Classification of Metabolic Reactions Based on Physicochemical Descriptors: Investigations on Hydrolases. submitted to PROTEINS: Structure, Function, and Bioinformatics 2007.

• Gasteiger, J.; Reitz, M.; Han, Y.; Sacher, O. Analyzing biochemical pathways using neural networks and genetic algorithms. Aust. J. Chem. 2006, 59, 854-858.

• Reitz, M.; von Homeyer, A.; Gasteiger, J. Query Generation to Search for Inhibitors of Enzymatic Reactions. J. Chem. Inf. Model. 2006, 46, 2333-2341.

• Reitz, M.; Sacher, O.; Tarkhov, A.; Trümbach, D.; Gasteiger, J.; Enabling the exploration of biochemical pathways. Org. Biomol. Chem. 2004, 2, 3226-3237.

• von Homeyer, A.; Reitz, M. In Handbook of Chemoinformatics - From Data to Knowledge. Gasteiger, J., Ed.; Wiley-VCH: Weinheim, Germany, 2003, pp 756- 789.

A Lebenslauf ______

Lebenslauf

Name Martin Johann Reitz Geburtsdatum, -ort 10. Januar 1974 in Forchheim Eltern Hermann und Lydia Reitz, geb. Kaul Staatsangehörigkeit deutsch

Schulbildung 1980 – 1984 Grundschule Igensdorf 1984 – 1993 Ehrenbürg Gymnasium Forchheim

Hochschulausbildung 10/1993 – 03/2000 Studium der Biologie (Diplom) an der Friedrich-Alexander- Universität Erlangen-Nürnberg Diplomarbeit am Lehrstuhl für Biochemie bei Prof. Dr. E. Schweizer: Herstellung eines Hefe- Fettsäuresynthasekomplexes ohne Substratbindestellen

10/2000 – 10/2001 Aufbaustudium Informatik/Wirtschaftsinformatik an der Friedrich-Alexander Universität Erlangen-Nürnberg Abschlussarbeit bei Fa. HEITEC AG, Erlangen: Entwicklung von Softwarekomponenten für eine Internetanwendung auf Basis von Java-Servlet-Technologie

seit 12/2001 Promotion am Arbeitskreis Prof. Dr. J. Gasteiger, Computer- Chemie-Centrum und Institut für organische Chemie, Friedrich-Alexander-Universität Erlangen-Nürnberg

B