Application of Genetic Algorithms to Structure Elucidation of Halogenated Alkanes Considering the Corresponding C NMR Spectra*

CROATICA CHEMICA ACTA CCACAA 77 (1–2) 213¿219 (2004) ISSN-0011-1643 CCA-2919 Original Scientific Paper Application of Genetic Algorithms to Structure Elucidation of Halogenated Alkanes Considering the Corresponding 13C NMR Spectra* Thomas Blenkers and Peter Zinn** Lehrstuhl für Analytische Chemie, Ruhr-Universität Bochum, D-44780 Bochum, Germany RECEIVED MAY 5, 2003; REVISED JULY 19, 2003; ACCEPTED AUGUST 20, 2003 A new approach for structure elucidation using genetic algorithms is introduced. In analogy to the genetic programming paradigm developed by Koza, the new concept supports genetic operations on hierarchically coded chemical line notations. The implementation of this concept consists of 5 steps. In the first step, a start population of chemical compounds is randomly generated. As the second step, physical properties of each compound of the population are pre- dicted. The third step is the comparison of each individual property with the observed property of an unknown compound, resulting in the calculation of the fitness value for each generated compound. Depending on the fitness values, the candidates for the next generation are selected by a spinning wheel procedure during the fourth step. In the last step, these candidates are rear- Key words ranged by genetic mutation and crossover to form the next generation. Steps 2 to 5 of the de- genetic algorithms scribed procedure are repeated until the spectrum of one candidate is almost equal to the spec- structure elucidation trum of the unknown compound within acceptable tolerances. The introduced concept was 13C NMR spectra verified for halogenated alkanes. INTRODUCTION structures directly from powder diffraction data using Since the first publications in 1990, the interest in appli- genetic algorithms was reviewed recently.5 An application cations of genetic algorithms in chemistry or in closely of genetic algorithms to structure elucidation in a more related sciences has dramatically increased. Today, there general way was given by Meiler.6,7 This approach re- are nearly 4000 papers published, with a yearly growth quires the measured 13C NMR spectrum of the unknown of more than 600 contributions. An overview of the compound and its experimentally determined molecular varying application fields of genetic algorithms in chem- gross formula. Depending on the molecular formula, the istry was given by Leardi.1 space for searching the corresponding constitution in- Different approaches have been used to apply gene- creases rapidly. Genetic algorithms and 13C NMR spec- tic algorithms in structure elucidation.2,3,4 The objects of trum prediction by neural networks are demonstrated as these investigations are the protein fold prediction and the tools to find the constitution corresponding to the mea- pharmacophore elucidation, etc. Solving molecular crystal sured spectrum. * Dedicated to Professor Nenad Trinajsti} on the occasion of his 65th birthday. ** Author to whom correspondence should be addressed. (E-mail: [email protected]) 214 T. BLENKERS AND P. ZINN The concept of structure elucidation in the present To allow only genetic modifications that result in correct paper also focuses on 13C NMR spectra and genetic al- chemical structures, a system of corresponding constraints gorithms. As first described in 1995,8 the present method and modification rules seems to be necessary. This com- abstains from the molecular formula and allows struc- plication slows down the genetic algorithm and decreases ture elucidation without a priori structural knowledge. practicability. A completely different way of coding the individuals of a genetic algorithm was introduced by J. Koza.11 GENERAL CONCEPT AND IMPLEMENTATION Koza’s Genetic Programming is a technique for finding STRATEGY the best mathematical formula in order to solve a given When including genetic algorithms into the structure problem. The principal idea is to code formulas in a elucidation procedure the most important difference tree-like hierarchical way and to modify them by genetic from the classical elucidation circuit is that, instead of operations from generation to generation until an opti- only one actual structure proposition, a whole popula- mal result is reached. Within a tree, sub-trees can be cut tion of possible chemical structures has to be considered. off and substituted by sub-trees from other trees in the Each time the population passes through the elucidation sense of a genetic crossover. Point mutations can be also circuit, the individual structures are modified by genetic performed easily. In the following chapter, we will dis- operations. Figure 1 shows the general concept of struc- cuss in detail the transformation of this coding to chemi- ture elucidation using genetic algorithms. cal structures. The starting point of the genetic algorithm is the ge- As the second step of the genetic structure elucida- neration of an initial population of chemical structures. tion, the prediction of a physical property of each indi- In generating an initial population the main problem is vidual chemical structure is to be performed. Physical to find a structure notation that allows genetic modifica- properties, e.g. spectral, diffraction, chromatographic, tions of the individual structures e.g. mutation and cross- thermodynamic or other data, can be used if a well de- over. Genetic algorithms can conveniently process binary fined quantitative structure property relationship (QSPR) coded individuals. These individuals are often represent- is known. Besides a single QSPR, the prediction of dif- ed as bit strings of fixed length or as the corresponding ferent properties or a combined QSPR is also of interest real numbers. On the other hand, typical computer-ori- and may improve the accuracy of the results of the ge- ented molecular codes9 are far from bit strings or real netic structure elucidation. Especially in cases when sin- number coding. The three most widespread structure no- gle QPR’s are of less predictive quality, a combination tations are the connection table, the different types of line may be successful. A set of property values correspond- notations, and the adjacency matrix.10 Transformation of ing to each individual structure of the actual population one of these representations into a bit string or a real is the result of this step. number is difficult and results in data structures unsuit- As the third step, the fitness value for each individual able for genetic operations. E.g., transformation of the has to be calculated, including the comparison between adjacency matrix to a bit string might succeed in linking the observed and the estimated property. Typical fitness the single lines or columns of the matrix one after the functions are error measures such as the root mean square, other. Genetic modifications of such bit strings could re- mean absolute error, etc. In the case of combined proper- sult in structures containing atoms with wrong valencies. ties, different fitness measures can be taken into account with respect to the reliability of the estimated properties. If the fitness value of an individual or the mean fitness of the complete population becomes better than the thre- shold value, the genetic algorithm stops; otherwise, the algorithm is continued with the next step. The fourth step is responsible for the selection of the individuals for the next generation. Depending on the fitness values, a random procedure decides which individuals or pairs of individuals are selected for mutation or crossover operations. A typical selection procedure is the spinning wheel method12 with circle sectors proportional to the fitness values of the individuals. In the fifth step, genetic operations applied to hierarchically notated chemical structures are performed. Con- Figure 1. Overview of the iterative structure elucidation strategy cerning the genetic crossover, randomly chosen sub-trees applying genetic algorithms. Instead of one chemical compound, of two individuals are exchanged. Concerning the gene- whole generations of substances are passing through the circuit. tic mutation, a randomly chosen sub-tree of a single in- Croat. Chem. Acta 77 (1–2) 213–219 (2004) APPLICATION OF GENETIC ALGORITHMS TO STRUCTURE ELUCIDATION 215 dividual is eliminated and substituted by a randomly generated new sub-tree to form an individual for the next (a) generation. After the fifth step, the first elucidation circuit is finished and a new generation of chemical structures has been built to pass the next circuit, and so on. Because the genetic operations as well as the generation of the start population have been verified as sym- bolic programming techniques instead of conventional numerical methods of the other steps, these operations (b) are described in more detail in the next two chapters. GENERATION OF HIERARCHICALLY CODED CHEMICAL STRUCTURES In a previous paper, we have introduced the basic list data types for performing list operations on chemical graphs.13 Among these data types, we have given an example of a tree – like hierarchical notation of a chemical Figure 2. Example of a hierarchical coded chemical line notation. structure. This hierarchical structure is based on a line Because it is not unique several notations are belonging to the notation that is a near SMILES14 implementation in the same molecule. A corresponding property prediction has to result programming language LISP.15 As an example, Figure 2 in the same value independent of the hierarchical notation of the gives the line notation of 1-bromo-2-chlorobutane. molecule. (a) Line notation of 1-bromo-2-chlorobutane. (b) Hier- archical representations of 1-bromo-2-chlorobutane. In order to generate a hierarchical representation need- ed in the genetic algorithm, a root atom has to be determined. Then, corresponding to the root atom, branches and dure if the actual atom belongs to the node set. The num- sub-branches are added, characterized by additional pa- ber and kind of the elements of the terminal and node sets rentheses. The hierarchical representation of the exam- are responsible for the magnitude of the generated trees.

Load more