<<

Journal of

Research article Open Access A constructive approach for discovering new drug leads: Using a kernel methodology for the inverse-QSAR problem William WL Wong* and Forbes J Burkowski

Address: The David R. Cheriton School of , University of Waterloo, Waterloo, Ontario N2L 3G1, Canada Email: William WL Wong* - [email protected]; Forbes J Burkowski - [email protected] * Corresponding author

Published: 28 April 2009 Received: 15 February 2009 Accepted: 28 April 2009 Journal of Cheminformatics 2009, 1:4 doi:10.1186/1758-2946-1-4 This article is available from: http://www.jcheminf.com/content/1/1/4 © 2009 Wong and Burkowski; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: The inverse-QSAR problem seeks to find a new molecular descriptor from which one can recover the structure of a molecule that possess a desired activity or property. Surprisingly, there are very few papers providing solutions to this problem. It is a difficult problem because the molecular descriptors involved with the inverse-QSAR must adequately address the forward QSAR problem for a given biological activity if the subsequent recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such a descriptor. The difficulty of recovering the molecule from its descriptor is the major limitation of most inverse-QSAR methods. Results: In this paper, we describe the reversibility of our previously reported descriptor, the vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our inverse-QSAR approach can be described using five steps: (1) generate the VSMMD for the compounds in the training set; (2) map the VSMMD in the input space to the kernel feature space using an appropriate kernel function; (3) design or generate a new point in the kernel feature space using a kernel feature space algorithm; (4) map the feature space point back to the input space of descriptors using a pre-image approximation algorithm; (5) build the molecular structure template using our VSMMD molecule recovery algorithm. Conclusion: The empirical results reported in this paper show that our strategy of using kernel methodology for an inverse-Quantitative Structure-Activity Relationship is sufficiently powerful to find a meaningful solution for practical problems.

Background is sparse or not available, as is the case for The structural conformation and physicochemical proper- many membrane proteins, it becomes necessary to esti- ties of both the ligand and its receptor site determine the mate affinities using only the properties of the ligand. This level of binding affinity that is observed in such an inter- ligand-based prediction strategy is often used in applica- action. If the structural properties of the receptor site are tions such as of molecular databases in a known (for example, there is crystallographic data) then procedure. techniques involving approximations of potential func- tions can be applied to estimate or at least compare bind- In a more general setting we strive to establish the quanti- ing affinities of various ligands [1]. When this tative dependency between the molecular properties of a

Page 1 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

ligand and its binding affinity. To restate this goal using The key to an effective method lies in the use of a descrip- current terminology: we want to analyze the Quantitative tor that facilitates the reconstruction of the corresponding Structure-Activity Relationship (QSAR) of the ligand with molecular structure. Ideally, such a descriptor should be respect to this type of receptor. A common approach for a informative, have good correlative abilities in QSAR QSAR analysis is the use of a machine strategy applications, and most importantly, be computationally that processes sample data to learn a function that will efficient. A computationally efficient descriptor should predict binding affinities. The input of such a function is have a low degeneracy, that is, it should lead to a limited a descriptor: a vector of molecular properties [2] that char- number of solutions when a molecular recovery algo- acterize the ligand. These vector entries may be physico- rithm is applied. chemical properties, for example, molecular weight, surface area, orbital energies, or they may describe topo- Currently, kernel methods are popular tools in QSAR logical indices that encode features such as the properties modeling and are used to predict attributes such as activ- of individual atoms and bonds. Topological indices can ity towards a therapeutic target, ADMET properties be rapidly computed and have been validated by a variety (absorption, distribution, metabolism, excretion, and of experiments investigating the correlation of structure toxic effects), and adverse drug reactions. Various kernel and biological activities. methods based on different molecular representations have been proposed for QSAR modelling [18]. They The inverse-QSAR problem seeks to find a new molecular include the SMILES string kernel [19], graph kernels [20- descriptor from which one can recover the structure of a 22] and a kernel [23]. However, none of molecule that possess a desired activity or property. Sur- these kernel methods have been used for the inverse- prisingly, there are relatively few papers providing solu- QSAR problem. tions to this problem [3]. Lewis [4] has investigated automated strategies for working with fragment based In this paper, we investigate the reversibility of our previ- QSAR-driven transforms that are applied to known mole- ously reported descriptor, the vector space model molecu- cules with the objective of providing new and promising lar descriptor (VSMMD) [24]. VSMMD is based on a drug leads. Brown et al. [5] have studied the inverse QSPR vector space model that is suitable for kernel studies in problem using multi-objective optimization. They used QSAR modeling. Our approach to the inverse-QSAR prob- an interesting partial least squares (PLS) related algorithm lem consists of first deriving a new image point in the ker- in the design of novel chemical entities (NCEs) that satisfy nel feature space and then finding the corresponding pre- a given property range or objective. The physical proper- image descriptor in the input space. Then, we use a recov- ties considered in their paper were mean molecular polar- ery algorithm to generate a chemical structure template to izability and aqueous solubility. A problem with the be used in the specification of new drug candidates. Tem- opposite emphasis has also been investigated: Masek et al. plate formats will vary with respect to their specificity. [6] considered sharing chemical information that is Depending on the nature of the recovery process, a tem- encoded in such a way as to prevent recovery of the original plate may specify a unique molecule or a family of mole- molecule (a strategy used in the facilitation of prediction cules. In the latter case, molecules meeting the software while protecting intellectual property rights). specification could be obtained by means of high throughput screening. In the "Methods" section, we pro- In general, the inverse-QSAR problem is difficult because vide a detailed description of our inverse-QSAR approach the molecular descriptors involved with the inverse-QSAR using our VSMMD approach. In the "Results" section, we algorithm must adequately address the forward QSAR present the experimental results of our descriptors in the problem for a given biological activity if the subsequent vector space setting. recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such Methods a descriptor. The difficulty of recovering the molecule Our inverse-QSAR approach can be described in five steps. from its descriptor is the major limitation of most inverse- The first two steps perform a QSAR analysis. In the first QSAR methods. step, we generate a VSMMD for each compound in the training set. Then, in the second step, we use a kernel func- Most of the proposed techniques are stochastic in nature tion to map each VSMMD to a feature space typically used [7-9], however, a limited number of deterministic for classification. The third step is to design or to generate approaches have been developed including the approach a new point in the kernel feature space using a kernel fea- of Kier and Hall [10-13] based on a count of paths, and an ture space algorithm (e.g. the center of highly active com- approach based on signature descriptors (see Faulon et al. pounds). In the fourth step, we map this point from the [14-17]). feature space back to the input space using a pre-image approximation algorithm. In the last step, the molecular

Page 2 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

structure template will be built by our VSMMD molecule combinatorial explosion of components in the descriptor recovery algorithm. Figure 1 illustrates the overall process- at the expense of introducing some degeneracy. Since the ing. occurrence of triple bonds is rather infrequent, we con- tend that this low level of degeneracy is much more Vector Space Model Molecular Descriptor (VSMMD) acceptable than the drastic and unwanted increase in The vector space model molecular descriptor (VSMMD) component count. [24] can be categorized as belonging to the constitutional descriptors that provide feature counts related to the two The VSMMD strategy is based on the extraction of molec- dimensional structure of a molecule. Functionally, our ular fragments that are comprised of small sets of bonded descriptor bears a resemblance to various other descrip- atoms. The atom count for a fragment is at least two and tors such as reduced graphs (see Gillet [25]) and circular at most c, where c is some pre-specified value such as 2, 3, fragments (Glenn et al. [26]). One may also see our frag- or 4. ment oriented approach as somewhat reminiscent of the path count strategy of Kier and Hall [10-13]. In the setting To illustrate the processing of a molecule we describe the of inverse-QSAR, we used VSMMD because we were famil- steps that are taken in the processing of a molecule (atom iar with its capabilities and we had software applications count for a fragment limited to 2). Figure 3 shows one of to do the generation of the descriptors. The primary signif- the pyrrole compounds that is a member of the data set icance of the current study is not the formulation of a new used in reference [27]. and competitive descriptor but rather the development of strategies to facilitate the reverse engineering The algorithm goes through the following steps: that one needs to go from feature space point back to the specification of a new ligand. 1. The atoms and bonds are labelled as prescribed by Figure 2. The first step in constructing the VSMMD is to identify the physicochemical properties of each atom in a molecule. 2. The molecular descriptor is created by extracting Specifically, we affix labels to atoms and bonds as speci- from the molecule a complete set of small fragments. fied in Figure 2. It should be noted that triple bonds would also be labelled as "=". This helps to reduce the

OverallFigure 1concept for the VSMMD inverse-QSAR approach Overall concept for the VSMMD inverse-QSAR approach.

Page 3 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

LabelsFigure for 2 atoms and bonds in a molecule Labels for atoms and bonds in a molecule.

3. Frequency counts are evaluated for all the fragments the vector entries that have been observed to have fre- so that a multi-set or bag of fragments can be gener- quency counts equal to zero across all molecules under ated consideration.

When these steps are completed, the multi-set counts are In the VSM, a language dictionary can be defined using placed into a vector that has a position for each of the dif- some permanent predefined set of words. In our model, ferent possible fragments recorded in the dictionary. Note according to the encoding scheme, the dictionary will be that all molecules end up with descriptors of the same the collection of all possible atom type (AT) and bond length. If fragment size is limited to 2 atoms this vector type (BT) labelled graphs that arise from molecular frag- would have a dimensionality of 7*7*3 = 147. ments that are restricted to having atom counts of 2, 3, or 4. Table 1 shows the general format of our dictionary with VSM and VSMMD compared c limited to 4. The last entry of this table represents a four The motivation for the VSMMD descriptor comes from atom fragment in which a central atom is bonded to three the "bag-of-words" approach [28] that is based on the vec- other atoms. tor space model (VSM) used in . Roughly speaking, the atom fragments extracted in the In the most general case, a molecular descriptor is repre- VSMMD process corresponds to the document words and sented by a bag of fragments, each fragment correspond- phrases extracted in the VSM. The molecule is then analo- ing to an entry in the dictionary. gous to a text document and much of the analysis used in the bag-of-words strategy can be brought over to the Analysis based on VSMMD VSMMD setting. We now describe the notation and mathematical setting used in VSMMD. For each molecule i, a molecular descrip- In practice, we have found that the success of VSMMD is tor di can be generated. We represent the descriptor as a greatly enhanced by utilizing fragments that contain more column vector in an m dimensional space using the map- than two atoms, for example, c = 3 or c = 4. This adoption ping: of a higher level of structure corresponds to the incorpo- ration of phrase structures in the VSM. From a molecular =∈T m φφLi:d Li ( d ) ( qf (12 , d i ), qf ( , d i ), , qf ( mi , d )) R perspective, the larger fragment will incorporate informa- tion related to rotation around a single bond. As the value (1) of c increases there is a point of diminished returns due to where q(fl, di) is the frequency of the fragment fl in the the combinatorial explosion of fragment possibilities. To descriptor di. help reduce this "curse of dimensionality", we can remove

Page 4 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

ProcessingFigure 3 steps for the VSMMD Processing steps for the VSMMD.

Table 1: General format of the fragment dictionary (AT = Atom The use of the linear kernel φ L(·) in this last equation is Type, BT = Bond Type). deliberate since we want to view this mapping as the type of kernel function that is used in the bag-of-words strategy Type ID General Fragment Type Atom Count described by Shawe-Taylor and Cristianini [28]. Via this mapping, each molecular descriptor is taken over to an m- 1) AT BT AT 2 dimensional vector, where m is the size of the dictionary. 2) AT BT AT BT AT 3 3) A B A B A B A 4 Although m could be very large, the typical vector gener- T T T T T T T ated in this way is usually quite sparse (just as vectors in 4) AT BT AT BT AT 4 BT the VSM are sparse). AT Working with n molecules, we can apply the mapping repeatedly to generate a succession of column vectors: φ L(d1),φL(d2),...,φL(dn). Computation of the vector space

Page 5 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

kernel is done by calculating the fragment-descriptor ponent in Y, the vector of the output values. In the process matrix F with rows indexed by the fragments and columns of , we want to be able to generalize to indexed by the descriptors: previously unseen data points. In the case of binary classi- fication, given some new input x ∈ X, we want to predict ⎛ qf(, d ) qf (, d )⎞ ⎜ 11 1n ⎟ the corresponding y ∈ {+ 1, -1}. In other words, we want == Fd(()φφLLn1 ()) d⎜ ⎟. ⎜ ⎟ to choose y such that (x, y) is in some sense similar to the ⎝ qf(,)mmn d1 qf (,) d ⎠ training examples. In order to achieve this, we require a (2) similarity measures in X and in Y. Since y ∈ {+1, -1}, to find the similarity measure in Y is relatively easy. On the The entry at position (i, j) gives the frequency of fragment other hand, we require a function to measure the similar- f in molecule j. Subsequently, we can create the kernel i ity in X: matrix as: kX:,×→ X R = T (3) (6) KFF → (,xxii ) kxx (, ) corresponding to the vector space inner product satisfying, for all x, xi ∈ X: i = 1 .. n,

m = = kxx(,ii )φφ (),( x x ), (7) φφLi(),()dd L j∑ qfdqfd (,)(, zi z j ). (4) = z 1 where φ maps descriptors into an inner product feature In our forward-QSAR experiments [24], the nonlinear space FS. The similarity function k is called a kernel, and Gaussian kernel [28] φ is its feature map. The feature space FS is usually called the reproducing kernel Hilbert space (RKHS) associated − = σφLi(),()dd φ L j (5) with k. Figure 4 illustrates the overall mapping concept. kd(,ij d ) e gave the best results. It provides a rich hypothesis space By using a kernel function, the embedding in the Hilbert and is often used in machine learning studies. One param- space is actually performed implicitly, that is by specifying eter, σ, had to be evaluated through cross-validation. the inner product between each pair of points rather than by giving their coordinates explicitly. This approach has With the Gaussian vector space kernel, we can apply a ker- several advantages, the most important being the fact that nel-based method to generate a predictor of any one of often the inner product in the embedding space can be several biological activities, for example: ADMET proper- computed much more easily using a kernel rather than ties or affinity of ligands used as therapeutic agents. In our using the coordinates of the points themselves. previous studies [24], we gave empirical evidence to dem- onstrate that VSMMD can capture important chemical In machine learning, using kernels is a strategy for con- information allowing it to perform very well in both for- verting a linear classifier algorithm into a non-linear one ward-QSAR studies and the determination of binding by using a non-linear function to map the original obser- mode information. vations into a higher-dimensional feature space; this makes a linear classification in the new feature space The Reproducing Kernel Hilbert Space (RKHS) equivalent to a non-linear classification in the original Kernel-based learning work by embedding the input space. data into a Hilbert space, often called the feature space fol- lowed by a search for linear relations within this Hilbert For more details about the RKHS, readers can refer to [28]. space. Kernels are functions that can be used to formulate similarity comparisons. They provide a general framework Designing Molecules Using a Feature Space Algorithm to represent data, subject to certain mathematical condi- Suppose we have a set of n molecular descriptors S, desig- tions. Data are not represented individually by kernels. nated as S ={d , d ,...,d } where each d is in the input Instead, data are represented through a set of pair-wise 1 2 n i space X. Let us assume we are using a Gaussian vector comparisons. space kernel as defined in equation (5). Under this kernel, any point di ∈ X, is implicitly mapped to an image φ(di) in More formally, suppose we have n training data pairs the feature space FS. With this kernel mapping, we can n {}()xy, where x ∈ X. The vector space X is referred define the set φ(S) = {φ(d1), φ(d2),...,φ(dn)} ⊂ FS. ii i=1 i to as the input space and each yi is considered to be a com-

Page 6 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

TheFigure implicit 4 function maps points in the input space over to the feature space The implicit function φ maps points in the input space over to the feature space.

In this sub-section, we will evaluate various properties of the data set φ(S). We provide a set of elementary algo- n 1 rithms to do various calculations such as distance between φφ= ∑ ().d (10) sin two descriptor image points in the feature space. i=1 The norm of the centroid can be calculated using only the The feature space centroid derived from highly active compounds evaluations of the kernel on the inputs: The norm of φ(d) is given by: n n 2 11 ==2 = (8) φφφ==,(),()∑ φd ∑ φd φφ()dd () φφ (),() ddkdd (,). sssn in j i=1 j=1 A special case of the norm is the length of the line joining n n two images φ(d ) and φ(d ), which can be computed 1 1 2 = φφ(),()dd using: 2 ∑∑ ij n i=1 j=1 n n −=−2 − 1 φφ()ddij () φφφφ () dddd ijij (),() () = kd(, d ). 2 ∑∑ ij =− + n = = φφ(),()ddii2 φφ (),( d iddddjjj)(),()φφ i 1 j 1 =−+ kd(,)ii d2 kd (, i d j ) kd (, j d j ). (11) (9) Note that, the result is the average of the entries in the ker- nel matrix. The inner product between a descriptor image The norm described by (9) represents the distance point φ(d) and the centroid φs is given by: between two descriptor image points in the feature space. We define the centroid φs of the molecule data set S in the feature space as:

Page 7 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

sian kernel) is used, then a point in the feature space bears n n a nonlinear relationship to the point in the input space. 11 φφ(),dd== φ (),∑∑ φ (d ) φφ (),(dd ) The centroid in the feature space would rarely, if ever, map sin n i i==11i back to the centroid in the input space. We have chosen to n use the centroid in the feature space because there is more 1 = ∑ kdd(, ). assurance of generating a new point in the feature space n i i=1 that in some sense represents a point of reasonable inter- (12) polation in this high dimensional space. Picking an arbi- trary point in the feature space runs a higher risk of Using equation (9), we can calculate the distance between extrapolation which may be difficult to avoid especially when a small training set is spread over the higher dimen- φ(d) and the centroid φs in the feature space by: sional feature space in some rarefied manner. 2 2 φφ()ddd−= φ ()2 −2 φφφ (), + sssThe inverse in the input space is called the pre-image. We n n n 21 will discuss pertinent details in the pre-image problem =−kdd(,)∑ kdd (, ) +∑∑ kd ( ,d ). n i 2 i j subsection to follow. i=1 n i=1 j=1 (13) There are several studies that use a feature space centroid Recall that the kernel-based learning algorithms work by to generate new data. Kwok and his colleagues used a fea- embedding the data into the feature space, and searching ture space centroid to generate a new data point for hand- for a linear relationship within this feature space. With written digit recognition [29] and speech processing [30]. this linear relationship, it makes sense to derive a new In both applications, the pre-image has been shown to be descriptor image using the centroid point of the highly robust and meaningful. In the next subsection we describe active compound's image points which will share the gen- another strategy for the derivation of a new feature space eral properties of all highly active compounds. If the cen- point troid point can be mapped from the feature space back to the input space, we can obtain the descriptor of a new can- Minimum enclosing and maximum excluding hyperspheres didate molecule. Figure 5 illustrates this idea. It should be In the last subsection, we derived a new descriptor image stressed that when a nonlinear kernel (such as the Gaus- point using the centroid of feature space images derived

DerivingFigure 5 a new image in kernel feature space Deriving a new image in kernel feature space.

Page 8 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

from the highly active compounds. In this subsection, we 2 use the highly active compounds to derive two hyper- ad−≤φ() r2 for dG ∈ , spheres with the same center. The center of the hyper- ii i (14) −≥2 2 ∉ spheres is then mapped back from the feature space to the adφ()ii r2 for dG . input space to generate the descriptor of a new candidate molecule. Figure 6 illustrates this idea.

Suppose we can identify a subset G ⊆ S where G contains Following the development of Liu and Zheng's minimum the descriptors of molecules in the with enclosing and maximum excluding machine (MEMEM) the highest activity. We let |G| represent the number of [31], we want the inner hypersphere H as small as possi- descriptors in G. In an ideal situation, the feature space 1 images of G will be spherically separable from all the ble for a good description of the highly active class. In the other descriptor images mapped over from S. With this meantime, we want the outer hypersphere H2 as large as assumption, we can derive two hyperspheres, sharing the possible. In other words, we try to maximize the area same center a, such that the images of all descriptors between two hyperspheres H2 and H1. Note that the area derived from highly active molecules are enclosed by the between two hyperspheres H2 and H1 is proportional to inner hypersphere H1 and all the remaining images are excluded by the outer hypersphere H . Let r be the radius the quantity ()rr2 − 2 . Let Δrrr2 =−1 ()2 2 and 2 1 2 1 2 2 1 of the inner hypersphere and let r2 be the radius of the outer hypersphere. Consequently, we have:

MinimumFigure 6 enclosing and maximum excluding hyperspheres in the feature space Minimum enclosing and maximum excluding hyperspheres in the feature space.

Page 9 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

outside the outer hypersphere. The corresponding dual rrr2 =+1 ()2 2 , we can formulate the objective func- 2 1 2 problem of equation (19) obtained by [31] is:

2 2 tion to have r1 as small as possible and Δr as large as ⎧ n n n ⎫ possible by minimizing the quantity ⎪ ⎪ ⎨ − 1 ()+ ()⎬ maxη ∑∑∑ααijijyyk d i , d j αii yk d i , d i 2 −−2 22=−Δ 2 αβi , ⎪ ⎪ rrrr1 ()2 1 3 r, or equivalently ⎩ i=1 j==11i ⎭ n n subject to αηy =−=≥,,, αββ10 1 22 ∑∑ii i rr−Δ . (15) == 3 i 11i ≥≥ = and C α i 0 for in1, to Replacing the constant 1 by η, which controls the trade (20) 3 where β is the Lagrange multiplier associated with the off between the importance of the inner hypersphere and constraint Δr2 ≥ 0, and C is some constant to be deter- the outer hypersphere, the objective function will mined with a validation data set. The center a of the become: hyperspheres is then mapped back from the feature space to the input space to generate the descriptor of a new can- min {}ηrr22− Δ didate molecule. ar,,22Δ r −≤−φ 2 22Δ ∈ subject toad()ii r r for dG , The Pre-image Problem In RKHS subsection, we illustrated how a point in the − φ 2 ≥+22Δ ∉ and a ()drriifor dG , input space is mapped to the feature space via the implicit (16) function φ. In this subsection, we are interested in finding how a point in the feature space can be mapped back to Our analysis will require Lagrange multipliers αi and the input space. Formally, this is called the pre-image labels yi such that problem of reconstructing patterns from their representa- tion in feature space (see Figure 7). ydG=∈1 for , ii (17) =− ∉ Ψ ydGii1 for . Let be a point in the reproducing kernel Hilbert space (RKHS) FS. The pre-image of Ψ ∈ F is a point d* ∈ X (the The dual problem of equation (16) can be obtained [31] original input space). Formally, as: ∗ Ψ=φ().d (21) ⎧ ⎫ ⎪ n n n ⎪ ⎨ − 1 ()+ ()⎬ maxη ∑∑∑ααijijyyk d i , d j αii yk d i , d i The problem of finding the pre-image d* is equivalent to α i ⎪ ⎪ ⎩ i=1 j==11i ⎭ the problem of finding the inverse of φ defined in equa- n n tion (21): == ≥= subject to∑∑αηiiyin,, αi 101 and αi for to . i==11i ∗− d = φ 1().Ψ (22) (18) In practice, the data may not be separable in this fashion. However, the problem of finding φ-1(·) is typically an ill- By introducing slack variables ζ ≥ 0, equation (16) posed problem. A problem is said to be ill-posed if the becomes: solution is not unique, does not exist, or is not a continu- ous function of the data [32]. ⎪⎧ n ⎪⎫ 22−+Δ min ⎨ηζrrC∑ i ⎬ One possible way to overcome this problem is to look for 22Δ ar,, r ,ζ i ⎪ ⎪ ⎩ i=1 ⎭ ∗ ∗ dˆ an approximation of the pre-image such that φ()dˆ is 2 22 subject to ad− φ() ≤≤−rrΔ +ζ for dG ∈, ∗ i ii as close as possible to Ψ. Formally, we search for dˆ ∈ X, −≥+−∉2 22Δ adφζ()iii r r for dG , such that and Δr 2 ≥ 0. ˆ ∗ 2 (19) dd=−arg minφ ( )Ψ . (23) dX∈ F By using equation (19), we allow some negative samples inside the inner hypersphere and some positive samples

Page 10 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

TheFigure pre-image 7 problem The pre-image problem.

Typically, we will have Ψ defined as some linear combina- In this paper, we are going to follow Kwok and Tsang [28] tion of implicit mappings from the input space. Conse- and use their approach to approximate the pre-image. quently, expanding equation (23), we get: Their algorithm is based on the notion of distance con- straints. They assume that there exists a simple relation- 2 n ship between distances in the input space and distances in ˆ ∗ =− (24) dddarg minφαφ ( )∑ ii ( ) , feature space. Figure 8 illustrates these relationships. Sup- dX∈ = i 1 F pose we have derived the center a of the hyperspheres which can be rewritten as: using equation (19). It was shown that the center a of the hyperspheres is a linear combination of the training sam- ⎡ ⎤ n n n n = 1 ˆ ∗ =−+⎢ ⎥ ples [31] i.e.: ayd∑αφii() i . Recall that η is defined in dkddkddkddarg min ( , )2∑∑αααii ( , )∑ ij ( i , j ) . η = dX∈ ⎢ ⎥ i 1 ⎣ i==11i=1 j ⎦ equation (16). The norm of the center can be calculated (25) using only the evaluations of the kernel on the training Using equation (25), an inversion problem turns out to sample inputs: be an optimization problem. There are several algorithms that attempt to solve this optimization problem. Schölko- n n = 11 pf et al. [33] proposed an iterative fixed point algorithm aa,(),()∑∑αφii y d i αφjj y d j η ==η strategy. Kwok and Tsang [28] proposed another method i 11j (26) that exploits the correspondence between distances in the n n 1 input space and the feature space. Alternatively, a stand- = αα yykd(, d ). 2 ∑∑ ijij i j ard gradient optimization method can be used to find an η i=1 j=1 approximation of the pre-image [33]. Note that all these methods are only guaranteed to find a local optimum.

Page 11 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

KwokFigure and 8 Tsang pre-image strategy Kwok and Tsang pre-image strategy.

For each training sample di we can derive est neighbours of a in the feature space. Using (28), we 2 can convert these p closest neighbour distances in the fea- distF =− aφ() d representing the square of the dis- ai, i ture space to their corresponding input space distances. tance between the training sample image di and the center Let b be the vector representing these input space dis- a of the hyperspheres: tances with

2 T distF =− aφφφφ() d = a , a −2 a ,() d + (),() d d = ⎡ ⎤ (29) ai, iiii b ⎣ distaa,,12,,,. dist dist ap ,⎦ n 1 =−aa,(2 ∑αφ y d)),()φφφddd+ (),() Their corresponding descriptors in the input space are d , η jj j i ii 1 j=1 p n n n 1 2 d ,...,d ∈ ޒm, and the centroid is defined as dd= 1 ∑ . =−∑∑αα yykd(, d )∑α ykd (, d)(,).+ kd d 2 p p i 2 ijij i jη jj j iiii i=1 η i=1 j=1 j=1 Let D = [d , d ,...,d ] be an m × p matrix. To establish the (27) 1 2 p centroid at the origin, we let Using the Gaussian kernel as specified in equation (5) and an observation made by Kwok and Tsang [29], the corre- ⎛ 1 ⎞ sponding input space distance dist , between the center AI=−⎜ 11TT⎟ D , (30) a, i ⎝ p ⎠ and training sample di can be found using: where I is a p × p identity matrix, and 1 is a p dimensional 1 1 vector with each component equal to 1. dist=−log(1 − dist F ). (28) ai,,σ 2 ai Assuming matrix D is of rank q, we obtain the singular To speed up the algorithm (as observed by Kwok and value decomposition (SVD) of AT as: Tsang [29]), only the p closest training sample images of a will be considered. From (27), we can identify the p clos-

Page 12 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

Recovering the Molecule AUSVUZTT==, (31) In order to solve the recovery problem for chemical struc- tures, we have to investigate a way to derive a graph repre- where U = [u , u ,...,u ] is a m × q matrix with orthonormal ∗ 1 2 q senting the 2D structure of a molecule that has dˆ as its columns composed of the u 's, and Z = [z , z ,...,z ] is a q × i 1 2 p descriptor. There are several related studies that attempt to p matrix with the i-th column zi being the projection of di find such a graph, see for example Bakir et al. [34], who onto the uj vectors such that the squared distance of di to worked with a stochastic search algorithm. However, this 2 2 the origin is equal to ||zi|| . Letting b0 = [||z1|| , recovery problem is not well studied from a computa- ∗ tional viewpoint. Tatsuya and Fukagawa [35,36] proposed ||z ||2,...,||z ||2]T the pre-image dˆ of the center a can be 2 p a dynamic programming algorithm for inferring a chemi- obtained by: cal structure from a descriptor. However, the algorithms are not practical for a large data set. Previous studies ˆ ∗−=−1 1 T − + dUSVd().bb0 (32) focused on creating a real chemical structure, which 2 defined too many constraints on the problem due to the The Need for a Nonlinear Kernel of chemical structure. In our study, we do not The nonlinear implicit mapping provided by the kernel attempt to recover a real chemical structure; instead, we operation allows us to generate an inner product in the generate a chemical structure template with physiochem- feature space by a kernel function that has ical properties only. This simplifies the problem and arguments taken from the input space. More significantly, makes it practical for real data sets. when a nonlinear kernel is used, linear operations in the feature space correspond to nonlinear operations in the Reversible VSMMD input space. This is important because the nonlinear map- For illustration and without loss of generality, we will ping will involve various cross products of components assume that the atom count associated with the VSMMD within a vector of the input space. As a consequence, lin- is two, and we further assume all the aromatic rings are ear structures within the feature space correspond to non- replaced by "super atoms" containing all the rings' physi- linear or "warped" structures in the input space. cochemical properties. Figure 9 illustrates a simplified VSMMD using the same example as in Figure 3. To illustrate this, we ran a small experiment with the COX2 training set (described below in the Results sec- From Figure 9, we observe that the VSMMD model con- tion): As described earlier, the new feature space point a, tains only the physiochemical properties of the chemical generated by extracting the center of the enclosing hyper- structure. As a result, for the recovery problem, we do not sphere, was mapped back to the input space to get its pre- attempt to recover the entire chemical structure. Instead, ˆ ∗ we attempt to generate a chemical structure template with image d . We then formed the set of points Sfe in the physiochemical properties only. A chemical structure tem- input space (taken from the training set) that produced plate converted from the molecule shown in Figure 9 is the ten closest neighbours of a under the kernel mapping. illustrated in Figure 10.

This set Sfe was compared with the set Sin containing the 10 In the next subsection, we define the notion of a structure ˆ ∗ closest neighbours of d in the input space (these neigh- template and show how it can be derived. bours derived using the Euclidean metric). Because of the warping effect, pre-images of close neighbours in the fea- Forming the De Bruijn graph ture space are not necessarily the closest neighbours of the If we consider a molecule to be comprised of molecular ˆ ∗ fragments then it is clear that there is a hierarchical organ- pre-image d in the input space, in fact, the intersection ization of these fragments. A linear fragment with an atom of Sin and Sfe is only 3 descriptors. More significant: the count of three can be seen as containing two smaller frag- average affinity of molecules in Sin is 8.03 while the aver- ments each with an atom count of two and of course the age affinity of molecules in S is 8.73. This provides two fragments overlap in the central atom. If we restrict a fe fragment to have an atom count of two, then it will con- empirical evidence that the nonlinear mapping provided tain two elementary fragments, namely two atoms, each by the kernel function helps us to select input space labelled with their atom types. descriptors that are more significant when considering their corresponding affinities. Informally: The purpose of a De Bruijn graph is to provide a data structure that shows how small fragments combine

Page 13 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

SimplifiedFigure 9 VSMMD with an aromatic ring treated as a super atom Simplified VSMMD with an aromatic ring treated as a super atom.

to build larger fragments. Since we wish to handle ring fragment that has an atom count equal to fa - 1. In our sim- structures using the simplification of a "super atom", we plified case, fa = 2 and each vertex will represent an atom will abuse these concepts slightly and consider the frag- labelled with a physicochemical property. We then add a ments under discussion to be fragments within a template bi-directional edge from vertex a to vertex b if the frag- as described in the previous subsection. ments associated with these vertices are within a larger fragment with atom count equal to fa. Each edge is Suppose we are dealing with fragments that have an atom weighted with a value representing the number of times count designated as fa. The De Bruijn graph D is con- that this larger type of fragment occurs in the template. structed in the following way: We provide a vertex for each

Page 14 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

AnFigure example 10 of a chemical structure template An example of a chemical structure template. Note: We use "O" to denote O#O#O#O#O#O and 'OA' to denote O#O#O#O#A.

TheFigure De 11Bruijn graph D for the VSMMD shown in Figure 9 The De Bruijn graph D for the VSMMD shown in Figure 9. Note: We use 'O' to denote O#O#O#O#O#O and 'OA' to denote O#O#O#O#A.

Page 15 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

Although we have been referencing the template in linear time algorithm for its derivation [37]. The follow- describing the construction of D, it should be clear that it ing is the pseudo- of the Euler circuit algorithm. is possible to accomplish the generation of D by process- ing the descriptor that represents this template. Figure 11 EULER(q) illustrates the De Bruijn graph D generated from the VSMMD shown in Figure 9. 1 Path ←none

The De Bruijn graph can be expanded by replacing each 2 For each unmarked edge e leaving q edge carrying weight c with c unweighted edges, each with do the same direction as the original edge. Figure 12 illus- trates this expansion. Let M be the resulting unweighted 3 Mark(e) De Bruijn graph. 4 Path A ← EULER(opposite ver From the VSMMD, we know the exact number of vertices tex(e)) || Path that should appear in the chemical structure template. With this information, we can derive a chemical structure 5 Return Path template from the De Bruijn graph by finding an Euler cir- cuit of M. Each Euler circuit will represent the extraction of a unique chemical structure template from the De Bruijn Graph M. All possible Euler circuits Figure 13 shows a subset of all the Euler circuits that can An Euler circuit is a circuit on the graph such that each be generated. The circuit labelled with a '*' corresponds to edge is traversed exactly once. Each traversal of an edge the chemical structure template illustrated in Figure 10. corresponds to the consumption of one instance of a com- The total number of possible Euler circuits for the chemi- ponent in the VSMMD. The problem of finding an Eule- cal structure in Figure 9 is 2700. rian circuit of a graph is well known and there exists a

TheFigure Expanded 12 graph M The Expanded graph M. Note: We use 'O' to denote O#O#O#O#O#O and 'OA' to denote O#O#O#O#A.

Page 16 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

In order to generate all possible chemical structures asso- To simplify the statistical model, independence assump- ciated with the VSMMD, we have to find all Euler circuits. tions are made so that each edge depends only on the last Different chemical structure templates correspond to the t edges. Consequently, we have a Markov model that pro- different possible orderings when traversing outgoing vides an approximation of how the fragments, each edges of each vertex. This produces a factorial explosion labelled with physicochemical properties, are connected with respect to the number of outgoing edges of each ver- within the template. More precisely, our model predicts tex. Thus, finding all Euler circuits is not feasible. traversal of ei based on previously traversed edges ei-1, ei- 2,...,ei-t. Formally, this is described as: Probabilistic Euler paths To overcome this, we have developed an algorithm that N = generates Eulerian circuits by doing a guided walk of the PE() Pe (121312 )( Pe | e )( Pe | e , e )ii∏ Pe (iit | e−− , , e i 1 ). graph. During the walk we choose an outgoing edge in a it=+1 probabilistic fashion. The choice is dependent on statistics (34) that are gathered from the descriptors in the training set. If we could handle unlimited amounts of training data, To accomplish this, we have built a statistical model that the maximum likelihood estimate of P (ei|ei-t+1,...,ei-1) is used to estimate the probability of an Euler circuit. would be:

Let E denote a path of N edges, that is, E = e , e ,...,e . 1 2 N = ce(,it−−+eit11 ,,,) ei −ei Then, by the probability chain rule, we can obtain: Pe(|iit e−− , , e i1 ) . ce(,it−−+eit1 ,, ei−1)

N (35) = PE() Pe (111 )i∏ Pe (ii | e , , e− ). (33) where c(ei-t,...,ei-1, ei) is the number of times the edge i=2 sequence ei-t,...,ei-1, ei is seen in the training data.

To estimate the conditional probabilities P (ei|e1, e2,...,ei- As an example, consider Figure 14 with t = 1. In order to 1), we need training data consisting of a large number of determine the edge to be traversed next when at the node Euler circuits each corresponding to some particular labelled "R", we consult the associated probabilities: P(R - molecular template. One can obtain these conditional O | A - R) and P(R - A | A - R). We traverse the edge with probability distributions from the training data by keep- the largest probability first. ing statistics on the dependency between the next edge to traverse and the history of the previously traversed edges. A threshold h can also be used as a cut-off to limit the Seen as probabilities of traversal, we of course use normal- number of edges that the algorithm should examine in an ized values so that the probabilities of all possible "next- effort to sidestep the factorial explosion that can occur edges" sum to 1.0. without this limitation. With this understanding, we can

SomeFigure possible 13 Euler circuits Some possible Euler circuits.

Page 17 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

compute the overall probability of each Euler circuit using Markov model that we set up in the previous subsection), equation (34). is added to connect two Euler paths together.

With t = 1 and h = 2, a total of 102 Euler circuits are gen- Consider the pre-image example given in Figure 16(a), the erated. The six Euler circuits with highest probability are corresponding expanded De Bruijn Graph is given in Fig- shown in Figure 15. They corresponded to two unique ure 16(b). The all-possible Euler algorithm is called at chemical templates. Templates 1 and 3 are the exact tem- each vertex whose outgoing edges are not all marked. The plates derived from the chemical structure shown in Fig- highest probability Euler circuits for the disjoint De Bruijn ure 10. Graph are given in Figure 16(c). To determine the concate- nation location of the two Euler circuits, our algorithm There are several related research papers that attempt to considers all the possible connections between the two retrieve the order of elements that are part of a larger struc- disjoint parts of the graph. All the possible connections ture using Eulerian circuits. Cortes et al. [38] retrieve the are illustrated in Figure 16(d). These possible connections order of words in documents using an Eulerian circuit are evaluated using the same Markov model that we set up approach. Pevzner et al. [39] assembly DNA fragments before to calculate the Euler circuits. Among all the possi- using an Eulerian circuit. ble connections, the P(R-O|R-R) value gives the highest probability. Thus a bi-directional edge between node 'R' As mentioned in [29], in general, there was no exact pre- and node 'O', is added to connect the two disjointed parts image in the input space; The pre-image returned by the together. The final De Bruijn Graph, the concatenated algorithm was an approximation and so it was compro- Euler circuit and the corresponding graph template are mised by approximation errors. Because of these approxi- shown in Figure 16(e). mation errors, the following problems may exist: Results • The pre-image vector may consist of non-integer Data components. In our previous work [24], eight different data sets were used to test the ability of the VSMMD to predict biological • The pre-image vector may not form a fully connected activities. All these data sets contain real valued QSAR De Bruijn Graph. inhibitor data. The eight QSAR data sets are from Suther- land et al. [27]. We chose one data set from these eight Our solution to overcome the first problem is to round data sets to demonstrate the effectiveness of the recovery the components to obtain integer counts. To deal with the algorithm when applied to our VSMMD. case where the graph is not connected, the all-possible Euler circuits algorithm is called at each vertex whose out- The data set we chose contains 322 cyclooxygenase-2 going edges are not all marked. The resulting path is the (COX2) inhibitors collected by Seibert and colleagues concatenation of the paths returned by different calls to [40] and subsequently utilized in a QSAR study by Cha- the all-possible Euler algorithm. More precisely, a bidirec- vatte et al. [41] with each inhibitor having pIC50 values tional edge with the largest conditional probability based ranging from 5.5 to 8.9. We chose this data set because on previously traversed edges in the path, (using the same training samples in the COX2 data set were presented using diagrams of molecular structures. This allowed us to

AnFigure example 14 in edges traversal An example in edges traversal.

Page 18 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

SixFigure highest 15 probability Euler Circuits for VSMMD shown in Figure 9 and the corresponding chemical structure templates Six highest probability Euler Circuits for VSMMD shown in Figure 9 and the corresponding chemical structure templates. Note: We use 'O' to denote O#O#O#O#O#O and 'OA' to denote O#O#O#O#A.

compare our generated chemical structure templates with programmed in Java. As illustrated in Figure 3, descriptors the given molecules in the data set. It should also be noted were calculated and the kernel matrix K was generated in that, despite the similarity of molecules displayed in Fig- a few seconds for each complete data set. For the pre- ure 17, the entire set of 322 COX2 inhibitors is spread image algorithm and the feature space algorithm, we used over 20 different scaffolds and so the training set has a rea- MATLAB to perform the required calculations. For the sonable structural diversity. recovery phase, we implemented the Probabilistic Euler Paths algorithm in Java. The same inverse-QSAR procedure was applied to the remaining seven data sets; the closest matching molecule Verification of the Inverse Mapping – Test Result in the test set for the generated chemical template is To verify our proposed inverse approach, we picked one shown in the last subsection. molecule in the COX2 training set randomly as shown in Figure 18(a). We then generated the VSMMD for each of In our experiments, the data were separated into the same the compounds in the training set. The corresponding training and testing sets as specified by Sutherland et al. VSMMD for the chosen molecule is shown in Figure [27]. 18(b). Next, we implicitly mapped this VSMMD to the kernel feature space using a Gaussian kernel function as Implementation Details stated in equation (5). Instead of generating a new point To identify the physicochemical properties of each atom, in the feature space, we used the pre-image approximation we implemented our descriptor generation program with algorithm to compute the pre-image of this feature space help from the chemical development kit (CDK) [42,43] point. The corresponding pre-image is shown in Figure

Page 19 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

AFigure case where 16 the pre-image vector did not form a fully connected De Bruijn Graph A case where the pre-image vector did not form a fully connected De Bruijn Graph.

Page 20 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

TenFigure highest 17 active compounds in the COX-2 training set Ten highest active compounds in the COX-2 training set.

18(c). Finally, we applied our VSMMD recovery algorithm In the fourth step, we mapped the feature space point back to obtain a chemical template. With t = 1 and h = 3, we into the input space using the pre-image approximation generated the Euler circuit with the highest probability algorithm. In this case, we used the Kwok and Tsang algo- and the corresponding chemical structure templates are rithm [29] described in the pre-image problem subsection shown in Figure 18(d). From this result, we observed that to map the center of the minimum enclosing and maxi- our approach is able to generate a chemical template cor- mum excluding hypersphere back into the input space. responding to the original chosen molecule. Figure 19 illustrates the derived VSMMD.

Inverse-QSAR Test Results The last step concerned building the molecular structure Recall that our inverse-QSAR approach contains five steps. template using our VSMMD recovery algorithm described The first two steps are to perform QSAR analysis. In the in the recovery subsection. Since the center of the mini- first step, we generated the VSMMD for the compounds in mum enclosing and maximum excluding hypersphere the training set. Then, in the second step, we implicitly was derived from the ten highest active compounds in the mapped the VSMMD to the kernel feature space using an training set, we assumed that the new derived compounds appropriate kernel function for classification. The results should look similar to these ten compounds. With this of the forward QSAR can be found in our previous paper assumption the path probability was calculated. Setting t [24]. = 1 and h = 3, we generated two Euler circuits and the cor- responding chemical structure template is shown in Fig- The third step was to design or to generate a new point in ure 20. the kernel feature space using a kernel feature space algo- rithm. To demonstrate our approach, we formed a new Notes on Test Results point in the feature space by using the ten highest active In order to investigate whether the pre-image VSMMD is compounds in the training set. The center of the mini- reasonable, we performed a more detailed analysis of the mum enclosing and maximum excluding hypersphere COX-2 data set. In the pre-image VSMMD as shown in Fig- was obtained using equation (19). Figure 17 shows these ure 20, the cyclopentene ring can be found in one of the ten compounds. descriptors. From medicinal studies, we know

Page 21 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

                      

                     

                   

  

  

  

  



VerificationFigure 18 test result Verification test result. that cyclopentene derivatives are one of the first series of The molecule in Figure 21 has a pIC50 value of 8.52, and diaryl-substituted cycles that have been well known as it was one of the highest active molecules in the testing set. COX-2 inhibitors [44,45]. This empirical evidence dem- From this result, we demonstrated that our strategy was onstrates that our generated pre-image VSMMD is able to able to generate a high affinity molecule using only data capture important properties of the ten most active mole- in the training set. We were able to claim that the gener- cules. ated molecule was a high affinity molecule because it appeared as such in the testing set. In practice, the success When we performed a high throughput screening on the of the algorithm would have to be assessed by using a wet test set using the generated chemical structure template, lab procedure to determine the affinity of the generated the following molecule was identified as an exact match to molecule. the template.

Page 22 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

TheFigure pre-image 19 VSMMD of the center of the minimum enclosing and maximum excluding hyperspheres The pre-image VSMMD of the center of the minimum enclosing and maximum excluding hyperspheres.

The inverse-QSAR procedure was applied to all eight data molecular descriptors in the training set follow a proba- sets; the closest matching molecule in the test set for the bility distribution that is only determined by the interac- generated chemical template across 8 data sets is shown in tions between ligands and the binding site. In practise this Figure 22. A quantitative evaluation of the generated mol- does not happen. The selection of members in the train- ecules was performed by implicitly mapping the VSMMD ing set may involve a significant amount of due to of each molecule to the kernel feature space for regression human involvement in its creation: analysis. The regression results are also shown in Figure 22. • Selection of members of the training set may be restricted by rules that exclude molecules that are not Discussion "drug like". In kernel-based learning, the usual assumption is that the n • Since the training set involves molecules that have data pairs {}()xy, , in the training set, come from a ii i=1 been assessed for binding affinity, they had to be syn- source that provides these samples in an independent thesized and may be part of a suite of molecules for identically distributed (i.i.d.) fashion according to an which the synthesis was not overly complicated. unknown probability distribution P(x, y). Furthermore, • Furthermore, the molecular descriptors in the train- the test examples are assumed to come from the same dis- ing set may show various types of repetition, (for tribution [46]. In an ideal situation, the collection of

TwotureFigure templatesEuler 20 circuits with the highest probability for the pre-image VSMMD in Figure 19 and the corresponding chemical struc- Two Euler circuits with the highest probability for the pre-image VSMMD in Figure 19 and the corresponding chemical structure templates. Note: We use 'O' to denote O#O#O#O#O#O and 'R*' to denote the cyclopentene ring.

Page 23 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

MatchingFigure 21 molecule in the test set Matching molecule in the test set. Note: We use 'O' to denote O#O#O#O#O#O and 'R*' to denote the cyclopentene ring.

example, the repeated occurrence of some type of scaf- 2. the deployment of a pre-image algorithm to map fold). This may or may not be intended. the descriptor image points in the feature space back to the input space of the descriptors, and As a consequence of these issues, the learning algorithm will produce a predictor that is taking into account both a 3. the design of a probabilistic strategy to convert new biological process and the human activity intrinsic to the descriptors into meaningful chemical graph templates formation of the training set. More significantly, there is the demand that future test molecules come from the As reported in earlier papers, our modeling has produced same probability distribution. Statistical learning theory very effective algorithms to predict drug-binding affinities will guarantee certain generalization bounds, but only if and to predict multiple binding modes [24]. We have now these demands are met. In effect, the theory tells us that if extended our modeling approach to the development of test samples come from a source, such as a virtual screen- algorithms that derive new descriptors and then facilitate ing library that is not characterized by the same rules of the reverse engineering of such a descriptor. This is a very formation as the training set – then all bets are off. desirable capability for a molecular descriptor [3].

In the constructive approach that has been described in The most important aspect of our research is the presenta- this paper, it is clear that we are also limited by the infor- tion of strategies that actually generate the structure of a mation that is intrinsic to a training set. But beyond this, new drug candidate. This is substantially different from the strategy significantly differs from virtual screening. methodologies that depend on database screening to get Instead of trying to find a new molecule in a database that new drug candidates. While our approach can support should exhibit the same P(x,y) characteristics, we side step such an endeavour, it is not our primary goal. In fact, we this requirement (which may be difficult to guarantee) are quite concerned that database screening, done using a and we build a new drug candidate using only the infor- predictor derived from a statistical learning algorithm, is mation that is strictly contained in the training set itself. subject to procedural demands that may be difficult to maintain. We are referring to statistical learning theory Conclusion that guarantees the success of a predictor, but only when While molecular fragments have been used in research the test sample is drawn from a data source that has the studies for dealing with quantitative structure-activity same probability distribution as that characterizing the relationship problems, we have further evolved this strat- training set. egy to include a reverse engineering mechanism. In the applications of statistical learning to database These mechanisms include: screening, the predictor may be applied to test molecules that have very little relationship to the training data. In 1. the use of a kernel feature space algorithm to design these cases, the predictor is optimistically treated as if it or modify descriptor image points in a feature space actually incorporates an algorithm that has some firm and

Page 24 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

TheFigure closest 22 matching molecule in the test set for the generated chemical template across 8 data sets The closest matching molecule in the test set for the generated chemical template across 8 data sets. direct relationship to the biological context of the prob- the feature space, the reverse engineering just described lem. As mentioned by Good et al. [47], the conclusion of allows us to develop a template for a new drug candidate a QSAR analysis can be profoundly altered by how the test that is independent of issues related to probability distri- set was derived. Traditionally, this concern was usually bution constraints placed on test set molecules. addressed through the design of complicated and time consuming validation experiments [48] to ensure that the Competing interests predictor will not generate a misleading conclusion. In The authors declare that they have no competing interests. our approach, we have avoided such concerns. While the training set is still used to generate a new image point in

Page 25 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

Authors' contributions 20. M Fröhlich H, Wegner JK, Sieker F, Zell A: Optimal assignment kernels for attributed molecular graphs. Bonn, Germany WWLW, with the help and under the supervision of FJB, :225-232. developed and implemented all the inverse-QSAR meth- 21. Mahe P, Ueda N, Akutsu T, Perret JL, Vert JP: Graph kernels for ods presented in the paper, and ran the experiments. Both molecular structure-activity relationship analysis with sup- port vector machines. J Chem Inf Model 2005, 45:939-951. authors contributed in writing the initial manuscript, and 22. Ralaivola L, Swamidass SJ, Saigo H, Baldi P: Graph kernels for both approved the final manuscript. chemical . Neural Net 2005, 18:1093-1110. 23. Mahe P, Ralaivola L, Stoven V, Vert JP: The pharmacophore ker- nel for virtual screening with support vector machines. J References Chem Inf Model 2006, 46:2003-2014. 1. Sharp KA: Potential functions for virtual screening and ligand 24. Burkowski FJ, Wong WWL: Predicting multiple binding modes binding calculations: Some theoretical considerations. In Vir- in QSAR studies using a vector Space model molecular tual Screening in Drug Discovery Edited by: Alvarez J, Shoichet B. New descriptor in reproducing kernel Hilbert space. International York: Taylor & Francis; 2005:229-248. Journal of and 2009 in press. 2. Todeschini R, Consonni V: Handbook of molecular descriptors Wein- 25. Gillet VJ, Willett P, Bradshaw J: Similarity searching using heim: Wiley-VCH; 2000. reduced graphs. J Chem Inf Comput Sci 2003, 43:338-345. 3. Faulon JL, Brown W, Martin S: Reverse engineering chemical 26. Glenn RC, Bender A, Arnby CH, Carlsson L, Boyer S, Smith J: Circu- structures from molecular descriptors: how many solutions? lar fingerprints: Flexible molecular descriptors with applica- J Comput-Aided Mol Des 2005, 19:637-650. tions from to ADME. IDrugs 2006, 4. Lewis RA: A general method for exploiting QSAR models in 9:199-204. lead optimization. J Med Chem 2005, 48:1638-1648. 27. Sutherland JJ, O'Brien LA, Weaver DF: A comparison of methods 5. Brown N, McKay B, Gasteiger J: A novel workflow for the inverse for modeling quantitative structure-activity relationships. J QSPR problem using multi-objective optimization. J Comput- Med Chem 2004, 47:5541-5554. Aided Mol Des 2006, 20:333-341. 28. Shawe-Taylor J, Cristianini N: Kernel methods for pattern analysis Cam- 6. Masek BB, Shen L, Smith KM, Pearlman RS: Sharing chemical infor- bridge, UK: Cambridge University Press; 2004. mation without sharing chemical structure. J Chem Inf Model 29. Kwok JTY, Tsang IWH: The pre-image problem in kernel meth- 2008, 48:256-261. ods. IEEE Trans Neural Net 2004, 15:1517-1525. 7. Sheridan RP, Kearsley SK: Using a genetic algorithm to suggest 30. Mak B, Hsiao R, Ho S, Kwok JT: Embedded kernel eigen-voice combinatorial libraries. J Chem Inf Comput Sci 1995, 35:310-320. speaker adaptation and its implication to reference speaker 8. Venkatasubramanian V, Chan K, Caruthers JM: Evolutionary design weighting. IEEE Trans Speech Audio Proc 2006, 14:1267-1280. of molecules with desired properties using the genetic algo- 31. Liu Y, Zheng YF: Minimum enclosing and maximum excluding rithm. J Chem Inf Comput Sci 1995, 35:188-195. machine for pattern description and discrimination. 18th 9. Kvasnicka V, Pospichal J: Simulated annealing construction of International Conference on Pattern Recognition :129-132. molecular graphs with required properties. J Chem Inf Comput 32. Mika S, Schölkopf B, Smola JA, Müller K-R, Scholz M, Rätsch G: Ker- Sci 1996, 36:516-526. nel PCA and de-noising in feature spaces. Proceedings of the 10. Hall LH, Kier LB, Frazer JW: Design of molecules from quantita- 1998 conference on Advances in neural system II tive structure-activity relationship models .2. Derivation and :536-542. proof of information-transfer relating equations. J Chem Inf 33. Schölkopf B, Knirsch P, Smola AJ, Burges CJC: Fast approximation Comput Sci 1993, 33:148-152. of support vector kernel expansions, and an interpretation of 11. Hall LH, Dailey RS, Kier LB: Design of molecules from quantita- clustering as approximation in feature spaces. DAGM-Sympo- tive structure-activity relationship models .3. Role of higher- sium :125-132. order path counts: path 3. J Chem Inf Comput Sci 1993, 34. Bakir GH, Zien A, Tsuda K: Learning to find graph pre-images. 33:598-603. Lecture Notes in Computer Science 2004, 3175:253-261. 12. Kier LB, Hall LH, Frazer JW: Design of molecules from quantita- 35. Tatsuya A, Fukagawa D: Inferring a graph from path frequency. tive structure-activity relationship models .1. Information- Lecture Notes in Computer Science 2005, 3537:371-382. transfer between path and vertex degree counts. J Chem Inf 36. Tatsuya A, Fukagawa D: Inferring a chemical structure from a Comput Sci 1993, 33:143-147. feature vector based on frequency of labelled paths and 13. Skvortsova MI, Baskin II, Slovokho tova OL, Palyulin VA, Zefirov NS: small fragments. APBC 2007:165-174. Inverse problem in QSAR, QSPR studies for the case of top- 37. Robin JW: Introduction to Graph Theory Addison Wesley; 1996. ological indexes characterizing molecular shape (Kier 38. Cortes C, Mohri M, Weston J: A general regression technique indexes). J Chem Inf Comput Sci 1993, 33:630-634. for learning transductions. Proceedings of the 22nd international 14. Churchwell CJ, Rintoul MD, Martin S, Visco DP, Kotu A, Larson RS, conference on Machine learning; Bonn, Germany :153-160. Sillerud LO, Brown DC, Faulon JL: The signature molecular 39. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach descriptor – 3. Inverse quantitative structure-activity rela- to DNA fragment assembly. Proc Natl Acad Sci USA 2001, tionship of ICAM-1 inhibitory peptides. J Mol Graphics Modell 98:9748-9753. 2004, 22:263-273. 40. Huang HC, Chamberlain TS, Seibert K, Koboldt CM, Isakson PC: Dia- 15. Faulon JL, Visco DP, Pophale RS: The signature molecular ryl indenes and benzofurans – novel classes of potent and descriptor. 1. Using extended valence sequences in QSAR selective cyclooxygenase-2 inhibitors. Bioorg Med Chem Lett and QSPR studies. J Chem Inf Comput Sci 2003, 43:707-720. 1995, 5:2377-2380. 16. Faulon JL, Churchwell CJ, Visco DP: The signature molecular 41. Chavatte P, Yous S, Marot C, Baurin N, Lesieur D: Three-dimen- descriptor. 2. Enumerating molecules from their extended sional quantitative structure-activity relationships of cyclo- valence sequences. J Chem Inf Comput Sci 2003, 43:721-734. oxygenase-2 (COX-2) inhibitors: A comparative molecular 17. Faulon JL, Collins MJ, Carr RD: The signature molecular descrip- field analysis. J Med Chem 2001, 44:3223-3230. tor. 4. Canonizing molecules using extended valence 42. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E: sequences. J Chem Inf Comput Sci 2004, 44:427-436. The chemistry development kit (CDK): an open-source Java 18. Azencott CA, Ksikes A, Swamidass SJ, Chen JH, Ralaivola L, Baldi P: library for chemo- and . J Chem Inf Comput Sci One- to four-dimensional kernels for virtual screening and 2003, 43:493-500. the prediction of physical, chemical, and biological proper- 43. Steinbeck C, Hoppe C, Kuhn S, Floris M, Guha R, Willighagen EL: ties. J Chem Inf Model 2007, 47:965-974. Recent developments of the chemistry development kit 19. Swamidass SJ, Chen J, Bruand J, Phung P, Ralaivola L, Baldi P: Kernels (CDK) – an open-source java library for chemo- and bioinfor- for small molecules and the prediction of mutagenicity, tox- matics. Curr Pharm Des 2006, 12:2111-2120. icity and anti-cancer activity. Bioinformatics 2005, 21(supple- 44. Leval X, Delarge J, Somers F, Tullio P, Henrotin Y, Pirotte B, Dogne ment 1):i359-i368. J: Recent advances in inducible cyclooxygenase (COX-2) inhi- bition. Curr Med Chem 2000, 7:1041-1062.

Page 26 of 27 (page number not for citation purposes) Journal of Cheminformatics 2009, 1:4 http://www.jcheminf.com/content/1/1/4

45. Reitz DB, Li JJ, Norton MB, Reinhard EJ, Collins JT, Anderson GD, Gregory SA, Koboldt CM, Perkins WE: Selective cyclooxygenase inhibitors: Novel 1,2-diarylcyclopentenes are potent and orally active COX-2 inhibitors. J Med Chem 1994, 37:3878-3881. 46. Müller K-R, Mika S, Räts ch G, Tsuda K, Schölkopf B: An introduc- tion to kernel-based learning algorithms. IEEE Trans Neural Net 2001, 12:181-201. 47. Good AC, Hermsmeier MA: Measuring CAMD technique per- formance. 2. How "druglike" are drugs? Implications of Ran- dom test set selection exemplified using druglikeness classification models. J Chem Inf Model 2007, 47:110-114. 48. Good AC, Hermsmeier MA, Hindle SA: Measuring CAMD tech- nique performance: a virtual screening case study in the design of validation experiments. J Comput-Aided Mol Des 2005, 18:529-536.

Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright

Submit your manuscript here: http://www.chemistrycentral.com/manuscript/

Page 27 of 27 (page number not for citation purposes)