<<

News The Newsletter of the CDK Project Volume 3/1, March 2006

Contents of this issue: QA of the XlogP Descriptor ...... 12 Development Tools. 2. Documentation . . 14 Editorial ...... 2 Literature ...... 18 A Protocol for Descriptor QA ...... 3 Validation of the CDK Surface Area Routine . . 5 iBabel ...... 19 Improving the CDK implementation of the An Applet Release of JChemPaint ...... 21 XlogP Descriptor ...... 10 Frequently Asked Questions ...... 23 Vol. 3/1, March 2006 2

Editorial by Egon Willighagen time a distinct (small) molecular structure is given. The InChI’s are normally given in the bibliography, allowing authors to just use names or IDs in the Editorial article itself. To help authors, the CDK News stylesheet now contains a new inchi command Welcome to the sixth issue of CDK News, the first which will also create a link to search for more issue of the third volume. This issue focuses on information about the compound using Google.com. validation of the QSAR descriptors implemented For example, a BibTex entry may look like: in the CDK. Fechner and Guha propose a scheme @MISC{methane, to validate QSAR descriptors in the CDK, which title = "Methane", Fechner and Grabowski use to validate the LogP note = "\inchi{1/CH4/h1H4}" descriptors in the CDK. Additionally, Hoppe } analyzes the algorithm in the CDK used to calculate the LogP descriptor, and Guha studies the behavior The next issue is scheduled for June/July 2006. of the TPSA descriptor. This issue also features two Given the large number of papers in each issue, we applications of the CDK: one article discusses the will try to start releasing four issues a year, instead of iBabel program developed by Swain, while a second every four months in the past. This will likely make discusses the JChemPaint applet. Finally, a tutorial the issues smaller, but, more importantly, reduce the explains what JavaDoc is and how it should be used time to publication. This also means that we can in the CDK. no longer promise that an article will be published Very recently, a wiki for the CDK has gone live, in the next issue; the sooner you submit, the larger at http://cdk.sf.net/wiki/. I would welcome the chance it has gone through the review process. all readers to discuss articles on this wiki. This As always, submissions may include comments on might especially prove useful for articles that discuss current code, discuss certain algorithms in general, source code. As the CDK code base is not static, and or just describe a piece of work related to the CDK neither are other libraries, the source code example library. might need to be updated now and then. The wiki is a good place to aggregate those updates. Egon Willighagen As I explained previously, the CDK News now Radboud University Nijmegen, The Netherlands requires InChI’s to be stated in the article each [email protected]

Front Page The front page shows the JChemPaint applet in action on the NMRShiftDB website (http://www. nmrshiftdb.org/), as discussed in the An Applet Release of JChemPaint article, by Kuhn, elsewhere in this issue.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 3

A Protocol for Descriptor QA

We discuss a quality assurance (QA) protocol was computed for all compounds [4]. We employed that has been established to validate the standard parameters for computation of the CATS CDK descriptors implementations against descriptor: the considered topological distance of corresponding implementations in commercially atom-pairs ranged from 0 to 9 bonds, each of the available packages. In particular, we compiled 15 possible pairs of potential pharmacophore points two datasets that we recommend for descriptor (PPPs) was divided by the added occurrences of the validation tasks. two respective PPPs. Please refer to Ref. [5] for a more detailed explanation of the applied scaling scheme. by Uli Fechner and Rajarshi Guha The CATS descriptors were then used to carry out selection of a diverse subset of 1,100 compounds. A Currently, the CDK provides implementations of 52 Java implementation of the MaxMin algorithm was descriptors. These are subdivided into two groups: used for this purpose [6]. Though our goal was 20 descriptors that calculate values for single atoms to create a QA dataset containing 1,000 structures, (org.openscience.cdk.qsar.descriptors.atomic), we initially selected 1,100 structures so as to have and 32 that provide descriptor values for whole a number of backup structures in case any of the (org.openscience.cdk.qsar.descrip- following tasks were unable to successfully deal with tors.molecular). Though the descriptor all structures. implementations have unit tests associated with After obtaining a diverse subset we modified it them, these are usually few in number. As a to yield two descriptor QA datasets that differed result, a comprehensive validation of the CDK in terms of hydrogens (present and non-present). descriptors has not been performed. This may Again, Cliff was used to add hydrogens. not be very important for simplistic descriptors Next, three-dimensional coordinates (one such as one that counts the number of atoms conformer per compound) were generated for (AtomCountDescriptor) or computes the molecular the two datasets using Corina (version 3.20) [3]. weight (WeightDescriptor). However for more Corina was unable to generate three-dimensional complex descriptor classes that encompass several coordinates for six of the 1,100 structures. Visual hundred lines of code and performs non-trivial tasks, inspection revealed that these structures contained a validation is necessary for users to be able to rely on significant number of atoms that were not part of a the CDK implementation. For example, the XlogP ring system. As Corina starts conformer generation descriptor (XLogPDescriptor) comprises over 1,400 using ring templates and then minimizes non-ring lines of code and carries out the recognition of nearly atoms, it may fail on structures with a lot of non- 100 different atom types. ring atoms. We removed the six structures without three-dimensional coordinates and 94 more from To encourage confidence in the CDK descriptor both datasets to finally yield two datasets of 1,000 implementations we decided to start a quality structures. These datasets are publicly available assurance (QA) project for validation purposes. We in SDF format and are deposited in CDK’s CVS developed a protocol that lays out the general repository at (module cdk-qa). procedure for a descriptor QA task. Another article Validation of a CDK descriptor is then performed in this issue makes use of our descriptor QA protocol by computing this particular descriptor using CDK and validates the CDK XlogP descriptor. and another - the comparison software - Our first task was the compilation of a suitable such as MOE [6] or Dragon [8], for one of the dataset that can be used for all descriptor QAs. two descriptor QA datasets. A detailed comparison We downloaded the drug-like subset (subset 3, last between the descriptor values of CDK and the updated on 03/03/2005, 2,066,905 compounds) of comparison software includes the ZINC database [1] in SMILES format. This subset includes all compounds of the ZINC database • a plot of the CDK descriptor values versus the that do not violate any rule of the rule-of-five [2], values obtained from the comparison software i.e., compounds having a predicted logP value smaller than or equal to 5, a molecular weight • root mean square error (RMSE) of at most 500, not more than 5 hydrogen bond • the median, maximum and minimum donors, and at most 10 hydrogen bond acceptors. differences The SMILES file was then converted to an SD file using the commercially available program, Cliff • the percentage of compounds for which the (version 1.14) [3]. All hydrogens were stripped descriptor values differ significantly (e.g., by at from the SD file and nitrogens were uniformly most 10 percent) written in the penta style. Then, the topological pharmacophore-based atom-pair descriptor CATS • noticeably outlying compounds

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 4

In addition to the points listed above, it might [email protected] be worthwhile to derive a linear regression between CDK descriptor values and those computed by comparison software. Such a linear regression Bibliography yields a straight line. Examination of its intercept and slope adds another aspect to the results of a [1] Brian K. Shoichet John J. Irwin. ZINC - descriptor validation task. An intercept different A Free Database of Commercially Available than zero denotes that the validated CDK descriptor Compounds for Virtual Screening. J. Chem. Inf. exhibits systematically higher or lower values than Model., 45:177–182, 2005. the ones calculated by comparison software. A slope different than 1.0 (corresponding to 45 degrees) [2] C.A. Lipinski et al. Experimental and states proportionality but inequality between values Computational Approaches to Estimate of a descriptor computed by CDK and a comparison Solubility and Permeability in Drug Discovery software. and Development Settings. Adv. Drug. Del. Rev., Scrutinizing compounds that yield noticeably 23:3–25, 1997. different descriptor values may lead to the [3] Molecular Networks GmbH - Computerchemie. detection of possible causes. Mistakes in the CDK http://www.mol-net.com/, January 2006. implementations may be revealed and fixed in the QA process. In addition, differences in descriptor [4] G. Schneider et al. "Scaffold-Hopping" values may arise due to aspects of the CDK not by topological pharmacophore search: A directly related to the descriptor implementation. contribution to virtual screening. Angew. Chemie For instance, a limitation in the aromaticity detection Int. Ed., 38:2894–2896, 1999. routine in the CDK would lead to an error in the perception of one or more TPSA atom environments [5] G. Schneider U. Fechner. Optimization of and therefore to a wrong TPSA value. a pharmacophore-based correlation vector We hope to show the strengths of the CDK descriptor. QSAR Comb. Sci., 23:19–22, 2004. descriptors, improve their quality and thus [6] G. Schneider M. Schmuker, A. Givehchi. Impact encourage their everyday use. We would very much of different software implementations on the appreciate anyone who is interested in joining us; performance of the Maxmin method for diverse there are still quite some descriptors left for QA! subset collection. Mol. Divers., 8:421–425, 2004.

Uli Fechner [7] Molecular Operating Evironment, Chemical Goethe-University Frankfurt, Germany Computing Group. http://www.chemcomp.com/, [email protected] January 2006.

Rajarshi Guha [8] Dragon, TALETE SRL. http://www.talete.mi. Pennsylvania State University it/, January 2006.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 5

Validation of the CDK Surface Area Routine

A comparison of the surface areas for two Once a sufficient number of recursive datasets using the numerical CDK routine and the tessellations have been performed we have a set of analytical algorithm implemented in SAVOL. points which represent (approximately) the surface of a unit sphere. These points are then translated by Rajarshi Guha and scaled and used to model the surface of the atoms in a . For each atom, the accessible surface is determined using a probe of specified Introduction radius. The default probe radius in the CDK is 1.4Å (corresponding to a water molecule). In addition, Molecular surface areas and volumes play an though higher levels of tessellation result in a denser important role in many modeling tasks. An obvious set of surface points for the unit sphere, this can application is but other applications exponentially increase the time required to perform include the analysis of solvent accessible pockets the calculation. In general 4 levels of tessellation in proteins and evaluation of molecular descriptors. appear to result in sufficient accuracy in a reasonable There are a number of algorithms available to amount of time. evaluate molecular surface areas [1, 2, 3, 4, 5]. The above discussion highlights the two main These algorithms can be divided into two classes: parameters of this algorithm: the number of times numerical and analytical. The former generally the tessellation is performed (termed the tessellation approximates a molecular surface using a set of level) and the probe radius. The rest of this article discrete points whereas the latter evaluates the describes how these parameters affect the accuracy molecular surface by using a collection of spheres of the algorithm when applied to real datasets. which are defined using the centers, radii and arcs representing boundaries. In addition, tori are also included to take cavities into account. The Methodology & Datasets NumericalSurface class of the CDK library provides a numerical method to evaluate molecular surface In general one expects that analytical algorithms areas for the whole molecule as well as atom-wise will be more accurate than numerical algorithms. surface areas. Though this is not always the case, due to the various parameters that are used to control both numerical and analytical algorithms, we chose to The CDK Implementation compare the results of the CDK surface area routine to an analytical implementation. For this purpose we selected the SAVOL [9] program which is a The implementation of surface area calculations in component of the ADAPT software package. The the CDK is a reimplementation of the Double Cubic only parameter required for the usage of SAVOL is Lattice Method (DCLM) [6] and is based on the the probe radius. We used the default value of 1.5Å. Python implementation of this method by McClusky Another factor that can affect the comparison are in the MMTK [7, 8]. Very briefly the algorithm starts the Van der Waals radii of the atoms. Prior to the by considering the points representing the corners calculations we checked the values of the radii used of a unit icosahedron. These points are used to in SAVOL and in the CDK implementation to ensure generate a new set of points, lying on the surface of that the values were consistent. the icosahedron, using a recursive approach. This procedure is termed tessellation. One could start BP Dataset ZINC Dataset from a tetrahedron or cube, which would result in a 120

slightly faster running time. Alternatively one could 100

use a dodecahedron as the starting point to obtain 80 80

a closer approximation to the sphere. The selection 60 60 Frequency Frequency of the icosahedron represents a balance between 40 40

accuracy and speed. It should be noted that the 20 20

McClusky implementation uses a simplified version 0 0 of the tesselation algorithm described by Eisenhaber, 0 50 150 250 350 150 200 250 300 350 Molecular Weight Molecular Weight leading to a slight loss of accuracy. As a result the CDK implementation utilizes a icosahedral starting point for the generation of the points on the unit Figure 1: Histograms of the molecular weight sphere. distribution of the datasets used in this study

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 6

Tesselation level = 3 (R2 = .97) Tesselation level = 4 (R2 = .98) Tesselation level = 5 (R2 = .98)

● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

500 ● ● ● ● ● ●

500 ● ●

● 500 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● 400 ●● ●● ● ● ● ● ●● ● ●● 400 ● ● ● ● ● ● 400 ● ● ● ●● ● ●● ● ●●● ● ● ●● ●● ●● ●● ● ● ● ● ●● ●● ● ●● ● ●● ●● ● ●● ● ●● ● ●● ● ●● ●● ●●● ●● ● ●●● ● ●●●● ●●●● ● ● ● ● ●● ● ●● ●● ● ●●● ● ●●● ●● ● ● ●●●● ●●● ● ●● ● ●●●●● ●●●●● ● ● ● ●●● ●●● ●●●● ●●●●●● ●●●● ●●● ● ● ●● ●●●● ●●● ● ●●●●●●●● ● ●●●● ● ●●● CDK SA ●●●● CDK SA ● ● CDK SA ● ●● ● ●● ● ●●●●● ● ●●●●● ● ●●● ●●●● ●●●●●● ●●● ● ●●●● ● ● ●●●●●●● ● ●●● ● ●●●●● ● ●●●●● ● ●●● ● ●●●● ●●●●● ● ●●●●● ●●●●● ●●●● ●● ●●●●●● ●●●●● 300 ● ● ●●● ● ●● ● ●●●● ● ●●●●● ●●●●●● ● 300 ●●●●● ●●●● 300 ●●● ● ● ●●●● ●●●●● ● ●●● ● ●●● ● ●●● ●●●● ●●● ●● ●● ●●●●●● ● ●●●● ●● ●● ●●●● ●●●●●● ● ●●● ●●● ● ●● ● ● ● ● ● ● ●●●●●● ●●●●●● ●●● ●●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●●● ●●●● ● ●●● ● ● ●● ● ●● ● ●●● ● ● ●● ● ● ●● ●● ●●● ●●●● ● ● ● ● ●● ●●● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

200 ● ● ● 200 ● 200 ● ● ● ● ●

200 300 400 500 200 300 400 500 200 300 400 500 SAVOL SA SAVOL SA SAVOL SA

Figure 2: Plots of the SA calculated by SAVOL versus the numerical CDK algorithm for the boiling point dataset. The probe radius for the CDK algorithm was fixed at 1.5Å and the tessellation level was varied from 3 to 5.

Probe Radius = 1A (R^2 = 0.976) Probe Radius = 1.1A (R^2 = 0.977) Probe Radius = 1.2A (R^2 = 0.98)

● ● ● ●● ● ●

500 ● 500 ● ● 500 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●

400 ●

● 400 ● ● ● ● 400 ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ● ●●●●● ●●● ● ● ●● ● ● ● ● ●●●● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ●● CDK SA ● CDK SA ● ● ●●● CDK SA ●● ●● ● ●●● ●● ● ● ● ●●● ●●●●● ● ●● ●●●● 300 ● ● ● ●●●●● ● ●● ●●●

300 ●●● ● ● ●●● ●● ●● ● ●●●● ● ●●●● ●●● 300 ●●● ● ●●●● ●● ● ●●●● ● ● ●● ●●● ● ●●●●●● ● ● ●●●● ●●● ●●●● ●●●● ●●●●● ● ●●●●●● ● ●●●● ● ●●● ● ●●●●● ●●●●●● ● ●●●●● ●●●●●● ● ●●● ●●●● ●●●● ● ●●●●●● ●●●●● ● ●●●● ●●●●● ●●●●● ● ●●●●●● ●●●●● ● ●●●● ●●● ●●●● ●● ● ●●●●●● ● ●●● ●●● ● ●●● ● ●● ●●●●● ●●●●● ●●●●● ●●●●●●● ●●●● ● ●●●● ● ● ● ●● ● ●●●● ●●●● ●●●● ●●●●● ●●● ●●●●● ●●● ● ●●● ● ● ● ● ●●●● ● ●●● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ●●●●● ●●●●● ●●●●●● ● ●●● ● ●●● ● ● ●●● ● ●● ● 200 ●● ● ● ●● ●● ● ●●● ● ●● ● 200 ● ● ● ● ● ●● ● ●● ● ●●● ●● ● ●● 200 ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● 200 300 400 500 200 300 400 500 200 300 400 500 SAVOL SA SAVOL SA SAVOL SA

Probe Radius = 1.3A (R^2 = 0.979) Probe Radius = 1.4A (R^2 = 0.981) Probe Radius = 1.5A (R^2 = 0.98)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 500 500 ●● ● ● ● ● ● ● 500 ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●

400 400 ● ● ● ● ● ● ● ● ●● ● ●● ● ● 400 ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●● ● ● ●● ● ●● ●●● ● ● ● ●●●● ●●●● ● ●●● ● ●● ● ●●● ● ●●●● ●● ●● ● ● ●●● ● ●●● ● ●● ●●●● ● ● ●●● ●●● ●●●●● ● ● ●● ●●●● ●● ●●●● ●●● ● ●● ●●●● ●●● ● ●●● ●●● ● ●●● CDK SA ●● CDK SA ●●● CDK SA ● ●● ●●●●● ●●● ● ●●●●● ●●● ● ● ●● ●●●● ●●● ●●●●● ● ●●●●● ●●● ● ●●●● ●●●●● ●●● ●●● ● ●●●● ● ●●●●● ● ●●●● ●●●●● ● ●●●● ● ●●●● ●●●●● 300 ●● 300 ●●● ● ●● ●●●●● ●●●● ●●●●●● ● ●●●●● ●●●● 300 ●●● ● ●●●● ● ●●●●●● ●●●●● ●●●● ●●● ● ● ●●●●● ●●●●● ●●● ●●●●● ● ● ● ●●● ● ●●●● ●●● ●●●● ●●●● ● ●●● ●●●● ●● ●●●● ● ●●●●● ●●●● ●●●● ● ● ● ● ● ●●● ●●●●●● ●● ●●●● ● ● ● ● ● ● ●●● ●●●●● ●● ●● ● ●●●●● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●● ●● ● ● ●●● ● ● ● ●●● ● ● ●● ● ● ● ● ●●●●● ● ● ●● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● 200 ● 200 ● ● ● 200 ● ● ● ● ● ● ● ●● ● ● ●● ●

200 300 400 500 200 300 400 500 200 300 400 500 SAVOL SA SAVOL SA SAVOL SA

Figure 3: Plots of the SA calculated by SAVOL versus the numerical CDK algorithm for the boiling point dataset. The tessellation level for the CDK algorithm was fixed at 4 and the radius was varied from 1.0Å to 1.5Å.

We considered two datasets. The first dataset, taken from the ZINC database. The ZINC ID termed the boiling point dataset, consisted of 277 codes for this dataset are available as supplementary molecules [10] with an average molecular weight of information. The average molecular weight of this 115 (σ = 42.7). This dataset consisted mainly of dataset was 249 (σ = 52.9). Fig. 1 displays the substituted hydrocarbons ranging in size from C1 to histogram of the molecular weights of these two C8. The second dataset consisted of 1000 molecules datasets.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 7

Tesselation level = 3 (R2 = .97) Tesselation level = 4 (R2 = .98) Tesselation level = 5 (R2 = .98) 1000 1000 1000

● ● ● ● ● ● 800 ● 800 ● 800 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ●●● ● ●●● ● ●● ● ●● ● ● ● ●● ●●● ● ● ● ● ●● ●●● ● ●● ●●● ● ● ●●● ●●● ●● ● ●●● ●●●● ●● ●● ●●●● ●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ●●● ●● ●● ●● ● ● ●●●●●● ●● ● ●● ● ● ●●● ●●●●●● ● ●● ● ●● ●●●●●● ● ●● ● ●●●● ● ● ● ● ● ●●●●● ● ● ●● ● ● ●●●●●●● ●●●● ●● ●● ●● ● ● ●●●●●● ●●●●●● ● ●● ●●●●●●●●●●●● ● ● ●● ●●●●●●●●● ●● ●●●●●●●● ● ● ● ●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●● ●●●● ●●●●● ●●●● ●● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●● ● ● ● ●● ● ●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●● ●● ● ●● ● ●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●● ●● ● ●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●●●●●●●●●● ● ● ● ●●●●● ●●●●●●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● CDK SA ●● ● ● CDK SA ● ● ●●● ●● ●●●●●● CDK SA ● ● ●●● ● ●● ●● ●●●●●●●●●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●● ●● ● ● ●● ●● ●●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●● ● ● ● ● ●●● ●●●●●●●●●●●●●●● ● ● ●●●●● ● ●●●●●●●●●●●● ● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●●● ● ● ●● ●●● ●●●●●●●●●● ●●● ●● ●●●●●●●● 600 ● ● ●● ●●●●● ●● 600 ● ● ●●●●●●●● ● 600 ● ● ● ●●●●●●●● ● ● ● ● ● ●●●●●●●● ●● ●●●●●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ● ● ●● ●● ●●●●●●●●●● ● ●● ● ●●● ●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●● ●●●●● ● ●●●●●● ●●●● ●●● ● ●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●●●● ● ●●● ●● ●●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●●●●● ● ● ●● ● ●● ●●●●●●● ● ● ● ● ●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●● ●●●●●●●●●●● ●● ● ●●●●● ●●●●●●● ●● ● ●●●●●●●●●● ●●●●● ● ● ●●●●●●● ●●●● ● ● ●●●●●●●●●●● ● ● ●●●●●●●●●● ● ● ●●●●● ●●●●● ●● ● ●●●●●●●●●●● ●● ● ● ●●●● ●●●● ●● ● ●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●● ●●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ● ●●●●●●●●●●● ●● ●●●●●●●●●●●●● ● ●● ● ●●●●● ● ●● ●●●●●●●●●●● ●● ● ● ●●●●●●●●●●● ● ●●●●●●● ●●●●● ●● ●●●● ●●●● ●● ●●● ● ● ● ● ●●● ●●●● ● ●●● ●● ● ●●●● ●● ●●● ●● ●● ●●●● ●●● ● ●● ●●● ●●●●●●●● ●● ●●●●● ●●●●●●● ●● ●●●● ●● ●●●●● ●●●●●● ●●● ●●● ●●●●●●●●●●●●● ●●●●●●●●●● ●● ●● ●●●●● ● ●●●●●●● ●●●● ●●●●●●●● ●●●● ● ●●●●●● ●●● ● ●● ●●●● ●● ● ●● ●●●●●●● ●● ● ●● ● ● ●●● ●● ● ● ●●● ●●● ●●●●●●● ● ●● ●●● ● ●●●●●● ● ●●● ●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●●●●● ● ● ●●●●● ● ● ●● ●● ● ● ● ● ●●●●● ●●● ●●●● ●●● ●●● ●●●● ●●●● ● ●● ●●●● ● ● ●●●● ●● ●●●●● ●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●● ●●●● ●●●● ● 400 ●● 400 ● 400 ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ●

400 600 800 1000 400 600 800 1000 400 600 800 1000 SAVOL SA SAVOL SA SAVOL SA

Figure 4: Plots of the SA calculated by SAVOL versus the numerical CDK algorithm for the ZINC dataset. The probe radius for the CDK algorithm was fixed at 1.5Å and the tessellation level was varied from 3 to 5.

Probe Radius = 1A (R^2 = 0.976) Probe Radius = 1.1A (R^2 = 0.977) Probe Radius = 1.2A (R^2 = 0.98) 1000 1000 1000 800

800 ● 800 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ●●●●● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●●● ● ●●● ● ●● ●●● ●●● ● ●●●● ●●● ●● ●● ● ● ●● ●●●●●●●● ●● ● ● ● ●●●●●●●●●●● ● ● ●●● ●●● ●●●●● ● ● ●● ●●●●●●● ● ● ●●●● ●● ●●●●●●● ● ● ● ●●●●●●●●●●●●●● ●●● ●●● ●●●● ● ●●● ●●●●●●●● ● ● CDK SA ● CDK SA ●●● ●●●●●●●●● CDK SA ● ● ● ●●●● ● ●●●●●●●●● ●● ● ●● ●●●●●●●●●●● ● ● ●●●●●●●● ●●●●●● ● ●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● 600 ●● ● ●● ● ● ● ●● ●●● ●● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●● ● ●● 600 ● ●●● ● ● ● ● ● ●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●● ● ● ●●●●●● 600 ● ●●●● ● ● ●● ●●●●●●●●●●●●● ●● ● ● ●● ●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●● ●●●● ●●● ● ●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●● ● ● ●●● ●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●● ●● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●● ●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●● ●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●● ● ●● ●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ●●●● ●● ●●●● ●●●●● ● ●●●●●●●●●●●●●●●● ●●●●● ●●●●●●● ●●● ●●● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●● ●●● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●● ●●● ●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●● ● ●●●●●●●●●●●●● ●● ●●●●●●●● ● ● ●● ●●● ●● ● ●●● ●● ●● ● ● ● ●●●●●●●●● ●●●●●●●●●●●● ●● ● ● ●●● ●●●● ●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●●●● ● ● ● ●●●●● ●●●● ● ● ● ● ●●●●●●●● ● ●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●● 400 ● ●●●●● ● ● ●●●●●● ●● ● ● ● ●●●●●● ● ●●●●●●●●●● ● ●●●●●●● ●●●● ● ●●●●●●● 400 ●● ●● ● ● ● ●●●●● ●●●●●●●●● ● ● ●●●● ●●● ● ● ● ● ● ● ● ● ● ● 400 ● ● ● ●●●●●● ● ●●●●●● ● ●●●●●●● ●● ●●●● ●●● ●● ●● ●●●● ●●● ●●●●● ● ● ● ●● ●●●● ●●●● ● ●●●●● ●● ●● ●●●●● ● ●● ●●● ● ●● ●● ● ● ●● ● ●● ●● ● ●● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● 400 600 800 1000 400 600 800 1000 400 600 800 1000 SAVOL SA SAVOL SA SAVOL SA

Probe Radius = 1.3A (R^2 = 0.979) Probe Radius = 1.4A (R^2 = 0.981) Probe Radius = 1.5A (R^2 = 0.98) 1000 1000 1000

● ● ● ● ● 800 800 ● 800 ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●●● ● ●● ●●● ● ● ● ● ●● ●●● ● ● ●● ●●●● ●●●●● ●● ● ●● ● ● ● ● ● ● ● ●●●●●● ●● ● ●● ● ● ●● ● ●●●● ●●● ●● ● ●●●● ● ● ●● ● ●●●● ●●● ●● ● ● ●● ●● ● ● ●●●●●● ●●●●●● ● ● ● ●● ● ● ● ● ●●● ●● ●●● ● ● ●● ●●●●●●●● ● ● ● ● ●● ● ●●● ● ●●●●●●●●●●● ● ●●●● ●●●●● ●●●● ● ●●●●●●●●● ● ●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ● ● ●● ●●● ● ●● ●●●●●●●●●●●●● ● ●● ● ●●●●●●●●●● ●● ● ● ● ●●●●●●●●●●●●● ● ● ●●● ●●●●●●● ●●●●● ● ● ●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●● ● ●●●●●●●●●●●●● ● ● ●●●●● ●●●●●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ● ●●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●● ● CDK SA ●● ●● ● CDK SA ● ●●●●●● CDK SA ● ● ●●● ● ● ●●●●●●●●●●●●●●●●● ● ●●● ●●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●● ●● ● ● ● ● ● ●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ● ● ●● ●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ● ● ●●● ●● ●●●●●●●● 600 ● 600 600 ●● ● ●●● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●● ●●●● ● ● ●● ●●●●●●● ● ●● ●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●●●●●●●●● ● ● ●●●●●●●●●●● ● ● ●● ● ●● ●●●●●●●●●●●● ● ●● ●● ●●●●●●●●●● ● ● ●● ● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●● ● ●● ● ●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●● ●● ● ● ● ●● ●●●●●●●●●●●● ● ●● ●●● ● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ● ● ●●● ●●●●●●●● ●● ● ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ● ● ● ●●●●● ●●●●● ●● ●●●●● ●●●●●●●●●●● ●●● ●● ●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●● ● ●●●●●●●●●●●● ●● ●●●●●●●●●● ● ●●●● ●●●●●●●●● ● ● ● ●●●● ●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●● ●●●● ●●●●●●●●●● ● ●● ●●●●●● ●●●● ● ●●●●●●●●●●● ● ●●●● ●● ●●●●●●● ●●●● ●● ●●●●●●●●●●●● ●●●● ●●●●●●●● ●●●●●●●●●●●● ●● ●● ● ● ●●●●●●●● ●●●●● ●●●●●●●● ●●● ●●●●●●●●●●●● ● ●●●● ●●●● ●● ●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●● ●●●● ●●●●●●●● ● ●●●●● ●●●● ●●●● ● ●●●● ●●●●● ●●●● ●● ●●●● ●●● ●●●●● ● ● ● ●●●●●●●●●●●● ● ●●●●● ●● ●●●● ●● ●●●●●●● ●● ●●●●● ●●●● ● ●●● ●●●●●●●●● ●●●●●●●●●● ●●● ● ●●●● ●●●●●●●●● ● ●●●●●●●●●●●● ●● ●●●●●●●● ● ●●●●● ●●● ●●● ●●●●●●●● ●●● ● ● ●●● ● ● ● ●●●●●● ● ●●●●●● ●●●● ●●●●●●● ● ● ●●●●●●●●● ● ●● ● ●●● ● ●●●●●●●●●●●● ● ●●●●●●●●● ● ● ● ●●● ●● ●● ● ● ● ●●●●●● ●●● ● ●●●●●● ●●●● ●●●● ●●● ●●●●●●●●● ● ● ●●●●●●● ● ● ●●● ●● ●●●●●●● ● ●● ●●●●● ●●●●●● ●●●●●● ●● ●●●● ●●●●●●●●● ● ●● ●●●●● ● ● ●●● ●●● ●●●● ● 400 ● ● 400 ● ●● 400 ●● ● ●● ●●● ●● ● ●● ● ● ●● ●●●●●● ●●●● ●● ● ●● ● ● ●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ●

400 600 800 1000 400 600 800 1000 400 600 800 1000 SAVOL SA SAVOL SA SAVOL SA

Figure 5: Plots of the SA calculated by SAVOL versus the numerical CDK algorithm for the ZINC dataset. The tessellation level for the CDK algorithm was fixed at 4 and the radius was varied from 1.0Å to 1.5Å.

Results CDK implementation since for a given molecule the same number of points are used to represent the surface of all the atoms, irrespective of their sizes. One of the main differences between a numerical In addition one would expect greater deviations for and an analytical surface area algorithm is that larger molecules. the former generally overestimates surface areas compared to the latter. This is especially so for the Figures 2 and 3 summarize the results for the

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 8 boiling point dataset. Figure 2 compares the total aromatic and are linked by an sp2 carbon. Thus molecular surface areas obtained from SAVOL to the scaffold for 500 is planar whereas the scaffold of those obtained from the CDK algorithm. In the the first four molecules are non-planar. Why these graphs the probe radius for the CDK algorithm is molecules are underestimated to such a degree by fixed at 1.5Å and the tessellation level is varied from the CDK algorithm is unclear, though one possible 3 to 5. It is clear that at a tessellation level of explanation is that a pocket can be observed in 3, the numerical surface areas exhibit a relatively the 3D space filling view of the structures. One wide spread. However for tessellation levels of 4 would expect that pockets would not be very well and above there is no appreciable difference. We modeled by the implementation of the numerical also examined the accuracy at tessellation levels of algorithm used in the CDK due to the fixed number 6 and 7 and though a small increase in accuracy was of points per unit sphere. However, this explanation observed, the time required for the calculation was does not fully explain why 500 is also severely significantly longer. underestimated, as such a pocket is not very clear for Figure 3 compares the results of the numerical this structure. surface algorithm to the analytical algorithm for Fig. 5 displays the results obtained from the varying probe radii. The probe radii for SAVOL was CDK algorithm using different probe radii with the fixed at 1.5Å and that for the numerical algorithm tessellation level fixed at 4. As before it appears was varied from 1.0Å to 1.5Å. It is apparent that at that the results from the CDK algorithm exhibit a lower radii the CDK algorithm underestimates the better correspondence to the results from SAVOL surface area of the bulk of the dataset where as for when the probe radius is lower than that used in higher radii it overestimates the surface areas. This SAVOL. Specifically, for the 1.5Å probe radius used is to be expected since a larger probe will cover a in SAVOL, the CDK algorithm overestimates the larger region of the surface resulting in effectively surface areas, whereas this occurs to a lower extent larger surface areas. However it is surprising that at lower radii. Clearly, a lower probe radius makes the results from the CDK algorithm appear to match up for the inherent overestimation of the numerical those from SAVOL best, when the CDK algorithm algorithm. However it can be seen that the plots uses a probe radius of 1.4Å. In general it is seen that exhibit a high degree of correlation, though the the most outlying points in Figs. 2 and 3 correspond outliers mentioned above are clearly visible. to the largest molecules in the dataset. Similar results were observed for the ZINC subset. Figs. 4 and 5 compare the numerical and analytical surface area algorithms for the ZINC subset. 185 891 As noted previously, the average molecular weight for this dataset was larger than for the previous dataset. Furthermore, the bulk of the dataset was skewed to larger molecular weights as shown in Fig. 1. Thus one would expect that, in general, the CDK algorithm would exhibit a greater 1545 2287 degree of overestimation of the surface areas for this dataset. This is highlighted in Fig. 4 which plots the surface areas calculated by the CDK versus those obtained from SAVOL at a probe radius of 1.5Å and varying tessellation levels. As before, increasing tessellation levels beyond 5 did not result 500 in a significant increase in accuracy but did increase the time time required to evaluate the surface Figure 6: The five structures from the ZINC areas. It is clear that the bulk of the dataset is subset [11, 12, 13, 14, 15] that were most overestimated by the CDK algorithm. However the underestimated by the CDK surface area plot also highlights a number of molecules which are algorithm. significantly underestimated by the CDK algorithm. These correspond to the ZINC ID’s of 185 [11], 500 [12], 891 [13], 1545 [14] and 2287 [15]. Conclusions It is interesting to note that the first four have the same steroid scaffold, as shown in Fig. 6. This article has presented a comparison of the Though 500 does not share the same backbone numerical surface area algorithm in the CDK to it does have a number of features in common the analytical algorithm implemented in SAVOL. with the first four, such as six membered rings As expected, the numerical algorithm does and a keto group. However, both the rings are overestimate the surface area, especially for larger

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 9 molecules. However the correlation between the two Integration of Surface Area and Volume approaches to surface area calculation are relatively and to Dot Surface Contouring of Molecular high. This would suggest that the CDK algorithm Assemblies. J. Comp. Chem., 16:273–284, 1995. is suitable for many purposes where absolute surface areas are not required (such as descriptor [7] K. Hinsen. The Molecular Modeling Toolkit: calculations). In summary the CDK surface area A New Approach to Molecular Simulations. algorithm is suitable for a variety of small molecule J. Comp.Chem., 21:79–85, 2000. modeling purposes, though if larger molecules are [8] K. Hinsen. The Toolkit: considered, the surface area calculated by the CDK A Case Study of a Large Scientific Application routine can be misleading. in Python. In Proc. of the 6th Intl. Python Conf., 1997. http://www.python.org/workshops/ Rajarshi Guha 1997-10/proceedings/hinsen.html. Pennsylvania State University [email protected] [9] R.S. Pearlman. Molecular Surface Areas and Volumes and Their Use in Structure-Activity Relationships. In S.H. Yalkowsky, A.A. Sinkula, Bibliography and S.C. Valvani, editors, Physical Chemical Properties of Drugs. Marcel Dekker, New York, [1] M.L. Connolly. Molecular Surfaces: A 1980. Review. Number 14 in Computational . Network Science, April 1996. [10] E.S. Goll and P.C. Jurs. Prediction of http://www.netsci.org/Science/Compchem/ the Normal Boiling Points of Organic feature14.html. Compounds From Molecular Structures With a Computational Neural Network Model. [2] J.M. Blaney and J.S. Dixon. Distance Geometry J. Chem. Inf. Comput. Sci., 39:974–983, 1999. in Molecular Modeling. In Reviews in , volume 5, pages 299– [11] ZINC00000185. InChI=1/C21H30O4/c1-20- 335. VCH, Weinheim, Germany, 1994. 8-7-13(23)9-12(20)3-4-14-15-5-6-16(18(25)11- 22)21(15,2)10-17(24)19(14)20/h9,14- [3] M. S. Chapman and M. Connolly. Molecular 17,19,22,24H,3-8,10-11H2,1-2H3/t14- Surfaces: Calculations, Uses and Representations, ,15+,16+,17-,19+,20-,21+/m1/s1. volume F: Crystallography of Biological Macromolecules of International Tables for [12] ZINC00000500. InChI=1/C15H14O3/c1-10-3- Crystallography, chapter 22.1.2. Kluwer, 5-11(6-4-10)15(17)13-8-7-12(18-2)9-14(13)16/h3- International Union of Crystallography, 9,16H,1-2H3. Chester, UK, July 2001. [13] ZINC00000891. InChI=1/C21H32O3/c1- [4] M. Gerstein and F. M. Richards. Protein 12(22)16-6-7-17-15-5-4-13-10-14(23)8-9- Geometry: Distances, Areas, and Volumes, 20(13,2)19(15)18(24)11-21(16,17)3/h13- volume F: Crystallography of Biological 17,19,23H,4-11H2,1-3H3/t13-,14-,15-,16-,17- Macromolecules of International Tables for ,19+,20-,21+/m0/s1. Crystallography, chapter 22.1.1. Kluwer Academic, International Union of [14] ZINC00001545. InChI=1/C21H30O3/c1- Crystallography, Chester, UK, July 2001. 13(22)21(24)11-8-18-16-5-4-14-12-15(23)6-9- 19(14,2)17(16)7-10-20(18,21)3/h12,16-18,24H,4- [5] M.F. Sanner. Modelling and Applications of 11H2,1-3H3/t16-,17+,18+,19+,20+,21-/m1/s1. Molecular Surfaces. PhD thesis, University of Haute-Alsace, Mulhouse, France, 1992. [15] ZINC00002287. InChI=1/C21H30O3/c1- 12(22)16-6-7-17-15-5-4-13-10-14(23)8-9- [6] F. Eisenhaber, P. Lijnzaad, P. Argos, C. Sander, 20(13,2)19(15)18(24)11-21(16,17)3/h10,15- and M. Scharf. The Double Cubic Lattice 19,24H,4-9,11H2,1-3H3/t15-,16-,17-,18-,19+,20- Method: Efficient Approaches to Numerical ,21+/m0/s1.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 10

Improving the CDK implementation of the XlogP Descriptor

In this article we present recent work that is aimed in silico methods for its prediction are of major at the improvement of the CDK implementation interest. For the CDK we chose to implement the of the XlogP descriptor. We present the results XlogP prediction which was developed by Wang et of a comparison of the CDK’s XLogP descriptor al. and is documented in two papers [4, 3]. XLOGP is implementation against the XLOGP program and free for academics, but it is not open source. against the logP value as calculated by MOE. We downloaded their program XLOGP (Version by Christian Hoppe 2.1) and their training set from ftp://ftp2.ipc. pku.edu.cn/pub/software/xlogp/. The training The octanol-water partition coefficient logP is an set consisted of 1854 structures in the Sybyl mol2 important property of drug molecules. The logP format. The conversion of the training set to MDL’s serves as a quantitative descriptor of molecular SD format for the CDK was done with MOE. The hydrophobicity. Hydrophobicity is related to implementation of the XlogP descriptor is part drug absorption, bioavailability, hydrophobic drug- of the QSAR package and can be found under receptor interactions, metabolism of molecules and qsar.descriptors.molecular.XLogPDescriptor. toxicity. Therefore logP is commonly used in QSAR We validated the CDK implementation by and rational drug design experiments. Because the calculating logP values for the training set with CDK, measurement of logP values is not straightforward, the XLOGP program, and the logP value of MOE [6].

Figure 1: The CDK XlogP descriptor versus the MOE logP descriptor calculated for 1854 structures.

The XLOGP program implements several • Atom type 83 was assigned a value of 0.512 correction factors one of which - the parallel donor pair - is currently not implemented in the CDK • hydrophobic carbon was changed to 1-3 version. Furthermore, in some cases the atom type relationship descriptions in the paper were not sufficient to do an • the atom type descriptor pi-system no longer exact atom type classification or differ from the ones considers P or S of the XLOGP program. As part of this research the CDK implementation was changed at some points to • if an atom type belongs to a ring system, this follow the XLOGP program: ring system must more than three members

• Atom type 7 was assigned a value of -0.317 • if an atom type belongs to an aromatic system, this aromatic system must have more than five • Atom type 81 was assigned a value of -0.447 members

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 11

Figure 2: The CDK XlogP descriptor versus the XLOGP v2.1 descriptor calculated for 1854 structures.

However, some problems remain. The correction implementation of the XlogP and XLOGP occur due factor for salicylic acid as described in the paper is to different atom type perception (e.g. amide). Thus, not used in all cases by the XLOGP program. It is the linear regression between the calculated values implemented and can be used in the CDK version. of CDK’s XlogP and MOE’s logP is more consistent Moreover, the value for internal bonds in the paper than between the calculated values of CDK’s XlogP is given as 0.429, but for the molecule no454 of the and XLOGP. Unfortunately, it is not possible to tell if training set it is given as 0.643. Another problem these inconsistencies in XLOGP are a feature because is the inconsistent classification of amide atom type the comparison between XLOGP and MOE’s logP is a by the XLOGP program (see Figure 1). In the CDK little better than CDK’s XlogP compared to XLOGP. implementation amide has the highest priority. Therefore we would need to have experimental logP values, which are currently not available to us.

Dr. Christian Hoppe Universität zu Köln, Germany [email protected]

Bibliography Figure 3: XLOGP v2.1 amid classification. Drawn on the left is 1,3-dimethyluracil [4]. [1] Y. Gao R. Wang and L. Lai. Calculating partition coefficient by atom-additive method. Perspectives Below are the results of the comparison of the in Drug Discovery and Design, 19:47–66, 2000. three different logP implementations for the training set comprising 1854 structures. Figure 2 and Figure 2 [2] Y. Fu R. Wang and L. Lai. A New Atom-Additive depict the distribution of the calculated values and Method for Calculating Partition Coefficients. the linear regression line: Journal of Chemical Information and Computer Sciences, 37:615–621, 1997. Avg. Max. Deviation Deviation [3] The Molecular Operating Environment 2004.03, CDK vs. XLOGP 0.078 1.31 Chemical Computation Group Inc. http://www. CDK vs. MOE 0.23 2.39 chemcomp.com/, March 2004. MOE vs. XLOGP 0.21 2.39 [4] Uracil. InChI=1/C6H8N2O2/c1-7-4-3- The greatest differences between CDK’s 5(9)8(2)6(7)10/h3-4H,1-2H3.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 12

QA of the XlogP Descriptor

This study is related to the work of Christian a numerical value that encodes information about Hoppe that is published in the same issue both its electronic interaction and its topological (Improving the CDK implementation of the XlogP environment [9]. An Associative Neutral Network descriptor). Christian focused on a comparison (ASNN) - a combination of k-nearest neighbor and between the original XlogP implementation and artificial neural networks - is then employed to its implementation in CDK as well as Java code compute log P values. improvements that resulted thereof. We, on the other hand, applied the descriptor quality assurance (QA) protocol to CDK’s XlogP descriptor Methods using two log P predictions implemented by comparison software. Our study revealed that all Both descriptor QA datasets, i.e. the one with three implementations tested varied significantly hydrogen atoms and the one without, were used. in their computed log P values. However, The two datasets are described in detail in another different descriptor values could not be attributed article in this issue (A Protocol for Descriptor QA). to errors in the implementations, but arise due to The CDK XlogP descriptor has two parameters: discriminative underlying QSPR models. the first one specifies whether aromaticity should be detected prior to calculation (default is false), by Uli Fechner and Kristina Grabowski the other one indicates if the salicyl acid correction factor should be applied (default is false). We set both parameters to true. As the XlogP descriptor Introduction only works with AtomContainers that comprise explicit hydrogens we added them for the QA dataset The octanol/water partition coefficient log P is a without hydrogens: quantitative measure of a compound’s lipophilicity. A log P value can be determined by an experiment IValencyChecker checker = in the laboratory or by an in silico prediction method. new ValencyHybridChecker(); An in silico prediction of log P is a typical application HydrogenAdder adder = of a quantitative structure-property relationship new HydrogenAdder(checker); (QSPR) model [1]. Most in silico prediction techniques adder.addExplicitHydrogens- are either atom-based (commonly abbreviated by ToSatisfyValency(molecule); alogP) or fragment-based (clogP). Basically, the atom- and fragment-based approach sum up lipophilicity Calculation of SlogP with MOE was contributions of individual atoms and fragments, straightforward as no parameters had to be adjusted. respectively. The core of these methods is a The ALOGPS program accepts only SMILES as an suitable definition of a set of chemically distinct input format. Both datasets were converted from atom types or fragments. Log P contributions are SDF to SMILES using the program sdf2smiles that then assigned to particular members of such a set is distributed with the ALOGPS program. Results by establishing a QSPR model using experimentally for both QA datasets were identical for all three determined log P values. Additionally, some of these programs. Therefore, the following discussion is approaches apply so-called correction factors to take equally valid for each of the two datasets. into account factors such as internal hydrogen bonds. A plethora of publications and software pinpoints the significance of in silico log P prediction; Ref [2] Results and Discussion provides a comprehensive comparison. We validated the CDK implementation of XlogP Both MOE and CDK came up with results for all [3, 4] against two other log P prediction methods. compounds in the dataset. The ALOGPS program The first one was the SlogP method developed by was unable to calculate log P values for 70 of the Wildmann and Crippen [5] and implemented in 1,000 compounds. Hence, a comparison between MOE [6]. SlogP is an atom-based approach. As a values yielded by ALOGPS and another program second log P prediction approach we employed the only considers the 930 compounds that could be ALOGPS program (version 2.1) [7] that is available processed by ALOGPS. Moreover, ALOGPS issued on the Virtual Computational Chemistry Laboratory warnings for 502 compounds; such a warning (VCCL) website (http://vcclab.org/) [8]. The indicates that the calculated value is not reliable. ALOGPS program encodes compounds by their Below are selected descriptive statistics of the number of hydrogen and non-hydrogen atoms and log P value comparison. The RMSEs of 1.30 (CDK 73 E-state indices. An E-state index - an abbreviation vs. SlogP), 1.37 (SlogP vs. ALOGPS), and 1.85 of electrotopological state index - of an atom is (CDK vs. ALOGPS) clearly demonstrate that all

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 13

Figure 1: Plot of A) the CDK descriptor values versus the values of ALOGPS (0.57), B) the CDK descriptor values versus the values of SlogP (0.78), and C) the SlogP values versus the values of ALOGPS (0.71). The numbers in braces indicate the respective correlation coefficient. The first bisecting line is shown in red. three implementations compute rather different log "Improving the CDK implementation of the XlogP P values. This is also pointed out by the unpleasantly Descriptor". high values for the 3rd quartile (1.40, 1.41, and 1.94). Maximum differences between two prediction approaches even range from 5.59 to 7.90. Conclusion

CDK vs. CDK vs. SlogP vs. Every QSAR model can only be as reliable as its ALOGPS SlogP ALOGPS underlying dataset. It was previously shown that RMSE 1.85 1.30 1.37 the diversity of the 1,853 compounds that make up 1st Quartile 0.43 0.35 0.32 the XlogP dataset, is limited [10]. This constraint Median 1.03 0.74 0.70 may restrict the overall log P prediction quality of 3rd Quartile 1.94 1.41 1.40 the XlogP approach and thus its implementation in Maximum 7.90 5.59 5.88 CDK. Minimum 0.00 0.00 0.00 This study pointed out that the three employed log P prediction approaches led to significantly The CDK implementation of XlogP vs. MOE’s different values for the descriptor QA datasets. SlogP performs very similar as does SlogP vs. Nevertheless, it has also been demonstrated that ALOGPS. However, the comparison between CDK the deviation of the CDK XlogP values from the and ALOGPS shows significantly greater deviations SlogP and ALOGPS predictions is comparable to the of their log P predictions. mutual deviation of the SlogP and ALOGPS results. Figure 1 depicts three plots each comparing descriptor values of two prediction approaches. The Uli Fechner and Kristina Grabowski plots also clearly show that different prediction Goethe-University Frankfurt, Germany techniques yield different log P values. Plot A) (CDK [email protected] vs. ALOGPS) exhibits the most obvious scattering [email protected] around the bisecting line thereby supporting the statistics from Table 1. A closer look at plot A) and C) reveals that ALOGPS seems to end up with Bibliography smaller log P values than CDK and SlogP. However, a consistent and clear trend cannot be deduced. [1] A. R. Katritzky, V. S. Lobanov, and M. Karelson. The CDK descriptor QA protocol suggests to look QSPR: the correlation and quantitative at the ten most outlying compounds. As the three prediction of chemical and physical properties log P prediction methods are backed by different from structure. Chem. Soc. Rev., 24:279–287, QSAR models scrutinizing the outliers may allow to 1995. compare these models but does not permit to assess the quality of the CDK XlogP implementation. Such [2] A. R. Katritzky et al. Structurally Diverse an assessment can only be provided by a comparison Qunatitative Structure-Property Relationship of the CDK XlogP descriptor against the reference Correlations of Technologically Relevant XlogP implementation of Wang et al [4]. This aspect Physical Properties. J. Chem. Inf. Comput. Sci., is discussed by another article in this issue titled 40:1–18, 2001.

CDK News ISSN 1614-7553 Vol. 3/1, March 2006 14

[3] R. Wang, Y. Fu, and L. Lai. A New Atom- [7] I. V. Tetko and V. Y. Tanchuk. Application of Additive Method for Calculating Partition Associative Neural Networks for Prediction of Coefficients. J. Chem. Inf. Comput. Sci., 37:615– Lipophilicity in ALOGPS 2.1 Program. J. Chem. 621, 1997. Inf. Comput. Sci., 42:1136–1145, 2002.

[4] R. Wang, Y. Gao, and L. Lai. Calculating [8] I. V. Tetko et al. Virtual computational chemistry partition coefficient by atom-additive method. laboratory - design and description. J. Comput. Perspectives in Drug Discovery and Design, 19:47– Aided Mol. Des., 19:453–463, 2005. 66, 2000. [9] L. B. Kier and L. H. Hall. An Electrotopological [5] S. A. Wildman and G. M. Crippen. Prediction State Index for Atoms in Molecules. of Physicochemical Parameters by Atomic Pharmaceutical Res., 7:801–807, 1990. Contributions. J. Chem. Inf. Comput. Sci., 39:868– 873, 1999. [10] I. V. Tetko, V. Y. Tanchuk, and A. E. P. Villa. Prediction of n-Octanol/Water Partition [6] Molecular Operating Evironment 2005.06, Coefficients from PHYSPROP Database Using Chemical Computing Group. http: Artificial Neural Networks and E-state Indices. //www.chemcomp.com, January 2006. J. Chem. Inf. Comput. Sci., 41:1407–1421, 2001.

Development Tools. 2. Java Documentation

This is the second article in a series describing IMolecule molecule) { development tools used to develop and maintain // do something the CDK. JavaDoc is a well known utility for } documenting source code, but few know the full power of the system, such as the ability to define It’s important to realize that the JavaDoc dependencies between modules. documentation in the source code is expected to be formatted as HTML. This means, for example, Egon Willighagen that special characters must be escaped! The open source GNU Java Compiler (gcj) [2] even fails non- ASCII chars, reporting that it is in conflict with JavaDoc the Java language specification. It also means that HTML element can be used for markup, like the , , ,

,

, and 
    . a HTML view of the application programming But the use of JavaDoc in the CDK is not limited interface (API) for available classes in a program or to these features. It also makes extensive use of library. For example, the CDK’s API is available JavaDoc’s taglet and doclet technologies. as HTML created with JavaDoc too (http://cdk.sf. net/api/). JavaDoc allows documentation to be embedded in the Java source code, providing a nice JavaDoc Tags development tool. JavaDoc tags are used to document the specifics of the API, like the return value and the method HyperText parameters, as shown by the @param tag in the previous example. A more comprehensive example, JavaDoc recognizes documentation in Java source taken from AtomContainer (the HTML output of code as a special kind of comment. Normal comment this is given in Figure 1): is wrapped between /* and */. By adding an extra asterix to the starting tag, JavaDoc is created: /** * Returns the atom parity for the given IAtom. /** * If no parity is associated with the given * This method calculates the molecular mass. * IAtom, it returns null. * * * @param molecule IMolecule for which the * @param atom Atom for which the parity must * mass is calculated * be returned */ * @return The IAtomParity for the given public double calculateMolMass( * Atom, or null if that IAtom does

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 15

    * not have an associated to be outputed by JavaDoc itself. JavaDoc * IAtomParity recognizes classes as taglets when they implement * @see #addAtomParity the Taglet interface from JavaDoc. An example is */ illustrative. For example, ‘CDKBugTaglet.java’ in the public IAtomParity getAtomParity(IAtom atom) { ‘doc/javadoc/source’ directory [3] looks like: return (IAtomParity)atomParities.get(atom); } public class CDKBugTaglet implements Taglet {

    private static final String NAME = "cdk.bug";

    public String getName() { return NAME; }

    public boolean inType() { return true; }

    public String toString(Tag tag) { return "

    This class is affected " + "by these bug(s):
    " + Figure 1: HTML output of the JavaDoc of the tag.text() + "
    \n"; AtomContainer.getAtomParity() method. } } An overview of often used tags that come with JavaDoc is given below. A full overview is available The getName() method defines that string from [ [1]]. used to identify the tag; @cdk.bug in this case. JavaDoc requires custom tags to contain a dot, @param For describing method parameters and CDK adopted the custom to start tags with cdk.. The inType() methods tells JavaDoc that @return For describing the returned value for a this methods is applicable to classes and interface. method The CDKBugTaglet class contains several other @exception For describing the exceptions thrown by administrative methods that are not shown. the method The @cdk.bug taglet allows one to mark classes and interfaces with known bugs. It also ensures that @see For pointing the reader to related methods or in the outputed HTML, the bug is directly linked to classes the SourceForge bug database for the CDK. A number of other tags are defined in the CDK: @deprecated For marking the method or class as deprecated, optionally providing details @cdk.dictref For citing dictionaries, such as the describing what method or class should now QSAR Descriptor dictionary [4] be used @cdk.cite For citing literature; Codes match the list For class descriptions the following are useful in ‘doc/refs/cheminf.bibx’ too: @cdk.set For adding this class to a set. For example, @author For stating who wrote the source code for used for defining all IO classes this class

    @since For describing since which version this @cdk.keyword For indexing the CDK classes on the method is available (Not used in CDK.) website

    It is interesting to note that @created is not @cdk.require For defining a required Java version amongst these tags, even though there is at least one integrated development environment that thinks it @cdk.depends For defining which third party is a valid tag. The tag @cdk.created should be used libraries this class requires at run time instead. And this brings us to our next topic: taglets. @cdk.builddepends For defining which third party libraries this class requires at build time Taglets @cdk.todo For adding information for the users on Taglets are classes used within the JavaDoc what remains to be done with the source code, architecture to convert tags to readable content e.g. adding some missing feature

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 16

    Doclets are of special interest. Information provided by the first two tags is used in the CDK build process, and the last tag defines sets of classes with similar function in the CDK library. For example, sets are defined for QSAR descriptors and for IO classes. The CDK build process uses a modularized Instead of creating HTML files, these doclets approach. That is, the library is divided into a create plain text files for all the CDK modules, with number of modules with clear interdependencies each CDK class on its own line. The ‘src/*.javafiles’ which are defined by metadata information in the files are used in the build process to ensure that ‘src/META-INF’ directory. The @cdk.module tag in the compiler compiles only the java files listed in the JavaDoc of a class indicates which module it these files; the Sun java compiler, for example, resolved missing dependencies and will include belongs to. The following example show how to these in the compile, which makes dependencies assign the Atom class to the data module. unclear. This is done by first creating a copy /** source dir containing only the files to be compiled * Represents the idea of a chemical atom. (${build.src}). Note that it includes all classes * listed in the ${module}.javafiles, and excludes * @cdk.module data all classes that do not fullfil some requirement as * defined with @cdk.require: * @author steinbeck * @cdk.created 2000-10-02 * @cdk.keyword atom */ public class Atom extends AtomType implements IAtom { By extracting this information from the source Figure 2 for an overview of available module). Note The extraction of this information is done, again, used by JavaDoc to create output. The default doclet that output DocBook XML too. Other doclets exist that, for example, check the quality of the JavaDoc in Java source files, which I will discuss in a later article. A doclet is a class that implements the Once this copy of the listed source file is done Doclet interface from the com.sun.tools.doclets. it compiles these files, again applying the same The CDK has two implementations of includes and excludes restrictions, though ignored this interface: MakeCDKSetFilesDoclet and by Sun’s java compiler: MakeJavaFilesFilesDoclet which can be found in ‘doc/javadoc/source’. They are used in the ‘build.xml’ to create the ‘src/*.javafiles’, ‘src/*.classes’ and name="${src}/java1.5.javafiles" name="net.sf.cdk.tools.MakeJavaFilesFilesDoclet" name="${src}/ant1.6.javafiles" unless="hasAnt16"/> name="${src}/r-project.javafiles" unless="rispresent"/>

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 17

    Figure 2: UML diagram of the CDK modules and their dependencies.

    [email protected] This gives a flexible but clean way to compile CDK modules. The release manager can now safely Bibliography release the CDK knowing how modules depend on each other, giving users the flexibility to select those [1] Sun Microsystems. Javadoc Tool. http: bits of the CDK that they need, and leave out those //java.sun.com/j2se/1.5.0/docs/guide/ which they do not need. javadoc/index.html, 2004. For an ever growing library, with applets as target [2] The GCC Team. The GNU Compiler for the Java applications, modularization is of high importance. Programming Language. http://gcc.gnu.org/ In my blog I have recently shown a nice application java/, 2005. of this technology, when including Jumbo5 which requires Java 1.5[5]. An overview of most CDK [3] The CDK Team. CDKBugTaglet.java. modules and there dependencies at the time of http://cvs.sf.net/viewcvs.py/cdk/cdk/ writing is given in Fig. 2. doc/javadoc/source/CDKBugTaglet.java, 2005.

    Conclusion [4] C. Steinbeck, C. Hoppe, S. Kuhn, M. Floris, R. Guha, and E.L. Willighagen. Recent I have not covered quality assurance of CDK’s Developments of the Chemistry Development JavaDocs and will discuss that in a future article. Kit (CDK) - An Open-Source Java Library for What I hope to have shown is that JavaDoc is a Chemo- and Bioinformatics. Curr. Pharm. Des., rather versatile system given the CDK developers a in press, 2006. lot possibilities to make managing the development process in more detail. [5] E.L. Willighagen. Jumbo 5.0 and the CDK. http://chem-bla-ics.blogspot.com/2005/ Egon Willighagen 12/jumbo-50-and-cdk.html, 2005.

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 18

    Literature “Literature” is a recurrent column describing Drug Design recently published articles that are directly or indirectly related to CDK. Geldenhuys et al. review the use of open- source software applications in drug discovery, and by Egon Willighagen approached it from a bench chemist point of view [4]. Because he discusses not only open-source software, but also free, but closed source, programs in one go, This issue of “Literature” describes four articles. The it is not always clear whether his conclusions apply first article proposes the use of bar codes for chemical to the free tools, or to the open source programs. For identification. Two articles discuss the use of CML example, he discusses the advantages and problems and enzyme reaction mechanisms, and the last one with open-source, and mentions the often lacking gives an overview of the use of open-source software user-friendly GUI, and the the lack of literature to in drug design. validate the program. While the former is likely true for many open-source chemoinformatics programs, the validity of the second might be arguable. Bar Codes Egon Willighagen Radboud University Nijmegen, The Netherlands [email protected] Karthikeyan et al. published an article in which they propose a SMILES based 2D bar code representation for use in inventory management [1]. The advantage Bibliography is here, that chemicals used in laboratories are identified by their 2D bar code, which can be [1] M. Karthikeyan and Andreas Bender. Encoding translated into a connection table, instead of their and decoding graphical chemical structures as name. Using the InChI as replacement for the two-dimensional (PDF417) barcodes. J Chem Inf SMILES is likely to be trivial. The article cites the Model, 45(3):572–580, 2005. CDK as one library that can parse and generate the SMILES representation. [2] Gemma L Holliday, Peter Murray-Rust, and Henry S Rzepa. Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions. J Chem Inf Model, 46(1):145–157, 2006. Reactions in CML [3] Gemma L Holliday, Gail J Bartlett, Daniel E Almonacid, Noel M O’Boyle, Peter Murray- CMLReact is an extension of the Chemical Markup Rust, Janet M Thornton, and John B O Language (CML) that allows markup of reaction Mitchell. MACiE: a database of enzyme reaction mechanisms [2]. The CDK and JChemPaint can mechanisms. Bioinformatics, 21(23):4315–4316, read and write this XML format, and JChemPaint Dec 2005. played a small part in the development of the MACiE database, a database for mechanisms, [4] W.J. Geldenhuys, K.E. Gaasch, M. Watson, D.D annotation and classification of enzyme reactions, Allen, and C.J. Van der Schyf. Optimizing the available online at http://www-mitchell.ch.cam. use of open-source software applications in drug ac.uk/macie/ [3]. discovery. Drug. Disc. Today, 11:127–132, 2006.

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 19

    COMMUNICATION iBabel iBabel is an Applescript Studio application applications folder. Double click on it to open the that provides a front-end for a variety of application. tools. by Chris Swain File conversion iBabel is a OS/X GUI to a variety of The “Convert” tab is a GUI for OpenBabel allowing chemoinformatics tools built using Applescript file conversion for a wide variety of formats Studio, a free application for building interfaces including CML. On all tabs the “Use Terminal” check to scripts and commandline tools whether it be box is available if you want to see the script run. PERL, shell script, a C-program or even applescript. It also allows the incorporation of web elements such as web pages, Java applets or images (http: //www.apple.com/applescript/studio/). To date the cheminfomatics tools include file conversion, SMARTS searching, list manipulation, overlaying using OpenBabel, a 2D viewer using JChemPaint, a 3D molecule viewer using , binaries for which are now included in the iBabel application. As an alternative Marvin can be used for both 2D and 3D display. The application and associated files can be downloaded from (http://sourceforge.net/ To convert a file click the input button and chose projects/openbabel/) the expanded downloaded the input file, select the input and output file types file contains an application called iBabel. Simply and click "convert". The dropdown menus allow the create a folder on Macintosh HD called Public and user to change the explicit hydrogens. By default all a folder within Public called Structures (if you molecules are converted but a subset can be selected. forget iBabel should create them for you the first The default location for the output is the desktop; the time it runs). The iBabel application can be in user can change this by simply typing the desired the applications folder. Marvin is available from destination into the output file text box. Chemaxon (http://www.chemaxon.com/marvin/ do-download.html) they also have a number of useful applications and toolkits that are free for academics. Substructure search You should end up with this folder structure: iBabel also provides a search tab, where you can run searches using SMARTS based queries. The Classes and Groups dropdown menus contain a variety of canned SMARTS queries. You can of course simply type in a SMARTS query or use Marvin as the editor to generate the SMARTS string. Simply fill the SMILES/SMARTS string box choose an option from the dropdown menu then click add (this allows the user to concatenate options).

    The folder Structures is where temporary structures are stored for visualisation, and needs to be emptied occasionally. The iBabel application can be moved to your

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 20

    At the moment only the Marvin applet can be used for editing. The results of a search can be viewed in the viewer tab but you must have first imported the structures in the viewer tab. The select or unselect dropdown menu allows you to change the selection in the Viewer.

    The viewer tab allows the user to browse through a multi-molecular file and then sort and select compounds and export a subset. The structures can be displayed as either 2D or 3D structures using a Java applet (JChemPaint, Jmol, or Marvin). “Name list” exports a text list of the selected compounds, “count” gives the total of selected compounds, Other tools “invert” inverts the selection, “none” and “all” either unselect or select all. You can import more then one The tools menu gives access to other features, structure file and then export selections from one or property calculation and superimposition onto a more of the imported files using the multi or single template based on SMARTS matching. More tools buttons. (If you have only imported one file the will be added as they become available. single file export is faster).

    Acknowledgements

    This work is only possible due to the outstanding efforts of others, in particular:

    • OpenBabel (http://openbabel.sourceforge. net/) (by Geoff Hutchinson)

    • Jmol (http://jmol.sourceforge.net/)

    • JChemPaint (http://jchempaint.sourceforge. The “Viewer” tab is used to select a file to view net) first using the "input" button then click “import”. • Applescript Studio mailing list (http: The table will be populated by a list of molecules //lists.apple.com/mailman/listinfo/ present in the selected file. This table can be applescript-studio) sorted by clicking on the column headers. Clicking on a molecule name will display a 2D or 3D • Marvin (http://www.chemaxon.com/) structure depending on which of the radio buttons are selected. (Note that the display is limited to the Chris Swain abilities of the applets). [email protected]

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 21

    COMMUNICATION An Applet Release of JChemPaint

    This communication describes the applet release of applet"-section. Interested groups could also get the JChemPaint, its advantages and usage. jars signed by some trusted organisation. The applet is compiled for the Java 1.3 platform, therefore it by Stefan Kuhn should run on all Java 1.3 or later (including 1.5) releases. It will not run on earlier Java VMs, noteably The two-dimensional structure editor JChemPaint not on the Microsoft VM built into older Internet has been under development for some years. Over Explorer versions. We tested the applet with a range this period it has gone through many changes, e. g. of current browsers (Mozilla, , IE, Konqueror) swapping the underlying library to CDK or the full and found no problems. integration into the CDK development. One goal An installation of the Editor applet is available had always been to develop a Java-Applet based on http://www.chemistry-development-kit.org. on JChemPaint, which should fullfill the following A static installation is at http://almost.cubic. requirements: uni-koeln.de/jcp-applet. The applet has been used in NMRShiftDB (http://www.nmrshiftdb. • small size org) since release 1.2, including both the Editor and • fast operation Viewer functionalities.

    • compatible with a wide range of Java VMs, browsers and operating systems Usage of the JCP Applet • provide all functions needed in online- chemistry applications The prerequisite for linking the JCP applet into a web page is to have all the jars from the binary distribution in a web server directory. To embed the Status of the Applet viewer the following html-code can be used:

    The release has been named 2.2.1. Its source JChemPaint and the applet continues as part of the CDK development. The download consists of a zip file, which contains 35 jar files plus html example code. The jar files form the binaries for the applet. Their size is 6.2 MB, but they will typically not all be needed with additional code fetched on request. The Editor as well as the Viewer can start with jars of less than 1 MB in size. Over a fixed link internet connection this means that the applet starts in around three seconds, depending on conditions. For comparison: Marvin (http://www.chemaxon.com/marvin/) is similar to JChemPaint in size, whereas JME (http://www. molinspiration.com/jme/) is only 37 KB large. Once loaded, the speed is no longer influenced by file size, Java VMs also cache the code. In operation, JChemPaint generally proves to be quite fast. The current release is not signed, which implies it can not perform file system or clipboard operations. Figure 1: Screenshot of NMRShiftDB, using the For workarounds see the section "Usage of the JCP JChemPaint viewer applet.

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 22

    The archive-attribute must point to jchempaint- applet offers more possibilities for runtime applet-core.jar relative to the directory containing the manipulation. Firstly, you can set the structure via HTML page. All jars should be placed into this JavaScript as a molfile. This can be achieved by: directory, even if not mentioned in the HTML code. document.Editor.setMolFile(thisBox.value). The param load tells the applet which structure file Editor is the name of the applet, thisBox could be for it should load for diplay. The path to the file must example a text box. Another possibility is to upload either be relative to the html directory or a complete a file to the server as part of a http-post request, read URL (the file must be publicly available). When this file on the server and display it. Both ways, detachable is set to true (default: false), the user can you avoid the restrictions of the sandbox. Another detach the applet by double-clicking on it and then restriction stemming from the sandbox architecture resize it. The other parameters are self-explaining. of the VM is the restriction not to use copy/paste Fig. 1 shows a screenshot of NMRShiftDB with the (clipboard) functions on the client side. Thus, the detached viewer applet. user can display a molecule SMILES via Report- Javascript can also be used to dynamically >Generate SMILES but he will be unable to copy interact with the applet. In Fig. 1 you can see one it into the buffer (it seems this works on some VMs). atom marked in red. This is done dynamically when This might lead to the situation that the user draws the user hovers over the table on the right. The code a (complicated) structure, but cannot get it out of for this would be document.Viewer.selectAtom(6). the applet, since he can neither save nor copy it. Vieweris the name-paramter in the -tag, 6 Therefore, we recommend that the following code the number of the atom to mark (naturally, atom also be included next to the applet: numbers start with 0). The Editor applet can be included with code like this: function showmol(mol) {
    " + name="Editor" mol + "
    " width="600" height="500"> ); }
    Show Viewer, except for the classname. The Editor editor content as mol file

    Figure 2: Screenshot of http://www.chemistry-development-kit.org, using the JCP editor applet.

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 23

    Via this link, the user can view the MOL file server. The code looks like this: in a seperate window, from which he can copy the text from. Currently, the applet can not display the document.MolForm.MolTxt.value = help texts internally. You can download a help- document.JcpEditor.getMolFile() distribution from http://sourceforge.net and put In Fig. 2 the Editor applet is shown. The content the files on your home page and link to the ‘jcp.html’ file with a help link next to the applet. Often, the has been displayed via the "show editor content as applet may be used to enter molecular data as part of mol file"-link. You can also see a "Help"-link plus a web page input. To get the structure to the server, upload possibilities. the getMolFile() function can be used. For this, put a form in your webpage and trigger a JavaScript Stefan Kuhn function on submit. This function writes the content Cologne University Bioinformatics Center to a text field in the form, which can be read on the [email protected]

    Frequently Asked Questions

    "Frequently Asked Questions" is a series in the I just generated the API documentation CDK newsletter. It is compiled of selected using ant -f javadoc.xml. No errors are questions and answers that are taken from the CDK reported and it is generated, yet there user and developer mailing lists. Additionally, appear a lot of warnings. the author considers questions that are of general interest for CDK users. All credits for the expert A closer look at the warnings reveals that they answers go to the helpful developers and users who are related to missing Java3D [1] and JOELib [2] are contributing to the mailing lists. installations. These warnings can be safely ignored. The generated API documentation is complete and by Uli Fechner usable. The only disadvantage is that HTML links that refer to the API documentation of Java3D and I read an SD file where molecules have JOELib are missing. a title. How do I access the title of a molecule in CDK? There are a lot of calls related to logging in The title of a molecule can be retrieved by calling: the CDK classes. Even though a cdk.log file is created every time I run my program molecule.getProperty(CDKConstants.TITLE); that makes use of CDK this very file is To set the title to name use always empty. Do I have to enable logging explicitly? molecule.setProperty( CDKConstants.TITLE, "name" Yes, logging is disabled by default and has to be ); explicitly enabled. In an existing application it is enabled by starting Java with an option that sets the I instantiated a new carbon atom with Atom system property cdk.debugging to true: atom = new Atom("C"). However, the mass java -Dcdk.debugging=true -jar jarname.jar number and exact mass are zero. Are these You also have the possibility to output the debugging atom properties not set automatically? info to standard out: No, if a new atom is created as shown above java -Dcdk.debugging=true its properties are not set automatically. You can -Dcdk.debug.stdout=true configure the Atom after instantiation by means of -jar jarname.jar the org.openscience.cdk.config.IsotopeFactory: If you want to enable logging for a custom IsotopeFactory.getInstance(). applications you need to make sure to configure the configure(atom) org.openscience.cdk.tools.LoggingTool: There is also a method that facilitates isotope configurations for each atom of an AtomContainer: LoggingTool logger = new LoggingTool(); logger.configureLog4J(); IsotopeFactory.getInstance(). configureAtoms(atomContainer) The LoggingTool has to be configured only once for

    CDK News ISSN 1614-7553 Vol. 3/1, March 2006 24 a custom application; these settings are then used by Container with the aid of org.openscience.cdk.- all classes that call its logging methods. Refer to the layout.StructureDiagramGenerator: API documentation of this class to learn more about it. StructureDiagramGenerator sdg = new StructureDiagramGenerator(molecule); I added a new atom to an AtomContainer sdg.generateCoordinates(); Molecule moleculeWithCoordinates = that already contained some atoms. Then, I sdg.getMolecule(); created a new bond between the new atom and another atom of the AtomContainer. Uli Fechner If I display this modified AtomContainer Goethe-University Frankfurt, Germany using paintMolecule(AtomContainer con- [email protected] tainer, java.awt.Graphics2D graphics) of org.openscience.cdk.renderer.Simple- Bibliography Renderer2D it looks really weird. [1] Java3D. https://java3d.dev.java.net/, Jan. If you add a new Atom to an AtomContainer its 2006. coordinates are not generated automatically. Neither is this done by the Simple2DRenderer. You have [2] JOELib. http://joelib.sourceforge.net/, Jan. to generate new coordinates for the whole Atom- 2006.

    Editors-in-Chief: Development Kit (CDK) project. All articles are Egon Willighagen [email protected] and copyrighted with GNU’s FDL by the respective Christoph Steinbeck [email protected] authors. Submissions can be send to the Editors- in-Chief. Editorial Board: Andreas Bender, Christoph Steinbeck, Egon CDK Project web pages: Willighagen, Noel O’Boyle, Rajarshi Guha, Rich http://cdk.sourceforge.net/ Apodaca and Uli Fechner. http://www.chemistry-development-kit.org/ CDK News is a publication of the Chemistry

    CDK News ISSN 1614-7553