Contents of This Issue: QA of the Xlogp Descriptor

News The Newsletter of the CDK Project Volume 3/1, March 2006 Contents of this issue: QA of the XlogP Descriptor . 12 Development Tools. 2. Java Documentation . 14 Editorial . 2 Literature . 18 A Protocol for Descriptor QA . 3 Validation of the CDK Surface Area Routine . 5 iBabel . 19 Improving the CDK implementation of the An Applet Release of JChemPaint . 21 XlogP Descriptor . 10 Frequently Asked Questions . 23 Vol. 3/1, March 2006 2 Editorial by Egon Willighagen time a distinct (small) molecular structure is given. The InChI’s are normally given in the bibliography, allowing authors to just use names or IDs in the Editorial article itself. To help authors, the CDK News stylesheet now contains a new inchi command Welcome to the sixth issue of CDK News, the first which will also create a link to search for more issue of the third volume. This issue focuses on information about the compound using Google.com. validation of the QSAR descriptors implemented For example, a BibTex entry may look like: in the CDK. Fechner and Guha propose a scheme @MISC{methane, to validate QSAR descriptors in the CDK, which title = "Methane", Fechner and Grabowski use to validate the LogP note = "\inchi{1/CH4/h1H4}" descriptors in the CDK. Additionally, Hoppe } analyzes the algorithm in the CDK used to calculate the LogP descriptor, and Guha studies the behavior The next issue is scheduled for June/July 2006. of the TPSA descriptor. This issue also features two Given the large number of papers in each issue, we applications of the CDK: one article discusses the will try to start releasing four issues a year, instead of iBabel program developed by Swain, while a second every four months in the past. This will likely make discusses the JChemPaint applet. Finally, a tutorial the issues smaller, but, more importantly, reduce the explains what JavaDoc is and how it should be used time to publication. This also means that we can in the CDK. no longer promise that an article will be published Very recently, a wiki for the CDK has gone live, in the next issue; the sooner you submit, the larger at http://cdk.sf.net/wiki/. I would welcome the chance it has gone through the review process. all readers to discuss articles on this wiki. This As always, submissions may include comments on might especially prove useful for articles that discuss current code, discuss certain algorithms in general, source code. As the CDK code base is not static, and or just describe a piece of work related to the CDK neither are other libraries, the source code example library. might need to be updated now and then. The wiki is a good place to aggregate those updates. Egon Willighagen As I explained previously, the CDK News now Radboud University Nijmegen, The Netherlands requires InChI’s to be stated in the article each [email protected] Front Page The front page shows the JChemPaint applet in action on the NMRShiftDB website (http://www. nmrshiftdb.org/), as discussed in the An Applet Release of JChemPaint article, by Kuhn, elsewhere in this issue. CDK News ISSN 1614-7553 Vol. 3/1, March 2006 3 A Protocol for Descriptor QA We discuss a quality assurance (QA) protocol was computed for all compounds [4]. We employed that has been established to validate the standard parameters for computation of the CATS CDK descriptors implementations against descriptor: the considered topological distance of corresponding implementations in commercially atom-pairs ranged from 0 to 9 bonds, each of the available packages. In particular, we compiled 15 possible pairs of potential pharmacophore points two datasets that we recommend for descriptor (PPPs) was divided by the added occurrences of the validation tasks. two respective PPPs. Please refer to Ref. [5] for a more detailed explanation of the applied scaling scheme. by Uli Fechner and Rajarshi Guha The CATS descriptors were then used to carry out selection of a diverse subset of 1,100 compounds. A Currently, the CDK provides implementations of 52 Java implementation of the MaxMin algorithm was descriptors. These are subdivided into two groups: used for this purpose [6]. Though our goal was 20 descriptors that calculate values for single atoms to create a QA dataset containing 1,000 structures, (org.openscience.cdk.qsar.descriptors.atomic), we initially selected 1,100 structures so as to have and 32 that provide descriptor values for whole a number of backup structures in case any of the molecules (org.openscience.cdk.qsar.descrip- following tasks were unable to successfully deal with tors.molecular). Though the descriptor all structures. implementations have unit tests associated with After obtaining a diverse subset we modified it them, these are usually few in number. As a to yield two descriptor QA datasets that differed result, a comprehensive validation of the CDK in terms of hydrogens (present and non-present). descriptors has not been performed. This may Again, Cliff was used to add hydrogens. not be very important for simplistic descriptors Next, three-dimensional coordinates (one such as one that counts the number of atoms conformer per compound) were generated for (AtomCountDescriptor) or computes the molecular the two datasets using Corina (version 3.20) [3]. weight (WeightDescriptor). However for more Corina was unable to generate three-dimensional complex descriptor classes that encompass several coordinates for six of the 1,100 structures. Visual hundred lines of code and performs non-trivial tasks, inspection revealed that these structures contained a validation is necessary for users to be able to rely on significant number of atoms that were not part of a the CDK implementation. For example, the XlogP ring system. As Corina starts conformer generation descriptor (XLogPDescriptor) comprises over 1,400 using ring templates and then minimizes non-ring lines of code and carries out the recognition of nearly atoms, it may fail on structures with a lot of non- 100 different atom types. ring atoms. We removed the six structures without three-dimensional coordinates and 94 more from To encourage confidence in the CDK descriptor both datasets to finally yield two datasets of 1,000 implementations we decided to start a quality structures. These datasets are publicly available assurance (QA) project for validation purposes. We in SDF format and are deposited in CDK’s CVS developed a protocol that lays out the general repository at sourceforge (module cdk-qa). procedure for a descriptor QA task. Another article Validation of a CDK descriptor is then performed in this issue makes use of our descriptor QA protocol by computing this particular descriptor using CDK and validates the CDK XlogP descriptor. and another software - the comparison software - Our first task was the compilation of a suitable such as MOE [6] or Dragon [8], for one of the dataset that can be used for all descriptor QAs. two descriptor QA datasets. A detailed comparison We downloaded the drug-like subset (subset 3, last between the descriptor values of CDK and the updated on 03/03/2005, 2,066,905 compounds) of comparison software includes the ZINC database [1] in SMILES format. This subset includes all compounds of the ZINC database • a plot of the CDK descriptor values versus the that do not violate any rule of the rule-of-five [2], values obtained from the comparison software i.e., compounds having a predicted logP value smaller than or equal to 5, a molecular weight • root mean square error (RMSE) of at most 500, not more than 5 hydrogen bond • the median, maximum and minimum donors, and at most 10 hydrogen bond acceptors. differences The SMILES file was then converted to an SD file using the commercially available program, Cliff • the percentage of compounds for which the (version 1.14) [3]. All hydrogens were stripped descriptor values differ significantly (e.g., by at from the SD file and nitrogens were uniformly most 10 percent) written in the penta style. Then, the topological pharmacophore-based atom-pair descriptor CATS • noticeably outlying compounds CDK News ISSN 1614-7553 Vol. 3/1, March 2006 4 In addition to the points listed above, it might [email protected] be worthwhile to derive a linear regression between CDK descriptor values and those computed by comparison software. Such a linear regression Bibliography yields a straight line. Examination of its intercept and slope adds another aspect to the results of a [1] Brian K. Shoichet John J. Irwin. ZINC - descriptor validation task. An intercept different A Free Database of Commercially Available than zero denotes that the validated CDK descriptor Compounds for Virtual Screening. J. Chem. Inf. exhibits systematically higher or lower values than Model., 45:177–182, 2005. the ones calculated by comparison software. A slope different than 1.0 (corresponding to 45 degrees) [2] C.A. Lipinski et al. Experimental and states proportionality but inequality between values Computational Approaches to Estimate of a descriptor computed by CDK and a comparison Solubility and Permeability in Drug Discovery software. and Development Settings. Adv. Drug. Del. Rev., Scrutinizing compounds that yield noticeably 23:3–25, 1997. different descriptor values may lead to the [3] Molecular Networks GmbH - Computerchemie. detection of possible causes. Mistakes in the CDK http://www.mol-net.com/, January 2006. implementations may be revealed and fixed in the QA process. In addition, differences in descriptor [4] G. Schneider et al. "Scaffold-Hopping" values may arise due to aspects of the CDK not by topological pharmacophore search: A directly related to the descriptor implementation. contribution to virtual screening. Angew. Chemie For instance, a limitation in the aromaticity detection Int. Ed., 38:2894–2896, 1999. routine in the CDK would lead to an error in the perception of one or more TPSA atom environments [5] G.

Contents of This Issue: QA of the Xlogp Descriptor

Practical Chemoinformatics Muthukumarasamy Karthikeyan • Renu Vyas

Open Data, Open Source, and Open Standards in Chemistry: the Blue Obelisk Five Years On" Journal of Cheminformatics Vol

A Study on Cheminformatics and Its Applications on Modern Drug Discovery

Molecular Structure Input on the Web Peter Ertl

A Web-Based 3D Molecular Structure Editor and Visualizer Platform

Getting Started in Jmol

Spoken Tutorial Project, IIT Bombay Brochure for Chemistry Department

Mannhold Methods and Principles in Medicinal Chemistry

Designing Universal Chemical Markup (UCM) Through the Reusable Methodology Based on Analyzing Existing Related Formats

Visualizing 3D Molecular Structures Using an Augmented Reality App

Open Data, Open Source and Open Standards in Chemistry: the Blue Obelisk ﬁve Years On

3D-Printing Models for Chemistry