The Chemistry Development Kit (CDK) V2.0: Atom Typing, Depiction, Molecular Formulas, and Substructure Searching Egon L
Total Page:16
File Type:pdf, Size:1020Kb
Willighagen et al. J Cheminform (2017) 9:33 DOI 10.1186/s13321-017-0220-4 SOFTWARE Open Access The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching Egon L. Willighagen1* , John W. Mayfeld2 , Jonathan Alvarsson3 , Arvid Berg3, Lars Carlsson4 , Nina Jeliazkova5 , Stefan Kuhn6 , Tomáš Pluskal7 , Miquel Rojas‑Chertó8 , Ola Spjuth3 , Gilleain Torrance9 , Chris T. Evelo1 , Rajarshi Guha10 and Christoph Steinbeck11 Abstract Background: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, provid‑ ing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug dis‑ covery, metabolomics, and toxicology. Over the last 10 years, the code base has grown signifcantly, however, result‑ ing in many complex interdependencies among components and poor performance of many algorithms. Results: We report improvements to the CDK v2.0 since the v1.2 release series, specifcally addressing the increased functional complexity and poor performance. We frst summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to signifcantly better perfor‑ mance for substructure searching, molecular fngerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions: This paper highlights our continued eforts to provide a community driven, open source cheminfor‑ matics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientifc computing software. Keywords: Java, Cheminformatics, Bioinformatics, Metabolomics, Depiction Background Obelisk, a movement promoting Open Data, Open Source, Te open source cheminformatics community has made and Open Standards in chemistry [1, 2]. Te CDK provid- signifcant steps forward recently [1] as evidenced by the ing data structures to represent chemical concepts along growing number of tools and underlying toolkits, along with methods to manipulate such structures and perform with the usage of these software components in a variety computations on them. Previously documented CDK ver- of applications. Te Chemistry Development Kit (CDK) sions have been widely adopted [3, 4]. Use of the CDK is one of the tools developed under the aegis of the Blue ranges from inclusion of CDK functionality in wrapper platforms such as Cinfony [5], incorporation within the R environment (rcdk [6]), and as plugins for Taverna [7], *Correspondence: [email protected] KNIME [8], Cytoscape (ChemViz2 [9]), and for Microsoft 1 Department of Bioinformatics ‑ BiGCaT, NUTRIM, Maastricht University, 6200 MD Maastricht, The Netherlands Excel (LICSS [10]). In contrast to scenarios that have made Full list of author information is available at the end of the article CDK functionality available in larger systems, a number of © The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Willighagen et al. J Cheminform (2017) 9:33 Page 2 of 19 projects have employed the CDK as a general cheminfor- improved and formalized the development model for the matics toolkit. Examples include jCompoundMapper [11], project using unit testing, code review and guidelines for ScafoldHunter [12, 13], OMG [14], PaDEL [15], Chem- handling version control. Finally we report on the availa- Des [16], ReactPRED [17], SMSD [18–20], WhichCyp [21], bility of binary distributions of the library, allowing users MetaPrint2D [22], MetFrag [23], and the IUPHAR/BPS to include specifc modules (and their dependencies) of Guide to Pharmacology [24], BRENDA [25] and QSAR the CDK in their own projects (as opposed to developers DataBank [26] databases. A number of such tools were ini- who work on the CDK library itself). tially developed using older versions of the CDK and are updated to new releases as they are made available. Exam- New APIs and improved implementations ples include Bioclipse [27, 28] and AMBIT [29–31]. Te We here outline various new and improved APIs in the CDK has also played a role in a number of chemical stud- CDK library since the two previous publications in 2003 ies, such as fnding the maximally bridging rings in chemi- and 2006 [3, 4]. cal structures [32], prediction of organic reactions [33], and bioactivities of compounds [34]. Atom typing While the CDK has purported to be a general purpose Atom type perception is core cheminformatics function- cheminformatics toolkit, older versions were designed by ality: the atom types describe chemical features of atoms, a community with specifc applications in mind, primary such as the number of neighbors, possible formal charges, among them being structure elucidation. In addition, an (approximate) hybridization, electron distribution over implicit goal of previous versions was to have the CDK serve orbitals and so on. However, previous versions of the CDK as an educational resource to enable students of chemin- implemented atom type perception as part of diferent formatics to understand the underlying algorithms. Tis algorithms, resulting in duplicated and sometimes diver- resulted in certain functionalities, such as molecular fnger- gent typing schemes. As a result it was cumbersome to add printing [35, 36], receiving more attention than others, such new atom types and implement support for new charged as stereochemistry. Te outcome was signifcant variance in and radical species in a consistent manner. performance and features throughout the toolkit. Tis CDK version has a new, centralized atom typ- Te growth of open source software over the last 10 ing framework, removing the perception of atom types years is evidence of the ability of communities of devel- from various algorithms. Tis allows for a consistent and opers to develop systems and processes that lead to high extensive typing scheme, that can be also be tested inde- quality software systems for long term use. Te CDK is pendently of other code. Te new code defnes the atom no diferent. Te adoption of automatic build systems types using a list that specifes for each type the element and quality control methodologies such as unit test- symbol, hybridization, formal charge, number of lone ing, automated source code validation, and peer review pairs, and an enumeration of the bond orders (see Fig. 1). by fellow developers have greatly improved the stability Tis list of properties captures the information needed of the library. While it has slowed development some- for the various algorithms in the CDK. For example, what, it has allowed for cleaning up interdependencies hybridization information can be used in certain aroma- between modules of functionality, and importantly, has ticity models (see later), and the lone pair information is improved the scalability of the development model. Tis needed for resonance structure calculation needed, for π has resulted in signifcant new functionality in core appli- example, for Gasteiger -charges. cation programming interfaces (APIs) while maintaining the quality of code depending on those core APIs. Examples of new features supported by the improved development model include InChI functionality [37], greatly improved ring detection algorithms [38], improve- ments to the core atom type perception module that now covers a much more comprehensive set of elements, charge states and radical species than previous versions, a more comprehensive fngerprinting API, new depiction functionality, and many speed and stability improvements. Implementation and results Tis section describes the specifcs of new APIs and improvements to pre-existing methods that are avail- 3 able in the latest CDK. We then discuss how we have Fig. 1 Atom type information specifed for a sp ‑hybridized carbon Willighagen et al. J Cheminform (2017) 9:33 Page 3 of 19 A reference implementation, CDKAtomType- Matcher, has been written in such a way that per- ceives these atom types, and validates the perception automatically against the properties defned by the ontology. Tis class handles a variety of types of miss- ing information, as commonly resulting from various (fle) formats; for example, it can handle undefned hydrogen counts and undefned double bond positions if hybridization information is provided instead. Tat makes the perception code fexible but also more com- plex. Alternative algorithms for atom typing have not been explored. Tis reference implementation can be