Mining Collections of Compounds with Screening Assistant 2
Total Page:16
File Type:pdf, Size:1020Kb
Le Guilloux et al. Journal of Cheminformatics 2012, 4:20 http://www.jcheminf.com/content/4/1/20 SOFTWARE Open Access Mining collections of compounds with Screening Assistant 2 Vincent Le Guilloux1*, Alban Arrault2, Lionel Colliandre1,Stephane´ Bourg3, Philippe Vayer2 and Luc Morin-Allory1* Abstract Background: High-throughput screening assays have become the starting point of many drug discovery programs for large pharmaceutical companies as well as academic organisations. Despite the increasing throughput of screening technologies, the almost infinite chemical space remains out of reach, calling for tools dedicated to the analysis and selection of the compound collections intended to be screened. Results: We present Screening Assistant 2 (SA2), an open-source JAVA software dedicated to the storage and analysis of small to very large chemical libraries. SA2 stores unique molecules in a MySQL database, and encapsulates several chemoinformatics methods, among which: providers management, interactive visualisation, scaffold analysis, diverse subset creation, descriptors calculation, sub-structure / SMART search, similarity search and filtering. We illustrate the use of SA2 by analysing the composition of a database of 15 million compounds collected from 73 providers, in terms of scaffolds, frameworks, and undesired properties as defined by recently proposed HTS SMARTS filters. We also show how the software can be used to create diverse libraries based on existing ones. Conclusions: Screening Assistant 2 is a user-friendly, open-source software that can be used to manage collections of compounds and perform simple to advanced chemoinformatics analyses. Its modular design and growing documentation facilitate the addition of new functionalities, calling for contributions from the community. The software can be downloaded at http://sa2.sourceforge.net/. Keywords: Chemical libraries, Molecular diversity, DRCS Background larger screening libraries [5-9]. Chemical space is indeed Exploring biology through the activity of small molecules known to be almost infinite, and selecting the appropri- is an established paradigm used in drug research for sev- ate regions to explore for the problem at hand remains a eral decades now [1,2]. Today, a state of the art drug challenging task. discovery program often begins with screening campaigns What we call library design and analysis actually aims to aiming at the identification of novel biologically active increase the likelihood of screening collections to contain molecules. In the recent years, the rise of High Through- potentially active compounds, while ensuring that any of put Screening (HTS), combinatorial chemistry and the these represent an acceptable starting point for lead opti- availability of large compound collections has led to a dra- misation. In terms of biological activity, the concept of matic increase in the size of screening libraries, for both molecular diversity has proven useful to design libraries private companies and public organisations [3,4]. Yet, containing diverse chemotypes and hence increase the despite these constantly increasing capabilities, various ratio of hits [10,11]. In terms of resource optimisation, the authors have stressed the need to design better instead of main difficulty lies in the removal of those nuisance com- pounds that are unlikely to be developed into effective *Correspondence: [email protected]; drugs, especially using biochemical assays. For instance, [email protected] reactives, warheads, promiscuous aggregating inhibitors, 1Institut de Chimie Organique et Analytique (ICOA), Universited’Orl´ eans,´ UMR CNRS 7311 B.P. 6759, rue de Chartres, 45067 Orleans´ Cedex 2, France or more simply non-drug-like compounds are usually fil- Full list of author information is available at the end of the article tered out to avoid a waste of resources [12,13]. Most © 2012 Le Guilloux et al.; licensee Chemistry Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Le Guilloux et al. Journal of Cheminformatics 2012, 4:20 Page 2 of 16 http://www.jcheminf.com/content/4/1/20 of these undesired properties are usually represented chemoinformatics functionalities in a modular and exten- by either simple rule-based flags derived from physico- sible workbench. In the context of the present work, chemical properties (e.g. the Lipinski Rule of 5 [14]), or Bioclipse can be used to perform simple routine tasks structural features encoded using the SMARTS notation such as chemical file reading and visualisation, descriptor [15]. calculation, and SMART matching. More closely related The domain of chemoinformatics provides a plethora to this work, AMBIT XT [29,30] is a chemoinformatics of methods that can be used to tackle various aspects data management software, which consists in a MySQL of library analysis. Despite the growing diversity of avail- database and a set of functional modules, allowing a vari- able modeling and chemoinformatics tools, there are ety of queries, data mining and predictive model building few software tools specifically dedicated to the man- and application. Although not specifically dedicated to the agement and the analysis of screening libraries. One of management of screening libraries, it contains a set of the main reasons for this no doubt is the specificity chemoinformatics and data-mining facilities that make it of each screening platform (e.g. plates format, auto- usable for analysing collections of compounds. The soft- matic collection of results), which requires more specific ware was recently enhanced by providing a set of OpenTox developments to collect the results and associate them API [31] compliant REST web service interfaces to most with tested molecules. Typically, screening collections of its functionalities, hence promoting collaborative devel- are stored internally using in-house, usually web-based opment and data sharing [32]. More specific tools are proprietary software programmes, some of which have clearly beyong the scope of this work, but it is worth been described in the literature [16-18]. Other general- mentioning the Scaffold Hunter [33], which allows one to purpose, proprietary packages also propose methods to display a chemical library in the form of an interactive handle chemical libraries. In particular, the Instant JChem scaffold tree, or the SARANEA package [34], which allows application [19] from Chemaxon encapsulates various one to derive similarity graphs that can be used to perform chemoinformatics functionalities, and makes it possible to structure-activity Relationship analysis. store libraries in a database environment. Another exam- This article presents Screening Assistant 2 (SA2), an ple in this category is the CACTVS toolkit [20], an exten- open-source desktop software for chemical library man- sible distributed client/server system for the computa- agement. SA2 stores unique chemical structures and tion, management, analysis and visualisation of chemical properties in a MySQL database, and allows a variety information. of advanced chemoinformatics analyses and datamining In the open-source literature, there is a growing range queries to be performed. It has been designed to handle of tools that can deal with chemical libraries, each of small to very large (up to millions of molecules) collec- them having its own specific applicability domain, rang- tions, and to integrate external sparse data in a flexible ing from general purpose to highly specific software. way (e.g. molecular descriptors, biological activities...). Workflow solutions allow the automation of numerous Besides various search and visualisation capabilities, SA2 recurrent tasks, such as data reading / writing, filtering can also be used to manage the provenance of stored or visualisation, and some of them integrate chemistry compounds, and to create new subsets of molecules functionalities. KNIME [21] in particular, is distributed using many different methods, e.g. filtering, merging or with a set of chemistry nodes available for the Chemistry diversity. SA2 was developed for facilitating the analysis Development Kit (CDK) [22,23] and other chemoinfor- of screening libraries, and to regroup multiple ways of matics packages such as RDKit [24] and Indigo [25]. mining these collections using chemoinformatics meth- Various other advanced features have been encapsulated ods. Broadly speaking, it can also be used to quickly in KNIME nodes, and it is thus possible to e.g. compute and interactively analyse the content of any chemical molecular descriptors, perform substructure searches, or dataset. extract scaffolds. Recently, the CDK was also integrated in the Taverna workflow solution [26,27] through a set Implementation of more than 160 different workers handling chemoin- SA2 is open-source software, which means complete formatics tasks. The combination of chemoinformatics transparency and scalability in terms of algorithm imple- functionalities with data-mining methods typically avail- mentation. A first version of the software was available able in workflow solutions makes it possible to use more on sourceforge and on the web-site of our laboratory advanced strategies to analyse the content of chemical [35,36]. This new version has been re-designed from libraries. scratch, keeping most of the concepts and features