View and Approval by a Chemoinformatician

Martin et al. Journal of Cheminformatics 2012, 4:11 http://www.jcheminf.com/content/4/1/11 METHODOLOGY Open Access Building an R&D chemical registration system Elyette Martin1*, Aurélien Monge1, Jacques-Antoine Duret1, Federico Gualandi2, Manuel C Peitsch1 and Pavel Pospisil1 Abstract Small molecule chemistry is of central importance to a number of R&D companies in diverse areas such as the pharmaceutical, nutraceutical, food flavoring, and cosmeceutical industries. In order to store and manage thousands of chemical compounds in such an environment, we have built a state-of-the-art master chemical database with unique structure identifiers. Here, we present the concept and methodology we used to build the system that we call the Unique Compound Database (UCD). In the UCD, each molecule is registered only once (uniqueness), structures with alternative representations are entered in a uniform way (normalization), and the chemical structure drawings are recognizable to chemists and to a cartridge. In brief, structural molecules are entered as neutral entities which can be associated with a salt. The salts are listed in a dictionary and bound to the molecule with the appropriate stoichiometric coefficient in an entity called “substance”. The substances are associated with batches. Once a molecule is registered, some properties (e.g., ADMET prediction, IUPAC name, chemical properties) are calculated automatically. The UCD has both automated and manual data controls. Moreover, the UCD concept enables the management of user errors in the structure entry by reassigning or archiving the batches. It also allows updating of the records to include newly discovered properties of individual structures. As our research spans a wide variety of scientific fields, the database enables registration of mixtures of compounds, enantiomers, tautomers, and compounds with unknown stereochemistries. Keywords: Chemical registration system, Chemical database, Chemical cartridge, Molecule import Background entered by the chemists are drawn correctly. Surprisingly, General introduction this topic is rarely covered by scientific publications, and Small molecule chemistry is of central importance to a few insights can be gained from chemoinformatics books number of R&D companies in diverse areas, such as the [1-5]. This is partly because it is a very technical challenge pharmaceutical, nutraceutical, food flavoring, and cosme- that is met by developers of the registration systems and ceutical industries. These institutions all face similar partly because it is a rapidly evolving field. problems, such as how to register and store information In this paper we show how a chemical registration regarding small molecules in their corporate collections. system can be built that we call the Unique Compound The registration of compounds becomes even more Database (UCD) and implemented at the corporate level complicated when two or more compounds have to of companies working with chemicals. Some elements of be registered together as a mixture that has a particular this system have been presented at two congresses [6-8]. mixture-specific property. Generally, people working on Here, we provide examples from our experience of such projects have to find answers to the same questions, building a chemical registration system at Philip Morris namely, which technology to use, what type of data need International, Inc. (PMI). to be stored, how to manage physical samples of mole- Without a common chemistry registration platform, cules, how to define the uniqueness of chemical struc- chemical information is generally retained in a variety of tures, and how to make sure that the chemical structures locations, as illustrated in Figure 1. Lists are kept at a team or even at a scientist level in diverse formats, such as Excel files or Isis Base [9]. Transferring the data from * Correspondence: [email protected] 1Philip Morris International R&D, Philip Morris Products S.ANeuchâtel, several locations to a single registration system poses a Switzerland challenge, because much of the information related to Full list of author information is available at the end of the article © 2012 Martin et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 2 of 14 http://www.jcheminf.com/content/4/1/11 Data Internet Internet Data Internet Data Data Data Data Internet Data Figure 1 The challenge of connecting different data formats, activities, and scientific approaches without a designated data registry. the molecule’s prior registration is incomplete (e.g., molecular data, scientists working with the molecule, name of the project) and there is often no molecular structure available for a molecule, just the name. Table 1 Examples of chemical cartridges Name; web source Database Available Quality of chemistry representation type for free There is a wide range of chemical cartridges available on Accelrys Direct; Oracle No the market using different underlying database technolo- http://www.accelrys.com/products/informatics gies (see Table 1). Chemical cartridges are basically Accelrys Accord; Oracle No database plug-ins that give chemical handling function- http://www.accelrys.com/products/informatics alities to the database. Cartridges usually differ in their CambridgeSoft Oracle Cartridge; Oracle No performance, searching mechanism (exact match, http://www.cambridgesoft.com/solutions/ details/?fid=186 substructure, similarity, etc.), and the ability to store ChemAxon JChem; Oracle No large quantity of structures in the most compact way. http://www.chemaxon.com/jchem/intro While these are critical elements, we believe that there IDBS Activity Base; Oracle No is another element that is usually underestimated: the http://www.idbs.com/products-and-services/ concordance of the input system used by the end user activitybase-suite (sketcher) and the underlying cartridge. These usually GGA Software Services Bingo; Oracle and Yes share the same chemical representation library, but full http://ggasoftware.com/opensource/bingo SQL Server alignment between them is not always certain. The UCD MolSoft MolCart; MySQL No overcomes this problem by ensuring that molecules are http://www.molsoft.com/molcart.html drawn and stored in the exact same manner. MyChem; MySQL Yes http://mychem.sourceforge.net Pgchem tigress; PostgreSQL Yes Challenge of registering diverse data http://pgfoundry.org/projects/pgchem The source of chemical compounds to be stored in a Orchem; Oracle Yes chemical registration system depends greatly on the http://orchem.sourceforge.net Martin et al. Journal of Cheminformatics 2012, 4:11 Page 3 of 14 http://www.jcheminf.com/content/4/1/11 business of the company. For instance, in pharmaceut- system must, therefore, be able to verify the uniqueness ical companies, compounds are synthesized internally or of the molecule regardless of whether it is in the form of purchased from external suppliers and can represent a salt or a hydrate. The canonical representation is gen- millions of individually identified chemical structures. In erated for each structure by the system, and the different the flavor and fragrance industry, compounds are often tautomers of the same molecule must have the same ca- natural products extracted from plants. nonical representation. For the tobacco industry in general, and for PMI R&D A temporary staging area (submission area) was in particular, chemical constituent sources are relatively planned to enable the storage of new molecule entries limited in comparison with traditional pharmaceutical until they are validated by an expert before final registra- inventories. Approximately 8400 compounds have been tion in the system. In addition, three options were cre- identified from tobacco plants and tobacco smoke [10]. ated to search by chemical structure: exact search, However, we made sure that our system is as universal substructure search, and similarity search. as possible and can hold millions of compounds as well. It was also important to define how batches should be The challenge posed by the implementation of a regis- managed by the person in charge of validating the data tration system in this context is not the number of entered in the system. Batch reassignment was defined compounds to be registered, but rather the wide range as an important feature of the system; batches can be of different chemistries represented (peptides, natural reassigned from one molecule to another molecule by products, sugars, and complex products resulting from the registrar. This functionality also had to be available tobacco combustion). As such, a critical step in the for the batch assigned to a molecule with « no struc- project was to identify what we need to register and how ture», for when the structure of the molecule is finally to ensure that the quality criteria were met effectively. identified. To ensure that no information is lost during the process, an audit trail of the modification must be Design of the UCD kept. In the same way, it should be possible for the per- Our intention was to build a registration system of the son in charge of the system to archive batches, but not compounds we use, not to create an inventory; therefore, to delete them. the physical location

View and Approval by a Chemoinformatician

The Alexandria Library, a Quantum-Chemical Database of Molecular Properties for Force ﬁeld Development 9 2017 Received: October 1 1 1 Mohammad M

Flexible Heuristic Algorithm for Automatic Molecule Fragmentation: Application to the UNIFAC Group Contribution Model Simon Müller*

Chemical Database Projects Delivered by RSC Escience

Predicting Outcomes of Catalytic Reactions Using Machine Learning

Umansysprop V1.0: an Online and Open-Source Facility for Molecular Property Prediction and Atmospheric Aerosol Calculations

Bringing Open Source to Drug Discovery

PSC-Db: a Structured and Searchable 3D-Database for Plant Secondary Compounds

Chemical Space, Diversity, and Complexity[Version 1; Peer Review: 2

Daylight Theory Manual Daylight Theory Manual Table of Contents Daylight Theory Manual

A Database of Medicinal Materials and Chemical Compounds in Northeast

Recent Advances in Multidimensional QSAR (4D-6D): a Critical Review

Chembiofinder V14 User Guide