<<

Martin et al. Journal of 2012, 4:11 http://www.jcheminf.com/content/4/1/11

METHODOLOGY Open Access Building an R&D chemical registration system Elyette Martin1*, Aurélien Monge1, Jacques-Antoine Duret1, Federico Gualandi2, Manuel C Peitsch1 and Pavel Pospisil1

Abstract Small chemistry is of central importance to a number of R&D companies in diverse areas such as the pharmaceutical, nutraceutical, food flavoring, and cosmeceutical industries. In order to store and manage thousands of chemical compounds in such an environment, we have built a state-of-the-art master chemical with unique structure identifiers. Here, we present the concept and methodology we used to build the system that we call the Unique Compound Database (UCD). In the UCD, each molecule is registered only once (uniqueness), structures with alternative representations are entered in a uniform way (normalization), and the drawings are recognizable to and to a cartridge. In brief, structural are entered as neutral entities which can be associated with a salt. The salts are listed in a dictionary and bound to the molecule with the appropriate stoichiometric coefficient in an entity called “substance”. The substances are associated with batches. Once a molecule is registered, some properties (e.g., ADMET prediction, IUPAC name, chemical properties) are calculated automatically. The UCD has both automated and manual data controls. Moreover, the UCD concept enables the management of user errors in the structure entry by reassigning or archiving the batches. It also allows updating of the records to include newly discovered properties of individual structures. As our research spans a wide variety of scientific fields, the database enables registration of mixtures of compounds, enantiomers, tautomers, and compounds with unknown . Keywords: Chemical registration system, Chemical database, Chemical cartridge, Molecule import

Background entered by the chemists are drawn correctly. Surprisingly, General introduction this topic is rarely covered by scientific publications, and Small molecule chemistry is of central importance to a few insights can be gained from chemoinformatics books number of R&D companies in diverse areas, such as the [1-5]. This is partly because it is a very technical challenge pharmaceutical, nutraceutical, food flavoring, and cosme- that is met by developers of the registration systems and ceutical industries. These institutions all face similar partly because it is a rapidly evolving field. problems, such as how to register and store information In this paper we show how a chemical registration regarding small molecules in their corporate collections. system can be built that we call the Unique Compound The registration of compounds becomes even more Database (UCD) and implemented at the corporate level complicated when two or more compounds have to of companies working with chemicals. Some elements of be registered together as a mixture that has a particular this system have been presented at two congresses [6-8]. mixture-specific property. Generally, people working on Here, we provide examples from our experience of such projects have to find answers to the same questions, building a chemical registration system at Philip Morris namely, which technology to use, what type of data need International, Inc. (PMI). to be stored, how to manage physical samples of mole- Without a common chemistry registration platform, cules, how to define the uniqueness of chemical struc- chemical information is generally retained in a variety of tures, and how to make sure that the chemical structures locations, as illustrated in Figure 1. Lists are kept at a team or even at a scientist level in diverse formats, such as Excel files or Isis Base [9]. Transferring the data from * Correspondence: [email protected] 1Philip Morris International R&D, Philip Morris Products S.ANeuchâtel, several locations to a single registration system poses a Switzerland challenge, because much of the information related to Full list of author information is available at the end of the article

© 2012 Martin et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 2 of 14 http://www.jcheminf.com/content/4/1/11

Data Internet

Internet

Data

Internet

Data Data

Data

Data

Internet

Data

Figure 1 The challenge of connecting different data formats, activities, and scientific approaches without a designated data registry. the molecule’s prior registration is incomplete (e.g., molecular data, scientists working with the molecule, name of the project) and there is often no molecular structure available for a molecule, just the name. Table 1 Examples of chemical cartridges Name; web source Database Available Quality of chemistry representation type for free There is a wide range of chemical cartridges available on Accelrys Direct; Oracle No the market using different underlying database technolo- http://www.accelrys.com/products/informatics gies (see Table 1). Chemical cartridges are basically Accelrys Accord; Oracle No database plug-ins that give chemical handling function- http://www.accelrys.com/products/informatics alities to the database. Cartridges usually differ in their CambridgeSoft Oracle Cartridge; Oracle No performance, searching mechanism (exact match, http://www.cambridgesoft.com/solutions/ details/?fid=186 substructure, similarity, etc.), and the ability to store ChemAxon JChem; Oracle No large quantity of structures in the most compact way. http://www.chemaxon.com/jchem/intro While these are critical elements, we believe that there IDBS Activity Base; Oracle No is another element that is usually underestimated: the http://www.idbs.com/products-and-services/ concordance of the input system used by the end user activitybase-suite (sketcher) and the underlying cartridge. These usually GGA Software Services Bingo; Oracle and Yes share the same chemical representation library, but full http://ggasoftware.com/opensource/bingo SQL Server alignment between them is not always certain. The UCD MolSoft MolCart; MySQL No overcomes this problem by ensuring that molecules are http://www.molsoft.com/molcart.html drawn and stored in the exact same manner. MyChem; MySQL Yes http://mychem.sourceforge.net Pgchem tigress; PostgreSQL Yes Challenge of registering diverse data http://pgfoundry.org/projects/pgchem The source of chemical compounds to be stored in a Orchem; Oracle Yes chemical registration system depends greatly on the http://orchem.sourceforge.net Martin et al. Journal of Cheminformatics 2012, 4:11 Page 3 of 14 http://www.jcheminf.com/content/4/1/11

business of the company. For instance, in pharmaceut- system must, therefore, be able to verify the uniqueness ical companies, compounds are synthesized internally or of the molecule regardless of whether it is in the form of purchased from external suppliers and can represent a salt or a hydrate. The canonical representation is gen- millions of individually identified chemical structures. In erated for each structure by the system, and the different the flavor and fragrance industry, compounds are often tautomers of the same molecule must have the same ca- natural products extracted from plants. nonical representation. For the tobacco industry in general, and for PMI R&D A temporary staging area (submission area) was in particular, chemical constituent sources are relatively planned to enable the storage of new molecule entries limited in comparison with traditional pharmaceutical until they are validated by an expert before final registra- inventories. Approximately 8400 compounds have been tion in the system. In addition, three options were cre- identified from tobacco plants and tobacco smoke [10]. ated to search by chemical structure: exact search, However, we made sure that our system is as universal substructure search, and similarity search. as possible and can hold millions of compounds as well. It was also important to define how batches should be The challenge posed by the implementation of a regis- managed by the person in charge of validating the data tration system in this context is not the number of entered in the system. Batch reassignment was defined compounds to be registered, but rather the wide range as an important feature of the system; batches can be of different chemistries represented (peptides, natural reassigned from one molecule to another molecule by products, sugars, and complex products resulting from the registrar. This functionality also had to be available tobacco combustion). As such, a critical step in the for the batch assigned to a molecule with « no struc- project was to identify what we need to register and how ture», for when the structure of the molecule is finally to ensure that the quality criteria were met effectively. identified. To ensure that no information is lost during the process, an audit trail of the modification must be Design of the UCD kept. In the same way, it should be possible for the per- Our intention was to build a registration system of the son in charge of the system to archive batches, but not compounds we use, not to create an inventory; therefore, to delete them. the physical location and quantity of the chemical com- Technical requirements to make the system compatible pounds were not considered. A project team of three with our technical infrastructure were also defined. For chemoinformaticians supported by a project manager example the system should be compatible with Oracle was responsible for defining needs and user require- 11 g [11], Windows Server 2003 [12], VMware ESXi [13] ments. Most of the requirements were related to chem- and Citrix [14]. In terms of performance, the system ical structure representation and storage; the starting should be able to support 300 users through a centralized point was that it must be possible to create and modify architecture, but five queries at the same time were con- structures in the platform and to record physical sam- sidered the maximum. It was also imperative that data be ples attached to their related compounds. available to external tools, and for this purpose a direct For the purpose of data uniformity and uniqueness, database access was requested in order to implement Ex- the chemoinformaticians specified that the structures tract Transform Load (ETL) processes [15]. entered in the system have to be standardized prior to The result of this process was a document containing registration, and that uniqueness should be defined at 147 user requirements describing the platform. the level of neutral molecules. In consequence, when a new batch is submitted, if a duplicate of the correspond- Concept of the UCD ing compound is found in the database, the system must Data organization: Concept of molecule, substance, and create a new batch for the compound. It is also import- batch ant for users to have the possibility to register com- The three main entities in the UCD are molecule, sub- pounds for which structures are not known. Such cases stance, and batch. Definitions of these terms can be are annotated as “No Structure”, meaning that no slightly different from one chemical system to another. particular chemical structure is defined. This is useful, In the UCD, the definitions are the following: for example, for analytical chemists working with who might encounter the same peak in  Molecule: a neutral form of the chemical structure several gas chromatography and liquid chromatography without any charge, counter-ion, or hydrate. If a mass spectra, without being able to identify the com- molecule is charged, the system changes it to its pound. For salts, the neutral form of the molecule is neutral equivalent and records its salt form at the drawn and associated with the appropriate counter ions substance level. An exception was made for and ratio. In the same manner, hydrates are not drawn, substances containing quaternary ammonium but are associated with the chemical structure. The cations. The particularity of the quaternary Martin et al. Journal of Cheminformatics 2012, 4:11 Page 4 of 14 http://www.jcheminf.com/content/4/1/11

ammonium cations is that they are permanently registration. Batch codes also serve as the Submission ID charged, independent of the pH of their solution. In to submitters and registrars. this case, the system does not neutralize the Each level requires specific information that is either molecule. entered manually by the scientists during the submission  Substance: a molecule (neutral or charged) with its step of the registration or calculated automatically by counter-ion or hydrate. the system. For example, the information related to the  Batch: an occurrence of the compound in the project, scientist, laboratory notebook reference, and company, generally a physical sample, identified analytical results is stored at the batch level (manual from mass spectrometry, or a reference in the entry). Information related to the chemical substance, relevant literature. i.e., IUPAC name, codes (InChI and SMILES), and physi- cochemical properties such as molecular weight and In the UCD, one molecule is linked to several logP, is stored at the substance level. Analogically, the substances and one substance can have several batches same kind of information is stored at the molecule level (see Figure 2). This allows users from different teams to for the neutral chemical structure. enter batches (e.g., different laboratory procedures) of The IUPAC names, SMILES, and InChI codes generated the same substance of the same molecule. This molecule by the UCD can be canonical and/or can only partly rep- can be a new one or one already registered in the UCD. resent the . When a molecule contains One batch can contain only one substance, which in more than one group of relative stereocenters the chem- turn can contain only one molecule; however, one mol- ical line notations SMILES and InChI are not sufficient to ecule can be represented by several different salts (sub- correctly represent the stereochemistry. For example, the stances) and each substance can have several batch ‘either’ bond (wavy line) pointed to a stereocenter cannot entries. An example is presented in Figure 2. Batches are be encoded in InChI or SMILES codes. To represent the stored with experimentally relevant information entered stereochemistry as precisely as possible we used the manually by submitters. At the molecule level, a single Symyx enhanced stereochemistry labeling (see section Use molecule is represented by a unique, property-independ- of enhanced stereochemistry). For such cases registered in ent code generated by the registration system. The sub- the UCD, it is the detailed 2D drawing of the structure stance codes are the same as the molecule code, but a with enhanced labeling that assures the uniqueness of the distinct letter is added for each distinct salt. Batches molecule. In order to omit the ambiguity and be sure we have their own unique codes. Batch codes are generated are working with a single unique structure, the UCD gen- incrementally upon the submission process of the erates a unique UCD code.

Molecule = Every molecule is neutral unique assigned a unique chemical entity company code: e.g., UCD01234567

Substance = Molecule + Salt

Substance ID = e.g., UCD01234567-A UCD01234567-B UCD01234567-C Batch = Substance physically present or project- relevant as cited in literature

Batch ID = e.g., BC000002152 BC000000122 BC000000121 BC000008641 BC000000154 BC000000158 BC000000560 BC000000176 BC000003174 BC000000320 BC000002504 BC000002598 BC000000894

Figure 2 Hierarchy of molecule, substance, and batch entities in the UCD. Example of the hierarchy is shown for a 2-cyanoacetic acid. Three substances are associated with the molecule: two salts and one neutral substance. Each substance gets a coded letter after the molecule code (e.g., A in UCD01234567-A). Several batches are registered for each substance. Batch codes are generated incrementally upon registration and also serve as submission ID. Batches can be re-assigned to other molecules. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 5 of 14 http://www.jcheminf.com/content/4/1/11

In the UCD, SMILES and InChI are still generated labeling allows us to represent the relative stereoconfi- because these popular chemical line notations are useful guration of stereogenic centers. As seen in Figure 3, the to do some queries in external and the unique six stereoisomers and stereoisomeric mixtures would be UCD code is an internal company code. However, it registered as six different molecules in the UCD; thus, should be noted that considerable progress to guarantee removing the uncertainty of known or unknown uniqueness of the structural description using line nota- stereoconfiguration. tions has been made, as it can be seen in two recent publications [16,17], introducing and discussing yaInChI Rules for drawing molecules and CTISMILES codes, respectively. Because some molecules can be drawn in more than one In addition, the UCD allows the registration of way, it is important to define chemical drawing rules that mixtures of enantiomers. Because mixtures of the same disallow some representations, thus ensuring seamless enantiomers in different ratios (e.g., 50:50 and 30:70) translation of structure from the to the chemical may have different physico-chemical properties, the cartridge. In order to be correctly understood by the UCD generates a unique code for each ratio. Import- chemical cartridge, the structure representation must be antly, even if different enantiomers or mixture of done in a non-perspective way (see a in Figure 4). When a enantiomers of the same molecule have different UCD structure contains bridging , the bonds that are codes, a search using the structure of the molecule will attached to the bridging atoms should not be marked as retrieve all the entries. stereo bonds; instead, the explicit hydrogens should be used (see b in Figure 4). The drawing of the cis-trans Dealing with chemical structure isomerism is also important. Figure 5 shows some exam- Chemical cartridges ples of the type of drawings that are allowed (a) and not The core technical functionality of the registration allowed (b). system is to handle chemical structures. Basically, this Finally, sugars can be represented in different ways, means that each chemical structure must be stored as a but only some of them are fully interpretable by the unique representation, and that the drawing of the struc- chemical cartridge. For linear sugars, the preferred ture must include all structural information, such as drawing uses the line-angle structure rather than the stereochemistry, so that users have the possibility to Fischer projection [19] (see a in Figure 6). In a Fischer search by exact structure, substructure, or similarity. projection, the horizontal lines represent bonds coming Although it is possible to generate unique codes and out towards the observer and vertical lines represent do searches using classical chemoinformatics tools such bonds going away from the observer behind the plane of as Pipeline Pilot [9] or Chemistry Development Kit [18], the paper. Cyclic structures of monosaccharides are we believe the most suitable approach is to use a chem- represented using the Haworth projection [20] with a ical cartridge as the central core of the registration sys- non-perspective drawing (see b in Figure 6). Haworth tem. Chemical cartridges have the advantage of offering projections are easy to translate into acceptable draw- very good performance (for the search of compound), ings. Bonds above the plane of the carbon ring are because chemical structures are indexed in the database. marked “Up”, and bonds beneath the plane of the car- In order to match our design and concept of the UCD, bon ring are marked “Down”. Hydrogen atoms, which we chose the Accelrys Direct (formerly Symyx Direct) are explicit in Haworth projections, are implicit in struc- chemical cartridge (see in Table 1). tures that are drawn for registration.

Use of enhanced stereochemistry Standardization and validation of compound One of the main difficulties with chemical registration sys- representation tems is the representation of uncertainties of stereoconfi- Drawing rules provide the necessary help for scientists gurations and mixtures of stereoisomers. A common to represent chemical structures in a manner which is challenge in chemical structure registration is to represent understood correctly by the chemical cartridge. How- stereogenic centers precisely, even when the absolute con- ever, because all structures must be checked automatic- figuration is not known. We solved this issue by using a ally before they are entered in the database, some leading system of stereo centers description, i.e., Accelrys standardization rules must be automatically applied. We enhanced stereochemical representation (V3000 format), chose for this purpose Accelrys Cheshire (formerly which uses embedded labels in the structure to allow pre- Symyx Cheshire [9]), a chemistry-oriented scripting cise configuration of the molecule for each possibility. platform. With this tool it is possible to apply corporate For example, the configuration of the two centers of standards to check and adjust chemical structures and 4-chloropentan-2-ol in the mixture of stereoisomers can neutralized structure. Some examples of standardization be known or partially known. Embedded stereochemical rules and error checks are presented in Figure 7. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 6 of 14 http://www.jcheminf.com/content/4/1/11

OR Enantiomer Mixed OH Cl OH Cl OH Cl

R S (R) (S) &1 &2 (R) (S) a. single stereoisomer, b. single unknown enantiomer, c. mixture of four absolute configuration known relative configuration of the two stereoisomers stereogenic centers known

AND Enantiomer Mixed OH Cl OH Cl OH Cl

(R) (S) or2 or1 (R) (S) d. mixture of two stereoisomers e. one of four possible f. nothing is known about with the same relative stereoisomers configuration of two stereogenic configuration centers

Figure 3 Different cases of stereochemistry managed by Accelrys Draw and Accelrys Direct chemical cartridge.

In some cases, structures can be standardized or manual review and approval by a chemoinformatician. checked automatically. In addition to an automatic Approved molecules (and all related information) are check of the structure drawing by Accelrys Cheshire, the copied into the registration area, which is the final data- good drawing rules and validation by an expert estab- base. When a submitted molecule is not approved, the lished in the company help ensure that only correct chemist who submitted this molecule has the option to structures are stored in the database. An example of the modify and resubmit it. The user roles for this workflow validation of chemical structures with Accelrys Cheshire registration are described in the next section (Roles). in our platform is presented in Figure 8. This two-stage quality control process ensures the accur- acy of the data. Registration workflow In addition, once a molecule is in the registration area, When submitting a molecule to be registered in the we can manage the potential user errors in the structure UCD system, the submitters have to follow a registration entry by reassigning or archiving the batches. The sys- workflow (Figure 9). Chemists wanting to add a new tem also allows the records to be updated to include batch to the UCD should consider whether the chemical newly discovered properties of individual structures. substance fulfills the criteria for inclusion in the data- This batch reassignment process will be explained in base (see criteria table in Figure 9). If the chemical struc- section Batch reassignment. ture should be in the UCD, chemists must complete a form with the drawing structure. In cases where the exact structure of the compound is not known, the user can use the “No Structure” function of Symyx draw to fill in the form. This form also contains various fields for data associated to the batch (e.g., common name, source, internal identifier). When the user validates the drawing of the molecular structure, a normalization process is automatically executed, and the normalized structure is also displayed in the form (see example in Figure 8). The user must then decide whether or not to approve this normalization. When the form is complete, the user must click the Submit button to copy all the data into a transitory area that we call submission area. This area contains the compounds that have been submitted by users, but not yet validated. An automated control ensures that a molecule can be submitted only if all mandatory data and all the validation fields are passed Figure 4 Representation of molecule. To be correctly understood (e.g., experimental molecular weight field accept only by the chemical cartridge the drawing must be non-perspective (a), and the stereo bond in ring must be avoided (b). number). Once a molecule is submitted, there is a Martin et al. Journal of Cheminformatics 2012, 4:11 Page 7 of 14 http://www.jcheminf.com/content/4/1/11

a.Allowed Cl Br Cl Cl

Cl Cl Cl Cl Br Cl Cl Cl

b. Not allowed Cl Cl Cl Cl Cl Cl Br BrCl Cl Cl Cl

Figure 5 This figure represents the allowed (a) and not allowed (b) drawing for the isomerism cis-trans.

Data validation Depending on the group that the user belongs to, differ- Validation of data registered in the chemical registration ent permissions can be assigned (Figure 10). system is imperative to ensure the reliability of the  The role of Viewer includes all R&D users that have content. In our case, the two staging areas allow scien- access to the UCD. Viewers can search and view tists to be part of the registration process and the regis- registered data via a web interface (except for some trar to check discrepancies and errors. However, because “restricted fields” reserved for specific teams). we decided that not all scientists should be able to create  The role of Submitter is reserved for scientists who new records, we defined three roles. create new information. Submitters have the same privileges as Viewers, but can also access to a Roles submission form to insert new records in the Authentication to access the database is managed by database. Lightweight Directory Access Protocol (LDAP) accounts  The role of Registrar is restricted to with three groups corresponding to the three roles. chemoinformaticians. Registrars have the same

a.

Line-angle structure

Fischer projection b.

Not allowed drawing Allowed drawing

Figure 6 Allowed representation of sugars. This figure represents the allowed drawing for open-chain form of sugar of α- D-Glucose (a) and of cyclic form of D-Glucose with the allowed drawing based on the Haworth projection α-D-Glucose (b). Martin et al. Journal of Cheminformatics 2012, 4:11 Page 8 of 14 http://www.jcheminf.com/content/4/1/11

Standardization OH OH a: N N b: NH2 NH2

c:

Flagged as erroneous

OH d: e: Cl Br OH NH2

Technical limitations

Br f: Cl Br g: Cl

Figure 7 Examples of standardization rules and error check. (a-c) Some standardization rules are applied automatically by Accelrys Cheshire. (a) Nitrile group can be automatically redrawn linearly. (b) For a stereocenter it is not necessary to have two double up bonds and/or two double down bonds. (c) Up and/or Down bonds must be oriented to the stereocenter. (d-e) Some compounds can not be corrected, but can be detected as erroneous. (d) in this case it is not possible to determine the stereochemistry. (e) Valence is not correct. In some cases it is not possible to correct or detect automatically the error. (f-g) Some drawings meet technical limitations. (f) The configuration of the stereobond can not be determined. (g) The configuration of the stereocenter can not be determined.

privileges as Submitters, but are also responsible for specific restricted field (e.g., study activity) that can be checking and validating data in the submission area viewed by this team only. (source, project ID, etc.) and giving approval for registering an entry. Batch reassignment A batch can be reassigned to a different substance if the As each user also belongs to a team group, when data user realizes that the structure was entered incorrectly associated with a molecule from a team are business sensi- or determines stereogenic centers during later experi- tive, the UCD allows the information to be entered in a ments. If a substance no longer has a batch, the

Figure 8 Example of a standardization of the structure in the interface of the registration system of our company. The nitrile group is drawn linearly and the acid group is put in the neutral form (Molecule Data). The output of the normalization script (Normalized Molecule) is presented to the user, who can accept the changes prior the submission of the molecule. Salt can be selected at the Substance Data level. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 9 of 14 http://www.jcheminf.com/content/4/1/11

Figure 9 UCD Registration workflow. Tasks to be followed by Submitters (left part) and Registrars (right part). Criteria table defines which compound is to be registered. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 10 of 14 http://www.jcheminf.com/content/4/1/11

In addition, the system automatically calculates the UCD theoretical molecular mass of both molecule and sub- Read stance. The system neutralizes the charge of the 1. Viewer molecule by adding one hydrogen to this molecule and Submission Registration the submitter selects the corresponding counter ion Area Area from the predefined dictionary of salts (for an example, Add new salt of the carboxylic acid in Figure 11) for substance. As record the charged form of the molecule is not represented in 2. Submitter the database, the molecular mass of the salt is decreased Validate new by the theoretical molecular mass of one hydrogen to record automatically calculate the theoretical molecular mass of the substance (Figure 11).

3. Registrar Implementation of the platform Figure 10 Three user roles and two staging areas of the Implementation approach database. It is known that the implementation of chemical regis- tration systems can be a lengthy and difficult process. It was critical from the start that the project be clearly delivered in the shortest timeframe possible and that substance is archived as inactive. If after this process a the implementation team be built to ensure maximum molecule no longer has a substance, the molecule is efficiency. As mentioned previously, a team was put to- archived. There is no ‘delete’ process; all new entries are gether of three chemoinformaticians, one project man- assigned chronologically with new codes. This procedure ager, and the support from the former Symyx consulting allows for correction of errors without losing any infor- team. This team was empowered by management to mation related to the archiving process. make all decisions regarding the chemical representation and standardization rules, the data structure, and the Automatic calculations technical implementation strategy. A production system In a chemical database, it is suitable to have names and was available four months after the initial kick-off certain properties (such as ADMET) calculated automat- meeting. ically. Even though our system generates corporate com- pound IDs (UCD codes) upon the registration of each entry, it is important that chemical names and other Database identifiers, such as IUPAC names or CAS numbers, be The database is hosted on Oracle 11g with Symyx Direct recorded to provide links to molecules in external data- 6.3 cartridge. The major effort in the design phase of the bases. ACD/Labs Name Batch tool is used to automatic- project was to conceive the database model (Figure 12). ally and accurately generate most names according to We saw that data are inserted in the database in two guidelines of the IUPAC from the molecular structure. steps (submission and registration area). These two areas However, naming structures with enhanced stereochem- are clearly separated in the data model. The submission istry is an issue: see example of (2R,4 S)-4-chloropentan- part of the scheme is a buffer area for the data contain- 2-ol in Figure 3 for which IUPAC names with such ing all the information entered by the user. The registra- detailed description of stereochemistry cannot be gener- tion tables contain the validated information. Molecule, ated. The software Accelrys Pipeline Pilot is used to pre- substance, and batch tables reflect the organization of dict ADMET properties and to calculate lead-like and the chemical data presented before. Properties are stored Lipinski indicators. ACD/Labs PhysChem is used to in dedicated tables for each level of information. The in- automatically calculate water values for the formation related to security and additional properties is molecules. stored in separate tables (green and white in Figure 12).

Figure 11 The process of calculation of the final theoretical mass of the substances. Normally, the molecule form is hydrogenated by the system, but the final mass is calculated correctly by the sum of the values of the acid and counter ion minus one hydrogen. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 11 of 14 http://www.jcheminf.com/content/4/1/11

Figure 12 Database model of Unique Compound Database. Database is composed from tables related to submission area (blue), registration area (yellow), security (green) and additional properties (white).

Software submitter and one for the registrar. Then a desktop appli- Several programs were used to build the UCD platform. cation was built using Isentris form designer. The applica- As previously mentioned, data are stored in an Oracle 11g tion is executed using Isentris Client. It is used by database using Accelrys Symyx Direct 6.3. Pipeline Pilot submitters to enter new molecules and by the registrar to 8.0 is used to compute overnight for any new molecule validate the data. entry the physicochemical properties (e.g., molecular The number of submitters is limited in our company, weight, logP) and IUPAC names (generated by ACD/Labs but many people were interested to consult the data. We Name). Symyx module Isentris was used to develop the wanted a more flexible solution for the people who are main part of the platform (Figure 13). The Isentris plat- only interested in viewing the data. We then chose to form offers a visual editor to build data sources on the top develop a web interface using Oracle Application Ex- of Oracle in order to access the data without SQL code. press 4.1 (Apex) to visualize the data in the UCD. Apex Two data sources were built for the project, one for the is a tool integrated by default in Oracle 11 g and

Figure 13 Software architecture of the UCD. The database part is based on Oracle and Accelrys Direct (formerly Symyx Direct). The software to input data is a desktop application based on the Isentris platform. The visualization interface was developed using Oracle Application Express and is accessible with a web browser. Pipeline Pilot is used as an ETL to compute physicochemical properties and chemical names overnight. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 12 of 14 http://www.jcheminf.com/content/4/1/11

Figure 14 Success rates for converting chemical names to structure. In this example around 70% of the names were transformed correctly into structures, 7% structures of the generated structures were ambiguous, and 23% of the names were not recognized by the software. dedicated to build web interfaces for Oracle databases in submitters and registrars correcting the structural repre- a very efficient way. The web interface is available to a sentation of molecules. large number of users in our company. Of the 7372 molecule names entered, 70% of the structures were generated correctly; 7% of the structures Data migration were generated with warning messages, which required Once the UCD was ready, a critical step was the migra- manual curation; and 23% of the structures were not tion of all the existing data. As is often the case, our generated because the names were not recognized by scientists had the names of the molecules and some the software (Figure 14). For this subset of molecules information about them stored locally in diverse file for- there was no easy or automatic way to obtain a struc- mats such as Excel or text (e.g., CAS number, experi- ture; therefore, the chemists had to check and correct mental molecular weight, origin of the compound). The each name and its structure manually. To obtain a struc- following section describes the challenges involved in ture for these molecules, we conducted searches in Pub- transforming names into structures and enabling auto- Chem [22], ChemSpider [23], and Google [24]. For some matic import of molecules into the UCD. molecules, an associated CAS number was available, which allowed us to obtain the structure using SciFin- W Transformation of names into structures In building der [25], a tool for exploring the CAS databases. the UCD without any existing infrastructure, we had Clearly, during the construction of our UCD, structure two major sources of compounds available: 1) names of conversion took considerable time (in terms of months molecules as listed in the literature and 2) internal for one chemoinformatician) and the effort and time working compounds, which are often referred to by their required should not be underestimated. IUPAC names, common names (e.g., harmane), or CAS numbers. Automated migration workflow Once we have the The names do not always follow the IUPAC recom- structures, we should import all the compounds into the mendations; therefore, transforming the names into UCD. To avoid doing this import manually molecule by structures and importing them into the UCD was molecule, which would be extremely time-consuming, a challenging. The issue was addressed using the standard Pipeline Pilot protocol was developed to automate the software module ‘ACD/Name to Structure Batch’ from importation (Figure 15). ACD/Labs [21]. This software generates accurate struc- This protocol requires an input file in SDF (stands for tures for entire libraries of compound names. As structure-data file) format. In this input file the structure illustrated by numbers in Figure 14, even if this software of the molecules is described by a Connection Table in can transform a large number of names into structures, V3000 format (as explained in section Use of enhanced a considerable amount of time must be spent by the stereochemistry). The SDF format was chosen because of

Figure 15 Accelrys Pipeline Pilot protocol to insert automatically an entire library of compound. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 13 of 14 http://www.jcheminf.com/content/4/1/11

its ability to include associated data. The input file actual data import into the newly built platform is the contains information about the molecule structure and time-consuming and the most challenging step. all the data associated with this molecule (e.g., Project Finally, we believe that the UCD concept is an efficient name, internal identifier, scientist who works on this and progressive way to accurately register and describe compound). all structures at the corporate level. The database is In order to ensure that structures are normalized modular and flexible. It allows us to link the accurately according to the same rules as defined for the manual described molecules to other databases. Thus, unique- import, we developed a Java component for Pipeline ness of molecular description in the UCD provides the Pilot which uses the Java API of Accelrys Cheshire to robust foundation of the company chemical space. In normalize chemical structures. consequence, compounds can be moved to complex knowledge bases and data can be mined for biological activities, modes of action, and therapeutic outcomes. Discussion and conclusion Future development of the UCD platform will in- At PMI R&D, we have built a chemical registration clude linking with the integration of spectroscopic system called the Unique Compound Database (UCD), information in relationship with the different entities which manages the registration process in an efficient stored in the system. We are also planning to move and non-redundant manner in a very short timeframe. the entire system to a web-based architecture using In order to register data efficiently and accurately, the the latest in sketcher technologies, which will in- UCD has the flexibility to register molecules with un- crease the ability of the bench scientist to easily known structures or mixtures of compounds and at the register any new substance. In addition, we will be same time can be used to register known structures with integrating the chemical standardization rules within the precisely defined stereochemical configuration. This the database itself to simplify the maintenance level of detail ensures the uniqueness of chemical process. records. Pre-defined standardization rules, drawing rules, automated normalization, and enhanced stereo- Competing interests The software that was used for the basis of the database development (from chemistry labeling decrease the chance of erroneous or Accelrys) was selected using independent PMI review process. Subsequently, ambiguous registry. Moreover, the system decreases the FG from Accelrys was employed to customize the development of the name-to-structure ambiguity by using only drawn struc- platform. tures (with the enhanced stereochemistry when known) Authors’ contributions and by generating names only after the registration MCP conceived the project. JD and PP jointly designed the concept and process is completed. managed the project. AM and FG did the development phase of the project. The reliability of the database and the accuracy of EM and AM provided input to the design, constructed the registration platform and they are the main authors for this manuscript. EM populated the registration process are enhanced by the two- the database. All authors reviewed and approved the final manuscript. stage area. The Submitter or bench chemist takes ownership of the records and registers the records Acknowledgements “from the bench”. This process is assisted by the The authors express their gratitude to Peter Hliva for developing a Java component for Pipeline Pilot which uses the Java API of Accelrys Cheshire automated standardization rules and automatic struc- and to Lynda Conroy for editing the manuscript. ture check. The Registrar reviews the submitted molecules and validates the structures before regis- Author details 1Philip Morris International R&D, Philip Morris Products S.ANeuchâtel, tration. The two-stage area system also allows the Switzerland. 2Accelrys, http:\\accelerys.com/. Registrar to detect potential software issues that the Submitter might have encountered. Received: 13 February 2012 Accepted: 11 May 2012 Published: 31 May 2012 Concerning the molecule-to-salt and salt-to-batch associations, our model prefers registering molecules as References neutral entities, where predefined salts are listed in a 1. Chemical Structure Information Systems: Interfaces, Communication, and dictionary and selected by the user at the substance Standards, ACS Symposium Series 400. Washington, DC: American Chemical Society; 1989. level. Salts are standardized and do not have to be 2. Buntrock RE: Chemical registries–in the fourth decade of service. J Chem drawn. Batches are then assigned to the substance entry. Inf Comput Sci 2001, 41:259–263. We believe this process provides a higher level of mol- 3. Gobbi A, Funeriu S, Ioannou J, Wang J, Lee M-L, Palmer C, Bamford B, Hewitt R: Process-driven information management system at a biotech ecule description and easier traceability of different en- company: concept and implementation. J Chem Inf Comput Sci 2004, tries. Furthermore, batches can be re-assigned or 44:964–975. archived, thus providing the company a way to deal with 4. O'Donnell TJ: Design and use of relational databases in chemistry. Boca Raton, London, New York: CRC Press; 2009. new changes to the structures (i.e., structure elucidation) 5. Weisgerber DW: Chemical abstracts service chemical registry system: and to log such changes. We have also observed that history, scope, and impacts. J Am Soc Inf Sci 1997, 48:349–360. Martin et al. Journal of Cheminformatics 2012, 4:11 Page 14 of 14 http://www.jcheminf.com/content/4/1/11

6. Martin E, Monge A, Duret J, Pospisil P: Building an R&D chemical registration system. In Ninth International Conference on Chemical Structures (ICCS). Noordwijkerhout: Poster P-13; 2011. June 5–9. 7. Martin E, Monge A, Duret J, Peitsch M, Pospisil P: Building an R&D chemical registration system. In 43rd IUPAC World Chemistry Congress: July 31-August 5. San Juan: TPC200-Poster Session I; 2011. 8. Martin E, Duret J, Monge A, Knorr A, Stueber M, Stratmann A, Arndt D, Peitsch M, Pospisil P: Building a corporate R&D chemical registration system that links structures to analytical spectra and biological activities. In 43rd IUPAC World Chemistry Congress: July 31-August 5. San Juan: IAC102-General Oral Session III; 2011. 9. Accelrys Web-page. [http://accelrys.com/]. 10. Rodgman A, Perfetti TA: The chemical components of tobacco and tobacco smoke. Boca Raton, London, New York: CRC Press; 2008. 11. Oracle. [http://www.oracle.com/]. 12. Microsoft Windows Server. [http://www.microsoft.com/windowsserver]. 13. VMware. [http://www.vmware.com/]. 14. Citrix. [http://citrix.com/]. 15. Extract Transform Load. [http://www.etltool.com/what-is-etl.htm]. 16. Cho YS, No KT, Cho KH: yaInChI: Modified InChI string scheme for line notation of chemical structures. SAR QSAR Environ Res 2012, 23:237–255. 17. Gobbi A, Lee M-L: Handling of Tautomerism and Stereochemistry in Compound Registration. J Chem Inf Model 2011, 52:285–292. 18. Chemistry Development Kit. [http://sourceforge.net/projects/cdk/]. 19. McMurry J: Essentials of general, organic, and biological chemistry. Englewood Cliffs, NJ: Prentice Hall; 1989. 20. Haworth projection. [http://goldbook.iupac.org/H02749.html]. 21. Advanced Chemistry Development, Inc., Toronto, ON, Canada. [http://acdlabs. com/]. 22. PubChem. [http://pubchem.ncbi.nlm.nih.gov/]. 23. ChemSpider. [http://www.chemspider.com/]. 24. Google. [http://www.google.com/]. 25. SciFinderW. [https://scifinder.cas.org].

doi:10.1186/1758-2946-4-11 Cite this article as: Martin et al.: Building an R&D chemical registration system. Journal of Cheminformatics 2012 4:11.

Publish with ChemistryCentral and every scientist can read your work free of charge Open access provides opportunities to our colleagues in other parts of the globe, by allowing anyone to view the content free of charge. W. Jeffery Hurst, The Hershey Company. available free of charge to the entire scientific community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours you keep the copyright

Submit your manuscript here: http://www.chemistrycentral.com/manuscript/