An Algorithm to Identify Functional Groups in Organic Molecules Peter Ertl*

Ertl J Cheminform (2017) 9:36 DOI 10.1186/s13321-017-0225-z METHODOLOGY Open Access An algorithm to identify functional groups in organic molecules Peter Ertl* Abstract Background: The concept of functional groups forms a basis of organic chemistry, medicinal chemistry, toxicity assessment, spectroscopy and also chemical nomenclature. All current software systems to identify functional groups are based on a predefned list of substructures. We are not aware of any program that can identify all functional groups in a molecule automatically. The algorithm presented in this article is an attempt to solve this scientifc challenge. Results: An algorithm to identify functional groups in a molecule based on iterative marching through its atoms is described. The procedure is illustrated by extracting functional groups from the bioactive portion of the ChEMBL database, resulting in identifcation of 3080 unique functional groups. Conclusions: A new algorithm to identify all functional groups in organic molecules is presented. The algorithm is relatively simple and full details with examples are provided, therefore implementation in any cheminformatics toolkit should be relatively easy. The new method allows the analysis of functional groups in large chemical databases in a way that was not possible using previous approaches. Keywords: Functional group, Chemical functionality, Organic chemistry, Medicinal chemistry Background type of publications is work by Bobach et al. describing Te concept of functional groups (FGs)—sets of con- a rule-based defnition of chemical classes to classify nected atoms that determine properties and reactivity of compounds into classes [3] or the ClassyFire software parent molecule, forms a cornerstone of organic chem- [4] developed in the Wishart’s group allowing chemists istry, medicinal chemistry, toxicity assessment, spectros- to perform large-scale automated chemical classifcation copy and, last but not least, also chemical nomenclature. based on a structure-based chemical taxonomy consist- Te study of common FGs forms substantial part of ing of over 4800 categories. basic organic chemistry curriculum. Numerous scien- Various substructure features are often used in chem- tifc papers and books focus on properties and reactiv- informatics in connection with machine learning to ity of various FGs. A well known example is the classical develop models to predict biological activity or proper- book series “Chemistry of functional groups” describ- ties of molecules [5]. In this approach the substructure ing various classes of organic molecules [1] consisting of descriptors are generated by extracting groups of atoms over 100 volumes. Tere is, however, surprisingly little from a molecule using a predefned algorithm. Exam- attention paid to the study of functional groups from the ples of such descriptors are linear or atom centered frag- cheminformatics point of view. Te majority of theoreti- ments, topological torsions, pharmacophoric triplets and cal studies are utilizing FGs as a basis of chemical ontolo- many others. Although such fragment descriptors are gies, where FGs are “keys” that are used to hierarchically very useful, they do not provide description of functional classify molecules into categories [2]. An example of this groups. Te fragments are generally strongly overlapping and are generated for all parts of a molecule without con- sidering their potential chemical role. *Correspondence: [email protected] Novartis Institutes for BioMedical Research, 4056 Basel, Switzerland © The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Ertl J Cheminform (2017) 9:36 Page 2 of 7 One of the first software tools to identify FGs was • all atoms in oxirane, aziridine and thiirane rings the checkmol program written by Haider [6] that was (such rings are traditionally considered to be func- able to identify about 200 FGs. Recently an extended tional groups due to their high reactivity). version of the program containing 583 manually curated functionalities encoded as SMARTS was 3. merge all connected marked atoms to a single FG published [7]. This list includes also numerous het- 4. extract FGs also with connected unmarked carbon erocyclic rings and general structural patterns (i.e. atoms, these carbon atoms are not part of the FG 5-membered aromatic ring with 1 heteroatom). These itself, but form its environment. substructure features are used to develop QSAR models for prediction of toxicity and various molecular Te algorithm described above iterates only through physicochemical properties. The well-known ZINC non-aromatic atoms. Aromatic heteroatoms are collected database and related web-based software suite [8] as single atoms, not as part of a larger system. Tey are stores about 500, so called, chemical patterns, that extended to a larger FG only when there is an aliphatic speed-up substructure searches and allow estimation functionality connected (for example an acyl group con- of molecule reactivity. The patterns include PAINS fil- nected to a pyrrole nitrogen). Heteroatoms in heterocy- ters [9] that identify frequent hitters interfering with cles are traditionally not considered to be “classical” FGs biochemical screens as well as some other substruc- by themselves but simply to be part of the whole hetero- tures. Another widely used set of substructures used cyclic ring. Te rationale for such treatment is enormous to identify potentially reactive or promiscuous mol- diversity of heterocyclic systems. For example in our ecule has been defined by Eli Lilly scientists based on previous study [12] nearly 600,000 diferent heterocycles their experience with internal screening campaigns consisting of 1–3 fused 5- and 6- membered rings were [10]. Recently a set of generic chemical functionalities enumerated. called ToxPrint chemotypes that describe molecule After marking all atoms that are part of FGs as substructure and reaction features and atom and bond described above, the identifed FGs are extracted properties was defined within the ToxPrint program together also with their environment—i.e. connected [11]. The main goal of the tool is to use these features carbon atoms, when the type of carbon (aliphatic or aro- in toxicity modelling. matic) is also preserved. We are not aware of any software system able to iden- We do not claim that this algorithm provides an ulti- tify FGs that is not based on manually curated set of sub- mate defnition of FGs. Every medicinal chemist has structure features, but instead automatically identifes probably a slightly diferent understanding about what all functional groups in a molecule. Te algorithm pre- a FG is. In particular the defnition of activated sp3 car- sented in this article is an attempt to solve this scientifc bons may create some discussion. In the present algo- challenge. rithm we restricted our defnition only to classical acetal, thioacetal or aminal centers (i.e. sp3 carbons having at Methods least 2 oxygens, sulfurs or nitrogens as neighbors) and Identifcation and extraction of functional groups did not consider other similar systems, i.e. alpha-substi- Te majority of FGs contain heteroatoms. Terefore our tuted carbonyls or carbons connected to S=O or similar approach is based on processing heteroatoms and their bonds. During the program development phase various environment with the addition of some other functionali- such options have been tested, and this “strict” defni- ties, like multiple carbon–carbon bonds. tion provided the most satisfactory results. Extension of Te algorithm is outlined below: FGs also to alpha-substituted carbonyls (i.e. heteroatom or halogen in alpha position to carbonyl) and similar sys- 1. mark all heteroatoms in a molecule, including halo- tems more than triple the number of FGs identifed, gen- gens erating many large and rare FGs. Since our major interest 2. mark also the following carbon atoms: was in comparing various molecular datasets and not in reactivity estimation we implemented this strict defni- • atoms connected by non-aromatic double or triple tion of acetal carbons. To assess the possible reactivity bond to any heteroatom of molecules, various substructures flters are available, • atoms in nonaromatic carbon–carbon double or as for example already mentioned PAINS [9] or Eli Lilly triple bonds rules [10]. • acetal carbons, i.e. sp3 carbons connected to two or To illustrate better the algorithm some examples of more oxygens, nitrogens or sulfurs; these O, N or S FGs identifed for few simple molecules are shown in atoms must have only single bonds Fig. 1. Ertl J Cheminform (2017) 9:36 Page 3 of 7 Fig. 1 Example of functional groups identifed. Groups are color coded according to their type Fig. 2 Various forms of the urea functionality difering in the environment patterns. The numbers in the corner indicate the number of molecules in ChEMBL in which this particular group is present and the percentage Ertl J Cheminform (2017) 9:36 Page 4 of 7 Generalization of functional

An Algorithm to Identify Functional Groups in Organic Molecules Peter Ertl*

Fluorescent Aminal Linked Porous Organic Polymer for Reversible Iodine Capture and Sensing Muhammad A

Synthesis and Structural Studies of a New Class of Quaternary Ammonium

An Indexing and Classification System for Earth-Science Data Bases

Physicochemical Properties of Organic Medicinal Agents

Synthesis of Monocyclic Diaziridines and Their Fused Derivatives

Biosynthesis of Natural Products Containing Β-Amino Acids

Synthesis and Characterization of Functionalized Poly(Arylene Ether Sulfone)S Using Click Chemistry

Derivatives of the Triaminoguanidinium Ion, 6. Aminal-Forming Reactions with Aldehydes and Ketones [9]

Synthesis and Characterization of Covalent Adaptable Networks

Ethyl Methyl Sulfone-Based Electrolytes for Lithium Ion Battery Applications

Photoredox Α‑Vinylation of Α‑Amino Acids and N‑Aryl Amines Adam Noble and David W

Genx Chemicals”