Flexible Heuristic Algorithm for Automatic Molecule Fragmentation: Application to the UNIFAC Group Contribution Model Simon Müller*

Müller J Cheminform (2019) 11:57 https://doi.org/10.1186/s13321-019-0382-3 Journal of Cheminformatics RESEARCH ARTICLE Open Access Flexible heuristic algorithm for automatic molecule fragmentation: application to the UNIFAC group contribution model Simon Müller* Abstract A priori calculation of thermophysical properties and predictive thermodynamic models can be very helpful for developing new industrial processes. Group contribution methods link the target property to contributions based on chemical groups or other molecular subunits of a given molecule. However, the fragmentation of the molecule into its subunits is usually done manually impeding the fast testing and development of new group contribution methods based on large databases of molecules. The aim of this work is to develop strategies to overcome the challenges that arise when attempting to fragment molecules automatically while keeping the defnition of the groups as simple as possible. Furthermore, these strategies are implemented in two fragmentation algorithms. The frst algorithm fnds only one solution while the second algorithm fnds all possible fragmentations. Both algorithms are tested to fragment a database of 20,000 molecules for use with the group contribution model Universal Quasichemical Func- tional Group Activity Coefcients+ (UNIFAC). Comparison of the results with a reference database shows that both algorithms are capable of successfully fragmenting all the molecules automatically. Furthermore, when applying them on a larger database it is shown, that the newly developed algorithms are capable of fragmenting structures previously thought not possible to fragment. Keywords: Molecule fragmentation, Cheminformatics, RDKit, Property prediction, Group contribution method, UNIFAC, Incrementation Introduction named QSPR methods (Quantitative Structure Property Cheminformatics is a growing feld due to the increas- Relationship). One subgroup of these models is the group ing computational capabilities and improvements in the contribution method. Te idea behind this method is to accuracy achieved by its predictions. Te chemical space divide the value of a property of the complete molecule is vast and the number of molecules available to produce into its contributions based on the chemical groups or with new and, in some cases even automated synthetiz- other molecular subunit. Group contribution models ing routes increases. However, before investing resources have been successfully applied to a wide variety of prop- into synthetizing and characterizing molecules, a predic- erties including density [1, 2], critical properties [3–5], tive approach for its properties would help narrow down enthalpy of vaporization [6], normal boiling points [7, 8], the possible candidates. In addition, for the application water–octanol partition coefcients [9–11], infnite dilu- of thermodynamic models or a priori calculation of ther- tion activity coefcients [12] and many more. Also, from mophysical properties, predictive methods can be helpful Gibbs excess energy models [13–15] and equations of and in some cases even necessary. Tese methods, which states [16–19] they provide an approach that allows wid- relate properties to the molecule structures are usually ening their application range to molecules composed of the same chemical groups relatively easily. *Correspondence: [email protected] However, in the development and application of Institute of Thermal Separation Processes, Hamburg University these models a manual mapping of the groups has to be of Technology, Eißendorfer Straße 38, 21073 Hamburg, Germany © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Müller J Cheminform (2019) 11:57 Page 2 of 12 performed in most cases. Tis can hinder the fast devel- Non‑unique group assignment opment and testing of possible diferent group combina- For the assignment of the groups several solutions might tions, especially for larger number of molecules. be possible. Te order in which the diferent groups are Jochelson [20], in 1968, already described a simple searched has an infuence. For example, an ACOH group automatic routine for substructure counting. Most of (hydroxyl bound to an aromatic carbon atom) can be rec- research since [21–28] is focused more on describing ognized as such or fragmented into an aromatic carbon algorithms for substructure search, ring perception and (AC) and a hydroxyl (OH) group. Furthermore, depend- aromaticity perception. In a recent paper Ertl [29] pro- ing on the order in which the non-overlapping fragmen- posed a new algorithm for automatic chemical group def- tation is performed on the molecule structure, diferent inition based on a large database. Fortunately, most of the results might be attained. For example, if a molecule is current cheminformatic toolkits already include search fragmented starting from left to right (Fig. 1a), the result and perception features, allowing to create new advanced obtained can be diferent from the one obtained if the fragmentation algorithms focusing on other problems. molecule is fragmented from right to left (Fig. 1b). One of the free tools ofered online for structure anal- In these cases, the algorithm must either deliver the ysis is Checkmol [28, 30]. It is an open-source program correct fragmentation as a frst solution or fnd all solu- for fnding a defned set of functional groups within a tions and then specify how to choose the correct one. molecular structure. However, it checks its existence without counting the occurrence. Przemieniecki [31] Incomplete group assignment developed an implementation of UNIFAC with auto- Tis case occurs when it is not possible to assign one or matic group fragmentation by means of a non-standard- more atoms to a specifc group. In some cases, the order ized way of specifying the fragmentation scheme. Some of the groups searched can also lead to this situation. For other free webpage services that allow a complete auto- example, in Fig. 2 if the AC groups (aromatic carbon) matic fragmentation of molecules also exist, including are searched frst, the remaining chlorine atom can- the ones from the companies DDBST GmbH [32] and not be assigned to any other functional group from the Xemistry GmbH [33]. In the frst case, fragmentation is fragmentation scheme. In other cases, there will be mol- limited to the schemes supported by the webpage. In the ecules with atoms or functional groups that are just not second case, it is possible to provide own fragmentation defned in the fragmentation scheme. However, in most rules allowing for fragmentation using diferent schemes. cases where the fragmentation is possible, this issue can However, the terms of use only allow for a manual use of be avoided if the algorithm specifes the order in which the website and without the ability to use the results in the functional groups are searched. commercial applications. Furthermore, knowing how the algorithm works would allow to debug, fnd errors and The fragmentation scheme improve it. Defning the fragmentation scheme is decidedly impor- Tools that implement group contribution models like tant for the accuracy of the algorithm. If the groups Octopus [34], thermo [35] or UManSysProp [36] would defned were targeting very specifc functional groups or largely beneft from an improved fexible automated avoiding overlapping with other groups, this would mini- fragmentation algorithm based on standardized ways to mize the non-unique or incomplete group assignments. defne the fragmentation scheme that can handle com- A lot of time and testing can be invested in developing plex molecules. highly specifc patterns for any given group contribu- Te goal of this work is to provide fexible algorithms tion method such as those already done for UNIFAC by that only need a simple fragmentation scheme based on Salmina et al. [38]. However, if the algorithm includes the SMARTS language [37] which is easy to use for the a way to prioritize the groups from the fragmentation rapid development and testing of group contribution scheme, in most cases the groups do not have to be methods on larger datasets. highly specifc thus allowing to focus more time on developing diferent fragmentation schemes instead of refning Challenges of automatic fragmentation one specifc scheme. Several challenges like non-unique group assignment, incomplete group assignment and the composition of Strategies to overcome the challenges the fragmentation scheme itself can arise when devel- To overcome the challenges described in the section oping an automatic fragmentation algorithm. Tese will “Challenges of automatic fragmentation”, three features be discussed in more detail in this section. Te exam- were implemented in this work: ples described are based on the fragmentation scheme from Table 1. Müller Table 1 Fragmentation scheme developed in this work for the published UNIFAC groups and the respective pattern described used for sorting Group information Descriptors J Cheminform(2019)11:57 Number

Flexible Heuristic Algorithm for Automatic Molecule Fragmentation: Application to the UNIFAC Group Contribution Model Simon Müller*

The Alexandria Library, a Quantum-Chemical Database of Molecular Properties for Force ﬁeld Development 9 2017 Received: October 1 1 1 Mohammad M

Chemical Database Projects Delivered by RSC Escience

Predicting Outcomes of Catalytic Reactions Using Machine Learning

Umansysprop V1.0: an Online and Open-Source Facility for Molecular Property Prediction and Atmospheric Aerosol Calculations

Bringing Open Source to Drug Discovery

PSC-Db: a Structured and Searchable 3D-Database for Plant Secondary Compounds

Chemical Space, Diversity, and Complexity[Version 1; Peer Review: 2

Daylight Theory Manual Daylight Theory Manual Table of Contents Daylight Theory Manual

A Database of Medicinal Materials and Chemical Compounds in Northeast

Recent Advances in Multidimensional QSAR (4D-6D): a Critical Review

Chembiofinder V14 User Guide

View and Approval by a Chemoinformatician