<<

Predictive Models for Kinetic Parameters of Reactions Marta Glavatskikh, Timur Madzhidov, Dragos Horvath, Ramil Nugmanov, Timur Gimadiev, Daria Malakhova, Gilles Marcou, Alexandre Varnek

To cite this version:

Marta Glavatskikh, Timur Madzhidov, Dragos Horvath, Ramil Nugmanov, Timur Gimadiev, et al.. Predictive Models for Kinetic Parameters of Cycloaddition Reactions. Molecular Informatics, Wiley- VCH, 2018, 38 (1-2), pp.1800077. ￿10.1002/minf.201800077￿. ￿hal-02346844￿

HAL Id: hal-02346844 https://hal.archives-ouvertes.fr/hal-02346844 Submitted on 5 Feb 2021

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 0DQXVFULSW &OLFNKHUHWRGRZQORDG0DQXVFULSW&$UHYLVLRQGRF[

Predictive models for kinetic parameters of cycloaddition reactions

Marta Glavatskikh[a,b], Timur Madzhidov[b], Dragos Horvath[a], Ramil Nugmanov[b], Timur Gimadiev[a,b], Daria Malakhova[b], Gilles Marcou[a] and Alexandre Varnek*[a] [a] Laboratoire de Chémoinformatique, UMR 7140 CNRS, Université de Strasbourg, 1, rue Blaise Pascal, 67000 Strasbourg, France; [b] Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya str. 18, Kazan, Russia

Abstract

This paper reports SVR (Support Vector Regression) and GTM (Generative Topographic Mapping) modeling of three kinetic properties of cycloaddition reactions: rate constant (logk), (Ea) and pre-exponential factor (logA). A data set of 1849 reactions, comprising (4+2), (3+2) and (2+2) (CA) were studied in different and at different . The reactions were encoded by the ISIDA fragment descriptors generated for Condensed Graph of Reaction (CGR). For a given reaction, a CGR condenses structures of all the reactants and products into one single molecular graph, described both by conventional chemical bonds and “dynamical” bonds characterizing chemical transformations. Different scenarios of logk assessment were exploited: direct modeling, application of the and -scaled GTM landscapes. The logk models with optimal cross-validated statistics (Q2=0.78-0.94 RMSE=0.45-0.86) have been challenged to predict rates for the external test set of 200 reactions, comprising both reactions that were not present in the training set, and training set transformations performed under different reaction conditions. The models are freely available on our web-server: http://cimm.kpfu.ru/models.

Keywords: cycloaddition reactions, QSPR, Condensed Graph of Reaction, Generative Topographic Mapping

1. Introduction

Chemical reactions represent rather complex objects for structure-property modeling because their yield, thermodynamic and kinetic parameters depend not only on structure of reactants, but in great extent on experimental conditions. Therefore, an effort should be made to encode the above information in molecular descriptors vector which serves as input in machine-learning methods. This paper is devoted to development of predictive models for three kinetic properties of cycloaddition reactions: rate constant logarithm (logk), activation energy (Ea) and pre- exponential factor logarithm (logA). Cycloaddition (CA), a classic reaction in organic chemistry, is an important tool to produce compounds of cyclic architecture, with high regio- and stereo- selectivity. In early works, frontier molecular orbitals (FMO) [1-5] calculated by quantum mechanics methods for small congeneric series were used for interpretation of thermodynamic and kinetic properties of cycloaddition. An effort has been made to build a linear correlation with experimentally measured physicochemical parameters of the reagents [6-7]. This, however, significantly limits application of the latter to already studied reactions. Classical QSPR modeling of chemical reactions was performed on homogeneous series, either varying the reagents under fixed reaction conditions (, temperature), or inversely varying conditions to study a reaction between given reactants [8-11]. In these studies topological indices [12], quantum-chemical[8-9, 13-14] or mixed [15-17] descriptors for reagents were used. For an non-exhaustive overview of the studies carried out in the field of chemical reactivity one can refer on papers by Warr[18], Baskin et al.,[19] or on earlier works[10-11, 20]. In spite of significant progress in the field, few attempts have been made towards a more general QSPR treatment of chemical reactions. Thus, Marcou et al [21] reported classification models predicting experimental conditions (solvent type, catalyst) of the Michael addition. [22-23] [24-25] Predictive models for the rate constant of SN2 and E2 reactions explicitly accounting both structure of reactants / products and experimental conditions have also been reported in our early publications. In this work, we employ the approach suggested in [22-24] to building predictive models for three kinetic properties of cycloaddition reactions: logk, Ea and logA. The transformation of reactants into products was encoded by a Condensed Graph of Reaction (CGR)[26-27] representing a given reaction as a single pseudomolecule (Fig. 1). CGR-based descriptors thus explicitly characterize the reaction center together with its immediate neighborhood. In our previous works, the CGR-based description has been successfully used for the prediction of rate constant [24-25] [28] of SN2 and E2 reaction , tautomerization equilibrium constants prediction , and the prediction of optimal conditions[27]. Here, the CGR-based descriptors and the reaction conditions descriptors have been concatenated thus forming a descriptors vector for . These vectors were used as an input in model building with Support Vector Regression (SVR)[29] and Generative Topographic Mapping (GTM)[30] machine-learning methods. GTM as a probabilistic extension of self-organizing maps, is an efficient method of visualization, analysis and modeling of chemical properties[31-32] or biological activities[33-35] and could be used for both classification and regression tasks. The algorithm converts the N-dimensional data into a 2D map that could be ‘colored’ by different reference properties. Here, the same 2D map built for the ensemble of cycloaddition reactions was used for the analysis of the logk, Ea and logA parameters.

Figure 1. Reaction of cyclization and the corresponding CGR of the transformation. Formed bonds are denoted with circle and the broken ones are crossed.

2. Computational procedure

The modeling workflow is shown in Figure 2. The database of 1849 reactions of cycloadditions, for which the logk, Ea and logA values have been measured in different solvents and at different temperatures, has been collected from the literature (see 2.1). The data set has been divided into training and test sets, where the latter has been composed out of 200 reactions picked up randomly from the data set. The remaining 1649 reactions constitute the training set, which has been used to build three individual models, for logk, Ea and logA prediction, respectively. The obtained best SVR and GTM models have been further challenged for the exhaustive external prediction of the logk on the test set of 200 reactions.

Figure 2. Workflow for the modeling of the rate constant (logk), activation energy (Ea) and pre-exponential coefficient (log A) of the reactions of cycloaddition. 2.1 Data preparation

An initial 2551 cycloaddition reactions with available experimental values of the rate constant (logk), 1356 reactions with activation energies (Ea, in kJ/mol) and 1237 with pre-exponential factor (logA) values were collected manually from the manuscripts of PhD thesis works of Prof. Konovalov’s group from Kazan Federal University, published in 1970-1990. Within the 2551 reactions there were 806 unique structural transformations. The data set contained about 85% of Diels-Alder (4+2) cycloaddition, about 8% (3+2) dipolar cyclisation, and 7% (2+2) cycloadditions. The reactions were carried out in C6H5CH3, CH3COOC2H5, C6H5OCH3,

C6H5NO2, CH3CN, C2H5OH, C6H5Cl, THF, CH2Cl2, DMSO, CHCl3, 1,4-Dioxane, C5H11OH,

C6H5Br, C6H6, CH3COCH3, C6H12 and CH2ClCH2Cl, at temperatures ranging from 273 to 423 K. The most frequently occurred for the Diels-Alder series are: condensed aromatic cycles, cyclopentadienones, cyclopentadienes and benzofurans. Among dienophiles, the prevailing structures are: maleic acid derivatives, cyanoethylenes, Ph-substituted and benzoquinones. The data set (reagents and products) underwent an accustomed standardization protocol using ChemAxon’s Standardizer Utility: basic aromatization, isotopes removal, NO2-, [36] NO-, N3-, RRSO2-, CN- group transformation . Cyclization may lead to different stereoisomers, and available data displayed several examples of logk differences associated to the competing processes, all within 0.01 to 1.0 log values. Based on our previous work[24] we considered that such difference is too small to be taken into account, and comparable with the interlaboratory errors affecting logk measurements. Thus, for stereospecific reactions, the logk constants were calculated as a mean of given experimental values. No other duplicates were found in the dataset. The 1849 reactions kept after cleaning have been characterized by 1849 values of logk, 1356 values of Ea (kJ/mol) and 1237 values of logA, the latter being a strict subset of the 1356 with reported Ea values. The prepared data has been divided into training and test sets. External test set. The test set consisted of 200 reactions randomly picked from the initial data set, to serve as a challenge for logk prediction. This included 57 structurally new transformations, that are not present in the training set. The others are reactions seen at training stage, for which the challenge consists in predicting their parameters under different conditions. Note that furthermore this test set includes 26 reactions reported at two different temperatures. The training set hence encompasses the remaining 1649 reactions. The distribution of the three properties for the training set is given on Figure 3.

Figure 3. Distribution of rate constant, activation energy and pre-exponential factor values in the training set of 1649 cycloaddition reactions.

2.2 Descriptors

The reactions were first rendered as Condensed Graphs of Reaction (CGR), created from the reaction RDF files using the in-house CGR Designer tool and stored in modified SDF format. This was directly processed by the in-house ISIDA Fragmentor software, in order to generate fragment descriptors. The length of monitored fragments varied from 2 to 14 for sequences and from 2 to 6 for atom-centered fragments. The following options were also used at choice: charges on atoms (Formal Charge), accounting for the terminal atoms of a fragment exclusively (Atom Pairs), and exploring all possible paths instead of shortest paths (DoAllWays). An important option regulating the amount of the overall generated CGR fragments is the ‘dynamic bond’ exclusive inclusion. Toggled on, the option produces the fragments, that contains the bonds forming/breaking while chemical reaction (local fragments) and omits the ‘generic’ fragments, not assigned to the reaction center. That could be used to generate fragments that describe local environment of the reaction center exclusively. For this project, the CGR fragmentation implied the generation of all possible, local and nonlocal, fragments. Overall, 728 descriptor sets have been generated for the preliminary SVR scanning. Descriptors of the reaction conditions. Descriptors of reaction conditions included solvent descriptors and temperature. The solvent descriptors considered in this study include the values of polarity, polarizability, H-acidity and basicity: Catalan SPP[37], SA[38] and SB constants[39], H 1 f B Camlet–Taft α[40], β[41], and π*[42] constants, 4 functions of dielectric constant ε (Born H H 1 H 1 H 1 f K f1 f 2 and Kirkwood 2H 1 functions, H 1 , H  2 ), 3 functions of the refractive

2 20 n 1 index nD (denoted as n for the sake of simplicity in the following formulae), g , 1 n 2  2 n 2 1 (n 2 1)(H 1) g , h ). The final element of the condition descriptor features the 2 2n 2 1 (2n 2 1)(2H 1) reciprocal of reaction temperature, given in Kelvin-1 (K).

2.3 Building and validation of the models

SVR modeling. SVR models were built and validated using the ε-SVR algorithm implemented in the libSVM package[43]. The modeling was performed using the evolutionary SVR optimizer[44], which can be used to perform both descriptor space selection and optimization of the operational parameters (epsilon, kernel type, cost, gamma) of the SVR method. The procedure, applied to logk as a modeled property generated a total of 6000 models. The descriptor set producing the SVR individual model of maximal robustness (estimated by five-fold cross-validation Q2 value) was based on atom-centered fragments of the length 2-3. This descriptor space encompassing 688 descriptors (674 of fragments, 14 of the reaction conditions) was used further for SVR and GTM model building and its external validation.

GTM modeling. GTM constructs a 2D-dimensional map reproducing the data distribution of the initial D-dimensional space, defined by the descriptors. The map is framed by K nodes forming a perfect square grid. The nodes correspond to normal probability distributions (NPD) centered on the GTM manifold, a flexible 2D envelope embedded into the D-dimensional data cloud. The assignment (mapping) of the nodes to the manifold points is defined by a mapping function

:(ݕௗሺݔǢ ܹሻ set up with the help of M radial basis functions (RBFs ெ ԡݔെݔ ԡଶ ݕ ሺݔǢ ܹሻ ൌ෍ܹ ‡š’ሺെ ௠ ሻሺͳሻ ௗ ௠ௗ ʹߪଶ ௠ୀଵ where d goes from 1 to D, W is the ܯൈܦ weight matrix connecting Gaussian RBF and data

space points, ݔ௠ is the center of the m-th RBF in the 2D space, and V is a width of RBF. The nodes framing the GTM map thus correspond the centers of RBFs coordinated on the manifold so that to best approximate the data distribution. The RBFs determines the probabilities of an object to be mapped into a certain node. Moreover, the data object has non-zero probability over all the nodes. Consequently, the object could be characterized by its probabilities, which are called responsibilities ܴ௡௞, that represent the degree of association between the object n (in the initial D-dimensional data space) and the node k. Details of the GTM algorithm are given in our publications [31, 33-35] and references therein. GTM property landscapes. Distribution of the property over the chemical space can be visualized as a GTM property landscape, in which the mean property value ࡭෡࢑ of each node ࢞௞ of the grid defined in the latent 2D space can be computed as: σே ܣ ܴ መ ௡ୀଵ ௡ ௞௡ ܣ௞ ൌ ே ሺʹሻ σ௡ୀଵ ܴ௞௡ where ܣ௡ is the property value of the nth molecule, ܴ௞௡ are responsibilities of the molecules to be located in the node ࢞௞, and N is the number of molecules in the data set. If a property landscape in rendered in 2D, the values of ܣመ௞ of the k-th node determines its color, while the transparency depends on its occupancy (to be more exact, cumulated responsibilities): the more molecules are located near a certain node, the opaquer is the color. The GTM models were built using the evolutionary optimizer[44] which supports the choice of the operational GTM parameters (the number of RBF kernels, the number of grid points, the width factor of radial basis functions, and the regularization coefficient). The descriptor space, based on atom-centered fragments of the length 2-3, chosen by the preliminary SVR scanning was considered. The genetic optimization procedure included 3000 generations. The current generated GTM manifold is used as “support” for all the three logk, Ea, logA property landscapes, which serve as predictive models and undergo five-fold cross-validation. The fitness score of the current GTM manifold (the objective function to be maximized) represents the mean value of scores achieved by each property landscape. The most robust GTM model, in the above-mentioned sense of maximal has been further used for the prediction and visualization of logk, Ea and logA.

Validation of the models. The predictive performance of the GTM and SVR models has been estimated by root mean squared error (RMSE) and squared determination coefficient calculated in five-fold cross-validation (Q2) repeated 10 times after the data reshuffling (10x5-CV), or on the external test set (R2):

σ௡ ሺܻ െܻ ሻଶ ଶ ଶ ௜ୀଵ ௘௫௣ǡ௜ ௣௥௘ௗǡ௜ ܳ ሺܴ ሻൌͳെ ௡ ଶ ሺ͵ሻ σ௜ୀଵሺܻ௘௫௣ǡ௜ െ൏ ܻ ൐௘௫௣ሻ

௡ ଶ σ ሺܻ௘௫௣ǡ௜ െܻ௣௥௘ௗǡ௜ሻ ܴܯܵܧ ൌ ඨ ௜ୀଵ ሺͶሻ ݊

Here Yexp and Ypred are, respectively, experimental and predicted values of log k, n is the number of data points, while exp is the mean of experimental values. Applicability Domain of the models. The “Fragment Control”[45] procedure was chosen to delimit the general Applicability Domain[46] (AD) of the approach, which corresponds to the bounding box encompassing the training set in fragment descriptor space. For each of the fragment associated to descriptor vector elements, a candidate molecule must not contain more of these than the richest representative of the training set, or less than the poorest one. Presence of fragments never seen among training compounds therefore counts as AD violation. Since both SVM and GTM models are based on the same fragment descriptor space, their AD are identical.

2.4 Different scenarios of logk assessment for the test set reactions

The logk values for the external set of 200 reactions were obtained using three different approaches: direct assessment, Arrhenius-based assessment and temperature-scaled GTM landscapes. The former corresponds to the application SVR or GTM predictive models for logk to the test set reactions. In the Arrhenius-based assessment the logk values are calculated using the Arrhenius equation (eq. 5). The latter implies the usage of predicted Ea and logA values obtained with the help of related SVR or GTM models. In the Ea and logA modeling, temperature descriptor was excluded. ܧܽ Ž‘‰ ݇ ൌ Ž‘‰ ܣ െ ൗܴܶ ሺͷሻ The rate constant may be very sensitive to temperature. This effect might not be properly accounted for by GTM-based models, where temperature is just one of hundreds of dimensions controlling the manifold geometry. Therefore, instead of using temperature as a descriptor, the temperature-scaled approach has been proposed. It implies construction of a series logk GTM landscapes each corresponding to a specific temperature range. Scaling of logk measured at temperature T1 to temperature T2 can be performed ௞ଶ ்ଶି்ଵ according to the van’t Hoff equation: ݈݋݃ ൌ Ž‘‰ ݕ. The temperature coefficient Ž‘‰ ݕ ௞ଵ ଵ଴

was computed as an average of Ž‘‰ ݕ௜ (i = 1, 2… n) values calculated for n series of the reactions, each studied at several temperatures. Overall, of 358 reactions in the training set was measured at 2 - 6 different temperatures. The smallest temperatures difference in a series was 10K and the largest was 110K. The temperature coefficient Ž‘‰ ݕ varied from 0.08 to 0.59; its average value 0.26 was used in our calculations. The entire temperature range (273K-423K) has been divided into 15 subranges of 10K each, e.g. from 273K to 283K. For each subrange, the corresponding logk landscape built for all 1649 reactions was constructed, i.e., each logk was recalculated to median temperature of a subrange (e.g., 278K for the range 273K-283K) using van’t Hoff’s equation. Notice that the logk error related to a temperature deviation of 5K (0.15-0.3 logk units) is far less than related RMSE values (Table 2). Thus, at the validation stage, the program selects a particular logk landscape corresponding to the external reaction temperature. Thus, fifteen temperature-scaled logk landscapes were calculated from experimental logk values using the Van’t Hoff relationship, Ž‘‰ ݕ ൌ ͲǤʹ͸ and related median temperature of a subrange. These landscapes were used for validation of GTM-based models on reactions from the external test set.

3. Results and discussions

3.1 GTM visualization of the training set

The Generative Topographic Map shown on Fig. 4 shows well separated zones populated by the (4+2), (3+2) and (2+2) cycloaddition reactions. The ratio of sizes of these zones correspond to class occurrences in the data set: thus, the major class (4+2) occupies the largest area, flowed by the (3+2) zone, which in turn is slightly bigger than the zone populated by (2+2) reactions. A distinct separation of the cycloaddition classes on the map infers a competency of the chosen descriptors space, able to correctly discriminate the reactions belonging to different types.

Figure 4. GTM class landscape displaying separation of three different types of cycloaddition in the training set of 1649 reactions.

Another technique of GTM visualization is property landscapes, characterizing the distribution of a certain property over the training set projection. Figure 5 shows GTM property landscapes for logk, Ea and logA, where the gradient from blue to red corresponds to the property variating from low to high value (in the range its own for each property). Simultaneous analysis of all three landscapes helps to understand decomposition of logk on Arrhenius equation parameters logA and Ea (eq. 5). According to , A is the frequency of collisions in the “correct orientation”. Thus, logA values might be related to steric interaction of reactants in a pre-reaction complex. Activation barrier Ea accounts for electronic and steric effects in the transition state.

Figure 5. GTM property landscapes for the rate constant (logk), activation energy (Ea) and the pre-exponential coefficient (logA) for the training set of 1649 reactions of cycloaddition. The encircled areas correspond to different combinations of logA and Ea contributions into logk.

Typically, high reaction rates correspond to large logA and small Ea (zone A on the maps) whereas low logk values result from a combination of small logA and large Ea (zone B). However, in some cases low activation barriers do not compensate the very low logA (zone C), or, on the contrary, large logA is compensated by high activation barriers (zone D). Intuitively, the example C in the Table below is a clear example of sterically crowded transition state (low logA), not compensated by the otherwise favorable electronic interactions between an electron- depleted and a neutral dienophile. Example D is, by contrast, featuring a less favorable interaction of electron-depleted diene and electron-depleted dienophile, however not involving any particular steric crowding.

Table 1. Examples of reactions projected into the zones corresponding to Figure 5.

Zones Example of reaction Logk A 2.1 (toluene, 320 K)

B -4.7 (benzene, 403 K

C -4.2 (dichloroethane, 313 K)

D -3.55 (benzene, 403 K)

3.2 Cross validation of the SVR and GTM models

The cross-validation results (5CV) for the three properties, logk, Ea (kJ/mol) and logA, for SVR and GTM models are shown on Table 2. The SVR modeling involved the construction of three different models for each property, whereas for the GTM, they were assessed by three property landscapes based on one same manifold (see 2.3). One may see that performances of SVR models exceed those of GTM model – unsurprisingly, knowing that the molecular descriptor set best suited for SVM modeling was the one used throughout this work. However, the overall statistics for the GTM is reasonable for all three properties.

Table 2. Predictive performance of the SVR models, built separately for each of the property, and a single GTM model, universal for all three properties, in 5-fold cross validation for the training set of 1649 reactions of cycloaddition. The descriptor space is based on atom-centered fragments of the length from 2 to 3.

Property SVR GTM R2 RMSE R2 RMSE logk 0.94 0.45 0.78 0.86 Ea 0.92 3.65 0.80 5.92 logA 0.89 0.36 0.62 0.67

3.3 External validation of SVR and GTM models

The external test set has been predicted in three different ways (see 2.4):

- (i) direct logk prediction with SVR and GTM, - (ii) prediction of logk using the Arrhenius equation (eq. 5) and the predicted values of Ea and logA, - (iii) prediction of logk using temperature-scaled GTM landscapes.

The results for both strategies are given in Table 3 and Figure 6. Similarly to cross-validation results, SVM models perform better than GTM ones. The models with direct logk assessment lead to better results than those involving the Arrhenius equation, which might be explained by both smaller training sets for logA and Ea modeling, and the fact that some test set reactions are dissimilar compared to those in the training set. Indeed, the Fragment Control”[45] applicability domain discards 36 out of 200 reactions which leads to significant improvement of models’ performances (Table 3).

Table 3. Validation of four different methods of logk assessment one test set of 200 cycloaddition reactions. 164 reactions are retained by the Fragment Control applicability domain.

Method of logk SVR GTM assessment R2 RMSE R2 RMSE Direct 0.92 0.50 0.74 0.90 Direct with AD 0.96 0.35 0.84 0.74 Arrhenius-based 0.72 0.93 0.51 1.23 Arrhenius-based with AD 0.93 0.51 0.83 0.79 Temperature-scaled - - 0.76 0.88

Direct logk Arrhenius-based assessment logk assessment

Figure 6. Predicted vs experimental logk values obtained with direct logk prediction (left) or with Arrhenius-based recalculation (right) # for the external test set of 200 reactions of cycloaddition. The statistics is given in Table 3.

Reaction rate assessment with temperature-scaled logk landscapes shows a minor improvement over the direct GTM logk prediction (Table 3). This trend was confirmed in logk predictions for the specific subset of 13 reactions measured at two different temperatures (see Fig. 7)

GTM-based direct GTM-based temperature- logk assessment scaled approach

R2=0.90 R2=0.94 RMSE=0.48 RMSE=0.3

Figure 7. The comparison of the predictive accuracy for the direct GTM logk calculation and the GTM temperature- scaled landscape, for the subset of 26 reactions of cycloaddition measured at different temperatures.

Subset of new transformations. Since the external test set has been randomly chosen from the initial data set, it includes several reactions present in the training set but associated to different conditions. Therefore, prediction propensity with respect to novel transformations and to different conditions for a known transformation could be assessed separately. Among 200 external test set’s reactions, only 57 transformations didn’t occur in the training set. Among them, 8 reactions were found out of the AD. Table 11 shows that only direct logk assessment with SVM models leads to reasonable agreement between predicted and experimental values for 49 reactions retained by AD. Indeed, the model performs similarly to cross-validation (R2 = 0.87, RMSE = 0.59 log units). On the other hand, application of the Arrhenius-based SVR assessment as well as GTM-based models result in poor predictions with R2 < 0.5.

Table 4. Comparison of the direct, Arrhenius-based and temperature-scaled logk assessment for the subset of 57 unique structural transformations. Fragment Control applicability domain retained 49 reactions.

Logk predictive SVR GTM method R2 RMSE R2 RMSE Direct 0.72 0.80 0.38 1.21 Direct (AD) 0.87 0.59 0.60 1.01 Arrhenius-based 0.33 1.26 -0.48 1.87 Temperature- - - 0.40 1.15 scaled GTM

Direct logk calculation

Figure 8. Performances of SVR and GTM models for the subset of unique structural transformations (57 reactions) obtained with the direct logk prediction method. The statistics is given in Table 4.

4. Conclusion

In this paper we report the first SVR and GTM models of kinetic parameters (reaction rate, pre- exponential factor in the Arrhenius equation and activation energy) built on large and structurally diverse data set of cycloaddition reactions. A reasonable accuracy of reaction rate assessment was achieved both in cross-validation and on external test set (RMSE = 0.45 and 0.35 log units, respectively). Different reaction classes are well separated on 2-dimensional Generative Topographic Map. The GTM-based property landscapes provide with an interesting insight on reaction chemical space and allow to identify different types of reactions with respect to the interplay between the Arrhenius parameters logA and Ea. The obtained results demonstrate an efficiency of the CGR-based fragment descriptors for chemical structures, and a proper choice of the parameters encoding experimental conditions. The SVR models for the rate constant, activation energy and the pre-exponential coefficient are freely available on our web-server: http://cimm.kpfu.ru/models.

Acknowledgement. This study was supported by Russian Science Foundation, grant No 14-43- 00024. MG thanks the French Embassy in Russia for the PhD fellowship.

References

[1] S. D. Kahn, C. F. Pau, L. E. Overman, W. J. Hehre, J. Am. Chem. Soc. 1986, 108, 7381-7396. [2] S. Kahn, C. Pau, L. Overman, W. J. Hehre, J. Am. Chem. Soc. 1986, 108, 7381-7396. [3] J. Chandrasekhar, S. Shariffskul, W. L. Jorgensen, The Journal of Physical Chemistry B 2002, 106, 8078-8085. [4] O. Acevedo, W. L. Jorgensen, Journal of Chemical Theory and Computation 2007, 3, 1412-1419. [5] O. Acevedo, W. L. Jorgensen, J. D. Evanseck, Journal of chemical theory and computation 2007, 3, 132-138. [6] V. D. Kiselev, A. N. Ustiugov, I. P. Breus, A. I. Konovalov, Doklady Akademii Nauk Sssr 1977, 234, 1089-1092. [7] V. D. Kiselev, A. I. Konovalov, J. Phys. Org. Chem. 2009, 22, 466-483. [8] G. Szabo, L. Nyulaszi, Struct. Chem. 2017, 28, 333-343. [9] X. Long, J. Niu, Chemosphere 2007, 67, 2028-2034. [10] T. C. Ho, reviews 2008, 50, 287-378. [11] A. Fernández-Ramos, J. A. Miller, S. J. Klippenstein, D. G. Truhlar, Chemical reviews 2006, 106, 4518-4584. [12] A. A. Toropov, V. O. Kudyshkin, N. L. Voropaeva, I. N. Ruban, S. S. Rashidova, J. Struct. Chem 2004, 45, 945-950. [13] X. Yu, B. Yi, X. Wang, European Polymer Journal 2008, 44, 3997-4001. [14] J. Gasteiger, U. Hondelmann, P. Röse, W. Witzenbichler, Journal of the Chemical Society, Perkin Transactions 2 1995, 193-204. [15] X. H. Jin, S. Peldszus, P. M. Huck, Chemosphere 2015, 138, 1-9. [16] D. A. Latino, J. Aires-de-Sousa, J. Chem Inf. Model. 2009, 49, 1839-1846. [17] J. A. Morrill, J. H. Biggs, C. N. Bowman, J. W. Stansbury, Journal of Molecular Graphics and Modelling 2011, 29, 763-772. [18] W. A. Warr, Mol. Inform. 2014, 33, 469-476. [19] I. I. Baskin, T. I. Madzhidov, I. S. Antipin, A. A. Varnek, Russian Chemical Reviews 2017, 86, 1127. [20] Y. E. Zevatskii, D. V. Samoilov, Russ. J. Org. Chem. 2007, 43, 483-500. [21] G. Marcou, J. Aires de Sousa, D. A. Latino, A. de Luca, D. Horvath, V. Rietsch, A. Varnek, J. Chem Inf. Model. 2015, 55, 239-250. [22] A. Kravtsov, P. Karpov, I. Baskin, V. Palyulin, N. Zefirov, in Doklady Chemistry, Vol. 440, Springer, 2011, pp. 299-301. [23] T. Madzhidov, P. Polishchuk, R. Nugmanov, A. Bodrov, A. Lin, I. Baskin, A. Varnek, I. Antipin, Russ. J. Org. Chem. 2014, 50, 459-463. [24] T. I. Madzhidov, A. V. Bodrov, T. R. Gimadiev, R. I. Nugmanov, I. S. Antipin, A. A. Varnek, J. Struct. Chem 2015, 56, 1227-1234. [25] P. Polishchuk, T. Madzhidov, T. Gimadiev, A. Bodrov, R. Nugmanov, A. Varnek, J. Comput.- Aided Mol. Des. 2017, 31, 829-839. [26] F. Hoonakker, N. Lachiche, A. Varnek, A. Wagner, Int. J. Artif. Intell. Tools 2011, 20, 253-270. [27] T. I. Madzhidov, P. G. Polishchuk, R. I. Nugmanov, A. V. Bodrov, A. I. Lin, I. I. Baskin, A. A. Varnek, I. S. Antipin, Russ. J. Org. Chem. 2014, 50, 459-463. [28] T. R. Gimadiev, T. I. Madzhidov, R. I. Nugmanov, I. I. Baskin, I. S. Antipin, A. Varnek, J. Comput.-Aided Mol. Des. 2018. [29] H. Drucker, C. J. Burges, L. Kaufman, A. J. Smola, V. Vapnik, in Advances in neural information processing systems, 1997, pp. 155-161. [30] C. M. Bishop, M. Svensen, C. K. I. Williams, Neural Comput. 1998, 10, 215-234. [31] H. A. Gaspar, G. Marcou, D. Horvath, A. Arault, S. Lozano, P. Vayer, A. Varnek, J. Chem Inf. Model. 2013, 53, 3318-3325. [32] N. Kireeva, S. L. Kuznetsov, A. Y. Tsivadze, Ind. Eng. Chem. Res. 2012, 51, 14337-14343. [33] H. A. Gaspar, Baskin, II, G. Marcou, D. Horvath, A. Varnek, Mol Inform 2015, 34, 348-356. [34] P. Sidorov, H. Gaspar, G. Marcou, A. Varnek, D. Horvath, J Comput Aided Mol Des 2015, 29, 1087-1108. [35] T. R. Gimadiev, T. I. Madzhidov, G. Marcou, A. Varnek, BioNanoScience 2016, 6, 464-472. [36] Standardizer, 6.1.5, ChemAxon (http://www.chemaxon.com), 2013. [37] J. Catalán, V. López, P. Pérez, R. Martin-Villamil, J.-G. Rodríguez, Liebigs Annalen 1995, 1995, 241-252. [38] J. Catalán, C. Díaz, Liebigs Annalen 1997, 1997, 1941-1949. [39] J. Catalán, C. Díaz, V. López, P. Pérez, J.-L. G. De Paz, J. G. Rodríguez, Liebigs Annalen 1996, 1996, 1785-1794. [40] R. W. Taft, M. J. Kamlet, J. Am. Chem. Soc. 1976, 98, 2886-2894. [41] M. J. Kamlet, R. W. Taft, J. Am. Chem. Soc. 1976, 98, 377-383. [42] M. J. Kamlet, J. L. Abboud, R. W. Taft, J. Am. Chem. Soc. 1977, 99, 6027-6038. [43] C.-C. Chang, C.-J. Lin, ACM Trans. Intell. Syst. Technol. 2011, 2, 1-27. [44] D. Horvath, J. Brown, G. Marcou, A. Varnek, Challenges 2014, 5, 450. [45] A. Varnek, D. Fourches, D. Horvath, O. Klimchuk, C. Gaudin, P. Vayer, V. Solov'ev, F. Hoonakker, I. V. Tetko, G. Marcou, Curr. Comput.-Aided Drug Des. 2008, 4, 191-198. [46] M. Mathea, W. Klingspohn, K. Baumann, Mol. Inform. 2016, 35, 160-180.