MVE385: Project course in mathematical and statistical modelling

Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands Project report

Students

Adrià Amell Tosas Richard Martin Sebastian Oleszko [email protected] [email protected] [email protected]

Partners

Peder Svensson Mattias Sundén Fredrik Wallner Erik Lorentzen Modeling Quantitative Structure-Activity Relationships for G-protein Coupled Receptor ligands 2020–12–18 Background

IRLAB Therapeutics is a Biotech company, engaged in the discovery and development of novel pharmaceuticals to treat disorders of the brain, currently focusing on Parkinson’s disease. The research is based on a so called phenotypic screening approach, which means that the effects of new chemical compounds are evaluated on a system level, to ensure that both direct and indirect effects on different neurotransmitters and brain pathways are captured. Evaluation of potential target receptor interactions is performed in silico as well as in different in vitro systems. In the design of new compounds, and understanding of how different structural elements of the compounds affect biological effects, quantitative structure-activity relationships (QSAR) is a key component. Our current drug discovery projects focus mainly on G-protein coupled receptors (GPCRs), the main class of molecular targets for CNS pharmaceuticals.

Project description

In this project, students will access large datasets covering chemical descriptors of a series of , combined with data on receptor interactions. The focus will be on GPCRs, of the monoamine family, but other target proteins will also be included. Chemical descriptors, including eg physico-chemical property estimates, shape, lipophilicity, ionization states, molecular graphs and fingerprints, are obtained from public , and generated inhouse at IRLAB. The task is to find statistical QSAR-models that describes how biological activity, in this case receptor affinities, relates to chemical properties of the ligand . Such models can be used to guide the design of novel compounds. Linear, principal component-based models as well as non-linear methods will be investigated.

Key points to consider in the project will be:

 Properties of the chemical descriptor space  Data distribution – transforms?  Different model types, linear, eg PLS, MR, non-linear, eg SVM, neural networks, random forest  Choice of dependent variables for the models – some relations to the independent variables (chemical descriptors) will be common for several dependent variables (receptor affinities), some are unique to specific Y variables. Depending on the statistical modelling approach, separate models for each dependent variable of interest, or a multiple Y block.  Diagnostics – how to access predictive capability of models

IRLAB will provide chemical descriptors and biological activity data for one or more series of compounds of interest for the pharmacological modulation of GPCRs, focusing on monoamine targets. Smartr will provide supervision regarding statistical models.

1 Abstract G protein-coupled receptors (GPCRs) are the main class of molecular targets for central nervous system pharmaceuticals. Dopamine receptors are GPCRs which are activated by the neurotrans- mitter dopamine and are central players in brain function and are involved in, for example, mo- tor control. Quantitative structure-activity relationships (QSAR) models are theoretical models that relate the quantitative measure of to a physical property or a biological activity. They are key in the design of new drugs and in the understanding of how different structural elements of the compounds affect biologically.

This project presents a number of QSAR models that can help describe relationships between series of organic molecules to the dopamine receptors D2 or D3 by predicting the corresponding inhibitory constant Ki. The molecules of study and Ki values were obtained from ChEMBL, a public chemical of manually curated bioactive molecules with drug-like properties, while the descriptors were generated computationally. Linear-based models, a genetic algorithm and tree-ensemble methods are motivated and evaluated on the constructed dataset. The tree- ensemble methods performed best, closely followed by the genetic algorithm. Finally, recom- mendations for any QSAR modelling is discussed.

1 Contents

1 Abstract 1

2 Introduction 3

3 Construction of the data set to model 3 3.1 Selection of compounds and inhibitory constants ...... 3 3.2 Activity cliffs ...... 5

4 Modelling 6 4.1 Motivation for methods ...... 7 4.1.1 Linear-based models ...... 8 4.1.2 Selection by a Genetic algorithm ...... 8 4.1.3 Ensemble methods ...... 9 4.2 Implementation details ...... 9 4.2.1 Linear-based models ...... 10 4.2.2 Variable selection ...... 11 4.2.3 Genetic algorithm ...... 12 4.2.4 Ensemble methods ...... 13

5 Results 13 5.1 Variable selection with linear-based models ...... 13 5.2 Evaluating the linear-based models ...... 14 5.3 Subset and model refinement with Genetic algorithm ...... 16 5.4 Ensemble models ...... 16

6 Discussion 19 6.1 Data discussions ...... 19 6.2 Applicability domain ...... 19 6.3 Modelling conclusions ...... 21 6.3.1 Computational limitations ...... 21 6.3.2 Feature subsets and importance ...... 22 6.3.3 Model interpretability ...... 22 6.3.4 Comparison of model candidates ...... 23 6.4 Future recommendations ...... 23

7 Acknowledgements 24

A Result tables 28 A.1 Variable selection chosen descriptors ...... 28 A.2 Specifications of the linear-based models ...... 28 A.3 Final descriptor sets of the genetic algorithm ...... 28

B Ensemble Parameter Tuning 28

C Code 32 C.1 Variable selection scripts ...... 32 C.2 Linear-based model scripts ...... 32 C.3 Genetic algorithm script ...... 32

2 2 Introduction Human cells are constantly communicating with each other and the surrounding environment. This requires a molecular mechanism for transmission of information over the cell plasma mem- brane. G protein-coupled receptors (GPCRs) are proteins located at the cell plasma membrane that provide this molecular mechanism which transfer signals upon binding of a ligand. This makes them the main class of molecular targets for central nervous system pharmaceuticals.

Dopamine receptors are GPCRs. They are activated by the neurotransmitter dopamine and are central players in brain function and are involved in, for example, motor control. This project covered the construction and modelling of a data set of chemical compounds with targets either dopamine receptors D2 or D3, also referred as D2R and D3R, respectively, given that IRLAB Therapeutics, one of the partners, is currently focused on Parkinson’s disease. The purpose of the modelling is to predict an activity, in particular the inhibitory constant Ki, for different chemical compounds with dopamine receptors D2 and D3 as targets. This is usually done as a stage in drug discovery projects where properties of the compounds and their relation to receptor interactions is a key component. The kind of compound-receptor models described in this report are called quantitative structure activity relationship (QSAR) models [1], and are important in the development of new pharmaceuticals.

In this report, Section 3 presents the data source and process followed to construct the data set of compounds and activities to model. Section 4 motivates and presents the methods employed, which are regression models. They consist in linear models, a genetic algorithm and tree-based ensemble methods. Modelling results are found in Section 5. Finally, a discussion on the construction of the data set, the modelling applicability domain and interpretability, as well as future recommendations are found in Section 6. Appendix B explains technical details on the implementations and the hyperparameter tuning in the models. All the code and data sets used are provided as supplementary material.

3 Construction of the data set to model In the initial stage of the project, the dataset of chemical compounds was collected, analyzed and processed in order to learn about its properties, to make it suitable for modelling and to help decide on appropriate model choices. In this section the resources used to obtain the data set on which the modelling is based as well as the steps performed in compiling and cleaning the data are described.

3.1 Selection of compounds and inhibitory constants ChEMBL is a large and open-access manually curated database of bioactive molecules with drug- like properties, maintained by the European Bioinformatics Institute of the European Molecular Biology Laboratory. The information in the database about small molecules and their biological activity is extracted from medicinal chemistry journals and integrated with data on approved drugs and clinical development candidates, as well as from combining bioactivity data from other databases, allowing users to benefint from an even larger body of interaction [2].

This database can be accessed through a web user interface at https://www.ebi.ac.uk/chembl/, web services or a number of download formats. This project used a local PostgreSQL copy of the ChEMBL release 27, the latest release, to access the data. The targets of interest are the dopamine receptors D2 and D3 for the homo sapiens organism, which were identified with the ChEMBL IDs CHEMBL217 and CHEMBL234, respectively.

All the compounds having activities for these target IDs were queried. Each record contains the compound ID and a standardised activity value, type, units and relation type, i.e. whether

3 the given value is equal or a bound in a range, as well as the molecule canonical SMILES (a description of the molecule in ASCII strings) or metadata such as an identifier for the , which can be linked to the data source. A variety of activity types were observed, some of which can be easily identified, such as IC50 (half maximal inhibitory concentration), while there may be need to review the data sources to understand other types. In this context, we found the types pKi, Delta pKi, logKi, KiH, Ki(app), Ratio Ki, KiL, Log Ki, and Ki that can relate directly to the inhibitory constant Ki. We filtered out any type other than Ki because (i) it is not exactly clear how some types relate to Ki, (ii) it is not mentioned the base of the logarithm in logKi or Log Ki, and (iii) all the pKi values are NaN.

Any compound whose weight was not between 120 and 620 Da and that was a salt was removed. The motivation was to have only drug-like compounds. A two-dimensional set of 207 molecular descriptors for these compounds was computed using the Molecular Operating Environment 2019.01 from Chemical Computing Group [3], where a Merck molecular force field charge model was used to compute the charge descriptors.

The computed molecular descriptors were combined with the obtained Ki, resulting in an initial data set. This data set had to be curated for any accurate modelling [4]. It is not trivial to identify and exclude records with inconsistencies or systematic errors. Any duplicated record (completely identical records, including Ki values) was removed and it was ensured that all Ki were non-negative. The manual curation of the ChEMBL database was assumed to minimise any other inconsistency and no further analysis was done in this regard.

In order to prepare the data for regression models, records not having a = relation were filtered out. This relation type establishes that the provided Ki value is equal to the measured value and not in a range, which would be indicated by a relation type such as >= and could be useful in classification models. It is recommended that the compounds in the data set are unique, because otherwise the models may have artificially skewed predictivity [5, 6]. Some compounds were observed to be duplicated at this stage (Figure 1). A handful of cases was manually examined, revealing that the Ki values for the same compounds can be both very close as well as very different, relatively.

D2R D3R

500 200 Count Count

0 0 2 3 4 6 5 7 8 9 2 3 4 6 5 9 8 7 10 39 11 14 18 19 20 24 49 10 12 11 34 18 14 Duplicate frequency Duplicate frequency

Figure 1: Frequency of compound duplicates for each receptor before averaging pKi values. Unique compounds for D2R (D3R): 5886 (4224), where 1110 (446) have at least one duplicate.

It was opted to assign one Ki value to each duplicated compound as long as the spread of Ki values was reasonable. In order to assign one Ki value per compound, the relative standard deviation (RSD) σ RSD = (3.1) µ was used as measure of spread. The spread was computed on the pKi values, defined as − pKi = log10 Ki,Ki in molar units (3.2)

All compounds that had a pKi spread less than 10% were kept, and the corresponding pKi

4 values were averaged to produce a single value. This defined the final size of each data set (Figure 2). The use of pKi instead of Ki is motivated later. In this step it was necessary to drop any metadata. Other approaches, such as selecting one value at random, could be used to preserve the metadata were it to be used in any downstream analysis, but it was not in the project scope.

17500 17406 D2R D2R 15000 unique compounds 12500 D3R 10426 D3R 10124 unique compounds 10000 9145 9123 8012 7500 6636 5652 Records count 5632 5309 5487 5000 4005

2500

0

Query all Type = Ki Relation = Averaged pKi Duplicated rows (train + test) compounds with removed, Ki >= 0 Compounds remaining after D2R/D3R as targets computing the descriptors

Figure 2: Number of Ki observations in each data set. The resulting data set to model has 4938/549 and 3604/401 train/test observations for D2R and D3R, respectively, with 207 descriptors.

Finally, the data was divided in a 90%–10% split for train and test purposes, respectively, defining the final data set to model for both D2R and D3R. The distributions of the pKi values were similar in the train and test sets (Figure 3). Note here that if instead the Ki distribution is used, it is exponential and presents potential outliers. Using pKi makes not only the modelling more tractable but also softens any possible outlier. Moreover, motivated by the fact that it is known that many descriptors can be dependant on the molecular weight, the molecular weight distribution was visualised and the test/train split obtained also presented similar distributions for this descriptor.

D2R train D2R test D3R train D3R test 0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0.0 0.0 0.0 0.0 5.0 7.5 10.0 5.0 7.5 10.0 5.0 7.5 10.0 5.0 7.5 10.0 pKi pKi pKi pKi

Figure 3: Distributions of the pKi values in the train and test for D2R and D3R.

3.2 Activity cliffs Similar molecules do not necessarily have similar activities. In fact, evidence supports this statement [7]. An activity cliff is observed when two similar compounds largely differ in their activity. Perfect valid data points in cliff regions may appear to be outliers.

While the activity cliffs can be of interest, it can also make the modelling process more com-

5 plicated. To begin with, the surrounding region of the cliff needs to contain many compounds. Moreover, it can be difficult to detect activity cliffs because (i) the compounds can be misla- belled, (ii) the measure of similarity is non-unique, (iii) the difference in activity values can be measured in different ways, and (iv) there can be different interpretations in how to connect the similarity and the difference in activity values for a pair of compounds.

Prior to begin any computational modelling of a data set, all activity cliffs must be detected, verified and treated [5]. It has been claimed that fewer activity cliffs are detected when the activity of interest is Ki [8], therefore it was not expected to observe many activity cliffs, and especially after pruning the data set in the last cleaning step.

The Structure-Activity Landscape Index (SALI) [9] is defined in this project for two compounds A and B as |pK(A) − pK(B)| SALI = i i (3.3) 1 − sim(A, B) where sim is a measure of similarity between A and B. In [9] it is discussed that the similarity metric does not impact negatively on SALI. On the other hand, selecting the descriptors de- scribing the molecules is crucial: compounds that are nearest neighbours in one space need not be in another. Here the Tanimoto similarity metric was evaluated on a number of molecular fingerprints: ECFP4, ECFP6, FCFP4, FCFP6, RDKit fingerprint, all as binary vectors of 2048 to ensure enough sparsity, and the 166 public MACCS keys. These fingerprints were computed using RDKit [10].

A SALI matrix contains at position (i, j) the SALI for compounds i and j. The columns are sorted by increasing pKi, and the rows by decreasing pKi. Visualising these matrices as heatmaps, where the upper-left pixel is the first column and row, should reveal an activity cliff as a bright pixel—depending on the colour map—around the anti-diagonal.

Given the dimensions of the data set, it is infeasible to visually detect bright single pixels in 5487 × 5847 (D2R) and 4005 × 4005 (D3R) SALI matrices as in [9], but a rough idea of the structure-activity landscape can be obtained. Compounds with similar structures present similar activities, and the more dissimilar they are the more this difference is highlighted (Figure 4).

In order to assess the presence of activity cliffs, studies were done between the anti-diagonal and the 5-th anti-diagonal, that is up to five positions difference in the ranking of activity values, and any SALI above a threshold was set to 1. 15 compounds were candidates to be activity cliffs for D2R evaluated with different fingerprints, but only two were agreed to be activity cliffs candidates for D3R in more than one fingerprint. These compounds presented extremal activity values and were decided to be kept, mainly because they were found upon revision of the code before submitting this report and all the modelling included them, but also because it is reasonable to expect that extremal values will not be modelled accurately.

It is worth mentioning that several pairs of compounds presented infinity SALI values, which was also found upon code revision before the submission. A reason for that, though not neces- sarily the single reason, is that different compounds presented the same fingerprint, producing a numerical 0 in the denominator. Nevertheless, an exhaustive study of this cause was not carried out, because it was considered out of the scope of the project given the time constraints, but this could have been another source of activity cliffs.

4 Modelling The purpose of a statistical model can either be predictive, explanatory or both. In drug discovery and QSAR-modelling, it is typically beneficial to find out what properties of the

6 Figure 4: Heatmaps of the SALI matrices for ECFP4, ECFP6 and 166 public keys MACCS finger- prints computed with (3.3). The FCFP4, FCFP6 and RDKit fingerprint matrices (not shown) present similar images. The D2R matrix is 5487 × 5847 and the D3R 4005 × 4005, therefore it is infeasible to visualise all pixels. Nevertheless, they provide a qualitative understanding of the structure-activity landscape. investigated compounds are relevant for the response in the target. Therefore, interpretable models are preferred over black-box models. This limits the options of modelling and puts restrictions on the data since they often come with various assumptions on the data. It is also preferrable to have a small model when possible since this also adds to the interpretability and is easier to use.

The modelling process can be divided into three stages: variable selection, linear-based models, ensemble models. The first step was done to try and reduce the set of descriptors and find the relevant ones. The second step was done in order to try and find a model as simple and interpretable as possible. Finally the third step was done in order to produce a model with as high predictive capability as possible, independently from the first two steps.

4.1 Motivation for methods The pre-processed dataset showed that a large portion of the descriptors presented distributions resembling a normal distribution. Furthermore, there were groups of correlated descriptors present, as illustrated in Figure 5. This could cause stability problems for traditional models based on linear regression.

The pruned dataset consisted of 207 descriptors, which is quite many to be able to interpret them. For the D2R, five of those 207 showed no variance for the compounds chosen and were therefore dropped since they would not be useful. For D3R, four were dropped on the same grounds. In order to try and filter out as many descriptors as possible without losing predictive power some variable selection techniques were applied. Initially, the set of descriptors were split into three different sets: one including all descriptors, called alldata, one including only continuous descriptors and discrete descriptors that could be approximated as continuous, called cont. The third set included only discrete and binary descriptors and is called cat. Since predictive modelling is much simpler with one-dimensional response, the two targets D2 and D3 were treated separately throughout the whole project. The data distributions for the two targets were very similar and therefore only data and models regarding D2 were treated initially

7 Figure 5: Heatmap of the pairwise linear correlations among the descriptors. The horizontal and vertical yellow and blue lines indicate highly correlated descriptors. since the D3 case could be assumed to have very similar properties. For the final models, the D3 case was also included.

4.1.1 Linear-based models Partial Least Squares (PLS) [11] is a variant that uses a projection similar to Principal Com- ponent Analysis, but also includes the target, to reduce the dimensionality before regressing the data. It is not sensitive to multicollinear data and was therefore tried as one of the techniques.

Support Vector Regression (SVR) [12] is another method that selects a subset of support vectors to model the target with. With a linear kernel it can perform variable selection in this way, but could be sensitive to the multicollinearity. By using a non-linear kernel in the regression, this sensitivity can be alleviated. However, that will not yield linear regression coefficients in the same way as the linear kernel does and it is therefore less interpretative.

The Lasso [13] and Group Lasso [13] models uses L1-regularisation on the linear regression, that is a penalisation on the absolute value of the associated descriptor coefficient, and will effectively cancel out variables that they deem irrelevant. The difference between them is that Group Lasso considers whole groups of variables instead of single variables. This can be very useful for categorical data. There are 51 discrete descriptors, binary descriptors included, in the dataset, and Group Lasso had potential to be be useful. Both models are likely sensitive to multicollinearity, but if the true significant descriptors are sparse, it can work very well for variable selection. If this assumption is fulfilled or not is not trivial to establish, especially without chemical knowledge. The easiest way to find out is to try the models and analyse the results. Elastic Net models (EN) [13] are much like Lasso, but with an additional quadratic penalty term on the regression coefficient. This makes it more flexible than Lasso regression, but it does not perform explicit variable selection.

4.1.2 Selection by a Genetic algorithm The simple models described above may come up with subsets of important descriptors, but may not agree in all situations because of their different strengths and weaknesses. Simply trying out

8 all possible permutations of 1-202 descriptors is not possible, but by using subsets acquired by the method of Section 4.2.2 as starting conditions it is possible to use a genetic algorithm to combine and evolve them into a better subset. The method is inspired by Darwinian evolution combined with DNA replication and operates on a population of descriptors. A thorough introduction is given in [14]. The advantages of this method is that no assumptions are made on the model or the descriptors and it generalises well. In particular, the genetic algorithm can be used together with any model of choice, and will always perform variable selection. It is by design guaranteed to find solutions at least as good as the best input set, and very likely much better sets. The downside is the computational load it causes, since the model should be tuned over its parameter space as often as possible, which is hundreds of times in a reasonable iteration length. In addition, the genetic algorithm itself also has a number of hyperparameters that need tuning.

4.1.3 Ensemble methods Tree-based ensemble methods have previously been successfully applied to the area of QSAR modelling [15], primarily in the form of Random Forest [16] and Gradient Boosting [17]. Both methods work by constructing several decision trees and combining the results to get a final model. The main difference lies in the dependence between trees. Random Forest has independ- ent trees that all contribute to the final decision while boosting has dependence on the previous tree, with the objective of improving predictions in a step wise process.

Traditionally, Random Forest has long been one of the most common methods in QSAR mod- elling due to its good predictability, few adjustable parameters and ease of use [17]. However, it has been shown that XGBoost [18], an implementation of Gradient Boosting, achieves higher predictive abilities while also having important advantages such as training speed [17]. One possible limitation of XGBoost is the number of adjustable parameters. Nevertheless, in QSAR modelling, the method has been shown not to be very sensitive to changes in parameters [17]. Thus, hyperparameter tuning for a single datasets/domain as in the studied case will be straight- forward and not too computationally difficult.

Additionally, XGBoost uses a sparsity-aware approach to making decision splits [17]. This can be very beneficial in QSAR modelling, especially when using molecular fingerprints as features in the model. To evaluate this approach to QSAR modelling, fingerprints were constructed for each compound in the dataset using the RDKit fingerprint [10] and the method was evaluated on the expanded dataset.

A measure of feature importance is built in to both methods. The decision trees make splits based on improving predictions, thus the features which are most often picked for splits should be most important and descriptive of the data. There is however a difference of how both methods handle the importance of correlated features. Due to the random and independent nature of Random Forest, correlated features have similar chances of getting picked, hence splitting the importance between them. In boosting, once a feature is picked, a correlated feature should not be picked over independent ones.

4.2 Implementation details The models were implemented in Python using the scikit-learn framework for machine learning and statistical modelling [19]. Parameter tuning was needed in all models since they had at least one hyperparameter. All model training was done with 5-fold cross-validation on the training set and evaluated on the test set, both described in Section 3.1. Adjusted R2 score was chosen as evaluation metric for the parameter searches while the final models were evaluated on the test set using the R2 and RMSE measures. When training models on continuous data, a standard-

9 normal scaling was used on the descriptor values. For all modelling, the pKi values defined in Section 3.1 was used as response variable.

4.2.1 Linear-based models The training and evaluation of the linear-based models was based on the principles outlined in Section 4.2. For each dataset that a model was trained and evaluated on, an analysis of possible outliers was done to see if any data points should be removed for that model and dataset by considering the standardised Pearson residuals and a normal Q-Q plot. Each model was then evaluated by its R2 and RMSE score along with a visual representation of its regression. The models and other tools were taken from Scikit-learn [19]. The code for all evaluation scripts are presented in Appendix C.2. A brief explanation of each model follows where X is the input matrix of descriptor values, y is the output target and β are the model weights trained in the fitting process.

Lasso The lasso model tries to minimise the function

1 || − ||2 || || y Xβ 2 + α β 1. (4.1) 2nsamples Lasso has one hyperparameter α, which controls the strength of the regularisation. It is a real number and most likely between 0 and 1.

Elastic Net The elastic net is the Lasso with an extra squared penalty term. The objective function is to minimise

1 || − ||2 || || 1 − | || y Xβ 2 + α β 1 + α(1 ℓ1ratio) β 2 (4.2) 2nsamples 2

It therefore has an additional hyperparameter ℓ1ratio that regulates the balance between ℓ1 and ℓ2 penalisation. ℓ1ratio = 0 means only ℓ2 penalty (also called Ridge regression) while ℓ1ratio = 1 is equivalent to the Lasso.

Group Lasso The Group Lasso model is a modification of Lasso and it’s objective is given by 1 XK |y − Xβ|2 + λ |β| + λ |B | (4.3) 2 2 1 1 group k 2 k=1 where Bk are the coefficients for group k. It’s hyperparameters are λ1 for ℓ1-regularisation and λgroup for the groupwise regularisation. It was implemented via the FISTA [20] algorithm. Both λ1 and λgroup are positive real numbers between 0 and 1 in the majority of cases.

Partial Least Squares PLS produces n orthogonal directions

Xp D E (n−1) (n−1) zn = φˆnjxj , with the projections φˆnj = xj , y . (4.4) j=1

and then derives the regression coefficients as βˆn = ⟨zn, y⟩ / ⟨zn, zn⟩ . (4.5)

10 For each such component k, PLS works by maximising

corr(Xku, ykv) · std(Xku) · std(yku), s.t. |u| = 1 (4.6) with respect to to the weights u and v, where corr and std are the correlation and standard deviations. n is therefore the only hyperparameter to tune and it’s range is an integer between 1 and the total number of variables. It was implemented using the PLS2 algorithm [21].

Support Vector Regression SVR tries to find a hyperplance that separates the data as good as possible. The optimisation problem is defined as

P ∗ ∗ 1 T n minw,b,ζ,ζ 2 w w + C i=1 (ζi + ζi ) T subject to yi − w ϕ (xi) − b ≤ ε + ζi T − ≤ ∗ (4.7) w ϕ (xi) + b yi ε + ζi ∗ ≥ ζi, ζi 0, i = 1, . . . , n.

C is the (inverse) strength of the regularisation. ϵ is the tube around the hyperplane where ∗ no penalty is given. ξi and ξi are the penalties for points being on the upper or lower side of the ϵ-tube. ϕ(x) is a mapping such that the kernel K(xj, xk) = ⟨ϕ(xj), ϕ(xk)⟩. The linear implementation uses the linear kernel Klinear with ϕ(x) = x, which was used in the variable selection process due to the fact that it creates regression coefficients β that makes a certain variable selection method possible. For the final evaluation of the SVR models, several kernels were tried by eventually the RBF-kernel was used. Denote the support vectors that define the hyperplane by x′. The RBF kernel is then given by    ′ ′ 2 KRBF x, x = exp −γ x − x (4.8) and has its own hyperparameter γ that controls the influence of single training samples. C and γ are real numbers between 0 and ∞. ϵ is a small number ≥ 0.

4.2.2 Variable selection Four models were used for variable selection: Lasso, Group Lasso, Linear SVR and PLS. These models were fit according to the scheme described in Section 4.2. All implementations were taken from Scikit-learn [19], except Group Lasso which is not provided by Scikit-learn and an independent implementation was used [22].

Both Lasso and Group Lasso perform variable selection automatically and the chosen descriptors were extracted directly from the trained models. The variable selection with PLS and SVR was done in a similar way but with a slight difference. Since the models do not explicitly assign zero values to coefficients, the choice of descriptors had to be done manually.

In the SVR model, after the model was fit with the best parameters found, the resulting model coefficients corresponding to descriptors were sorted in order of descending absolute value. This order was taken as the descriptor significance. The model was then re-fit using different number of descriptors. First, only the most significant descriptor was used, then the two most significant descriptors, and so on until the model with the full number of descriptors had been fit. For each of these model fits, the adjusted R2 was taken as performance measure, and finally the descriptor set that maximised the adjusted R2 of the model was chosen as the best set. This method is commonly referred to as the Recursive Feature Elimination algorithm [23].

The PLS model selection was done by first tuning a model over its parameter space using cross- validation on the training set and then considering the loading weights from that model. For each

11 descriptor i and PLS component k in the weight matrix, the corresponding loading weights wi,k were summed over the components to form the descriptor weights wi. Let w be the descriptor weights. The median w and interquartile range IQR(w) were computed and a threshold Tw to decide if a descriptor should be included or excluded was determined as w T = (4.9) w IQR(w) to include all descriptors with |w| ≥ Tw. This method was proposed by Shao et al. [24].

The code for these procedures can be found in Appendix C.1.

4.2.3 Genetic algorithm The concept of the genetic algorithm explained in Section 4.1 was implemented in Python from scratch. The initial population is formed by taking the subsets acquired by the linear-based models in Table 2, and then adding a number of randomly chosen subsets to that population. These subsets are then represented in the algorithm as binary ”chromosomes”, all of which have length 202 and where a 1 at an index in the string indicates that the descriptor corresponding to that index should be active.

An iterative process then starts, where each iteration is called a generation, and in which the following steps take place: First, all subsets (chromosomes) in the population are evaluated by scoring a model that is fit to each such set. In order to punish large models, the Adjusted R2 is taken as fitness. Based on these fitness scores, a new population is drawn with replacement from the existing one, with probability proportional to the fitness score of each chromosome. The new population is then subjected to sexual replication through a crossover operator that, with a certain probability, cuts pairs of chromosomes in two at a random index and pastes the ends together with each other. Finally each chromosome is mutated randomly by flipping its bits with a small probability. The loop is then complete and the new population is evaluated. In each iteration, the best chromosome found so far is kept so that the evolution never goes back- wards. This procedure continues until the best model fit stops improving, and the corresponding descriptor set is taken as optimal for the current model.

The genetic algorithm has three hyperparameters: population size N, crossover probability pcross and mutation rate rmut. A larger population size results in a more diverse solution space and prevents “inbreeding”, which in turn helps prevent premature convergence. A low crossover probability decreases the rate at which subsets spread across the population and also helps prevent premature convergence to suboptimal solutions. The mutation rate is a constant that is used to form the probability of mutating each descriptor in a subset. It introduces new descriptor sets to the population.

Apart from these genetic parameters, the algorithm also has a predictive model with its own parameters. The model was chosen before the start of the algorithm and was re-tuned with respect to its parameters according to the following scheme. The model is initially tuned to the best subset in the population of the first generation. From then on, every 10th generation the algorithm checks if the currently best subset has changed and if so, re-tunes the model to the new currently best subset. This means that the algorithm takes turns in tuning the descriptor subset to the model with tuning the model to the descriptor subset. The hyperparameter table for the model consisted of an adaptible grid where if the tuning process had previously chosen a marginal value for any parameter, then that parameter value range was shifted logarithmically to center the previously chosen value in the new range. That way the tuning process only has to search a small space around the previously optimal point. The code for the algorithm is presented in Appendix C.3.

12 4.2.4 Ensemble methods Random Forest Regression was implemented with scikit-learn [19], and the scikit-learn Wrap- per interface for XGBoost [18] was used. This allowed for full usage of scikit-learn’s evaluation functions, such as cross-validation and scoring functions.

Three sets of descriptors were used with these methods. The first was the set of molecular descriptors, the second was RDKit fingerprints generated with the Python library RDKit [10] and lastly a combined dataset of both the molecular and fingerprint descriptors. The fingerprints were generated with the default settings, resulting in a size of 2048 bits.

Parameter tuning of the ensemble methods was performed with the Python library Hyperopt [25]. The tuning process was split into three segments, tuning tree-related parameters, regu- larisation parameters and learning rate separately. The algorithm is explained in further detail in Appendix B. For XGBoost, this resulted in a significant improvement (∼ 10%) over default settings. However, the tuning method did not result in any improvement for Random Forest. Previous research in the subject of QSAR-modelling has concluded that Random Forest does not overfit, thus increasing the number of estimators only penalises computation time [16]. It was also found that this was the only significant tuning parameter [16], therefore the number of estimators was set to a high value (n_estimators = 500). The parameters for both methods are shown in Table 1.

Table 1: Chosen parameters for the ensemble methods. The parameters for XGBoost were tuned following the algorithm in Appendix B. Random Forest used default settings except an increase in n_estimators.

Parameter Random Forest XGBoost

max_depth ∞ 23 min_child_weighta - 30 gammab 0 0.13 colsample_bytreeb 1.0 0.58 lambdaa - 0.55 alphaa - 1.4 n_estimators 500 417 learning_ratea - 0.025

aFor XGBoost. bOther name for Random Forest.

5 Results 5.1 Variable selection with linear-based models The variable selection results using the simpler models for the D2 target are presented in Table 2. Note that most scores are too low to be satisfactory while the selected subsets tend to be large. For the full set of variables chosen by each model and dataset, see Appendix A.1. The models do not agree fully on which descriptors to select, however there is indeed some correlation. Figure 6 illustrates this by plotting the number of descriptors that the models agree on to a certain degree (50%, 75% and 100%). This score was used to extract new subsets of descriptors in which to continue the analyses with, subsets of high agreement score. Table 3 lists the descriptors that have 100% agreement to be included among the model selection processes over the three datasets. The continuous dataset proved to be the easiest of the three for most

13 Table 2: Results from the variable selection method using linear-based models for the D2 target. The models and scores are defined in Section 4.2 while the datasets are defined in Section 4.1. The size is the number of descriptors chosen by the model according to the processes outlined in Section 4.2.

Model Dataset Test R2 Test RMSE Size Group Lasso alldata 0.006 1.005 74 Group Lasso cat -0.017 1.017 42 Group Lasso cont -0.075 1.046 49 Lasso alldata 0.396 0.961 86 Lasso cat 0.272 1.007 33 Lasso cont 0.387 0.965 77 Linear SVR alldata 0.405 0.958 91 Linear SVR cat 0.280 1.004 24 Linear SVR cont 0.399 0.960 63 PLS alldata 0.385 0.960 24 PLS cat 0.275 1.005 55 PLS cont 0.388 0.964 24 models to handle.

Figure 6: Histogram over how much the D2 models agree on selecting descriptors on the three datasets. The x-axis shows the level of agreement (50%, 75% and 100%) and the y-axis shows the number of descriptors for that level.

5.2 Evaluating the linear-based models SVR, PLS, Lasso, Group Lasso and EN models as defined in Section 4.1 were fit on different subsets of the descriptors as chosen by the different variable selection techniques in Table 2. The full list of descriptors can be found in Appendix A.1. All models and subsets were chosen somewhat subjectively. The models were cross-validated as described in Section 4.2 and evalu- ated on the test set by their R2 and RMSE scores. The results for the D2 target are shown in Table 4.

The results obtained in Table 4 show that many models and subsets are either bad in themselves or bad combinations. The SVR model with radial basis function (RBF) kernel seems to be the best choice in most cases. The (cont, linear_svr) subset has 65 descriptors and the SVR model

14 Table 3: The names of the descriptors, with the size of each descriptor set in parentheses, that the linear-based models includes with 100% agreement respectively.

Dataset Chosen descriptors at 100% agreement level Weight-mol, apol, a_donacc, BCUT_PEOE_1, BCUT_PEOE_2, BCUT_SLOGP_3, BCUT_SMR_1, b_ar, b_count, b_rotN, chi0v_C, chi0_C, chi1, GCUT_PEOE_0, alldata GCUT_PEOE_3, GCUT_SLOGP_1, GCUT_SLOGP_2, GCUT_SMR_1, (41) h_log_dbo, h_pKb, Kier1, KierFlex, logP(o/w), opr_brigid, PC-, PEOE_VSA_FPPOS, petitjean, petitjeanSC, Q_PC+, Q_PC-, Q_RPC-, Q_VSA_FHYD, Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_FPOS, RPC-, SMR_VSA0, VAdjMa, VDistMa, vsa_hyd, Weight, Weight-mol, apol, a_acc, a_count, a_heavy, a_nC, a_nH, BCUT_PEOE_1, BCUT_PEOE_2, b_count, b_heavy, chi0, chi0v, chi0v_C, chi0_C, chi1, chi1v, chi1v_C, chi1_C, diameter, GCUT_SLOGP_2, h_logS, h_mr, Kier1, KierA1, KierA2, KierA3, cont PC+, PEOE_VSA+3, PEOE_VSA_POS, Q_PC+, Q_RPC+, Q_RPC-, Q_VSA_FHYD, (56) Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_HYD, Q_VSA_POL, Q_VSA_POS, RPC+, RPC-, SlogP_VSA7, SlogP_VSA8, SMR, SMR_VSA0, SMR_VSA5, VAdjEq, VAdjMa, VDistMa, vdw_area, vdw_vol, vsa_other, vsa_pol, Weight, weinerPath, zagreb, cat a_aro, a_count, a_don, a_heavy, a_nI, a_nS, b_count, b_double, lip_acc, (11) lip_druglike, opr_violation,

Table 4: Results from evaluating the linear-based models on different subsets of the descriptors for D2. The Subset column denotes the model used to produce the subset in cases where the value is a model name and it denotes an agreement level used to produce the subset in cases where the value is a rational number. The parameter sets were found by cross-validation over a subjectively chosen parameter space. SVR with RBF-kernel stands out as the best choice regardless of subset.

Model Dataset Subset Subset size Test R2 Test RMSE SVR 1 alldata None 202 0.529 0.693 SVR 2 alldata pls 24 0.38 0.795 PLS 1 alldata None 202 0.278 0.858 Lasso 1 alldata None 202 0.252 0.873 PLS 2 alldata pls 24 0.109 0.953 SVR 3 alldata 0.75 20 0.139 0.937 Elastic net 1 alldata pls 24 -0.007 1.013 Lasso 2 cat None 51 0.117 0.948 Group Lasso 1 cat gl 42 -0.075 1.046 SVR 4 cont linear_svr 63 0.524 0.696 SVR 5 cont lasso 77 0.461 0.741 SVR 6 cont None 180 0.512 0.705 SVR 7 cont 0.75 21 0.172 0.919 SVR 8 cont 1 6 0.271 0.861 PLS 3 cont None 180 0.27 0.862 Lasso 3 cont None 180 0.241 0.88 Elastic net 2 cont None 180 0.237 0.881 Elastic net 3 cont 0.75 21 0.05 0.983

15 reaches R2 = 0.524 on that set. These model fitting procedures were not repeated for the D3 case because of tests showing that the differences between how models perform on the two sets is very small. Therefore the results of this section work as guidelines for the D3 case as well.

5.3 Subset and model refinement with Genetic algorithm The genetic algorithm finds a subset of descriptors based on a model that is fit and tuned to that set, and therefore also finds a (locally) optimal model. The best such subsets and their respective models that was found for the D2 and D3 versions are presented in Table 5. The best model in terms of R2 and RMSE was the SVR with RBF kernel. Since this is a non-linear kernel mapping to infinite dimension it is not possible to achieve a ”feature importance” like with the linear kernel. For the full set of selected descriptors for each model, refer to Appendix A.1. It is obvious from these results that the process of variable selection with a genetic algorithm works better than the previous methods in Section 5.1. The results also show that the SVR with RBF kernel perform better when applied to the GA subsets than by itself. PLS was also tried as a model for the GA but it did not perform well. An unexpected result was the large descriptor subsets chosen by the algorithm. This could be either a preference of the SVR model to function well, or it could be that a large number of descriptors actually explain the receptor affinity well.

Table 5: The best results found from applying the Genetic algorithm to the problem of subset selection and model tuning with the SVR model with RBF kernel. The two tables correspond to D2 and D3 models respectively. The full sets of chosen descriptors can be found in Appendix A.1.

D2 Model name SVR 9 Model parameters Kernel = RBF, C = 1.5, ϵ = 0.05, γ = 0.017 Genetic parameters N = 50, pcross = 0.8, rmut = 1, Generations = 100 Number of descriptors 166 Test R2 0.579 Test RMSE 0.654

D3 Model name SVR 10 Model parameters Kernel = rbf, C = 1.5, ϵ = 0.1, γ = 0.011 Genetic parameters N = 50, pcross = 0.8, rmut = 1, Generations = 100 Number of descriptors 170 Test R2 0.614 Test RMSE 0.738

5.4 Ensemble models The ensemble models were found to have the best predictive performance of all evaluated models. Table 6 shows evaluation of the ensemble models on the test set. Three sets of descriptors were used, see Section 4.2 for more details. XGBoost was found to be superior to Random Forest for all datasets, having ∼ 5% better R2. Additionally, the evaluated XGBoost model was more than three times faster than Random Forest which is shown in Table 7. XGBoost used on the molecular descriptors is observed to have approximately the same level of prediction as Random Forest using all descriptors. A consequence of this is that XGBoost is effectively able to reach a similar level of prediction as Random Forest in approximately 6.4% of the computation time.

16 Table 6: Results of the ensemble models on the three descriptor sets; MD: Molecular Descriptor, FP: Fingerprint, CB: Combined. The results indicate that ensemble models make better predictions using fingerprints.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2 MD 0.579 0.655 0.624 0.619 FP 0.616 0.625 0.643 0.603 CB 0.630 0.614 0.669 0.580

Target = D3 MD 0.611 0.741 0.643 0.710 FP 0.638 0.715 0.680 0.673 CB 0.658 0.695 0.690 0.662

Table 7: Time to fit to training data and predict on the test set for the D2 target. Both models were parameter optimised as presented in Table 1 and XGBoost was found to be more than 3x faster than Random Forest.

Dataset Random Forest XGBoost

Molecular Descriptors 50.5 s 11.6 s Fingerprint Descriptors 133.5 s 39.5 s Both Sets 180.7 s 55.5 s

17 Table 8: Results on all descriptors (CB). The cross-validation (CV) results are the means of each fold.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2 CV Train 0.942 0.244 0.957 0.210 CV Test 0.612 0.632 0.649 0.600 Test 0.630 0.614 0.669 0.580

Target = D3 CV Train 0.947 0.274 0.962 0.231 CV Test 0.672 0.681 0.704 0.647 Test 0.658 0.695 0.690 0.662

Table 9: Feature importance results for the ensemble models. The set I is an ordered set of feature importances from highest to lowest. In the topmost part of the table are the number of features whose cumulative sum add up to (i) 50% (ii) 95% of the total importance. In the second section of the table, the 10 most important features are presented for each model.

Random Forest XGBoost P #features: ∈ i < 0.5 33 47 Pi I #features: i∈I i < 0.95 141 167 maxi∈I i 0.044 0.049

Random Forest: GCUT_PEOE_0, GCUT_PEOE_1, h_pstates, rsynth, SMR_VSA4, BCUT_SLOGP_0, SlogP_VSA4, PEOE_VSA+5, BCUT_SMR_0, balabanJ

XGBoost: PEOE_VSA+5, vsa_don, SlogP_VSA4, h_log_dbo, GCUT_PEOE_0, a_nN, SMR_VSA4, a_nS, opr_nring, GCUT_PEOE_1

By comparing the cross-validation results in Table 8 with the evaluation on the external test set, we can see that both models generalised well. Also, it was hypothesised and found that Random Forest does not overfit from increasing the number of trees. This is shown by the plot in Appendix B.

Feature importances were produced by the ensemble models and the most important results are presented in Table 9. The table shows the minimum number of features that make up 50% and 95% of the total importance, the largest importance of a single feature and the ten most important features for each model. Thus, these results show that both models use most of the available descriptors to some extent in the decision making process. The observed feature importances do not give any obvious insight into the descriptors for this reason. However, although many descriptors are needed to reach the best fit, some of them are observed to be more common among the top ten most important features.

18 6 Discussion 6.1 Data discussions It may be of interest to limit the data set with data from certain sources or other metadata information, such as the date of publication. Users of the ChEMBL database sometimes use a time-split test set to validate the models, that is assigning compounds tested in later phases of a study to the test set [15]. However, since in this project data from different sources and assays is combined, it was not considered appropriate to have time-split data set. Moreover, with the approach taken in aggregating the data by averaging pKi values, any metadata had to be dropped.

The threshold that determines whether the spread of pKi values for one compound is acceptable has a significant impact on the number of compounds removed and can be considered somewhat arbitrary and dependent on the field of application. Nevertheless, 10% can be considered a common value when one reads discussions on the Internet, and it retains a good amount of com- pounds to model. Filtering out compounds based by their pKi spread can remove perfectly valid compounds. One reason why a spread can be large is that the stereoisomers of a compound pro- duce different inhibitory constants. It is not available in ChEMBL, to our best knowledge, which stereoisomer each value corresponds to. Were this information available, a compound could be re-labelled by their stereosiomers, and modelling would require three-dimensional descriptors, as one- and two-dimensional descriptors cannot differentiate between stereoisomers.

This project assessed the presence of any activity cliff based on a proposal from [9], but without performing any detailed analysis on it. The main reason was that an error was found in the code that assessed the presence of activity cliffs before submission. Nonetheless, after the correction, a similar activity landscape was obtained and it was decided not to re-run the modelling because the results were not expected to differ significantly and because of time constraints.

A more in depth analysis of the data could have been performed if time allowed, for example analysing why some compounds with different ChEMBL IDs are identical based on fingerprints or an analysis of intra- and inter-assay variability in activity values, in order to obtain a data set of higher quality. On the other hand, other descriptors, such as three-dimensional descriptors, could be incorporated that could provide potential significant information on the activity values.

6.2 Applicability domain It is advisable that a QSAR model is accompanied by an applicability domain (AD). The ap- plicability domain of a model relates the descriptor space on which it was trained and to which it should be applied to provide some prediction reliability. Unsurprisingly, a defined domain of applicability is stated in the Organisation for Economic Co-operation and Development (OECD) principles for the validity of a QSAR model for regulatory purposes [26].

It has been claimed that the QSAR field precedes the general field of machine learning in defining applicability domains [15]. There are two main approaches in defining an AD: defining the chem- ical space for which predictions are reliable or the estimation of a prediction uncertainty. While some QSAR methods, such as Gaussian processes, already provide a prediction uncertainty [27], the definition of an AD seems to be a very active area of research. The recent publications from Berenger and Yamanishi [28] and Liu et. al [29] briefly summarise different approaches, cite several reviews on applicability domains, and provide new AD definitions themselves.

A simple AD is that, in order to predict the activity of a new compound, its descriptors have to be in the convex hull of the descriptors from the training set. This can be computationally challenging, and can be approximated by imposing that its descriptors have to be in the range of the training set descriptors, creating a hypercube. A more elaborated method is to create

19 this hypercube using the principal components to reduce the dimensions of the hypercube, but the amount of components to use depends on a user-chosen value.

In general, distance-based methods compute the distance from a compound to some compounds in the data set, and often depend on a user-chosen threshold to categorise compounds in or out of the AD. Probably the most advanced technique to estimate an AD are probability density distribution methods [28].

It can be reasoned that distance-based methods can be outperformed by uncertainty estimations as the prediction performance is more likely to deteriorate gradually towards the edges of the chemical descriptor space [29]. One approach to determine some sort of uncertainty is to build an error model as in the work of Sheridan [27, 30, 31]. In this work, an error model using a random forest is built to provide a new molecule the mean error from molecules in the train set in the same region for a set of defined AD metrics, such as a fingerprint similarity to the nearest or five nearest compounds in the train set, the predicted activity value or the standard deviation of the prediction among the random forest trees.

Some work has been published that aims to quantify uncertainty in random forests in a more elaborate fashion than just the prediction standard deviation among the trees. Wager, Hastie and Efron [32] and Mentch and Hooker [33] have published work in estimating confidence intervals for random forests, while Zhang et al. [34] propose a method to determine prediction intervals for random forests. Prediction intervals can also be constructed with quantile regression forests [35], a generalisation of random forests.

In this project, the work presented in [36] to determine an applicability domain was applied to the data sets generated. This method consists in first classifying compounds into well (the negative class) or badly (the positive class) predicted from a regression, and later classifying them with a method that provides some measure of confidence in the predicted class or a probability of class membership. The initial class assignment, somewhat arbitrarily, is reasoned in the original work such that the minority class, the positive, has a prevalence of 20-30%. This work also reasons that the same descriptors used in the regression model have to be employed in the classification.

This method was applied to the Random Forest and XGBoost regression models described in Section 4.1.3. Since they use all the descriptors computed, a random forest classifier was also chosen suitable to reduce the effect of correlated descriptors, and which also provides a class membership probability. The results showed desired findings for the train set without tuning, both in the confusion matrix as well as the class membership probabilities distribution. However, when tuning the classifier in order to generalise, maximising either the true positive rate or the F1 score, aiming to reduce false positives and also taking into account the class imbalance, not only did the results degrade significantly for the train set but also they were unsatisfactory for the test set.

Further work could be conducted towards evaluating the feasibility of applying different methods to determine the applicability domain from the models generated and applying those found suitable. This could highlight that different models may present different applicability domains, especially if the AD provides an estimate of the prediction uncertainty. In particular, the aforementioned work from Sheridan [27] could be a starting point for the tree models presented given the similarities.

20 6.3 Modelling conclusions 6.3.1 Computational limitations Properly evaluating statistical models on datasets of this size takes time and resources. For example the genetic algorithm of Section 4.2 fits the same model, without any tuning, for all N descriptor subsets in the population. The tuning is instead done only on the model with the highest fitness, and is only done every tenth generation (if the current best model has changed). Tuning was done on a parameter grid of size 3 × 3 × 3 using 5-fold cross-validation. With SVR model it took around 7 hours to run for 80 generations with a population of 40 descriptor sets on a relatively slow home computer. That amounts to 3200 model fits and evaluations in the fitness computation step plus an additional 3 × 3 × 3 × 5 fits every tenth generation, resulting in a total number of 4280 model fittings and evaluations. The algorithm would most certainly benefit from tuning with cross-validation for each of the N subsets. Furthermore it would have been appropriate to run for up to 200 generations with a slightly larger population, and to fine-tune the model over a higher resolution of the parameter grid. This was not possible to do on home computers, but doing this with high performance computers would most likely lead to better results. The conclusion is that some of the results presented in this report are not optimal, although the methods might very well be. An indication for the potential of the GA is the training progress plot in Figure 7 which shows that the fitness could probably get better with more generations and tuning of the genetic parameters. It is reasonable to assume a steeper fitness curve if the model tuning would have been done more often and with higher resolution.

Figure 7: Training progress of the genetic algorithm with a SVR model on the D2 dataset. Even though the increase in fitness is getting smaller with more generations, Some improvement could probably be gained from tuning the genetic parameters and doing some more iterations.

Additionally, computational limitations were the primary reasons to why XGBoost was not tuned to each descriptor set and target individually. When doing 5 fold cross-validation, a single evaluation of one combination of parameters could take up to 10 minutes if the learning rate was low enough. Since 8 parameters are tuned for the model in a relatively large range,

21 this implies very many evaluations even when making smart choices using an optimisation tool such as Hyperopt. It was however concluded that the presented set of parameters in Table 1 perform well even for data that it was not specifically tuned for. The adequate performance of a XGBoost model with roughly tuned parameters is also argued by previous research in the area [17]. Thus the model could have more optimal performance if tuned individually to each situation it is evaluated on, however the improvements are minimal in comparison to the increase in resources that are needed. A discussion that follows is the computational advantage of XGBoost compared to Random Forest. After the parameters are tuned, XGBoost train and predict significantly faster than Random Forest, making it superior in this sense. It should however be noted that Random Forest did not require as many estimators as was used. It was possible to see small increases in performance but the model was reasonably accurate with very few estimators (less than 100). Hence, Random Forest can also be computationally sped up significantly at the expense of some predictability. The conclusion is however still that XGBoost is the superior method out of the two when it comes to computational performance of training and predicting.

6.3.2 Feature subsets and importance When comparing the results of the most important features for the ensemble methods and the subset selected by the genetic algorithm, both methods are found to discover that the two models are mostly explained by the same number of descriptors. For the D2 receptor target, the genetic algorithm chooses 166 features and the same number of features contribute to 94.6% of the feature importance with the XGBoost model. When comparing the specific features chosen, an agreement of 84% is found, that is 139 common features.

A possible explanation is that the descriptors that are omitted from either of the models are correlated to included ones. The dependence structure of features in the descriptor set would be an interesting area of further study, however it was not in the scope of this report.

A final note on the choice of descriptor sets concern the ensemble models that used fingerprints as features. The inclusion of fingerprints seemed to increase performance for both Random Forest and XGBoost. Both the fingerprints and molecular descriptors are computed from the SMILES string but from these results it seems like the fingerprints encode more information about the molecule, and is therefore better for prediction of molecule behaviour. Recent research [37] has shown promising results, bypassing the step of computing descriptors from the SMILES and have a neural network perform this step implicitly. This is an interesting area of study which could find that some information is lost in the manual creation of descriptors.

6.3.3 Model interpretability RBF kernel and Support vector machines The RBF-kernel used with SVR regression proved to be the best choice for the Linear-based models. Linear SVR works by finding the hyperplane in descriptor space that best separates the data. When the RBF kernel is applied that hyperplane is subjected to a non-linear transformation into kernel space. The RBF-kernel as defined in Equation (4.8) is a gaussian function of the euclidean distance between data points to the support vectors. Since the goal of SVR is to choose the support vectors that creates the greatest separation, this can be interpreted as if the RBF-SVR finds gaussian-shaped groups of points and separates between those groups. This argument is not strictly rigorous but can be used to get a sense of how the descriptor data is clustered. The downside with the RBF-kernel is that no variable importance can be extracted in a reasonable way. This is because the kernel transforms the input data from descriptor space to an infinite dimensional kernel space where the coefficient interpretation is lost.

22 Tree-based ensemble methods Both Random Forest and XGBoost create ensembles of decision trees to make predictions. Thus, the models are fully interpretable, as each decision can be traced though the tree structures and each decision can be specified. The problem is that in practice, due to the large amount of trees (∼ 500), the decision cannot easily be visualised and analysed, limiting model explainability. Thus, the primary tool of explainability becomes feature importances which can easily be extracted from the ensemble tree structures.

6.3.4 Comparison of model candidates Figure 8 presents a sample of the most interesting models found during this project. The position of the model name on the grid indicates how many descriptors it considered important (model complexity, the x-axis) and what its final evaluation score was (R2, the y-axis). This can be used to give an idea of how complexity and performance relates between the different models. The model specifications can be found in Table 4.

Figure 8: The points on the grid indicates a model that had a certain complexity and a certain evaluation score as indicated by the x and y axes respectively. Refer to Table 8 for info on the RF and XGB models. Note here that the 95% level for the number of important descriptors is plotted. Refer to Table 4 for info on the other models.

6.4 Future recommendations A number of articles that discuss best practices in QSAR modelling are useful to review before any study [5, 6, 15]. Moreover, the OECD document on guidance on the validation of QSAR models [26] can be very convenient, because it provides principles for QSAR modelling for regulatory purposes that can be considered in any QSAR modelling, guidance on each of these principles and a checklist to fulfil them. It can be discussed that this project does not satisfy one of the principles, a well defined applicability domain, which was not studied in detail due to time limitations, but the discussion in Section 6.2 can be a starting point for any future study.

The QSAR modelling should be lead by the scientific question the study wants to answer. In general, a model that reliably predicts and is interpretable can be difficult to obtain. Inter- pretable models often use empirical descriptors, are simpler and often linear, whereas models that include computational descriptors and are more complex to interpret mechanistically are more reliable predictors [38]. Consequently, prior to modelling it can be useful to determine

23 what is prioritised, always bearing in mind that correlation does not imply causality, which is a common cause of disappointment in QSAR modelling [7, 39]. This decision can help determine the data sources and descriptors to consider before any analysis, in particular if experimental conditions affect the measurement and therefore the prediction.

Using a Random Forest as a model to benchmark against can be useful given that it is a well known and unambiguous method, it is extensively discussed in the QSAR literature and performs intrinsic variable selection. Nonetheless, the also tree-based method XGBoost can be a model benchmark alternative by the reasoning in Section 4.1.3 and given the good results presented in Section 5.4. It is remarkable that public model implementations are usually open source, which allows an inspection of the implementation if required. These particular models can also be used for classification tasks, an approach not considered in this project given that the activity values could not be easily assigned a class.

7 Acknowledgements We want to thank Peder Svensson and Fredrik Wallner from IRLAB Therapeutics and Mattias Sundén and Erik Lorentzen from Smartr for offering this project, the interesting discussions over videocalls and for their availability. We found the project not only matched our expectations but also helped us gain new knowledge and discover QSAR modelling. We are very happy to have been chosen for this project and we hope that out findings are useful.

24 References [1] Arkadiusz Z. Dudek, Tomasz Arodz and Jorge Gálvez. ‘Computational Methods in De- veloping Quantitative Structure-Activity Relationships (QSAR): A Review’. In: Combin- atorial Chemistry & High Throughput Screening 9 (2006), pp. 213–228. [2] David Mendez et al. ‘ChEMBL: towards direct deposition of bioassay data’. In: Nucleic Acids Research 47.D1 (Nov. 2018), pp. D930–D940. issn: 0305-1048. doi: 10.1093/nar/ gky1075. [3] Chemical Computing Group. Molecular Operating Environment. Version 2019.01. url: https://www.chemcomp.com/. [4] Denis Fourches, Eugene Muratov and Alexander Tropsha. ‘Curation of chemogenomics data’. In: Nature chemical biology 11.8 (2015), pp. 535–535. doi: 10.1038/nchembio.1881. [5] Denis Fourches, Eugene Muratov and Alexander Tropsha. ‘Trust, but Verify II: A Prac- tical Guide to Chemogenomics Data Curation’. In: Journal of Chemical Information and Modeling 56.7 (2016), pp. 1243–1252. issn: 15205142. doi: 10.1021/acs.jcim.6b00129. [6] Artem Cherkasov et al. ‘QSAR modeling: Where have you been? Where are you going to?’ In: Journal of Medicinal Chemistry 57.12 (2014), pp. 4977–5010. issn: 15204804. doi: 10.1021/jm4004285. [7] Gerald M. Maggiora. ‘On Outliers and Activity Cliffs — Why QSAR Often Disappoints’. In: Journal of Chemical Information and Modeling 46.4 (2006). PMID: 16859285, pp. 1535– 1535. doi: 10.1021/ci060117s. [8] Dagmar Stumpfe et al. ‘Recent Progress in Understanding Activity Cliffs and Their Utility in Medicinal Chemistry’. In: Journal of Medicinal Chemistry 57.1 (2014). PMID: 23981118, pp. 18–28. doi: 10.1021/jm401120g. [9] Rajarshi Guha and John H. Van Drie. ‘StructureActivity Landscape Index: Identifying and Quantifying Activity Cliffs’. In: Journal of Chemical Information and Modeling 48.3 (2008). PMID: 18303878, pp. 646–658. doi: 10.1021/ci7004093. [10] Open-Source . RDKit. Version 2020.09.1. url: https://www.chemcomp. com/. [11] Tahir Mehmood, Solve Saebö and Kristian Hovde Liland. ‘Comparison of variable selection methods in partial least squares regression’. In: Journal of Chemometrics 9 (2020), pp. 213– 228. doi: 10.1002/cem.3226. [12] Alex J. Smola and Bernhard Schölkopf. ‘A tutorial on support vector regression’. In: Statistics and Computing 14 (2004), pp. 199–222. doi: 10.1023/B:STCO.0000035301. 49549.88. [13] Trevor Hastie, Robert Tibshirani and Jerome Friedman. The Elements of Statistical Learn- ing. Springer, 2009. doi: 10.1007/b94608. [14] M Wahde. Biologically inspired optimization algorithms. WIT Press, 2008. [15] Eugene N. Muratov et al. ‘QSAR without borders’. In: Chem. Soc. Rev. 49 (11 2020), pp. 3525–3564. doi: 10.1039/D0CS00098A. [16] Vladimir Svetnik et al. ‘Random Forest: A Classification and Regression Tool for Com- pound Classification and QSAR Modeling’. In: Journal of Chemical Information and Com- puter Sciences 43.6 (2003). PMID: 14632445, pp. 1947–1958. doi: 10.1021/ci034160g. [17] Robert P. Sheridan et al. ‘Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships’. In: Journal of Chemical Information and Modeling 56.12 (2016). PMID: 27958738, pp. 2353–2360. doi: 10.1021/acs.jcim.6b00591.

25 [18] Tianqi Chen and Carlos Guestrin. ‘XGBoost: A Scalable Tree Boosting System’. In: Pro- ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16. San Francisco, California, USA: Association for Computing Machinery, 2016, pp. 785–794. isbn: 9781450342322. doi: 10.1145/2939672.2939785. [19] Scikit-learn website for regression models. Dec. 2020. url: https://scikit-learn.org/ stable/supervised_learning.html#supervised-learning. [20] Mathematicl background of Group Lasso. Dec. 2020. url: https://group-lasso.readthedocs. io/en/latest/maths.html. [21] Jacob A Wegelin. ‘A survey of Partial Least Squares (PLS) methods, with emphasis on the two-block case’. In: Technical report 371 (2000). [22] Python implementation of Group Lasso. Dec. 2020. url: https://github.com/yngvem/ group-lasso. [23] I. Guyon et al. ‘Gene selection for cancer classification using support vector machines’. In: Machine Learning 46 (2002), pp. 389–422. [24] R Shao et al. ‘Wavelets and nonlinear principal components analysis for process monit- oring’. In: Control Eng Practice. 7 (1999), pp. 856–879. doi: 10.1016/S0967-0661(99) 00039-8. [25] James Bergstra et al. ‘Hyperopt: A Python library for model selection and hyperparameter optimization’. In: Computational Science & Discovery 8 (July 2015), p. 014008. doi: 10. 1088/1749-4699/8/1/014008. [26] OECD. Guidance Document on the Validation of (Quantitative) Structure-Activity Rela- tionship [(Q)SAR] Models. 2014, p. 154. doi: https://doi.org/https://doi.org/10. 1787/9789264085442-en. [27] Robert P. Sheridan. ‘The Relative Importance of Domain Applicability Metrics for Es- timating Prediction Errors in QSAR Varies with Training Set Diversity’. In: Journal of Chemical Information and Modeling 55.6 (2015), pp. 1098–1107. issn: 15205142. doi: 10.1021/acs.jcim.5b00110. [28] Francois Berenger and Yoshihiro Yamanishi. ‘A Distance-Based Boolean Applicability Do- main for Classification of High Throughput Screening Data’. In: Journal of Chemical In- formation and Modeling 59.1 (2019), pp. 463–476. issn: 15205142. doi: 10.1021/acs. jcim.8b00499. [29] Ruifeng Liu et al. ‘General Approach to Estimate Error Bars for Quantitative Structure- Activity Relationship Predictions of Molecular Activity’. In: Journal of Chemical Inform- ation and Modeling 58.8 (2018), pp. 1561–1575. issn: 15205142. doi: 10.1021/acs.jcim. 8b00114. [30] Robert P. Sheridan. ‘Three useful dimensions for domain applicability in QSAR models using random forest’. In: Journal of Chemical Information and Modeling 52.3 (2012), pp. 814–823. issn: 15499596. doi: 10.1021/ci300004n. [31] Robert P. Sheridan. ‘Using random forest to model the domain applicability of another random forest model’. In: Journal of Chemical Information and Modeling 53.11 (2013), pp. 2837–2850. issn: 15499596. doi: 10.1021/ci400482e. [32] Stefan Wager, Trevor Hastie and Bradley Efron. ‘Confidence intervals for random forests: The jackknife and the infinitesimal jackknife’. In: Journal of Machine Learning Research 15 (2014), pp. 1625–1651. issn: 15337928. url: https://jmlr.org/papers/v15/wager14a. html. [33] Lucas Mentch and Giles Hooker. ‘Quantifying uncertainty in random forests via confidence intervals and hypothesis tests’. In: Journal of Machine Learning Research 17 (2016), pp. 1– 41. issn: 15337928. arXiv: 1404 . 6473. url: https : / / jmlr . org / papers / v17 / 14 - 168.html.

26 [34] Haozhe Zhang et al. ‘Random Forest Prediction Intervals’. In: American Statistician 74.4 (2020), pp. 392–406. issn: 15372731. doi: 10.1080/00031305.2019.1585288. [35] Nicolai Meinshausen. ‘Quantile Regression Forests’. In: Journal of Machine Learning Re- search 7 (2006), pp. 983–999. url: https://www.jmlr.org/papers/v7/meinshausen06a. html. [36] Rajarshi Guha and Peter C. Jurs. ‘Determining the validity of a QSAR model - A classi- fication approach’. In: Journal of Chemical Information and Modeling 45.1 (2005), pp. 65– 73. issn: 15499596. doi: 10.1021/ci0497511. [37] Suman K. Chakravarti and Sai Radha Mani Alla. ‘Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks’. In: Frontiers in Artifi- cial Intelligence 2 (2019), p. 17. issn: 2624-8212. doi: 10.3389/frai.2019.00017. url: https://www.frontiersin.org/article/10.3389/frai.2019.00017. [38] Toshio Fujita and David A. Winkler. ‘Understanding the Roles of the Two QSARs’. In: Journal of Chemical Information and Modeling 56.2 (2016). PMID: 26754147, pp. 269– 274. doi: 10.1021/acs.jcim.5b00229. [39] Stephen R. Johnson. ‘The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy)’. In: Journal of Chemical Information and Modeling 48.1 (2008). PMID: 18161959, pp. 25–26. doi: 10.1021/ci700332k.

27 A Result tables A.1 Variable selection chosen descriptors See separate csv files named varselect_results_v3__.csv for the complete lists of chosen descriptors and their variable agreement scores. can be D2 or D3 while can be alldata, cont or cat.

A.2 Specifications of the linear-based models More detail on the models presented in Table 4 are presented in table 10.

Table 10: Results from evaluating the linear-based models on different subsets of the descriptors. The Subset column denotes the model used to produce the subset in cases where the value is a model name and it denotes an agreement level used to produce the subset in cases where the value is a rational number. The parameter sets were found by crossvalidation over a subjectively chosen parameter space.

Model Dataset Subset Test R2 Test RMSE Best parameters SVR 1 alldata None 0.529 0.693 C: 100.0, ϵ: 0.1, γ: 0.001, kernel: rbf SVR 2 alldata pls 0.38 0.795 C: 1.0, ϵ: 0.1, γ: 0.1, kernel: rbf PLS 1 alldata None 0.278 0.858 n_components: 100 Lasso 1 alldata None 0.252 0.873 α: 0.005 PLS 2 alldata pls 0.109 0.953 n_components: 16 SVR 3 alldata 0.75 0.177 0.916 C :0.215, ϵ: 0, kernel: rbf Elastic net 1 alldata pls -0.007 1.013 α: 1, l1_ratio: 0.5, Lasso 2 cat None 0.117 0.948 α: 0.005 Group Lasso 1 cat gl 0.079 0.968 group_reg: 0.005, l1_reg: 0 SVR 4 cont linear_svr 0.524 0.696 C: 1.0, ϵ: 0.1, γ: 0.1, kernel: rbf SVR 5 cont lasso 0.461 0.741 C: 100.0, ϵ: 0.01, γ: 0.001, kernel: rbf SVR 6 cont None 0.512 0.705 C: 100.0, ϵ: 0.1, γ: 0.001, kernel: rbf SVR 7 cont 0.75 0.406 0.778 C: 0.01, ϵ: 0 PLS 3 cont None 0.27 0.862 n_components: 22 Lasso 3 cont None 0.241 0.88 α: 0.005 Elastic net 2 cont None 0.237 0.881 α: 0.1, l1_ratio: 0 Elastic net 3 cont 0.75 0.196 0.905 α: 0.1, l1_ratio: 0

A.3 Final descriptor sets of the genetic algorithm Tables 11 and 12 lists the names of the descriptors used in the final generation of the genetic algorithm for the results presented in Table 5.

B Ensemble Parameter Tuning Parameters for the ensemble methods were tuned following the procedure specified in this section. The most relevant parameters were identified and split into three categories; tree-related para- meters: max_depth, min_child_weight, min_samples_split, gamma, colsample_bytree; reg- ularisation parameters: lambda, alpha; learning parameters: n_estimators, learning_rate. Each set of parameters were optimised separately in the order they are presented. Finally, the optimisation of the tree-related parameters was rerun to produce the final results.

The Python library Hyperopt [25] was used for the optimisation step, specifically the function fmin, which minimises an objective function on a given parameter space. A search space for the function was defined as in Table 13 and the objective function was the mean RMSE of a 5-fold cross-validation. Furthermore, the table shows the optimal parameters found by the algorithm.

28 Table 11: The descriptors chosen in the final iteration of the GA algorithm for the D2 dataset.

D2 Weight-mol, apol, ast_fraglike, ast_violation, ast_violation_ext, a_acc, a_aro, a_base, a_count, a_don, a_donacc, a_heavy, a_ICM, a_nB, a_nBr, a_nC, a_nCl, a_nF, a_nH, a_nN, a_nO, a_nS, balabanJ, BCUT_PEOE_1, BCUT_PEOE_2, BCUT_PEOE_3, BCUT_SLOGP_0, BCUT_SLOGP_1, BCUT_SLOGP_3, BCUT_SMR_0, BCUT_SMR_1, BCUT_SMR_2, BCUT_SMR_3, bpol, b_1rotN, b_1rotR, b_ar, b_count, b_double, b_heavy, b_max1len, b_rotN, b_rotR, b_single, b_triple, chi0, chi0v, chi0_C, chi1_C, chiral, chiral_u, density, diameter, GCUT_PEOE_0, GCUT_PEOE_1, GCUT_PEOE_2, GCUT_PEOE_3, GCUT_SLOGP_0, GCUT_SLOGP_1, GCUT_SLOGP_2, GCUT_SLOGP_3, GCUT_SMR_0, GCUT_SMR_1, GCUT_SMR_2, GCUT_SMR_3, h_ema, h_emd_C, h_logP, h_logS, h_log_dbo, h_mr, h_pavgQ, h_pKa, h_pKb, h_pstates, h_pstrain, Kier1, Kier2, Kier3, KierA1, KierA3, lip_don, lip_druglike, lip_violation, logP(o/w), logS, mr, opr_brigid, opr_leadlike, opr_nring, opr_nrot, opr_violation, PC+, Names PEOE_PC+, PEOE_PC-, PEOE_RPC+, PEOE_VSA+0, PEOE_VSA+1, PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA+4, PEOE_VSA+5, PEOE_VSA+6, PEOE_VSA-0, PEOE_VSA-1, PEOE_VSA-2, PEOE_VSA-3, PEOE_VSA-4, PEOE_VSA-5, PEOE_VSA-6, PEOE_VSA_FHYD, PEOE_VSA_FNEG, PEOE_VSA_FPNEG, PEOE_VSA_FPOS, PEOE_VSA_FPPOS, PEOE_VSA_NEG, PEOE_VSA_POL, PEOE_VSA_POS, PEOE_VSA_PPOS, petitjean, petitjeanSC, Q_PC+, Q_PC-, Q_RPC+, Q_VSA_FHYD, Q_VSA_FNEG, Q_VSA_FPNEG, Q_VSA_FPOL, Q_VSA_FPOS, Q_VSA_FPPOS, Q_VSA_PNEG, radius, reactive, rings, rsynth, SlogP, SlogP_VSA0, SlogP_VSA2, SlogP_VSA3, SlogP_VSA4, SlogP_VSA5, SlogP_VSA6, SlogP_VSA7, SlogP_VSA8, SlogP_VSA9, SMR, SMR_VSA0, SMR_VSA1, SMR_VSA2, SMR_VSA3, SMR_VSA4, SMR_VSA5, SMR_VSA6, SMR_VSA7, TPSA, VAdjEq, VDistEq, VDistMa, vdw_area, vsa_acc, vsa_don, vsa_other, vsa_pol, weinerPath, weinerPol, zagreb,

29 Table 12: The descriptors chosen in the final iteration of the GA algorithm for the D2 dataset.

D3 ast_fraglike_ext, ast_violation, ast_violation_ext, a_acc, a_base, a_don, a_donacc, a_heavy, a_hyd, a_IC, a_ICM, a_nBr, a_nCl, a_nF, a_nH, a_nI, a_nN, a_nO, a_nP, a_nS, balabanJ, BCUT_PEOE_0, BCUT_PEOE_1, BCUT_PEOE_2, BCUT_PEOE_3, BCUT_SLOGP_0, BCUT_SLOGP_1, BCUT_SLOGP_2, BCUT_SLOGP_3, BCUT_SMR_0, BCUT_SMR_1, BCUT_SMR_3, bpol, b_1rotR, b_ar, b_count, b_double, b_max1len, b_rotN, b_rotR, b_triple, chi0, chi0v, chi0v_C, chi1, chi1v, chi1v_C, chi1_C, chiral, chiral_u, density, FCharge, GCUT_PEOE_0, GCUT_PEOE_1, GCUT_PEOE_2, GCUT_PEOE_3, GCUT_SLOGP_0, GCUT_SLOGP_1, GCUT_SLOGP_2, GCUT_SLOGP_3, GCUT_SMR_0, GCUT_SMR_1, GCUT_SMR_2, GCUT_SMR_3, h_emd, h_emd_C, h_logP, h_logS, h_log_pbo, h_mr, h_pavgQ, h_pKa, h_pKb, h_pstates, h_pstrain, Kier1, Kier3, KierA3, lip_acc, lip_don, lip_violation, mr, mutagenic, opr_brigid, opr_leadlike, opr_nring, opr_nrot, opr_violation, PC+, PC-, PEOE_PC+, PEOE_PC-, PEOE_RPC+, PEOE_RPC-, PEOE_VSA+0, PEOE_VSA+1, Names PEOE_VSA+2, PEOE_VSA+3, PEOE_VSA+4, PEOE_VSA+5, PEOE_VSA+6, PEOE_VSA-0, PEOE_VSA-1, PEOE_VSA-2, PEOE_VSA-4, PEOE_VSA-5, PEOE_VSA-6, PEOE_VSA_FHYD, PEOE_VSA_FNEG, PEOE_VSA_FPNEG, PEOE_VSA_FPOL, PEOE_VSA_FPOS, PEOE_VSA_FPPOS, PEOE_VSA_HYD, PEOE_VSA_NEG, PEOE_VSA_PNEG, PEOE_VSA_POS, PEOE_VSA_PPOS, petitjean, petitjeanSC, Q_PC+, Q_PC-, Q_RPC+, Q_VSA_FHYD, Q_VSA_FNEG, Q_VSA_FPOL, Q_VSA_FPPOS, Q_VSA_HYD, Q_VSA_NEG, Q_VSA_PNEG, Q_VSA_POS, Q_VSA_PPOS, radius, reactive, rings, RPC+, rsynth, SlogP, SlogP_VSA0, SlogP_VSA1, SlogP_VSA2, SlogP_VSA3, SlogP_VSA4, SlogP_VSA5, SlogP_VSA6, SlogP_VSA7, SlogP_VSA8, SlogP_VSA9, SMR_VSA0, SMR_VSA1, SMR_VSA2, SMR_VSA3, SMR_VSA4, SMR_VSA5, SMR_VSA6, SMR_VSA7, TPSA, VAdjEq, VAdjMa, VDistEq, VDistMa, vdw_area, vdw_vol, vsa_acc, vsa_don, vsa_hyd, vsa_other, vsa_pol, Weight, zagreb,

30 The datasets that were used in the optimisation were all descriptors (molecular descriptors and fingerprints) for the target D2.

Table 13: Optimal parameters chosen by tuning algorithm. The range and step indicate the searched parameter space by hyperopt. Note that this was not the optimal set of parameters for Random Forest (see Appendix B).

Parameter Random Forest XGBoost Range Step

max_depth 17 23 [4, 30] 1 min_child_weighta - 30 [1, 30] 1 min_samples_splitb 5 - [2, 30] 1 gammac 0 0.13 [0, 0.5] 0.01 colsample_bytreec 0.61 0.58 [0.5, 1] 0.01 lambdaa - 0.55 [0, 2] 0.05 alphaa - 1.4 [0, 2] 0.05 n_estimators 422 417 [100, 500] 1 learning_ratea - 0.025 [0.005, 0.2] 0.005

aFor XGBoost. bFor Random Forest. cOther name for Random Forest.

Each optimised method was evaluated against its default settings as a part of the tuning pro- cedure. The 5-fold cross-validation results on the set of all descriptors are presented in Table 14. The metrics presented are the mean of the scores on the test folds. It is observed that Ran- dom Forest barely benefit from parameter optimisation. This is explained in previous studies where authors note that only the number of trees impact predictive performance of the method [16]. Thus, default settings with n_estimators = 500 was later used, resulting in similar cross- validation scores. To confirm that this choice did not lead to overfitting, the plot in Figure 9 was generated. It confirmed that Random Forest does not seem to overfit for an increased number of estimators.

XGBoost experienced a larger benefit from tuning, in the range of 5–10%. It was noted that a lot of the benefit came from reducing the learning speed and increasing the number of estim-

Table 14: Mean scores of test folds in 5 fold cross-validation. The dataset were the molecular descriptors.

Random Forest XGBoost

R2 RMSE R2 RMSE

Target = D2 Default 0.607 0.636 0.610 0.630 Tuned 0.612 0.631 0.669 0.580

Target = D3 Default 0.668 0.685 0.642 0.711 Tuned 0.674 0.679 0.690 0.662

31 ators. This is of course a trade off between predictive accuracy and computational difficulty, thus not a trivial choice of model. We recommend that this is taken into consideration when choosing parameters for XGBoost. The other parameters can however reduce computation time by considering fewer features or making smaller trees.

Figure 9: Random Forest Regression with an increasing number of trees n. The plot shows that in the range 20 to 500, it does not overfit to the training data.

C Code C.1 Variable selection scripts See separate file variable_selection.py

C.2 Linear-based model scripts See separate file linear_models.py

C.3 Genetic algorithm script See separate file genetic_algorithm.py

32