Open-Source QSAR Models for Pka Prediction Using Multiple Machine
Total Page:16
File Type:pdf, Size:1020Kb
Mansouri et al. J Cheminform (2019) 11:60 https://doi.org/10.1186/s13321-019-0384-1 Journal of Cheminformatics RESEARCH ARTICLE Open Access Open-source QSAR models for pKa prediction using multiple machine learning approaches Kamel Mansouri1*† , Neal F. Cariello1† , Alexandru Korotcov2, Valery Tkachenko2, Chris M. Grulke3 , Catherine S. Sprankle1 , David Allen1, Warren M. Casey4, Nicole C. Kleinstreuer4 and Antony J. Williams3 Abstract Background: The logarithmic acid dissociation constant pKa refects the ionization of a chemical, which afects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa afects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. Methods: The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the ini- tial set was used for modeling. To evaluate diferent approaches to modeling, several datasets were constructed based on diferent processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fngerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). Results: The three methods delivered comparable performances on the training and test sets with a root-mean- squared error (RMSE) around 1.5 and a coefcient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and perfor- mance of our models compared favorably to the commercial products. Conclusions: This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub. Keywords: pKa prediction, QSAR, DataWarrior, Machine learning, Chemical 2D descriptors, Chemical fngerprints, PaDEL *Correspondence: [email protected]; [email protected] †Kamel Mansouri and Neal F. Cariello contributed equally to this work 1 Integrated Laboratory Systems, Inc., P.O. Box 13501, Research Triangle Park, NC 27709, USA Full list of author information is available at the end of the article © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Mansouri et al. J Cheminform (2019) 11:60 Page 2 of 20 Introduction of value, quantitative structure–activity relationship Te acid dissociation constant (also called the protona- (QSAR) models being one such approach. tion or ionization constant) Ka is an equilibrium con- Quantitative structure activity/property relation- stant defned as the ratio of the protonated and the ships (QSAR/QSPR) models for hydrophobicity were deprotonated form of a compound. Ka is usually repre- frst developed in the 1960s [11]. Te conceptual basis sented as pKa = − log10 Ka [1]. Te pKa of a chemical of QSARs is the congenericity principle, which is the strongly infuences its pharmacokinetic and biochemi- assumption that structurally similar compounds will have cal properties. pKa refects the ionization state of a similar properties. While QSAR approaches have been chemical, which in turn afects lipophilicity, solubility, used for decades, their accuracy is highly dependent on protein binding, and ability to cross the plasma mem- data quality and quantity [12, 13]. Multiple commercial brane and the blood–brain barrier. software vendors have developed systems for QSAR- Te contributions of physicochemical parameters, based physicochemical parameter estimation, such as including pKa, to environmental fate, transport, and BioByte, ACD/Labs, Simulations Plus, ChemAxon and distribution are well-recognized [2–5]. Chemicals with many others [14–17]. no charge at a physiological pH will cross the plasma Diferent machine learning algorithms and variable membrane more easily than charged molecules and will selection techniques have been used in combination therefore have greater potential for pharmacological with molecular descriptors and binary fngerprints to or toxicological activity. Tus, pKa afects absorption, develop QSAR models for physicochemical and toxico- distribution, metabolism, excretion, and toxicity prop- logical properties. Te advent of open data, open source, erties and is considered one of the fve most important and open standards in the scientifc community resulted parameters in drug discovery [6, 7]. in a plethora of web-based sites for sourcing data and pKa is also an important parameter for physiologi- performing real-time predictions. Examples include cally based pharmacokinetic (PK) modeling and in vitro OCHEM, QSARDB, ChemBench and others [18–21]. to in vivo extrapolation. Approaches such as those As environmental scientists and modelers supporting described by Wetmore et al. [8] are producing data sets U.S. government projects, our interest is in the devel- that characterize metabolism and excretion for hun- opment of free and open-source data and algorithms dreds of chemicals. Tese data sets provide input for that are provided to the scientifc community in such a high-throughput methods for calculating the appar- way that more data can be incorporated, and additional ent volume of distribution at steady state and tissue- models can be developed, consistent with government specifc PK distribution coefcients [9] that will allow directives [22, 23]. Full transparency may also increase for the rapid construction of PK models. Tese, in turn, regulatory acceptance and confdence in modeling will provide context for both biomonitoring data and predictions. high-throughput toxicity screening studies. pKa prediction is challenging because a single chemi- Distribution of a chemical in an octanol/water mix- cal can have multiple ionization sites. An examination of ture (described by the constants logKow or logP) is approximately 600 drugs showed that about 70% contain afected by the ionizable groups present in the chemi- a single ionization site, with 45% of the compounds hav- cal and is pH-dependent. logD is the distribution coef- ing a single basic ionization site and 24% having a single fcient that takes into account the pH. Tis constant is acidic site [24]. QSAR/QSPR methods generally perform therefore used to estimate the diferent relative con- better at predicting single endpoints. Consequently, centrations of the ionized and non-ionized forms of a many pKa models are restricted to small chemical spaces chemical at a given pH. Together, pKa and logP can be such as anilines, phenols, benzoic acids, primary amines, used to predict logD values [10]. Tis pH-dependent etc. [25, 26]. prediction is important to consider when attempting In addition, the paucity of large, freely available, high- to predict absorption. For example, pH varies widely quality, experimentally derived pKa datasets hinders through the body from about 1.5 in the lower portion the development of open-source and open data models. of the stomach to about 8.5 in the duodenum. Ioniza- Indeed, both the quality of chemical structures and the tion characteristics of a chemical across this pH range associated experimental data are of concern due to their therefore infuence absorption in diferent locations in potential efects on the robustness of QSAR/QSPR mod- the body. Te ability to predict logP and pKa and uti- els and the accuracy of their predictions [13, 27]. lize these parameters to predict logD can therefore be Several companies have developed algorithms to pre- of value for a number of applications, including drug dict the pKa of individual ionization sites; several pro- design. Te development of computational models grams also predict multiple ionization sites for a single to predict such physicochemical properties is clearly chemical [28]. However, to the best of our knowledge, Mansouri et al. J Cheminform (2019) 11:60 Page 3 of 20 there are no free, open-source, and open data models for chemicals were not given. Te collected chemical struc- predicting pKa for heterogeneous chemical classes. Liao tures were analyzed for diversity using Toxprint chemo- and Nicklaus compared nine programs that predict pKa types [33]. Te enrichment graph (available in Additional using a validation data set of 197 pharmaceuticals that fle 2) shows the high diversity of the functional groups included acetaminophen, aspirin, aspartame,