MASARYKOVA UNIVERZITA PRˇ I´RODOVEˇ DECKA´ FAKULTA NA´ RODNI´ CENTRUM PRO VY´ZKUM BIOMOLEKUL

Diplomova´pra´ce

BRNO 2013 STANISLAV GEIDL MASARYKOVA UNIVERZITA PRˇ I´RODOVEˇ DECKA´ FAKULTA NA´ RODNI´ CENTRUM PRO VY´ZKUM BIOMOLEKUL

Predikce hodnot pKa na za´kladeˇ EEM atomovy´ch na´boju˚

Diplomova´pra´ce Stanislav Geidl

Vedoucı´pra´ce: prof. RNDr. Jaroslav Kocˇa, DrSc. Konzultant: RNDr. Radka Svobodova´Varˇekova´, Ph.D.

Brno 2013 Bibliograficky´za´znam

Autor: Bc. Stanislav Geidl Prˇı´rodoveˇdecka´fakulta, Masarykova univerzita Na´rodnı´centrum pro vy´zkum biomolekul

Na´zev pra´ce: Predikce hodnot pKa na za´kladeˇEEM atomovy´ch na´boju˚

Studijnı´program: Biochemie

Studijnı´obor: Chemoinformatika a bioinformatika

Vedoucı´pra´ce: prof. RNDr. Jaroslav Kocˇa, DrSc.

Akademicky´rok: 2012/2013

Pocˇet stran: xi + 78

Klı´cˇova´slova: disociacˇnı´konstanta, pKa, QSPR, vı´cerozmeˇrna´linea´rnı´ regrese, atomove´ na´boje, kvantova´ mechanika, EEM, fenoly, karboxylove´kyseliny Bibliographic Entry

Author: Bc. Stanislav Geidl Faculty of Science, Masaryk University National Centre for Biomolecular Research

Title of Thesis: Predicting pKa values from EEM atomic charges

Degree Programme: Biochemistry

Field of Study: Chemoinformatics and Bioinformatics

Supervisor: prof. RNDr. Jaroslav Kocˇa, DrSc.

Academic Year: 2012/2013

Number of Pages: xi + 78

Keywords: dissociation constant, pKa, QSPR, multilinear regres- sion, partial atomic charge, quantum mechanics, EEM, phenols, carboxilic acids Abstrakt

Disociacˇnı´ konstanta pKa je velmi du˚lezˇitou vlastnostı´ molekuly, a proto je vy´voj spolehlivy´ch a rychly´ch metod pro predikci pKa jednou z klı´cˇovy´ch oblastı´ vy´zkumu. Tato pra´ce zjisˇt’uje, jestli lze predikovat pKa pomocı´QSPR modelu˚ vyuzˇı´vajı´cı´ch empiricke´atomove´na´boje, konkre´tneˇna´boje vypocˇı´tane´metodou EEM (Electronegativity Equalization Method). V ra´mci pra´ce jsme nejdrˇı´ve shro- ma´zˇdili 18 sad EEM parametru˚vytvorˇeny´ch pro 8 ru˚zny´ch kvantoveˇmechan- icky´ch (QM) na´bojovy´ch sche´mat. Pote´jsme si prˇipravili tre´ninkovou sadu 74 molekul substituovany´ch fenolu˚. Da´le jsme pro kazˇdou molekulu vytvorˇili jejı´ disociovanou formu tak, zˇe jsme odstranili fenolovy´vodı´k. Pro vsˇechny molekuly v tre´ninkove´sadeˇjsme pak vypocˇı´tali EEM na´boje pomocı´18 sad EEM parametru˚ a QM na´boje pomocı´8 zmı´neˇny´ch na´bojovy´ch sche´mat. Pro kazˇdy´typ QM a EEM na´boju˚jsme vytvorˇili jeden QSPR model, vyuzˇı´vajı´cı´na´boje z nedisociovane´ molekuly (QSPR model se trˇemi deskriptory), a jeden QSPR model, vyuzˇı´vajı´cı´ na´boje z disociovane´i nedisociovane´molekuly (QSPR model s peˇti deskriptory). Pote´jsme vypocˇı´tali krite´ria kvality modelu˚a zhodnotili vsˇechny vytvorˇene´QSPR modely. QSPR modely vyuzˇı´vajı´cı´EEM na´boje se uka´zaly jako vhodna´metodika 2 2 pro predikci pKa (63% teˇchto modelu˚ma´ R > 0.9 a nejlepsˇı´model ma´ R = 0.924). Jak bylo ocˇekva´no, QM QSPR modely poskytovaly prˇesneˇjsˇı´predikci pKa nezˇEEM QSPR modely, rozdı´l v prˇesnosti modelu˚vsˇak nebyl prˇı´lisˇvy´razny´. Navı´c velkou vy´hodou EEM QSPR modelu˚je, zˇe jejich deskriptory (tj., EEM na´boje) mohou by´t vypocˇı´ta´ny vy´razneˇrychleji nezˇQM na´boje. Kromeˇtoho jsme zjistili, zˇe EEM QSPR modely nejsou tak silneˇovlivneˇny vy´beˇrem metodiky vy´pocˇtu na´boju˚jako QM QSPR modely. Robustnost EEM QSPR modelu˚byla na´sledneˇpotvrzena cross- validacı´. Aplikovatelnost EEM QSPR modelu˚na dalsˇı´trˇı´dy chemicky´ch la´tek jsme uka´zali pomocı´prˇı´padove´studie, zameˇrˇene´na karboxylove´kyseliny. Souhrnneˇ mu˚zˇeme rˇı´ci, zˇe EEM QSPR modely poskytujı´rychlou a prˇesnou metodiku pro predikci pKa, a mohou by´t tedy pouzˇity naprˇ. v ra´mci virtua´lnı´ho screeningu. Abstract

The acid dissociation constant pKa is a very important molecular property, and there is a strong interest in the development of reliable and fast methods for pKa prediction. We have evaluated the pKa prediction capabilities of QSPR models based on empirical atomic charges calculated by the Electronegativity Equaliza- tion Method (EEM). Specifically, we collected 18 EEM parameter sets created for 8 different quantum mechanical (QM) charge calculation schemes. Afterwards, we prepared a training set of 74 substituted phenols. Additionally, for each molecule we generated its dissociated form by removing the phenolic hydrogen. For all the molecules in the training set, we then calculated EEM charges using the 18 parameter sets, and the QM charges using the 8 above mentioned charge calcu- lation schemes. For each type of QM and EEM charges, we created one QSPR model employing charges from the non-dissociated molecules (three descriptor QSPR models), and one QSPR model based on charges from both dissociated and non-dissociated molecules (QSPR models with five descriptors). Afterwards, we calculated the quality criteria and evaluated all the QSPR models obtained. We found that QSPR models employing the EEM charges proved as a good approach 2 for the prediction of pKa (63% of these models had R > 0.9, while the best 2 had R = 0.924). As expected, QM QSPR models provided more accurate pKa predictions than the EEM QSPR models but the differences were not significant. Furthermore, a big advantage of the EEM QSPR models is that their descriptors (i.e., EEM atomic charges) can be calculated markedly faster than the QM charge descriptors. Moreover, we found that the EEM QSPR models are not so strongly influenced by the selection of the charge calculation approach as the QM QSPR models. The robustness of the EEM QSPR models was subsequently confirmed by cross-validation. The applicability of EEM QSPR models for other chemical classes was illustrated by a case study focused on carboxylic acids. In summary, EEM QSPR models constitute a fast and accurate pKa prediction approach that can be used in assigning charges to compounds in virtual screening.

Podeˇkova´nı´

Ra´d bych na tomto mı´steˇ podeˇkoval sve´mu sˇkoliteli prof. Kocˇovi za jeho podporu a pomoc, bez ktere´by tato pra´ce nemohla vzniknout. Da´le bych ra´d podeˇkoval sve´odborne´konzultance Radce Svobobove´za vesˇkerou jejı´pomoc. V neposlednı´ rˇadeˇ bych chteˇl take´ podeˇkovat babicˇce a otci za jejich podporu prˇedevsˇı´m prˇi psanı´pra´ce.

Prohla´sˇenı´

Prohlasˇuji, zˇe jsem svoji diplomovou pra´ci vypracoval samostatneˇs vyuzˇitı´m informacˇnı´ch zdroju˚, ktere´jsou v pra´ci citova´ny.

Brno 14. kveˇtna 2013 ...... Stanislav Geidl Contents

I Introduction1

II Theory4

1. Acid dissociation constant...... 5 1.1 Definition...... 5 1.2 Methods for pKa prediction...... 6

2. Atomic charges...... 8 2.1 QM approaches for charge calculation...... 8 2.1.1 Quantum mechanics...... 9 2.1.2 Charge distribution schemes...... 11 2.2 Empirical approaches for charge calculation...... 12 2.2.1 Electronegativity equalization method...... 13

3. Quantitative Structure-Property Relationship (QSPR)...... 15 3.1 Descriptors...... 15 3.2 Parametrization of models...... 16 3.3 Validation of models...... 18 3.3.1 Cross-validation...... 19

III Methods 20

4. Tools...... 21 4.1 NCI database...... 21 4.2 Physprop...... 21 4.3 ...... 21 4.4 AIMAll...... 22 4.5 EEM solver...... 22 4.6 QSPR designer...... 22 4.7 R...... 22

– ix – IV Results and discussion 23

5. Preparation of input data...... 24 5.1 Data sets...... 24 5.1.1 Data set for phenols...... 24 5.1.2 Data set for carboxylic acids...... 25 5.2 pKa values...... 25 5.3 Charge calculation...... 25 5.3.1 EEM parameter sets...... 25 5.3.2 QSPR model parameterization...... 26

6. Building QSPR models for phenols...... 27 6.1 Selection of descriptors...... 27 6.2 Models...... 27 6.2.1 Prediction of pKa using EEM charges...... 28 6.2.2 Improvement of EEM QSPR models by removing outliers.... 33 6.2.3 Improvement of EEM QSPR models by adding descriptors... 33 6.2.4 Comparison of EEM and QM QSPR models...... 34 6.2.5 Influence of theory level and basis set...... 34 6.2.6 Influence of population analysis...... 34 6.2.7 Influence of the EEM parameter set...... 35

7. Validation QSPR models...... 36 7.1 Cross-validation...... 36 7.2 Comparison with previous work...... 36

8. Case study for carboxylic acids...... 40 8.1 Selection of descriptors...... 40 8.2 Models...... 40

9. Publication outputs...... 42 9.1 QM models for phenols...... 42 9.2 EEM models for phenols...... 42 9.3 My contribution...... 43

V Conclusion 44

VI Appendices 47

A. Content of included CD...... 48

B. List of abbreviations...... 49 C. List of molecules...... 51 Part I

Introduction

– 1 – Introduction

The acid dissociation constant pKa is an important molecular property, and its values are of interest in pharmaceutical, chemical, biological and environmen- tal research. The pKa values have found application in many areas, such as the evaluation and optimization of candidate drug molecules [1–3], ADME profil- ing [4,5], pharmacokinetics [6], understanding protein-ligand interactions [7,8], etc.. Moreover, the key physicochemical properties like lipophilicity, solubility, and permeability are all pKa dependent. For these reasons, pKa values are impor- tant for virtual screening. Therefore, both the research community and pharma- ceutical companies are interested in the development of reliable and above all fast methods for pKa prediction. Several approaches for pKa prediction have been developed [8–11]. However, the prediction of pKa values still remains a challenge. A very promising approach for pKa prediction is to use QSPR models which employ partial atomic charges as descriptors [12–15]. The partial atomic charges cannot be determined experimentally, and they are also not quantum mechanical observables. For this reason, the rules for determin- ing partial atomic charges depend on their application (e.g. molecular mechanics energy, pKa etc.), and many different methods have been developed for their cal- culation. The charge calculation methods can be divided into two main groups, namely quantum mechanical (QM) approaches and empirical approaches. The quantum mechanical approaches first calculate a molecular wave func- tion by a combination of some theory level and basis set, and then partition this wave function among the atoms [16]. Empirical approaches determine par- tial atomic charges without calculating a quantum mechanical wave function. A well-known empirical charge calculation method is EEM (Electronegativity Equal- ization Method) [17]. This approach is based on the 3D structure of the molecule, and it employs empirical parameters derived from QM charges. The accuracy of EEM is comparable to the accuracy of the QM charge calculation approach from which the EEM parameters were derived. Additionally, EEM is very fast, as its computational complexity is θ(N3), where N is the number of atoms in the molecule. The prediction of pKa using QSPR models which employ QM atomic charges was described in several studies [12–15]. The studies analyzed the precision of this approach and compared the quality of QSPR models based on different QM charge calculation schemes. All these studies show that QM charges are successful

– 2 – Introduction 3 descriptors for pKa prediction, as the QSPR models based on QM atomic charges are able to calculate pKa with high accuracy. The weak point of QM charges is that their calculation is very slow, as the computational complexity is at least θ(B4), where B is the number of basis functions used during the QM calculation. There- fore, B is greater than the number of electrons in the molecule. For this reason, pKa prediction by QSPR models based on QM charges cannot be applied in vir- tual screening, as it is not feasible to compute QM atomic charges for hundreds of thousands of compounds in a reasonable time. This issue can be avoided if empirical charges are used instead of QM charges. Since the conformation of the atoms surrounding the dissociating hydrogens strongly influences the dissocia- tion process and therefore pKa, it is important that the atomic charges used as descriptors correctly account for the effect of the conformation. Employing EEM charges as descriptors of QSPR models is a promising solution for pKa predic- tion during virtual screening, because EEM charges consider the influence of the molecule’s conformation on the atomic charges, whiletheir calculation remains very inexpensive. Therefore, the present thesis focuses on pKa prediction using QSPR models which employ EEM charges as descriptors. The particular goals of the work are:

• Preparation of a training set containing phenol molecules.

• Retrieval of published EEM parameter sets from literature.

• Calculation of EEM atomic charges using all available EEM parameter sets.

• Calculation of corresponding QM charges.

• Creation of a QSPR model for each set of EEM and QM atomic charges, and prediction of pKa using all obtained QSPR models.

• Discussion of the accuracy of the predicted pKa values. Part II

Theory

– 4 – 1

Acid dissociation constant

The acid dissociation constant is a physicochemical property which describes the strength of an acid in solution. This property plays an important role in pharmaceutical, chemical, biological and environmental research. In particular, it can be used in the evaluation and optimization of candidate drug molecules [1–3], ADME profiling [4,5], pharmacokinetics [6], understanding protein-ligand interactions [7,8], etc..

1.1 Definition

The acid dissociation constant [18, 19](Ka) is the equilibrium constant for the dissociation of an acid. According to the Brønstred–Lowry theory [18, 19], an acid is a compound that can donate a proton, while a base is a compound that can receive a proton:

− + HA (acid) + H2O (base) A + H3O (1.1)

This equilibrium can be described using equilibrium concentrations, or using the change in standard Gibbs energy for the reaction:

[A−] K = [H O+] (1.2) a [HA] 3

−∆G◦ Ka = e RT (1.3) Because acid dissociation constants span many orders of magnitude, it is con- venient to report them as their decimal logarithm:

pKa = log10 Ka (1.4)

Examples of pKa values for common molecules are given in Table 1.1.

– 5 – Chapter 1. Acid dissociation constant 6

◦ 1 Table 1.1: Table of common acids and the values of their pKa at 25 C .

Name Molecular formula pKa Name Molecular formula pKa

Hydrochloric acid HCl -6.3 Formic acid HCO2H 3.74

Nitrous acid HNO2 3.35 Benzoic acid C6H5CO2H 4.19

Hydrofluoric acid HF 3.46 Acetic acid CH3CO2H 4.74

Hydrocyanic acid HCN 9.31 Phenol C6H5OH 9.99

1.2 Methods for pKa prediction

Several approaches for pKa prediction have been developed [8–11]. Common methods include:

LFER (Linear Free Energy Relationships) method [21, 22]: This is one of the first methods developed, and it is based on the fact that there is a linear relation between free energy and pKa. It uses the Hammett and Taft equations, and it is still used in the programs ACD/pKa [23], EPIK [24] and SPARC [25].

Database methods [26, 27]: These methods use databases of known pKa values. The pKa value of a compound of interest is inferred based on the pKas of similar molecules in the available databases. For this prediction, techniques such as interpolation or triangulation can be used. In order to cover all chemical space, it is necessary to have large databases.

Decision tree methods [27, 28]: This is a method which groups molecules into different categories according to various properties. The prediction has two steps: first, the unknown molecule is assigned to some group, and then the property of interest (in this case pKa) is computed based on the other molecules in this group. For example, the pKa of the molecule of interest can be calculated as an average of the pKas of the rest of the molecules in the group.

Ab initio quantum mechanical calculations [29, 30]: Such methods calculate the change in Gibbs energy for the reaction, and then use equation 1.3 for cal- culating pKa. However, there is no universal approach, and a specific con- figuration of the calculation needs to be setup for every type of molecule. Moreover, this prediction is very time-consuming and the calculation of the solvation energy is still complicated. On the other hand, this method can be very accurate if provided with the correct parameters. This method is imple- mented as a module of [31], a quantum chemical software package provided by Schro¨dinger.

1Source: Inorganic chemistry[18], except phenol, for this was used Physprop [20] Chapter 1. Acid dissociation constant 7

QSPR2methods [32–34]: These methods construct most commonly linear, but also nonlinear models for predicting a molecular property. The QSPR models describe the relations between the property of interest, and various indicators calculated based on the structure of the molecule. These indicators are referred to as descriptors (more in chapter 3). pKa correlates well with polarizability, HOMO energy [35], proton-transfer energy [35], partial atomic charges [12–15], the electrostatic potential of the molecule [36] etc..

However, pKa values remain one of the most challenging physicochemical properties to predict. A promising approach for pKa prediction is to use QSPR models which employ partial atomic charges as descriptors [12–15].

2Quantitave Structure-Property Relationship 2

Atomic charges

Partial atomic charges [37] are conceptual quantities assigned to every atom in a molecule to describe the electron density around any given atom. Partial charges result due to an asymmetric distribution of electrons in chemical bonds. When a neutral atom bonds chemically to a second neutral atom, but which is more electronegative (able to attract electrons), the electrons of the first atom tend to spend more time in the vicinity of the second atom. This leads to less electron density around the nucleus of the first (less electronegative) atom, creating a partial positive charge on this atom, while the second nucleus will be surrounded by more electron density, and thus the second atom will be assigned a partial negative charge. The molecular electron density can be determined by experiment or calcula- tion, as it is related to many chemical properties, such as molecular dipole [16], electrostatic potential [16], chemical shift [16] in NMR (Nuclear magnetic reso- nance spectroscopy), X-ray diffraction [16] or EPR [16] (Electron paramagnetic resonance) etc.. The of the molecular electron density and electro- static potential is given in Figure 2.1. Since there is no unique (universally valid) way to define the boundaries of an atom which is part of a molecule, there is no unique way to measure the amount of electron density which belongs to each atom, and subsequently calculate its partial charge. Many different approaches have been developed to calculate partial atomic charges.

2.1 QM approaches for charge calculation

The most common approach for charge calculation is to apply quantum me- chanics (QM) [16, 37], and subsequently employ a selected charge distribution scheme [37]. A disadvantage of this method is that it is very time-consuming. Its time complexity is at least θ(B4), where B is the number of basis functions used during the QM calculation. Therefore, B is greater than the number of electrons in the molecule.

– 8 – Chapter 2. Atomic charges 9

Figure 2.1: Examples of electron density visualization using electrostatic potential (left) and electron density map from X-ray (right).

2.1.1 Quantum mechanics

Systems having the size of a particle, an atom or a molecule (i.e., the microworld) cannot be described by classical Newtonian mechanics. The reason is Newtonian mechanics cannot handle special aspects of the microworld, specifically: quanti- zation of microparticle energy, Heisenberg’s uncertainty principle, wave-particle duality etc.. Therefore, the quantum theory [16, 37, 38] was developed to characterize the microworld. Quantum mechanics [16, 37, 38] provides a computational approach which describes molecular systems using the quantum theory. Quantum theory is based on the following principles [16]: Each physical system can be represented using a Hilbert space3. Each state of a system in a Hilbert space is fully described by a wavefunction ψ, which is a vector of this space. The wavefunction is the function ψ : R3 → R, whose domain is made up of coordinates of particles in a space. The range of ψ are real numbers whose squares describe the probability of occurrence of a particle in some point of space. The wavefunction can be obtained by solving the Schro¨dinger equation:

Hψ = Eψ (2.1) where E is the energy and H is the Hamiltonian operator. The Schro¨dinger equa- tion is a second-order differential equation, whose solutions are pairs (ψ, E) ful- filling this equation.

3A Hilbert space is a full unitary vector space. Chapter 2. Atomic charges 10

An exact (analytical) solution of the Schro¨dinger equation is possible only for the hydrogen atom (a system with two particles). For the molecule, it is necessary to introduce approximations into this equation. Two types of approximation can be used – an approximation of the Hamiltonian operator and an approximation of the wavefunction. Well-known approximations of the Hamiltonian are the Born-Oppenheimer approximation, effective Hamiltonian approximation etc. A set of approximations used together as an approach for solving of the Schro¨dinger equation is called a theory level. Several theory levels were published, the most common are: Hartree-Fock (HF) method [16, 37, 38]: In this method, the Schro¨dinger equa- tion is approximated via the system of Hartree-Fock equations:

Fiψi = εiψi (2.2) where Fi is the Fock operator of the electron i and represents an effective Hamil- tonian. ψi is the wavefunction of the electron i and εi is the Lagrangian multiplier of the electron i. The Hartree-Fock method is based on the following iterative process: First, a set of input solutions ψi of the Hartree-Fock equations is obtained. Afterwards, the Fock operators are calculated from these values. The Hartree-Fock equations with these Fock operators are solved and the new set of ψi values is obtained. These ψi values are used in the next iteration. HF method this way step by step refines the wavefunctions that correspond to lower and lower total energies. The process is repeated, until the energy values remain unchanged (self-consistent). Density functional theory (DFT) [37, 38]: Theory levels based on DFT oper- ate with electrons in terms of the electronic density distribution. DFT replaces the problem of solving the many electron Schro¨dinger equation by the problem of find- ing sufficiently accurate approximations to a universal functional of the electronic density. Particular examples of DFT theory levels are BLYP (named according its authors Becke, Lee, Yang and Parr), its extension B3LYP (BLYP combined with HF) and BP86 (gradient-corrected functional). The wavefunction is approximated by its representation using a basis set – a set of simple mathematical functions (i.e., Gaussian functions, Slater functions etc.). A lot of basis sets were published [39] and they differ in the type and number of basis functions. Examples of well-known basis sets are the following: STO-3G: This basis set approximates the wave function of each orbital by a set of three Gaussians4 and it is the simplest possible basis set. An analogous case is STO-nG. 6-31G: This notation means that each valence orbital5 is expressed by two sets of Gaussians (one with 3 Gaussians and the second with one Gaussian), while the other orbitals are expressed by one set of six Gaussians. Analogous cases are 3-21G, 6-311G, etc.. 6-31G*: The basis set is an extension of 6-31G, which has one more Gaussian

2 4Gaussian type functions - functions having the form e−α·r 5An orbital in an upper layer of an electronic shell. Chapter 2. Atomic charges 11

(called polarization function) to all atoms other than hydrogens (analogously all basis sets with *).

2.1.2 Charge distribution schemes

QM based charge calculation approaches can be divided in two main groups [37]: calculated directly from the wave function and calculated from the wave func- tion and based on some physical observation. The first group uses some orbital-based scheme (population analysis, PA) for calculating the atomic charge from the wave function. The first scheme was proposed by Mulliken [40, 41], and this population analysis now bears his name. Within the Mulliken Population Analysis (MPA), the atomic charge for each atom is computed from the electron density in the single AO (atomic orbital) basis functions, together with half of the electron density shared with other atoms. Mulliken partial charges prove to be very sensitive to basis-set size, and thus a comparison among MPA charges calculated at different theory levels is impossible. Moreover, Mulliken charges have a tendency to become unphysically large as basis sets become larger. Another population analysis is Lo¨wdin PA [42]. This scheme requires the tranformation of AO basis functions into an orthonormal set of basis functions using a symmetric orthogonalization scheme. Then, it follows the same procedure as MPA. These partial atomic charges have higher basis-set stability than MPA. Nevertheless, even Lo¨wdin charges can become unstable with large basis sets. Natural population analysis (NPA) [43] is very similar to Lo¨wdin PA, but this PA uses more complicated orthogonalization. The electron density around each atom is initially rendered as compact as possible, and futher diagonalization is carried out to preserve the shape of the strongly occupied atomic orbitals to as large an extent as possible. Atomic charges produced by NPA effectively converge to a stable value with increasing basis-set size. AIM (atoms-in-molecules) [44] belongs to the second type of QM approaches, where the calculation is based also on some physical observation. This approach was developed by Bader and his co-workers, and it is based on the idea that electron density measured by X-ray can be used directly in the calculation of partial charges. In particular, an atomic volume is defined as that region of space including the nucleus, that lies within all zero-flux surfaces surrounding the nucleus (as can be seen in Figure 2.2). Then partial atomic charges are defined as the nuclear charge minus the total number of electrons residing within the atomic basin. In the Hirshfeld charge analysis [45], the total electron density is divided between the different atoms of a system according to a scheme which preserves the proportional contribution of each atom in its neutral state to the molecular electron density. One last group of QM-based charge calculation approaches is made up of MK [46] (Merz-Kollman, CHELP) and CHELPG [47] schemes based on fitting the partial charges to the molecular ESP (electrostatic potential). There is only a small Chapter 2. Atomic charges 12

Figure 2.2: Electron density gradient paths in a plane containing atoms of the HCN molecule. The solid lines are the intersections of the zero-flux surface with the plane. The large black points are the bond critical points.

difference between these two methods. Such atomic charges have been found useful especially in modeling short to long range molecule-molecule interactions in molecular mechanics simulations [48].

2.2 Empirical approaches for charge calcula- tion

Empirical approaches determine partial atomic charges without calculating a quantum mechanical wave function for the given molecule. Therefore they are markedly faster than QM approaches. One of the first empirical approaches devel- oped, CHARGE [49], performs a breakdown of the charge transmission by polar atoms into one-bond, two-bond, and three-bond additive contributions. Most of the other empirical approaches have been derived on the basis of the elec- tronegativity equalization principle. One group of these empirical approaches invoke the Laplacian matrix formalism, and result in a redistribution of elec- tronegativity. Such methods are PEOE (partial equalization of orbital electroneg- ativity) [50], GDAC (geometry-dependent atomic charge) [51], KCM (Kirchhoff charge model) [52], DENR (dynamic electronegativity relaxation) [53] or TSEF (topologically symmetric energy function) [53]. The second group of approaches use full equalization of orbital electronegativity, and such approaches are, for example, EEM (electronegativity equalization method) [17], QEq (charge equili- bration) [54] or SQE (split charge equilibration) [55]. The empirical atomic charge calculation approaches can also be divided into ’topological’ and ’geometrical’. Topological charges are calculated using the 2D structure of the molecule, and they are conformationally independent (i.e., CHARGE, PEOE, KCM, DENR, and Chapter 2. Atomic charges 13

TSEF). Geometrical charges are computed from the 3D structure of the molecule and they consider the influence of conformation (i.e., GDAC, EEM, Qeq, and SQE).

2.2.1 Electronegativity equalization method

EEM is very popular, and despite the fact that it was developed more than twenty years ago, new parameterizations [17, 56–62] and modifications [59, 63, 64] of EEM are still under development. The EEM method is a geometrical empirical charge calculation approach, and therefore EEM charges consider the influence of the molecule’s conformation on the atomic charges. For this reason, EEM charges are often used as chemoinfor- matics descriptors [65] and they are also promissing inputs for QSPR models. EEM is based on three principles: 1. Sanderson’s electronegativity equalization principle. According to this principle, the efective electronegativity of each atom in the molecule is con- sidered to be equal to the molecular electronegativity χ¯:

χ1 = χ2 = ... = χN = χ¯ (2.3)

where χx is the efective electronegativity of the atom x. 2. The principle of the charge balance. The total charge Q of the molecule is equal to the sum of all atomic charges:

∑ qx = Q (2.4) x=1

where qx is the charge of the atom x. 3. The principle of the charge dependent electronegativity. This principle is the definition of the atomic electronegativity, and states that the electroneg- ativity of each atom can be expressed as a function of its charge:

N qy χx = Ax + Bxqx + κ ∑ (2.5) y=1 x6=y Rxy

where Rxy is the distance between atoms x and y, and the coeficients Ax, Bx and κ are empirical parameters. The previously described principles can be used to derivate a system of equa- tions of the form

 κ κ  B1 ··· −1     R1,2 R1,N q1 −A1 κ κ  B2 ··· −1  −  R2,1 R2,N   q2   A2     .   .   ......  ·  .  =  .  (2.6)  . . . . .   .   .  κ κ      ··· B −1   qN   −AN   RN,1 RN,2 N  1 1 ··· 1 0 χ¯ Q Chapter 2. Atomic charges 14

This system of equations contains N + 1 equations with N + 1 unknowns: q1, q2,..., qN and χ¯. The values of parameters Ai, Bi and κ can be obtained from literature [17, 56–62] or by a new parametrization. The accuracy of EEM is comparable to the accuracy of the QM charge calcu- lation approach for which it was parameterized. Additionally, EEM is very fast, as its computational complexity is θ(N3), where N is the number of atoms in the molecule. 3

QSPR

QSPR (Quantitative Structure-Property Relationship) [66, 67] models describe the relation between a property of interest and a structure, respectively, between the property and descriptors, which are indicators calculated from the structure. In other words, QSPR is a multidimensional function which quantifies the relation- ship between the physicochemical property of interest and the characteristics of molecular structure. The main application of the QSPR models is the prediction of various properties. Nowadays, a huge amount of experimental and calculated data is available about the structure of small organic molecules and big biomolecules. Advanced computational methods and large clusters or clouds of high performance comput- ers allow to obtain huge sets of descriptors [66, 67]. Therefore, the QSPR models provide a fast and effective approach to estimate physicochemical properties of compounds. The best feature of these models is that they can be used for pre- dicting sophisticated properties which can be obtained only by performing very demanding experiments or complicated calculations. The QSPR models are de- rived from the Quantitative Structure-Activity Relationship (QSAR) models [66], which are focused on estimating one specific property of a molecule – its biological activity. QSPR models are very often used in virtual screening, e.g. for designing new drugs [4].

3.1 Descriptors

Molecular descriptors [65, 66] are numerical values that are calculated from a struc- ture of a molecule. They are frequently used in chemoinformatics, bioinformatics, chemistry, pharmaceutical research and quality control. The reason is, these values characterize the properties of molecules. Therefore they provide a straightforward approach on how to represent these properties in an algorithmically processable way.

– 15 – Chapter 3. Quantitative Structure-Property Relationship (QSPR) 16

Thousands of molecular descriptors have been developed for various applica- tions [65, 66]. The descriptors are frequently divided into 1D, 2D, or 3D descriptors, depending on the dimensionality of the molecular structure from which they were calculated [65, 66]. 1D descriptors. These descriptors are based on 1D structure of a molecule (i.e., the type of atoms which the molecule contains). An example of a 1D descriptor is a number of carbons, molecular weight, etc. 2D descriptors. The 2D descriptors incorporate the topology of the molecule. Such descriptors can be obtained using counting (e.g., count of some type of bonds) or graph algoritms (e.g., number of rings), or using some empirical approach (e.g., Gasteiger-Marsilli charges, logP, etc.). 3D descriptors. These descriptors include information about a location of atoms in 3D space. For their calculation one can use some measurement (e.g., length of some specific bond, distance between two atoms, some angles or dihedral angles, etc.), QM calculations (e.g., atomic charges, energy, etc.), or empirical calculations (e.g., EEM charges, etc.). A very important application of the descriptors is, they are inputs of QSPR models.

3.2 Parametrization of models

Before using QSPR to predict the value of a physicochemical property of interest, it is necessary to build a QSPR model. For creating the classical QSPR model, two conditions must be met. First, the physicochemical property of the molecule must be a function of its structure, and therefore also a function of the descriptors. Second, the relationship between the property and the descriptors must be linear. The first condition is generally valid, although not all effects that are relevant for the molecular properties are directly related to the structure. The second premise clearly represents an approximation. Many structure-property relationships can be very well approximated by linear modeling, while others require the application of non-linear methods such as neural network (NN) simulations [68]. A QSPR model is a multidimensional function described by the equation:

P = c1D1 + c2D2 + ··· + cnDn + c (3.1) where P is the molecular property in linear relation with various descriptors Dx, with coefficients cx weighting their relative importance. Descriptors are often chosen based on knowledge or intuition. After the selection of the descriptors it is necessary to calculate the relevant coefficients. The coefficients are also called parameters, and the process of their calculation is denoted as the parameterization of the QSPR model. The parame- ters are calculated using training sets containing different molecules with known experimental values of the analyzed property. The multiple linear regression (MLR) [69] is a popular method to obtain applicable coefficients. Chapter 3. Quantitative Structure-Property Relationship (QSPR) 17

Table 3.1: Examples of descriptors classified according to dimensionality, demon- strated on the saccharin molecule.

Structure Representation of the molecule Descriptors

Number of nitrogen atoms 1D C H NO S 7 5 3 Molecular weight

Number of double and triple bonds

Number of rings 2D Gasteiger-Marsilli charges

logP

Distance between sulphur and nitrogen

Angles or dihedral angles 3D QM partial atomic charges

EEM charges Chapter 3. Quantitative Structure-Property Relationship (QSPR) 18 3.3 Validation of models

After a model has been obtained, it needs to be validated before its use for predic- tion, and the quality of the prediction needs to be evaluated. For this purpose of evaluating the quality of the model, the following indicators can be used: Pearson correlation coefficient (R) [69, 70] is widely used in science as a mea- sure of the strength of the linear dependence between two variables. It’s defined as the covariance of the two variables divided by the product of their standard deviations: s N calc calc exp exp  ∑ (P − P¯ ) · (Px − P¯x ) R = x=1 x x (3.2) N calc ¯calc 2 N exp ¯exp 2 ∑x=1(Px − Px ) · ∑x=1(Px − Px ) calc exp where Px is the calculated value of the property, and Px is the experimental ¯calc ¯exp calc exp value. Px , Px are average values of Px , respectively Px . The squared Pearson correlation coefficient, also called coefficient of determi- nation (R2) [69, 70], describes how well the regression line fits the data. If the regression line fits the data well, the value is close to 1, while if the fit is bad, the value is close to 0. Root mean square error (RMSE) [69, 70] is the normalized sum of squared errors, and provides us with an independent measure of the model’s reliability. RMSE can be calculated the following way: s epx ∑N (Pcalc − P )2 RMSE = x=1 x x (3.3) N

With respect to pKa prediction, the accuracy of a model is considered excellent if RMSE is smaller than about 0.5, and respectively sufficient if RMSE is smaller than about 0.8 to 1. Average absolute error (∆¯ ) [69] is a quantity used to measure how close the predictions are to the possible outcomes. It is given by:

epx ∑N |Pcalc − P | ∆¯ = x=1 x x (3.4) N Standard deviation of the estimation (s) [69] evaluates the extent of the vari- ation of the predicted value from the expected value. A low standard deviation indicates that the data points tend to be very close to the expected value; high standard deviation indicates that the data points are spread out over a large range of values. s N · RMSE2 s = (3.5) N − k − 1 where N is the number of molecules in the set, and k is the number of descriptors in the model. Fisher’s statistics of the regression (F) [69, 70] is a statistical test in which the probability distribution is under the null hypothesis, and is meant to check Chapter 3. Quantitative Structure-Property Relationship (QSPR) 19 whether there is indeed a relationship between the sets of predicted values and the set of expected values. R2(N − k − 1) F = (3.6) k(1 − R2) This quality measurement (validation) can be performed on the training set, i.e., the set of molecules, corresponding descriptors and values for the property of interest, which have been used for creating the QSPR model. The validation step is used to obtain information about the suitability of the model, descriptors or potential outliers. However, if a model appears to be highly accurate in predicting correct values for the training set, it doesn’t necessarily mean that this model will be as accurate when used for predictions of the same property for molecules outside the training set. From this reason, it is necessary to perform a validation also on a testing set, i.e., an additional set of molecules (not included in the training set), along with the corresponding descriptors and experimentally determined values of the property of interest. Alternatively, it is possible do a cross-validation.

3.3.1 Cross-validation

Cross-validation [71, 72] is a validation technique used to analyze if and how the results from the validation step could be generalized to an independent data set. It can give an estimate regarding how accurately a predictive model will perform in practice. One round of cross-validation involves dividing a sample data set into several similar subsets. Then, the QSPR model is parameterized and validated on one subset (training set), and subsequent validations are done for all other subsets (validation sets). Multiple rounds of cross-validation are performed using different combinations of subsets, and the validation results are averaged over the rounds. k-fold cross-validation [71, 72] is a common cross-validation method where the original sample is randomly partitioned into k, equal sized, subsamples. One single subsample is chosen as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples being used exactly once as validation data. The k results from the folds are then averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold or 5-fold cross-validation is commonly used, but in general k remains an unfixed parameter. Part III

Methods

– 20 – 4

Tools

4.1 NCI database

NCI database [73] is a database of molecules collected within the Developmental Therapeutics Program (DTP) of the National Cancer Institute. This database contains 260,071 structures in SDF format, with their 3D structure calculated by the program Corina [74], unique SMILES calculated by CACTVS according to Daylight’s original (1989) canonicalization rules, the IUPAC/NIST InChI chemical identifier, IUPAC names generated with ACD/Name Batch, and additional helpful properties.

4.2 Physprop

The Physical Properties Database (PHYSPROP) [20] contains chemical structures, names, and physical properties (melting point, boiling point, water solubility, octanol-water partition coefficient, vapor pressure, pKa, Henry’s Law Constant, and atmospheric OH rate constant) for over 41,000 chemicals. Physical properties are collected from experiment, or they were extrapolated or estimated.

4.3 Gaussian

Gaussian [75] is a software package for electronic structure modeling. Starting from the fundamental laws of quantum mechanics, Gaussian 09 predicts the en- ergies, molecular structures, vibrational frequencies and molecular properties of molecules and reactions. Gaussian can be used for predicting spectra (NMR, op- tical or hyperfine spectra) or the calculation of thermochemistry, photochemistry or solvent effects.

– 21 – Chapter 4. Tools 22 4.4 AIMAll

AIMAll [76] is a software package for performing quantitative and visual QTAIM (Quantum Theory of Atoms in Molecules) analyses of molecular systems - starting from molecular wavefunction data.

4.5 EEM solver

EEM solver [77] is a software tool for the calculation of empirical atomic charges using the EEM approach. The program is able to read EEM parameters from an input file and therefore it can be used with any EEM parameter set. The program includes a serial version applicable for calculation of charges in organic molecules, and a parallel version which is able to calculate charges for proteins.

4.6 QSPR designer

This sophisticated web application [78] is dedicated for QSPR modeling. It is easy to use, and thus allows for a quick design and validation of QSPR models. This software was used for selecting descriptors and validation data.

4.7R

R[79] is a language and a software environment for statistical calculation and visualization. R provides a wide variety of statistical (linear and nonlinear mod- elling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. R was used for the calculation of all QSPR models and their cross-validation. Part IV

Results and discussion

– 23 – 5

Preparation of input data

5.1 Data sets

5.1.1 Data set for phenols

There are two main ways to create a QSPR model for a feature to be predicted. The first is to create as general a model as possible, with the risk that the accuracy of such a model may not be high. The second approach is to develop more models, each of them being dedicated to a certain class of compounds. Here we took the second approach, following a similar methodology as in previous studies [12–15]. Specifically, we focus on substituted phenols, because they are the most common test molecules employed in the evaluation of novel pKa prediction approaches [12– 15, 80–82]. Our data set contains the 3D structures of 74 distinct phenol molecules. This data set is of high structural diversity and it covers molecules with pKa val- ues from 0.38 to 11.1. The molecules were obtained from the NCI Open Database Compounds [73] and their 3D structures were generated by CORINA 2.6 [74], without any further geometry optimization. Our data set is a subset of the phenol data set used in our previous work related to pKa prediction from QM atomic charges [15]. The subset is made up of phenols which contain only C, O, N and H, and none of the molecules contain triple bonds. This limitation is necessary, because the EEM parameters of all 18 studied EEM parameter sets are available only for such molecules (see Table 5.1). For each phenol molecule from our data set, we also prepared the structure of the dissociated form, where the hydrogen is missing from the phenolic OH group. This dissociated molecule was created by removing the hydrogen from the original structure without subsequent geometry optimization. The list of the molecules, including their NCS numbers, CAS num- bers and experimental pKa values, can be found in the AppendixC (Table C.1). The SDF files with the 3D structures of molecules and their dissociated forms are on the attached CD.

– 24 – Chapter 5. Preparation of input data 25

5.1.2 Data set for carboxylic acids

An aspect which is very important for the applicability of the pKa prediction approach is its transferability to other chemical classes. In this work, we provide a case study showing the performance of the approach on carboxylic acids, which are also very common testing molecules for pKa prediction approaches [12, 33, 34, 83]. The data set contains 71 distinct molecules of carboxylic acids and their dissociated forms. The 3D structures of these molecules were obtained in the same way as for the phenols. The list of the molecules, including their NCS numbers, CAS numbers and experimental pKa values can be found in the AppendixC (Table C.2). The SDF files with the 3D structures of the molecules and their dissociated forms are included on the CD.

5.2p Ka values

The experimental pKa values were taken from the Physprop database [20].

5.3 Charge calculation

We calculated QM atomic charges for all the combinations of QM theory level, basis set and population analysis for which we have EEM parameters (see Ta- ble 5.1). Specifically, atomic charges were calculated for these eight QM ap- proaches: HF/STO-3G/MPA, HF/6-31G*/MK, and B3LYP/6-31G* with MPA, NPA, Hirshfeld, MK, CHELPG and AIM). The QM charge calculations were car- ried out using Gaussian09 [75]. In the case of AIM population analysis, the output from Gaussian09 was further processed by the software package AIMAll [76]. The EEM charges were calculated by the program EEM SOLVER [77] using each of the 18 EEM parameter sets.

5.3.1 EEM parameter sets

In our study, we used all EEM parameters published till now. Specifically, we found 18 different EEM parameters sets, published in 8 different articles [17, 56–62]. The parameters cover two QM theory levels (HF and B3LYP), two basis sets (STO- 3G and 6-31G*) and six population analyses (MPA, NPA, Hirshfeld, MK, CHELPG, AIM). Unfortunately, only some combinations of QM theory levels, basis sets and population analyses are available. On the other hand, more parameter sets were published for some combinations (i.e., 6 parameter sets for HF/STO-3G/MPA). All the parameter sets include parameters for C, O, N and H. Some sets include also parameters for S, P, halogens and metals. Most of the sets do not include parameters for C and N bonded by triple bond. Summary information about all these parameter sets is given in Table 5.1. Chapter 5. Preparation of input data 26

Table 5.1: Summary information about the parameter sets used in the present study.

QM theory level PA Parameter set Published by Year of Elements + basis set name publication included HF/STO-3G MPA Svob2007 cbeg2 Svobodova et al. [56] 2007 C, O, N, H, S Svob2007 cmet2 Svobodova et al. [56] 2007 C, O, N, H, S, Fe, Zn Svob2007 chal2 Svobodova et al. [56] 2007 C, O, N, H, S, Br, Cl, I Svob2007 hm2 Svobodova et al. [56] 2007 C, O, N, H, S, F, Cl, Br, I, Fe, Zn Baek1991 Baekelandt et al. [57] 1991 C, O, N, H, P Al, Si Mort1986 Mortier et al. [17] 1986 C, O, N, H HF/6-31G* MK Jir2008 hf Jirouskova et al. [58] 2008 C, O, N, H, S, F, Cl, Br, I, Zn B3LYP/6-31G* MPA Chaves2006 Chaves et al. [59] 2006 C, O, N, H, F Bult2002 mul Bultinck et al. [60] 2002 C, O, N, H, F NPA Ouy2009 Ouyang et al. [61] 2009 C, O, N, H, F Ouy2009 elem Ouyang et al. [61] 2009 C, O, N, H, F Ouy2009 elemF Ouyang et al. [61] 2009 C, O, N, H, F Bult2002 npa Bultinck et al. [60] 2002 C, O, N, H, F Hir. Bult2002 hir Bultinck et al. [60] 2002 C, O, N, H, F MK Jir2008 mk Jirouskova et al. [58] 2008 C, O, N, H, S, F, Cl, Br, I, Zn Bult2002 mk Bultinck et al. [60] 2002 C, O, N, H, F CHELPG Bult2002 che Bultinck et al. [60] 2002 C, O, N, H, F AIM Bult2004 aim Bultinck et al. [62] 2004 C, O, N, H, F

5.3.2 QSPR model parameterization

The parameterization of the QSPR models was done by multiple linear regression (MLR). 6

Building QSPR models for phenols

6.1 Selection of descriptors

Our descriptors were atomic charges. We analyzed two types of QSPR models, namely QSPR models with three descriptors (3d QSPR models) and QSPR models with five descriptors (5d QSPR models). The 3d QSPR models used those descriptors which proved to be the most relevant for pKa prediction in our previous study [15]. Therefore these descriptors were the atomic charge of the hydrogen atom from the phenolic OH group (qH), the charge on the oxygen atom from the phenolic OH group (qO), and the charge on the carbon atom binding the phenolic OH group (qC1). These descriptors were used to establish the QSPR models by the general equation:

pKa = pH · qH + pO · qO + pC1 · qC1 + p (6.1)

where pH, pO, pC1 and p are parameters of the QSPR model (i.e., constants derived by multiple linear regression). The 5d QSPR models employ the above mentioned descriptors qH, qO and qC1 and additionally also the charge on the − phenoxide O from the dissociated molecule (qOD), and the charge on the carbon atom binding this oxygen (qC1D). Using the charges from the dissociated molecules for pKa prediction was inspired by the work of Dixon et al. [33]. The equation of the 5d QSPR models is therefore:

0 0 0 0 0 0 pKa = pH · qH + pO · qO + pC1 · qC1 + pOD · qOD + pC1D · qC1D + p (6.2) 0 0 0 0 0 0 where pH, pO, pC1, pOD, pC1D and p are parameters of the QSPR model. 6.2 Models

We prepared one 3d QSPR model and one 5d QSPR model using atomic charges calculated by each of the above mentioned 18 EEM parameter sets. These models

– 27 – Chapter 6. Building QSPR models for phenols 28 are denoted 3d or 5d EEM QSPR models. Additionally, we created one 3d and one 5d QSPR model using atomic charges calculated by each of the corresponding 8 QM charge calculation approaches (denoted as 3d or 5d QM QSPR models). The data set of 74 phenol molecules was used for the parameterization of the QSPR models, and the obtained models were validated for all molecules in the data set. The parameterization of the 3d EEM QSPR models showed that several mo- lecules in the data set perform as outliers. For this reason, we created also EEM QSPR models without outliers (i.e., EEM QSPR models for which parameterization was done using a data set that excluded the previously observed outliers). These models are denoted 3d EEM QSPR WO models. We classified as outliers 10% of the molecules from our data set, which had the highest Cook’s square distance [84, 85]. Therefore the 3d EEM QSPR WO models were parameterized using 67 molecules, and their validation was also done on the data set excluding outliers. The quality of the QSPR models, i.e. the correlation between experimental pKa and the pKa calculated by each model, was evaluated using the squared Pearson correlation coefficient (R2), root mean square error (RMSE), and average absolute pKa error (∆), while the statistical criteria were the standard deviation of the estimation (s) and Fisher’s statistics of the regression (F). Table 6.1 contains the quality criteria (R2, RMSE, ∆) and statistical criteria (s and F) for all the QSPR models analyzed. All these models are statistically significant at p = 0.01. Since our data sets contained 74 and 67 molecules, respectively, the appropriate F value to consider was that for 60 samples. Thus, the 3d QSPR models are statistically significant (at p = 0.01) when F > 4.126 and the 5d QSPR models when F > 3.339. Table 6.2 summarizes the R2 of all QSPR models for ease of visual comparison, and Tables 6.3 and 6.4 provide a comparison of the models from specific points of view. The parameters of the QSPR models are summarized on the CD. The most relevant graphs of correlation between experimental and calculated pKa are visualized in Figure 6.1.

6.2.1 Prediction of pKa using EEM charges The key question we wanted to answer is whether EEM QSPR models are applica- ble for pKa prediction. For this purpose we selected a set of phenol molecules and generated QSPR models which used EEM atomic charges as descriptors. We then evaluated the accuracy of these models by comparing the predicted pKa values with the experimental ones. The results (see Tables 6.1, 6.2 and 6.3) clearly show that QSPR models based on EEM charges are indeed able to predict the pKa of phe- nols with very good accuracy. Namely, 63% of the EEM QSPR models evaluated 2 2 in this study were able to predict pKa with R > 0.9. The average R for all 54 EEM QSPR models considered was 0.9, while the best EEM QSPR model reached R2 = 0.924. Our findings thus suggest that EEM atomic charges may prove as efficient QSPR descriptors for pKa prediction. The only drawback of EEM is that EEM pa- rameters are currently not available for some types of atoms. Nevertheless, EEM parameterization is still a topic of research, therefore more general parameter sets are being developed. Chapter 6. Building QSPR models for phenols 29

Figure 6.1: Correlation graphs. Graphs showing the correlation between experi- mental and calculated pKa for selected QSPR models.

HF/STO-3G/MPA/3d B3LYP/6-31G*/MPA/3d B3LYP/6-31G*/NPA/3d B3LYP/6-31G*/MK/3d 12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

calculated pKa 4 calculated pKa 4 calculated pKa 4 calculated pKa 4

2 2 2 2

0 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 experimental pKa experimental pKa experimental pKa experimental pKa HF/STO-3G/MPA/5d B3LYP/6-31G*/MPA/5d B3LYP/6-31G*/NPA/5d B3LYP/6-31G*/MK/5d 12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

calculated pKa 4 calculated pKa 4 calculated pKa 4 calculated pKa 4

2 2 2 2

0 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 experimental pKa experimental pKa experimental pKa experimental pKa Svob2007_chal2/3d Chaves2006/3d Bult2002_npa/3d Bult2002_mk/3d 12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

calculated pKa 4 calculated pKa 4 calculated pKa 4 calculated pKa 4

2 2 2 2

0 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 experimental pKa experimental pKa experimental pKa experimental pKa Svob2007_chal2/3d WO Chaves2006/3d WO Bult2002_npa/3d WO Bult2002_mk/3d WO 12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

calculated pKa 4 calculated pKa 4 calculated pKa 4 calculated pKa 4

2 2 2 2

0 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 experimental pKa experimental pKa experimental pKa experimental pKa Svob2007_chal2/5d Chaves2006/5d Bult2002_npa/5d Bult2002_mk/5d 12 12 12 12

10 10 10 10

8 8 8 8

6 6 6 6

calculated pKa 4 calculated pKa 4 calculated pKa 4 calculated pKa 4

2 2 2 2

0 0 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 experimental pKa experimental pKa experimental pKa experimental pKa Chapter 6. Building QSPR models for phenols 30

Table 6.1: Quality criteria and statistical criteria for all the QSPR models analyzed in the present study and focused on phenols.

QM theory level PA EEM parameter QSPR model R2 RMSE ∆ s F + basis set set name HF/STO-3G MPA - 3d QM 0.9515 0.490 0.388 0.504 458 - 5d QM 0.9657 0.412 0.310 0.430 358 Svob2007 cbeg2 3d EEM 0.8671 0.812 0.571 0.835 152 3d EEM WO 0.9239 0.482 0.382 0.497 255 5d EEM 0.9179 0.638 0.481 0.666 152 Svob2007 cmet2 3d EEM 0.8663 0.814 0.577 0.837 151 3d EEM WO 0.9239 0.482 0.386 0.497 255 5d EEM 0.9189 0.634 0.476 0.661 154 Svob2007 chal2 3d EEM 0.8737 0.792 0.554 0.814 161 3d EEM WO 0.9127 0.483 0.387 0.498 220 5d EEM 0.9203 0.629 0.473 0.656 157 Svob2007 hm2 3d EEM 0.8671 0.812 0.578 0.835 152 3d EEM WO 0.9241 0.481 0.387 0.496 256 5d EEM 0.9179 0.638 0.478 0.666 152 Baek1991 3d EEM 0.9099 0.669 0.531 0.688 236 3d EEM WO 0.9166 0.531 0.423 0.548 231 5d EEM 0.9195 0.632 0.493 0.659 155 Mort1986 3d EEM 0.8860 0.752 0.577 0.773 181 3d EEM WO 0.9151 0.520 0.405 0.536 226 5d EEM 0.9142 0.652 0.524 0.680 145 HF/6-31G* MK - 3d QM 0.8405 0.890 0.727 0.915 123 - 5d QM 0.8865 0.750 0.641 0.782 106 Jir2008 hf 3d EEM 0.8612 0.830 0.582 0.853 145 3d EEM WO 0.9182 0.500 0.394 0.516 236 5d EEM 0.9154 0.648 0.488 0.676 147 B3LYP/6-31G* MPA - 3d QM 0.9671 0.404 0.317 0.415 686 - 5d QM 0.9724 0.370 0.274 0.386 479 Chaves2006 3d EEM 0.891 0.735 0.570 0.756 191 3d EEM WO 0.9198 0.505 0.398 0.521 241 5d EEM 0.9192 0.633 0.489 0.660 155 Bult2002 mul 3d EEM 0.8876 0.747 0.589 0.768 184 3d EEM WO 0.9151 0.520 0.416 0.536 226 5d EEM 0.9158 0.646 0.504 0.674 148 Chapter 6. Building QSPR models for phenols 31

Table 6.1: Summary information about the parameter sets used in the present study (continued).

QM theory level PA EEM parameter QSPR model R2 RMSE ∆ s F + basis set set name B3LYP/6-31G* NPA - 3d QM 0.9590 0.451 0.349 0.464 546 - 5d QM 0.9680 0.399 0.295 0.416 411 Ouy2009 3d EEM 0.8731 0.793 0.541 0.815 161 3d EEM WO 0.9043 0.505 0.379 0.521 198 5d EEM 0.9094 0.670 0.503 0.699 137 Ouy2009 elem 3d EEM 0.8727 0.795 0.546 0.817 160 3d EEM WO 0.9113 0.487 0.382 0.502 216 5d EEM 0.9132 0.656 0.495 0.684 143 Ouy2009 elemF 3d EEM 0.8848 0.756 0.519 0.777 179 3d EEM WO 0.9012 0.512 0.386 0.528 192 5d EEM 0.8866 0.750 0.520 0.782 106 Bult2002 npa 3d EEM 0.9044 0.689 0.532 0.708 221 3d EEM WO 0.9098 0.523 0.405 0.539 212 5d EEM 0.9180 0.638 0.488 0.666 152 Hir. - 3d QM 0.9042 0.689 0.503 0.708 220 - 5d QM 0.9477 0.509 0.356 0.531 246 Bult2002 hir 3d EEM 0.8415 0.887 0.636 0.912 124 3d EEM WO 0.8838 0.579 0.414 0.597 160 5d EEM 0.9050 0.687 0.522 0.717 130 MK - 3d QM 0.8447 0.878 0.705 0.903 127 - 5d QM 0.8960 0.718 0.594 0.749 117 Jir2008 dft 3d EEM 0.8696 0.804 0.555 0.827 156 3d EEM WO 0.9224 0.487 0.371 0.502 250 5d EEM 0.9148 0.650 0.489 0.678 146 Bult2002 mk 3d EEM 0.8639 0.822 0.610 0.845 148 3d EEM WO 0.9053 0.519 0.384 0.535 201 5d EEM 0.9131 0.657 0.508 0.685 143 Chel. - 3d QM 0.8528 0.854 0.712 0.878 135 - 5d QM 0.9087 0.673 0.552 0.702 135 Bult2002 che 3d EEM 0.8695 0.805 0.597 0.828 155 3d EEM WO 0.8863 0.588 0.436 0.606 164 5d EEM 0.9057 0.684 0.540 0.714 131 AIM - 3d QM 0.9609 0.440 0.332 0.452 573 - 5d QM 0.9677 0.400 0.285 0.417 407 Bult2004 aim 3d EEM 0.8646 0.819 0.619 0.842 149 3d EEM WO 0.8972 0.590 0.438 0.608 183 5d EEM 0.9017 0.698 0.571 0.728 125 Chapter 6. Building QSPR models for phenols 32

2 Table 6.2: R for the correlation between calculated and experimental pKa.

QM theory level PA EEM parameter R2 of QSPR model + basis set set name 3d EEM 3d EEM WO 5d EEM 3d QM 5d QM HF/STO-3G MPA Svob2007 cbeg2 0.8671 0.9239 0.9179 0.9515 0.9657 Svob2007 cmet2 0.8663 0.9239 0.9189 Svob2007 chal2 0.8737 0.9127 0.9203 Svob2007 hm2 0.8671 0.9241 0.9179 Baek1991 0.9099 0.9166 0.9195 Mort1986 0.8860 0.9151 0.9142 HF/6-31G* MK Jir2008 hf 0.8696 0.9182 0.9154 0.8405 0.8865 B3LYP/6-31G* MPA Chaves2006 0.8910 0.9198 0.9192 0.9671 0.9724 Bult2002 mul 0.8876 0.9151 0.9158 NPA Ouy2009 0.8731 0.9043 0.9094 0.9590 0.9680 Ouy2009 elem 0.8727 0.9113 0.9132 Ouy2009 elemF 0.8848 0.9012 0.8866 Bult2002 npa 0.9044 0.9098 0.9180 Hir. Bult2002 hir 0.8415 0.8838 0.9050 0.9042 0.9477 MK Jir2008 mk 0.8696 0.9224 0.9148 0.8447 0.8960 Bult2002 mk 0.8639 0.9053 0.9131 Chel. Bult2002 che 0.8695 0.8863 0.9057 0.8528 0.9087 AIM Bult2004 aim 0.8646 0.8972 0.9017 0.9609 0.9677

Legend excellent very good good satisfactory acceptable weak R2 0.95 – 0.97 0.92 – 0.95 0.91 – 0.92 0.9 – 0.91 0.85 – 0.9 0.8 – 0.85

2 Table 6.3: Average R between experimental and predicted pKa for all QSPR models of a certain type and percentages of QSPR models whose R2 values are in a certain interval.

QSPR model 3d EEM 3d EEM WO 5d EEM 3d QM 5d QM Average R2 0.876 0.911 0.913 0.929 0.951 Interval of R2 R2 > 0.9 11% 83% 94% 78% 83% 0.9 ≥ R2 > 0.85 83% 17% 6% 6% 17% 0.85 ≥ R2 > 0.8 6% 0% 0% 17% 0% QSPR model EEM based models QM based models Average R2 0.900 0.940 Interval of R2 R2 > 0.9 63% 81% 0.9 ≥ R2 > 0.85 35% 13% 0.85 ≥ R2 > 0.8 2% 6% Chapter 6. Building QSPR models for phenols 33

2 Table 6.4: Average R between experimental and predicted pKa for all QSPR models using atomic charges calculated by a specific combination of theory level and basis set, or by a specific population analysis.

QSPR model 3d EEM 3d EEM WO 5d EEM 3d QM 5d QM Theory level HF/STO-3G 0.878 0.919 0.918 0.952 0.966 and basis set a B3LYP/6-31G* 0.889 0.917 0.918 0.967 0.972 Population MPA 0.889 0.917 0.918 0.967 0.972 analysis b NPA 0.884 0.907 0.907 0.959 0.968 Hirshfeld 0.842 0.884 0.905 0.904 0.948 MK 0.867 0.914 0.914 0.845 0.896 CHELPG 0.870 0.886 0.906 0.853 0.909 AIM 0.865 0.897 0.902 0.961 0.968 a Only QSPR models employing MPA were included in this analysis. b Only QSPR models using B3LYP/6-31G* were included in this analysis.

6.2.2 Improvement of EEM QSPR models by remov- ing outliers

The quality of 3d EEM QSPR models can be markedly increased by removing the outliers. In this case, the models have average R2 = 0.911 and 83% of them have R2 > 0.9. The disadvantage of these models is that they are not able to cover the complete data set (i.e., 10% of molecules must be excluded as outliers). On the other hand, the outliers are similar for all EEM QSPR models. For ex- ample, while 16 molecules from our data set are outliers for at least one parameter set, 10 out of these 16 molecules are outliers for five or more parameter sets. From the chemical point of view, most of the outliers contain one or more nitro groups. This may be related to reported lower accuracy of EEM for these groups [60]. In general one limitation of the 3d EEM QSPR models is that they are very sensitive to the quality of EEM charges. Therefore, if the EEM charges are inaccurate for certain compounds or class of compounds, the 3d QSPR models based on these EEM charges will have lower performance for these compounds or class of com- pounds. In addition, a lower experimental accuracy of these pKa values may also be a reason for low performance in some cases. A table containing information about outlier molecules is given on the CD.

6.2.3 Improvement of EEM QSPR models by adding descriptors

Our first EEM QSPR models contained three descriptors (3d), namely atomic charges originating from the non-dissociated molecule. Nonetheless, in our study we found that using two additional charge descriptors from the dissoci- ated molecule can markedly improve the predictive power of the EEM QSPR models. Tables 6.1, 6.2 and 6.3 show that these new 5d EEM QSPR models provide Chapter 6. Building QSPR models for phenols 34 better pKa prediction than their corresponding 3d EEM QSPR models. Specifi- cally, adding the descriptors derived from the dissociated molecules increased the average R2 value for the EEM QSPR models from 0.876 to 0.913.

6.2.4 Comparison of EEM and QM QSPR models

Another important question is how accurate the EEM QSPR models are in com- parison with QM QSPR models. Tables 6.1 and 6.2 show that QM QSPR models provide, in most cases, more precise predictions. This is confirmed also by the average R2 values from Table 6.3. This is not surprising, since EEM is an empirical method which just mimics the QM approach for which it was parameterized. An interesting fact is that the differences in accuracy between QM QSPR models and EEM QSPR models are not substantial. For example, 5d EEM QSPR models have average R2 = 0.913, while 5d QM QSPR models have average R2 = 0.951. We also note that adding more descriptors to a QM QSPR model brings less improvement than adding more descriptors to an EEM QSPR model.

6.2.5 Influence of theory level and basis set

EEM parameters are available only for a relatively small number of theory levels (HF and B3LYP) and basis sets (STO-3G and 6-31G*). Therefore we can not perform such a deep analysis of theory level and basis set influence on pKa prediction capability from EEM atomic charges, as was done for QM QSPR models by Gross et al. [13] or Svobodova et al. [15]. We can only compare the models employing HF/STO-3G and B3LYP/6-31G* charges, as these are the only combinations for which EEM parameters are available for the same population analysis (MPA). Therefore we can study only the influence of the combination of theory level / basis set, and not the isolated influence of the theory level or basis set. Our analysis revealed that B3LYP/6-31G* charges provide slightly more accurate QM QSPR models than HF/STO-3G charges (see Table 6.4). This is in agreement with our previous findings [15], and it can be explained by the fact that 6-31G* is a more robust basis set than STO-3G. However, the difference is not marked in the case of EEM QSPR models.

6.2.6 Influence of population analysis

Eleven EEM parameter sets were published for B3LYP/6-31G* with six different population analyses (see Table 6.1). Therefore it is straightforward to analyze the influence of the population analysis on the predictive power of the resulting QSPR models (see Table 6.4). We found that MPA and NPA provide the best QM models, while MK and CHELPG (PAs based on fitting the atomic charges to the molecular electrostatic potential) provide weak QM models. Our results are in agreement with previous studies [13, 15]. QM QSPR models based on AIM predict pKa with accuracy comparable to MPA and NPA. In the case of EEM QSPR models, we did Chapter 6. Building QSPR models for phenols 35 indeed find that MPA provided the best models, but most of the other population analyses gave comparable results. This confirms previous observations that the EEM approach is not able to faithfully mimic MK charges [86]. On the other hand, this drawback of EEM allowed the EEM QSPR models employing MK charges to predict pKa more accurately than the corresponding QM QSPR models.

6.2.7 Influence of the EEM parameter set

Two or more EEM parameter sets are available in literature for four combinations of theory level, basis set and population analysis (see Table 5.1). We found that the quality of EEM QSPR models employing the same types of charges slightly varies when using EEM parameters coming from different studies (see Tables 6.1 and 6.2). Even EEM parameters from the same study, but obtained by different approaches, lead to QSPR models of slightly different quality. In any case, these differences are minimal. 7

Validation QSPR models

7.1 Cross-validation

Our results show that 5d EEM QSPR models provide a fast and accurate approach for pKa prediction. Nonetheless, the robustness of these models should be proved. Therefore, all the 5d EEM QSPR models (i.e., 18 models) were tested by cross- validation. For comparison, also the cross-validation of all 5d QM QSPR models (i.e., 8 models) was done. The k-fold cross-validation procedure was used [71, 72], where k = 5. Specifically, the set of phenol molecules was divided into five parts (each contained 20% of the molecules). The division was done randomly, and included stratification by pKa value. Afterwards, five cross-validation steps were performed. In the first step, the first part was selected as a test set, and the remaining four parts were taken together as the training set. The test and training sets for the other steps were prepared in a similar manner, by subsequently considering one part as a test set, while the remaining parts served as a training set. For each step, the QSPR model was parameterized on the training set. Afterwards, the pKa values of the respective test molecules were calculated via this model, and compared with experimental pKa values. The results are summarized on the CD, while the cross-validation results for the best and the worst performing 5d EEM QSPR models are shown in Table 7.1. The cross-validation showed that the models are stable and the values of R2 and RMSE are similar for the test set, the training set and the complete set. The robustness of EEM QSPR models and QM QSPR models is comparable.

7.2 Comparison with previous work

QM QSPR models for pKa prediction in phenols, similar to those presented in this paper (i.e., employing similar charges) were previously published by Gross and Seybold [13], Kreye and Seybold [14] and Svobodova and Geidl [15]. Table 7.2 shows a comparison between these models and the models developed in this

– 36 – Chapter 7. Validation QSPR models 37

Table 7.1: Comparison of the quality criteria and statistical criteria for the training set, test set and complete set for some selected charge calculation approaches.

5d EEM QSPR model employing Svob2007 chal2 EEM parameters: Complete set: R2 RMSE s F Number of molecules 0.920 0.629 0.647 269 74 Cross-validation: Cross- Training set Test set validation Number of Number of step R2 RMSE s F molecules R2 RMSE s F molecules 1 0.9283 0.5211 0.5498 137 59 0.9202 1.0754 1.3884 21 15 2 0.9210 0.6538 0.6899 124 59 0.9029 0.5394 0.6963 17 15 3 0.9191 0.6442 0.6796 120 59 0.9275 0.5823 0.7517 23 15 4 0.9207 0.6244 0.6588 123 59 0.9271 0.6878 0.8880 23 15 5 0.9274 0.6302 0.6643 138 60 0.9008 0.6678 0.8834 15 14

5d EEM QSPR model employing Ouy2009 elemF EEM parameters: Complete set: R2 RMSE s F Number of molecules 0.8866 0.7501 0.7825 106 74 Cross-validation: Cross- Training set Test set validation Number of Number of step R2 RMSE s F molecules R2 RMSE s F molecules 1 0.8936 0.6349 0.6698 89 59 0.8704 1.2857 1.6598 12 15 2 0.8953 0.7526 0.7940 91 59 0.8018 0.7802 1.0072 7 15 3 0.8908 0.7481 0.7893 86 59 0.8647 0.7983 1.0306 12 15 4 0.8821 0.7614 0.8033 79 59 0.9154 0.7481 0.9658 19 15 5 0.8956 0.7557 0.7966 93 60 0.8089 0.8396 1.1107 7 14 Chapter 7. Validation QSPR models 38 study. Our work is the first which presents QSPR models for pKa prediction based on EEM charges. Therefore, we can not provide a comparison between EEM QSPR models, but we can compare against QSPR models based on QM charges only. It is seen therein that our 3d QM QSPR models show markedly higher R2 and F values than the models published by Gross and Seybold and Kreye and Seybold (even if some of these models employ higher basis sets) and comparable R2 and F values as models published by Svobodova and Geidl. Moreover, our 5d QM QSPR models outperform the models from Svobodova and Geidl. Our best EEM QSPR models (i.e., 5d EEM QSPR models) provide even better results than QM QSPR models from Gross and Seybold and Kreye and Seybold. These EEM QSPR models are not as accurate as the QM QSPR models published by Svobodova and Geidlor those developed in this work, but the loss of accuracy is not too high (R2 values are still > 0.91). Chapter 7. Validation QSPR models 39 a a b b e c d Source This work This work This work This work This work This work This work This work This work Gross and Seybold [ 13 ] Gross and Seybold [ 13 ] Gross and Seybold [ 13 ] Gross and Seybold [ 13 ] Gross and Seybold [ 13 ] Gross and Seybold [ 13 ] Kreye and Seybold [ 14 ] Kreye and Seybold [ 14 ] Kreye and Seybold [ 14 ] Kreye and Seybold [ 14 ] Svobodova and Geidl [ 15 ] Svobodova and Geidl [ 15 ] Svobodova and Geidl [ 15 ] Svobodova and Geidl [ 15 ] Svobodova and Geidl [ 15 ] Svobodova and Geidl [ 15 ] EEM parameter set Chaves2006. d 15 15 15 15 19 19 74 74 74 19 19 74 74 74 19 19 74 74 74 124 124 124 124 124 124 molecules Number of 9 F 48 38 95 38 38 173 134 986 545 705 261 179 144 605 936 685 822 265 185 168 126 201 250 1013 s 1.300 1.500 0.970 1.000 0.252 0.283 0.440 0.435 0.464 0.410 0.656 0.248 0.274 0.556 0.450 0.415 0.380 0.651 0.682 0.467 0.941 0.978 0.902 0.739 0.669 2 R 0.789 0.731 0.880 0.865 0.911 0.887 0.961 0.962 0.959 0.968 0.918 0.913 0.894 0.938 0.959 0.967 0.972 0.919 0.344 0.692 0.822 0.808 0.845 0.896 0.915 EEM parameter set Bult2002 npa. c , , , , 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 D D D D D D C C C C C C C C C C C C C C C 1 1 1 1 1 1 q q q q q q q q q q q q q q q C C C C C C , , , , , , , , , , , , , , , − − − q q q q q q H H H O O O O O O O O O O O O O O O O O O O O , , , , , , OH OH q q q q q q q q q q q q q q q q q q q q q q q q q , , , , , , , , , , , , , , , OD OD OD OD OD OD H H H H H H H H H H H H H H H q q q q q q q q q q q q q q q q q q q q q Descriptors 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-31G* 6-311G 6-311G 6-311G 6-31+G* 6-31+G* Basis set 6-311G** 6-311G** With solvent model SM8. 6-311G(d,p) 6-311G(d,p) 6-311G(d,p) 6-311G(d,p) 6-311G(d,p) 6-311G(d,p) b PA MK MK MK MK MK MK MK NPA NPA NPA NPA NPA NPA NPA NPA NPA NPA NPA MPA MPA MPA MPA MPA MPA MPA level B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP B3LYP Theory QM QM QM EEM EEM EEM With solvent model SM5.4. EEM parameter set Jir2008 mk. Method a e Table 7.2: Comparison between the performance of the QSPR models developed here, and previously developed models. 8

Case study for carboxylic acids

8.1 Selection of descriptors

The descriptors were again atomic charges and, similarly as for phenols, two types of QSPR models were developed and evaluated. Specifically, QSPR models with four descriptors (4d QSPR models) and QSPR models with seven descriptors (7d QSPR models). The 4d QSPR models used similar descriptors as the 3d models for phenols - the atomic charge of the hydrogen atom from the COOH group (qH), the charge on the hydrogen bound oxygen atom from the COOH group (qO), and the charge on the carbon atom binding the COOH group (qC1). Additionally, also the charge of the second carboxyl oxygen (qO2) is included. These 4d QSPR models are represented by the equation:

pKa = pH · qH + pO · qO + pO2 · qO2 + pC1 · qC1 + p (8.1)

where pH, pO, pO2, pC1 and p are parameters of the QSPR model. The 7d QSPR models employ also charges from the dissociated forms, namely the charge on the carboxyl oxygens (qOD, qO2D) and the charge on the carboxylic carbon atom (qC1D). The equation of the 7d QSPR models is therefore:

0 0 0 0 pKa = pH · qH + pO · qO+pO2 · qO2 + pC1 · qC1+ 0 0 0 0 (8.2) pOD · qOD + pO2D · qO2D + pC1D · qC1D + p 0 0 0 0 0 0 0 0 where pH, pO, pO2, pC1, pOD, pO2D, pC1D and p are parameters of the QSPR model.

8.2 Models

We have shown that QSPR models based on EEM atomic charges can be used for predicting pKa in phenols. In order to evaluate the general applicability of this

– 40 – Chapter 8. Case study for carboxylic acids 41

Table 8.1: Correlation between calculated and experimental pKa for carboxylic acids.

QM theory level PA EEM parameter R2 of QSPR model + basis set set name 7d EEM 7d QM HF/STO-3G MPA Svob2007 cbeg2 0.8831 0.9327 Svob2007 cmet2 0.8810 Svob2007 chal2 0.8822 Svob2007 hm2 0.8793 Baek1991 0.9211 Mort1986 0.9176 B3LYP/6-31G* MPA Chaves2006 0.9238 0.9059 Bult2002 mul 0.9248 NPA Ouy2009 0.8825 0.9169 Ouy2009 elem 0.8777 Ouy2009 elemF 0.8478 Bult2002 npa 0.9094

approach for pKa prediction, we tested the performance of such models for car- boxylic acids. This case study is done for the charge schemes found to provide the best QM and EEM QSPR models in the case of phenols. Specifically, QM charges calculated by HF/STO-3G/MPA, B3LYP/6-31G*/MPA and B3LYP/6-31G*/NPA, and EEM charges calculated using the corresponding EEM parameters. Because 5d QSPR models provide the most accurate prediction for phenols, the case study is focused on their analogue for carboxylic acids, i.e., 7d QSPR models. Squared Pearson correlation coefficients of the analysed QSPR models are summarized in Table 8.1, and all the quality and statistical criteria can be found on the CD. The results show that 7d EEM QSPR models are able to predict the pKa of carboxylic acids with very good accuracy. Namely, 5 out of 12 analysed 7d EEM QSPR 2 models were able to predict pKa with R > 0.9, while the best EEM QSPR model reached R2 = 0.925. Therefore, we concluded that EEM QSPR models are indeed applicable also for carboxylic acids. Again QM QSPR models perform better than EEM QSPR models, but the differences are not substantial. 9

Publication outputs

9.1 QM models for phenols

I presented the results as a poster in the conference 9th Discussions in Structural Molecular Biology:

• S. Geidl, R. Svobodova´Varˇekova´, O. Skrˇehota, M. Kudera, C.-M. Ionescu, D. Sehnal, T. Bouchal and J. Kocˇa. Predicting pKa values of substituted phenols from atomic charges. In 9th Discussions in Structural Molecular Biology, Nove´ Hrady, Czech Republic, 2011.

Afterwards, we published an article in the Journal of Chemical Information and Modeling (impact 4.675):

• R. Svobodova´Varˇekova´, S. Geidl, C.-M. Ionescu, O. Skrˇehota, M. Kudera, D. Sehnal, T. Bouchal, R. A. Abagyan, H. J. Huber and J. Kocˇa. Predicting pKa values of substituted phenols from atomic charges: Comparison of different quantum mechanical methods and charge distribution schemes. Journal of Chemical Information and Modeling, 51(8):1795–1806, 2011.

9.2 EEM models for phenols

I presented preliminary results at the 3rd Strasbourg Summer School on Chemoin- formatics.

• S. Geidl, R. Svobodova´Varˇekova´, C.-M. Ionescu, O. Skrˇehota, T. Bouchal, M. Kudera, D. Sehnal, R. A. Abagyan and J. Kocˇa.Predicting pKa values of substituted phenols by QSPR models which employ EEM atomic charges. 3rd Strasbourg Summer School on Chemoinformatics - Strasbourg, France, 2012.

The final results were published in the Journal of Chemoinformatics (impact 3.42):

– 42 – Chapter 9. Publication outputs 43

• R. Svobodova´Varˇekova´, S. Geidl, C.-M. Ionescu, O. Skrˇehota, T. Bouchal, D. Sehnal, R. A. Abagyan and J. Kocˇa. Predicting pKa values from EEM atomic charges. Journal of Chemoinformatics, 5:18, 2013.

Note: In this article, Dr. R. Svobodova´Varˇekova´and I share first authorship.

9.3 My contribution

The concept of the study originated from Prof. Kocˇa and was reviewed and ex- tended by Prof. Abagyan. The design was put together by me and Dr. Svobodova´ Varˇekova´and reviewed by Prof. Kocˇa and Prof. Abagyan. I and Dr. Ionescu collected and prepared the input data. I, Dr. Skrˇehota, Dr. Sehnal and Bc. Bouchal performed the acquisition and post-processing of data. The data were analyzed and interpreted by me, Dr. Svobodova´Varˇekova´, Dr. Ionescu and Prof. Kocˇa. The poster presentations at the conferences were given by me. The article for Journal of Cheminformatics was written by me and Dr. Svobodova´Varˇekova´in cooperation with Prof. Kocˇa and Dr. Ionescu, and reviewed by all co-authors. Part V

Conclusion

– 44 – Conclusion

The acid dissociation constant pKa is a very important molecular property, and there is a strong interest in the development of reliable and fast methods for pKa prediction. We have evaluated the pKa prediction capabilities of QSPR models based on empirical atomic charges calculated by the Electronegativity Equaliza- tion Method (EEM). Specifically, we collected 18 EEM parameter sets created for 8 different quantum mechanical (QM) charge calculation schemes. Afterwards, we prepared a training set of 74 substituted phenols. Additionally, for each molecule we generated its dissociated form by removing the phenolic hydrogen. For all the molecules in the training set, we then calculated EEM charges using the 18 parameter sets, and the QM charges using the 8 above mentioned charge calcu- lation schemes. For each type of QM and EEM charges, we created one QSPR model employing charges from the non-dissociated molecules (three descriptor QSPR models), and one QSPR model based on charges from both dissociated and non-dissociated molecules (QSPR models with five descriptors). Afterwards, we calculated the quality criteria and evaluated all the QSPR models obtained. We found that the QSPR models employing EEM charges can be a suitable approach for pKa prediction. From our 54 EEM QSPR models focused on phenols, 2 63% show a correlation of R > 0.9 between the experimental and predicted pKa. The most successful type of these EEM QSPR models employed 5 descriptors, namely the atomic charge of the hydrogen atom from the phenolic OH group, the charge on the oxygen atom from the phenolic OH group, the charge on the carbon atom binding the phenolic OH group, the charge on the oxygen from the phenoxide O− from the dissociated molecule, and the charge on the carbon atom binding this oxygen. Specifically, 94% of these models have R2 > 0.9, and the best one has R2 = 0.920. In general, including charge descriptors from dissociated molecules, which wasintroduced in our work, always increases the quality of a QSPR model. The only drawback of EEM QSPR models is that the EEM parameters are currently not available for all types of atoms. Therefore the EEM parameter sets need to be expanded to larger sets of molecules and further improved. As expected, the QM QSPR models provided more accurate pKa predictions than the EEM QSPR models. Nevertheless, these differences are not substantial. Furthermore, a big advantage of EEM QSPR models is that one can calculate the EEM charges markedly faster than the QM charges. Moreover, the EEM QSPR models are not so strongly influenced by the charge calculation approach as the QM QSPR models are. Specifically, the QM QSPR models which use atomic

– 45 – Conclusion 46 charges obtained from calculations with higher basis set perform better, while the EEM QSPR models do not show such marked differences. Similarly, the quality of QM QSPR models depends a lot on population analysis, but EEM QSPR models are not influenced so much. Namely, QM QSPR models which use atomic charges calculated from MPA, NPA and Hirshfeld PA performed very well, while MK provides only weak models. In the case of EEM QSPR models, MPA performs also the best, but all other PAs (including MK) provide accurate results as well. The source of the EEM parameters also did not affect the quality of the EEM QSPR models significantly. The robustness of EEM QSPR models was successfully confirmed by cross- validation. Specifically, the accuracy of pKa prediction for the test, training and complete set were comparable. The applicability of EEM QSPR models for other chemical classes was tested in a case study focused on carboxylic acids. This case study showed that EEM QSPR models are indeed applicable for pKa prediction also forcarboxylic acids. Namely, 5 from 12 of these models were able to predict 2 2 pKa with R > 0.9, while the best EEM QSPR model reached R = 0.925. Therefore, EEM QSPR models constitute a very promising approach for the prediction of pKa. Their main advantages are that they are accurate, and can predict pKa values very quickly, since the atomic charge descriptors used in the QSPR model can be obtained much faster by EEM than by QM. Additionally, the quality of EEM QSPR models is less dependent on the type of atomic charges used (theory level, basis set, population analysis) than in the case of QM QSPR models. Accordingly, EEM QSPR models constitute a pKa prediction approach which is very suitable for virtual screening. The results of the work were presented in the 3rd Strasbourg Summer School on Chemoinformatics and published in Journal of Cheminformatics. Part VI

Appendices

– 47 – A

Content of included CD

• Manuscript of the published research article

• Data related to the article and this thesis:

Table S1a The list of the phenol molecules, including their names, NCS numbers, CAS numbers and experimental pKa values. Table S1b The list of the carboxylic acid molecules, including their names, NCS numbers, CAS numbers and experimental pKa values. Table S2 The parameters of all the QSPR models for phenols. Table S3 The information about outlier molecules for phenols. Table S4 The table of cross-validation results for phenols. Table S5 The quality and statistical criteria of QSPR models for carboxylic acids. Table S6 The table containing charge descriptors for all charge calculation approaches and predicted pKa values for all QSPR models (for phenols). Molecules The SDF files with the structures of the molecules and also their dissociated forms.

• Folder diploma thesis/ containing this thesis with latex sources

– 48 – B

List of abbreviations

3d 3 descriptors

4d 4 descriptors

5d 5 descriptors

7d 7 descriptors

ADME Absorption, Distribution, Metabolism and Excretion

AIM Atoms in Molecules

ANN Artificial Neural Networks

B3LYP Becke, three-parameter, Lee-Yang-Parr

BLYP Becke, Lee-Yang-Parr

DENR Dynamic Electronegativity Relaxation

DFT Density Functional Theory

EEM Electronegativity Equalization Method

EPR Electron Paramagnetic Resonance

GDAC Geometry-Dependent Atomic Charge

HF Hartree-Fock

KCM Kirchhoff Charge Model

LFER Linear Free Energy Relationships

MK Merz-Singh-Kollman

MLR Multiple Linear Regression

– 49 – Appendices 50

MPA Mulliken Population Analysis

NPA Natural Population Analysis

NMR Nuclear Magnetic Resonance

PA Population Analysis

PEOE Partial Equalization of Orbital Electronegativity

QEq Charge Equilibration

QM Quantum Mechanical

QSAR Quantitative Structure-Activity Relationship

QSPR Quantitative Structure-Property Relationship

RMSE Root Mean Square Error

SQE Split Charge Equilibration

TSEF Topologically Symmetric Energy Function

WO Without Outliers C

List of molecules

Table C.1: List of phenols.

NSC CAS number pKa

O O

N NSC 881 97-51-8 O 5.51

OH

O NSC 1317 100-02-7 N OH 7.15 O

O O

N N NSC 1532 51-28-5 O O 4.09

HO

OH

NSC 1548 90-43-7 9.97

– 51 – Appendices 52

NSC CAS number pKa

HO CH3 NSC 1549 95-65-8 10.4

CH3

OH

O NSC 1551 554-84-7 8.36 N

O

OH

O NSC 1552 88-75-5 7.23 N

O

OH

NSC 1809 94-71-3 O 10.1

H3C

NSC 1858 92-69-3 HO 9.55

CH3 NSC 1888 99-89-8 OH 10.2

CH3

CH3

HO NSC 2082 534-52-1 4.31 O O N N

O O Appendices 53

NSC CAS number pKa

OH CH3

NSC 2123 576-26-1 10.6 H3C

NSC 2127 123-08-0 HO 7.61 O

H OO

NSC 2150 148-53-8 7.91 O

CH3

OH

CH3 NSC 2209 618-45-1 10.2

CH3

OH

CH3 NSC 2440 121-71-1 9.25

O

CH3 NSC 2599 95-87-4 10.4

H3C OH

O

N O NSC 2880 731-92-0 3.85 OH

N O O Appendices 54

NSC CAS number pKa OH

NH2 NSC 3115 65-45-2 8.37

O

OH

O NSC 3142 700-38-9 7.41 H3C N

O

HO N

NSC 3177 1689-82-3 N 8.2

HO O NSC 3504 100-83-4 8.98

NSC 3696 106-44-5 H3C OH 10.3

O NSC 3698 99-93-4 OH 8.05

CH3

OH

NSC 3814 90-01-7 9.84

OH Appendices 55

NSC CAS number pKa OH

NSC 3815 90-05-1 9.98 O

CH3

OH NSC 3829 105-67-9 10.6

H3C CH3

OH CH3 NSC 3991 103-90-2 9.38 H O N

H3C OOH NSC 4875 621-34-1 9.65

CH3 NSC 4960 150-76-5 OOH 10.1

H3C NSC 4965 80-46-6 OH 10.4 H3C CH3

CH3

H3C O NSC 4969 93-51-6 10.3

OH Appendices 56

NSC CAS number pKa OH

CH3 NSC 5103 88-69-7 10.5

CH3

CH3 OH CH3

NSC 5105 2078-54-8 H3C CH3 11.1

H3C CH3

NSC 5296 697-82-5 10.7 CH3

OH

H3C CH3

NSC 5353 527-60-6 10.9 OH

CH3

O

H3C N NSC 5387 119-33-5 O 7.6

OH

O OH O

N N NSC 6215 573-56-8 O O 3.97

CH3

H3C O NSC 6769 97-54-1 9.88

OH Appendices 57

NSC CAS number pKa

O OH

N O NSC 7739 131-89-5 4.52

N O O

H3C OH NSC 8768 108-39-4 10.1

OH H3C NSC 8873 620-17-7 9.9

H3C CH3 NSC 8885 698-71-5 10.1

OH

CH3

H2C O NSC 8895 97-53-0 10.2

OH

H3C CH3

NSC 9268 108-68-9 10.2

OH

H3C NSC 9885 622-62-8 OOH 10.1 Appendices 58

NSC CAS number pKa OH

NSC 10112 90-00-6 10.2

CH3

OH

CH3 NSC 11215 89-83-8 10.6 H3C

CH3

OH

O NSC 12987 5428-54-6 8.59 H3C N

O

OH

O NSC 14881 87-17-2 7.4 NH

CH3

O NSC 15351 121-33-5 O 7.4

OH

OH

NSC 17588 580-51-8 9.64

OOH H3C NSC 21735 150-19-6 9.65 Appendices 59

NSC CAS number pKa

CH3

NSC 23076 95-48-7 10.3

OH

CH3

NSC 33870 609-93-8 4.23 O O N N

O OH O

NSC 36808 108-95-2 HO 9.99

O O

N N O O NSC 36947 88-89-1 0.38 OH

N O O

HO CH3 NSC 38776 496-78-6 10.6

H3C CH3

H3C

NSC 41205 2042-14-0 O 8.62 N OH

O

OH

NSC 60291 3180-9-4 H3C 10.6 Appendices 60

NSC CAS number pKa

HO OH NSC 60735 620-24-6 9.83

CH3 OH

NSC 62011 526-75-0 10.5 H3C

CH3 NSC 62012 123-07-9 OH 10

OH

NSC 65646 644-35-9 10.5

H3C

H3C NSC 65647 645-56-7 OH 10.3

O O H3C CH3 NSC 70955 500-99-2 9.34

OH

OH

NSC 82996 621-59-0 O 8.89

CH3 O Appendices 61

NSC CAS number pKa

OH

H3C NSC 85475 4383-7-7 10.2

OH

OH

O O

NSC 90441 329-71-5 N N 5.21

O O

O

O N NSC 95810 885-82-5 6.73 HO

O

OH N O NSC 156083 66-56-8 O 4.96 N

O

Chiral CH3 OH O

H3C N O NSC 202753 88-85-7 4.62

N O O

NSC 227926 623-05-2 HO 9.82 OH

2NH NSC 249188 51-67-2 OH 9.77 Appendices 62

Table C.2: List of carboxylic acids.

NSC CAS number pKa

O

NSC 166 79-14-1 HO 3.83 OH

O NSC 295 1821-12-1 4.76 OH

OH OH

NSC 836 595-46-0 O O 3.15

H3C CH3

O

NSC 1112 1759-53-1 4.83

OH

O NSC 2159 5292-21-7 4.8 OH

O

NSC 2192 111-14-8 H3C 4.8 OH

O

O NSC 2752 110-17-8 HO 3.03

OH Appendices 63

NSC CAS number pKa

O

O NSC 3716 123-76-2 H3C 4.64

OH

Chiral OH OH NSC 3806 300-85-6 4.41

H3C O

CH3

O NSC 4126 646-07-1 H3C 4.84

OH

O H3C

O NSC 4255 1877-75-4 O 3.21

OH

CH3 O

NSC 4505 594-61-6 HO 3.61

CH3 OH

O NSC 4765 79-10-7 4.26 H2C OH

O NSC 5024 124-07-2 4.89 H3C OH Appendices 64

NSC CAS number pKa

O NSC 5025 334-48-5 4.9 H3C OH

O NSC 5026 143-07-7 5.3 H3C OH

O

NSC 5028 544-63-8 H3C OH 4.9

H3C O OH

O NSC 5165 1878-85-9 O 3.23

CH3 OH

O NSC 5293 1878-49-5 O 3.23

O

N

NSC 5398 104-03-0 O O 3.85

OH

Chiral OH

O NSC 6135 81-25-4 HO 4.98 C C H3 H3 OH CH3 OH Appendices 65

NSC CAS number pKa

Chiral H3C OH NSC 7304 116-53-0 4.81

CH3 O

CH2 O

NSC 7393 79-41-4 4.65

CH3 OH

O

O NSC 7622 124-04-9 HO 4.44

OH

Chiral OH

NSC 7925 90-64-2 O 3.41

OH

OOH NSC 8124 141-82-2 2.85 HO O

O NSC 8266 142-62-1 4.88 H3C OH

Chiral OH

NSC 8406 97-61-0 H3C 4.79 O

CH3 Appendices 66

NSC CAS number pKa

O NSC 8415 107-92-6 4.82

H3C OH

O OH

NSC 8742 117-34-0 3.94

O NSC 8751 107-93-7 4.17

H3C OH

H3C O NSC 8758 88-09-5 4.71

H3C OH

Chiral H3C NSC 8881 149-57-5 OH 4.7

H3C O

H3C OH NSC 8999 80-59-1 4.96

CH3 O

OOH NSC 9238 110-94-1 4.34 HO O Appendices 67

NSC CAS number pKa

OH

NSC 9272 501-52-0 O 4.66

O

NSC 15772 86-87-3 OH 4.23

CH3

O O NSC 16682 306-08-1 4.41 OH OH

O OH NSC 19493 123-99-9 4.55 HO O

OH OH

NSC 25201 516-05-2 O O 3.12

CH3

O

O NSC 25949 110-15-6 HO 4.21

OH

O

O NSC 25952 505-48-6 HO 4.52 OH Appendices 68

NSC CAS number pKa

O H3C OH NSC 27799 104-01-8 4.36 O

OH OH

O NSC 30092 6324-11-04 O 3.02

O OH NSC 30112 111-16-0 4.51 HO O

O O

NSC 30279 77-92-9 OH OH OH 2.79 HO

O

OH

H3C O NSC 35913 1643-15-8 O 3.2

O O N OH

O NSC 37409 1878-87-1 O 2.9

H3C OOH NSC 38176 940-64-7 3.21

O Appendices 69

NSC CAS number pKa

OH

OO NSC 38177 2088-24-6 H3C O 3.14

OH

H3C NSC 61983 1185-39-3 O 5.02

H3C CH3

O O

NSC 62774 144-62-7 1.25

OH OH

CH3 O

NSC 62780 79-31-2 4.84

CH3 OH

CH3 OH NSC 62783 503-74-2 4.77

H3C O

O

NSC 62787 112-05-0 H3C 4.95 OH

CH3 O

NSC 65449 75-98-9 H3C 5.03

CH3 OH Appendices 70

NSC CAS number pKa

O

NSC 65595 622-47-9 OH 4.37 H3C

OH

NSC 65637 2270-20-4 O 4.88

Chiral OH

NSC 67957 617-31-2 H3C 3.59 O

OH

O OH NSC 73191 102-32-9 4.25 OH OH

CH3

O NSC 93819 99-66-1 4.6 OH

CH3

O

N O NSC 93911 1877-73-2 O 3.97 OH

HO O

NSC 132953 64-19-7 4.76

CH3 Appendices 71

NSC CAS number pKa

O

N OOH

NSC 166278 1798-11-4 O 2.89

O

O OH

N O NSC 193418 1878-88-2 O O 2.95

Chiral CH3

O NSC 256857 15687-27-1 CH3 4.91 OH H3C

NSC 281243 60-33-3 O 4.77 OH

CH3

Chiral OH O NSC 367919 598-82-3 3.86

CH3 OH

Chiral OH

O NSC 401846 515-30-0 3.53

OH H3C

O

NSC 406833 109-52-4 H3C 4.84 OH Bibliography

[1] Y. Ishihama, M. Nakamura, T. Miwa, T. Kajima, and N. Asakawa. A rapid method for pKa determination of drugs using pressure-assisted capillary electrophoresis with photodiode array detection in drug discovery. J. Pharm. Sci., 91(4):933–942, 2002.

[2] S. Babic,´ A. J. M. Horvat, D. M. Pavlovic,´ and M. Kasˇtelan-Macan. Determi- nation of pKa values of active pharmaceutical ingredients. TrAC, 26(11):1043– 1061, 2007.

[3] D. Manallack. The pKa distribution of drugs: Application to drug discovery. Perspect. Med. Chem., 1:25–38, 2007.

[4] H. Wan and J. Ulander. High-throughput pKa screening and prediction amenable for ADME profiling. Expert Opin. Drug Metab. Toxicol., 2(1):139– 155, 2006.

[5] G. Cruciani, F. Milletti, L. Storchi, G. Sforna, and L. Goracci. In silico pKa prediction and ADME profiling. Chem. Biodivers., 6(11):1812–1821, 2009.

[6] J. Comer and K. Tam. Pharmacokinetic Optimization in Drug Research: Biological, Physicochemical, and Computational Strategies. Verlag Helvetica Chimica Acta, Postfach, CH-8042 Zu¨rich, Switzerland, 2001.

[7] G. Klebe. Recent developments in structure-based drug design. J. Mol. Med., 78:269–281, 2000.

[8] A. C. Lee and G. M. Crippen. Predicting pKa. J. Chem. Inf. Model., 49:2013– 2033, 2009.

[9] M. Rupp, R. Ko¨rner, and I. V. Tetko. Predicting the pKa of small molecules. Comb. Chem. High Throughput Screen., 14(5):307–327, 2010.

[10] R. Fraczkiewicz. In Silico Prediction of Ionization, volume 5. Elsevier, Oxford, England, 2006.

[11] J. Ho and M. Coote. A universal approach for continuum solvent pKa calcu- lations: Are we there yet? Theor. Chim. Acta, 125(1–2):3–21, 2010.

– 72 – Bibliography 73

[12] M. J. Citra. Estimating the pKa of phenols, carboxylic acids and alcohols from semi-empirical quantum chemical methods. Chemosphere, 1:191–206, 1999.

[13] K. C. Gross, P. G. Seybold, and C. M. Hadad. Comparison of different atomic charge schemes for predicting pKa variations in substituted anilines and phe- nols. Int. J. Quantum Chem., 90:445–458, 2002.

[14] W. C. Kreye and P. G. Seybold. Correlations between quantum chemical indices and the pKas of a diverse set of organic phenols. Int. J. Quantum Chem., 109:3679–3684, 2009.

[15] R. Svobodova´ Varˇekova´, S. Geidl, C-M Ionescu, O. Skrˇehota, M. Kudera, D. Sehnal, T. Bouchal, R. Abagyan, H. J. Huber, and J. Kocˇa. Predicting pKa values of substituted phenols from atomic charges: Comparison of different quantum mechanical methods and charge distribution schemes. J. Chem. Inf. Model., 51(8):1795–1806, 2011.

[16] P. Atkins and J. D. Paula. Atkins’ Physical Chemistry. Oxford University Press, 9th edition, 2010.

[17] W. J. Mortier, S. K. Ghosh, and S. Shankar. Electronegativity equalization method for the calculation of atomic charges in molecules. J. Am. Chem. Soc., 108:4315–4320, 1986.

[18] C. E. Housecroft and A. G. Sharpe. Inorgnic chemistry. Pearson Education, Essex, England, second edition, 2005.

[19] P. Atkins, T. Overton, J. Rourke, M. Weller, F. Armstrong, and M. Hagerman. Shiver and Atkins’ Inorganic Chemistry. W. H. Freeman and Company, New York, United States, fifth edition, 2010.

[20] P. Howard and W. Meylan. Physical/chemical property database (PHYSPROP). Syracuse Research Corporation, Environmental Science Cen- ter, North Syracuse NY, 1999.

[21] J. Clark and D. D. Perrin. Prediction of the strengths of organic bases. Q. ReV. Chem. Soc., 18:295–320, 1964.

[22] D. D. Perrin, B. Dempsey, and E. P. Serjeant. pKa prediction for organic acids and bases. Chapman and Hall: New York, 1981.

[23] Acd/pka. Advanced Chemistry Development Inc.: 110 Yonge St., 14th Floor, Toronto, Ontario, Canada M5C 1T4.

[24] J. Shelley, A. Cholleti, L. L. Frye, J. R. Greenwood, M. Uchimaya, and Epik. A software program for pKa prediction and protonation state generation for drug-like molecules. J. Comput. – Aided Mol. Des., 21:681–691, 2007. Bibliography 74

[25] S. H. Hilal and S. W. Karickhoff. A rigorous test for SPARC’s chemical reactiity models: Estimation of more than 4300 ionization pKas. Quant. Struct. Act. Relat., 14:348–355, 1995.

[26] R. Sayle. Physiological ionization and pKa prediction. http://www.daylight.com/meetings/emug00/Sayle/pkapredict.html.

[27] P. E. Blower and K. P. Cross. Decision tree methods in pharmaceutical re- search. Curr. Top. Med. Chem., 6:31–39, 2006.

[28] A. C. Lee, J.-Y. Yu, and G. M. Crippen. pKa prediction of monoprotic small molecules the SMARTS way. J. Chem. Inf. Model., 48:2042–2053, 2008.

[29] M. D. Liptak, K. C. Gross, P. G. Seybold, S. Feldgus, and G. Shields. Absolute pKa determinations for substituted phenols. J. Am. Chem. Soc., 124:6421–6427, 2002.

[30] A. M. Toth, M. D. Liptak, D. L. Phillips, and G. C. Shields. Accurate relative pKa calculations for carboxylic acids using complete basis set and Gaussian- n models combined with continuum solvation methods. J. Chem. Phys., 114:4595–4606, 2001.

[31] Jaguar, version 4.2, schro¨dinger, inc., new york, ny, 2010.

[32] S. Jelfs, P. Ertl, and P. Selzer. Estimation of pKa for druglike compounds using semiempirical and information-based descriptors. J. Chem. Inf. Model., 47:450–459, 2007.

[33] S. L. Dixon and P. C. Jurs. Estimation of pKa for organic oxyacids using calculated atomic charges. J. Comput. Chem., 14:1460–1467, 1993.

[34] J. Zhang, T. Kleino¨der, and J. Gasteiger. Prediction of pKa values for aliphatic carboxylic acids and alcohols with empirical atomic charge descriptors. J. Chem. Inf. Model., 46:2256–2266, 2006.

[35] K. C. Gross and P. G. Seybold. Substituent effects on the physical properties and pKa of phenol. Int. J. Quantum Chem., 85:569–579, 2001. [36] S. Liu and L. G. Pedersen. Estimation of molecular acidity via electrostatic potential at the nucleus and valence natural atomic orbitals. J. Phys. Chem. A, 113:3648–3655, 2009.

[37] Ch. J. Cramer. Essential of : Theories and Models. John Wiley & Sons, second edition, 2004.

[38] I. N. Levine. . Prentice Hall, 6th edition, 2008.

[39] E. R. Davidson and D. Feller. Basis set selection for molecular calculations. Chem. Rev., 86(4):681–696, 1986. Bibliography 75

[40] R. S. Mulliken. Electronic structures of molecules xi. electroaffinity, molecular orbitals and dipole moments. J. Chem. Phys., 3(9):573–585, 1935.

[41] R. S. Mulliken. Criteria for construction of good self-consistent-field molec- ular orbital wave functions, and significance of lcao-mo population analysis. J. Chem. Phys., 36(12):3428–3439, 1962.

[42] P. O. Lowdin. On the non-orthogonality problem connected with the use of atomic wave functions in the theory of molecules and crystals. J. Chem. Phys., 18(3):365–375, 1950.

[43] A. E. Reed, R. B. Weinstock, and F. Weinhold. Natural-population analysis. J. Chem. Phys., 83(2):735–746, 1985.

[44] R. F. W. Bader, A. Larouche, C. Gatti, M. T. Carroll, P. J. Macdougall, and K. B. Wiberg. Properties of atoms in molecules – dipole-moments and transferabil- ity of properties. J. Chem. Phys., 87(2):1142–1152, 1987.

[45] F. L. Hirshfeld. Bonded-atom fragments for describing molecular charge- densities. Theor. Chim. Acta, 44(2):129–138, 1977.

[46] B. H. Besler, K. M. Merz, and P. A. Kollman. Atomic charges derived from semiempirical methods. J. Comput. Chem., 11(4):431–439, 1990.

[47] C. M. Breneman and K. B. Wiberg. Determining atom-centered monopoles from molecular electrostatic potentials - the need for high sampling density in formamide conformational-analysis. J. Comput. Chem., 11(3):361–373, 1990.

[48] D. E. Williams. Representation of the molecular electrostatic potential by atomic multipole and bond dipole models. J. Comput. Chem., 9(7):745–763, 1988.

[49] R. J. Abraham, L. Griffiths, and P. Loftus. Approaches to charge calculations in molecular mechanics. J. Comput. Chem., 3(3):407–416, 1982.

[50] J. Gasteiger and M. Marsili. Iterative partial equalization of orbital electroneg- ativity - a rapid access to atomic charges. Tetrahedron, 36(22):3219–3228, 1980.

[51] K. H. Cho, Y. K. Kang, K. T. No, and H. A. Scheraga. A fast method for calculating geometry-dependent net atomic charges for polypeptides. J. Phys. Chem. B, 105(17):3624–3634, 2001.

[52] A. A. Oliferenko, S. A. Pisarev, V. A. Palyulin, and N. S. Zefirov. Atomic charges via electronegativity equalization: Generalizations and perspectives. Adv. Quantum Chem., 51:139–156, 2006.

[53] D. A. Shulga, A. A. Oliferenko S. A. Pisarev, V. A. Palyulin, and N. S. Zefirov. Parameterization of empirical schemes of partial atomic charge calculation for reproducing the molecular electrostatic potential. Dokl. Chem., 419:57–61, 2008. Bibliography 76

[54] A. K. Rappe and W. A. Goddard. Charge equilibration for molecular- dynamics simulations. J. Phys. Chem., 95(8):3358–3363, 1991. [55] R. A. Nistor, J. G. Polihronov, M. H. Muser, and N. J. Mosey. A generalization of the charge equilibration method for nonmetallic materials. J. Chem. Phys., 125(9):094108–094118, 2006. [56] R. Svobodova´Varˇekova´, Z. Jirousˇkova´, J. Vaneˇk, S. Suchomel, and J. Kocˇa. Electronegativity equalization method: Parameterization and validation for large sets of organic, organohalogene and organometal molecule. Int. J. Mol. Sci., 8:572–582, 2007. [57] B. G. Baekelandt, W. J. Mortier, J. L. Lievens, and R. A. Schoonheydt. Probing the reactivity of different sites within a molecule or solid by direct com- putation of molecular sensitivities via an extension of the electronegativity equalization method. J. Am. Chem. Soc., 113(18):6730–6734, 1991. [58] Z. Jirousˇkova´, R. Svobodova´Varˇekova´, J. Vaneˇk, and J. Kocˇa. Electroneg- ativity equalization method: Parameterization and validation for organic molecules using the Merz–Kollman–Singh charge distribution scheme. J. Comput. Chem., 30:1174–1178, 2009. [59] J. Chaves, J. M. Barroso, P. Bultinck, and R. Carbo-Dorca. Toward an alter- native hardness kernel matrix structure in the electronegativity equalization method (EEM). J. Chem. Inf. Model., 46(4):1657–1665, 2006. [60] P. Bultinck, W. Langenaeker, P. Lahorte, F. De Proft, P. Geerlings, M. Waro- quier, and J. P. Tollenaere. The electronegativity equalization method i: Parametrization and validation for atomic charge calculations. J. Phys. Chem. A, 106(34):7887–7894, AUG 29 2002. [61] Y. Ouyang, F. Ye, and Y. Liang. A modified electronegativity equalization method for fast and accurate calculation of atomic charges in large biological molecules. Phys. Chem. Chem. Phys., 11:6082–6089, 2009. [62] P. Bultinck, R. Vanholme, P. L. A. Popelier, F. De Proft, and P. Geerlings. High- speed calculation of AIM charges through the electronegativity equalization method. J. Phys. Chem. A, 108(46):10359–10366, 2004. [63] Z. Z. Yang and C. S. Wang. Atom-bond electronegativity equalization method. 1. Calculation of the charge distribution in large molecules. J. Phys. Chem. A, 101:6315–6321, 1997. [64] G. Menegon, M. Loos, and H. Chaimovich. Parameterization of the elec- tronegativity equalization method based on the charge model 1. J. Phys. Chem. A, 106:9078–9084, 2002. [65] R. Todoschini and V. Consonni. Molecular Descriptors for Chemoinformatics (Methods and Principles in Medicinal Chemistry). Wiley-VCH, second edition, 2009. Bibliography 77

[66] A.R. Leach and V.J. Gillet. An Intorduction to Chemoinformatics. Springer, 2007.

[67] J. Gasteiger and T. Engel, editors. Chemoinformatics: A Textbook. John Wiley & Sons, 2003.

[68] G. Espinosa, D. Yaffe, Y. Cohen, A. Arenas, and F. Giralt. Neural network based quantitative structural property relations (QSPRs) for predicting boil- ing points of aliphatic hydrocarbons. J. Chem. Inf. Comp. Sci., 40(3):859–879, 2000.

[69] T. Urdan. Statistics in Plain English. Psychology Press, second edition, 2005.

[70] J. Verzani. Using R for Introductory Statistics. Chapman and Hall/CRC Press, Florida, United States, 2005.

[71] S. Lemm, B. Blankertz, T. Dickhaus, and K.-R. Mu¨ller. Introduction to machine learning for brain imaging. NeuroImage, 56(2):387–399, 2011.

[72] Organisation for Economic Co-operation and Development. Guid- ance document on the validation of (quantitative)structure- activity relationships [(Q)SAR] models. [Online] Available from http://search.oecd.org/officialdocuments/displaydocumentpdf/?cote= env/jm/mono(2007)2&doclanguage=en, last check on April 6, 2013, 2007.

[73] NCI Open Database Compounds. Retrieved from http://cactus.nci.nih.gov/ on August 10, 2010.

[74] J. Sadowski and J. Gasteiger. From atoms and bonds to three-dimensional atomic coordinates: Automatic model builders. Chem. ReV., 93:2567–2581, 1993.

[75] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scal- mani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y.Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople. Gaussian 09, Revision E.01. Gaussian, Inc.: Wallingford, CT, 2004.

[76] T. A. Keith. Aimall, version 11.12.19. TK Gristmill Software, Overland Park KS, USA, 2011 (aim.tkgristmill.com). Bibliography 78

[77] R. Svobodova´ Varˇekova´ and J. Kocˇa. Optimized and parallelized imple- mentation of the electronegativity equalization method and the atom-bond electronegativity equalization method. J. Comput. Chem., 3:396–405, 2006.

[78] O. Skrˇehota, R. Svobodova´Varˇekova´, S. Geidl, M. Kudera, D. Sehnal, C.-M. Ionescu, and J. Kocˇa. QSPR designer – a program to design and evaluate QSPR models. case study on pKa prediction. J. Cheminf., 3 (Suppl 1):P16, 2011.

[79] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2013.

[80] A. Habibi-Yangjeh, M. Danandeh-Jenagharad, and M. Nooshyar. Application of artificial neural networks for predicting the aqueous acidity of various phenols using QSAR. J. Mol. Model., 12:338–347, 2006.

[81] T. Hanai, K. Koizumi, T. Kinoshita, R. Arora, and F. Ahmed. Prediction of pKa values of phenolic and nitrogen-containing compounds by computational chemical analysis compared to those measured by liquid chromatography. J. Chromatogr. A, 762:55–61, 1997.

[82] B. G. Tehan, E. J. Lloyd, M. G. Wong, W. R. Pitt, J. G. Montana, D. T. Manallack, and E. Gancia. Estimation of pKa using semiempirical molecular orbital methods. Part 1: Application to phenols and carboxylic acids. Quant. Struct.- Act. Relat., 21:457–472, 2002.

[83] R. Gieleciak and J. Polanski. Modeling robust QSAR. 2. Iterative variable elimination schemes for CoMSA: Application for modeling benzoic acid pKa values. J. Chem. Inf. Model., 47:547–556, 2007.

[84] R. D. Cook. Detection of Influential Observation in Linear Regression. Tech- nometrics, 19(1):15–18, 1977.

[85] R. D. Cook. Influential observations in linear regression. J. Am. Statist. Assoc., 74(365):169–174, 1979.

[86] P. Bultinck, W. Langenaeker, P. Lahorte, F. De Proft, P. Geerlings, C. Van Alsenoy, and J. P. Tollenaere. The electronegativity equalization method ii: Applicability of different atomic charge schemes. J. Phys. Chem. A, 106(34):7895–7901, 2002.