3D-QSAR and Physical Property Modeling Using Quantum-Mechanically- Derived Molecular Surface Properties

A Dissertation

Kendall Byler 2007

3D-QSAR and Physical Property Modeling Using Quantum Mechanically Derived Molecular Surface Properties

Den Naturwissenschaftlichen Fakultäten der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades

vorgelegt von

Kendall Grant Byler

aus Huntsville

Als Dissertation genehmigt von den naturwissenschaftlichen Fakultäten der Friedrich-Alexander-Universität Erlangen-Nürnberg.

Tag der mündlichen Prüfung: 11.05.2007 Vorsitzender der Promotionskomission: Prof. Dr. E. Bänsch Erstberichterstatter: Prof. Dr. T. Clark Zweitberichterstatter: Prof. Dr. P. Gmeiner

Acknowledgements

I would like to thank those but for whom this work would not have been possible. The first of these is Professor Dr. Tim Clark, who provided the opportunity and the guidance in my study of computational chemistry. And thanks go to the members of the Clark group who helped me in my endeavors: Dr. Nico van Eikema Hommes, Dr. Harald Lanig, Dr. Ralph Puchta, Dr. Matthias Hennemann, Matthias Brüstle, Anselm Horn, Dr. Olaf Othersen, Dr. Gudrun Schürer, Dr. Tatyana Shubina, Florian Haberl, Kirsten Höhfeld, Catalin Rusu, Jr-Hung Lin, Hakan Kayi, and Sergio Sanchez. And also to members of the Gasteiger group for their assistance: Dr. Simon Spycher, Prof. Dr. Fernando da Costa, Dimitar Hristozov, Dr. Christof Schwab, and Dr. Thomas Engel, and of course Adrian Jung of the Kirsch group. Thanks also to the Pfizer Corporation for their financial support of this research. I would thank my family: my parents, Paul and Carol Byler, my sister, Ashley, my grandparents, Henry and Martha Snoddy, Elza and Emma Byler, and my beautiful wife, Anastasia. And I would thank the friends everywhere that stayed friends despite the separations of time and distance.

i

Contents

1 Introduction ...... 1

1.1 Drug Discovery...... 1

1.2 Property Modeling ...... 3

1.3 A Quantum-Mechanical, Molecular Orbital Approach...... 4

2 Surface-Integral QSPR Models: Local Energy Properties ...... 7

2.1 Introduction...... 7 2.1.1 Local Molecular Properties...... 8 2.1.2 Surface-Integral Models...... 9

2.2 Methods...... 15

2.3 Results ...... 16 2.3.1 Octanol/Water Partition Coefficient ...... 16 2.3.2 Free Energy of Solvation ...... 23 2.3.2.1 Free Energy of Solvation in Octanol ...... 23 2.3.2.2 Free Energy of Solvation in Water ...... 28 2.3.3 Acid Dissociation Constant...... 33 2.3.4 Boiling Point ...... 36 2.3.5 Glass Transition Temperature...... 40 2.3.6 Aqueous Solubility...... 44

2.4 Discussion...... 48

2.5 Conclusions...... 51

ii

3 Support Vector Classification of Phospholipidosis-Inducing Drugs... 52

3.1 Introduction...... 52 3.1.1 Phospholipidosis ...... 52 3.1.2 Phospholipidosis Models ...... 54 3.1.3 Surface Autocorrelations ...... 56 3.1.4 Statistical Methods...... 57 3.1.4.1 Support Vector Machines...... 57 3.1.4.2 Multivariate Adaptive Regression Splines...... 60

3.2 Methods...... 61

3.3 Results...... 62 3.3.1 Support Vector Machines ...... 63 3.3.2 Multivariate Adaptive Regression Splines Using Autocorrelation Indices...... 68

3.4 Discussion ...... 70

3.5 Conclusions...... 73

4 3D-QSAR Using Local Properties ...... 74

4.1 Introduction...... 74 4.1.1 Comparative Molecular Field Analysis ...... 74 4.1.2 Partial Least Squares Regression...... 76 4.1.3 Local Properties ...... 77

4.2 Computational Methods...... 79

4.3 Results and Discussion...... 80 4.3.1 Serotonin Receptor Agonists/Antagonists...... 80 4.3.2 Adrenergic Receptor Agonists/Antagonists...... 84 4.3.3 Dopamine D4 Antagonists...... 86 4.3.4 Avian Influenza Neuraminidase Inhibitors...... 89 4.3.5 Mutagenic Tertiary Amides...... 92

iii

4.3.6 The Effect of Grid Orientation on Predictivity ...... 96

4.4 Conclusions...... 101

5 Conclusions and Outlook...... 103

5.1 Conclusions...... 103

5.2 Outlook...... 104

6 Summary ...... 106

7 Zusammenfassung ...... 110

Appendix A...... 114

Appendix B...... 151

References...... 152

iv

Chapter 1

Introduction

1.1 Drug Discovery

It has been estimated1 that, out of a pool of millions of compounds screened, 10,000 reach the animal testing phase, which will then likely produce ten drug candidates for human clinical trials, of which only one will reach the market. It may also require 15 years and 750,000 U.S. dollars in the process. Drug candidates that fail late in the testing process will never produce a return for the company that has invested so much time and money. Pharmaceutical companies must offset these losses by recouping the expenditure from among the several successfully tested drugs they produce. In an effort to minimize the potential loss from focusing on compounds that will never result in a marketable drug, much preliminary research and testing are done. The rational drug-design approach2 to this problem begins by identifying a molecular target involved in a pathophysiological process and characterizing its structure and function; then begins the search for a lead compound. This is usually achieved by means of an array of in vitro screens for biological activity. Large groups of compounds may be evaluated simultaneously in this way and the procedure is referred to as high-throughput screening (HTS). Once a lead compound is discovered, it may also be found to have some undesirable properties such as high toxicity, poor bioavailability or pharmacokinetics. Libraries of compounds may be synthesized that have modifications to the general structure of the lead compound in an effort to modulate the desirable and undesirable

1 Introduction effects. Structure-activity relationships (SAR’s) may be observed concurrently with the study of the combinatorial library that point to a common chemical substructure that produces the pharmacological effect. The medicinal chemist can then make various modifications to the pharmacophore in order to improve its properties. Kubinyi3 describes the drug-design process in terms of a design cycle wherein the optimization of a lead compound is improved iteratively in an evolutionary manner4 (Figure 1.1).

Biological Computer-aided design: Protein crystallography, NMR, Concept 3D databases, de novo design

Structure-activity relationships, QSAR, molecular modeling Series design, Lead Structures synthesis design

Biological Syntheses Testing

Candidates for New Drug Investigational New Drug further development

Figure 1.1 The drug design cycle from Kubinyi’s lectures on drug design4.

However, all of this takes quite a lot of time and the questions of clinical development and lengthy drug approval process have yet to be addressed. Thus, to improve the efficiency of the HT screen further, chemists use molecular-modeling schemes to calculate properties based on chemical structure to aid in the screening process. These virtual-screening methods include molecular-dynamics simulations, protein-ligand docking, protein-protein docking, membrane simulation, similarity searching of pharmacophore databases, and quantitative structure-activity relationships

2 Chapter 1

(QSAR’s). These tools allow pharmaceutical companies to screen out compounds that possess too many undesirable characteristics before investing time in producing, chemically analyzing, and testing. Of great interest is the elucidation of a set of chemical/physical properties that modulate the relationship between chemical structure and pharmacological activity that could be used to predict activity based solely on chemical structure.

1.2 Property Modeling

The use of property modeling for the purpose of prediction has taken many approaches. One of the first examinations of electronic effects on activity lies with Hammett’s linear free-energy relationship5 of substituent effects on benzoic acid hydrolysis reaction rates. He generated a series of substituent constants from a plot of the effect on reaction rate, which could then be used in the prediction of the substituent effect on other reaction rates. Hansch suggested6 a similar relationship between lipophilicity and biological activity. Unless a drug is actively transported across the cell membrane, it must passively diffuse through the membrane7, which is composed of a lipid bilayer. Thus, the lipophilicity of a compound must have a corresponding effect on the drug’s ability to enter the cell and produce the pharmacological effect and, indeed, this correlation between lipophilicity and biological activity had been observed as early as the late nineteenth century8. Since direct measurement of the solubility of compounds in cellular membranes is difficult at best, Hansch6 approximated this property of lipophilicity by a measure of the ratio of a compound’s solubility in n-octanol and in water as defined9 by []compound P = oct (1.1) [](1compound aq −α) where the term (1-α) represents the degree to which the compound dissociates in water as calculated from its ionization constant. As some compounds are ionizable, making them appear more soluble in water, solubility measurements in water are often performed in an aqueous buffer and measurements taken over a pH range (logD). Substituent constants similar to those of Hammett were used to calculate logP and logD based solely on the

3 Introduction chemical structure. More recently atom/fragment based methods were developed10 for the prediction of logP and logD. A more or less Gaussian distribution of logP values correlating to the drug potency (log 1/C), with a peak value of approximately 2, has been observed11. Lipinski made the observation12 that a compound’s oral absorption and distribution seemed to depend on certain structural characteristics. This is commonly referred to as the Rule of Five and states that a compound with two or more of the following characteristics will be poorly absorbed and distributed in the body. These are:

• A molecular weight > 500 amu. • A logP > 5. • More than 5 hydrogen-bond donors (sum of –OH’s and –NH’s) • More than 10 hydrogen-bond acceptors (sum of N and O atoms)

Drugs that passively diffuse across the cell membrane tend to follow this rule, while those that are actively transported do not depend on the same criteria of lipophilicity/hydrophilicity and are exceptions. More recently4, the observation was made that the absorption of drug-like molecules is regularly distributed along these properties, bounded on one side by the rule-of-five values. The general implication of this simple rule is that the amount of property space that needs to be sampled in order to derive physicochemical or pharmacological properties is small. It is necessary only to discover the particular set of molecular descriptors that describe the set of properties to be predicted adequately. This trend in property prediction seems to be toward a reduced space approach, which can account for complex interactions by relatively simple terms.

1.3 A Quantum-Mechanical,

Molecular Orbital Approach

Most modeling methods employ as much theory as is practicable given the system to be studied. For example, molecular dynamics may be used to model proteins and

4 Chapter 1 protein-ligand interactions in solution, but the complexity of the system requires the use of classical mechanics with a reduced set of non-bonded interactions, and a simplified representation of the solvent molecules. Other approximations are made in order that the simulation may be made in some reasonable period of time. Although the system as a whole may be well represented, this approach often leaves interactions near a particular site of interest poorly described13,14. This has led to the development of hybrid quantum mechanical/molecular mechanical (QM/MM) methods15,16, which employ quantum mechanical calculations in the regions where a higher level of theory is required, while the bulk of the system is represented by force-field calculations. These regions are usually those where ligands interact with protein residues in a binding pocket and quantum mechanical methods describe electrostatic intermolecular interactions better than atomic- monopole-based force field techniques17.

Since the point of contact for all drugs lies inevitably with the molecular surface of both the drug and the drug target, a descriptive model of the molecular surface is needed. The nature of this surface is electronic and quantum mechanical methods are those which describe the electronic structure of the molecule. Quantum mechanical calculations take into account the behavior of electrons in molecular orbitals rather than localized atomic orbitals, whereas force field techniques must inevitably rely on atomic constants parameterized to heats of formation. Local properties such as the molecular electrostatic potential (MEP) have been used to describe strong non-covalent interactions that are based primarily on charge. The MEP has been projected onto molecular isodensity surfaces to calculate descriptors for physical property prediction by Murray and Politzer18-23. Recently, additional local properties were described24,25 to complement the MEP and provide a more complete description of the local electronic environment at the molecular surface. Local properties such as polarizability24, ionization potential24,26, electron affinity24, electronegativity27-29, and hardness29, taken together, can readily be calculated by quantum-mechanical methods. Dispersion forces, which dominate in the case of nonpolar molecules, may be described by calculating local molecular polarizability30. The tertiary structure of proteins and the stability of biological membranes depend fundamentally on these dispersion interactions between nonpolar regions31-33.

5 Introduction

Figure 1.2 Surface-integral electrostatic potential surface for paracetamol.

The use of surface-integral models33 (SIM’s) to predict physical properties by the integration of a functional of one or more local properties over the molecular surface has been demonstrated in the literature31,32,34. In addition to predicting physical properties, surface-integral QSAR models may be constructed from local properties that predict biological activities such as enzyme inhibition constants (Ki) and protein-ligand binding

(Kd) constants. These activities, used as local properties, may then be mapped onto the molecular surface to expose regions that are significant to the observed activity. In this way, the portions of a drug’s molecular surface important to the binding and activation of its target may be examined as functions of both local electronic properties and local activities concomitant with the property/activity predictions of the virtual high-throughput screen.

6

Chapter 2

Surface-Integral QSPR Models:

Local Energy Properties

2.1 Introduction

The tools used for quantitative structure-activity relationships (QSAR), quantitative structure-property relationships (QSPR), protein-ligand docking, and scoring functions, among others in the cheminformatics toolbox, generally apply an atom-based approach. In an attempt to move from this atom-based scheme to a quantum-mechanical surface-based approach13,14,17, a local properties method has been developed to define properties and interactions at the molecular surface. These local properties are used in statistical models for the prediction of physical properties and biological activities in terms of Coulomb, exchange repulsion, dispersion, and donor-acceptor interactions. The following describes the local-property/surface-integral approach implemented by the CEPOS InSilico program Parasurf ‘0635 used in producing QSAR/QSPR models for the octanol-water partition coefficient (logP), the free energy of solvation in water

(ΔGsolv.(H2O)), the free energy of solvation in n-octanol (ΔGsolv.(oct.)), the acid dissociation constant (pKa) for nitrogenous compounds, the boiling point (Tb) for organic

7 Surface-Integral QSPR Models: Local Energy Properties

compounds, the glass transition temperature (Tg) for organic polymers, and water solubility (logS).

2.1.1 Local Molecular Properties

The electrostatic potential at the molecular surface has been examined widely22,36 as a descriptor of the electronic environment of the molecular surface and has been used to describe the noncovalent interactions possible for a given structure. Murray and Politzer have used the molecular electrostatic potential (MEP, V) and statistical measures derived from it to calculate pharmacological properties by a general interaction properties method18-23,37. Tripos’ SYBYL36 uses a calculation of MEP for use in comparative molecular-field analyses. Additional local properties defined at the molecular surface have recently been examined as predictors of two-electron donor-acceptor interactions in order to describe intermolecular electronic interactions more completely 24,25. The molecular electrostatic potential V(r) is defined as the energy resulting from the interaction between a positive point charge with a point r on the molecular surface and is described by the equation, n Z ρ (rr′)d ′ V ()r =−∑ i ∫ (2.1) i=1 Ri − rr-r′ where n is the number of atoms in the molecule, ρ(r) is the electron-density function for the molecule, and Zi is the nuclear charge of atom i at Ri. 26 The local ionization potential IEL(r) is a density-weighted Koopmans’ ionization potential38 at a point r at the surface that describes the tendency of a molecule to interact with electron acceptors (electrophilic reactivity) and is defined by

HOMO − ∑ ρii()r ε i=1 IEL ()r = HOMO (2.2) ∑ ρi ()r i=1 where ρ i (r) is the electron density at r due to molecular orbital i, εi is its Eigenvalue.

Local electron affinity EAL is defined in an analogous Koopmans’ formulation using the virtual orbitals and describes the tendency of a molecule to interact with electron donors. It is defined by:

8 Chapter 2

norbs − ∑ ρii()r ε EA ()r = iLUMO= (2.3) L norbs ∑ ρi ()r iLUMO=

29 27 Local hardness ηL and local Mulliken electronegativity χL are derived from the two previous properties24 by:

()IEEA− η = LL (2.4) L 2

IEEA+ χ = LL (2.5) L 2 and represent additional local properties that are readily-interpretable chemical terms.

Local polarizability αL is an occupation-weighted sum of the orbital polarizabilities over atomic orbitals using Rivail’s variational technique39-43 in which the contribution of each atomic orbital is determined by the electron density of the individual atomic orbital at point r and is defined by:

norbs 1 ∑ ρ jj()r q α j j=1 α ()r = (2.6) L norbs 1 ∑ ρ jj()r q j=1 where qj is the Coulson occupation, α j is the isotropic polarizability for atomic orbital j, and density ρ j is defined as the electron density at r due to an exactly singly occupied atomic orbital j. The five local properties used in the following regression models have 25 been shown to be essentially orthogonal , with ηL correlating weakly with local ionization potential.

2.1.2 Surface-Integral Models

The surface-integral models are defined by the general expression:

ntri ii iii i PfVIEEA= ∑ (),,LLLL ,,αη ⋅ A (2.7) i=1

9 Surface-Integral QSPR Models: Local Energy Properties where P is the modeled property, f is a nonlinear function of the five local properties where the summation is run over all ntri surface triangles which make up the molecular surface. The individual surface properties are taken from the center of each triangle, denoted by the superscript i, with an associated area Ai. The function f is determined by multiple regression using pre-calculated sums of component terms as listed in Appendix A, Table A1. The local properties may be fitted to an isodensity or spherical-harmonic surface39. When a spherical-harmonic approach is used, the surfaces, as well as the local properties, are fit to a spherical-harmonic expansion of radial distances,

Nl mm rcNPαβ, = ∑∑ llml()cosα cos mβ (2.8) lml==−0

m where Pl (cosα ) are Legendre functions, Nlm are normalization factors, and l and m are integers ( −≤lml ≤ ). The number of harmonics to be used depends on the application. In general, the higher the order of l, the incrementally tighter the surface is fitted to the molecular framework. Spherical-harmonic fitting may only be used with a shrink-wrap surface because the surface properties must be single-valued at any point extending outward from the center along a radial vector. The local properties are calculated for each of a set of triangles fitted to the surface of the molecule. This set of tesserae may be integrated over the entire surface in order to derive quantitative structure-activity and structure-property models. In this way the local properties and the properties/activities derived from them, mapped to the molecular surface, may be visualized using molecular visualization software such as GEISHA44 or Pymol45 (See Table 2.2). In addition to QSAR/QSPR models that may be derived from the surface-integral approach, descriptors based on various statistical features of the local property surfaces may also be used. A set of 40 molecular descriptors derived from the local surface properties are generated by Parasurf for use in statistical models. Models generated using Murray-Politzer-type18, 19, 22 statistical descriptors use the general formula:

(2.9) PfDD= ( 14,..., 0)

10 Chapter 2

These statistical descriptors are described in the following table:

Table 2.1 Parasurf ‘06 statistical descriptor set. Descriptor Description Dipole moment μ

Dipolar density μD

Molecular electronic polarizability α

Molecular weight MW Globularity Glob Molecular surface area A

Molecular volume Vol

Most positive MEP Vmax

Most negative MEP Vmin

Mean of positive MEP values V+

Mean of negative MEP values V−

Mean of all MEP values V Range of MEP values ΔV

2 Total variance of positive MEP σ +

2 Total variance of negative MEP σ −

2 Total variance in MEP σ tot

MEP balance parameter ν MEP

2 Product of MEP balance and variance σνtot MEP

max Maximum ionization potential value IEL

min Minimum ionization potential value IEL 1 N i Mean ionization potential value IEILL= ∑ E N i=1

max min Range of ionization potential Δ=IEIEIELL − L

1 N 2 Total variance in ionization potential 2 ⎡ i ⎤ σ IE =−∑ ⎣IELL IE ⎦ N i=1

max Maximum electron affinity value EAL

11 Surface-Integral QSPR Models: Local Energy Properties

min Minimum electron affinity value EAL

1 N + Mean of positive electron affinity values EA EAi LL+ = + ∑ + N i=1

1 N − Mean of negative electron affinity values EA EAi LL− = − ∑ − N i=1

N 1 i Mean of electron affinity values EALL= ∑ EA N i=1

max min Range of electron affinity Δ=EAEAEALL − L

1 m 2 Variance in positive electron affinity 2 ⎡ ++⎤ σ EA+ =−∑ EAi EA m i=1 ⎣ ⎦

1 n 2 Variance in negative electron affinity 2 ⎡ −−⎤ σ EA− =−∑ EAi EA n i=1 ⎣ ⎦

Sum of pos., neg. variances for EA σ 222=+σσ EAtot EA+ EA−

22 σσEA+ ⋅ EA− EA balance parameter ν EA = 2 ⎡ 2 ⎤ ⎣σ EA ⎦

+ Fraction of surface with pos. EA δ AEA

Mean electronegativity value χL

max Maximum local polarizability value α L

min Minimum local polarizability value α L

Mean local polarizability value α L

Range of local polarizability Δα L

2 Variance in local polarizability σα

Yet other property models use spherical-harmonic hybridization coefficients as terms in the multipole regression with the same general form. The set of spherical- harmonic terms consists of 100 hybridization coefficients Hl: 16 shape hybrids, and 21 each of V, IEL, EAL, and αL hybrids. These are defined by:

m m 2 Hcll= ∑ () (2.10) im=−

12 Chapter 2

When a molecular shape or a local property is fitted to the spherical-harmonic expansion, the shape or property may be described by the hybridization coefficients in an analogous fashion to the linear combination of atomic orbitals (LCAO).

Figure 2.1 Molecular electrostatic potential surface for N-(3-acetylphenyl)-acetamide.

Quantitative structure-property models for several physical properties have been derived using surface-integral methods31,32, including logP as a measure of hydrophobicity33,46-48 and solvation free energies. Several surface-integral QSPR models employing the aforementioned local properties are presented here. This treatment is felicitous in dealing with donor-acceptor and dispersion interactions between molecular surfaces that play a significant role in solvation by non-polar solvents49 and protein-ligand binding to non-polar residues.

13 Surface-Integral QSPR Models: Local Energy Properties

Table 2.2 Local property surfaces for paracetamol calculated with Parasurf.

Electron Affinity Electronegativity -147 0 93 329

Hardness Ionization Potential 200 432 320 761

Molecular Electrostatic Potential Molecular Polarizability

-5424 1907 159 346

14 Chapter 2

2.2 Methods

Structures for the data sets assembled from the literature were converted from 2D structures to 3D MDL SD files using Molecular Networks’ CORINA50,51. The molecular geometries for these were then optimized with the AM1 Hamiltonian52 using VAMP 9.053. In cases where the addition of d-orbitals improved the overall structure, the AM1* Hamiltonian54 was used for optimization, followed by a single-point AM1 calculation in order to retrieve essential polarizability data. The five local surface properties were calculated for each structure by Parasurf ’0635 for either an electron isodensity surface or a spherical-harmonic-fitted surface by either a marching-cube or shrink-wrap algorithm, as indicated. Molecular electrostatic potentials were calculated using the zero-differential- overlap-based atomic multipole technique and the local ionization energy, electron affinity, and polarizability as described previously40-43,55,56. Multiple regression models were generated with Tsar 3.357 using functions of powers and products of the five local properties. One-hundred-fifty nonlinear product and power terms of the local properties were generated by script and used as descriptors in a multiple regression routine (Appendix A, Table A1). The multiple regressions were performed with the Leave Out Groups of Three cross-validation method, using an F to enter value of 4.0 and F to leave of 3.9, excluding variables if there is a cross-correlation greater than 0.9. A Leave One 2 Out method is often used with the multiple regression routine, yielding predictive r cv values that may be very close to corresponding r2 values. However, it is considered by the authors of Tsar to be a better measure of predictivity in the case of stepwise regression to leave out what amounts to a third of the data to be predicted by the remaining two-thirds (Tsar reference guide). Individual terms used in the multiple regressions that cross- correlated R>0.86 were excluded from the regression. The surface-integral models obtained using the marching-cube method of generating isodensity surfaces were fitted at an isodensity value of 0.008 e/Å3 (corresponding approximately to a van der Waals surface). For the regression models using the shrink-wrap method of generating surfaces, including those models employing spherical harmonic coefficients as regression terms, the local properties were fit to spherical harmonics at an isodensity value of 0.0002 e/Å3, which, for a spherical-harmonic

15 Surface-Integral QSPR Models: Local Energy Properties fit, is approximately the van der Waals’ surface. The set of spherical harmonic terms consists of 100 hybridization coefficients: 16 shape hybrids, and 21 each of V, IEL, EAL, and local polarizability hybrids. These were also used as terms to generate linear regression models. In this chapter, the statistical measure of the fit of the regression models to the surface data are presented below the plots of experimental and calculated physical property values by the regression coefficient, r2. Measures of the predictive 2 capacity of the models are expressed as r cv, the cross-validated regression coefficient, the mean unsigned error (MUE), and the root-mean-square error (RMSD) of the predictions.

2.3 Results

2.3.1 Octanol/Water Partition Coefficient

The n-octanol/water partition coefficient (here, logP) data set consists of 168 structures assembled from the literature58-60, with values ranging from -3.64 to 8.23 logP units (Appendix A, Table A2.). The surface-integral model for logP derived from multiple regression using the set of 150 property terms (using the marching-cube algorithm) as starting variables yielded an 8-term regression equation using neutral structures (including zwitterionic amino acids) and represents the best model to date. The regression equation is given by:

5 33 -6 2 −8 fP(log )()rr=×⋅+×⋅− 1.6967 10 ⎣⎦⎡⎤V() 4.6367 10 ⎣⎦⎡⎤V() r0.25768 ⋅ ⎣⎡α L ()r⎦⎤ 5 2 −−13 2 6 −×⋅⋅5.2448 10 ⎣⎦⎡⎤VIE()rrLL () +4.4222 ×⋅⋅ 10 ⎣⎡αη()rrL ()⎦⎤ −16 2 +×⋅⋅7.7213 10 ⎣⎦⎡⎤VIE()rrrLL () ⋅η () (2.11) 3 −6 2 −×⋅⋅1.5978 10 ⎣⎦⎡⎤VEA()rrrLL () ⋅α () 5 −10 2 +×⋅⋅1.4233 10 ⎣⎦⎡⎤VEA()rrL () ⋅α L ()r + 0.1784

16 Chapter 2

8

6

OW 4

2 Calculated logP 0

-2

-202468

Experimental logPOW

Figure 2.2 Experimental and calculated values of logP for the test set: 2 2 N=168, MUE=0.227, RMSD=0.500, r =0.797, r cv=0.685.

In a prior model, the amino acids phenylalanine and tryptophan were represented in their uncharged forms, which resulted in a 7-term regression equation:

-6 23−−8 10 3 fP(log )()rr=×⋅+×⋅−×⋅ 6.2390 10 ⎡⎤⎣⎦V() 2.7378 10 ⎡⎤⎣⎦V() r2.2779 10 ⎣⎡IEL ()r⎦⎤ 3 3 −−752 −×⋅⋅5.0736 10 ⎣⎦⎡⎤EALL()rrαα () −×⋅⋅1.6563 10 ⎣⎦⎡⎤LL()rrη () 3 −11 2 (2.12) −×⋅⋅2.0941 10 ⎣⎦⎡⎤VIEEA()rrLL () ⋅ () r 3 −24 2 −×⋅⋅⋅8.5026 10 ⎣⎦⎡⎤VIE()rrrLL ()η () +0.3042

As can be seen in Figure 2.1, the regression statistics and the clustering of points improves slightly with the use of the zwitterionic forms of these amino acids.

17 Surface-Integral QSPR Models: Local Energy Properties

8

6

OW 4

2 Calculated logP 0

-2

-202468 Experimental logP OW Figure 2.3 Neutral logP set with non-zwitterionic amino acids: 2 2 N=168, r =0.782, r cv=0.656, MUE=0.238, RMSD=0.516.

Another regression model was generated for the same set using ionized structures for those ionized >50% at pH=7.0 as calculated by pKa, giving the 10-term equation:

35 −−5722−83 fP(log )()rr=−×⋅ 8.3660 10 ⎡⎤⎣⎦V() +5.4673 ×⋅ 10 ⎡⎤⎣⎦V() r +2.2713 ×⋅ 10 ⎣⎡V()r⎦⎤

−−33 +×⋅−4.4369 10 IELL()rr5.2747 ×⋅ 10 EA ()

5 3 −−14 2 18 (2.13) −×⋅⋅6.0514 10 ⎣⎦⎡⎤VIE()rrLL () −4.4293 ×⋅⋅ 10 ⎣⎡IE()rrηL () ⎦⎤ −−572 +×⋅⋅1.4686 10 ⎣⎦⎡⎤EALL()rrα () +9.7195 ×⋅ 10 EALL()rr⋅η () −8 +×⋅⋅⋅1.2682 10 VIEEA()rrLL () () r +0.02807

18 Chapter 2

7

5 OW

3 Calculated logP 1

-1

-11357

Experimental logPOW

Figure 2.4 Surface-integral model for logP using compounds charged by pKa at pH=7: 2 2 r =0.729, r cv=0.145, MUE=0.252, RMSD=0.576.

2 Trifluopromazine is an outlier in this model, and the r cv statistic improves significantly with its removal, as do the MUE and RMSD. The resulting model predicted poorly, 2 however, exhibiting a negative value for r cv, so the number of cross-validation sets was reduced from ten (standard for these models) to six in order to generate a model with better statistics. This gives the 9-term equation:

35 −−4722 fP(log )()rr=− 1.0814 × 10 ⋅⎣⎦⎡⎤V() +2.2696 × 10 ⋅⎣⎦⎡⎤V() r

−−933 − 2 +×⋅+9.0847 10 ⎣⎦⎡⎤VI()rr7.0265 ×⋅−×⋅ 10 ELL() 1.5322 10 EA()r 35(2.14) +×⋅4.0126 10−−56⎡⎤EA rr22 +×⋅⋅1.1769 10 ⎡V α r ⎤ ⎣⎦LL() ⎣() () ⎦

−−18 3 8 −×⋅⋅7.1156 10 ⎣⎦⎡⎤IELL()rrη () +×⋅⋅1.1316 10 V()r IEL () r⋅ EAL ()r −0.08667

19 Surface-Integral QSPR Models: Local Energy Properties

7

5 OW

3 Calculated logP 1

-1

-11357

Experimental logPOW

2 2 Figure 2.5 “Charged” logP model with outlier removed: r =0.735, r cv=0.573, MUE=0.151, RMSD=0.437.

This greatly improves the predictivity of the model, while not improving the regression statistic as much. The use of charged structures diminished the predictivity of the models and it was decided that their inclusion in these QSAR/QSPR models was not useful. By virtue of the fact that the local properties are calculated in the gas phase, where no solvent shielding may occur, the impact of ionization on target values as derived from the regression models might be exaggerated. A model using the 40 statistical descriptors as starting variables yielded a 10-term equation:

+ fP(log )()r =− 0.5690 ⋅+μρμα 43.14 ⋅ ( ) + 0.1467 ⋅− 0.1577 ⋅MEP − (2.15) +⋅−⋅10.45ν MEP 0.0130 max()IEL +⋅−⋅ 0.1397EA 0.0342 χ +⋅7.056 min(αα ) −⋅+ 14.84 38.72

20 Chapter 2

8

6

OW 4

2 Calculated logP Calculated 0

-2

-2 0 2 4 6 8

Experimental logPOW

Figure 2.6 Linear regression model for logP using statistical descriptors: 2 2 N=168, MUE=0.775, RMSD=0.996, r =0.743, r cv=0.635.

It is evident that, although the regression statistics are comparable to the best nonlinear model, the predictive capacity of this model is not quite as good, with a root mean square error of nearly 1 logP unit.

The regression model for logP using spherical-harmonic hybridization coefficients is comprised of 11 terms:

14 1 3 fP(log )()r =⋅+⋅−⋅−⋅ 0.8771 HRR 1.389 H 0.0459 H MEPM 0.0584 HEP 4 9 10 11 −⋅−⋅+⋅+⋅0.1117HHHMEP 0.2729MEP 0.0404IEL 0.1722 HEAL (2.16) 21 1 +⋅−6.808HHEAL 8.842 ⋅−α 5.786

21 Surface-Integral QSPR Models: Local Energy Properties

8

6

OW 4

2 Calculated logP Calculated 0

-2

-202468

Experimental logPOW

Figure 2.7 Regression model for logP using spherical-harmonic hybridization coefficients: 2 2 N=168, MUE= 0.756, RMSD= 0.966, r = 0.759, r cv=0.516.

The surface-integral model using a spherical harmonic-fitted surface gives the 4-term equation:

35 −−3122 − 1 fP(log )()rrr=− 5.668 × 10 ⋅V() +2.345 × 10 ⋅⎣⎡ααLL()⎦⎣⎤⎡ −2.154 × 10 ⋅ ()r⎦⎤ (2.17) −17 3 +×⋅9.453 10 ⎣⎦⎡⎤IELL()rr ⋅ EA ()

Overall, the model with the best regression coefficient and best predictivity in terms of 2 r cv, MUE, and RMSD was the surface-integral model that used the local properties from the marching-cube surface and included the amino acids in their zwitterionic form. There was not much variation among the several models in the r2 fit of the surface properties, although the RMS error varied by nearly ½ of a logP unit.

22 Chapter 2

8

6

OW 4

2 Calculated logP Calculated 0

-2

-202468

Experimental logPOW

Figure 2.8 Surface-integral model for logP using spherical-harmonic-fitted surface: 2 2 N=162, MUE= 0.770, RMSD= 0.963, r = 0.745, r cv=.0.662.

2.3.2 Free Energy of Solvation

2.3.2.1 Free Energy of Solvation in Octanol

The surface-integral model for the free energy of solvation in n-octanol

(ΔGsolv.(oct.)) was generated using the 165 compounds in Table A3 of Appendix A, taken from Ehresmann, et al.34. The resulting regression equation is comprised of 17 terms:

23 Surface-Integral QSPR Models: Local Energy Properties

3 fG(Δ()oct )rrr = 1.3705 ×⋅ 10-2 V −3.9476 ×⋅ 10-4 ⎡⎤V 2 −6.8874 ×⋅ 10-2 α r solv () () ⎣⎦() L ()

-16 3 -3 +×⋅⋅5.5092 10 ⎣⎦⎡⎤VIE()rrLL () +×⋅⋅1.0796 10 VEA()r () r

-4 -13 3 +1.1937 ×⋅⋅+ 10 ⎣⎦⎡⎤VEA()rrLL () 1.1179 ×⋅⋅ 10 ⎣⎦⎡⎤VEA()rr () 3 3 +×⋅⋅3.5384 10-4 ⎡⎤V rrα 2 −×5.2971 10-9 ⋅⋅IE rrα ⎣⎦()L () ⎣⎦⎡⎤LL() () -17 33-21 −×⋅⋅1.6949 10 ⎣⎦⎡⎤IELL()rrη () +3.9527 ×⋅⋅⋅ 10 ⎣⎡V()r IELL () r EA () r⎦⎤ (2.18) -5 -8 −×⋅⋅3.6011 10 ⎣⎦⎣⎡⎤⎡VIE()rrrLL () ⋅αη () +4.7541 ×⋅⋅ 10 VIE()rrrLL () ⋅ ()⎦⎤ 3 2.8234 10-12 ⎡⎤VIErrr2 +×⋅⋅⎣⎦()LL () ⋅η () -4 −×⋅⋅2.7129 10 ⎣⎦⎡⎤VEA()rrrLL () ⋅α () 5 6.5137 10-17 ⎡⎤VEArrr2 −×⋅⋅⎣⎦()LL () ⋅η () 5 +×⋅⋅2.0072 10-16 ⎡⎤VEArrrr ⋅⋅αη 2 +0.3585 ⎣⎦()LLL () () ()

2

0 ) -1 -2

-4

-6 (octanol) (kcal mol (kcal (octanol) solv G Δ -8

-10 Calculated Calculated

-12

-12 -10 -8 -6 -4 -2 0 2 Experimental ΔG (octanol) (kcal mol-1) solv Figure 2.9 Experimental and calculated free energies of solvation in octanol for the training set: 2 2 N=165, MUE=0.569, RMSD=0.713, r =0.914, r cv=0.798.

In order to examine the effect of optimal conformation in solution, the structures from the same data set were (AM1) geometry-optimized using the conductor-like

24 Chapter 2 screening model61 (COSMO) of Klammt and Schüürmann with a bulk dielectric constant (EPS) of 10. The surface-integral model using these structures yields an 11-term equation:

−−31 fG(Δ=×⋅−×⋅solv ()oct )(rrr) 5.0064 10 IEL ( ) 5.5198 10 α L ( )

−−132 +×⋅3.6463 10 ⎣⎦⎡⎤α LL()rr +2.7085 ×⋅⋅ 10 VEA() ()r 3 +×⋅⋅2.3985 10−−57⎡⎤VEArr −7.3706 ×⋅⋅ 10 ⎡VEArr⎤ 2 ⎣⎦()LL () ⎣ () ()⎦ −−13 3317 (2.19) +×⋅⋅1.1133 10 ⎣⎡VEA()rrLL ()⎦⎣⎤⎡ −×⋅⋅1.8658 10 IE()rrηL ()⎦⎤ −5 −×1.2607 10 ⋅⋅⎣⎦⎡⎤VIE()rrrLL () ⋅α () −8 +×⋅⋅⋅1.8956 10 ⎣⎦⎡⎤VIE()rrrLL ()η () 5 +×⋅⋅1.2349 10−16 ⎡⎤VEArrrr ⋅⋅αη 2 +0.0834 ⎣⎦()LLL () () ()

2

0 ) -1 -2

-4

-6 (octanol) (kcal mol (kcal (octanol) solv G Δ -8

-10 Calculated Calculated

-12

-12 -10 -8 -6 -4 -2 0 2 Experimental ΔG (octanol) (kcal mol-1) solv Figure 2.10 Surface-integral model for free energy of solvation in octanol using the COSMO-optimized 2 2 training set: N=165, MUE=0.648, RMSD=0.841, r =0.875, r cv=0.816.

The use of the COSMO-optimized structures reduces the predictivity of the surface- integral model in terms of the mean unsigned error and RMS error as seen in Figure 2.10.

25 Surface-Integral QSPR Models: Local Energy Properties

A regression model using spherical-harmonic hybrid coefficients yields an 18-term equation:

11 4 5 fG(Δ=solv ( oct))(r) −⋅+⋅−⋅−1.003HHHHR 0.058MEP 0.071MEP 0.143⋅MEP 711181 −⋅−0.156H MEP 0.557 ⋅−⋅−HHMEP 1.670MEP 0.024 ⋅HIEL 2467 −⋅+⋅+⋅+⋅0.013HHHHIEL 0.010IEL 0.015IEL 0.018 IEL (2.20) 11 17 12 16 −⋅+⋅+⋅−0.042HHHIEL 0.127IEL 0.169EAL 0.262⋅ H EAL 13 −⋅+⋅+4.943HHαα 8.343 53.187

2 ) -1 -2

-6 (octanol) (kcal mol (kcal (octanol) solv G Δ

-10 Calculated

-10 -6 -2 2 -1 Experimental ΔGsolv(octanol) (kcal mol )

Figure 2.11 Experimental and calculated free energies of solvation in octanol using hybrid coefficients: 2 2 N=165, MUE=0.636, RMSD=0.813, r =0.889, r cv=0.704.

This model has similar statistics to the surface-integral model using the marching-cube surface, but the RMS error increases by 0.1 kcal·mol-1. The surface-integral model using a spherical-harmonic-fitted surface yields the 12-term equation:

26 Chapter 2

−2 2 fG(Δ=solv ()oct )()rr−⋅+×⋅ 0.1315ααL () 8.01 10 ⎣⎦⎡⎤L () r

−−14 3 5 −×⋅2.195 10 ⎣⎦⎡⎤VIE()rr ⋅LL () +8.852 ×⋅⋅ 10 VEA()r () r 5 5 −×⋅3.266 10−−15 IE rr ⋅ηα2 +4.793 ×⋅ 10 6 ⎡ EA rr ⋅ ⎤ 2 ⎣⎦⎡⎤LL() () ⎣ LL() ()⎦ +×⋅⋅2.059 10−8 VIE()rrr () ⋅η () LL −4 −×⋅⋅1.413 10 VEA()rrrLL () ⋅α () (2.21) −11 3 +×7.322 10 ⋅⋅⎣⎦⎡⎤VEA()rrrLL () ⋅α () −24 3 +×⋅7.020 10 ⎣⎦⎡⎤IELLL()rrr ⋅ EA () ⋅η ()

−5 +×⋅9.707 10 VEA()rrrr ⋅LLL () ⋅αη () ⋅ ()

3 −×⋅⋅1.792 10−9 ⎡⎤VEArrrr ⋅αη ⋅2 +3.126 ⎣⎦()LLL () () ()

2

0 ) -1 -2

-4

-6 (octanol) (kcal mol solv G Δ -8

-10 Calculated Calculated

-12

-12 -10 -8 -6 -4 -2 0 2 Experimental ΔG (octanol) (kcal mol-1) solv Figure 2.12 Surface-integral model for free energy of solvation in octanol using spherical harmonic surface: 2 2 N=165, MUE=0.719, RMSD=0.924, r =0.865, r cv=0.729.

Here again, the regression statistics are not quite as good for the surface-integral model employing a spherical-harmonic surface as compared with the case of the marching-cube surface. There is also an accompanying increase in RMS error of ~0.2 kcal·mol-1.

27 Surface-Integral QSPR Models: Local Energy Properties

2.3.2.2 Free Energy of Solvation in Water

The data set presented in Table A4 of Appendix A for the free energy of solvation 62-64 in water (ΔGsolv.(H2O)) was assembled from 384 compounds in the literature . The regression equation for the free energy of hydration surface-integral model is comprised of 21 terms:

-2 -5 2 fG(Δ=×⋅−×⋅solv ()HO2 )()rrr 2.0192 10 V() 4.3422 10 ⎣⎦⎡⎤V()

-9 3 −×6.7859 10 ⎣⎦⎡⎤IELL()rr +0.3486 ⋅α () 5 3 +×⋅8.9213 10-7 ⎡⎤η rr2 −×⋅⋅1.2382 10-16 VIEr ⎣⎦LL() ⎣⎦⎡⎤() () 5 −×⋅⋅2.2924 10-2 VVrrαα −×⋅⋅3.1783 10-6 ⎡⎤rr2 ⎣⎦⎡⎤()LL () ⎣⎦() () -15 3 +×⋅⋅1.4154 10 ⎣⎦⎡⎤IELL()r EA ()r 5 5 −×⋅⋅5.0232 10-7 IE rrαα2 −5.7349 ×⋅⋅ 10-5 ⎡ EA rr⎤ 2 ⎣⎦⎡⎤LL() () ⎣ LL() ()⎦ -15 3 −×⋅⋅2.5633 10 ⎣⎦⎡⎤EALL()rrη () 3 −×⋅⋅⋅1.1298 10-22 ⎡⎤VIEEA()rr () () r ⎣⎦LL (2.22) -10 2 +×⋅⋅⋅8.3577 10 ⎣⎦⎡⎤VIE()rrrLL ()α () -8 +×⋅⋅⋅3.3400 10 ⎣⎦⎡⎤VIE()rrrLL ()η ()

-3 −×⋅⋅5.7742 10 VEA()rrrLL () ⋅α ()

-5 +×⋅⋅1.5065 10 ⎣⎦⎡⎤IELLL()rrr EA () ⋅α () 5 +×⋅⋅1.5836 10-11 ⎡⎤IErrr EA ⋅α 2 ⎣⎦LLL() () ()

-14 3 −×⋅⋅2.4093 10 ⎣⎦⎡⎤IELLL()rrr EA () ⋅α () 3 −×⋅⋅7.9139 10-12 ⎡⎤IErrr EA ⋅η 2 ⎣⎦LLL() () ()

-20 3 +×⋅⋅1.2174 10 ⎣⎦⎡⎤VEA()rrrrLLL () ⋅⋅αη () () −0.2167

28 Chapter 2

10.0 )

-1 -17.5 O) (kcal mol (kcal O) 2

(H -45.0 solv G Δ

-72.5 Calculated

-100.0 -100.0 -72.5 -45.0 -17.5 10.0 -1 Experimental ΔGsolv(H2O) (kcal mol )

Figure 2.13 Experimental and calculated free energies of solvation in water for the training set given in 2 2 Table A4: N=384, MUE= 0.727, RMSD= 1.503, r = 0.983, r cv=0.825.

As can be seen in Figure 2.13, the predictivity suffers somewhat from the inclusion of the charged species, resulting in a lever effect on the regression such that the whole set of structures cannot be fitted with the same robustness as either the charged or uncharged portions. Using only the neutral compounds (N=362) from the data set (Appendix A, Table A4, rows 1-362) in a surface-integral model results in a 17-term equation:

29 Surface-Integral QSPR Models: Local Energy Properties

−8 3 fG(Δ=solv ()HO2 )()rrr−×⋅⋅−⋅ 8.385 10⎣⎦⎡⎤V() 0.0194IELL() 1.232 α ()r 2 3 +⋅1.032 α rr −×⋅7.411 10−7 ⎡⎤VEA ⋅r2 ⎣⎦⎡⎤LL() ⎣⎦() () −−13 325 +×⋅⋅3.829 10 ⎣⎦⎡⎤VEA()rrLL () −×⋅⋅2.527 10 ⎣⎡V()rrα ()⎦⎤ −−71335 +×⋅⋅1.329 10 ⎣⎦⎡⎤VV()rrαηLL () +5.618 ×⋅ 10 ⎣⎡()rr ⋅ ()⎦⎤ −17 3 −×4.288 10 ⋅⋅⎣⎦⎡⎤IELL()rrη () −7 −×⋅⋅1.828 10 ⎣⎦⎡⎤VIEEA()rrLL () ⋅ () r 5 (2.23) −×⋅7.725 10−18 ⎡⎤VIEEArr ⋅ ⋅ r2 ⎣⎦()LL () () −6 −×⋅⋅9.213 10 ⎣⎦⎡⎤VIE()rrrLL () ⋅α () 3 +×⋅3.365 10−12 ⎡⎤VIErrr ⋅ ⋅η 2 ⎣⎦()LL () () 3 +×⋅1.550 10−6 ⎡⎤VEArrr ⋅ ⋅α 2 ⎣⎦()LL () () 3 −×⋅2.357 10−12 ⎡⎤VEA()rrr ⋅ () ⋅α () ⎣⎦LL −7 +×9.339 10 ⋅⋅⎣⎦⎡⎤VEA()rrrrLLL () ⋅⋅αη () () −0.5539

5 ) -1 0 O) (kcal mol (kcal O) 2 (H -5 solv G Δ

Calculated -10

-10 -5 0 5 Experimental ΔG (H O) (kcal mol-1) solv 2 Figure 2.14 Experimental and calculated free energies of solvation in water for the uncharged components of the training set given in Table A4: 2 2 N=362, MUE= 0.789, RMSD= 1.031, r = 0.891, r cv= 0.845.

30 Chapter 2

This gives a model with similar regression statistics, but with an improved RMS predictivity of ½ a kcal·mol-1 of solvation free energy. When this data set, minus two outliers, was optimized with the COSMO solvation model (EPS=80.0), the result was the 13-term equation:

2 fG(Δ=⋅−⋅+⋅solv ()HO2 )()rrrr 0.0165 IEL () 0.8593ααL () 0.4845 ⎡⎤⎣⎦L () 3 +×⋅8.100 10−−47VEArr ⋅ −4.357 ×⋅ 10 ⎡VEArr ⋅ ⎤ 2 ()LL () ⎣ () ()⎦

−17 3 +⋅⋅0.0152 ⎣⎦⎡⎤VI()rrαηLL () −×⋅⋅4.485 10 ⎣⎡E()rrL () ⎦⎤ −5 −×⋅⋅5.086 10 ⎣⎦⎡⎤VIE()rrrLL () ⋅α ()

−8 +×⋅4.071 10 ⎣⎦⎡⎤VIE()rrr ⋅LL () ⋅η () (2.24) −5 +9.077×⋅ 10 ⎣⎦⎡⎤VEA()rrr ⋅LL () ⋅α () −24 +×⋅9.522 10 ⎣⎦⎡⎤IELLL()rrr ⋅ EA () ⋅η () −16 3 +×⋅3.271 10 ⎣⎦⎡⎤IELLL()rrr ⋅ EA () ⋅η () 5 +×⋅5.327 10−17 ⎡⎤VEArrrr ⋅ ⋅αη ⋅2 −0.9146 ⎣⎦()LLL () () ()

5 ) -1 0 O) (kcal mol O) (kcal 2 (H -5 solv G Δ

Calculated -10

-10 -5 0 5 Experimental ΔG (H O) (kcal mol-1) solv 2 Figure 2.15 Experimental and calculated free energies of solvation in water for the unchargedcomponents using theCOSMO solvation model (EPS=80.0): 2 2 N=360, MUE= 0.922, RMSD= 1.139, r = 0.862, r cv= 0.805.

31 Surface-Integral QSPR Models: Local Energy Properties

Thus, the use of structures geometry-optimized with the COSMO model reduce the predictivity again by 0.1 kcal·mol-1. The best regression model using spherical-harmonic hybridization coefficients yielded a 21-term equation with poor regression statistics and is not presented here (MUE=1.47, RMSD=2.14). The surface-integral model using a spherical harmonic-fitted surface is defined by the 18-term equation:

−−4623 fG(Δ=×⋅+×⋅solv ()HO2 )()rr 1.545 10 ⎣⎦⎡⎤V() 8.236 10 ⎣⎦⎡⎤V() r 5 +×⋅5.710 10−−37IE rr −5.879 ×⋅ 10 ⎡⎤EA 2 LL() ⎣⎦() −13 3 −⋅0.247α LL()rr −×⋅ 1.651 10 ⎣⎦⎡⎤VIE() ⋅ ()r 3 +×⋅⋅1.054 10−−46VEArr −×⋅1.043 10 ⎡⎤VEArr ⋅ 2 ()LL () ⎣⎦() () 3 5 4.746 10−12 VEArr5.11310×⋅−5 ⎡V rr ⋅α ⎤ 2 +×⋅⎣⎦⎡⎤() ⋅L () + ⎣ ()L ()⎦ 3 5 −×⋅⋅7.387 10−−61⎡⎤VI()rrαη () −6.237 ×⋅ 10 5 ⎡E()rr ⋅ ()⎤2 ⎣⎦LL ⎣L⎦(2.25) −6 3 −×⋅1.639 10 ⎣⎦⎡⎤EALL()rr ⋅α () 3 3.699 10−11 ⎡⎤VIErrrη 2 −×⋅⎣⎦() ⋅LL () ⋅ () −4 −×⋅⋅3.319 10 VEA()rrrLL () ⋅α () 3 5.071 10−6 ⎡⎤VEArrrα 2 +×⋅⎣⎦() ⋅LL () ⋅ () 3 +×⋅1.036 10−10 ⎡⎤VEArrr ⋅ ⋅α ⎣⎦()LL () () −17 3 −×⋅2.508 10 ⎣⎦⎡⎤VEA()rrrr ⋅LLL () ⋅αη () ⋅ () +1.003

32 Chapter 2

5 ) -1 0 O) (kcal mol O) (kcal 2 (H

solv -5 G Δ Calculated Calculated -10

-10 -5 0 5 Experimental ΔG (H O) (kcal mol-1) solv 2 Figure 2.16 Surface-integral model for free energy of solvation in water for the uncharged structures and using thespherical-harmonic-fitted surface: 2 2 N=360, MUE= 0.929, RMSD= 1.18, r = 0.842, r cv= 0.781.

Again, the spherical-harmonic-fitted surface manages to affect the local property space such that predictivity is decreased (~0.2 kcal·mol-1) and the model with the best predictivity was the surface-integral model using the marching-cube surface. With this appearing to be a trend, the surface-integral models presented hereafter are comprised only of the marching-cube surface-fitted local properties.

2.3.3 Acid Dissociation Constant

A surface-integral model was generated for pKa using the data set in Table A5 of Appendix A, consisting of 268 nitrogenous compounds taken from the article by Tehan, et 65 al. on pKa estimation, which is comprised of primary and secondary amines, anilines, and pyridines. The regression equation has 23 terms:

33 Surface-Integral QSPR Models: Local Energy Properties

3 −−−2342 fpK()()a rrr=−×⋅6.979 10V ( ) + 6.469 ×⋅ 10V ( ) + 3.829 ×⋅⎡⎤ 10⎣V (r ) ⎦ 5 −−632 −×⋅⎡⎤+×⋅4.278 10⎣⎦VI (rr ) 8.326 10EL ( ) 3 5 −6 2 −×⋅1.552 10[]IELL (rr ) − 4.6127 ⋅ []α ( )

−−15 3 3 +×⋅⋅3.0124 10[]VIE (rr )LL ( ) +×⋅⋅ 7.818 10VEA (r ) ( r ) 35 −7 22−11 −×⋅⋅7.530 10⎡⎤+×⋅⎣⎦V (r ) EALL(rr ) 9.208 10⎡⋅ ⎣⎦V ( ) EA (r ) ⎤ −−13 3 2 +×⋅⋅2.379 10[]VEA (rr )LL ( ) −×⋅⋅ 4.549 10 []V (rr )α ( ) 3 −−372 3 +×⋅⎡⋅1.522 10⎣⎦VEA (rr )LL ( ) ⎤−×⋅⋅ 5.233 10[]V (rr )α ( ) 5 (2.26) −7 2 +×⋅9.698 10[]IELL (rr ) ⋅α ( ) −5 +×⋅⋅9.348 10[]VIE (rrr )LL ( ) ⋅α ( ) −8 −4.13110×⋅[]VIE ()rrr ⋅LL () ⋅η () 5 −19 2 +×⋅⎡⋅2.53410⎣⎦VIE ()rrrLL () ⋅η () ⎤ −23 3 −×⋅⋅7.872 10[]VIE (rrr )LL ( ) ⋅η ( )

−12 3 +×⋅⋅6.73210[]VEA ()rrrLL () ⋅α ()

−13 2 −×⋅⋅9.21210[]VEA ()rrrrLLL () ⋅αη () ⋅ ()

−19 3 −×⋅⋅5.94710[]VEA ()rrrrLLL () ⋅αη () ⋅ () + 4.9512

12

8 a

4 Calculated pK

0

-4

-404812 Experimental pK a

Figure 2.17 Experimental and calculated pKa values for the training set: 2 2 N=268, MUE= 1.03, RMSD= 1.339, r = 0.841, r cv=0.767.

34 Chapter 2

The authors performed separate regressions for each class of nitrogenous compound (i.e. amines, anilines, pyridines, etc.) and report regression statistics for each class. These 2 2 values range from the low values of r =0.55, r cv=0.54 for nitrogenous heterocycles 2 2 (N=150) to high values of r =0.94, r cv=0.94 for a combined set of anilines and amines (N=132). The reported regression equations are comprised of a constant and a single term (electrophilic superdelocalizability)65.

Using the standard statistical descriptor output of Parasurf gave a model with four terms:

−−21+ −2 f ()()pKa r =×⋅−×⋅−×⋅4.599 10MEP 1.281 10MEP 3.065 10 IEL (2.27) −1 − −×⋅+1.200 10EAL 12.451

11.5

9.0

a 6.5

4.0

Calculated pK Calculated 1.5

-1.0

-3.5

-3.5 -1.0 1.5 4.0 6.5 9.0 11.5

Experimental pKa

Figure 2.18 Experimental and calculated pKa values using statistical descriptors: 2 2 N=268, MUE= 1.32, RMSD= 1.678, r = 0.769, r cv=0.736.

The regression statistics for the surface-integral model are only slightly better than that obtained for the statistical descriptor model above and, considering the need for only four linear terms (versus 23 nonlinear terms), this model may lend itself more easily to physical

35 Surface-Integral QSPR Models: Local Energy Properties interpretation. The major drawback comes in the form of an increase in RMS error of 0.34 pKa units.

2.3.4 Boiling Point

The surface-integral model for the boiling point data set, which was taken from Syracuse Research Corporation’s PHYSPROP database66 and consisting of 1642 compounds and using the marching-cube surface, has 17 terms:

−−253 − 5 fT(b )()rr=×⋅−×⋅8.018 10V() 4.143 10 ⎣⎦⎡⎤V() r +×⋅0.786 10 α L ()r

−−252 −×⋅2.195 10 VEA()rr ⋅LL () +3.922 ×⋅ 10 ⎣⎡IE()rr ⋅α L ()⎦⎤

−8 3 −×⋅7.287 10 ⎣⎦⎡⎤IELL()rr ⋅α () −19 3 −×⋅4.727 10 ⎣⎦⎡⎤VIEEA()rr ⋅LL () ⋅ () r −4 +×⋅⋅3.316 10 VIEEA()rrLL () ⋅ () r

−8 2 −×⋅⋅2.515 10 ⎣⎡VIE()rrrLL () ⋅α ()⎦⎤ −7 −×⋅⋅7.114 10 VIE()rrrLL () ⋅η () 3 +×⋅2.367 10−10 ⎡⎤VIErrr ⋅ ⋅η 2 (2.28) ⎣⎦()LL () () −20 3 +×⋅3.463 10 ⎣⎦⎡⎤VIE()rrr ⋅LL () ⋅η () 5 −×⋅1.071 10−7 ⎡⎤VEArrr ⋅ ⋅α 2 ⎣⎦()LL () () 5 −×⋅6.296 10−19 ⎡⎤IErrr ⋅ EA ⋅α 2 ⎣⎦LLL() () () 5 −12 2 +×⋅1.032 10 ⎣⎦⎡⎤IELLL()rrr ⋅αη () ⋅ () 5 −×⋅9.686 10−11 ⎡⎤EA rr ⋅αη ⋅r 2 ⎣⎦LL() () L () −10 2 +×⋅1.165 10 ⎣⎦⎡⎤VEA()rrrr ⋅LLL () ⋅αη () ⋅ () −65.41

36 Chapter 2

500

400 C) ° (

b 300

200 Calculated T Calculated

100

0

0 100 200 300 400 500 Experimental T (°C) b Figure 2.19 Surface-integral model for boiling point using the marching-cube surface: 2 2 N=1642, MUE= 22.2, RMSD= 33.9, r = 0.740, r cv=0.574.

When 19 outliers are removed from the data set, the resulting regression model possesses 19 terms:

3 fT rrr=⋅0.235VV +⋅−×⋅ 0.106 1.023 10−2 ⎡⎤V r2 − IE r (b )() () () ⎣⎦() L ()

2 −3 +⋅6.926ααLL()rr − 5.028 ⋅⎣⎦⎡⎤ () −×⋅⋅1.529 10 VEA() rL ()r 3 3 +×⋅6.297 10−−51⎡⎤VEArr ⋅2 −×⋅1.126 10 0VEArr ⋅ ⎣⎦()LL () ⎣⎡ () ()⎦⎤ 5 5 +×⋅2.656 10−−10 ⎡⎤VIrr ⋅ηη2 +×⋅1.144 10 13 ⎡ Err ⋅ ⎤ 2 ⎣⎦()LL () ⎣ ()L ()⎦ 3 5 −−662 +×8.485 10 ⋅⋅+×⋅⋅⎣⎦⎡⎤EALL()rrαα () 3.106 10 ⎣⎡LL()rrη ()⎦⎤

−3 −×⋅2.101 10 VIEEA()rr ⋅LL () ⋅ () r (2.29)

−7 −×⋅⋅9.788 10 VIE()rrrLL () ⋅η ()

−20 3 +×⋅3.008 10 ⎣⎦⎡⎤VIE()rrr ⋅LL () ⋅η () −3 +×⋅⋅3.397 10 VIE()rrrLL () ⋅α () 3 −×⋅1.313 10−4 ⎡⎤VIErrr ⋅ ⋅α 2 ⎣⎦()LL () ()

−16 3 +×⋅1.131 10 ⎡⎣⎦VEA()rrrr ⋅LLL () ⋅αη () ⋅ () −⎤ 87.97

37 Surface-Integral QSPR Models: Local Energy Properties

400

300 C) ° ( b

200 Calculated T 100

0

0 100 200 300 400

Experimental Tb (°C)

Figure 2.20 Surface-integral model for boiling point with outliers removed: 2 2 N=1623, MUE=26.0, RMSD=35.1, r = 0.752, r cv=0.728.

Both regression statistics are improved, but somewhat ironically, the prediction error increases with the removal of the outliers: MUE:+3.8 and RMSD: +1.2 degrees Celsius. The linear regression model using spherical-harmonic hybrid coefficients for the boiling point data set has 29 terms:

13467 fT(bRRRR )(r) =⋅+⋅−⋅−⋅+⋅ 47.52HHHHH 13.05 8.54 35.30 31.27 R 25 7 8 +⋅1.39HHHMEP +⋅ 1.61MEP +⋅ 1.88MEP + 2.86 ⋅ HMEP 13 15 19 1 +⋅16.39HHHHMEP +⋅ 19.35MEP −⋅ 17.13MEP +⋅ 0.44 IEL 24678 (2.30) +⋅+⋅+⋅−⋅−⋅0.35HHHHHIEL 0.26IEL 0.19IEL 0.46IEL 0.27 IEL 910171 2 −⋅−⋅−⋅−⋅0.87HHHHIEL 0.50IEL 1.57IEL 0.49EAL −⋅ 0.49 HEAL 7915 17 −⋅1.03H EAL − 2.10 ⋅HHEAL + 3.08 ⋅EAL + 61.41 ⋅− Hα 378.5 ⋅Hα −1143

38 Chapter 2

550

450

350 C) ° (

b 250

150 Calculated T

50

-50

-50 50 150 250 350 450 550 Experimental T (°C) b Figure 2.21 Regression model for boiling point using spherical-harmonic hybrid coefficients: 2 2 MUE= 24.6, RMSD= 34.6, r = 0.779, r cv=0.742.

When the set of 40 statistical descriptors is used, the regression model for the boiling point data set has 16 terms:

f (TMbD )()r =⋅−⋅+⋅−⋅+⋅ 23.19μμ 7523 11.45 α 0.3103W 1.163 Vmax 2min +⋅−⋅−⋅−⋅12.65VV++ 8.179 0.7882σ 0.2502 IEL (2.31) max −⋅−⋅+⋅Δ+⋅0.9949EALL 1.251 EA − 0.7031EA L 1.285 χL max 2 −⋅+⋅+⋅−127.9αασLL 449.5 757.2α 699.2

39 Surface-Integral QSPR Models: Local Energy Properties

595

495

395 C) ° ( b 295

195 Calculated T

95

-5

-5 95 195 295 395 495 595 Experimental T (°C) b Figure 2.22 Regression model for boiling point using statistical descriptors: 2 2 MUE= 25.3, RMSD= 36.8, r = 0.750, r cv=0.733.

With an average RMS error of 35.1°C for the set of property models, none of the individual models predicts well enough to be used in a practical application, but rather, they serve to highlight the limitations of the method and confirm the intuitive notion that, at or near the boiling point, where a percentage of molecules are entering the gas phase, the collective set of local properties cease to describe well the interactions between molecules in terms of molecular surface properties.

2.3.5 Glass Transition Temperature

Glass transition temperature (Tg) is the temperature at which amorphous materials change from a somewhat crystalline phase to a liquid phase and is used as a measure of the thermal failure limit for organic light-emitting diodes67-69 (OLED’s). The surface- integral model using the marching-cube surface for the glass transition temperature was

40 Chapter 2 generated from a set of 73 OLED materials in Table A6 of Appendix A, assembled from the literature70. The resulting regression equation has 4 terms:

5 3 fT rr=−2.608 × 10−−42 ⋅⎡⎤V 2 −2.336 × 10 0 ⋅V r ⋅ IE r ⋅η r ()g () ⎣⎦() ⎣⎡ ()LL () ()⎦⎤

−15 3 +×⋅3.215 10 ⎣⎦⎡⎤IELLL()rrr ⋅αη () ⋅ () (2.32) 3 −×⋅1.078 10−7 ⎡⎤EA rrr ⋅αη ⋅2 +255.98 ⎣⎦L ()LL () ()

450 C) ° (

g 400 Calculated T Calculated

350

300 300 350 400 450

Experimental Tg (°C)

Figure 2.23 Experimental and calculated glass transition temperatures for the training set: 2 2 N=73, MUE= 16.8, RMSD= 22.5, r = 0.690, r cv=0.582.

Using lower F statistic values (for individual terms to enter and to leave the regression equation) in the multiple regression results in an equation with more terms and a slightly 2 2 improved r value, but also yields a much less predictive model (r cv approaches zero). The COSMO-optimized data set (using a bulk dielectric constant value of 10.0 for n- octanol) yields a model with 12 terms:

41 Surface-Integral QSPR Models: Local Energy Properties

3 fT rr=−0.286 ⋅V − 5.091 × 10−−23 ⋅V r + 9.633 × 10 ⋅⎡⎤Vr2 ()g () () () ⎣⎦() 3 5 −−63 2 +×⋅6.251 10 ⎣⎦⎡⎤VE()rr −3.745 ×⋅ 10 ALL() +5.833 ⋅⎣⎡α ()r ⎦⎤ 3 7.761 10−−42VEArr2.385 10 ⎡⎤Vrr2 −×⋅⋅()LL () − ×⋅⎣⎦() ⋅α () −5 3 (2.33) −×⋅⋅3.135 10 ⎣⎦⎡⎤V ()rrα L () −20 3 +×⋅5.192 10 ⎣⎦⎡⎤VIEEA()rr ⋅LL () ⋅ () r −13 3 +2.52110×⋅⎣⎦⎡⎤VIE()rrr ⋅LL () ⋅α () −10 3 +×⋅1.843 10 ⎣⎦⎡⎤VEA()rrr ⋅LL () ⋅α () +252.25

450 C) ° (

g 400 Calculated T Calculated

350

300 300 350 400 450 Experimental T (°C) g Figure 2.24 Surface-integral model for glass transition temperature using COSMO-optimized structures: 2 2 MUE= 15.3, RMSD= 18.7, r = 0.779, r cv=0.491.

The predictivity of this model is comparable to the previous one, with an improvement in RMS error of 2.8. The same data set (not using COSMO-optimized structures) was used to generate a regression model using the statistical descriptors, which yielded a 9-term equation:

42 Chapter 2

max f ()TVg ()r =⋅−⋅−⋅+⋅−⋅1.253α 6.112+− 7.938V 4.544V 1.221 IEL min max 2 + (2.34) +⋅+0.334IELLE 6.220 ⋅ EA + 0.125 ⋅−⋅σδA− 6160 AEA +840.6

450 C) ° (

g 400 Calculated T

350

300 300 350 400 450 Experimental T (°C) g Figure 2.25 Regression model for glass transition temperature using statistical descriptors: 2 2 MUE=12.7, RMSD= 15.7, r = 0.844, r cv=0.521.

This model possesses better regression statistics, with a significantly improved prediction error, compared with that of the surface-integral model. The regression model using the spherical-harmonic hybrid coefficients yields an equation with only two terms that predicts very poorly (MUE = 22.2, RMSD = 28.5):

16 4 fT()gRM()r =⋅+⋅+144.05HH 2.982EP 285.1 (2.35)

43 Surface-Integral QSPR Models: Local Energy Properties

450 C) ° (

b 400 Calculated T

350

300 300 350 400 450 Experimental T (°C) b Figure 2.26 Regression model for glass transition temperature using hybridization coefficients: 2 2 MUE=22.2, RMSD=28.5, r = 0.501, r cv=0.224.

The best-predicting property model here uses the statistical descriptor set and has an RMS of 15.7°C, which represents roughly 10% of the range of the Tg values in the data set, which is rather large (the best boiling point model predicts within ~6% of its range). But here again, the local properties are being used to predict a phase change – the point at which the forces dictating the arrangement of molecules cease to apply in the same manner.

2.3.6 Aqueous Solubility

The aqueous solubility data set in Table 1.6 of Appendix A is a small subset of 589 compounds taken from The University of Arizona’s AQUASOL database71. Given that the solubility values were in some 100 various units, all values were converted to standard molarity units (moles/liter, M) and the logarithm (base 10) taken as target values (logS).

44 Chapter 2

The regression equation for the surface-integral model derived using the marching-cube surface consists of 11 terms:

3 −−352 fS(log )()rr=×⋅+×⋅ 2.086 10V() 3.768 10 ⎣⎦⎡⎤V() r 3 −4 2 −×⋅1.803 10 EALL()rr −0.338 ⋅⎣⎦⎡⎤α () 5 3 2 −14 +⋅0.391 ⎣⎡α LL()rr⎦⎣⎤⎡ +3.643 ×⋅ 10 VEA() ⋅ ()r⎦⎤ −×⋅⋅5.224 10−3 V ()rrα () L (2.36) 5 −×⋅4.336 10−7 ⎡⎤V rr ⋅α 2 ⎣⎦()L () −19 3 +×⋅9.093 10 ⎣⎦⎡⎤IELL()rr ⋅η () −5 +×⋅⋅1.257 10 VIE()rrL ()⋅α L ()r −8 −×⋅⋅1.478 10 VIE()rrrLL () ⋅η () −0.8576

2

0 O) 2 -2

-4 Calculated logS (H -6

-8

-8 -6 -4 -2 0 2 Experimental logS (H O) 2 Figure 2.27 Surface-integral model for logS using the marching cube surface: 2 2 N=589, MUE= 0.844, RMSD= 1.14, r = 0.578, r cv=0.411.

45 Surface-Integral QSPR Models: Local Energy Properties

The regression model using the spherical-harmonic hybrid coefficients yielded an 18-term equation: fS()()logrrr=− 4.206 × 10−−11 ⋅V() −3.416 × 10 ⋅V()

3 −−112 3 −×⋅2.993 10 ⎣⎦⎡⎤VV()rr +×⋅6.651 10 ⎣⎦⎡⎤() −−22 +×⋅2.272 10 ααLL()rr +2.797 ×⋅ 10 () 3 −×⋅1.447 10−−11η rr +×⋅⋅8.008 10 ⎡⎤VIEr2 LL() ⎣⎦() () −−122 +×⋅⋅7.902 10 ⎣⎦⎡⎤VIE()rrLL () +×⋅⋅1.113 10 V()rrα ()

−2 −2 3 −×⋅2.251 10 V ()r ⋅−×⋅⋅ηηLL()rr4.405 10 ⎣⎡V () ()r⎦⎤ (2.37) 5 +×⋅1.044 10−2 ⎡⎤IErr ⋅ EA 2 ⎣⎦LL() () −2 2 +×⋅3.205 10 ⎣⎦⎡⎤IELL()rr ⋅α () −1 2 −×⋅1.564 10 ⎣⎦⎡⎤EALL()rr ⋅α () 3 −⋅1.920 EA rr ⋅ηη +2.457 ⋅⎡⎤EA rr ⋅ 2 LL() () ⎣⎦LL() () 3 +⋅⋅⋅59.728 ⎣⎦⎡⎤VIEEA()rrLL () () r +2.055

2

0 O) 2 -2

-4 Calculated logS (H -6

-8

-8 -6 -4 -2 0 2

Experimental logS (H2O)

Figure 2.28 Regression model for logS using spherical-harmonic hybrid coefficients: 2 2 N=589, MUE= 0.960, RMSD= 1.27, r = 0.469, r cv=0.333.

46 Chapter 2

The regression model using the statistical descriptors has 14 terms:

3 3 fSlogrr=− 1.135 × 10−−13 ⋅⎡⎤V 2 −6.618 × 10 ⋅ IEr2 ( )() ⎣⎦() ⎣⎡ L ()⎦⎤ 2 5 −−222 +×⋅4.049 10 ⎣⎦⎡⎤IELL()rr −2.219 ×⋅ 10 ⎣⎦⎡⎤IE () −−432 −×⋅2.853 10 ⎣⎦⎡⎤EALL()rr +5.296 ×⋅ 10 α () 2 −×⋅−1.709 10−−12ηη()rr 4.315 ×⋅ 10 ⎡⎤() LL⎣⎦ (2.38) 5 3 +×⋅1.342 10−2 η rr2 +⋅12.75 ⎡VIE ⋅r⎤ 2 ⎣⎦⎡⎤LL() ⎣ () ()⎦ 2 +⋅3.718 ⎣⎦⎡⎤VIE()rr ⋅LL () −9.269 ⋅⋅ VEA () r () r 3 2 +⋅1.949 ⎡⎤VEArr ⋅2 +43.79 ⋅VEA rr ⋅ ⎣⎦()LL () ⎣⎡ () ()⎦⎤ −6.998

2

0 O) 2 -2

-4 Calculated logS (H logS Calculated -6

-8

-8 -6 -4 -2 0 2 Experimental logS (H O) 2 Figure 2.29 Regression model for logS using statistical descriptors: 2 2 N=589, MUE= 0.791, RMSD= 1.07, r = 0.624, r cv=0.528.

As in the case of glass transition temperature, the statistical descriptor set again gives the best regression model in terms of RMS error (~1 logS unit). The regression statistics, however, are not encouraging. In comparison to the commercially-available product ACD/Labs’ Solubility DB72, the RMS error of our best model performs only as well as the outside range of RMS errors in the prediction of a set of test compounds73,74.

47 Surface-Integral QSPR Models: Local Energy Properties

2.4 Discussion

As described in Ehresmann, et al.34, the construction of the surface-integral models involves two approximations: that the target properties may be treated using a sum of local surface values and that gas-phase electron densities from semiempirical calculations can be used to represent properties that, in bulk, depend on the presence of a polar medium. The free energy of solvation models themselves give reliable estimates, although they are not as accurate as the most reliable methods available63,75. It should be noted that this surface-integral approach relies on the gas-phase electron densities and optimized structures from semiempirical calculations and can only include solute polarization implicitly by the local polarizability for the molecular surface. It was also found that the use of COSMO-optimized structures by this method results in models of slightly lower predictive power. In order to evaluate the predictivity of our model, the logP data set was first predicted using KOWWIN 1.67, included in the U.S. Environmental Protection Agency’s property estimation package, EPISUITE 3.11. KOWWIN is a Windows implementation of Syracuse Research Corporation’s LogKow10, which is an atom/fragment-based method for estimating logP that was trained with 2,410 compounds, using 175 fragment groups (r2 = 0.98). The statistics reported for a 13,058 compound validation set are as follows: a standard deviation of 0.436, an MUE of 0.316 and an r2 of 0.95. The SD files for our logP training set were converted to their corresponding SMILES strings and run as a batch job. This test set yielded a mean signed error of 0.396, a mean unsigned error (MUE) of 1.172, and a root mean square (RMS) error of 2.046. As a second trial and a rough measure of the comparative prediction accuracy of the two methods, the logP values of a small set of 17 recognizable biological structures taken from Exploring QSAR59 were predicted using both methods (Table 2.3). The KOWWIN model yielded a mean unsigned error of 1.12 and an RMS error of 1.61. Our local property model performed slightly worse, with a mean unsigned error of 1.40 and an RMS error of 1.82. From previous regressions, a tendency of the models to predict poorly for logP values at or below zero logP units was observed and attributed to an under-representation by the data set of compounds much more soluble in water than in n-octanol (since

48 Chapter 2 compounds with logP values at the other end of the solubility spectrum also presented a similar prediction error).

Table 2.3 Results of KOWWIN and Parasurf logP predictions.

No. Compound Exp. KOWWIN Parasurf

1 estradiol 4.01 1.55 3.02 2 imipramine 4.80 5.01 4.78 3 pentazocine 3.31 5.03 4.65 4 rifampin 1.32 2.08 2.15 5 vincristine 2.57 3.11 3.83 6 digitoxin 1.68 2.04 3.61 7 terfenadine 3.22 7.62 7.87 8 sufentanil 3.24 3.62 3.22 9 colchicine 1.3 1.86 1.64 10 tetracycline -1.44 -0.18 0.57 11 hexetidine 2.00 5.26 5.30 12 Δ9-tetrahydrocannabinol 6.97 7.6 5.84 13 yohimbine 2.73 2.11 2.86 14 quinine 2.64 3.29 4.52 15 acyclovir -1.56 -2.41 0.20 16 diazepam 2.99 2.70 3.83 17 codeine 1.14 1.28 2.54

The principal disadvantage of the local-property/surface-integral method lies with the application to small regions of the molecular surface as having a well-defined local values in the property units. The projected property value is not a true measure of the actual property value at a given point, but rather, is an index of it, describing over the whole of the surface the variations in the property. This abstraction also applies to the local properties themselves. It has been noted24 that EAL does not, in fact, represent a real electron affinity, even within the definition of Koopmans’ theorem, but rather is a local indicator of electron-accepting regions on the molecular surface - regions that are likely to be the site of nucleophilic attack. The local electronegativity also does not correspond to a real electronegativity based on chemical potential37. Another disadvantage of the method

49 Surface-Integral QSPR Models: Local Energy Properties is the lack of an obvious physical interpretation of the regression equation. The surface- integral models are nonlinear relationships between a physical property and a set of surface properties (or indices) that require an understanding of the terms within their own context or a method of relating local property variations to chemical structural features.

Figure 2.30 Surface-integral-modeled logP surface of decylsulfonic acid.

These drawbacks notwithstanding, the models themselves approach the predictive ability of accepted, commonly-used prediction methods, with the added benefit of allowing the researcher to visualize the physical property surface for a given structure, as well as use the surface-mapped physical property as a local property in itself. The quality of prediction of actual physical property values at the point of using surface-integral- derived physical properties would be dubious at best. Rather, the use of these “extended” properties would only be appropriate in the capacity of a surface descriptor in a statistical analysis or a classification scheme.

50 Chapter 2

2.5 Conclusions

The quantitative structure-property models presented here represent a shift towards a completely wave-function, molecular-orbital-based approach to QSAR/QSPR prediction using surface-integral models. One particularly attractive feature of the surface-integral technique is the ability to use the predicted property (or activity) as a local property itself. The property in question is defined as a local property and projected onto an isodensity surface for visualization or for further statistical analysis. Thus, not only are the surface- integral models QSPR/QSAR prediction methods, but they are also indicators of the molecular surface features that contribute to the particular property. With further investigation into the physical meanings of surface-integral models in terms of the local properties of which they are comprised, any physical property or biological activity that is a function of molecular surface interactions can be predicted and visualized by this method.

51

Chapter 3

Support Vector Classification of Phospholipidosis-Inducing Drugs

3.1 Introduction

3.1.1 Phospholipidosis

Drug-induced phospholipidosis is a physiopathological condition characterized by the appearance of microscopic subcellular structures, called lamellar bodies or lysosomal inclusion bodies that contain primarily large deposits of undegraded phospholipids. The lysosomal bodies aggregate inside the cells of the lungs, liver, kidneys, corneas, and brain and their presence often coincides with adverse clinical effects such as inflammatory reactions and fibrosis, although the relationship is as yet unexplained. Indeed, the onset of phospholipidosis may or may not be associated with a presentation of adverse symptoms76. It is nevertheless well-documented that drug-induced phospholipidosis does affect cellular function by impairing lysosomal protein degradation, membrane fusion, and pino- and endo-cytosis77,78. For this reason it is desirable of pharmaceutical companies to develop a screen for phospholipidosis induction.

52 Chapter 3

The most common feature of the drugs that induce phospholipidosis is that they are both cationic and amphiphilic: they have a positively-charged water-soluble portion and a hydrophobic portion. Referred to, therefore, as cationic amphiphilic drugs (CAD’s), these compounds are found accumulated inside the lysosomal bodies along with aggregated phospholipids. It is thought that the CAD’s pass into the lysosomal compartment where the pH is low and become trapped. By virtue of being weakly basic, they are protonated so that they cannot pass back through the phospholipid bilayer79. This may also be a defense mechanism of the cell to protect itself against exogenous xenobiotics and metabolites. While the molecular mechanism of phospholipidosis induction is not presently known, there are two basic hypotheses. The first involves the CAD’s binding directly to phospholipids, resulting in indigestible complexes that are stored in the lysosomal lamellar bodies79. The other hypothesis takes note of the concomitant inhibition of phospholipase activity in the lamellar bodies. The CAD’s can inhibit phospholipases either by binding directly to them, or if the concentration of the CAD’s becomes high enough, they may effectively raise the pH such that phospholipase function becomes impaired, resulting in an accumulation of phospholipids that cannot be degraded78. Working from either of these hypotheses it should be possible to elucidate some molecular surface features common among the drugs that promote the induction of phospholipidosis in order to establish the relationship between biological activity and structure (or surface properties). The challenge comes in that, while most drugs that induce phospholipidosis are cationic and amphiphilic, not all cationic amphiphilic compounds induce phospholipidosis. Thus, the most defining characteristics of the drug class may not be useful in classifying its activity. In fact, these properties may completely overshadow other aspects of the molecular surface more pertinent to the function of CAD’s as phospholipase inhibitors or as sites of phospholipid complexation/aggregation. Given that the octanol-water partition coefficient (P, or in this case logP), itself an index of hydrophobicity32, has been shown to be additive in terms of the solvent-accessible surface80, local properties such as the ones described in the previous chapter (Equations 2.1 – 2.6) that were used in the logP surface-integral model should be particularly useful in predicting phospholipidosis induction provided there is a sufficiently rigorous statistical method available to classify compounds based on their molecular surface property values.

53 Support Vector Classification of Phospholipidosis-Inducing Drugs

3.1.2 Phospholipidosis Models

Our recent work with phospholipidosis prediction was performed using a 144- compound data set of structures with a positive assay for phospholipidosis induction as determined by transmission electron microscopy81, which was provided by Anne Tilloy- Ellul (Pfizer Global R&D, Amboise Laboratories, France) and Marcel de Groot (Pfizer Global R&D, Sandwich Laboratories, UK). This data set, shown in Table 3.1, was divided in half, with one portion to be used as a training set, and the other as a test set by sorting the provided set by the Parasurf-calculated free energies of solvation in octanol

(ΔGsolv(oct.)) and placing every other compound into either the training set or the test set. Thus, the training set of 72 structures consists of 44 positives and 28 negatives, and the test set of 72 compounds consists of 42 positives and 30 negatives. The two compounds carbon tetrachloride and valproic acid were duplicates in the complete data set and are in both the training set and test set. The basic approach undertaken to classify these data according to the likelihood of inducing phospholipidosis in assay involved the use of two statistical methods, each applied to two types of descriptor sets of local properties. The first set of property descriptors used were the statistical descriptors described in the previous chapter and shown in Table 2.2. The second used consisted of surface autocorrelation indices, which had recently been implemented in Parasurf ’06 and will be described briefly in the following section.

Table 3.1 The phospholipidosis training set (l) and test set (r).

Training Set Inducer Test Set Inducer 17-a-ethynylestradiol Negative 3-methylcholanthrene Negative 1-amino-4-octylpiperazine Positive AC-3579 Positive 1-chloro-10,11- Positive acetaminophen Negative dehydroamitriptyline 1-chloroamitriptyline Positive amikacin Positive 6-hydroxydopamine Positive amiodarone Positive abacavir Negative amitriptyline Positive amineptine Negative anticoman Negative amodiaquine Positive aricept Negative azaserine Negative AY-25329 Negative azithromycin Positive AY-9944 Positive bilirubin Positive Negative brompheniramine Positive boxidine Positive caffeine Negative bupropion Negative carbamazepine Negative carbon tetrachloride Negative carbon tetrachloride Negative ceftazidime Negative ceftazidime Negative cephaloridine Positive

54 Chapter 3

chloroquine Positive chlorcyclizine Positive chlorpromazine Positive chloroform Negative chlortetracycline Negative chlorphentermine Positive ciprofibrate Negative citalopram Positive clociguanil Positive clindamycin Positive clofibrate Negative clomipramine Positive colchicine Negative clozapine Positive cyclizine Positive cocaine Positive Negative coralgil Positive desipramine Positive dantrolene Negative (d-)H-4,4-bis-diethylaminoethoxy- Positive demeclocycline Negative diethylphenylethane dibucaine Positive desferal Negative erythromycin Positive dibekacin Positive etoposide Negative diclofenac Negative felbamate Negative diflunisal Negative fenofibrate Negative di-isobutamide Positive fluoxetine Positive doxapram Negative galactosamine Negative doxycycline Negative gemfibrozil Negative emetine Positive gentamicin Positive ethyl fluclozepate Positive hydroxyzine Positive famotidine Negative hypoglycin-A Negative fenfluramine Positive iprindole Positive flutamide Negative ketoconazole Positive homochlorcyclizine Positive lysergide Positive hydrazine Negative methotrexate Negative hydroxyurea Negative methyldopa Negative IA3 Positive norchlorcyclizine Positive imipramine Positive nortriptyline Positive indoramin Positive phenacetin Positive (l)-ethionine Negative pheniramine Positive maprotiline Positive phenobarbital Negative meclizine Positive phentermine Positive mesoridazine Positive physostigmine Negative metformin Negative piroxicam Negative methadone Negative quinine Positive methapyrilene Negative R-800 Positive mianserin Positive RMI10393 Positive netilmicin Positive SC-45864 Probable noxiptiline Positive stilbamidine Positive paraquat Positive sulindac Negative perhexiline Positive suramin Positive procaine Negative tamoxifen Positive promethazine Positive temozolomide Negative propranolol Positive tetracaine Positive quinacrine Positive thioacetamide Negative quinidine Positive tilorone Positive rolitetracycline Negative tobramycin Positive SDZ_200-125 Positive tocainide Positive SKF-14336-D Positive trimeprazine Positive stavudine Negative trimethoprim sulfamethoxazole Positive tacrine Negative trospectomycin sulfate Positive trifluperazine Positive valproic_acid Negative triparanol Positive WY-14643 Negative tunicamycin Positive zidovudine Negative valproic_acid Negative zileuton Negative zimelidine Positive

55 Support Vector Classification of Phospholipidosis-Inducing Drugs

3.1.3 Surface Autocorrelations

Surface autocorrelations are cross-correlations of various surface property values that have been shifted by some distance on the molecular surface. Introduced by Gasteiger82,83 as descriptors for use in molecular binding studies, they are used to discover periodic patterns or fundamental harmonics that may not be apparent by inspection. There are six general autocorrelation functions implemented in Parasurf ’06 that describe local molecular electrostatic potential (3 functions), shape (1 function), local ionization potential (1 function), and local electron affinity (1 function) cross-correlations. These are used in the general vector equation:

nntri tri 2 1 −−σ ()Rrij AR()= ∑∑ωije (3.1) ntri iji==+11 where rij is the distance between surface points and ωij is one of the four autocorrelation functions. Each autocorrelation vector has 128 elements, starting at a radius of 2.5Å with increments of 0.06Å. The three MEP functions that describe the three possible sets of cross products are defined in the following table.

Table 3.2 Molecular electrostatic functions used in surface autocorrelations.

Plus-plus MEP autocorrelation, ωij= VV i× j where (Vi > 0 and Vj > 0)

VPP ωij = 0.0 where (Vi <0 or Vj < 0) Minus-minus MEP autocorrelation, ωij= VV i× j where (Vi < 0 and Vj < 0) VMM

Plus-minus MEP autocorrelation, ωij= −×VV i j where (Vi × Vj < 0)

VPM ωij = 0.0 where (Vi × Vj > 0)

56 Chapter 3

Similarity indices may be calculated for data sets by comparison with a reference structure by

N 2min⋅ AR , AR 1 ( 12( ii) ( )) (3.2) S = ∑ N i=1 ()AR12()ii+ AR ()

⎡⎤N for ⎢⎥∑ AR12()ii+ AR ()> 0 ⎣⎦i=1

where A1(Ri) is the value of the autocorrelation function for molecule 1 at a distance Ri and N is the number of points within the defined range of R for which the sum is non-zero. The similarity indices are calculated for the range of each of the autocorrelation functions, as well as for the first four quartals of the distance range for each of the functions.

3.1.4 Statistical Methods

The principal statistical method used to predict phospholipidosis induction from molecular surface descriptors was the Support Vector Machine84-86 (SVM), with a multivariate adaptive regression splines87,88 (MARS) method used to compare the prediction accuracy of the best SVM models. In the case of a small difference in prediction accuracy between the training set and the test set for the support vector machines and a relatively large difference in prediction accuracy for the regression splines models, over-fitting of the data by the SVM’s would be assumed.

3.1.4.1 Support Vector Machines

The Support Vector Machine (SVM) is a machine-learning technique for classification that involves a non-linear mapping of data into a high-dimensional feature space, then using structural risk management to find a separating hyperplane with the largest margin between the transformed data. These learning machines have been shown to classify with an accuracy at least as good as the various neural net methods85.

57 Support Vector Classification of Phospholipidosis-Inducing Drugs

w

-b |w|

H2

H Origin 1

Margin

Figure 3.1 Linear maximum-margin hyperplane with circled support vectors.

The method for solving the maximum-margin hyperplane (Figure 3.1) problem involves a minimization of the Lagrangian formulation

ll 1 2 LybP ≡−wxw∑αii() i ⋅++∑αi (3.3) 2 ii==11 where w is a vector normal to the hyperplane, b / w is the perpendicular distance from the hyperplane to the origin, and αi are Lagrange multipliers with i=1, … , l, one for each input vector, subject to the constraint, ybii(xw⋅ +−≥) 10, ∀i . In the convex quadratic programming solution of the maximum and the minimum LP, the training data are mapped into dot product feature space (represented here by the vector pair, xi and xj) after requiring that the gradient of LP be subject to:

wx= ∑αiiiy (3.4) i and

∑αiiy = 0 (3.5) i in the following “dual formulation” equation:

1 LyDi=−∑∑ααα ijijyxxij ⋅ (3.6) ii2 , j

58 Chapter 3

In the nonlinear case this is a difficult solution until one employs the “kernel trick”84 to express the dot products and the mappings Φ into the Hilbert space:

d Φ :ℜ⇒H (3.7) in terms of some kernel function of the form

K (x,xij) =Φ( x i) ⋅Φ( x j) (3.8) By this method, wherein the dot products are defined in the new space as a single function, it becomes unnecessary to determine Φ explicitly. This is especially useful in the case of the commonly-used radial basis function:

2 −−γ xx ij (3.9) Ke()x,xij= (for γ>0), which renders H infinite-dimensional. In C-SVC, or C-support vector classification86, the construction of the optimal hyperplane involves the minimization of the functional

⎧1 l ⎫ wwT + C ξ min ⎨ ∑ i ⎬ (3.10) w,,b ξ ⎩⎭2 i=1

T subject to yxbii()w φ ()+≥−1 ξi and ξi ≥ 0, where w again represents the hyperplane vector, b is a bias term, and the summation term represents the sum of deviations, ξ, of training errors and maximizes the margin for the correctly classified vectors. If the training data can be separated without errors, then the constructed hyperplane corresponds to the optimal margin hyperplane86. And by varying the value of the C term in the expression, one can vary the trade-off between the complexity of the decision surface and the frequency of error, in effect “tuning” the SVM’s ability to generalize. Another algorithm for constructing the hyperplane is the ν-SVC method89, which uses a parameter, ν, to set the upper bound on the fraction of training errors and a lower bound on the fraction of support vectors. The formulation of this method is seen in the formulation

l ⎧11T ⎫ min ⎨ ww−+νρξi ⎬ (3.11) w,,,b ξρ ∑ ⎩⎭2 l i=1

59 Support Vector Classification of Phospholipidosis-Inducing Drugs

T subject to yxbii()w φ ()+≥−ρξi and ξi ≥ 0, ρ≥ 0, where l is the number of input of input vectors, b is a bias term, and φ(xi) is the feature space mapping. The expectation value of the probability of error is bounded by the ratio of the expectation value of the number of support vectors to the number of training vectors, expressed as:

nsv Pr(error ) ≤ (3.12) ntv where nsv is the number of support vectors and ntv is the number of training vectors. So, if the optimal hyperplane can be constructed from a small number of support vectors relative to the training set size, then the ability of the SVM to generalize will be high. This is true even when the feature space is infinite dimensional, since the complexity of the learning algorithm does not depend on the dimensionality of the feature space, but on the number of support vectors86.

3.1.4.2 Multivariate Adaptive Regression Splines

The second classification method applied to the surface property data was the multivariate adaptive regression splines technique87, which generates a prediction model by selecting a weighted sum of basis functions from the set of basis functions spanning all descriptor values, then adding basis functions to the predictive equation on a least squares goodness-of-fit criterion, according to the general equation:

M fB()xx= ∑αmm () (3.13) m=0

where BmB (x) represents the set of basis functions and αm is a real number indicating the contribution of the basis function:

Km Bbxx = mk() ∏ m()vkm(), (3.14) k =1 where v(k,m) is an index of the factor used as the argument of bkm. The basis functions here are categorical two-sided truncated linear functions used to approximate the response to predictor variance:

60 Chapter 3

bx+ Ix A (3.15) km ( vkm(),,) =∈( vkm() km )

bx− Ix A km ()vkm(),,=∉()vkm() km (3.16) with I(xv(k,m)) representing an indicator function having a value of one if true and zero if false. Akm is a subset of the possible values of xv(k,m).

Both of the multivariate analysis techniques described (support vector machines and multivariate adaptive regression splines) are readily applied to problems of high dimensionality, where the number of predictors is large and the variance within the predictors tends to add noise to underlying correlations.

3.2 Methods

The Pfizer set of 144 canonical SMILES90 (Table 3.1) were converted to 3D structures with CORINA50, 51 and then geometry-optimized in the gas phase using the AM1 Hamiltonian in VAMP 9.0. The molecular surface properties were calculated and mapped with Parasurf ‘06, using the marching-cube algorithm to fit to an isodensity surface of 8.0×10-3 e·Å-1. Parasurf’s 40 standard statistical descriptors (Chapter 2, Table 2.2) were augmented with three physical property descriptors calculated from multiple regression models for logP, the free energy of solvation in octanol ΔGsolv.(oct.), and the free energy of solvation in water ΔGsolv.(H2O), as described in the chapter on surface- integral models. These 43 descriptors were then used to train the C-support vector classification (C-SVC) and ν-support vector classification (ν-SVC) machines using Chang and Lin’s libsvm91. The data set was linearly scaled from –1 to +1 to prevent single descriptor domination of the training and a radial basis function (RBF) was used with adjustable parameters, C and γ, determined by libsvm included cross-validation and grid searching routine. The grid search is an automated trial of (C,γ) pairs that runs until a best cross-validation accuracy is reached. The γ parameter is the multiplier in the RBF exponential (Equation 3.9) and the C parameter corresponds, in the libsvm authors’

61 Support Vector Classification of Phospholipidosis-Inducing Drugs formulation for the soft margin hyperplane minimization expression, to an upper bound applied to the sum of training errors (Equation 3.10).

CORINA VAMP 9.0 3D Structures Geometry Optimization

libsvm ParaSurf ‘06 support vector machine Surface Descriptors

Figure 3.2 General processing pathway from SMILES strings to phospholipidosis prediction.

The University of Minnesota’s XTAL package92 was used for the training of multivariate adaptive regression splines. Piecewise-linear splines were used with a varying maximum number of basis functions to be used and the number of interactions set to the number of predictors in each case. A Leave One Out cross-validation scheme was used with the training parameters varied individually until a minimum RMS error was achieved.

3.3 Results

In the following section, predictions of the SVM and MARS models are presented in the form of confusion matrices, with the actual value of positive or negative with respect to the induction of phospholipidosis in rows with italicized text labels and the predicted values in columns with plus (+) and minus (−) symbols as labels.

62 Chapter 3

3.3.1 Support Vector Machines

The support vector classification using the standard set of statistical descriptors with spherical-harmonic fitting yielded a model with a prediction accuracy of 47% for negatives and 95% for positives; an overall accuracy of 75%. This SVM (ν-SVC, ν=0.5) uses a radial basis function (γ=0.016) and possesses 43 support vectors.

− + Negative 14 16 Positive 2 40

Figure 3.3 Confusion matrix for test set using spherical-harmonic fitting.

A support vector machine was trained with the marching-cube surface-fitted properties, which yields a model with a prediction accuracy of 43% for negatives and 90% for positives; an overall accuracy of 71%. The SVM (ν-SVC, ν=0.4) uses a radial basis function (γ=0.016) possesses 42 support vectors.

− + Negative 13 17 Positive 4 38

Figure 3.4 Confusion matrix for the test set using marching-cube fitting.

An analysis of the correlation between the Parasurf statistical descriptors and the target classification was performed and the ten most significant descriptors for the marching-cube-fitted surface properties were used, along with calculated logP, as an enriched predictor set (Table 3.3) and SVM’s trained. Three of these descriptors were from a set of newly-added Parasurf statistical descriptors, describing the skewness and kurtosis of the distribution of values for each of the local properties. The best ν-SVM

63 Support Vector Classification of Phospholipidosis-Inducing Drugs

(ν=0.500) with a radial basis function (γ=0.091) had 41 support vectors and an overall prediction accuracy of 75% for the test set (57% for negatives, 88% for positives).

− + Negative 17 13 Positive 5 37

Figure 3.5 Confusion matrix for the test set using descriptors that are highly-correlated with the target values.

Table 3.3 Enriched descriptor set: 11 descriptors. 1 logP 2 Mol. Vol. 3 Mean (−) MEP 4 MEP (−) variance 5 MEP kurtosis

6 Mean IEL

7 IEL variance

8 IEL kurtosis

9 Mean χL

10 χL variance

11 χL skewness

Given that the compounds that have been observed to induce phospholipidosis are generally cationic amphiphilic structures, an additional correlation analysis was performed using pKa and logP values calculated from surface-integral models. These two surface- integral properties proved to correlate well and were added to a list of the 18 most significant descriptors (Table 3.4) and used to train a support vector machine (C-SVC, C=0.8; radial basis function, γ=0.048). This consisted of 49 support vectors and yielded a prediction accuracy of 53% for negatives and 98% for positives; or 79% overall.

64 Chapter 3

− + Negative 16 14 Positive 1 41

Figure 3.6 Confusion matrix for the test set using additional descriptors, pKa and logP.

Table 3.4 Enriched descriptor set: 20 descriptors.

1 pKa 2 logP

3 Dipolar density, μD 4 Molecular polarizability, α 5 Mol. Wt. 6 Globularity 7 Mol. surface area 8 Mol. volume 9 Mean (+) MEP 10 Mean (−) MEP 11 MEP (−) variance 12 MEP total variance 13 MEP kurtosis

14 Mean IEL

15 IEL variance

16 IEL skewness

17 IEL kurtosis

18 Mean χL

19 χL skewness

20 ηL skewness

21 αL skewness

Six ν-SVM’s were trained using sets of 128 each of autocorrelation vectors for shape, molecular electrostatic potential, electron affinity, and ionization potential. The SVM using electron affinity autocorrelation vectors consisted of 46 support vectors (ν=0.400; RBF, γ=0.008) and yielded a prediction accuracy of 37% for negatives and 74% for positives, or 58% overall.

65 Support Vector Classification of Phospholipidosis-Inducing Drugs

− + Negative 11 19 Positive 11 31

Figure 3.7 Confusion matrix for the test set using EAL autocorrelation vectors.

The SVM using ionization potential autocorrelation vectors consisted of 25 support vectors (ν=0.200; RBF, γ=0.008) and yielded a prediction accuracy of 57% for negatives and 76% for positives, or 68% overall.

− + Negative 17 13 Positive 10 32

Figure 3.8 Confusion matrix for the test set using IEL autocorrelation vectors.

The SVM using shape autocorrelation vectors consisted of 40 support vectors (ν=0.300; RBF, γ=0.008) and yielded a prediction accuracy of 37% for negatives and 83% for positives, or 64% overall.

− + Negative 11 19 Positive 7 35

Figure 3.9 Confusion matrix for the test set using shape autocorrelation vectors.

The SVM using molecular electrostatic potential autocorrelation vectors (minus-minus) consisted of 42 support vectors (ν=0.200; RBF, γ=0.008) and yielded a prediction accuracy of 47% for negatives and 88% for positives, or 71% overall.

66 Chapter 3

− + Negative 14 16 Positive 5 37

Figure 3.10 Confusion matrix for the test set using VMM autocorrelation vectors.

The SVM using molecular electrostatic potential autocorrelation vectors (plus-minus) consisted of 44 support vectors (ν=0.200; RBF, γ=0.008) and yielded a prediction accuracy of 37% for negatives and 76% for positives, or 60% overall.

− + Negative 11 19 Positive 10 32

Figure 3.11 Confusion matrix for the test set using VPM autocorrelation vectors.

The SVM using molecular electrostatic potential autocorrelation vectors (plus-plus) consisted of 40 support vectors (ν=0.200; RBF, γ=0.008) and yielded a prediction accuracy of 43% for negatives and 90% for positives, or 71% overall.

− + Negative 13 17 Positive 4 38

Figure 3.12 Confusion matrix for test set using VPP autocorrelation vectors.

The sets of autocorrelation vectors were also combined (truncated to 28 increments each) and used to train a ν-SVM (ν=0.500; RBF, γ=0.006) with 45 support vectors. The prediction accuracy was 53% for negatives and 95% for positives for an overall accuracy of 78%.

67 Support Vector Classification of Phospholipidosis-Inducing Drugs

− + Negative 16 14 Positive 2 40

Figure 3.13 Confusion matrix for test set using all autocorrelation vectors.

The ν-SVM’s were trained with a 10-fold cross-validation scheme with no misclassification penalty bias (misclassification of negatives are equivalent to misclassification of positives in the training). As such, the support vector machines tended toward far more false negatives than false positives.

3.3.2 Multivariate Adaptive Regression Splines

Using Autocorrelation Indices

Using the local properties mapped onto the marching cube surface for each structure, 66 autocorrelation similarity indices described in Section 3.1.3 for all compounds in both sets were calculated using the surface of valproic acid (test set; negative) as a reference. The final model consisted of 12 basis functions and had a generalized cross-validation error of 0.219. The prediction accuracy for the training set was 97%, predicting one false positive and one false negative (Figure 3.14). When the regression splines equations were applied to the test set, a prediction accuracy of 68%, with 20 false positives and 3 false negatives was found (Figure 3.15).

− + Negative 27 1 Positive 1 42

Figure 3.14 Confusion matrix for the training set.

68 Chapter 3

− + Negative 10 3 Positive 20 39

Figure 3.15 Confusion matrix for the test set.

The same procedure was applied using the standard set of Parasurf ’06 statistical descriptors. The training set yielded a model that consists of 9 basis functions and had a generalized cross-validation error of 0.351. The prediction accuracy for the training set was 90%, predicting 3 false positives and four false negatives (Figure 3.16). When the regression equations were applied to the test set, a prediction accuracy of 72%, with 15 false positives and 5 false negatives (Figure 3.17).

− + Negative 25 4 Positive 3 40

Figure 3.16 Confusion matrix for the training set.

− + Negative 15 5 Positive 15 37

Figure 3.17 Confusion matrix for the test set.

Thus, the MARS models are slightly less accurate in their predictions as compared to the support vector machines, but they also predict many more false positives (while the SVM’s predict many more false negatives).

69 Support Vector Classification of Phospholipidosis-Inducing Drugs

3.4 Discussion

It seems clear from the comparison with the MARS prediction accuracies (of approximately 70%) that there is not an obvious condition of over-fitting of the training data in the case of the SVM’s, with an averaged prediction accuracy of 75.6%. The C- SVC machine (RBF; C=0.8; γ=0.048) with the largest feature space margin (ρ=2.405) with 49 support vectors was able to classify 57 of the 72 compounds in the test set correctly (79% accuracy). It is generally useful to note the size of the feature space margin, ρ, as an indicator of the relative ability of the SVM to generalize in the prediction of new data, but as this value is in the units of the n-dimensional transformed feature space for a particular SVM, it cannot be used as a standard measure. More useful is a direct comparison with the predictive capacity of another multivariate analysis technique, such as the MARS analyses presented here. That the two techniques give similar predictive accuracies suggests that the best models will generalize as well as the training data will allow. Among the different SVM trainings using the various descriptor sets, there were 16 cases where molecules were predicted correctly consistently among all trained machines and several cases where molecules were predicted incorrectly. The most consistently well-predicted members of the test set are shown below in Table 3.5 in bold italics, while the most misclassified compound, Ceftazidime, is underlined.

Table 3.5 Test set misclassifications among trained support vector machines. Number of Compound Misclassifications 3-Methylcholanthrene 2 AC-3579 1 Acetaminophen 2 Amikacin 1 Amiodarone 1 Amitriptyline 1 Anticoman 3 Aricept 3 AY-25329 2 AY-9944 0 Bicalutamide 2 Boxidine 1 Bupropion 1 Carbon_tetrachloride 3 Ceftazidime 4

70 Chapter 3

Cephaloridine 1 Chlorcyclizine 2 Chloroform 3 Chlorphentermine 1 Citalopram 1 Clindamycin 0 Clomipramine 1 Clozapine 3 Cocaine 2 Coralgil 1 Dantrolene 1 Demeclocycline 0 Desferal 3 Dibekacin 1 Diclofenac 2 Diflunisal 2 Di-isobutamide 0 Doxapram 3 Doxycycline 1 Emetine 0 Ethyl_fluclozepate 2 Famotidine 2 Fenfluramine 2 Flutamide 0 Homochlorcyclizine 2 Hydrazine 2 Hydroxyurea 1 IA3 1 Imipramine 0 Indoramin 2 L-Ethionine 2 Maprotiline 1 Meclizine 0 Mesoridazine 1 Metformin 0 Methadone 3 Methapyrilene 3 Mianserin 2 Netilmicin 0 Noxiptiline 1 Paraquat 2 Perhexiline 1 Procaine 1 Promethazine 2 Propranolol 1 Quinacrine 1 Quinidine 1 Rolitetracycline 0 SDZ_200-125 3 SKF-14336-D 0 Stavudine 0 Tacrine 0 Trifluperazine 2 Triparanol 0 Tunicamycin 1 Valproic_acid 3 Zimelidine 0

71 Support Vector Classification of Phospholipidosis-Inducing Drugs

In general, structures that have a negative assay result for phospholipidosis induction are under-represented in the data set and the multivariate adaptive regression splines that have been applied to the surface property data have proved to predict more false positives than false negatives, while the support vector machines predict in an opposite manner. In both cases there is a tendency of the multivariate methods to bias their predictions toward the correct classification of primarily positives or primarily negatives, with the border between them remaining rather unresolved. Overall, the best combinations of surface predictors and multivariate methods give a prediction accuracy of 75-79% for this data set. As more research is published on the prediction of phospholipidosis, larger data sets will be available for use in the construction of computational models and thus, the models themselves will improve. In a previous experiment examining the effect of charge state on the predictive capacity of the SVM’s, the structures in the training set were ionized according to their charge state (ionized > 50%) in solution at physiological pH (7.4) and used to train several SVM’s. The structures in the test set were also ionized by this criterion and used to test the predictive capacity of the trained machines. The resulting SVM’s proved excellent in predicting the charge state of the molecules, but very unreliable in predicting phospholipidosis induction (~50% overall accuracy). As a result, all structures were used in their neutral forms. It would seem that, while the charge state of a given molecule may represent its true state in solution, the effect on the surface descriptors is to diminish the impact of the weaker, non-electrostatic components such as molecular polarizability in the subsequent classification schemes. The work of Tomizawa, et al.93, drawing on earlier work by Ploemen94 and Fischer95, describes the use of two predictors, net molecular charge (NC) (based on the relative charge distribution of molecules in solution at a specified pH of 4.0 from a calculation of pKa) and ClogP, in the prediction of phospholipidosis-inducing potential (PLIP), giving a PLIP risk rating to each compound in their combined set of 63 compounds. The reported prediction accuracy is 98%, with only one misclassified compound in their validation set. This simple and efficient method seems highly predictive, but it is little more than a set of rules in the manner of Lipinski’s Rule of Five96 and, as the authors indicate, its accuracy is wholly dependent on the degree of accuracy of the atom/fragment-based calculations of pKa and ClogP. And aside from predicting that cationic amphiphilic species are, in fact, cationic and amphiphilic, the method does not

72 Chapter 3 explore or allow for the exploration of the underlying relationships between the CAD and its environment, in terms of the close-contact regions with the surfaces of intra-lysosomal phospholipids or phospholipases. Thus, in terms of application to efficient high-throughput virtual screening, the more lightweight, less computationally-intensive methods are the more desirable, which, in this case are the methods of Tomizawa, Ploemen, and Fischer, with whatever their actual prediction accuracies might be. In the case of our local property/SVM technique, what is lost in a marginally greater computational cost is made up for in the accumulation of local property information that may be used to ascribe electronic surface interactions to actual processes involved in inducing phospholipidosis. The main drawback here, again, is the present lack of interpretability of the local properties as a collection of statistical measures. However, insofar as the properties of pKa and logP, themselves, may be accurately predicted by local property surface-integral models (Chapter 2), it seems clear that local surface properties must play a significant role in the interplay of forces governing the initiation of phospholipid aggregation within the lysosome.

3.5 Conclusions

This study demonstrated the use of local surface properties in a support vector machine methodology to predict phospholipidosis induction given a set of molecular surfaces as described by statistical measures of the local properties. It is interesting to note that the support vector machine trained with the additional pKa and logP descriptors calculated from surface-integral models was the most accurately predictive in terms of classification by local property descriptor. This suggests not only the importance of these particular properties to the process of phospholipidosis induction, but, as these values are themselves calculated from the same pool of local surface properties, the importance of the local properties in describing the range of surface interactions involved in the process associated with phospholipidosis.

73

Chapter 4

Three-Dimensional Quantitative Structure-Activity Relationships Using Local Properties

4.1 Introduction

4.1.1 Comparative Molecular Field Analysis

Comparative Molecular Field Analysis97, or CoMFA, is a 3D-QSAR method developed by the group of Richard D. Cramer, III that involves modeling the relationship between ligands and receptors in terms of steric and electrostatic interactions. This is done by aligning a set of molecular structures that have an associated activity value (logK, inhibitory concentration, etc.). A three-dimensional grid is generated around the aligned molecules and probe “atoms” are placed at each point in the grid. The steric and electrostatic potentials that arise from proximity with the atoms in the aligned molecules are recorded and used in a partial least squares regression with the activity values as independent variables. A Leave-One-Out cross-validation scheme is used in the Tripos’

74 Chapter 4

SYBYL36 implementation of CoMFA to estimate the predictive capacity of the model in terms of q2, the cross-validated r2 of the model. A three-dimensional contour map is then plotted that relates regions of steric and electrostatic potential to activity. Colored figures in the space around the aligned molecules indicate regions that relate positively and negatively to activity (Figure 4.1).

Figure 4.1 A representation of a SYBYL CoMFA analysis of coumarin substrates as inhibitors of cytochrome P450 2A598.

The most common method of molecular alignment is by substructure. The alignment algorithms use a reference fragment as a template for aligning all other structures, as in SYBYL. Cepos InSilico’s Parafit aligns structures using a set of spherical harmonic functions that are produced by Parasurf to generate a molecular surface. Local properties that are mapped onto the surface, i.e. onto the spherical harmonic functions, can then be used as a template for alignment by common electronic properties such that the set of molecules need not have a common substructure. In addition to the alignment by overlaying the spatial positions of spherical harmonics, alignment may also be conducted by similarity of fitted local electronic properties, such as electronegativity.

75 3D-QSAR Using Local Properties

The measure of the predictive capacity of a CoMFA model, according to the SYBYL manual, is found in the statistical measures r2, the regression coefficient of the model, and q2, the “predictive r2”. The latter is the measure of the fit of the cross- validated predictions which, in the case of the standard CoMFA and the method employed here, is a full Leave-One-Out cross-validation scheme, with each case left out in turn and predicted by the rest of the data set. The value of r2 should always be greater than 0.6 (a good model should have an r2 > 0.9) and the value of q2 could fall into three categories:

• q2 > 0.6: The model is fairly good. • 0.4 < q2 < 0.6: The model is questionable. • q2 < 0.4: The model is poor.

In addition, a minimum number of vector components (described in the following section) should be used that improves r2 by at least 5%. Typically, the number of components in a given model should be no greater than seven or eight. In general, the lower the number of components, the more straightforward the relationship between the probe parameters and observed activity.

4.1.2 Partial Least Squares Regression

Representing the large number of steric and electrostatic potential values determined for each of the many grid points (in some cases, thousands of values) in a meaningful way becomes difficult for typical statistical analytical methods. It is a problem of how to select the important predictors from such a large set of data. In instances of QSAR/QSPR modeling where multiple regression analyses result in poorly- predicting models due to cases of over-fitting of the data or where large numbers of factors cannot be avoided due to the nature of the experiment, a statistical method very similar to principal component analysis, called Partial Least Squares (PLS) analysis99, can be used to extract latent factors in the data that account for the variation in the target values. Introduced by Wold100 and co-workers around 1979, and referred to as the Projection to Latent Structures in statistics texts101, the general method involves transforming the matrices of the factors and the target values into new vector spaces such

76 Chapter 4 that the relationship between successive pairs of scores is a high as possible. Directions in transformed factor space that associate with the greatest variance in the responses that are also biased toward directions in response space that result in accurate predictions are used as a means of indirectly modeling the target values. The extracted factors depend on all input variables, with each factor contributing successively less to the predictivity of the model. Thus, while there is no data reduction involved in the process itself, only a certain number of factors, or vector components, (usually determined by some measure of residual variance) are used in the final model. The most common method of determining the maximum number of vector components to be used is by cross-validating by each target value until a minimum value is reached.

4.1.3 Local Properties

It has been argued102 that steric and electrostatic fields do not present a complete picture of drug-receptor interactions, so more recently other 3D-QSAR methods have been developed that take additional physicochemical properties into account. One such method is known as Comparative Molecular Similarity Index Analysis103 (CoMSIA) and follows the same general CoMFA methodology, but using atomic probes for local hydrophobicity, hydrogen bond donors and acceptors, as well as for steric and electrostatic potential contributions. Another major difference lies in the use of a Gaussian distance function applied to grid values such that there are no dramatic property changes from grid point to grid point and the use of similarity indices between structures for each property used as factors in PLS analysis. The indices are calculated by:

n 2 q −αriq Sj()=−∑ωωki ⋅ ⋅ e (4.1) i=1 where, for molecule j, ωi is the target property value, ωi is the local property value at grid point q for a probe atom (charge +1, radius 1Å, hydrophobicity +1, H-bond donor index

+1, and H-bond acceptor index +1), riq is the distance between grid point q and atom i, and α is an attenuation factor. The models that result from CoMSIA are generally more predictive in terms of q2 than their CoMFA counterparts and have the ability to model the

77 3D-QSAR Using Local Properties binding surfaces of the ligand-substrate complex more accurately in terms that are familiar to the biochemist.

In the interest of expanding the descriptive vocabulary of 3D-QSAR, a method has been developed that uses local properties to model the electronic interactions of drug binding surfaces using a methodology analogous to CoMFA. The approach described below begins with the standard steric and electrostatic descriptors of CoMFA, in the form of the local electron density and the local molecular electrostatic potential, respectively. To these are added the local properties of electron affinity, ionization potential, electronegativity, hardness, and polarizability. The result is an augmented molecular field analysis that is interpreted in a 3D-graphical manner identical to that of CoMFA, but with additional property fields that may be used alone or in combination to reveal important intermolecular interactions not elucidated by shape and charge fields alone.

Figure 4.2 A set of aligned structures in a CoMFA grid.

78 Chapter 4

4.2 Computational Methods

The structures in the following data sets were aligned using SYBYL 7.0 via conversion to Tripos mol2 format, followed by conversion back into MDL SD format. Semi-empirical MO calculations with the AM1 Hamiltonian were performed on each using VAMP 9.0 to calculate charges and orbital information, with or without geometry optimization as indicated. A three-dimensional grid with a point spacing of 2 Ångstroms and a 4 Ångstrom border was generated by script and was used in the calculation of seven local properties: electron density δe, electron affinity EAL, electronegativity χL, hardness

ηL, ionization potential IEL, electrostatic potential V, and polarizability αL at each grid point using Parasurf ’07104 (Figure 4.2).

Figure 4.3 Representation of the local-property/activity CoMFA field for EAL.

Points interior to the molecular surface were removed from the grid by using a “generalized” van der Waals radius of 1.16 Ångstroms in order to ensure that property values that bear no direct relation to surface activity would not appear in the PLS analyses. This atomic radius, which is slightly smaller than the van der Waals radius for the

79 3D-QSAR Using Local Properties hydrogen atom (1.20Å), was chosen in order to leave enough surface electron density to use the local electron density as a steric parameter in the 3D-QSAR analysis. The local property values at the grid points were then used as independent variables in separate partial least squares regressions, using associated physical property values as target values. The partial least squares analyses were performed using an in-house program using the SIMPLS105 algorithm and a full cross-validation scheme (i.e. all cases are excluded and predicted in turn), re-centering and re-scaling the included data for each run. The PLS regressions were carried out initially to ten vector components in order to determine the maximum number of components to be included in the final model, using the cross- validated standard error of prediction (SEP) of the model to choose the appropriate number of components (as in SYBYL). In the cases where PLS analyses were performed using single local properties, the property data was normalized by the mean. The regression coefficients for the final model were then used to generate a three-dimensional representation of the property space with colored spheres using Pymol45. Those grid points with a positive relationship to the particular target property are color-coded red, while those with a negative relationship to the target property are color-coded blue. In addition to color-coding, the size of the grid spheres, determined by the standardized magnitude of the regression coefficients, represents the magnitude of the relationship between the local property at that point to the target value (See Figure 4.3).

4.3 Results and Discussion

4.3.1 Serotonin Receptor Agonists/Antagonists

The common 5-HT1A and α1 -adrenergic agonist/antagonist data set consists of 23 thienopyrimidinone structures in Tripos mol2 format that had been optimized previously106 with MM3107. The structures in Table 4.1 were aligned to structure 23 and converted to MDL SD format. Single-point AM1 calculations were used to calculate the charges and orbital information needed by Parasurf ’07 for the grid (4Å border, 2Å

80 Chapter 4

spacing) points surrounding the aligned molecules. The pIC50 values for 1) 5-HT1A receptor binding and 2) α1 -adrenergic receptor binding for each compound were used as target values in PLS analyses, where IC50 is the concentration of ligand that causes 50% 3 dissociation of [ H]-8-hydroxy-2-(di-N-propylamino)tetralin from the 5-HT1A receptor or 3 50% dissociation of [ H]-prazosin from the α1 receptor in binding assays.

The PLS analysis of the aligned data set using all local properties yielded a q2 of 0.761 with one vector component, an overall SEP of 0.793, and a model r2 of 0.870. The cross-validated predictions are presented below in Figure 4.4. The results of the PLS analyses using individual local properties are presented in Table 4.2.

Table 4.1 Thienopyrimidinone 5-HT1A and α1 agonists/antagonists.

S N S N OCH3 S N OCH3

R3 N H2N NN R4 NN H2N N NN 21 O O 22 S S O

R1 H3C R2 CH3

Structure R1 R2 R3 R4 5-HT1A pIC50 α1 pIC50

1 Me Me H 2-Cl-Ph 6.34 6.79 2 Me Me H 3-Cl-Ph 6.01 6.52 3 Me Me H 2-OMe-Ph 7.62 7.40 4 Me Me H 1-Naphthyl 6.45 6.05 5 Me Me H 2-Pyrimidyl 6.65 5.96

6 -(CH2)4- H 2-Cl-Ph 6.03 6.78

7 -(CH2)4- H 2-OMe-Ph 7.23 7.42

8 -(CH2)4- H 1-Naphthyl 6.43 6.35

9 -(CH2)4- H 2-Cl-Ph 6.30 5.74 10 H Ph H 2-OMethenyl 6.41 6.65 11 H Ph H 1-Naphthyl 5.70 5.61 12 -(CH=CH)- H 2-OMe-Ph 7.34 7.04

13 H H NH2 2-OMe-Ph 8.92 8.54

14 -(CH2)4- Me 2-OMe-Ph 8.15 7.19

15 -(CH2)4- NH2 2-OMe-Ph 8.89 7.41

16 Me Me NH2 Ph 8.48 7.37

81 3D-QSAR Using Local Properties

17 Me Me Me 2-OMe-Ph 8.52 7.57 18 Me Me NH-Ph 2-OMe-Ph 6.30 7.49 19 Me Me Me 2-Pyrimidyl 7.19 5.69

20 Me Me NH2 2-Pyrimidyl 8.17 6.30 21 − − − − 9.10 7.44 22 − − − − 9.30 8.40

23 Me Me NH2 2-OMe-Ph 9.52 8.14

9

8 50

7 Calculated pIC Calculated 6

5

4 456789 Experimental pIC 50

Figure 4.4 Cross-validation predictions vs. actual values of 5-HT1A receptor binding pIC50.

Table 4.2 Partial least squares regression results for the 5-HT1A data set.

δe EAL χL ηL IEL V αL ALL

r2 0.682911 0.759950 0.974536 0.818111 0.936360 0.613866 0.780818 0.869525

q2 0.22754 0.66781 0.82511 0.698682 0.750238 0.509428 0.699553 0.761065

SEP 1.245238 0.913798 0.690154 0.874884 0.81127 1.059668 0.874447 0.792926

Components 1 1 2 1 2 1 1 1

None of the PLS regressions required more than two vector components to return q2 values greater than 0.6. The two exceptions (Table 4.2) are electronic density (δe) and molecular electrostatic potential (V), which are the terms analogous to the two standard CoMFA parameters. Judging from the q2 value (0.825) for the regression using only the

82 Chapter 4 local electronegativity, a 3D-QSAR model (Figure 4.5) using this single local property is sufficient to predict the serotonin inhibitory concentration for the data set.

9

8 50

7 Calculated pIC 6

5

4 456789 Experimental pIC 50

Figure 4.5 Cross-validation predictions vs. actual values of 5-HT1A receptor

binding pIC50 using only local electronegativity.

Figure 4.6 Local-electronegativity/activity field for the aligned 5-HT1A data set.

83 3D-QSAR Using Local Properties

In Figure 4.6, a strong positive relationship with activity is observed near the distal end of aligned thienopyrimidinone nitrogenous substituents, while a larger region of negative activity resides near the distal nitrogen of the aligned piperazine rings.

4.3.2 Adrenergic Receptor Agonists/Antagonists

The PLS analysis of the aligned data set using all local properties yielded a q2 of 0.700 with two vector components, an overall SEP of 0.602, and a model r2 of 0.980. The cross-validated predictions are presented below in Figure 4.7 and the results of the PLS analyses using individual local properties are presented in Table 4.3.

9

8 50

7 Calculated pIC 6

5

4 456789

Experimental pIC50

Figure 4.7 Cross-validation predictions vs. actual values of α1-adrenergic receptor binding pIC50.

Table 4.3 Partial least squares regression results for the α1-adrenergic receptor data set.

δe EAL χL ηL IEL V αL ALL

r2 0.837437 0.992341 0.964925 0.976671 0.949911 0.480773 0.656486 0.980002

q2 0.300313 0.676767 0.765624 0.635388 0.677426 0.296878 0.519698 0.700124

SEP 0.817689 0.633365 0.545112 0.659089 0.621647 0.820678 0.724712 0.60231

Components 2 7 3 4 3 2 1 4

84 Chapter 4

Here again, with a q2 of 0.766, the local electronegativity model predicts slightly better than the combined local property model and the two standard CoMFA parameters are the poorest-performing of the local properties. Additionally, more vector components were required to construct each of the α 1 receptor models than were required for the 5-

HT1A models. The plot of experimental versus calculated pIC50 is presented below in Figure 4.8.

9

8 50

7 Calculated pIC Calculated 6

5

4 456789 Experimental pIC 50

Figure 4.8 Cross-validation predictions vs. actual values of α1-adrenergic

receptor binding pIC50 using only local electronegativity.

Figure 4.9 Local electronegativity/activity field for the aligned α1-receptor data set.

85 3D-QSAR Using Local Properties

As in the case of the 5-HT1A data, a positive response in local electronegativity near the distal end of the thienopyrimidinone rings was observed for the α1 receptor data (Figure 4.9). There are several negative response regions, however, that describe a rather complicated response in local electronegativity, primarily on the thienopyrimidinone end of the structures. The property field regions of positive response common to both sets of data indicate a relationship between electronegative (nitrogenous) substituents on the thienopyrimidinone ring and inhibitory activity, while the regions of negative impact are different for the two data sets.

4.3.3 Dopamine D4 Antagonists

The D4 dopamine antagonist data set consists of 29 MDL SD piperazine structures that had been optimized previously108 with MM3. The structures were converted to mol2 format, then aligned to a central substructure using SYBYL 7.0 and converted back to SD format. Single-point AM1 calculations using VAMP 9.0 were used to calculate charges and orbital information that were used as input for Parasurf ‘07, which calculated the local properties for each point in a three-dimensional grid (4Å border, 2Å spacing) surrounding the aligned molecules. The pKi (the negative logarithm of the inhibition constant) values for the dopamine D4 receptor binding for each compound were used as the target values in PLS analyses.

Table 4.4 Piperazine dopamine D4 receptor antagonists.

4 R1 5 N N R1

R2 N N Cl 6 N 1-16, 26-29 7 N 17-25

Structure R1 R2 exp. pKi calc. pKi

1 p-Cl-Ph H 8.64 7.935 2 Ph H 7.78 7.379

86 Chapter 4

3 p-I-Ph H 8.52 8.047 4 p-F-Ph H 7.70 7.834 5 Me H 5.14 7.028 6 Et H 4.62 5.911 7 p-Cl-Ph 4-Me 7.30 7.522 8 p-Cl-Ph 7-I 8.30 9.439 9 p-Cl-Ph 7-Me 8.57 7.921 10 p-Cl-Ph 7-ethinyl 8.91 7.905 11 cyclohexyl H 5.35 6.698 12 m-Cl-Ph H 8.41 7.661 13 p-Cl-Ph 4,5-benzo 5.74 7.513 14 p-Cl-Ph 6,7-benzo 6.85 8.194 15 m-Cl-Ph 4,5-benzo 6.10 6.532 16 p-Cl-Ph 6,7-benzo 6.66 7.297

17 − 7.58 7.805 N H CN 18 − 8.02 8.426 NC N H

19 − 7.41 8.982 HN NC 20 − 8.24 8.834 CN HN

21 − 7.74 7.822 N H

22 − 8.66 7.64 N N

23 − 6.60 7.705

24 − 9.21 7.581 N N H

N

N Cl N − 7.80 7.101 25 N H 26 p-ethinyl-Ph H 8.36 8.036 27 m,p-Cl-Ph H 8.25 7.708

28 m-CF3-Ph H 8.72 7.646

29 H CH2OH 7.71 7.935

87 3D-QSAR Using Local Properties

The PLS regression model using all local properties yields a q2 value of 0.623 with three vector components and an overall SEP of 0.960, and a model r2 of 0.906. A plot of the cross-validated predictions is presented in Figure 4.11. The results of the PLS analyses using individual local properties are presented in Table 4.5.

Table 4.5 Partial least squares regression results for the D4 receptor data set.

δe EAL χL ηL IEL V αL ALL

r2 0.886454 0.533339 0.900134 0.922590 0.900603 0.930679 0.375814 0.905785

q2 0.67420 0.27462 0.626792 0.566274 0.616129 0.778059 0.098352 0.623449

SEP 0.94303 1.188695 0.955174 1.012352 0.972363 0.784165 1.235116 0.959651

Components 3 1 3 4 3 7 1 3

Figure 4.10 Molecular electrostatic potential field for the aligned D4 receptor set.

With this data set, the electrostatic potential (Figure 4.10) and electron density regressions yield the most predictive models, with local electronegativity and local ionization potential also providing significant contributions. It is, therefore, to be

88 Chapter 4 expected that a standard CoMFA would produce a comparable model and, indeed, the reported q2 for the standard analysis with this data set was 0.739, with an SEP of 0.734 using seven vector components108.

9

8 i

7 Calculated pK

6

5

4 456789 Experimental pK i

Figure 4.11 Cross-validation predictions vs. actual values of dopamine D4 receptor binding pKi.

The original article described the use of an all-orientation109 sampling of CoMFA property space to return the best possible q2 value, which may over-estimate the relationship between the observed activity and the combined steric and electrostatic parameters.

4.3.4 Avian Influenza Neuraminidase Inhibitors

A subset of 21 2D structures and accompanying pIC50 values (Table 4.6) were taken from a larger set of 126 avian influenza neuraminidase inhibitors110. These were converted to 3D MDL SD files using Molecular Networks’ CORINA and subsequently geometry-optimized with AM1 with VAMP 9.0. The structures were aligned with SYBYL 7.0 and the set of Parasurf local properties were calculated for a grid (4Å border, 2Å spacing) surrounding the structures.

89 3D-QSAR Using Local Properties

Table 4.6 Avian influenza neuraminidase inhibitors.

COOH

R1O R2

NHR3

Structure R1 R2 R3 exp. pIC50 calc. pIC50

1 CHEt2 NH2 COMe 9.00 7.15

2 C3H7 NH2 COMe 6.89 6.44

3 CH2CH2CF3 NH2 COMe 6.65 5.59

4 CH2CH=CH2 NH2 COMe 5.66 6.66

5 Me NH2 COMe 5.43 6.74

6 C2H5 NH2 COMe 5.70 7.08

7 C4H9 NH2 COMe 6.52 6.44

8 C5H11 NH2 COMe 6.70 6.86

9 C6H13 NH2 COMe 6.82 6.61

10 C7H15 NH2 COMe 6.57 6.93

11 C8H17 NH2 COMe 6.74 6.81

12 C9H19 NH2 COMe 6.68 6.40

13 C10H21 NH2 COMe 6.22 6.39

14 CH2CHMe2 NH2 COMe 6.70 5.56

15 CH(Me)Et (S) NH2 COMe 8.05 7.17

16 CH2C6H5 NH2 COMe 6.21 7.06

17 H NHC(=NH)NH2 COMe 7.00 7.75

18 C3H7 NHC(=NH)NH2 COMe 8.70 8.30

19 C4H9 NHC(=NH)NH2 COMe 8.52 8.43

20 CH(Me)Et NHC(=NH)NH2 COMe 9.30 9.86

21 C3H7 NH2 COMe 5.82 6.76

The PLS regression model using all local properties yields a q2 value of 0.678 with four vector components and an overall SEP of 0.847, and a model r2 of 0.965. As can be seen in Table 4.7, all of the local properties contribute significantly to the predictivity of the model, with the exception of the local electron density and the local molecular electrostatic potential. The regressions of local electron affinity and local molecular polarizability are the best predictors of activity in this case, with q2 values greater than 0.7.

90 Chapter 4

Either of these local properties, alone, should be adequate in predicting the observed activity and indicating the dependence of the activity on EAL or αL.

Table 4.7 Partial least squares regression results for the neuraminidase inhibitor data set.

δe EAL χL ηL IEL V αL ALL

r2 0.711406 0.953357 0.963215 0.986630 0.982036 0.606607 0.913741 0.965417

q2 0.559258 0.729133 0.680904 0.690092 0.681546 0.463248 0.733392 0.677654 SEP 0.937534 0.858793 0.838464 0.825561 0.836767 1.008261 0.782692 0.847242

Components 2 4 4 6 5 1 3 4

9

50 8

7 Calculated pIC

6

5 56789 Experimental pIC 50

Figure 4.12 Cross-validation predictions vs. actual values of neuraminidase inhibition pIC50.

The regions of both positive and negative response for the local molecular polarizability field are situated very near the central ring on the same side of the ring, suggesting a somewhat complex positive response to polarizable R1 moieties.

91 3D-QSAR Using Local Properties

Figure 4.13 Local molecular polarizability field for the aligned neuraminidase inhibitor set.

4.3.5 Mutagenic Tertiary Amides

A set of 49 N-acyloxy-N-alkoxyamide structures111,112 that possess a fully sp3- hybridized central nitrogen amide and accompanying mutagenicity values (log[mutagenicity] at a concentration of 1 μmol/plate in Salmonella typhimurium TA100) are presented in Table 4.8. These compounds have been shown to react with N7 of guanine in the major groove of DNA through an SN2 mechanism involving the displacement of the N-acyloxy group. Somewhat counter-intuitively, however, the less reactive the compound, the more mutagenic it is113. The 2D structures were converted to 3D MDL SD files using CORINA and subsequently geometry-optimized with AM1. The structures were aligned using SYBYL 7.0 and the set of Parasurf local properties calculated for a grid (4Å border, 2Å spacing) surrounding the structures. The PLS regression model using all local properties yields a q2 value of 0.678 with four vector components and an overall SEP of 0.847, and a model r2 of 0.965. The results of the local property regressions are presented in Table 4.9 below and the cross-validated predictions of the model are presented in Figure 4.14.

92 Chapter 4

Table 4.8 The mutagenic tertiary amides data set.

O

R1 O

N R2

O O R3

Structure R1 R2 R3

1 Me φ p-Br-φ -CH2- 2 2,6-diMe-φ - φ Bu 3 3,5-diMe-φ - φ Bu 4 Me 3,5-diMe-φ - Bu 5 Me Me Bu 6 Me p-Br-φ - Bu 7 Me φ Bu 8 Me p-Cl-φ - Bu 9 Me p-Me-φ - Bu

10 Me m-NO2-φ - Bu

11 Me p-NO2-φ - Bu 12 Me p-φ-φ- Bu 13 Me p-t-Bu-φ- Bu 14 adamantanyl φ Bu 15 Bu φ Bu 16 φ adamantanyl Bu 17 φ φ Bu 18 φ i-Pr Bu

19 φ t-Bu-CH2- Bu 20 φ Et Bu 21 i-Pr φ Bu

22 t-Bu-CH2- φ Bu 23 t-Bu φ Bu

24 Me Me φ -CH2-

25 Me φ φ -CH2-

26 φ φ φ -CH2-

27 φ p-t-Bu-φ- φ -CH2-

28 p-benzaldehyde φ φ -CH2-

29 p-Cl-φ - φ φ -CH2-

30 p-cyano-φ - φ φ -CH2-

31 p-Me-φ - φ φ -CH2-

32 p-MeO-φ - φ φ -CH2-

33 m-MeO-φ - φ φ -CH2-

34 m-NO2-φ - φ φ -CH2-

93 3D-QSAR Using Local Properties

35 p-NO2-φ - φ φ -CH2-

36 p-φ-φ- φ φ -CH2-

37 p-t-Bu-φ- φ φ -CH2-

38 Me φ p-Cl-φ -CH2- 39 Me φ Et 40 Me φ i-Pr

41 Me φ p-Me-φ -CH2-

42 Me φ p-MeO-φ -CH2- 43 Me φ n-octanol

44 Me φ p-φ-φ-CH2-

45 Me φ p-φ-Ο-φ-CH2- 46 Me φ Pr

47 Me φ p-t-Bu-φ- CH2-

48 Me p-t-Bu-φ- p-t-Bu-φ- CH2-

49 φ φ p-t-Bu-φ- CH2-

Table 4.9 Partial least squares regression results for the mutagenic tertiary data set.

δe EAL χL ηL IEL V αL ALL

2 r 0.790466 0.846869 0.726179 0.708745 0.717700 0.365975 0.531608 0.932495

2 q 0.560957 0.621656 0.636329 0.615513 0.627104 0.322679 0.287702 0.693979

SEP 0.304009 0.291694 0.28339 0.289702 0.286228 0.350909 0.395195 0.266125

Components 3 4 1 1 1 1 1 4

Here again the standard CoMFA steric and electrostatic parameters would be expected to produce a less predictive model as a result of lower q2 values and larger SEP values for local electron density and local MEP in comparison to the other local property q2 values. The q2 for local polarizability is also very low with this data set. In one of the original papers describing these compounds112, it is indicated that steric factors affect mutagenicity in two respects. The first of these concerns the ability of the amides to enter the major groove of DNA in such a way that a stable transition state geometry can be achieved and the second involves an observed decrease in SN2 reactivity with increased bulk near the amide nitrogen. In our analysis, the major responses to activity were found in the local properties EAL, χL, ηL, and IEL, with the local

94 Chapter 4 electronegativity field indicating a relationship between mutagenicity and the electron- withdrawing character of para-substituents on the alkoxy phenyl ring moieties, as seen in Figure 4.35, where the large red sphere is situated above the para- position (with respect to the central nitrogen) of the benzoxy ring(s).

3 Calculated

2

1 123 Experimental

Figure 4.14 Cross-validation predictions vs. actual values for the mutagenic amides data set.

Figure 4.15 Local electronegativity field for the aligned mutagenic amides set.

95 3D-QSAR Using Local Properties

The local electron affinity field shown in Figure 4.16, which also contributes strongly to predictivity, indicates a strong positive relationship with activity on the “puckered” side of the central nitrogen and a strong negative relationship with activity on the opposite side.

Figure 4.16 Local electron affinity field for the aligned mutagenic amides set.

4.3.6 The Effect of Grid Orientation on Predictivity

A relationship recently has been observed109 between the relative orientation of the grid surrounding the aligned structures and the predictive q2. An effect of the size of the grid spacing on the predictivity of CoMFA models has also been reported114. It is now well-documented that CoMFA analyses tend to give a range of q2 values as the grid spacing changes or the grid is re-oriented around the aligned structures115. Tropsha, et al. report that q2 may vary as much as 0.5 q2 units between grid re-orientations, which is tantamount to the difference between a predictive model and a non-predictive model. Wang, et al.109 have presented a grid orientation routine that samples all possible translations and rotations to return the best possible CoMFA q2 value. It would seem that

96 Chapter 4 this approach defeats the purpose of using a validation scheme to estimate the predictivity of the model since it takes advantage of a defect in the method to give the best possible statistic. Böhm, et al.114 have attributed the dependence of q2 on grid orientation to the steepness of calculated steric and electrostatic potentials at the lattice points with a typical 2Å spacing and the use of arbitrary cutoff values. If the grid spacing and orientation lead to discontinuous values that bias the partial least squares model toward less estimated predictivity (low q2 values), then might not these same discontinuities lead to an overestimation of predictivity? A more reasonable use of an all-alignment method might take the median q2 value from the distribution of values in a manner similar to the method used by Kroemer, et al.116 in their examination of the effect of cross-validation techniques on predictivity. In an effort to offset the effect of grid orientation, the use of grids with 1Å spacing has been reported108,116, but have not generally been adopted, possibly due to the significant additional computational expense. For instance, a 1Å-spaced grid with the same dimensions as a 2Å-spaced grid has roughly eight times as many points, resulting in about eight times the number of dependent variables in the PLS analysis, which can easily be ~10,000 in number with the use of several property descriptors. Additionally, the use of the smaller grid spacing may not result in an improvement in q2. Indeed, neither the analyses using the local properties presented here, nor those in Kroemer’s investigation116, exhibit an improvement in q2 with the use of the smaller grid spacing. To investigate the response in q2 for the Parasurf-generated local properties to rotation of the grid, two data sets were aligned in a set of grids rotated in steps of 15° through a range of −90° to +90° about a common Z-axis. Partial least squares analyses were performed for each local property, including the combined set of properties for each grid rotation.

97 3D-QSAR Using Local Properties

Table 4.10 Isoquinoline influenza neuraminidase inhibitors data set.

N

O

X

Structure X log 1/C

1 4-NO2 2.90 2 4-Br 2.77 3 4-CN 2.84 4 4-Cl 2.81 5 4-F 2.63 6 H 2.58

7 4-CH3 2.68

8 4-OCH3 2.62 9 4-OH 2.24

10 4-OC2H5 2.65

11 4-OC3H7 2.79

12 4-OC4H9 2.78

13 4-C(CH3)3 3.15

14 3-CH3 2.78 15 3-F 2.67 16 3-Cl 2.82

A set of 16 aligned influenza neuraminidase inhibitors117 in Table 4.10 exhibited a very poor correlation to activity (log 1/C), with an average q2 value of −0.06593 for the combined set of local property descriptors. The values of q2 for all of the individual local property PLS analyses were observed to vary greatly among the rotated grids. The electronic density models, which gave the overall best scores, yielded a maximum q2 value of 0.682 and a minimum value of −0.012, with a range of 0.694 q2 units. The other local properties varied similarly, as shown in Figure 4.17. Local ionization potential (IEL) and local electronegativity (ENEG) periodically exhibit some measure of correlation to

98 Chapter 4 activity, while local polarizability (POL) consistently contributes nothing to the predictivity of the model. 2 The previously described 5-HT1A data set had exhibited good q statistics and was also evaluated for grid orientation dependency. The results of the PLS analyses are presented in Figure 4.18 below. For five of the local properties: electronegativity, hardness (HARD), ionization potential, polarizability, and electron affinity (EAL), grid rotation has a very small effect on the predictivity as measured by q2, while electron density (DENS) and electrostatic potential (MEP) exhibit a large variance in q2. This would seem to suggest that when there is a significant contribution from these local properties there is much less dependence on the orientation of the grid. While for the steric and electrostatic parameters, the story is much the same as before: the predictive quality of models that include them are subject to rather severe dependence on the orientation of the grid.

0.8

0.6

0.4

q 2 0.2

DENS 0 ENEG -90° -75° -60° -45° -30° -15° 0° 15° 30° 45° 60° 75° 90° HARD IEL -0.2 MEP POL EAL -0.4 degrees

Figure 4.17 Response of q2 to grid rotation for each local property using the isoquinoline data set.

99 3D-QSAR Using Local Properties

1

0.9

0.8

0.7

0.6

q 2 0.5

0.4 DENS ENEG 0.3 HARD IEL 0.2 MEP 0.1 POL EAL 0 -90° -75° -60° -45° -30° -15° 0° 15° 30° 45° 60° 75° 90° degrees

2 Figure 4.18 Response of q to grid rotation for each local property using the 5-HT1A data set.

The q2 for the PLS analysis using the set of combined local properties also exhibits much less rotation dependence than the neuraminidase inhibitor data set (Table 4.19).

1

0.95

0.9

0.85

0.8

q 2 0.75

0.7

0.65

0.6

0.55

0.5 -90° -75° -60° -45° -30° -15° 0° 15° 30° 45° 60° 75° 90° degrees

2 Figure 4.19 Response of q to grid rotation for combined local properties using the 5-HT1A data set.

100 Chapter 4

The predictive capacity (q2) of the PLS regression models using the combined set of local properties seems to suffer somewhat in comparison with the q2 values of the individual local-property PLS regressions, presumably due to the addition of “noise” from poorly-predicting local properties. This is not observed, however, in the case of the set of mutagenic tertiary amides (Section 4.3.5), where the modestly-predictive local properties appear to have an additive effect on the overall predictive capacity, even with the inclusion of the poorly-predicting properties.

4.4 Conclusions

The application of the newly-introduced local properties: EAL, χL, ηL, IEL, V, and

αL, used previously only in a molecular surface context, to several comparative molecular field analyses has proved to be substantially as predictive as the standard CoMFA steric and electrostatic parameters. Indeed, the MEP, analogous to the electrostatic parameter in standard CoMFA, has been found to be the least predictive of the local properties in some of the cases presented here. Although it is a well-established property used in describing Coulombic interactions (which are, in turn, the strongest contributors to intermolecular interaction energies in the gas phase), the MEP is strongly attenuated in polar solvent such that interactions from among other local properties may predominate in the overall intermolecular interaction. In the case where a ligand molecule is “solvated” by inclusion into an enzyme binding pocket, additional local properties offer a rationale for the orientation of binding that goes beyond charge-cancellation. In addition to the advantage gained by having a complement of local electronic properties from which to establish structure-activity relationships, these local properties seem to be exceptionally robust with respect to grid orientation, such that all-orientation grid placement schemes may not be required to extract the best possible PLS q2 value. The local property 3D-QSAR method presented here is similar to the CoMSIA method103 in that several properties contribute to the overall predictive response of the resulting model. The local properties could very easily be adapted to a standard CoMSIA methodology by the application of similarity indices to the local property values at the grid points. The local properties can, in principle, also be used within the Hypothetical Active

101 3D-QSAR Using Local Properties

Site Lattice106,118 technique (HASL), whereby a 3D lattice of points internal to the van der Waals radii of the aligned set of molecules is iteratively optimized in terms of partial (biological, etc.) activities at each lattice point, generating a composite pharmacophore. The application by any of these methods should prove to be an apt use of local electronic properties to describe molecular/macromolecular interactions at the point of close contact. As such, salient electronic features of the binding regions of drug targets may be better- described by this ensemble of electronic terms.

102

Chapter 5

Conclusions and Outlook

5.1 Conclusions

The computational methods described here allow drug researchers a means of applying quantum-mechanically-derived local electronic properties to in silico high throughput screening schemes in such a way as to not only predict and classify by various physical properties and biological activities, but also to describe in chemical terms the nature of the observed activity as a function of surface properties. These properties may also be visualized by mapping them onto an isodensity surface, making the identification of important functional moieties readily accessible to everyone involved in the drug design pathway. Large sets of 2D structures or SMILES strings may be evaluated for potential problems, such as phospholipidosis-inducing potential, in terms that are useful to the medicinal chemist. In this way, rapid screening of compounds can be achieved for any property or activity that can be explained by local properties. The same pool of data can then be used to extrapolate regions at or near the molecular surface that impact activity the most. The major drawback in the use of local properties lies in their application to the surface-integral models, where the interpretation of the model is less than straightforward. However, when applied to a 3D-QSAR scheme, the individual contributions of the local properties become evident and may be easily interpreted by indicating the regions that contribute greatly to function.

103 Conclusions and Outlook

Figure 5.1 The contact MEP surface for 5-acetamido-1,3,4-thiadiazole-2-sulfonamide bound to human carbonic anhydrase II (from the RCSB Protein Data Bank: ID=1YDA)

5.2 Outlook

Given the interest in describing the interactions between a drug and its target in as much detail as is feasible, the addition of new electronic terms to the traditional combination of steric and electrostatic parameters in the calculation of binding constants and free energies, as in the case of automated docking routines, may provide better model predictivity as well as new insights into the drug-binding process itself. Borrowing on the ideas of building up a free energy of binding model from the contributions of molecular fragments119 and the same sort of fragment-based approach in identifying the portions of the binding region most important to protein-ligand binding, we find that we may be able to construct a binding energy model based on the contributions of close-contact surface properties. In essence, the idea is to treat the free energy of binding (ΔGbind) as a sort of surface-integral-based solvation free energy where the contact free energy term (ΔGcontact) is the energy associated with the “solvation” of ligand by the substrate (See Equation 5.1).

104 Chapter 5

lig substr Δ=Δ+Δ+Δ+ΔGGbind contact GG conf conf G solv (5.1)

Only the portion of the ligand surface that is in close proximity to the surface of the enzyme binding pocket is taken to represent the binding surface. The model is constructed by taking the bound ligand/binding-pocket atoms from crystal structures (along with associated binding data such as experimentally-determined Kd or Ki values), adding hydrogens and optimizing the geometries of these hydrogens, keeping the heavy atoms fixed. Close-contact regions are identified by means of sums of van der Waals radii or by overlap of isodensity surfaces and recorded. The substrate atoms are removed, leaving only the ligand in a geometry that is very near that of the bound structure. The local properties are then calculated for the molecule and, taking only the close-contact portion of the surface, a regression model of binding energy is constructed from either 1) statistical descriptors of the close-contact surface, or 2) a surface-integral treatment of the close-contact surface. One approximation that is made with this procedure is the neglect of the changes in conformation in both the ligand and substrate upon binding, but several authors120-122 note that the bound conformations of ligands are inevitably low-energy geometries and may not contribute greatly to the overall free energy of binding as defined here. The conformational free energy of the substrate cannot be evaluated by this method and is assumed to be small and relatively constant among proteins. The free energy of solvation term and the contact free energy term in Equation 5.1 are inversely proportional: as the ligand enters the binding pocket, it becomes solvated by the binding pocket and desolvated in bulk solution for the same portion of the molecular surface. Since the solvation free energy model described in Chapter 2, Section 2.3.2, is calculated by a surface-integral approach as well, it is a function of the solvent-exposed surface. The prediction of the free energy of protein/ligand binding by local surface properties then becomes a matter of including a small portion of the binding pocket of the protein in the initial handling of the ligand. Further, it may be possible to predict the maximum possible free energy of binding123 for a given ligand without a crystal structure, using the previously-described model, and to estimate the amount contact surface area and the actual contact surface regions from a given binding constant.

105

Chapter 6

Summary

Of great interest to the pharmaceutical industry is the elucidation of a set of chemical/physical properties modulating the relationship between chemical structure and pharmacological activity that could be used to predict activity based solely on chemical structure. It is necessary then, only to discover the particular set of molecular descriptors which adequately describe the activity to be predicted. Since the point of contact for all drugs lies inevitably with the molecular surface of both the drug and the drug target, a descriptive model of the molecular surface is needed. The nature of this surface is electronic and quantum mechanical methods are those which describe the electronic structure of the molecule. Local properties defined at points on the molecular surface, such as molecular electrostatic potential (MEP), have been used to describe strong non- covalent interactions that are based primarily on charge. Recently, additional local properties have been described which complement MEP and provide a more complete description of the local electronic environment at the molecular surface. This work describes the implementation of the five local electronic properties using Parasurf: electron affinity EAL, electronegativity χL, hardness ηL, ionization potential IEL, molecular electrostatic potential MEP, and polarizability αL, into three principal methods of quantitative structure-activity (QSAR) and structure-property prediction (QSPR) for use in virtual high-throughput screening.

106 Chapter 6

The first of these methods involves the construction of surface-integral models, which relate physical properties to the sum of the individual contributions of the local surface properties, as determined by the statistical technique of multiple regression. Similar regression models for activity have also been constructed from statistical measures of the local properties, such as maxima, minima, ranges, etc. The predicted properties may then be mapped onto the molecular surface as local properties, themselves, to expose surface regions that relate to the observed activity. So, this method provides not only a predicted property value, but allows for the visualization of the property surface.

Representation of a local property surface (MEP) used to construct surface-integral models.

Seven such models have been constructed for the prediction of 1) the n-octanol/water partition coefficient logP, 2) the free energy of solvation in water ΔGsolv.(H2O), 3) the free energy of solvation in n-octanol ΔGsolv.(oct.), 4) the acid dissociation constant pKa, 5) boiling point Tb, 6) the glass transition temperature of organic LED materials Tg, and 7) water solubility logS.

The second statistical method employing the local properties uses support vector machines to classify drug compounds as inducers of phospholipidosis. Drug-induced phospholipidosis is an undesirable side-effect of, primarily, cationic amphiphilic drugs

107 Summary that causes lysosomal bodies, which contain large deposits of undegraded phospholipids, to aggregate inside the cells of the lungs, liver, kidneys, corneas, and brain. Their presence often coincides with adverse clinical effects such as inflammatory reactions and fibrosis. The support vector machines use the statistical measures of the local properties as descriptors for the classification of a test set of compounds, based on the model constructed from a training set with the same number of compounds.

Representative local property field (EAL) using CoMFA methodology.

The third statistical method to which the local properties were applied was the 3D- QSAR method of Comparative Molecular Field Analysis (CoMFA), which involves aligning a set of molecules and determining the relationship between biological activity and steric and electrostatic potentials at each of set of grid points surrounding the structures by a partial least squares analysis. It has been argued that steric and electrostatic fields alone do not present an adequate representation of drug-receptor interactions, so additional physicochemical properties are required. In our formulation,

108 Chapter 6

the standard CoMFA electrostatic parameter corresponds to the MEP and the steric parameter corresponds to the electron density, with the additional local properties augmenting these basic parameters. Five sets of structure-activity data sets were examined and our method produced q2 values comparable to, if not better than, the reported standard CoMFA values. In addition, it was noted that the individual local properties consistently gave better q2 values than either the MEP or electron density parameters and proved to be significantly less sensitive to grid orientation. This method effectively expands the descriptive vocabulary of 3D-QSAR and is better able to reveal important intermolecular interactions not elucidated by shape and charge fields alone. The computational methods described here allow drug researchers a means of applying quantum-mechanically-derived local electronic properties to in silico high throughput screening schemes in such a way as to not only predict and classify by various physical properties and biological activities, but also to describe in chemical terms the nature of the observed activity as a function of surface properties. These properties may also be visualized by mapping them onto an isodensity surface, making the identification of important functional moieties readily accessible to everyone involved in the drug design pathway.

109

Chapter 7

Zusammenfassung

Die Aufklärung von physikalisch-chemischen Eigenschaften welche einen direkten Bezug zwischen chemischer Struktur und pharmakologischer Aktivität und somit eine Abschätzung der Aktivität allein basierend auf der strukturellen Information einer Substanz erlauben ist für die pharmazeutische Industrie von essentieller Bedeutung. Gelingt dies, so wird für die adäquate Beschreibung der vorherzusagenden Aktivität folglich nur noch ein passend Satz molekularer Deskriptoren benötigt. Die Schnittstelle zwischen Wirkstoff und aktivem Zentrum des Zielmoleküls wird unweigerlich durch die molekularen Oberflächen beider Substanzen bestimmt weswegen ein Modell zur Beschreibung dieser Oberflächen benötigt wird. Da es sich hierbei um elektronische Oberflächen handelt werden zur Beschreibung quantenchemische Methoden herangezogen. Um starke, auf Ladungen basierende, nicht-kovalente Wechselwirkungen zu beschreiben wurden bisher lokale Eigenschaften auf der molekularen Oberfläche wie etwa das molekulare elektrostatische Potential (MEP) benutzt. In letzter Zeit wurden zusätzliche lokale Eigenschaften in diesen Ansatz integriert, welche die Informationen des MEP ergänzen und somit zu einer vollständigeren Beschreibung der lokalen elektronischen Umgebung auf der molekularen Oberfläche führen. Diese Arbeit beschreibt die Integration fünf verschiedener lokaler elektronischer Eigenschaften wie

Elektronenaffinität EAL, Elektronegativität χL, Härte ηL, Ionisationspotential IEL, molekulares elektronisches Potential MEP und Polarisierbarkeit αL in drei Hauptmethoden

110 Chapter 7

der quantitativen Stuktur-Aktivitäts (QSAR) und Struktur-Eigenschafts Vorhersage (QSPR) für die Verwendung in High Throughput Screening (HTS) Anwendungen.

Darstellung einer MEP-Oberfläche für die Konstruktion von Oberflächenintegralmodellen

Die erste dieser Methoden beinhaltet die Konstruktion von Oberflächenintegralmodellen. Diese setzen die physikalischen Eigenschaften mit der Summe der individuellen Beiträge der lokalen Eigenschaften auf der Oberfläche in Beziehung welche durch statistische Mehrfachregression bestimmt wurden. Vergleichbare Regressionsmodelle zur Bestimmung der Aktivität wurden ebenfalls unter Verwendung statistischer Größen wie etwa Maxima, Minima und Spannweiten aufgestellt. Die vorhergesagten Eigenschaften können dann auf der molekularen Oberfläche als lokale Eigenschaften abgebildet werden um so die Bereiche offen zu legen, die der beobachtbaren Aktivität zuzuordnen sind. Somit sagt diese Methode nicht nur Eigenschaftswerte voraus sondern eignet sich zusätzlich zur Veranschaulichung einer Eigenschaftsoberfläche. Insgesamt wurden sieben solcher Modelle erstellt die sich zur Vorhersage folgender lokaler Eigenschaften eignen 1) dem Verteilungskoeffizienten

(logP) von n-Octanol/Wasser, 2) der freien Lösungsenthalpie in Wasser ΔGsolv.(H2O), 3) der freien Lösungsenthalpie in n-Octanol ΔGsolv.(oct.), 4) der Säuredissoziationskonstante

111 Zusammenfassung

pKa, 5) des Siedepunkts Tb, 6) der Glasübergangstemperatur organischer LED-Materialien

Tg, sowie 7) der Wasserlöslichkeit logS. Die zweite statistische Methode zur Bestimmung lokaler Eigenschaften beinhaltet die Verwendung von support vector machines zur Klassifizierung von Wirkstoffbestandteilen als Ursache für Phospholipidose. Die wirkstoffinduzierte Phospholipidose ist ein unerwünschter Nebeneffekt von meist kationischen amphiphilen Wirkstoffen welcher zur Aggregation von Lysosomen führt die hohe Konzentrationen nicht abgebauter Phospholipide enthalten die in der Lunge, der Leber, den Nieren, der Kornea und dem Gehirn häufig zum Auftreten schädlicher Nebenwirkungen wie Entzündungen und Fibrose führt. Die support vector machines nutzen die statistisch ermittelten Größen der lokalen Eigenschaften als Deskriptoren zur Bestimmung einer Auswahl an Substanzen welche vorher in einer Trainingsprozedur klassifiziert wurden.

Darstellung der räumlichen Verteilung lokaler Eigenschaften (EAL) erzeugt mittels CoMFA Methodik.

112 Chapter 7

Die dritte statistische Methode in die die lokalen Eigenschaften integriert wurden ist die 3D-QSAR Methode der Comparative Molecular Field Analysis (CoMFA). Basierend auf der Grundlage einer Gruppe zueinander ausgerichteter Moleküle erstellt die CoMFA-Methode mittels Kleinstquadratanalyse auf einem Satz von Gitterpunkten eine Beziehung zwischen biologischer Aktivität, der Sterik und dem elektrostatischen Potential. Es wurde vermutet, dass Felder basierend auf Sterik und Elektrostatik allein keine adäquate Wiedergabe der Wirkstoff-Rezeptor Wechselwirkung darstellen, weswegen zusätzliche chemische und physikalische Eigenschaften mit berücksichtigt werden mussten. In dem hier erarbeiteten Ansatz repräsentiert der elektrostatische CoMFA Parameter das MEP und der sterische Parameter die Elektronendichte wobei die zusätzlichen lokalen Eigenschaften diese grundlegenden Parameter ergänzen. Fünf der mit dieser Methode erzeugten Stuktur-Aktivitäts Datensätze lieferten q2 Werte welche vergleichbar oder sogar besser ausfielen als die durch Standard-CoMFA erzeugten Literaturwerte. Zusätzlich zeigte sich, dass die individuellen lokalen Eigenschaften durchweg bessere q2 Werte ergaben als die durch MEP oder Elektronendichteparameter allein berechneten und dass sie sich zudem weniger anfällig bezüglich der Gitterorientierung verhielten. Die in dieser Arbeit vorgestellte Methode erweitert die Möglichkeiten von 3D-QSAR und ist weitaus besser in der Lage wichtige Informationen über intermolekulare Wechselwirkungen aufzuzeigen die mit Strukturfeldern und Ladungsverteilungen allein bisher nicht erkläret werden konnten. Die hier beschriebenen rechnerbasierten Methoden eröffnen die Möglichkeit der Anwendung quantenmechanisch generierter lokaler elektronischer Eigenschaften im in silico HTS sodass nicht nur Vorhersagen und Klassifizierungen von physikalischen Eigenschaften und biologischen Aktivitäten getroffen werden können sondern zusätzlich eine chemische Beschreibung der Natur der beobachteten Eigenschaft als eine Funktion von Oberflächeninformationen geschaffen werden kann. Diese können dann grafisch auf eine molekulare Oberfläche übertragen und dargestellt werden, was im Folgenden die Identifikation von wichtigen funktionalen Stellen für alle in der Prozesskette der Wirkstoffentwicklung beteiligten Personen leicht zugänglich macht.

113

Appendix A

Data Sets

Table A1 Nonlinear regression terms used in calculating surface-integral models. Number Term 1 V ()r 2 V ()r 3 3 2 ⎣⎦⎡⎤V ()r 2 4 ⎣⎦⎡⎤V ()r 5 5 2 ⎣⎦⎡⎤V ()r 3 6 ⎣⎦⎡⎤V ()r

7 IEL ()r

8 IEL ()r 3 9 2 ⎣⎦⎡⎤IEL ()r 2 10 ⎣⎦⎡⎤IEL ()r 5 11 2 ⎣⎦⎡⎤IEL ()r 3 12 ⎣⎦⎡⎤IEL ()r

13 EAL ()r

14 EAL ()r 3 15 ⎡⎤EA r 2 ⎣⎦L () 2 16 ⎣⎦⎡⎤EAL ()r 5 17 ⎡⎤EA r 2 ⎣⎦L () 3 18 ⎣⎦⎡⎤EAL ()r

19 α L ()r

20 α L ()r

114

3 21 2 ⎣⎦⎡⎤α L ()r 2 22 ⎣⎦⎡⎤α L ()r 5 23 2 ⎣⎦⎡⎤α L ()r 3 24 ⎣⎦⎡⎤α L ()r

25 ηL ()r

26 ηL (r) 3 27 2 ⎣⎦⎡⎤ηL ()r 2 28 ⎣⎦⎡⎤ηL ()r 5 29 2 ⎣⎦⎡⎤ηL ()r 3 30 ⎣⎦⎡⎤ηL ()r

31 VIE()rr⋅ L ()

32 VIE()rr⋅ L () 3 33 ⎡⎤2 ⎣⎦VIE()rr⋅ L () 2 34 ⎣⎦⎡⎤VIE()rr⋅ L () 5 35 ⎡⎤2 ⎣⎦VIE()rr⋅ L () 3 36 ⎣⎦⎡⎤VIE()rr⋅ L ()

37 VEA()rr⋅ L ()

38 VEA()rr⋅ L () 3 39 ⎡⎤2 ⎣⎦VEA()rr⋅ L () 2 40 ⎣⎦⎡⎤VEA()rr⋅ L () 5 41 ⎡⎤2 ⎣⎦VEA()rr⋅ L () 3 42 ⎣⎦⎡⎤VEA()rr⋅ L ()

43 V ()rr⋅α L ()

44 V ()rr⋅α L () 3 45 ⎡⎤2 ⎣⎦V ()rr⋅α L () 2 46 ⎣⎦⎡⎤V ()rr⋅α L () 5 47 ⎡⎤2 ⎣⎦V ()rr⋅α L () 3 48 ⎣⎦⎡⎤V ()rr⋅α L ()

49 V ()rr⋅ηL ()

50 V ()rr⋅ηL () 3 51 ⎡⎤2 ⎣⎦V ()rr⋅ηL () 2 52 ⎣⎦⎡⎤V ()rr⋅ηL ()

115

5 53 ⎡⎤2 ⎣⎦V ()rr⋅ηL () 3 54 ⎣⎦⎡⎤V ()rr⋅ηL ()

55 IELL()rr⋅ EA ()

56 IELL()rr⋅ EA () 3 57 ⎡⎤2 ⎣⎦IELL()rr⋅ EA () 2 58 ⎣⎦⎡⎤IELL()rr⋅ EA () 5 59 ⎡⎤2 ⎣⎦IELL()rr⋅ EA () 3 60 ⎣⎦⎡⎤IELL()rr⋅ EA ()

61 IELL()rr⋅α ()

62 IELL()rr⋅α () 3 63 2 ⎣⎦⎡⎤IELL()rr⋅α () 2 64 ⎣⎦⎡⎤IELL()rr⋅α () 5 65 2 ⎣⎦⎡⎤IELL()rr⋅α () 3 66 ⎣⎦⎡⎤IELL()rr⋅α ()

67 IELL()rr⋅η ()

68 IELL()rr⋅η () 3 69 2 ⎣⎦⎡⎤IELL()rr⋅η () 2 70 ⎣⎦⎡⎤IELL()rr⋅η () 5 71 2 ⎣⎦⎡⎤IELL()rr⋅η () 3 72 ⎣⎦⎡⎤IELL()rr⋅η ()

73 EALL()rr⋅α ()

74 EALL()rr⋅α () 3 75 ⎡⎤2 ⎣⎦EALL()rr⋅α () 2 76 ⎣⎦⎡⎤EALL()rr⋅α () 5 77 ⎡⎤2 ⎣⎦EALL()rr⋅α () 3 78 ⎣⎦⎡⎤EALL()rr⋅α ()

79 EALL()rr⋅η ()

80 EALL()rr⋅η () 3 81 ⎡⎤2 ⎣⎦EALL()rr⋅η () 2 82 ⎣⎦⎡⎤EALL()rr⋅η () 5 83 ⎡⎤2 ⎣⎦EALL()rr⋅η () 3 84 ⎣⎦⎡⎤EALL()rr⋅η ()

116

85 αηLL()rr⋅ ()

86 αηLL()rr⋅ () 3 87 2 ⎣⎦⎡⎤αηLL()rr⋅ () 2 88 ⎣⎦⎡⎤αηLL()rr⋅ () 5 89 2 ⎣⎦⎡⎤αηLL()rr⋅ () 3 90 ⎣⎦⎡⎤αηLL()rr⋅ ()

91 VIEEA()rr⋅⋅LL () ()r

92 VIEEA()rr⋅⋅LL () (r) 3 93 ⎡⎤VIEEArr⋅⋅ r2 ⎣⎦()LL () () 2 94 ⎣⎦⎡⎤VIEEA()rr⋅⋅LL () () r 5 95 ⎡⎤2 ⎣⎦VIEEA()rr⋅⋅LL () () r 3 96 ⎣⎦⎡⎤VIEEA()rr⋅⋅LL () () r

97 VIE()rr⋅⋅LL ()α ()r

98 VIE()rr⋅⋅LL ()α (r) 3 99 ⎡⎤VIErrr⋅⋅α 2 ⎣⎦()LL () () 2 100 ⎣⎦⎡⎤VIE()rrr⋅⋅LL ()α () 5 101 ⎡⎤2 ⎣⎦VIE()rrr⋅⋅LL ()α () 3 102 ⎣⎦⎡⎤VIE()rrr⋅⋅LL ()α ()

103 VIE()rr⋅⋅LL ()η ()r

104 VIE()rr⋅⋅LL ()η (r) 3 105 ⎡⎤VIErrr⋅⋅η 2 ⎣⎦()LL () () 2 106 ⎣⎦⎡⎤VIE()rrr⋅⋅LL ()η () 5 107 ⎡⎤2 ⎣⎦VIE()rrr⋅⋅LL ()η () 3 108 ⎣⎦⎡⎤VIE()rrr⋅⋅LL ()η ()

109 VEA()rr⋅⋅LL ()α ()r

110 VEA()rr⋅⋅LL ()α (r) 3 111 ⎡⎤VEArrr⋅⋅α 2 ⎣⎦()LL () () 2 112 ⎣⎦⎡⎤VEA()rrr⋅⋅LL ()α () 5 113 ⎡⎤2 ⎣⎦VEA()rrr⋅⋅LL ()α () 3 114 ⎣⎦⎡⎤VEA()rrr⋅⋅LL ()α ()

115 VEA()rr⋅⋅LL ()η ()r

116 VEA()rr⋅⋅LL ()η (r)

117

3 117 ⎡⎤2 ⎣⎦VEA()rrr⋅⋅LL ()η () 2 118 ⎣⎦⎡⎤VEA()rrr⋅⋅LL ()η () 5 119 ⎡⎤VEArrr⋅⋅η 2 ⎣⎦()LL () () 3 120 ⎣⎦⎡⎤VEA()rrr⋅⋅LL ()η ()

121 IELLL()rr⋅⋅ EA ()α ()r

122 IELLL()rr⋅⋅ EA ()α ()r 3 123 ⎡⎤2 ⎣⎦IELLL()rrr⋅⋅ EA ()α () 2 124 ⎣⎦⎡⎤IELLL()rrr⋅⋅ EA ()α () 5 125 ⎡⎤IErrr⋅⋅ EA α 2 ⎣⎦LLL() () () 3 126 ⎣⎦⎡⎤IELLL()rrr⋅⋅ EA ()α ()

127 IELLL()rr⋅⋅ EA ()η ()r

128 IELLL()rr⋅⋅ EA ()η ()r 3 129 ⎡⎤2 ⎣⎦IELLL()rrr⋅⋅ EA ()η () 2 130 ⎣⎦⎡⎤IELLL()rrr⋅⋅ EA ()η () 5 131 ⎡⎤IErrr⋅⋅ EA η 2 ⎣⎦LLL() () () 3 132 ⎣⎦⎡⎤IELLL()rrr⋅⋅ EA ()η ()

133 IELLL()rrr⋅⋅αη () ()

134 IELLL()rr⋅⋅αη () ()r 3 135 2 ⎣⎦⎡⎤IELLL()rrr⋅⋅αη () () 2 136 ⎣⎦⎡⎤IELLL()rrr⋅⋅αη () () 5 137 2 ⎣⎦⎡⎤IELLL()rrr⋅⋅αη () () 3 138 ⎣⎦⎡⎤IELLL()rrr⋅⋅αη () ()

139 EALLL()rrr⋅⋅αη () ()

140 EALLL()rr⋅⋅αη () ()r 3 141 ⎡⎤2 ⎣⎦EALLL()rrr⋅⋅αη () () 2 142 ⎣⎦⎡⎤EALLL()rrr⋅⋅αη () () 5 143 ⎡⎤EA rrr⋅⋅αη 2 ⎣⎦LLL() () () 3 144 ⎣⎦⎡⎤EALLL()rrr⋅⋅αη () ()

145 VEA()rrr⋅⋅⋅LLL ()αη () ()r

146 VEA()rrr⋅⋅⋅LLL ()αη () ()r 3 147 ⎡⎤2 ⎣⎦VEA()rrrr⋅⋅⋅LLL ()αη () () 2 148 ⎣⎦⎡⎤VEA()rrrr⋅⋅⋅LLL ()αη () ()

118

5 149 ⎡⎤2 ⎣⎦VEA()rrrr⋅⋅⋅LLL ()αη () ()

3 150 ⎣⎦⎡⎤VEA()rrrr⋅⋅⋅LLL ()αη () ()

Table A2 The logP data set. No. Compound Exp. Calc.

1 glutamine -3.64 -3.16 2 citric acid -1.72 -0.59 3 phenylalanine -1.52 -1.79 4 tryptophan -1.06 -1.12 5 1,3-propanediol -1.04 -1.02 6 maleic acid-hydrazide -0.84 -0.43 7 N-formylcyclobutane carboxamide -0.70 0.19 8 allopurinol -0.55 1.23 9 2,2-dimethylpropionic acid-hydrazide -0.35 0.41 10 3-fluoropropanol -0.28 -0.14 11 thiamphenicol -0.27 1.19 12 2',3'-didesoxyadenosine -0.22 1.17 13 3-mesylphenyl urea -0.12 -0.95 14 imidazole -0.08 0.26 15 caffeine -0.07 0.91 16 o-methyl THPO -0.04 0.82 17 5,6-dihydro-2-methyl-1,4-oxathiin-3-carboxylic acid 0.04 0.32 18 mercaptoacetic acid 0.09 0.62 19 6-methylthioinosine 0.09 1.85 20 merbarone 0.14 2.64 21 atenolol 0.16 1.68 22 o-methylbenzoyl hydrazine 0.22 1.01 23 pentoxifylline 0.29 1.49 24 nikethamide 0.33 0.10 25 p-hydroxybenzamide 0.33 2.02 26 2,2-dichloroethanol 0.37 1.27 27 antipyrine 0.38 1.87 28 1-acetyl-N-(4-fluorophenyl)hydrazine carboxamide 0.42 -0.11 29 sulpiride 0.42 0.68 30 piperazine-2-carboxanilide 0.48 0.76 31 fluconazole 0.50 3.06 32 acetaminophen 0.51 0.20 33 2-amino-5-methoxy benzimidazole 0.57 0.99 34 sotalol 0.59 0.98 35 glutaric acid dimethyl ester 0.62 0.07 36 3-(5-nitro-2-furanyl)-2-propenoic amide 0.65 0.84 37 0.70 -0.23 38 1-acethyl-6-dimethyl-7-methoxymitosene 0.72 0.86 39 2-azacycloheptanthione 0.75 2.11 40 N-(2-benzoyl-oxyacetyl)-2-carboxyazetidine 0.79 0.37 41 chloropentazide 0.84 1.89 42 4-pyridinebutylamine 0.86 2.24 43 procainamide 0.88 1.52 44 tiapride 0.90 0.68 45 4-methylthiazole 0.97 1.40 46 6-cyanoquinoxaline 1.01 2.46 47 syringic acid 1.04 -0.02

119

48 1-phenyl-3-cyanoguanidine 1.05 2.06 49 m-acetylaminoacetophenone 1.10 0.98 50 acetylsalicylic acid 1.19 0.63 51 benzaldehydesemicarbazone 1.27 0.71 52 4-oxo-4-phenylbutanoic acid 1.30 1.00 53 2-phenylethanol 1.36 1.62 54 carocainide 1.38 1.43 55 3-bromobenzenesulfonamide 1.39 1.30 56 bromochloromethane 1.41 1.10 57 trimethylacetic acid 1.47 0.81 58 o-fluorophenylacetic acid 1.50 1.19 59 2-(2,6-dichloro-4-hydroxyphenylimino)imidazolidine 1.52 1.72 60 N-phenyl-4-aminophenylsufonamide 1.55 0.54 61 hydrocortisone 1.55 1.21 62 tryptamine 1.55 1.57 63 acetophenone 1.58 1.39 64 p-(N,N-dimethylcarbamate)-N,N-dimethylcarbamate, benzyl ester 1.59 1.48 65 prednisolone 1.60 1.82 66 1-dodecansulfonic acid 1.60 2.46 67 2-methylquinoxaline 1.61 2.72 68 3,5-dimethoxyphenol 1.64 0.74 69 bromazepam 1.65 2.89 70 indole-3-ethanolcarbamate 1.69 1.44 71 3-indolylpropionic acid 1.75 1.84 72 pindolol 1.75 2.09 73 propylene 1.77 1.54 74 2-oxoisopropyl-5-phenyl-5'-ethylbarbituric acid 1.79 3.11 75 4-dimethylamino-thieno(2,3-D)pyrimidine 1.82 2.09 76 dexamethasone 1.83 1.28 77 2-acetyl-oxyethyl benzoate 1.85 1.40 78 4-chloroaniline 1.88 1.53 79 N-methyl-2,3-dimethylphenyl carbamate 1.95 1.72 80 o-methylphenoxyacetic acid 1.98 0.95 81 acetic acid-m-methoxybenzoate 2.02 1.42 82 quinoline 2.03 2.69 83 1,1'-dioxo-3-cyclohexen-3-yl-1,2,4-benzothiadiazine 2.05 1.67 84 3,4-dimethylacetanilide 2.10 1.71 85 indole 2.14 1.59 86 mexilitene 2.15 2.32 87 griseofulvin 2.18 2.37 88 carbamazepine 2.19 2.08 89 hydrocortisone acetate 2.19 2.88 90 o-methylbenzaldehyde 2.26 1.48 91 4-bromoaniline 2.26 1.52 92 2,6-mimethoxypyridine 2.30 1.35 93 thiophene-2-carboxylic acid, ethyl ester 2.33 1.50 94 21-desoxybetamethasone 2.35 2.11 95 thiosalicylic acid 2.39 1.98 96 butyl gallate 2.41 1.14 97 1-pyrrol-2-yl-pentanone 2.42 1.55 98 4-phenylbutyric acid 2.42 2.04 99 5,5'-diphenylhydantoin 2.47 2.49 100 8-trifluoromethylquinoline 2.50 1.28 101 3-chlorophenol 2.50 1.93 102 lorazepam 2.51 3.04 103 2,17-dihydroxy-3-oxolactone-7,21-dicarboxy-pregan-4-ene 2.54 3.65 104 di-isopyramide 2.58 3.65 105 N-benzyl-N-formylaniline 2.62 2.62 106 5,6-diazaphenanthrene 2.71 3.47

120

107 lormetazepam 2.72 3.58 1-methyl-1,3-dihydro-5-(2-fluorophenyl)-7-chloro-1,4-benzodiazepin- 108 2-one 2.75 3.57 109 diazepam 2.79 3.73 110 3-butyl-R,S-1-(3H)-isobenzofuranone 2.80 2.57 111 2-anilino-1,4-naphthoquinone 2.84 2.33 112 4-aminobiphenyl 2.86 2.90 113 quinidine 2.88 4.55 114 chlorobenzene 2.89 1.95 115 dihydromorphanthridine 2.90 3.67 116 p-phenoxyaniline 2.93 2.67 117 3-bromoquinoline 3.03 2.96 118 octanoic acid 3.05 2.04 119 deoxycorticosterone acetate 3.08 3.42 120 alprenolol 3.10 3.38 121 indecainide 3.11 4.09 122 N-(3,4-dichlorophenyl)difluoroacetamide 3.18 2.61 123 benzophenone 3.18 2.67 124 p-fluorotoluene 3.20 1.71 125 testosterone 3.29 3.37 126 1-(3,4-dichlorophenyl)-2-isopropylaminoethanol 3.32 3.65 127 3-methoxy-4-cyclohexylmethoxyphenylacetic acid 3.35 3.28 128 naphthalene 3.37 3.29 129 anthraquinone 3.39 2.06 130 1,2-dichlorobenzene 3.43 2.14 131 prometrin 3.51 2.37 132 4,7-dichloroquinoline 3.57 3.80 133 9-(N-((N,N’-diethylamino)acetyl)amino)fluorene 3.64 4.66 134 3,5-dichlorophenol 3.68 2.15 135 indigo 3.72 2.55 136 flecainide 3.78 2.38 137 3,4-dimethylchlorobenzene 3.82 3.30 138 diflubenzuron 3.83 2.14 139 estradiol 4.01 3.55 140 1-(4-cyclohexylphenyl)-3-methoxy-3-methylurea 4.08 3.61 1-phenyl-1-benzyl-2-methyl-3-(N,N-dimethylamino)-propanoic acid, 141 propyl ester 4.18 4.48 142 2,6-dimethylnaphthalene 4.31 3.89 143 aminopyrene 4.31 4.15 144 fluphenazine 4.36 3.87 145 1,3-dimethylnaphthalene 4.42 4.08 146 1,3-dithiolan-2-ylidine-propanoic acid, dibutyl ester 4.60 4.32 147 1,2,4,5-tetrachlorobenzene 4.60 5.32 148 propafenone 4.63 3.53 149 bifonazole 4.77 5.82 150 aprindine 4.86 6.00 151 diethylstilbestrol 5.07 4.56 152 fluoranthene 5.16 4.93 153 trifluopromazine 5.19 4.18 154 clotrimazole 5.20 5.25 155 teflubenzuron 5.39 3.79 156 hexaflumuron 5.43 3.77 157 2,4,4'-trichlorobiphenyl 5.62 5.07 158 2,4,5-trichlorobiphenyl 5.90 4.50 159 thioridazine 5.90 5.47 160 phenylanthracene 6.01 6.33 161 flufenoxuron 6.16 5.10 162 1,3,7,8-tetrachlorodibenzodioxin 6.30 6.92 163 chlorfluazuron 6.63 5.09

121

164 1,2,3,6,7-pentachlorodibenzodioxin 6.74 8.00 165 linoleic acid 7.05 5.90 166 palmitic acid 7.17 5.26 167 3,3',4,4',5,5'-hexachlorobiphenyl 7.41 6.46 168 stearic acid 8.23 6.15

Table A3 The free energy of solvation in n-octanol data set. No. Compound Exp. Calc.

1 methane 0.51 -0.67 2 ethane -0.64 -1.19 3 propane -1.26 -1.77 4 cyclopropane -1.60 -0.98 5 2-methylpropane -1.45 -2.17 6 2,2-dimethylpropane -1.74 -2.36 7 n-butane -1.86 -2.22 8 cyclopentane -2.65 -3.49 9 n-pentane -2.45 -2.80 10 n-hexane -3.01 -3.32 11 cyclohexane -3.46 -3.16 12 methylcyclohexane -3.21 -3.36 13 n-heptane -3.74 -3.90 14 n-octane -4.18 -4.47 15 ethylene -0.27 -0.74 16 propylene -1.14 -1.42 17 2-methylpropene -2.03 -2.04 18 1-butene -1.89 -2.23 19 1-hexene -2.94 -3.37 20 1,3-butadiene -2.10 -2.16 21 acetylene -0.51 -0.61 22 propyne -1.59 -1.25 23 1-pentyne -2.79 -2.55 24 1-hexyne -3.43 -3.02 25 benzene -3.72 -3.92 26 toluene -4.55 -4.50 27 ethylbenzene -5.08 -5.20 28 m-xylene -5.25 -5.11 29 o-xylene -5.07 -4.98 30 p-xylene -5.19 -5.07 31 naphthalene -6.97 -6.97 32 anthracene -10.47 -10.11 33 1,1-difluoroethane -1.13 -2.39 34 tetrafluoromethane 1.50 0.36 35 fluorobenzene -3.87 -5.15 36 chlorotrifluoromethane -1.97 -0.49 37 dichlorodifluoromethane -1.25 -1.68 38 fluorotrichloromethane -2.63 -2.87 39 1,1,2-trichloro-1,2,2-trifluoroethane -2.54 -2.66 40 1-bromo-1-chloro-2,2,2-trifluoroethane -3.27 -4.01

122

41 bromotrifluoromethane -0.75 -1.44 42 dichloromethane -3.07 -2.44 43 trichloromethane -3.81 -3.15 44 chloroethane -2.58 -2.20 45 1,1,1-trichloroethane -3.69 -4.05 46 1,1,2-trichloroethane -4.53 -3.95 47 1-chloropropane -3.06 -2.93 48 2-chloropropane -2.84 -1.51 49 cis-1,2-dichloroethylene -3.71 -3.33 50 trans-1,2-dichloroethylene -3.61 -2.75 51 trichloroethylene -3.75 -3.29 52 tetrachloroethylene -4.24 -3.97 53 chlorobenzene -5.00 -5.42 54 1,2-dichlorobenzene -6.01 -6.26 55 1,4-dichlorobenzene -5.67 -6.55 56 2,2'-dichlorobiphenyl -9.41 -8.78 57 2,3-dichlorobiphenyl -9.23 -9.98 58 2,2',3'-trichlorobiphenyl -9.12 -9.81 59 bromomethane -2.43 -2.40 60 dibromomethane -4.18 -4.90 61 tribromomethane -5.62 -5.05 62 bromoethane -2.90 -2.93 63 1-bromopropane -3.42 -3.59 64 2-bromopropane -3.40 -2.94 65 1-bromobutane -4.16 -4.41 66 1-bromopentane -4.68 -5.02 67 3-bromopropene -3.30 -4.04 68 bromobenzene -5.46 -5.59 69 1,4-dibromobenzene -7.47 -7.46 70 p-bromotoluene -6.36 -6.14 71 methanol -3.87 -4.13 72 ethanol -4.36 -4.49 73 ethylene glycol -7.44 -6.90 74 1-propanol -5.02 -5.11 75 2-propanol -4.62 -4.93 76 1,1,1-trifluoro-2-propanol -5.12 -6.15 77 1,1,1,3,3,3-hexafluoro-2-propanol -5.76 -3.60 78 1-butanol -5.71 -5.49 79 t-butanol -4.78 -4.75 80 1-pentanol -6.40 -6.13 81 1-hexanol -7.06 -6.67 82 1-heptanol -7.75 -7.19 83 1-octanol -8.13 -7.75 84 allyl alcohol -5.27 -5.34 85 -8.69 -7.46 86 4-bromophenol -10.59 -9.48 87 2-cresol -8.49 -7.89 88 3-cresol -8.20 -8.05 89 4-cresol -8.84 -8.12 90 2,2,2-trifluoroethanol -4.81 -6.75 91 2-methoxyethanol -5.83 -5.14

123

92 methyl propyl ether -3.63 -3.41 93 methyl isopropyl ether -4.64 -3.48 94 methyl t-butyl ether -3.49 -3.28 95 diethyl ether -2.89 -3.14 96 THF -3.93 -3.61 97 anisole -5.47 -5.82 98 ethyl phenyl ether -5.65 -6.29 99 1,2-dimethoxyethane -4.55 -4.25 100 1,4-dioxane -4.89 -5.23 101 propanal -4.13 -4.29 102 butanal -4.62 -4.54 103 benzaldehyde -6.13 -6.66 104 m-hydroxybenzaldehyde -11.39 -10.71 105 p-hydroxybenzaldehyde -12.36 -10.94 106 acetone -3.15 -3.94 107 2-butanone -3.78 -4.41 108 3,3-dimethyl-2-butanone -4.53 -5.14 109 2-pentanone -4.35 -4.96 110 3-pentanone -4.36 -4.77 111 cyclopentanone -5.01 -5.67 112 2-hexanone -5.02 -5.21 113 2-heptanone -5.65 -5.94 114 2-octanone -6.38 -6.31 115 acetophenone -6.74 -7.14 116 acetic acid -6.35 -5.20 117 propionic acid -6.86 -5.70 118 butyric acid -7.58 -6.27 119 pentanoic acid -8.22 -6.92 120 hexanoic acid -8.82 -7.41 121 4-amino-3,5,6-trichloropyridine-2-carboxylic acid -12.37 -13.33 122 methyl formate -2.82 -5.09 123 methyl acetate -3.54 -4.16 124 ethyl acetate -4.06 -4.70 125 propyl acetate -4.55 -5.18 126 butyl acetate -4.96 -5.76 127 methyl propionate -4.06 -4.69 128 methyl butyrate -4.59 -5.22 129 methyl pentanoate -5.13 -5.93 130 methyl benzoate -7.26 -8.06 131 methylamine -3.78 -3.31 132 ethylamine -4.09 -3.99 133 propylamine -4.77 -4.57 134 butylamine -5.35 -5.00 135 diethylamine -4.75 -4.36 136 dipropylamine -6.02 -5.64 137 trimethylamine -3.60 -2.35 138 piperazine -5.80 -5.91 139 aniline -6.71 -8.18 140 hydrazine -6.48 -7.06 141 morpholine -5.99 -5.33 142 piperidine -6.27 -4.67

124

143 pyridine -5.34 -4.89 144 2-methylpyridine -6.14 -5.38 145 3-methylpyridine -6.40 -5.60 146 4-methylpyridine -6.60 -5.66 147 2-ethylpyridine -6.40 -5.99 148 2-methylpyrazine -5.87 -6.30 149 2-ethyl-3-methoxypyrazine -6.85 -7.54 150 acetonitrile -3.15 -2.29 151 propionitrile -3.66 -3.14 152 butyronitrile -4.25 -3.67 153 benzonitrile -6.09 -6.54 154 2,6-dichlorobenzonitrile -9.18 -8.05 155 1-propanethiol -3.52 -3.80 156 thiophenol -5.99 -6.68 157 thioanisole -6.47 -6.99 158 dimethyl sulfide -4.24 -2.61 159 diethyl sulfide -4.09 -3.48 160 dipropyl sulfide -3.89 -4.99 161 trimethyl phosphate -7.81 -8.94 162 triethyl phosphate -8.88 -8.78 163 tripropyl phosphate -8.65 -8.22 164 2,2-dichloroethenyl dimethyl phosphate -8.59 -7.89 165 o-ethyl-o'-(4-bromo-2-chlorophenyl) S-propyl phosphorothioate -10.49 -10.80

Table A4 The free energy of solvation in water data set. No. Compound Exp. Calc. 1 methane 1.98 0.98 2 ethane 1.83 0.85 3 propane 1.96 1.04 4 cyclopropane 0.75 0.07 5 2-methylpropane 2.32 1.42 6 2,2-dimethylpropane 2.50 1.98 7 n-butane 2.08 1.24 8 2,2-dimethylbutane 2.59 2.14 9 cyclopentane 1.20 -0.64 10 n-pentane 2.33 1.44 11 2-methylpentane 2.52 1.81 12 3-methylpentane 2.51 1.73 13 2,4-dimethylpentane 2.88 2.22 14 2,2,4-trimethylpentane 2.85 2.72 15 methylcyclopentane 1.60 -0.13 16 n-hexane 2.49 1.64 17 cyclohexane 1.23 0.74 18 methylcyclohexane 1.71 1.18 19 cis-1,2-dimethylcyclohexane 1.58 1.59 20 n-heptane 2.62 1.84

125

21 n-octane 2.89 2.02 22 ethylene 1.27 0.86 23 propylene 1.27 0.72 24 2-methylpropene 1.16 0.46 25 1-butene 1.38 0.87 26 2-methyl-2-butene 1.31 0.26 27 3-methyl-1-butene 1.83 1.18 28 1-pentene 1.66 1.14 29 trans-2-pentene 1.34 0.72 30 4-methyl-1-pentene 1.91 1.51 31 cyclopentene 0.56 -0.77 32 1-hexene 1.66 1.35 33 cyclohexene 0.37 -0.58 34 trans-2-heptene 1.66 1.13 35 1-methylcyclohexene 0.67 -0.75 36 1-octene 2.17 1.85 37 1,3-butadiene 0.61 0.77 38 2-methyl-1,3-butadiene 0.68 0.57 39 2,3-dimethyl-1,3-butadiene 0.40 0.37 40 1,4-pentadiene 0.94 0.89 41 1,5-hexadiene 1.01 1.06 42 acetylene -0.01 0.55 43 propyne -0.48 -0.11 44 1-butyne -0.15 0.05 45 1-pentyne -0.16 0.48 46 1-hexyne 0.01 0.61 47 1-heptyne 0.60 0.88 48 1-octyne 0.71 1.05 49 1-nonyne 1.05 1.28 50 vinyl acetate 0.04 0.19 51 benzene -0.89 -0.88 52 toluene -0.76 -0.96 53 1,2,4-trimethylbenzene -0.86 -1.12 54 ethylbenzene -0.61 -0.77 55 m-xylene -0.80 -1.05 56 o-xylene -0.90 -1.01 57 p-xylene -0.80 -1.04 58 propylbenzene -0.53 -0.37 59 butylbenzene -0.40 -0.27 60 t-butylbenzene -0.44 0.23 61 t-amylbenzene -0.18 0.36 62 naphthalene -2.41 -2.43 63 anthracene -4.23 -4.02 64 phenanthrene -4.06 -4.09 65 acenaphthene -3.40 -3.75 66 p-chlorotoluene -1.92 -2.19 67 fluoromethane -0.22 -2.29 68 1,1-difluoroethane -0.11 -3.26 69 trifluoromethane 0.80 -1.15 70 tetrafluoromethane 3.16 2.09 71 hexafluoroethane 3.94 3.33

126

72 octafluoropropane 4.28 4.80 73 fluorobenzene -0.78 -3.31 74 2-chloro-1,1,1-trifluoroethane 0.05 -1.15 75 chlorofluoromethane -0.77 -0.68 76 chlorodifluoromethane 0.11 1.13 77 chlorotrifluoromethane 2.52 2.93 78 dichlorodifluoromethane 1.69 2.54 79 fluorotrichloromethane 0.82 0.33 80 1,1,2-trichloro-1,2,2-trifluoroethane 1.77 3.05 81 1,1,2,2-tetrachlorodifluoroethane 0.82 2.38 82 chloropentafluoroethane 2.86 3.46 83 1,1-dichlorotetrafluoroethane 2.50 2.75 84 1,2-dichlorotetrafluoroethane 2.31 3.97 85 1-bromo-1-chloro-2,2,2-trifluoroethane -0.13 -1.23 86 bromotrifluoromethane 1.79 -1.16 87 1-bromo-1,2,2,2-tetrafluoroethane 0.52 -0.80 88 chloromethane -0.56 -0.81 89 dichloromethane -1.36 -1.02 90 trichloromethane -1.07 -0.34 91 tetrachloromethane 0.10 -0.38 92 chloroethane -0.63 -0.76 93 1,1-dichloroethane -0.85 -1.42 94 (E)-1,2-dichloroethane -1.73 -1.64 95 1,1,1-trichloroethane -0.25 -1.26 96 1,1,2-trichloroethane -1.95 -1.66 97 1,1,1,2-tetrachloroethane -1.15 -0.57 98 1,1,2,2-tetrachloroethane -2.36 -0.79 99 pentachloroethane -1.36 -0.09 100 hexachloroethane -1.40 0.52 101 1-chloropropane -0.35 -0.43 102 2-chloropropane -0.24 0.49 103 1,2-dichloropropane -1.25 -1.01 104 1,3-dichloropropane -1.90 -1.93 105 1-chlorobutane -0.14 -0.31 106 2-chlorobutane 0.07 0.51 107 1,1-dichlorobutane -0.70 -1.05 108 1-chloropentane -0.07 -0.13 109 2-chloropentane 0.07 0.73 110 3-chloropentane 0.07 0.49 111 chloroethylene 0.49 -0.68 112 cis-1,2-dichloroethylene -1.17 -1.50 113 trans-1,2-dichloroethylene -0.76 -1.32 114 trichloroethylene -0.44 -1.16 115 tetrachloroethylene 0.05 -0.55 116 chlorobenzene -1.01 -1.97 117 o-chlorotoluene -1.15 -1.69 118 1,2-dichlorobenzene -1.36 -2.68 119 1,3-dichlorobenzene -0.98 -2.86 120 1,4-dichlorobenzene -1.01 -3.03 121 2,2'-dichlorobiphenyl -2.73 -2.38 122 2,3-dichlorobiphenyl -2.45 -3.48

127

123 2,2',3'-trichlorobiphenyl -1.99 -3.34 124 bromotrichloromethane -0.93 -2.28 125 1-chloro-2-bromoethane -1.95 -1.91 126 bromomethane -0.82 -0.49 127 dibromomethane -2.11 -2.39 128 tribromomethane -1.98 -3.00 129 bromoethane -0.70 -0.71 130 1,2-dibromoethane -2.10 -1.58 131 1-bromopropane -0.56 -0.46 132 2-bromopropane -0.48 1.76 133 1,2-dibromopropane -1.94 -0.23 134 1,3-dibromopropane -1.96 -1.75 135 1-bromo-2-methylpropane -0.03 0.31 136 1-bromobutane -0.41 -0.38 137 1-bromoisobutane -0.03 0.73 138 1-bromo-3-methylbutane 0.20 -0.03 139 1-bromopentane -0.08 -0.22 140 3-bromopropene -0.86 -0.42 141 bromobenzene -1.46 -1.53 142 1,4-dibromobenzene -2.30 -1.71 143 p-bromotoluene -1.39 -1.77 144 1-bromo-2-ethylbenzene -1.19 -1.17 145 o-bromocumene -0.85 -0.22 146 methanol -5.07 -4.58 147 ethanol -4.90 -4.86 148 ethylene glycol -9.30 -7.52 149 1-propanol -4.85 -4.57 150 2-propanol -4.75 -4.92 151 1,1,1-trifluoro-2-propanol -4.16 -4.18 152 2,2,3,3-tetrafluoropropanol -4.90 -4.30 153 2,2,3,3,3-pentafluoropropanol -4.15 -5.57 154 1,1,1,3,3,3-hexafluoro-2-propanol -3.76 -0.48 155 2-methyl-1-propanol -4.51 -5.11 156 1-butanol -4.72 -4.34 157 2-butanol -4.61 -3.06 158 t-butanol -4.51 -4.37 159 2-methyl-1-butanol -4.42 -3.27 160 3-methyl-1-butanol -4.42 -3.87 161 2-methyl-2-butanol -4.43 -3.59 162 2,3-dimethyl-1-butanol -3.91 -4.46 163 1-pentanol -4.49 -4.15 164 2-pentanol -4.39 -3.12 165 3-pentanol -4.35 -3.45 166 2-methyl-1-pentanol -3.93 -4.52 167 2-methyl-2-pentanol -3.93 -3.14 168 2-methyl-3-pentanol -3.89 -2.90 169 4-methyl-2-pentanol -3.74 -2.74 170 cyclopentanol -5.49 -5.77 171 1-hexanol -4.36 -3.94 172 3-hexanol -3.68 -3.11 173 cyclohexanol -4.95 -4.82

128

174 4-heptanol -4.01 -2.72 175 cycloheptanol -5.49 -4.05 176 1-heptanol -4.25 -3.73 177 1-octanol -4.10 -3.54 178 allyl alcohol -5.03 -5.61 179 phenol -6.53 -5.58 180 4-bromophenol -7.10 -6.06 181 4-t-butylphenol -5.92 -4.14 182 2-cresol -5.86 -5.10 183 3-cresol -5.49 -5.76 184 4-cresol -6.12 -5.81 185 2,2,2-trifluoroethanol -4.31 -5.49 186 p-bromophenol -7.13 -6.06 187 2-methoxyethanol -6.77 -4.81 188 dimethoxymethane -2.93 -2.52 189 methyl propyl ether -1.66 -1.51 190 methyl isopropyl ether -2.00 -1.82 191 methyl t-butyl ether -2.21 -0.28 192 diethyl ether -1.75 -2.13 193 ethyl propyl ether -1.81 -1.72 194 dipropyl ether -1.16 -1.41 195 diisopropyl ether -0.53 -0.83 196 di-n-butyl ether -0.83 -1.05 197 THF -3.12 -3.83 198 2-methyltetrahydrofuran -3.30 -3.70 199 anisole -2.45 -1.99 200 ethyl phenyl ether -4.28 -1.81 201 1,1-diethoxyethane -3.27 -3.17 202 1,2-dimethoxyethane -4.84 -3.06 203 1,2-diethoxyethane -3.53 -3.89 204 1,3-dioxolane -4.09 -6.12 205 1,4-dioxane -5.05 -5.18 206 2,2,2-trifluoroethyl vinyl ether -0.12 -1.53 207 1-chloro-2,2,2-trifluoroethyl difluoromethyl ether 0.11 0.01 208 acetaldehyde -3.50 -3.21 209 propanal -3.44 -4.10 210 butanal -3.18 -2.81 211 pentanal -3.03 -3.79 212 hexanal -2.81 -2.50 213 heptanal -2.67 -3.43 214 octanal -2.29 -2.15 215 nonanal -2.07 -3.16 216 trans-2-butenal -4.23 -4.62 217 trans-2-hexenal -3.68 -4.29 218 trans-2-octenal -3.44 -3.93 219 trans,trans-2,4-hexadienal -4.64 -3.52 220 benzaldehyde -4.02 -5.05 221 m-hydroxybenzaldehyde -9.51 -9.02 222 p-hydroxybenzaldehyde -10.47 -9.23 223 acetone -3.80 -3.75 224 2-butanone -3.71 -4.41

129

225 3-methyl-2-butanone -3.24 -4.02 226 3,3-dimethyl-2-butanone -2.89 -3.65 227 2-pentanone -3.52 -3.99 228 3-pentanone -3.41 -4.22 229 4-methyl-2-pentanone -3.06 -3.67 230 2,4-dimethyl-3-pentanone -2.74 -3.37 231 cyclopentanone -4.68 -4.09 232 2-hexanone -3.41 -3.92 233 2-heptanone -3.04 -3.65 234 4-heptanone -2.93 -3.61 235 2-octanone -2.88 -3.48 236 2-nonanone -2.48 -3.33 237 5-nonanone -2.67 -2.99 238 2-undecanone -2.15 -2.87 239 acetophenone -4.58 -4.94 240 acetic acid -6.70 -5.32 241 propionic acid -6.46 -5.40 242 butyric acid -6.35 -4.73 243 pentanoic acid -6.16 -4.71 244 hexanoic acid -6.21 -4.29 245 4-amino-3,5,6-trichloropyridine-2-carboxylic acid -11.96 -12.75 246 methyl formate -2.78 -4.55 247 ethyl formate -2.65 -4.77 248 propyl formate -2.48 -4.28 249 methyl acetate -3.31 -3.69 250 isopropyl formate -2.02 -4.81 251 isobutyl formate -2.22 -4.84 252 isoamyl formate -2.13 -3.66 253 ethyl acetate -3.08 -3.56 254 propyl acetate -2.85 -3.06 255 isopropyl acetate -2.65 -3.70 256 butyl acetate -2.55 -2.88 257 isobutyl acetate -2.36 -4.84 258 amyl acetate -2.45 -2.62 259 isoamyl acetate -2.21 -2.52 260 hexyl acetate -2.26 -2.44 261 methyl propionate -2.97 -3.60 262 ethyl propionate -2.80 -3.37 263 propyl propionate -2.54 -3.03 264 isopropyl propionate -2.22 -3.38 265 pentyl propionate -1.99 -2.78 266 methyl butyrate -2.84 -3.10 267 ethyl butyrate -2.50 -3.06 268 propyl butyrate -2.28 -2.69 269 methyl pentanoate -2.54 -2.97 270 ethyl pentanoate -2.52 -2.74 271 methyl hexanoate -2.48 -2.61 272 ethyl heptanoate -2.30 -2.32 273 methyl octanoate -2.05 -2.15 274 methyl benzoate -4.28 -5.93 275 methylamine -4.60 -3.98

130

276 ethylamine -4.61 -4.34 277 propylamine -4.50 -4.02 278 butylamine -4.38 -3.88 279 pentylamine -4.09 -3.64 280 hexylamine -4.04 -3.50 281 dimethylamine -4.28 -3.28 282 diethylamine -4.06 -3.52 283 dipropylamine -3.65 -2.97 284 dibutylamine -3.31 -2.49 285 trimethylamine -3.23 -1.70 286 triethylamine -3.03 -2.38 287 azetidine -5.56 -4.02 288 piperazine -7.40 -8.45 289 N,N'-dimethylpiperazine -7.58 -5.74 290 N-methylpiperazine -7.77 -7.15 291 aniline -5.49 -6.28 292 1,1-dimethyl-3-phenylurea -11.87 -9.31 293 N,N-dimethyaniline -2.90 -4.97 294 ethylenediamine -9.75 -8.93 295 hydrazine -9.30 -9.78 296 2-methoxy-1-ethanamine -6.55 -6.26 297 morpholine -7.17 -6.61 298 N-methylmorpholine -6.34 -5.22 299 N-methylpyrrolidine -3.97 -3.82 300 N-methylpiperidine -3.89 -2.67 301 pyrrolidine -5.47 -4.47 302 piperidine -5.10 -3.98 303 pyridine -4.69 -3.32 304 2-methylpyridine -4.62 -3.49 305 3-methylpyridine -4.77 -3.54 306 4-methylpyridine -4.92 -3.58 307 2-ethylpyridine -4.32 -3.42 308 3-ethylpyridine -4.60 -3.40 309 4-ethylpyridine -4.72 -3.45 310 2,3-dimethylpyridine -4.81 -3.58 311 2,4-dimethylpyridine -4.85 -3.75 312 2,5-dimethylpyridine -4.70 -3.74 313 2,6-dimethylpyridine -4.60 -3.60 314 3,4-dimethylpyridine -5.21 -3.71 315 3,5-dimethylpyridine -4.84 -3.77 316 2-methylpyrazine -5.51 -4.80 317 2-ethylpyrazine -5.45 -4.72 318 2-isobutylpyrazine -5.05 -4.10 319 2-ethyl-3-methoxypyrazine -4.39 -4.57 320 2-isobutyl-3-methoxypyrazine -3.68 -3.67 321 9-methyladenine -13.60 -13.77 322 1-methylthymine -10.40 -11.22 323 methylimidazole -10.25 -7.61 324 N-propylguanidine -10.92 -10.73 325 acetonitrile -3.89 -1.49 326 propionitrile -3.85 -1.73

131

327 butyronitrile -3.64 -1.48 328 benzonitrile -4.10 -3.87 329 2,6-dichlorobenzonitrile -5.22 -5.11 330 3,5-dibromo-4-hydroxybenzonitrile -9.00 -9.34 331 N,N-dimethylformamide -4.90 -6.63 332 N-methylformamide -10.00 -8.02 333 Acetamide -9.72 -8.41 334 (E)-N-methylacetamide -10.00 -7.25 335 (Z)-N-methylacetamide -10.00 -7.42 336 propionamide -9.42 -8.55 337 methanethiol -1.24 -1.82 338 ethanethiol -1.30 -1.22 339 1-propanethiol -1.05 -0.91 340 thiophenol -2.55 -3.03 341 thioanisole -2.73 -2.80 342 dimethyl sulfide -1.54 -2.21 343 diethyl sulfide -1.43 -0.90 344 methyl ethyl sulfide -1.49 -1.60 345 dipropyl sulfide -1.27 -0.51 346 2,2'-dichlorodiethyl sulfide -3.92 -3.44 347 dimethyl disulfide -1.83 -2.67 348 diethyl disulfide -1.63 -1.60 349 trimethyl phosphate -8.70 -8.52 350 triethyl phosphate -7.80 -7.65 351 tripropyl phosphate -6.10 -4.16 352 2,2-dichloroethenyl dimethyl phosphate -6.61 -6.55 353 dimethyl-5-(4-chloro)-bicyclo[3.2.0]-heptyl phosphate -7.28 -8.62 354 o-ethyl-o'-(4-bromo-2-chlorophenyl) S-propyl phosphorothioate -4.09 -5.04 355 hydrochinone -10.77 -10.18 356 1,2,3-trimethoxybenzene -5.40 -6.31 357 1,2-benzenediole -7.62 -9.56 358 1,3-benzenediole -9.67 -9.68 359 o-phenylenediamine -7.19 -10.91 360 m-phenylenediamine -10.26 -12.01 361 2-methylaniline -5.47 -6.41 362 N-methylaniline -4.54 -5.38 363 acetylene anion -73 -66 364 protonated methanol -85 -76 365 protonated dimethyl ether -70 -72 366 protonated 2-propanol -64 -66 367 methanolate ion -95 -92 368 formylate ion -77 -75 369 dimethyl ether carbanion -81 -78 370 phenolate ion -72 -80 371 toluene carbanion -59 -62 372 superoxide -87 -84 373 methyl ammonium ion -70 -79 374 protonated acetamide -66 -63 375 protonated N-methylmethanamine -63 -70 376 protonated N,N-dimethylmethanamine -59 -62 377 pyridinium ion -59 -68

132

378 ammonium ion -79 -67 379 acetonitrile carbanion -75 -73 380 azide ion -74 -74 381 methylsulfonium ion -74 -74 382 protonated dimethyl sulfide -61 -55 383 1-propanethiolate anion -76 -75 384 thiophenolate ion -67 -66

Table A5 The pKa data set. No. Compound Exp. Calc.

1 2,3,4,5,6-pentafluoroaniline -0.28 0.26 2 2,3,5,6-tetramethyl-4-nitrobenzeneamine 2.36 2.97 3 2,3-dichloroaniline 1.76 2.03 4 2,4,5-trichloroaniline 1.09 1.33 5 2,4,6-trichloroaniline -0.03 1.38 6 2,4-dibromoaniline 2.30 0.92 7 2,4-dichloroaniline 2.00 2.23 8 2,4-dinitroaniline -4.25 -2.34 9 2,5-dichloroaniline 2.05 2.09 10 2,5-dimethoxyaniline 3.93 6.15 11 2,6-dichloro-4-nitroaniline -2.55 -1.07 12 2,6-dichloroaniline 0.42 2.18 13 2,6-dimethyl-4-nitrobenzeneamine 0.98 2.37 14 2,6-dinitroaniline -5.00 -2.07 15 2-amino-4-nitrophenol 3.10 2.33 16 2-aminobenzoic acid,ethyl ester 2.51 3.52 17 2-aminobenzoic acid 2.14 2.83 18 2-aminobiphenyl 3.83 6.77 19 2-aminophenol 4.84 5.26 20 2-chloro-4-nitroaniline -0.94 -0.64 21 2-methoxy-5-nitroaniline 2.49 2.36 22 2-nitro-4-toluidine 0.40 1.62 23 3,4-dichloroaniline 2.97 2.18 24 3,5-dichloroaniline 2.51 1.99 25 3,5-dimethyl-4-nitrobenzeneamine 2.54 2.22 26 3,5-dinitroaniline 0.30 -0.33 27 3-aminobenzoic acid 3.07 3.17 28 3-aminophenol 4.37 5.24 29 3-bromoaniline 3.58 2.62 30 3-methyl-4-bromoaniline 4.05 3.46 31 3-methyl-4-nitroaniline 1.64 1.70 32 3-nitro-4-toluidine 3.03 2.96 33 3-trifluoromethylaniline 3.49 2.46 34 4-aminobenzoic acid 2.38 1.86 35 4-aminobiphenyl 4.35 6.32

133

36 4-aminophenol 5.48 5.91 37 4-benzoylaniline 2.24 1.11 38 4-bromoaniline 3.86 4.30 39 4-chloro-2-nitroaniline -1.02 -0.22 40 4-chloro-3-nitrobenzeneamine 1.90 -0.52 41 4-methoxy-2-nitrobenzenamine 0.77 1.03 42 4-methylsulfonylaniline 1.35 0.31 43 4-nitro-2-toluidine 1.04 0.90 44 5-nitro-2-toluidine 2.35 2.65 45 butyl-4-aminobenzoate 2.47 3.98 46 methyl-4-aminobenzoate 2.47 3.48 47 methyl anthranilate 2.23 2.97 48 o-bromoaniline 2.53 1.67 49 p-aminobenzoic acid,ethyl ester 2.51 2.60 50 p-aminosalicylic acid 2.05 2.73 51 propyl-4-aminobenzoate 2.49 4.55 52 p-trifluoromethylaniline 2.45 2.01 53 1,2,2,6,6-pentamethylpiperidine 11.25 9.80 54 1,2,3,4-tetrahydro-2-naphthalenamine 9.93 10.17 55 1-methylpyrrolidine 10.32 8.72 56 2,2,2-trifluoroethylamine 5.70 4.95 57 2,2,6,6-tetramethylpiperidine 11.72 11.19 58 2,2-bipyridine 4.33 3.97 59 2,3,4,5,6-pentachloropyridine -1.00 -0.72 60 2,3,5,6-tetrachloropyridine -0.80 -1.88 61 2,3,5,6-tetramethylpyridine 7.90 7.64 62 2,3-dichloropyridine -0.85 0.70 63 2,3-dimethylpyridine 6.57 6.44 64 2,4,6-collidine 7.43 6.93 65 2,4-dimethylpyridine 6.99 6.51 66 2,5-dimethylpyridine 6.40 6.53 67 2,6-dichloropyridine -2.86 1.44 68 2,6-dimethoxypyridine 1.60 4.41 69 2,6-lutidine 6.60 6.18 70 2-acetylpyridine 2.73 3.02 71 2-amino-5-methylpyridine 7.22 5.93 72 2-aminomethylfuran 8.89 8.22 73 2-benzylpyridine 5.13 6.08 74 2-bromopyridine 0.90 1.91 75 2-chloropyridine 0.49 2.78 76 2-ethylpyridine 5.89 6.38 77 2-fluoropyridine -0.44 3.32 78 2-hydroxypyridine 0.75 3.70 79 2-methoxypyridine 3.06 4.55 80 2-methyl-5-vinylpyridine 5.67 5.94 81 2-methylpiperidine 11.08 10.10 82 2-methylpyridine 6.00 5.69 83 2-methylthiopyridine 3.59 1.71 84 2-phenethylamine 9.96 7.35 85 2-phenylpyridine 4.48 4.81

134

86 2-phenylpyrrolidine 9.40 9.24 87 2-propylpiperidine 11.00 10.88 88 2-pyridinecarboxyaldehyde 3.80 4.25 89 2-pyridineethanol 5.31 5.13 90 2-pyridinepropanol 5.61 5.81 91 2-t-butylpyridine 5.76 7.26 92 2-vinylpyridine 4.98 4.74 93 3,4-dimethylpyridine 6.46 6.70 94 3,4-methylenedioxyamphetamine 9.67 9.40 95 3,5-dichloropyridine 0.67 1.60 96 3,5-dimethylpyridine 6.15 6.86 97 3-bromopyridine 2.91 1.11 98 3-ethylpyridine 5.56 6.77 99 3-formylpyridine 3.80 3.51 100 3-hydroxypyridine 4.80 4.31 101 3-methoxypyridine 4.91 4.71 102 3-methylpyridine 5.63 6.07 103 3-nitropyridine 1.18 0.51 104 3-phenylpropylamine 10.16 9.18 105 3-phenylpyridine 4.80 5.50 106 3-pyridinemethaneamine 5.96 8.10 107 3-pyridinemethanol 4.90 4.67 108 3-pyridinepropanol 5.47 6.08 109 4,4-bipyridinyl 4.82 5.16 110 4-acetylpyridine 3.59 3.45 111 4-benzylpyridine 5.59 6.78 112 4-bromopyridine 3.78 3.24 113 4-chloropyridine 3.84 3.65 114 4-cyanopyridine 1.90 3.85 115 4-ethylmorpholine 7.67 8.03 116 4-ethylpyridine 5.87 6.72 117 4-formylpyridine 4.77 3.29 118 4-methoxypyridine 6.47 5.05 119 4-methylbenzenemethanamine 9.36 9.22 120 4-methylpyridine 5.98 6.01 121 4-phenylbutylamine 10.36 9.66 122 4-phenylpyridine 5.55 5.33 123 4-propylpyridine 6.05 7.56 124 4-pyridineethanol 5.60 6.88 125 4-pyridinemethanol 5.33 5.56 126 4-pyridinepropanol 5.84 7.11 127 4-t-butylpyridine 5.99 7.84 128 4-vinylpyridine 5.62 5.15 129 5-ethyl-2-methylpyridine 6.51 7.20 130 allylamine 9.70 8.45 131 α-methylbenzeneethanamine 10.13 9.45 132 α-methylbenzenepropanamine 9.79 9.51 133 anabasine 8.70 8.91 134 arecoline 7.16 6.34 135 azetidine 11.29 8.87

135

136 benzylamine 9.33 7.82 137 bis-(2-chloroethyl)ethylamine 6.57 6.57 138 chlorpheniramine 9.13 7.09 139 cyclohexanamine 10.63 10.50 140 diallylamine 9.29 9.17 141 dibutylamine 11.39 11.89 142 dicyclohexylamine 10.40 11.82 143 diethylamine 11.09 10.22 144 diisopropylamine 11.07 10.99 145 dimethylamine 10.73 9.46 146 dimethylbutylamine 10.19 9.61 147 dinicotinic acid 1.10 3.88 148 diphenhydramine 8.98 6.74 149 dipropylamine 11.00 11.01 150 E-3-nicotinoylacrylic acid 3.82 1.90 151 ethylamine 10.87 10.11 152 ethyldimetyhlamine 10.16 8.72 153 fenpropidin 10.10 12.19 154 fenpropimorph 6.98 11.76 155 hexamethyleneimine 11.07 10.21 156 isobutylamine 10.68 10.84 157 isonicotinic acid, ethyl ester 1.70 4.11 158 isonicotinic acid, methyl ester 3.45 3.45 159 isonicotinic acid 3.26 2.96 160 isopropylamine 10.63 10.42 161 mescaline 9.56 8.09 162 methadone 8.94 7.28 163 methamphetamine 9.87 9.35 164 methylamine 10.62 9.23 165 methylbutylamine 10.90 10.29 166 morpholine 8.49 7.85 167 moxisylyte 8.72 8.01 168 N-β-dimethylbenzeneethanamine 9.87 8.73 169 N-butylamine 10.78 10.50 170 N-ethylbenzenemethanamine 9.64 9.78 171 nicotine 8.18 8.64 172 nicotinic acid, ethyl ester 3.35 3.51 173 nicotinic acid, methyl ester 3.13 2.81 174 nikethamide 3.50 3.97 175 n-methylbenzeneethanamine 10.08 8.13 176 n-methylbenzylamine 9.54 7.85 177 n-methylmorpholine 7.38 7.72 178 n-methylpiperidine 10.08 8.83 179 N,N-di-2-propenyl-2-propen-1-amine 8.31 8.60 180 N,N-dimethyl-2-(3-pyridyl)ethylamine 8.86 8.46 N,N-dimethyl-2-[5-methyl-2-(1- 181 8.66 9.74 methylethyl)phenoxy]ethanamine 182 N,N-dimethyl-(2-pyridine)ethanamine 8.75 7.82 183 N,N-dimethyl-3-pyridylmethylamine 8.00 7.63 184 N,N-dimethylbenzylamine 8.91 7.50

136

185 orphenadrine 8.91 8.33 186 picolinic acid, methyl ester 2.21 2.49 187 picolinic acid 1.06 2.29 188 piperalin 8.90 8.34 189 piperidine 11.28 9.30 190 p-methoxyamphetamine 9.53 10.04 191 propylamine 10.71 10.09 192 pyridine 5.23 5.27 193 pyrrolidine 11.31 8.97 194 sec-butylamine 10.56 10.80 195 t-butylamine 10.68 10.52 196 triethylamine 10.78 9.64 197 trimethylamine 9.80 8.37 198 tri-N-butylamine 10.89 12.41 199 tripropylamine 10.65 10.91 200 1-acetyl-1H-imidazole 3.60 4.53 201 1-methyl-4-nitro-1H-imidazole -0.53 -2.05 202 1-methyl-5-nitroimidazole 2.13 -0.07 203 1-phenylmethyl-1H-imidazole 6.70 6.39 204 2-(2,4-dimethylphenyl)-5-nitrobenzimidazole 5.29 2.75 205 2-(2-methoxyphenyl)benzimidazole 7.17 4.37 206 2-(2-methylphenyl)-5-nitrobenzimidazole 4.87 2.72 207 2,4,6-pyrimidinetriamine 6.81 2.23 208 2-(4-aminophenylmethyl)-5-chlorobenzimidazole 7.47 4.67 209 2-(4-bromophenylmethyl)-5-chlorobenzimidazole 5.42 7.96 210 2-(4-chlorphenylmethyl)-5-chlorobenzimidazole 4.86 4.82 211 2,4-dimethylquinoline 5.12 6.41 212 2-(4-methoxyphenylmethyl)-5-nitrobenzimidazole 4.26 1.76 213 2-(4-methylphenyl)benzimidazole 6.90 6.27 214 2-(4-methylphenylmethyl)-5-chlorobenzimidazole 7.09 5.70 215 2,6-dimethylquinoline 6.10 6.62 216 2-amino-4,6-dimethylpyrimidine 4.82 4.35 217 2-aminopyrimidine 3.45 2.98 218 2-bromopyrimidine -1.63 1.24 219 2-ethoxypyrimidine 1.27 3.11 220 2-methyl-1H-imidazole 7.85 6.22 221 2-methyl-8-quinolinol 5.55 4.92 222 2-methylquinoline 5.71 5.70 223 2-methylthio-4,6-dimethylpyrimidine 0.59 4.86 224 2-methylthiopyrimidine 6.48 3.72 225 2-phenyl-1H-imidazole -0.68 5.11 226 2-pyrimidinecarboxylic acid, methyl ester 2.12 0.70 227 3-bromoquinoline 2.69 3.66 228 3-methylquinoline 5.17 6.09 229 3-quinolinol 4.28 4.73 230 4,6-dimethylpyrimidine 2.70 5.05 231 4,7-dichloroquinoline 2.80 1.94 232 4-methyl-8-quinolinol 5.56 4.89 233 4-methylpyrimidine 1.91 4.40 234 4-methylquinoline 5.67 5.87

137

235 4-nitroimidazole -0.05 -0.20 236 5-chloro-8-quinolinol 3.56 2.62 237 5-nitropyrimidine 0.72 -0.44 238 5-quinolinol 5.02 4.37 239 6-bromoquinoline 3.87 1.17 240 6-chloroquinoline 3.85 3.18 241 6-hydroxyquinoline 5.15 3.85 242 6-methoxyquinoline 5.03 4.93 243 6-methylquinoline 5.34 6.06 244 7-bromoquinoline 3.87 3.98 245 7-methoxyquinoline 5.03 4.67 246 7-methylquinoline 5.34 6.00 247 7-quinolinol 5.46 4.56 248 8-chloroquinoline 3.12 3.28 249 8-fluoroquinoline 3.34 4.20 250 8-methoxyquinoline 5.01 4.42 251 8-methylquinoline 5.05 5.81 252 8-quinolinol 4.90 4.21 253 anserine 7.04 8.67 254 benzimidazole 5.53 5.20 255 cimetidine 6.80 6.07 256 cloquintocetmexyl 3.75 3.90 257 fenclorim 4.23 0.07 258 imidazole 6.95 5.97 259 pentostatin 5.20 7.38 260 pilocarpol 6.78 5.75 261 prochloraz 3.80 2.85 262 pyrimethanil 3.52 4.06 263 pyrimidine 1.23 3.61 264 quinoline 4.90 5.16 265 triflumizole 3.70 2.29

Table A6 The glass transition temperature data seta. No. Compound Exp. Calc.

1 1-TNATA 386 380 2 2-TNATA 383 395 3 AODF1 353 362 4 AODF2 353 379 5 BMA-1T 359 346 6 BMA-2T 363 357 7 BMA-3T 366 374 8 BMA-4T 371 376 9 BMB-2T 380 360 10 BMB-3T 366 372 11 BNpA-1T 364 364 12 BPAPF 440 422 13 EFPCA 405 393 14 EFPPCA 458 446

138

15 EM1 407 401 16 EM2 395 366 17 EM3 391 374 18 EM4 372 389 19 EM5 440 424 20 ENPPCA 447 403 21 EPPCA 447 419 22 EtCz2 343 364 23 F1AMB-1T 397 384 24 m-BPD 354 365 25 m-MTDAB 320 344 26 m-MTDAPB 378 387 27 m-MTDATA 348 361 28 m-MTDATz 315 360 29 MPPPCA 456 430 30 MTBDAB 407 451 31 m-TTA 353 399 32 NEFAPQ 389 380 33 NPB 368 351 34 NPCA 396 402 35 NPECAPPP 425 382 36 o-MTDAB 315 337 37 o-MTDAPB 382 387 38 o-MTDATA 349 355 39 o-MTDATz 328 357 40 PAB 401 402 41 PAE3b 388 398 42 PAE3c 412 428 43 PAPA 394 406 44 PATB4a 398 394 45 PATB4d 423 421 46 PATB4e 416 449 47 p-BrTDAB 345 352 48 p-ClTDAB 337 320 49 p-DPA-TDAB 380 388 50 p-FTDAB 327 326 51 PhAMB-1T 357 355 52 PhCz2 363 376 53 p-MTDAB 328 344 54 p-MTDAPB 383 388 55 p-MTDATA 353 362 56 PPACBN 467 422 57 PPATC3e 415 445 58 PPCA 453 431 59 PPPCA 457 440 60 p-TTA 405 398 61 TBB 361 428 62 TBPSF 468 392 63 TCB 399 393 64 TCPB 445 453 65 TCTA 423 414 66 TDAPB 394 377 67 TDATA 362 352 68 TMB-TB 433 430 69 TPD 338 335 70 TPOB 410 383 71 TPTAB1 311 324 72 TPTAB2 319 310 73 TPTE 403 402 a Structure codes from Yin, S.; Wang, Y., J. Chem. Inf. Comput. Sci., 2003, 43, 970-977.

139

Table A7 The aqueous solubility data set (logS). No. Compound Exp. Calc.

1 1-Bromoheptane -4.431 -3.409 2 1-Bromohexane -3.807 -3.214 3 Acetyl-R-mandelic acid -1.231 -2.342 4 1,1-Diphenylethene -4.436 -4.155 5 Benzo[b]triphenylene -8.222 -5.931 6 1,2,4,5-Tetrafluorobenzene -2.376 -1.534 7 1,3-Butadiene -1.867 -2.157 8 1,4-Dimethylcyclohexane -4.466 -3.070 9 1,4-Pentadiene -2.087 -2.309 10 1,5-Hexadiene -2.687 -2.562 11 1,6-Heptadiene -3.340 -2.836 12 1,6-Heptadiyne -1.747 -2.436 13 1,8-Nonadiyne -2.983 -2.954 14 1-Chloro-2-[2,2-dichloro-1-(4-chlorophenyl)ethyl]benzene -6.506 -6.201 15 1-Anthranol -4.721 -3.929 16 1-Bromo-2-naphthylisothiocyanate -0.319 -4.858 17 1-Bromo-3-chloropropane -1.848 -2.394 18 1-Bromo-3-fluorobenzene -2.666 -3.146 19 1-Butene -2.403 -2.109 20 1-Butyne -1.275 -1.911 21 1-Chloro-2,4-dinitronaphthalene -5.402 -4.238 22 1-Chloro-2-fluorobenzene -2.416 -2.613 23 1-Chloro-3-fluorobenzene -2.346 -2.637 24 1-Chloroheptane -3.996 -3.223 25 1-Ethyl-2-methylbenzene -3.207 -3.156 26 1-Heptene -3.733 -2.934 27 1-Heptyne -3.010 -2.714 28 1-Hexen-3-ol -2.344 -1.952 29 1-Methyl Tetrahydrofuran -1.538 -1.555 30 1-Methyl-1-cyclohexene -3.267 -2.605 31 1-Methylphenanthrene -5.854 -4.358 32 1-Naphthaleneacetic Acid -2.652 -3.037 33 1-Naphthol -3.519 -2.939 34 1-Naphthyl Isothiocyanate -4.602 -4.307 35 1-Nonene -5.053 -3.487 36 1-Nonyne -4.237 -3.263 37 1-Octyne -3.662 -2.997 38 1-Pentene -2.676 -2.392 39 17-Methyltestosterone -3.951 -3.260 40 2,2',3,3',4,4',6-Heptachlorobiphenyl -8.301 -7.524 41 2,2',4,5-Tetrachlorobiphenyl -7.252 -5.963 42 2,2',4,4'-Tetrachlorobiphenyl -6.123 -5.933 43 2,2',3,5'-Tetrachlorobiphenyl -6.562 -6.121 44 2,2',3,4,5'-Pentachlorobiphenyl -7.854 -6.648 45 2,2',3,3',4,4',5,5'-Octachlorobiphenyl -9.000 -7.928 46 2,2',3,4,5,5',6-Heptachlorobiphenyl -9.000 -7.564 47 2,2',3,4,6-Pentachlorobiphenyl -7.432 -6.587 48 2,2',3,3',5,6-Hexachlorobiphenyl -8.523 -7.135 49 2,2,3-Trimethyl-3-pentanol -0.833 -2.237 50 2,2,5,5-Tetramethyl-3-hexyne -3.833 -3.521 51 2,2,5-Trimethyl-3-hexyne -3.618 -3.270 52 2,2-Dimethyl-3-butanol -2.368 -1.938 53 2,2-Dimethyl-3-hexyne -4.143 -2.981 54 2,3',5-Trichlorobiphenyl -6.000 -5.381

140

55 2,3,4,5,6-Pentachlorophenoxyacetic Acid -3.745 -4.964 56 2,3,4,6-Tetrachlorophenoxyacetic Acid -3.409 -4.382 57 2,3,4-Trichlorophenoxyacetic Acid -3.097 -3.944 58 2,3,5-Trichloro-4-hydroxypyridine -4.286 -3.480 59 2,3,5-Trichlorophenoxyacetic Acid -3.000 -3.889 60 2,3,6-Trichlorophenoxyacetic Acid -2.620 -3.848 61 2,3-Dichlorophenoxyacetic Acid -2.810 -3.296 62 2,3-Dimethyl-1-butanol -2.133 -1.834 63 2,3-Dimethyl-2-pentanol -2.622 -2.132 64 2,3-Dimethyl-3-pentanol -2.595 -2.076 65 2,3-Xylenol -1.427 -2.334 66 2,4,6-Trichlorophenol -2.341 -3.826 67 2,4,6-Trichlorophenoxyacetic Acid -3.013 -3.837 68 2,4-Decadione -2.585 -2.236 69 2,4-Dimethyl-2-pentanol -2.683 -2.305 70 2,4-Dimethyl-3-pentanol -2.448 -2.172 71 2,4-Dimethyl-3-pentanone -3.046 -1.981 72 2,4-Dimethylquinoline -1.942 -3.341 73 2,4-Dinitrobenzoic Acid -1.067 -3.120 74 2,4-Dinitrophenol -2.598 -3.138 75 2,4-Lutidine -1.231 -2.370 76 2,4-Octadione -1.559 -1.644 77 2,5-Dichlorophenoxyacetic Acid -2.616 -3.313 78 2,5-Dimethyl-4-acetaminophenol -2.013 -2.034 79 2,5-Piperazinedione -0.831 -0.271 80 2,5-Xylenol -1.538 -2.407 81 2,6-Dichlorophenoxyacetic Acid -2.152 -3.287 82 2,6-Diethyl-4-acetaminophenol -2.531 -2.602 83 2,6-Diisopropyl-4-acetaminophenol -3.214 -2.977 84 2,6-Dimethyl-4-acetaminophenol -1.911 -2.023 85 2,6-Dimethyl-4-heptanol -3.904 -2.783 86 2,6-Dimethylnaphthalene -4.893 -3.766 87 2,6-Xylenol -1.305 -2.335 88 2-(2-Methyl-4-chlorophenoxy)propionic Acid -2.407 -3.191 89 2-Anthranol -4.328 -3.911 90 2-Chlorophenoxyacetic Acid -2.164 -2.639 91 2-Ethyl-1-butanol -3.152 -2.031 92 2-Ethylnaphthalene -4.291 -3.853 93 2-Fluorobenzyl Chloride -2.541 -2.623 94 2-Heptene -3.816 -2.909 95 2-Heptyne -3.770 -2.733 96 2-Hexanol -2.617 -1.998 97 2-Methyl-1-pentanol -2.976 -2.069 98 2-Methyl-1-pentene -3.033 -2.631 99 2-Methyl-2-hexanol -2.823 -2.241 100 2-Methyl-2-pentanol -2.244 -1.981 101 2-Methyl-3-hexyne -3.745 -2.756 102 2-Methyl-3-pentanol -2.451 -1.876 103 2-Methyl-4-acetaminophenol -1.595 -1.834 104 2-Methyl-4-penten-3-ol -2.260 -1.991 105 2-Methyl-5-t-butylphenol -2.594 -3.068 106 2-Methyldecalin -6.573 -3.702 107 2-Naphthoic Acid -3.886 -2.696 108 2-Naphthyl Isothiocyanate -4.444 -4.289 109 2-Nitrobenzaldehyde -3.878 -2.274 110 2-Pentene -2.538 -2.354 111 2-Thiouracil -2.257 -2.135 112 3,3'-Dichlorobiphenyl-4,4'-diamine -4.910 -3.903 113 3,3'-Dichlorobiphenyl -5.699 -4.777

141

114 3,3-Diphenylphthalide -4.855 -4.432 115 3,4,5-Trichlorophenoxyacetic Acid -2.939 -3.788 116 3,4,7,8-Tetramethyl-1,10-phenanthroline -5.222 -3.986 117 3,4-Dichlorophenoxyacetic Acid -2.684 -3.133 118 3,4-Xylenol -1.409 -2.408 119 3,5-Dichlorophenoxyacetic Acid -2.362 -3.096 120 3,5-Dinitrobenzoic Acid -2.197 -3.252 121 3,5-Pyridinedicarboxylic Acid -2.194 -1.400 122 3,5-Xylenol -1.398 -2.398 123 3-(5-tert-Butyl-1,3,4-thiadiazol-2-yl)-4-hydroxyl-l -1.877 -3.088 124 3-Bromo-2-nitrobenzoic Acid -2.872 -3.334 125 3-Bromobenzyl Isothiocyanate -3.971 -4.378 126 3-Carboxyphenylisothiocyanate -3.252 -3.038 127 3-Chloro-2-nitrobenzoic Acid -2.632 -2.687 128 3-Chlorobenzyl Isothiocyanate -3.863 -4.139 129 3-Chlorophenoxyacetic Acid -1.898 -2.447 3-Cyclohexyl-6-dimethylamino-1-methyl-1,3,5-triazine- 130 2,4-dione -0.883 -2.268 131 3-Ethyl-3-pentanol -2.585 -2.210 132 3-Fluorobenzyl Chloride -2.544 -2.609 133 3-Heptanol -3.208 -2.259 134 3-Hexanol -2.547 -1.987 135 3-Hexanone -2.578 -1.763 136 3-Hexyne -2.167 -2.463 137 3-Hydroxy-5-methyl Isoxazole -0.067 -0.878 138 3-Hydroxyphenyl Isothiocyanate -1.991 -3.174 139 3-Methyl-1-butene -2.732 -2.381 140 3-Methyl-1-pentanol -3.121 -1.897 141 3-Methyl-2,4-pentadione -0.010 -1.132 142 3-Methyl-2-butanone -1.896 -1.479 143 3-Methyl-2-pentanol -2.466 -1.782 144 3-Methyl-2-pentanone -2.425 -1.814 145 3-Methyl-3-hexanol -2.734 -2.191 146 3-Methyl-3-pentanol -2.125 -1.890 147 3-Nitrobenzaldehyde -4.179 -2.293 148 3-Nitrobenzyl Isothiocyanate -4.086 -3.879 149 3-Nitropentane -1.955 -1.983 150 3-Nitrophenyl Isothiocyanate -3.553 -3.783 151 3-Nitrophthalic Acid -1.021 -1.862 152 3-Penten-2-ol -1.730 -1.811 153 3-Pentyl-2,4-pentadione -1.851 -2.006 154 3-Propyl-2,4-pentadione -0.876 -1.436 155 3-Thenoic Acid -1.474 -1.787 156 4,4'-Dimethylbiphenyl -6.000 -4.166 157 4,7-Dimethyl-1,10-phenanthroline -3.971 -3.616 158 4-(Methylthio)phenyl Dipropyl Phosphate -3.386 -3.280 159 4-(4-Chlorophenoxy)butyric Acid -3.290 -2.770 160 4-(2,4,5-Trichlorophenoxy)butyric Acid -3.829 -4.276 161 4-Benzoyl Phenylisothiocyanate -4.854 -4.286 162 4-Bromo-1-butene -2.247 -2.518 163 4-Bromobiphenyl -5.523 -4.708 164 4-Bromophenyl Isothiocyanate -0.268 -3.622 165 4-Carbethoxyphenylisothiocyanate -4.046 -3.612 166 4-Carboxyphenylisothiocyanate -3.975 -3.025 167 4-Chlorobenzyl Isothiocyanate -3.830 -4.152 168 4-Chlorophenyl Phenyl Ether -4.793 -4.151 169 4-Cyanobenzyl Isothiocyanate -3.495 -3.743 170 4-Dimethylaminophenyl Isothiocyanate -4.125 -3.628 171 4-Hexen-3-ol -2.165 -2.008

142

172 4-Hydroxyphenyl Isothiocyanate -2.668 -3.128 173 4-Methyl-1-pentene -3.244 -2.653 174 4-Methyl-3-pentanone -2.564 -1.813 175 4-Methylbenzaldehyde -1.724 -1.895 176 4-Methylbiphenyl -4.620 -3.924 177 4-Nitrobenzyl Isothiocyanate -3.633 -3.724 178 4-Nitrocatechol -1.571 -2.074 179 4-Nitroresorcinol -3.022 -2.113 180 4-Nonylphenol -4.498 -4.436 181 4-Penten-1-ol -1.924 -1.585 182 4-Penten-3-ol -1.766 -1.764 183 4-Vinyl-1-cyclohexene -3.335 -2.853 184 4-s-Butylphenol -2.194 -3.021 185 4-t-Butylphenol -2.413 -3.041 186 5,5-Dimethyl-2,4-hexadione -1.631 -1.753 187 5,5-Dipropylbarbituric Acid -2.398 -2.667 188 5,6-Dimethyl-2-thiouracil -2.056 -2.734 189 5-Bromo-2-nitrobenzoic Acid -1.521 -3.173 190 5-Bromo-3-tert-butyl-6-methyluracil -2.804 -3.657 191 5-Carboethoxy-2-thiouracil -2.099 -2.792 192 5-Chloro-2-nitrobenzoic Acid -1.319 -2.941 193 5-Ethyl-5-methylbarbituric Acid -1.096 -1.759 194 5-Ethyl-5-N-butylbarbituric Acid -1.638 -2.564 195 5-Ethyl-5-N-heptylbarbituric Acid -3.218 -3.307 196 5-Ethyl-5-N-hexylbarbituric Acid -3.049 -3.048 197 5-Ethyl-5-N-nonylbarbituric Acid -3.462 -3.856 198 5-Ethyl-5-N-octylbarbituric Acid -3.943 -3.582 199 5-Ethyl-5-N-propylbarbituric Acid -1.442 -2.241 200 5-Ethyl-5-pentylbarbituric Acid -2.177 -2.675 201 5-Methyl-2-thiouracil -2.446 -2.342 202 5-Nitro-1,10-phenanthroline -3.917 -3.543 203 6-Amino-2-thiouracil -2.747 -2.341 204 6-Methyl-2,4-heptadione -1.604 -1.897 205 6-Nitrophthalide -2.651 -2.556 206 7-Methylsulfinyl-2-xanthonecarboxylic Acid -5.062 -3.570 207 7-Methylthio-2-xanthonecarboxylic Acid -6.042 -4.030 208 Acetal -0.429 -1.870 209 Acetaminophen Acetate -2.781 -1.906 210 Acetaminophen Butyrate -2.826 -2.473 211 Acetaminophen Hexanoate -4.141 -3.040 212 Acetaminophen Laurate -4.745 -4.701 213 Acetaminophen Octanoate -4.443 -3.602 214 Acetaminophen Palmitate -4.892 -5.809 215 Acetaminophen Propionate -2.811 -2.181 216 Acetaminophen Stearate -4.922 -6.380 217 Adenine -2.119 -1.322 218 Adipic Acid -2.654 -1.194 219 Alachlor -3.261 -3.695 220 Allyl Bromide -1.499 -2.215 221 Ametryn -3.037 -3.408 222 Amikacin -0.500 2.060 223 Amitrole 0.522 -0.679 224 Ampyrone 0.554 -2.290 225 Amyl Acetate -1.877 -2.105 226 Ancymidol -2.596 -3.160 227 Androstenedione -3.699 -3.293 228 Anethole -3.126 -3.103 229 Anthraquinone -5.187 -3.220 230 Arginine 0.019 -0.482

143

231 Aspirin Phenylalanine Ethyl Ester -3.328 -3.278 232 Azinphos-methyl -4.039 -4.179 233 Barban -4.370 -4.146 234 Bendroflumethiazide -3.590 -3.535 235 Benzamide -0.956 -1.560 236 Benzhydrol -2.553 -3.325 237 Benzidine-2,2'-disulfonic Acid -4.634 -2.876 238 Benzoin -2.850 -3.116 239 Benzophenone -3.125 -3.043 240 Benzoyl-r-mandelic Acid -1.509 -3.408 241 Benzoylprop-ethyl -4.263 -5.408 242 Benzyl Alcohol -0.402 -1.909 243 Benzyl Isothiocyanate -3.137 -3.582 244 Benzylamine -1.533 -1.908 245 Bibenzyl -4.627 -4.296 246 Borneol -2.320 -2.563 247 Bromochloromethane -0.889 -2.139 248 Bromomethionic Acid 1.131 -1.712 249 Butadiyne -4.699 -2.021 250 Butyl Dibutyl Phosphinate -1.717 -1.753 251 CDAA -0.945 -2.781 252 Camphor -1.987 -2.384 253 Caproic Aldehyde -1.302 -1.806 254 Caprylic Aldehyde -2.360 -2.058 255 Carbazole -5.265 -3.330 256 Carbofuran -2.500 -2.378 257 Carbon Disulfide -3.170 -0.794 258 Carbonyl Sulfide -1.682 -1.954 259 Carboxin -3.141 -3.474 260 Carvacrol -2.080 -3.065 261 Chelidonic Acid -1.110 -1.596 262 Chloromethionic Acid -4.812 -1.444 263 Chloroneb -4.413 -3.675 264 Chloropicrin -2.006 -2.818 265 Chlorothalonil -5.647 -4.505 266 Chlorothiazide -3.020 -2.921 267 Chlorpropham -3.380 -3.439 268 Chlorpyrifos-methyl -4.907 -3.113 269 Chlorquinox -5.428 -5.049 270 Cinchonidine -3.168 -3.961 271 Cinchophen -3.193 -4.176 272 Cinnamaldehyde -1.991 -2.166 273 Citraconic Acid 0.779 -0.899 274 Cortisone -3.110 -2.867 275 Cortisone Acetate -4.277 -3.232 276 Cortisone Propionate -4.717 -3.192 277 Cumene Hydroperoxide -1.039 -2.409 278 Cyclobarbital -1.456 -2.508 279 Cycloheptane -3.515 -2.806 280 Cycloheptene -3.164 -2.650 281 Cyclohexanone -2.339 -1.312 282 Cyclooctane -4.152 -2.883 283 Cytosine -1.143 -0.874 284 D-Alanine 0.267 -0.611 285 D-Glutamic Acid -1.219 -0.773 286 DCPA -5.822 -4.796 287 d,l-2-(4-Chlorophenoxy)propionic Acid- -2.134 -3.066 288 d,l-Aminooctanoic Acid -4.202 -1.954 289 d,l-Aspartic Acid -1.212 -0.601

144

290 d,l-Glutamic Acid -0.746 -0.698 291 d,l-Isoleucine -0.778 -1.382 292 d,l-Methionine -0.659 -1.570 293 d,l-Norvaline -0.145 -1.154 294 d,l-Phenylalanine -1.066 -1.995 295 d,l-alpha-Aminobutyric Acid 0.287 -0.904 296 DMPA -4.798 -4.202 297 Dalapon 0.545 -2.391 298 Daminozide -0.205 -0.533 299 Dazomet -3.876 -3.342 300 Decabromodiphenyl Ether -7.585 -6.483 301 Decyl-p-hydroxybenzoate -2.885 -4.198 302 Dexamethasone -3.644 -2.464 303 Dianisidine -3.610 -2.850 304 Dibenzo-18-crown-6 -4.693 -4.413 305 Dibenzothiophene -5.542 -4.281 306 Dibutyl Butyl Phosphonate -2.700 -2.307 307 Dibutyl Ethoxybutyl Phosphate -2.647 -3.030 308 Dibutyl Ethyl Phosphate -1.846 -2.140 309 Dibutyl Ethyl Phosphonate -1.569 -1.917 310 Dibutyl Hydrogen Phosphonate -1.425 -2.025 311 Dibutyl Methyl Phosphate -1.499 -2.039 312 Dibutyl Methyl Phosphonate -1.415 -1.467 313 Dicamba -1.691 -3.214 314 Dichlobenil -4.236 -3.581 315 Dichlofenthion -6.110 -3.750 316 Dichlone -6.357 -3.822 317 Dichlorodifluoromethane -2.635 -1.823 318 Dichlorophen -5.698 -4.484 319 Dichlorprop -2.453 -3.807 320 Dicofol -5.448 -6.544 321 Diethyl Amyl Phosphate -1.476 -1.821 322 Diethyl Butyl Phosphate -1.147 -1.638 323 Diethyl Hexyl Phosphonate -2.666 -1.737 324 Diethyl Trichloromethyl Phosphonate -1.754 -1.426 325 Diethylstilbestrol -4.350 -4.487 326 Digallic Acid -2.809 -2.687 327 Digitoxin -5.293 -4.771 328 Dimethirimol -2.242 -2.497 329 Dinitramine -5.467 -3.654 330 Dinoseb -3.665 -3.960 331 Diphenic Acid -2.284 -2.697 332 Diphenyl Methyl Phosphate -2.121 -2.592 333 Diphenylacetic Acid -3.222 -3.405 334 Diphenylnitrosamine -3.752 -3.623 335 Dixanthogen -4.959 -5.389 336 EPTC -2.703 -3.846 337 Estragole -2.921 -3.159 338 Ethalfluralin -6.222 -4.160 339 Ethoate-methyl -1.457 -2.134 340 Ethohexadiol -2.287 -1.849 341 Ethyl Cinnamate -2.996 -2.928 342 Ethyl Cyanoacetate -0.753 -1.556 343 Ethyl Dibutyl Phosphonate -1.200 -1.917 344 Ethyl Hydrocinnamate -2.909 -3.054 345 Ethyl Phthalate -2.347 -2.715 346 Ethyl Propionate -0.667 -1.437 347 Ethyl m-Isothiocyanobenzoate -3.602 -3.757 348 Ethyl p-Benzoate -2.319 -2.492

145

349 Ethylidene Chloride -1.292 -2.360 350 Ethylmalonic Acid 0.732 -1.310 351 Eugenol -1.824 -2.627 352 Fenarimol -4.383 -4.917 353 Fenbufen -5.046 -3.578 354 Fensulfothion -2.302 -3.569 355 Flufenamic Acid -4.398 -2.902 356 Fluometuron -3.412 -1.743 357 Fluorobenzene -1.792 -1.930 358 -5.843 -2.861 359 Fumaric Acid -1.220 -0.956 360 Glutamic Acid -1.235 -0.773 361 Glutamine -0.548 -0.337 362 Glyphosate -1.149 0.581 363 Heptanoic Acid -1.665 -1.960 364 Heptyl p-Hydroxybenzoate -2.234 -3.386 365 Hexachloro-1,3-butadiene -4.907 -4.854 366 Hexachlorobenzene -7.770 -5.329 367 Hexadecyl p-Hydroxybenzoate -2.981 -5.818 368 Hexobarbital -2.735 -2.910 369 Hexyl Acetate -2.451 -2.419 370 Hexyl p-Hydroxybenzoate -2.768 -3.114 371 Histidine -0.533 -0.901 372 Hydantoic Acid -0.483 -0.534 373 Ibuprofen -4.367 -3.407 374 Isoamyl Acetate -3.558 -2.124 375 Isoamyl Salicylate -3.157 -3.227 376 Isoamylmalonic Acid 0.543 -1.867 377 Isobutane -3.075 -2.188 378 Isobutyl Isobutyrate -2.403 -2.275 379 Isobutylbenzene -4.123 -3.469 380 Isobutylene -2.329 -2.054 381 Isobutyraldehyde 0.091 -1.351 382 Isoprene -2.026 -2.334 383 Isopropalin -6.491 -4.909 384 Isopropyl Ether -4.608 -2.049 385 Isopropyl tert-Butyl Ether -2.366 -2.395 386 L-Asparagine -0.652 -0.351 387 L-Cystine -3.343 -2.327 388 Glutamic Acid -1.235 -0.773 389 Histidine -0.533 -0.901 390 L-Isoleucine -0.582 -1.428 391 L-Mandelic Acid -0.233 -1.923 392 Lactamide -5.057 -0.578 393 Lenacil -4.592 -2.712 394 Leptophos -5.235 -5.609 395 Levodopa -1.717 -1.389 396 Limonene -4.194 -3.312 397 Linalool -1.987 -2.879 398 MCPA -2.233 -2.945 399 MCPB -3.678 -3.238 400 Maleic Acid 0.579 -0.949 401 Mandelic Acid -5.924 -2.030 402 Meconic Acid -1.377 -2.011 403 Meconin -1.890 -2.402 404 Menthol -2.535 -2.602 405 Methacrylonitrile -2.167 -1.769 406 Methidathion -3.100 -3.376 407 Methionine -0.421 -1.570

146

408 Methomyl -0.447 -2.070 409 Methotrimeprazine -5.960 -5.427 410 Methyl Benzoate -1.839 -2.217 411 Methyl Butyrate -0.833 -1.595 412 Methyl Dixanthogen -3.939 -4.791 413 Methyl Isopropyl Ether -1.802 -1.599 414 Methyl Oxalate -0.292 -1.380 415 Methyl Propyl Ether -2.130 -1.785 416 Methyl m-Isothiocyanobenzoate -3.565 -3.502 417 Methylamine 1.605 -0.677 418 Methylaniline -1.280 -2.120 419 Methylmalonic Acid 0.760 -1.061 420 Methyltestosterone Acetate -4.854 -2.524 421 Methylthiouracil -2.426 -2.562 422 Metolazone -3.783 -3.828 423 Mirex -6.807 -8.824 424 Monolinuron -2.465 -2.719 425 Mustard Gas -2.363 -2.964 426 Myristic Acid -5.301 -3.878 427 Myristyl Alcohol -6.053 -4.127 428 N',N'-Dimethyl-m-aminophenyl Isothiocyanate -3.710 -3.650 429 Naproxen -4.161 -3.482 430 Naptalam -3.163 -3.639 431 Neburon -4.758 -4.242 432 Neopentyl Alcohol -2.146 -1.800 433 Niridazole -4.962 -2.763 434 Nitralin -5.760 -4.499 435 Nitrilotriacetic Acid -0.510 -0.608 436 Nitroethane -1.917 -1.439 437 Nitroguanidine -1.374 -1.432 438 Nitromethane 0.256 -1.265 439 Nonyl Aldehyde -3.171 -2.326 440 Nonyl p-Hydroxybenzoate -2.316 -3.917 441 Norflurazon -4.035 -2.834 442 Octadecyl-p-hydroxybenzoate -3.079 -6.375 443 Octyl p-Hydroxybenzoate -2.485 -3.670 444 Octylamine -2.810 -2.586 445 Oxamyl 0.106 -2.708 446 Oxanilic Acid -1.302 -2.244 447 Oxycarboxin -2.427 -2.526 448 Palmitic Acid -5.523 -4.424 449 Parathion -4.084 -3.290 450 Pentachlorbenzyl Alcohol -6.147 -4.886 451 Pentachloroethane -2.607 -3.641 452 Pentachlorophenol -4.208 -4.929 453 Pentamethylmelamine -1.958 -1.573 454 Phenacetin -2.350 -2.103 455 Phenetole -2.301 -2.592 456 Phenothiazine -5.097 -4.073 457 Phenoxyacetic Acid -3.959 -1.956 458 Phenyl Isothiocyanate -3.177 -3.392 459 Phenyl Salicylate -3.155 -3.398 460 Phosmet -4.104 -3.897 461 Phthalimide -2.611 -1.919 462 Picloram -2.749 -3.370 463 Pindone -4.107 -3.122 464 Pipemidic Acid -2.975 -2.083 465 Pirimicarb -1.946 -2.624 466 Prednisolone -3.953 -3.105

147

467 Propyl Acetate -2.453 -1.485 468 Propyl Dixanthogen -5.699 -6.363 469 Propylthiouracil -2.151 -3.138 470 Propyne -1.042 -1.639 471 Propyzamide -4.232 -3.956 472 Protoporphyrin IX -3.721 -7.160 473 Pyrocatechol -5.378 -1.817 474 Quinethazone -5.031 -2.480 475 Quinhydrone -1.730 -1.662 476 Quinidine -3.365 -4.283 477 Resorcinol -5.186 -1.731 478 Rhodanine -1.772 -2.583 479 Saccharin -1.629 -2.192 480 Salicin -0.855 -1.176 481 Salicylanilide -3.589 -3.104 482 Serine 0.607 -0.308 483 Siduron -4.111 -3.365 484 Sucrose 0.793 0.347 485 Sulfapyridine -2.969 -2.776 486 Sulfathiazole -2.835 -3.224 487 Tetradecyl p-Hydroxybenzoate -2.975 -5.321 488 Tetrahydropyran -1.776 -1.659 489 Thiometon -3.091 -3.536 490 Thionazin -2.338 -2.624 491 Threonine -0.089 -0.488 492 Thymine -1.519 -0.938 493 Triallate -4.882 -5.516 494 Tributylamine -3.000 -3.974 495 Tributylphosphine Oxide -0.593 -0.091 496 Trichlorfon -0.223 -2.151 497 Tricyclazole -2.073 -3.165 498 Tridecyl p-Hydroxybenzoate -2.945 -5.050 499 Trietazine -4.060 -3.298 500 Triethyl Phosphate 0.439 -1.100 501 Triethylamine -0.138 -2.265 502 Trimethoprim -2.861 -2.836 503 Trimethyl Phosphate 0.553 -0.629 504 Trimethylamine 0.841 -1.561 505 Triphenylcarbinol -2.260 -4.631 506 Tripropylamine -2.301 -3.176 507 Undecane -7.553 -4.137 508 Undecyl p-Hydroxybenzoate -2.092 -4.488 509 Uric Acid -3.730 -1.188 510 Vanillin -1.140 -1.756 511 Xylene -3.000 -2.849 512 Xylidine -2.040 -2.273 513 α,α,α-Trifluoro-o-toluic Acid -1.598 -1.316 514 2,2’-Bipyridine -1.420 -3.021 515 alpha-1,2,3,4,5,6-Hexachlorocyclohexane -5.163 -4.975 516 alpha-Endosulfan -5.885 -5.455 517 alpha-Hydroxycaproamide -1.081 -1.286 518 beta,beta,beta-Trichlorolactic Acid -2.645 -2.705 519 beta-1,2,3,4,5,6-Hexachlorocyclohexane -6.084 -4.757 520 beta-Alanine -5.213 -0.683 521 beta-Aminobutyric Acid 1.084 -0.488 522 beta-Endosulfan -6.162 -5.641 523 cis-1,2-Dimethylcyclohexane -4.272 -3.055 524 d-Borneol -2.319 -2.530 525 d-Camphoric Acid -1.421 -2.014

148

526 d-Fenchone -1.851 -2.416 527 d-Limonene -3.996 -3.316 528 dl-2-Octanol -2.036 -2.541 529 epsilon-Aminocaproic Acid -5.415 -1.053 530 4,4’-Bipyridine -1.538 -2.889 531 gamma-Aminobutyric Acid 1.101 -0.651 532 l-Menthone -2.492 -2.562 533 m-Acetoxyphenyl Isothiocyanate -3.114 -3.547 534 m-Acetylphenyl Isothiocyanate -4.328 -3.423 535 m-Biphenyl Isothiocyanate -4.523 -4.709 536 m-Cyanophenyl Isothiocyanate -3.193 -3.542 537 m-Ethoxyphenyl Isothiocyanate -3.420 -3.790 538 m-Fluorobenzoic Acid -1.970 -1.642 539 m-Isopropoxyphenyl Isothiocyanate -3.328 -4.109 540 m-Isothiocyanobenzoic Acid -3.097 -3.297 541 m-Isothiocyanophenyl Isothiocyanate -4.699 -4.109 542 m-Methylphenyl Isothiocyanate -3.848 -3.648 543 m-Terphenyl -5.155 -4.980 544 m-Toluenesulfonamide -1.341 -2.453 545 n-2-Hydroxy-n2,n4,n4,n6,n6-pentamethylmelamine -2.371 -1.802 546 n-Amyl Bromide -3.077 -2.937 547 n-Amyl beta-Ethoxypropionate -2.196 -2.845 548 n-Butyl Chloride -2.025 -2.356 549 n-Butyl Ether -3.592 -2.744 550 n-Butyl beta-Ethoxypropionate -1.639 -2.404 551 n-Butylmalonic Acid 0.437 -1.771 552 n-Capric Acid -3.445 -2.789 553 n-Ethyl beta-Ethoxypropionate -0.421 -1.857 554 n-Hexyl beta-Ethoxypropionate -2.829 -2.936 555 n-Methyl beta-Ethoxypropionate -0.072 -1.758 556 n-Methylolpentamethylmelamine -2.400 -1.666 557 n-Methylolpentamethylmelamine Methyl Ether -2.205 -2.230 558 n-Octyl Bromide -5.063 -3.678 559 n-Propyl beta-Ethoxypropionate -1.017 -2.136 560 n-Propylcyclopentane -4.740 -2.998 561 n-Propylmalonic Acid 0.680 -1.390 562 n-Valeraldehyde -0.867 -1.249 563 n2,n2,n4,n4-Tetramethylmelamine -2.688 -1.332 564 n2,n4,n6-Triethyl-n2,n4,n6-trimethylmelamine -3.703 -2.998 565 n6,n6-Diethyl-n2,n2,n4,n4-tetramethylmelamine -3.507 -2.616 566 o,p'-DDE -6.357 -6.399 567 o-Chlorobenzoic Acid -1.872 -2.191 568 o-Chlorophenol -1.054 -2.587 569 o-Fluorobenzoic Acid -1.289 -1.250 570 o-Nitrophenol -1.745 -2.368 571 o-Phenylphenol -2.386 -3.337 572 o-Terphenyl -5.301 -5.014 573 o-Tolidine -2.213 -3.111 574 o-Toluenesulfonamide -2.023 -2.367 575 o-Toluic Acid -2.060 -1.951 576 o-Toluidine -0.860 -2.042 577 4-(Dodecyloxy)benzoic Acid -2.447 -4.514 578 p-Acetylphenyl Isothiocyanate -0.022 -3.122 579 p-Anisaldehyde -1.502 -1.662 580 p-Biphenyl Isothiocyanate -4.854 -4.705 581 p-Cresol -0.701 -2.243 582 p-Dibromobenzene -4.072 -4.302 583 p-Ethoxyphenyl Isothiocyanate -4.260 -3.718 584 p-Ethylphenol -1.397 -2.588

149

585 p-Fluorobenzoic Acid -2.067 -1.625 586 p-Methylbenzyl Isothiocyanate -3.796 -3.869 587 p-Phenylenediamine -0.379 -1.196 588 p-Phenylphenol -3.481 -3.323 589 p-Toluenesulfonamide -1.734 -2.350 590 p-Tolyl Isothiocyanate -4.721 -3.638 591 p-tert-Pentylphenol -2.990 -3.250 592 l-Mandelic Acid -0.233 -1.923 593 S-Trioxane 0.288 -0.479 594 tert-Amylbenzene -4.150 -3.552 595 trans-Crotonic Acid 0.000 -1.287 596 trans-Stilbene -5.793 -4.322

150

Appendix B

The measures of skewness and kurtosis of the distribution of surface property values were added to the set of Parasurf ’07 statistical descriptors during the study of phospholipidosis-inducing drugs. These measures are described briefly below.

Skewness, or the third standardized moment, is a measure of the asymmetry of data distribution, describing the left or right –handedness of a distribution of values. It is described by the equation:

N 3 ∑()xi − x γ = i=1 1 ()N −1 σ 3 where x is the mean, σ is the standard deviation, and N is the number of data points. The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right.

Kurtosis, or the fourth standardized moment, is a measure of whether the data distribution is peaked or flat relative to a normal distribution and is described by the equation:

N 4 ∑()xi − x γ = i=1 2 ()N −1 σ 4 where x is the mean, σ is the standard deviation, and N is the number of data points. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

151

References

1. Kennedy, S. P.; Bormann, B. J. Effective partnering of academic and physician scientists with the pharmaceutical drug development industry. Experimental Biology and Medicine 2006, 231, 1690-1694. 2. Silverman, R. B. The Organic Chemistry of Drug Design and Drug Action. 2nd ed.; Elsevier Academic Press: New York, 2004. 3. Kubinyi, H. Opinion: Drug research: myths, hype, and reality. Nature Reviews Drug Discovery 2003, 2, 665-668. 4. Kubinyi, H. Lectures of the Drug Design Course. http://www.kubinyi.de/lectures.html 5. Hammett, L. P. Effect of structure upon the reactions of organic compounds. Benzene derivatives. Journal of the American Chemical Society 1937, 59, 96-103. 6. Hansch, C.; Maloney, P.; Fujita, T.; Muir, R. M. Correlation of biological activity of phenoxyacetic acids with Hammett substituent constants and partition coefficients. Nature 1962, 194, 178-80. 7. Fischer, H.; Gottschlich, R.; Seelig, A. Blood-Brain Barrier Permeation: Molecular Parameters Governing Passive Diffusion. Journal of Membrane Biology 1998, 165, 201-211. 8. Overton, E. Osmotic properties of the cells and their importance for toxicology and pharmacology. Zeitschrift für Physikalische Chemie, Stöchiometrie und Verwandtschaftslehre 1897, 22, 189-209. 9. Sangster, J. Octanol-Water Partition Coefficients: Fundamentals and Physical Chemistry. Wiley: New York, 1997; p 79-112. 10. Meylan, W. M.; Howard, P. H. Atom/fragment contribution method for estimating octanol-water partition coefficients. Journal of Pharmaceutical Science 1995, 84, 83-92. 11. Hansch, C.; Steward, A. R.; Anderson, S. M.; Bentley, D. The parabolic dependence of drug action upon lipophilic character as revealed by a study of hypnotics. Journal of Medicinal Chemistry 1968, 11(1), 1-11. 12. Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews 1997, 1997, 3-25. 13. Clark, T. Quantum Cheminformatics: An Oxymoron? Beilstein Institute Workshop, Chemical Data Analysis in the Large, May 22-26, 2000, Bozen, Italy 2000. 14. Clark, T.; Ford, M.; Essex, J.; Richards, W. G.; Ritchie, D. W. A non-atom-based paradigm for modeling QSAR and QSPR. In QSAR and Molecular Modelling in Rational Design of Bioactive Molecules, Proceedings of the 15th European Symposium on Structure-Activity Relationships (QSAR) and Modelling Istanbul, Turkey, Sept. 5-10, 2004. 15. Monard, G.; Kenneth M. Merz, J. Combined Quantum Mechanical/Molecular Mechanical Methodologies Applied to Biomolecular Systems. Accounts of Chemical Research 1999, 32(10), 904-911. 16. Warshel, A.; Levitt, M. Theoretical studies of enzymic reactions: dielectric, electrostatic and steric stabilization of the carbonium ion in the reaction of lysozyme. Journal of Molecular Biology 1976, 103(2), 227-49.

152

17. Clark, T. Modelling the chemistry: time to break the mould? Euro QSAR 2002, Designing drugs and crop protectants, 111-121. 18. Murray, J. S.; Lane, P.; Brinck, T.; Paulsen, K.; Grice, M. E.; Politzer, P. Relationships of critical constants and boiling points to computed molecular surface properties. Journal of Physical Chemistry 1993, 97(37), 9369-9373. 19. Murray, J. S.; Politzer, P. Statistical analysis of the molecular surface electrostatic potential: an approach to describing noncovalent interactions in condensed phases. Journal of Molecular Structure 1998, 425, 107-114. 20. Murray, J. S.; Ranganathan, S.; Politzer, P. Correlations between the solvent hydrogen bond acceptor parameter β and the calculated molecular electrostatic potential. Journal of Organic Chemistry 1991, 56, 3734-3737. 21. Politzer, P.; Lane, P.; Murray, J. S.; Brinck, T. Investigation of relationships between solute molecule surface electrostatic potentials and solubilities in supercritical fluids. Journal of Physical Chemistry 1992, 96(20), 7938-7943. 22. Politzer, P.; Murray, J. S. Molecular electrostatic potentials and chemical reactivity. In Rev. Comput. Chem., Lipkowitz, K.; Boyd, R. B., Eds. VCH: New York, 1998; Vol. 2, p 273. 23. Politzer, P.; Murray, J. S.; Peralta-Inga, Z. Molecular Surface Electrostatic Potentials in Relation to Noncovalent Interactions in Biological Systems. International Journal of Quantum Chemistry 2001, 85, 676-684. 24. Ehresmann, B.; Martin, B.; Horn, A. H. C.; Clark, T. Local molecular properties and their use in predicting reactivity. Journal of Molecular Modeling 2003, 9, 342- 347. 25. Ehresmann, B.; Groot, M. J. d.; Alex, A.; Clark, T. New Molecular Descriptors Based on Local Properties at the Molecular Surface and a Boiling-Point Model Derived from Them. Journal of Chemical Information and Computational Sciences 2004, 43, 658-668. 26. Sjoberg, P.; Murray, J. S.; Brinck, T.; Politzer, P. A. Average local ionization energies on the molecular surfaces of aromatic systems as guides to chemical reactivity. Canadian Journal of Chemistry 1990, 68, 1440-1443. 27. Mulliken, R. S. New electroaffinity scale; together with data on valence states and on valence ionization potentials and electron affinites. Journal of Chemical Physics 1934, 2, 782-93. 28. Mulliken, R. S. Electronic population analysis on LCAO-MO molecular wave functions. II. Overlap populations, bond orders, and covalent bond energies. Journal of Chemical Physics 1955, 23, 1833-40. 29. Pearson, R. G. Density functional theory: electronegativity and hardness. Chemtracts: Inorganic Chemistry 1991, 3(6), 317-33. 30. Schürer, G.; Gedeck, P.; Gottschalk, M.; Clark, T. Accurate parametrized variational calculations of the molecular electronic polarizability by NDDO-based methods. International Journal of Quantum Chemistry 1999, 75, 17. 31. Jäger, R.; Kast, S. M.; Brickmann, J. Parameterization Strategy for the MolFESD Concept: Quantitative Surface Representation of Local Hydrophobicity. Journal of Chemical Information and Computational Sciences 2003, 43, 237-247. 32. Jäger, T.; Schmidt, F.; Schilling, B.; Brickmann, J. Localization and quantification of hydrophobicity; The molecular free energy density (MolFESD) concept and its application to the sweetness recognition. Journal of Computer-Aided Molecular Design 2000, 14, 631-646. 33. Pixner, P.; Heiden, W.; Merx, H.; Möller, A.; Moeckel, G.; Brickmann, J. Empirical Method for the Quantification and Localization of Molecular

153

Hydrophobicity. Journal of Chemical Information and Computational Sciences 1994, 34, 1309-1319. 34. Ehresmann, B.; Groot, M. J. d.; Clark, T. A Surface-Integral Solvation Energy Model: The Local Solvation Energy. Journal of Chemical Information and Computational Sciences 2005, 45, 1053-1060. 35. Clark, T.; Lin, J.-H.; Horn, A. H. C. Parasurf '06, A1; CEPOS InSilico Ltd.: 26 Brookfield Gardens Ryde, Isle of Wight PO33 3NP, 2005. 36. SYBYL 7.0, Tripos Inc.: 1699 South Hanley Rd., St. Louis, Missouri, 63144, USA. 37. Politzer, P.; Weinstein, H. Some relations between electronic distribution and electronegativity. Journal of Chemical Physics 1979, 71, 4218-4220. 38. Koopmans, T. C. The distribution of wave function and characteristic value among the individual electrons of an atom. Physica 1933, 1, 140-113. 39. Lin, J.-H.; Clark, T. An Analytical, Variable Resolution, Complete Description of Static Molecules and Their Intermolecular Binding Properties. Journal of Chemical Information and Modelling 2005, 45(4), 1010-1016. 40. Rivail, J.-L.; Cartier, A. Variational Calculation of Electronic Multipole Molecular Polarizabilites. Molecular Physics 1978, 36, 1085-1097. 41. Rivail, J.-L.; Cartier, A. An Extended Variational Method for Calculating Molecular Multipole Polarizabilities. Chemical Physics Letters 1979, 61, 469-472. 42. Martin, B.; Clark, T. Dispersion treatment for NDDO-based semiempirical MO techniques. International Journal of Quantum Chemistry 2006, 106(5), 1208-1216. 43. Martin, B.; Gedeck, P.; Clark, T. Additive NDDO-based atomic polarizability model. International Journal of Quantum Chemistry 2000, 77(1), 473-497. 44. Schamberger, J.; Gedeck, P.; Martin, B.; Schindler, T.; Hennemann, M.; Horn, A. H. C.; Ehresmann, B.; Clark, T. GEISHA, Erlangen, Germany, 2003. 45. DeLano, W. L. The PyMOL Molecular Graphics System, DeLano Scientific: Palo Alto, CA, USA, 2002. 46. Lombardo, F.; Shalaeva, M. Y.; Tupper, K. A.; Gao, F.; Abraham, M. H. ElogPoct: A Tool for Lipophilicity Determination in Drug Discovery. Journal of Medicinal Chemistry 2000, 43, 2922-2928. 47. Mannhold, R.; Cruciani, G.; Dross, K.; Rekker, R. Multivariate analysis of experimental and computational descriptors of molecular lipophilicity. Journal of Computer-Aided Molecular Design 1998, 12, 573-581. 48. Mannhold, R.; van de Waterbeemd, H. Substructure and whole molecule approaches for calculating logP. Journal of Computer-Aided Molecular Design 2001, 15, 337-354. 49. Nadig, G.; Zant, L. C. V.; Dixon, S. L.; Kenneth M. Merz, J. Charge-Transfer Interactions in Macromolecular Systems: A New View of the Protein/Water Interface. Journal of the American Chemical Society 1998, 120(22), 5593-5594. 50. CORINA 3D Structure Generator, Molecular Networks, GmbH: Erlangen, Germany, 2006. 51. Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three- Dimensional Model Builders Using 639 X-Ray Structures. Journal of Chemical Information and Computational Sciences 1994, 34, 1000-1008. 52. Dewar, M. J. S.; Zoebisch, E. G.; Healy, E. F.; Stewart, J. J. P. Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model. Journal of the American Chemical Society 1985, 107(13), 3902-9.

154

53. Clark, T.; Alex, A.; Beck, B.; Burkhardt, F.; Chandrasekhar, J.; Gedeck, P.; Horn, A. H. C.; Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP, 9.0; Accelrys Inc.: San Diego, 2003. 54. Winget, P.; Horn, A. H. C.; Selçuki, C.; Martin, B.; Clark, T. AM1* Parameters for Phosphorous, Sulfur and Chlorine. J. Mol. Model. 2003, 9, 408-414. 55. Rinaldi, D.; Rivail, J.-L. Molecular polarizabilities and dielectric effect of the medium in the liquid state. Theoretical study of the water molecule and its dimers. Theor. Chim. Acta 1973, 32, 57. 56. Rinaldi, D.; Rivail, J.-L. Calculation of molecular electronic polarizabilities. Comparison of different methods. Theor. Chim. Acta 1974, 32, 243-251. 57. TSAR 3.3, 3.3; Oxford Molecular Ltd.: Oxford, England, 2000. 58. Breindl, A.; Beck, B.; Clark, T. Prediction of the n-Octanol/Water Partition Coefficient, logP, Using a Combination of Semiempirical MO-Calculations and a Neural Network. Journal of Molecular Modelling 1997, 3, 142-155. 59. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic, and Steric Constants. The American Chemical Society: Washington, D.C., 1995. 60. Sotomatsu, T.; Nakagawa, Y.; Fujita, T. Quantitative Structure-Activity Studies of Benzoylphenylurea Larvicides. Pesticides Biochem. and Physiol. 1987, 27, 156- 164. 61. Klammt, A.; Schüürmann, G. COSMO: A new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc., Perkin Transactions 1993, 2, 799-805. 62. Schüürmann, G. Prediction of Henry's Law Constant of Benzene Derivatives Using Quantum Chemical Continuum-Solvation Models. Journal of Computational Chemistry 2000, 21, 17-34. 63. Thompson, J. D.; Cramer, C. J.; Truhlar, D. G. New Universal Solvation Model and Comparison of the Accuracy of the SM5.42R, SM5.43R, C-PCM, D-PCM, and IEF-PCM Continuum Solvation Models for Aqueous and Organic Solvation Free Energies and for Vapor Pressures. Journal of Physical Chemistry B 2004, 108, 6532-6542. 64. Wang, J.; Wang, W.; Huo, S.; Lee, M.; Kollman, P. A. Solvation Model Based on Weighted Solvent Accessible Surface Area. Journal of Physical Chemistry B 2001, 105, 5055-5067. 65. Tehan, B. G.; Lloyd, E. J.; Wong, M. G.; Pitt, W. R.; Gancia, E.; Manallack, D. T. Estimation of pKa Using Semiempirical Molecular Orbital Methods. Part 2: Application to Amines, Anilines, and Various Nitrogen Containing Heterocyclic Compounds. Quantitative Structure-Activity Relationships 2002, 21. 66. Physical/Chemical Property Database (PHYSPROP), Syracuse Research Corporation, Environmental Research Center: Syracuse, NY, USA. 67. Shirakawa, H.; Louis, E. J.; MacDiarmid, A. G. Synthesis of electrically conducting organic polymers: halogen derivatives of polyacetylene, (CH)x. J. Chem. Soc., Chem. Commun. 1977, 578-580. 68. Thomas, K. R. J.; Lin, J. T.; Tao, Y.-T.; Chuen, C.-H. Quinoxalines Incorporating Triarylamines: Potential Electroluminescent Materials with Tunable Emission Characteristics. Chemistry of Materials 2002, 14, 2796-2802. 69. Thomas, K. R. J.; Lin, J. T.; Tao, Y.-T.; Ko, C.-W. New Star-Shaped Luminescent Triarylamines: Synthesis, Thermal, Photophysical, and Electroluminescent Characteristics. Chemistry of Materials 2002, 14, 1354-1361.

155

70. Yin, S.; Shuai, Z.; Wang, Y. A Quantitative Structure-Property Relationship Study of the Glass Transition Temperature of OLED Materials. Journal of Chemical Information and Computational Sciences 2003, 43, 970-977. 71. Yalkowsky, S. H.; Dannenfelser, R. M. AQUASOL database of aqueous solubility. In College of Pharmacy, University of Arizona, Tucson, AZ: 2000. 72. ACD/Solubility DB, release 10.0, Advanced Chemistry Development, Inc.: Toronto ON, Canada, 2006. 73. Cheng, A.; K. M. Merz, J. Prediction of Aqueous Solubility of a Diverse Set of Compounds Using Quantitative Structure-Property Relationships. Journal of Medicinal Chemistry 2003, 46(17), 3572-3580. 74. Delaney, J. S. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. Journal of Chemical Information and Computational Sciences 2004, 44(3), 1000-1005. 75. Xie, L.; Liu, H. The Treatment of Solvation by a Generalized Born Model and a Self-Consistent Charge-Density Functional Theory-Based Tight-Binding Model. Journal of Computational Chemistry 2002, 23, 1404-1415. 76. Reasor, M. J. A review of the biology and toxicologic implications of the induction of lysosomal bodies by drugs. Toxicology and Applied Pharmacology 1989, 97, 47-56. 77. Anderson, N.; Borlak, J. Drug-induced phospholipidosis. Federation of European Biochemical Societies Letters 2006, 580, 5533-5540. 78. Reasor, M. J.; Kacew, S. Drug-Induced Phospholipidosis: Are There Functional Consequences? Experimental Biology and Medicine 2001, 226, 825-830. 79. Halliwell, W. H. Cationic amphiphilic drug-induced phospholipidosis. Toxicologic Pathology 1997, 25, 53-60. 80. Fujita, T.; Iwasa, J.; Hansch, C. A new substituent constant, π, derived from partition coefficients. Journal of the American Chemical Society 1964, 86(23), 5175-5180. 81. Coulombe, P. A.; Kan, F. W.; Bendayan, M. Introduction of a high-resolution cytochemical method for studying the distribution of phospholipids in biological tissues. European Journal of Cell Biology 1988, 46(3), 564-76. 82. Bauknecht, H.; Zell, A.; Bayer, H.; Levi, P.; Wagener, M.; Sadowski, J.; Gasteiger, J. Locating biologically active compounds in medium-sized heterogenous datasets by topological autocorrelation vectors: dopamine and benzodiazepine agonists. Journal of Chemical Information and Computational Sciences 1996, 36(6), 1205- 13. 83. Sadowski, J.; Wagener, M.; Gasteiger, J. Assessing similarity and diversity of combinatorial chemistry libraries by spatial autocorrelation functions and neural networks. Angewandte Chemie, Int'l Ed. 1996, 34(24), 2674-7. 84. Boser, B. E.; Guyon, I.; Vapnik, V. N. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 1992, 5, 144-152. 85. Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 1998, 2, 121-167. 86. Cortes, C.; Vapnik, V. Support-Vector Networks. Machine Learning 1995, 20(3), 273-297. 87. Friedman, J. H. Multivariate Adaptive Regression Splines. Annals of Statistics 1991, 19(1), 1-141. 88. Friedman, J. H. Estimating functions of mixed ordinal and categorical variables using adaptive splines. In New Direction in Statistical Data Analysis and

156

Robustness, Morgenthaler, S.; Ronchetti, E.; Stahl, W. A., Eds. Birkhaüser: 1993; pp 73-113. 89. Schölkopf, B.; Sung, K.-K.; Burges, C. J. C.; Girosi, F.; Niyogi, P.; Poggio, T.; Vapnik, V. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. on Signal Processing 1997, 45, 2758-2765. 90. Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. Journal of Chemical Information and Computational Sciences 1988, 28, 31-36. 91. Chang, C.-C.; Lin, C.-J. LIBSVM: a Library for Support Vector Machines. 2003. 92. Cherkassky, V.; Gehring, D.; Mulier, F.; Friedman, J. H.; Masters, T. XTAL Software Package, ver. 5, University of Minnesota Electrical Engineering Dept.: Minnesota, 1995. 93. Tomizawa, K.; Sugano, K.; Yamada, H.; Horii, I. Physicochemical and Cell-Based Approach for Early Screening of Phospholipidosis-Inducing Potential. Journal of Toxicological Sciences 2006, 31(4), 315-324. 94. Ploemen, J.-P. H. T. M.; Kelder, J.; Hafmans, T.; Sandt, H. v. d.; Burgsteden, J. A. v.; Salemink, P. J. M.; Esch, E. v. Use of physicochemical calculation of pKa and ClogP to predict phospholipidosis-inducing potential. Experimental and Toxicologic Pathology 2004, 55, 347-355. 95. Fischer, H.; Kansy, M.; Potthast, M.; Csato, M. Prediction of in vitro phospholipidosis of drugs by means of their amphiphilic properties. In Rational Approaches to Drug Design, Proceedings of the 13th European Symposium on Quantitative Structure-Activity Relationships, Hoeltje, H. D.; Sippl, W., Eds. Prous Science: Barcelona, 2001; pp 286-289. 96. Lipinski, C. A.; Lombardo, F.; Dominy, B. W. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews 2001, 46, 3-26. 97. Cramer, R. D., III; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of to Carrier Proteins. Journal of the American Chemical Society 1988, 110, 5959-5967. 98. Poso, A.; Juvonen, R.; Gynther, J. Comparative molecular field analyses of compounds with CYP2A5 binding affinity. Quantitative Structure-Activity Relationships 1995, 14, 507-511. 99. Geladi, P.; Kowalski, B. Partial least squares regression: A tutorial. Analytica Chimica Acta 1986, 185, 1-17. 100. Gerlach, R. W.; Kowalski, B. R.; Wold, H. O. A. Partial least-squares path modelling with latent variables. Analytica Chimica Acta 1979, 112(4), 417-21. 101. Dijkstra, T. Latent variables in linear stochastic models: Reflections on maximum likelihood and partial least squares methods. 2nd ed.; Sociometric Research Foundation: Amsterdam, The Netherlands, 1985. 102. Green, S. M.; Marshall, G. R. 3D-QSAR: a current perspective. Trends in pharmacological sciences 1995, 16(9), 285-91. 103. Klebe, G.; Abraham, U.; Mietzner, T. Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. Journal of Medicinal Chemistry 1994, 37, 4130-4146. 104. Clark, T.; Lin, J.-H.; Horn, A. H. C. Parasurf '07, A1; CEPOS InSilico Ltd.: 26 Brookfield Gardens Ryde, Isle of Wight PO33 3NP, 2006. 105. de Jong, S. SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 1993, 18, 251-263.

157

106. Guccione, S.; Doweyko, A. M.; Chen, H.; Barretta, G. U.; Balzano, F. 3D-QSAR using 'Multiconformer' alignment: The use of HASL in the analysis of 5-HT1A thienopyrimidinone ligands. Journal of Computer-Aided Molecular Design 2000, 14, 647-657. 107. Allinger, N. L.; Yuh, Y. H.; Lii, J.-H. Molecular Mechanics. The MM3 Force Field for Hydrocarbons. Journal of the American Chemical Society 1989, 111(23). 108. Lanig, H.; Utz, W.; Gmeiner, P. Comparative Molecular Field Analysis of Dopamine D4 Receptor Antagonists Including 3-[4-(4-Chlorophenyl)piperazin-1- ylmethyl]pyrazolo[1,5-a]pyridine (FAUC 113), 3-[4-(4-Chlorophenyl)piperazin-1- ylmethyl]-1H-pyrrolo-[2,3-b]pyridine (L-745,870), and Clozapine. Journal of Medicinal Chemistry 2001, 44, 1151-1157. 109. Wang, R.; Gao, Y.; Liu, L.; Lai, L. All-Orientation Search and All-Placement Search in Comparative Molecular Field Analysis. Journal of Molecular Modeling 1998, 4, 276-283. 110. Zheng, M.; Yu, K.; Liu, H.; Luo, X.; Chen, K.; Zhu, W.; Jiang, H. QSAR analyses on avian influenza virus neuraminidase inhibitors using CoMFA, CoMSIA, and HQSAR. Journal of Computer-Aided Molecular Design 2006, 20, 549-566. 111. Andrews, L. E.; Banks, T. M.; Bonin, A. M.; Clay, S. F.; Gillson, A.-M. E.; Glover, S. A. Mutagenic N-Acyloxy-N-alkoxyamides: Probes for Drug-DNA Interactions. Australian Journal of Chemistry 2004, 57, 377-381. 112. Andrews, L. E.; Bonin, A. M.; Fransson, L. E.; Gillson, A.-M. E.; Glover, S. A. The role of steric effects in the direct mutagenicity of N-acyloxy-N-alkoxyamides. Mutation Research 2006, 605, 51-62. 113. Bonin, A. M.; Glover, S. A.; Hammond, G. P. A comparison of the reactivity and mutagenicity of N-benzoyloxy-N-benzyloxybenzamides. Journal of Organic Chemistry 1998, 63, 9684-9689. 114. Böhm, M.; Stürzebecher, J.; Klebe, G. Three-Dimensional Quantitative Structure- Activity Relationship Analyses Using Comparative Molecular Field Analysis and Comparative Molecular Similarity Indices Analysis To Elucidate Selectivity Differences of Inhibitors Binding to Trypsin, Thrombin, and Factor Xa. Journal of Medicinal Chemistry 1999, 42, 458-477. 115. Tropsha, A.; Cho, S. J. Cross-validated r2 guided region selection for CoMFA studies. Perspectives in Drug Discovery and Design 1998, 12/13/14, 57-69. 116. Kroemer, R. T.; Hecht, P.; Guessregen, S.; Liedl, K. R. Improving the Predictive Quality of CoMFA Models. Perspectives in Drug Discovery and Design 1998, 14, 41-56. 117. Verma, R. P.; Hansch, C. A QSAR study on influenza neuraminidase inhibitors. Bioorganic & Medicinal Chemistry 2006, 14, 982-996. 118. Doweyko, A. M. The hypothetical active site lattice. An approach to modelling active sites from data on inhibitor molecules. Journal of Medicinal Chemistry 1988, 31(7), 1396-406. 119. Andrews, P. R.; Craik, D. J.; Martin, J. L. Functional group contributions to drug- receptor interactions. Journal of Medicinal Chemistry 1984, 27(12), 1648-57. 120. Becker, O. M.; Levy, Y.; Ravitz, O. Flexibility, Conformation Spaces, and Bioactivity. Journal of Physical Chemistry B 2000, 104, 2123-2135. 121. Furnham, N.; Blundell, T. L.; DePristo, M. A.; Terwilliger, T. Is one solution good enough? Nature Structural and Molecular Biology 2006, 13(3), 184-185. 122. Günther, S.; Senger, C.; Michalsky, E.; Goede, A.; Preissner, R. Representation of target-bound drugs by computed conformers: implications for conformational libraries. BMC Bioinformatics 2006, 7, 1-11.

158

123. Kuntz, I. D.; Chen, K.; Sharp, K. A.; Kollman, P. A. The maximal affinity of ligands. Proceedings of the National Academy of Sciences of the United States of America 1999, 96, 9997-10002.

159

Curriculum Vitae

Name: Kendall Grant Byler Birthdate: 21.05.1970 Birthplace: Huntsville, AL, United States

Education

Doctor rerum naturalium 05/03-05/07 Friedrich-Alexander-Universität, Erlangen-Nürnberg Computer-Chemie-Centrum, Prof. Dr. Tim Clark

Master of Science 08/97-12/01 The University of Alabama in Huntsville

Bachelor of Science 08/88-05/93 The University of Alabama in Huntsville

Publications

• Byler, K.; de Groot, M. J.; Clark, T. Support Vector Classification for the Prediction of Phospholipidosis Induction. The 20th Darmstadter Molecular Modelling Workshop Erlangen, Germany 2006. • Byler, K.; Ehresmann, B.; de Groot, M. J.; Clark, T. Surface-Integral QSPR Models: Local Energy Properties. The 19th Darmstadter Molecular Modelling Workshop Erlangen, Germany 2005. • Lawton, R. O.; Alexander, L. D.; Setzer, W. N.; Byler, K. G. Floral essential oil of Guettarda poasana inhibits yeast growth. Biotropica 1993, 25, 483-486. • Setzer, W. N.; Flair, M. N.; Byler, K. G.; Huang, J.; Thompson, M. A.; Moriarty, D. M.; Lawton, R. O.; Windham-Carswell, D. B. Antimicrobial and cytotoxic activity of crude extracts of Araliaceae from Monteverde, Costa Rica. Brenesia 1992, 38, 123-130.

160