<<

3D-QSAR/QSPR Based Surface- Dependent Modeling Approach Derived From Semi-Empirical Quantum Mechanical Calculations

3D-QSAR/QSPR-basierter, oberflächenabhängiger Modellierungsansatz, abgeleitet von semi-empirischen quantenmechanischen Rechnungen

Der Naturwissenschaftlichen Fakultät

der Friedrich-Alexander-Universität Erlangen-Nürnberg Zur Erlangung des Doktorgrades Dr. rer. nat.

vorgelegt von Marcel Youmbi Foka aus Kamerun

Als Dissertation genehmigt von der Naturwissenschaftlichen Fakultät/ vom Fachbereich Chemie und Pharmazie der Friedrich-Alexander-Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 05.12.2018

Vorsitzender des Promotionsorgans: Prof. Dr. Georg Kreimer

Gutachter/in: Prof. Dr. Tim Clark

Prof. Dr. Birgit Strodel Dedication

In memory of my late Mother Lucienne Metiegam, who the Lord has taken unto himself on May 3, 2009.

My mother, light of my life, God rest her soul, had a special respect for my studies. She had always encouraged me to move forward. I sincerely regret the fact that today she cannot witness the culmination of this work. Maman, que la Terre de nos Ancêtres te soit légère!

This is a special reward for Mr. Joseph Tchokoanssi Ngouanbe, who always supported me financially and morally. That he find here the expression of my deep gratitude.

i

ii Acknowledgements

I would like to pay tribute to all those who have made any contribution, whether scientific or not, to help carry out this work.

All my thanks go especially to Prof. Dr. Tim Clark, who gave me the opportunity and means to work in his research team. I am grateful to have had him not only supervise my work but also for his patience and for giving me the opportunity to explore this fascinating topic. As it was not in my area of expertise, I really enjoyed acquiring skills in this research area.

I address my sincere thanks to Dr. Nico Van Eikema Hommes, for his technical assistance.

I thank Dr. Christian Cramer for introducing me to the Val-Mlr and MOE programs, for the help he has always given me, and for becoming a close friend.

I thank Dr. Jr-Hung Lin for helping me to attain knowledge in VAMP, ParaSurf and Material’s Studio programs.

Due to the fact that I come from a French-speaking country, my level of English is not high. For this purpose, the contribution of Dr. Victoria Jackiw from the Language Centre of the University of Erlangen-Nuremberg for correcting the English quality of my thesis was a very great contribution. I am very grateful to her for this help.

I sincerely thank Mr. Justin Choapoueng Nkue and Mrs. Elise Tchokoanssi Ngouanbe for the brotherly love they have always given me.

While praying for the repose of the souls of Mom Pauline Sikompe, Julienne Tagny, Jean Lonkep, Mom Lydie Teukam, and Alice Nana, allow me to extend my thanks to Mom Anne Kouatchie, Mom Elisabeth Kamgho, Bernadette Mafodjo, Martin Kugoua, Gildas Nkue, Willy Nkue, Anick, Carine, Larissa, Boris, and Cynthia Tchokoanssi, Gisele and Beatrice Youmbi, Paulette, Muriel, Lynn, Nancy, Cindy, and Erwin Choapoueng, Patricia Ngangoua, Family Ngongang, Family Wember, Family Bagnessi, Armelle Nanguep, Fabien Touko, Emmanuel Tchokoanssi, Therese Yommo, Susanne Kamdem, Odette and Paul Bati, Hugues Lengue, Armand Lonkep, Dieudonne Fogaing, Hubert Djapou, Luther Tagny, and Astride Nguetchuessi for the solidarity and for the family love they never withheld from me.

Father Rigoberg Beck was kind enough to help me spiritually and emotionally during this time dedicated to my Ph.D, especially during the illness of my mother and after her death.

I will not forget to thank Lißet Prechtel (R.I.P), Family Labbat-Metiogno, Dr. Eva and Radim Beranek, Family Guiffo, Brigitte Wohleben, Ferdinand Kuete, Angelika and Siegfried Balleis, Birgitt Aßmus, Alexandra Wunderlich, all the members of Christlich-Soziale Union (CSU), Odon Fokou, Dr. Pierrette Fofana, Guy Toko, Sylviane Tassing, Maila Dengel, Family Wete, Family Kuate, Family Yaneu, Family Sadjue, Family Tsumbu, Merlin Nkodja, Yanick Modjo, Gervais Gamgmeni, Emerent Prowo, Nathalie Tchamdjou, Sorelle Nsogning, everyone from Hering & Schneider GmbH, Carole Nya, Rosine Niaba, Chanceline Kamdem, Noelia Santos, Rose Kouatchet, Claude Heuyam, and Helene Kankeu for their sincere friendship, which has always united us, their sympathy and their solidarity.

iii It comes from the heart to thank Simone Rennoch, Käthe and Josef Rennoch, and all the German community of Pommer for the love, affection and support they always gave me.

It is a real pleasure for me to thank all the people of Computer-Chemie-Centrum in Erlangen (CCC) with whom I enjoyed working during these years of my thesis, particularly Prof. Dr. Paul von Rague Schleyer (R.I.P), Prof. Dr. Bernd Meyer, Prof. Dr. Dirk Zahn, Dr. Harald Lanig, Dr. Pawel Rodziewicz, Dr. Mateuz Wielopolski, Dr. Hakan Kayi, Dr. Ute Seidel, Dr. Jakub Goclon, Dr. Adria Gil Mestres, Dr. Ahmed Elkerdawy, Maximilian Kriebel, Dr. Pavlo Dral, Jürgen Wittman, Dr. Alexander Urban, Dr. Sebastian Schenker, Matthias Wildauer, Dr. Patrick Duchstein, Dr. Theodor Milek, Dr. Frank Beierlein, Tilo Sauertig, Oscar Roja, Heike Thomas, Dr. Christian Wick, Philipp Altmann, Bahanur Becit, Stefano Sansotta, Johannes Träg, Isabelle Schraufstetter, Nadine Scharrer, and not forgetting all the others.

Finally, I would like to thank the German Academic Exchange Service (DAAD), which through its program "DAAD-STIBET Doktorandenabschlussförderung" granted me a scholarship at the end of my studies.

iv Zusammenfassung

In dieser Arbeit werden einige neue QSAR/QSPR Modelle für die Vorhersage physikalisch-chemischer und biologischer Aktivitäten von organischen Verbindungen beschrieben.

Neue Modelle für die Berechnung der freien Solvatisierungsenergie in Wasser, Octanol und Chloroform wurden entwickelt, basierend auf Gasphase-Geometrien, die mittels AM1, AM1*, MNDO/d, und PM3-Optimierung durch VAMP berechnet wurden. Die neuen Modelle wurden durch eine Kombination der reinen Coulomb-Solvatisierungsenergie erhalten, abgeleitet aus einer SCRF-Berechnung, in Kombination mit einem Oberflächen- Integral als Funktion lokaler quantenmechanischer Eigenschaften auf der Oberfläche. Obwohl AM1* und MNDO/d keine lokale Polarisierbarkeit besitzen, wurde die Berechnung der Lösungsmitteleffekte für diese Hamiltonians durch eine Erweiterung der SCRF-routine auf s- und p-Orbitale, zu d-Orbitale ermöglicht. Die lokalen Eigenschaften wurden mit ParaSurf berechnet, basierend entweder auf der Isodichte-Oberfläche oder der sphärischen harmonischen Oberfläche. Die Modelle mit den statistisch besten Ergebnissen wurden mit der Isodichte-Oberfläche berechnet. Unter den Hamiltonians ergab AM1 die besten Vorhersagen mit (R2 = 0,92, MUE = 0,67, RMSD = 0,87), (R2 = 0,92, MUE = 0,57, RMSD = 0,73), (R2 = 0,91, MUE = 0,46, RMSD = 0,61), für die Solvatisierungsenergie in Wasser, Octanol und Chloroform. Für diese Solvatisierungsmodelle wurde herausgefunden, dass der Beitrag der jeweiligen lokalen Eigenschaft mehr als 30% für das molekulare elektrostatische Potential, (MEP, V), zwischen 15% und 25% für die lokale Ionisierungsenergie, (IEL), zwischen 15% und 20% für die lokale Elektronenaffinität, (EAL) und zwischen 10% und 18% für die lokale Polarisierbarkeit, (ĮL, POL) und die Härte, (ȘL, HARD) beträgt. Diese kleine Anzahl an verwendeten Variablen half dabei das Risiko der Erzeugung von übertrainierten Modellen erheblich zu verringern. Das Fehlen der lokalen Polarisierbarkeit für AM1* und MNDO/d äußerte sich signifikant, vor allem für die freie Solvatisierungsenergie in Wasser für neutrale und ionische Verbindungen mit einem RMSD-Unterschied von 6%, verglichen mit AM1 und PM3. Die Solvatisierungsmodelle in Wasser und Octanol, die mit neutralen Verbindungen entwickelt wurden, wurden zur Vorhersage des Octanol/Wasser-Verteilungskoeffizienten, logPow für kleine Moleküle angewandt. Für diese Verbindungen wiesen die Modelle eine sehr gute Vorhersagekraft auf, scheinen aber nur sehr eingeschränkt in der Lage zu sein, den logPow für große Moleküle zu berechnen. Der Chloroform/Wasser-Verteilungskoeffizient, logPcw für eine Reihe von kleinen Verbindungen wurde ebenfalls berechnet, um die Modelle zu validieren, wobei sehr gute statistische Ergebnisse erzielt wurden.

Es wurde ein neuer mathematischer Ansatz, basierend auf klassifizierten Oberflächenabschnitten des molekularen elektrostatischen Potentials, (MEP), der lokale Ionisierungsenergie, (IEL), der lokale Elektronenaffinität, (EAL), der lokale Polarisierbarkeit,

(ĮL), der Härte, (ȘL), der Elektronegativität, (ȤL, ENEG) und dem Feld senkrecht zur Oberfläche, (FN) und ihrer Kreuz-Produkte entwickelt. Dieser Ansatz unterscheidet sich grundsätzlich vom vorhergehenden polynomischen surface-integral model (SIM), dessen

v Prinzip es ist, über eine molekulare Oberfläche MEP, IEL, EAL, ĮL und ȘL zu integrieren. Der neue Oberflächen-Integral-Modell-Ansatz wurde dann verwendet, um logPow Modelle für einen sehr großen Datensatz, die LOGKOW Datenbank, bestehend aus hauptsächlich neutralen, kleinen und großen Molekülen, zu erstellen. Ausgehend von den Gasphasengeometrien, die mittels AM1, AM1*, PM3, MNDO, MNDO/d und PM6 Optimierung durch VAMP erhalten wurden, wurden Modelle unter Verwendung der Isodichte-Oberfläche, beziehungsweise der vom Lösungsmittel ausgeschlossenen Oberfläche zur Berechnung der Deskriptoren, erzeugt. Es wurde herausgefunden, dass diese Modelle stark von der Flexibilität und Steifheit der Verbindungen beeinflusst werden und Verbindungen mit einer kleineren Anzahl an rotierbaren Bindungen besser vorhergesagt wurden. Modelle, die basierend auf der vom Lösungsmittel ausgeschlossenen Fläche berechnet wurden, ergaben hier die kleineren Abweichungen. Bezüglich der Solvatisierungsvorhersagen ergab AM1 die besten Ergebnisse für den Test-Datensatz (R2 = 0,89, MUE = 0,43, RMSE = 0,58) und in etwa 25 der 50 Gleichungen des bagging-Ansatzes nutzte es eine geringere Anzahl von Deskriptoren, nämlich 40 von 336 (11,90%). AM1* und

MNDO/d ohne ĮL basieren auf jeweils 252 Deskriptoren und verwendeten hiervon 39 (15,48%) beziehungsweise 55 (21,83%) für die einzelnen Gleichungen des bagging-Ansatzes. Aufgrund des Auftretens von MEP × FN in allen 50 Gleichungen für AM1* und MNDO/d, wurde FN als der Parameter identifiziert, der für die Kompensation des Mangels an ĮL dieser Hamiltonians verantwortlich ist. Es wurde eine enge Beziehung zwischen FN und der Anzahl der Wasserstoffbrücken-Donatoren/Akzeptoren festgestellt, welche durch die starke Abhängigkeit der logPow Vorhersage von diesen Parametern bestätigt wurde.

Die bisher entwickelten logPow-Modelle wurden zur Vorhersage von Phospholipidose angewandt. Die Daten hierfür stammen von Pfizer Global R&D, Amboise/Frankreich und Sandwich/UK. Die logPow Werte, die mit den Modellen, basierend auf AM1, AM1*, MNDO, MNDO/d, PM3 und PM6, erhalten wurden, wurden mit den Standard ParaSurf-Deskriptoren kombiniert, um Sätze von 125 Deskriptoren zu erzeugen. Diese Deskriptorensätze wurden mit zwei verschiedenen Algorithmen des maschinellen Lernens (Naive Bayes und Random Forest) ausgewertet, um Verbindungen hinsichtlich ihrer Fähigkeit Phospholipidose zu induzieren, zu klassifizieren. Die besten Testdatensatzvorhersagen wurden mit den Modellen erzeugt, in denen die Deskriptoren mit der vom Lösungsmittel ausgeschlossenen Fläche berechnet wurden. Das beste Modell mit einer Genauigkeit von 84% wurde mit PM3 mittels der Random Forest-Klassifizierung erhalten. Der Naive Bayes-Algorithmus lieferte oberflächen-abhängige Modelle, die aber ein Ähnlichkeitsproblem in der Konfusionsmatrix aufwiesen. Dieses Problem wurde vollständig gelöst durch die Anwendung der Random Forest-Klassifizierung auf Gruppen von Deskriptoren, die die vom Lösungsmittel ausgeschlossene Oberfläche enthalten. Die Isodichte-Oberfläche ergab einen Matthews Korrelationskoeffizienten (MCC) zwischen 0,24 bis 0,48 mit einem Durchschnitt von 0,38 und von 0,33 bis 0,55 mit einem Durchschnitt von 0,47 für die Naive Bayes beziehungsweise Random Forest-Modelle. Bezüglich der vom Lösungsmittel ausgeschlossenen Oberfläche reichten die Werte des MCC von 0,33 bis 0,57, mit einem Durchschnitt von 0,47 für Naive Bayes, und von 0,50 bis 0,68, mit einem Durchschnitt von 0,57 für Random Forest, welcher die beste Vorhersagequalität erzielte. Zwei und zwanzig der 69 Verbindungen der vi Versuchsanordnung wurden sowohl mit der Naive Bayes-, als auch mit der Random Forest- Klassifizierungsmethode sehr gut vorhergesagt.

vii

viii Abstract

In this thesis, some new QSAR/QSPR models for predicting physico-chemical and biological activities of organic compounds are described.

New solvation models for calculating the solvation free energy in water, octanol and chloroform have been developed, proceeding by gas-phase geometries derived directly from AM1, AM1*, MNDO/d, and PM3 optimization through VAMP. Basically, these models were obtained by combining a pure Coulomb free energy of solvation derived from a SCRF calculation, with a local term calculated as a surface-integral of a function of local properties. Although AM1* and MNDO/d do not have local polarizability, the calculation of the solvent effect for these Hamiltonians was made possible by extending the SCRF routine, once limited to s and p-orbitals, to d-orbitals. The local properties were calculated with ParaSurf, using either the isodensity or the spherical harmonic surface. The best models, presenting better statistical performances, were performed with the isodensity surface (iso). Among the Hamiltonians, AM1 was found to be the one providing better qualities of prediction, with statistical performances of (R2 = 0.92, MUE = 0.67, RMSD = 0.87), (R2 = 0.92, MUE = 0.57, RMSD = 0.73), and (R2 = 0.91, MUE = 0.46, RMSD = 0.61) for the solvation free energy in water, octanol, and chloroform, respectively. For these solvation models, the contribution of each local property was found to be more than 30% for the molecular electrostatic potential (MEP, V), between 15% and 25% for the local ionization energy (IEL), between 15% and 20% for the local electron affinity (EAL), and between 10% and 18% for the local polarizability (ĮL or POL) and the hardness (ȘL or HARD). This small number of variables used helped in reducing considerably the risk of generating overfitted models. The lack of ĮL for AM1* and MNDO/d was significant, especially for the solvation free energy in water for neutral and ionic compounds, with a difference in RMSD of ≈ 6%, compared to AM1 and PM3. The solvation models in water and octanol developed with neutral compounds were applied for calculating the octanol/water partition coefficient, logPow, for small molecules. For these compounds, the models have provided very good predictive powers, but seem to be very limited when used to calculate the logPow for large molecules. The chloroform/water partition coefficient, logPcw, for a set of small compounds was also calculated in order to validate the models, and very good statistical performances were obtained.

A new approach, based among others on the MEP, IEL, EAL, ĮL, ȘL, the electronegativity (ȤL or ENEG), the field normal to the surface (FN), and their cross-products, over the surface divided into bins, is presented that is totally different from the former polynomial surface-integral model (SIM), whose principle was to integrate across a molecular surface MEP, IEL, EAL, ĮL, and ȘL. This approach, called binned SIM, was then used with a very large logPow data set obtained from the LOGKOW database to generate models necessary in predicting accurately the logPow, for a data set consisting of large and small compounds that are mainly present in their neutral forms. Proceeding by gas-phase geometries obtained from AM1, AM1*, PM3, MNDO, MNDO/d, and PM6 optimization through VAMP, the models were generated using either the iso or the solvent-excluded surface (SES) for calculating the descriptors. These models were found to be strongly influenced by the flexibility and the rigidity of the compounds used, and compounds having a small number of

ix rotatable bonds were those giving good predictions. Models generated with sets of descriptors calculated with the SES presented better statistical performances. As for the solvation models, AM1 was the one providing better statistical performances for the test set (R2 = 0.89, MUE = 0.43, RMSE = 0.58), and in about 25 of the 50 bagging equations, utilized a lower number of descriptors, 40 among 336 (11.90%). AM1* and MNDO/d without ĮL had 252 descriptors each and used 39 (15.48%) and 55 (21.83%) of them, respectively. Because of the occurrence of MEP × FN in all the 50 bagging equations for AM1* and MNDO/d, FN was found to be the parameter responsible for the compensation of the lack of ĮL for these Hamiltonians. A close relationship was found between FN and the number of hydrogen bond donor/acceptors, confirming the strong dependence of the logPow prediction on these parameters.

The logPow models previously developed were applied to gas-phase geometries of sets of phospholipidosis-inducing compounds obtained from Pfizer Global R&D of Amboise Laboratories, France, and Sandwich Laboratories, UK. The logPow values obtained for AM1, AM1*, MNDO, MNDO/d, PM3, and PM6 were added to the standard ParaSurf descriptors calculated either with the iso or the SES to generate sets of 125 descriptors. These created sets of descriptors, used through two machine-learning (ML) algorithms (Naive Bayes and Random Forest), generated models to classify compounds according to their ability to induce phospholipidosis. These models, when evaluated on the respective test sets, provided better predictive performances for those generated with the descriptors calculated with the SES. The best model with a predictive power of 84% was obtained with PM3 through the Random Forest classifier (RF). The Naive Bayes (NB) algorithm provided surface-dependent models, but was faced with a problem of similarity in the confusion matrix. This problem was fully corrected by applying the RF classifier on sets of descriptors obtained with the SES. With the iso, the ranges of the Matthews Correlation Coefficient (MCC) were 0.24 to 0.48, with an average of 0.38, and 0.33 to 0.55, with an average of 0.47 for the NB and the RF algorithms, respectively. With the SES, the values of the MCC ranged from 0.33 to 0.57, with an average of 0.47 for NB, and from 0.50 to 0.68, with an average of 0.57 for RF, which yielded the best prediction quality. Twenty-two of the 69 compounds of the test set were found to be highly predictive by both classifiers.

x Contents

Chapter 1: Introduction...... 1 1.1 Computer and Life Skills...... 2 1.2 Computational and Theoretical Chemistry Challenges...... 2 1.3 Computational Chemistry...... 4 1.3.1 Selection and Calculation of Descriptors...... 5 1.4 QSAR/QSPR Modeling...... 6 1.5 Objective and Thesis Outline...... 9 1.6 References...... 10 Chapter 2: Predicting the Solvation Free Energy Using a Combination of Semi-empirical Self-consistent Reaction Field Calculations and the Local Energy Properties...... 13 2.1 Introduction...... 14 2.1.1 Surface-Integral Models (SIMs)...... 16 2.1.1.1 Isodensity surface...... 16 2.1.1.2 Spherical-harmonic surface...... 17 2.2 Methods...... 17 2.3 Results...... 18 2.3.1 Free Energy of Solvation...... 18 2.3.1.1 Local Properties...... 18 2.3.1.2 Free Energy of Solvation in Water...... 20 2.3.1.3 Free Energy of Solvation in Octanol...... 25 2.3.1.4 Free Energy of Solvation in Chloroform...... 27 2.3.2 The Partition Coefficient: logP...... 29

2.3.2.1 The Octanol-Water Partition Coefficient: logPow...... 30 2.3.2.1.1 LogPow for Small Molecules...... 30 2.3.2.1.2 LogPow for Large Molecules...... 32 2.3.2.2 The Chloroform-Water Partition Coefficient: logPcw...... 33 2.4 Discussion...... 35 2.5 Conclusions...... 38 2.6 References...... 39

ZK Chapter 3: Binned Surface-Integral Models for Predicting the Octanol- Water Partition Coefficient...... 45 3.1 Introduction...... 46 3.2 Methods of Calculating...... 47 3.2.1 Solvent-Excluded Surface (SES)...... 48 3.3 Results...... 49 3.3.1 Conformational Dependence...... 51

3.3.2 Comparison with Publicly Available logPow models...... 62 3.3.3 Variable Importance...... 62 3.3.4 Variable Dependence...... 64 3.4 Discussion...... 67 3.5 Conclusions...... 74 3.6 References...... 76 Chapter 4: Comparative Study of two Classification Algorithms for the Prediction of Drug-Induced Phospholipidosis...... 81 4.1 Introduction...... 82 4.2 Methods...... 83 4.2.1 Machine Learning Algorithms...... 85 4.2.1.1 Naive Bayes...... 85 4.2.1.2 Random Forest...... 86 4.3 Results...... 87 4.3.1 Machine Learning Models...... 88 4.3.1.1 Naive Bayes Models...... 88 4.3.1.2 Random Forest Models...... 94 4.4 Discussion...... 107 4.5 Conclusions...... 110 4.6 References...... 111 Appendix...... 115

ZKK

Chapter 1

Introduction

Chapter 1

1.1 Computer and Life Skills

In the past, in some African countries such as Cameroon, some tasks were difficult to accomplish because people had access only to some rudimentary tools and archaic techniques. Our grandparents and parents used stones and mortars to crush corn, rice, millet and other grains. Today, thanks to many advances made in the field of science and technology, these same people, once forced to use archaic tools, have seen these light duties facilitated by the arrival of tools and machinery derived from new technologies. Thus, the trader who was once forced to add using a sheet of paper and a pen can do it today far more easily and quickly with a calculator. The farmer who could only use a hoe or a pick for his work benefits from the invention and manufacture of machinery and equipment that are easily manipulated and accessible. The office secretary who used a typewriter to type can do it quickly with a computer. The payment statements and the data of a company’s staff, formerly written in large and sometimes cumbersome registers, can be easily generated by the use of this valuable tool, the computer. The computer is an electronic machine whose operation is based on automatically reading a set of instructions sequentially, allowing the execution of arithmetic and logic operations on bits. Because of this, the computer has become the tool most often used in all areas of life, particularly in research, health, trade, transport, communication and many other fields. In the field of research, and more specifically in chemistry, the computer can be used to draw molecules, graph and perform statistics, theoretical calculations, simulations and computations.

1.2 Computational and Theoretical Chemistry Challenges

Quantum chemistry is a science whose main purpose is to clarify the electronic structure of molecules. To achieve this goal, it uses models and principles of quantum mechanics. Traditionally, quantum chemistry was derived from quantum mechanics, which is a discipline that has more in common with general physics. Usually qualified as quantum- chemical, these models contribute effectively to the characterization and description of molecules and their various interactions. Related to quantum chemistry, computational chemistry finds its origins in efficient computer implementation, focused particularly on the implementation of specific chemical phenomena and of quantum-chemical models already established. In chemical research, it is not absurd to actively take part in establishing the differences between computational and quantum directions, which a priori seem to lie in the methodology and how the results are interpreted. Based exclusively on the foundational discovery and theory of quantum mechanics, it seems that historically the first theoretical calculations were performed by Walter Heitler and Fritz London in 1927. Objectively, the formulation of a scientific work takes into account several parameters including documentation and bibliography. For this purpose, the beginnings of computational and quantum chemistry were strongly influenced by the books of Linus Pauling and E. Bright Wilson1; Eyring, Walter and Kimball2; Heitler3 and later Coulson4, which in following years were used as the first references by chemists. In 1956, the use of a basis set of Slater orbitals, actually just applicable to diatomic molecules, was involved in the first ab initio Hartree-Fock calculations at the Massachusetts Institute of Technology (MIT). A few years later, Hückel undertook research on the development of a method to determine with less complexity

2

Introduction electron energies of molecular orbitals of ʌ electrons in conjugated hydrocarbon systems. Thus, using a theoretical approach based on the linear combination of atomic orbitals (LCAO), his work was crowned by the development of a calculation method5 for molecules, extending from butadiene and benzene to ovalene. Later in the 1970s, in order to improve these one-electron methods, which while effective still had a narrow scope, we witnessed the advent of semi-empirical methods such as CNDO6.

Methodically, one notable difference is observed between computational and theoretical chemistry. In computational chemistry, the methodological principle is based on the direct implementation through a computer of a mathematical method fully developed in order to get programs specifically adapted to the desired methodology, which later will be automatically used in conjunction with the computer to efficiently solve well-defined chemical problems. In contrast, theoretical chemistry sits firmly on a purely mathematical description of chemistry, through algorithms and computer programs exclusively developed by chemists, physicists and mathematicians. Thanks to this, it contributes to the prediction of atomic molecular properties and reaction paths for chemical reactions. Computational chemistry, which goes along with computer science, can be defined as a branch of chemistry dealing with scientific problems, applying simultaneously the concepts of chemistry and some principles of computer science. Theoretical chemistry may play a complementary role to computational chemistry to the extent that, thanks to some of its results, computational chemistry can through enough powerful computer programs determine the structures and properties of molecules and solids. Computational chemistry through a singular approach bypasses the difficult analytical resolution of the quantum n-body problem, which is quite complex in closed form, apart from the hydrogen molecular ion; generally its application in areas such as the design of new drugs offers remarkable success. Fundamentally, in computational chemistry calculations, we proceed by processes ranging from highly accurate to very approximate, allowing predictions that confirm or provide further information on the results arising from chemical experiments and, in some cases, providing free access to the study of chemical phenomena that were previously unobserved. Each of these methods has a particular characteristic that can be directly related to its principle of operation or its scope. Thus, highly accurate methods are very appropriate for the study of small systems and associated applications, while ab initio methods apply entirely the fundamental basis of the theory of the first principles. Empirical or semi-empirical methods contribute remarkably to the development of many approximations that help to characterize some elements of the underlying theory. For this purpose, the center of interest lies in the rational exploitation of experimental results obtained from acceptable models of atoms or related molecules. However, these empirical or semi-empirical methods generally belong to the group of less accurate methods. The use of certain approximations remains a fundamental parameter in the process of developing both ab initio and semi-empirical methods. Ab initio methods focus exclusively on the use of the Born-Oppenheimer approximation; this facilitates systematically the simplification of the underlying Schrödinger equation by freezing the nuclei in place during the calculation. At first, reducing the number of approximations has a highly positive effect for ab initio methods, insofar as it entails the absolute convergence of the underlying equation to an exact solution. However, in practice, the prospect of a complete elimination of all approximations is a phenomenon difficult or impossible to achieve, because the residual errors in themselves are highly capable of being present and tend to remain permanently. As one of the main goals, computational chemistry aims to minimize residual errors without affecting systematically the calculations. The determination of molecular structures by an

3

Chapter 1 approach consisting of simulating forces, the specific use of quantum mechanical methods for determining more explicitly the points on the energy surface that remain invariant after any change in the position of the nuclei, the effective synthesis of molecular compounds using appropriated computational techniques, the active search through databases and storage of data on chemical entities (chemical databases), the estimation of a direct relationship or correlation existing between chemical structures and properties (QSPR/QSAR), and the design of molecules capable of undergoing a specific interaction with other molecules through computational approaches can be distinguished among these several major areas.

1.3 Computational Chemistry

On the health front, the world today is subject to new pandemics or diseases that are a real obstacle to the full development of human beings. For this purpose, to deal with this problem, it becomes imperative and a major challenge to conduct an active search for the discovery of new inexpensive and readily available . A statistical study exclusively conducted on new drugs has established that for a sample of 10,000 molecules synthesized and tested, on average, one has the characteristics of an innovative product with commercial properties. Generally, the cycle of development of a new drug is a particularly long process that can spread over 10 to 15 years of research. Indeed, during this development, the major objective is to achieve the establishment of a molecule that possesses not only particular therapeutic properties but also and above all an excellent ability to produce a minimum of unwanted side effects. To properly conduct these syntheses, which are often unnecessarily executed over a long period of time, the strong mobilization of human, material and financial means is required. These factors have an enormous influence on the final product, which is often found on the market at a very high price and therefore not easily accessible for individuals with an average standard of living. To overcome this disadvantage, researchers in the pharmaceutical industry have proposed a new method that consists of predicting in advance the properties and activities of molecules before moving on to the final step, which is the realization of their synthesis in a laboratory. In computational chemistry, two fields of research, Quantitative Structure-Activity Relationships (QSAR) and the Quantitative Structure-Property Relationship (QSPR), whose principal objectives are the identification of commonalities that may exist between molecules in large databases of existing molecules whose properties are known, have been developed to meet this urgent need. The highlighting of such a relationship has many advantages: On the one hand, it contributes to determining the physical and chemical properties and biological activities of compounds. On the other hand, it participates actively in the development of new theories, or it helps to obtain a fair idea about the observed phenomena in order to conduct a comprehensive study of whole families of compounds or to synthesize new molecules without using data obtained from this synthesis in the laboratory. Today, thanks to the advent of molecular modeling, we can establish without much difficulty relationships that may exist between the structures of molecules and their properties or activities. In molecular modeling techniques, the molecules are usually characterized by a set of data consisting of descriptors, which are measures or real numbers derived from calculations performed on the molecular structures. This opens the way for the establishment of a possible relationship that can exist between these descriptors and the modeled properties. However, these methods whose

4

Introduction effectiveness is well established are still facing difficulties that are mostly related to the calculation and selection of relevant descriptors.

1.3.1 Selection and Calculation of Descriptors

In recent decades, the thorny problem about how chemical information extracted from molecular structures, commonly referred to as a set of real numbers or descriptors, could be effectively symbolized was the epicenter of several research works. Once these descriptors are adequately represented, they allow, through traditional modeling techniques, the establishment of a relationship between the chemical information contained in the molecule and a molecular property or activity. These numerical descriptors are responsible for the transmission of information in a vector of real functions. They can be used to perform a quantitative assessment of physico-chemical or structural characteristics of molecules and currently we can evaluate more than 3000 kinds of descriptors. We can determine the different descriptors using either empirical or semi-empirical methods. The use of descriptors obtained without resorting to a transition by experimental methods, namely directly by calculation or prediction, is a highly appropriate technique that allows predictions while bypassing the molecule synthesis step, and this approach is one of the major points of the modeling. In reality, obtaining data through an experimental process seems to be the most 7 accessible way, compared to the prediction of a property or activity such as the logPow , ĮL, or IEL, which constitute a small set of descriptors that can be determined by measurement. However, these methods for determining the activity or property of a molecule are sometimes faced with the problem of misunderstanding and misinterpretation of the mechanisms. Thus, obtaining first a sufficiently large number of different descriptors would be the most important step for the early stages of modeling, which will be followed by the selection of those that have a considerable influence or that are most relevant for modeling. Generally in modeling, descriptors are divided into three main groups, 1D, 2D, and 3D descriptors.

The 1D descriptors that define particularly the atomic distribution (number and type of atoms), or the mass composition () of a molecule, give further details on the global properties of the molecule. These descriptors that are generated directly from the empirical formula of the compound fail to establish a difference between the different constitutional isomers.

The 2D descriptors predominantly composed of constitutional indices (number of simple and multiple bindings, number of cycles, etc.) or topological indices (Wiener8, Randic index9, valence connectivity index of Kier-Hall10, and the Balaban index11) provide guidance on the actual structure of the molecule, including its shape, size, and ramifications. These descriptors, obtained exclusively by using the formula of the compound and whose major asset is to characterize the physical properties of the molecule, however, show some shortcomings when used to describe some properties or activities such as biological activity.

The 3D descriptors that can be assimilated into geometrical descriptors (the molecular volume, the solvent accessible surface, and the principal moment of inertia), electronic descriptors (dipole moment, ionization potential, and other energies related to the molecule), and even spectroscopic descriptors allow a fairly broad description of the complex

5

Chapter 1 characteristics. These descriptors are entirely dependent on the 3D geometry, which allows different atoms constituting the molecule to adopt sufficiently stable relative positions from which the descriptors can be easily generated by performing some empirical or ab initio molecular modeling calculations. Among these descriptors, some have a feature that can be independently linked to their methodology or their functionality. Electronic descriptors, for example, are derived very often from quantum chemical calculations, in which microscopic descriptors can characterize molecules by spectroscopic measurements such as vibrational wave functions.

1.4 QSAR/QSPR Modeling

In the late 19th century, Crum-Brown and Frazer12 had the brilliant idea of attempting to model the activity of molecules, which allowed them to highlight the existence of an extremely close relationship between the biological activity of a molecule and its chemical composition. Later, the year 1964 was marked by the advent of the famous “Group Contribution” Theory, which therefore became the true detonator for the beginning of QSAR modeling. The establishment of new modeling techniques for learning, which were initially linear and subsequently nonlinear, had an explosive effect on the development of many methods whose focal point was to highlight the relationship that may exist between the molecular descriptors and the properties or activities to be predicted. QSAR, or QSPR, is the process used to establish a link in proportion to a certain quantitative value between a defined chemical structure and a well-known process that is generally a biological activity or chemical reactivity. The fundamental principle of the Structure-Activity Relationship (SAR) is based on the hypothesis that all molecules with common features are capable of producing similar biological activities. However, the difficulty commonly encountered in the applicability of this hypothesis seems to be related to the manner in which any difference existing at the molecular level can be considered as the main parameter on which each type of activity, such as the reaction or biotransformation ability, the solubility, or the target activity, depends. QSARs are an assembly consisting of predictive models, generated beforehand through statistical tools, whose purpose is to establish a certain parallelism between the biological activity (including desirable therapeutic effects and undesirable side effects) of chemicals (drugs, toxicants, or environmental pollutants) and descriptors representative of the molecule and/or its properties. The scope of QSAR models has been expanded and the most significant areas are, among others, risk assessment, the prediction of toxicity, regulatory decisions13, drug discovery and lead optimization14. Generating a good QSAR model and satisfying the set standards is entirely dependent on the choice of biological data, descriptors, and sufficiently appropriated statistical methods. The major goal of any QSAR modeling is to produce statistically robust models, which can be easily used to perform an efficient and reliable prediction of the biological activities of newly discovered compounds. Although obtaining a QSAR model with highly significant characteristics is extremely parameterized by the quality of the input data, the selection of descriptors, and the choice of the statistical approach to be used, its performance is nonetheless related to its validation, which clearly remains the only way by which one can establish the relevance and reliability of a procedure applied in a particular case15. The techniques necessary for the determination of training set 16 17 compounds , setting the training set size , and the effect of the distribution of variables according to their importance in the evaluation of training set models, which gives a basic idea of the quality of predictions18, seem to be the main parameters to take into account 6

Introduction during the validation of a QSAR model. Moreover, one of the most important points allowing for the appreciation of the quality of QSAR models is the focus on the development of novel validation parameters19. The prediction of boiling points is one of the first applications that has been conducted with success, and which also has particularly marked the history of QSAR20.

In organic chemistry, the chemical compounds that share the same functional groups very regularly possess structures with strong correlations to the properties to be predicted. Determining experimentally the biological activity of a molecule permits the evaluation of the degree of inhibition of a sufficiently defined signal transduction or metabolic pathway. In drug discovery, the biological activity of a chemical species is commonly referred to as its toxicity. For this purpose, the chemical molecules whose inhibitory effects exerted on their respective targets have been judged successful and whose degree of toxicity is sufficiently diminished compared to the threshold (non-specific activity) are most appropriate for the application of QSAR techniques, which thereby facilitates their identification. The biological activity, recognized in pharmacology under the name of pharmacological activity, provides a total reflection of all effects, whether desirable or not, that a drug can cause when it is in the presence of living matter. Given the close relationship between the pharmacological activity and the beneficial or adverse effects of drug candidates, it is quite natural that the toxicity of a chemical structure is fully assimilated to the type of biological activity.

21 The logPow, as stated in Lipinski’s Rule of Five , plays a significant role in the different QSAR applications that are related to the identification of “drug likeness”. The development of a QSAR/QSPR model follows a process whose general mechanism is constituted by the following steps:

Query experimental data 2D to 3D conversion (Corina/Concord)

Descriptors calculation Quantum chemical calculations

(ParaSurf) (VAMP)

QSAR/QSPR modeling

In 1997, Christopher A. Lipinski working on lipophilicity laid down a principle called Lipinski’s Rule of Five, which states that many medications are relatively small molecules that very often belong to the family of lipophilic compounds21. This rule, which has seen

7

Chapter 1 some popularity in QSAR/QSPR, shows particular success when applied in the evaluation of drug likeness, the prediction of chemical compounds that with a certain threshold of pharmacological or biological activity can induce some effects in humans when administered orally. Thus, according to Lipinski, a drug whose biological properties confer an ability to produce certain effects in humans after oral intake must at least obey one of the following basic criteria:

• A ClogP value not greater than 5 (logP units)22, • A molecular weight not greater than 500 Dalton, • A number of hydrogen bond donors not greater than 5 (sum of –OH’s and –NH’s), • A number of hydrogen acceptors not greater than 10 (sum of N and O atoms).

In QSPR, several methods, which have proven to be successful, for an effective determination of logP have been developed. Among these methods, we can mention:

• Atomic based prediction, or atomic contribution (AlogP, MlogP, etc.), • Fragment based prediction, or group contribution (ClogP, etc.), • Data mining prediction, • Molecule mining prediction, • Estimation of logD (at given pH) from logP and pKa.

The distribution coefficient of a molecule, logD23, is this relationship expressed as a ratio between two variables previously obtained by summing the concentrations of ionized forms and of un-ionized forms. Sometimes it is equal to logP of un-ionizable compounds when a certain pH is reached.

In molecular modeling, descriptors that symbolically support some of the information contained in the molecule play a fundamental role, and insofar as faithfully relaying this information they can help to predict effectively the property or activity of a molecule. The group contribution technique, which was one of the first methods applied in the early era of QSAR modeling, remains today the main alternative for specific applications such as those involving the characterization of molecules. In 1988, Cramer initiated the development of Comparative Molecular Field Analysis (CoMFA24), a method mainly based on a preliminary alignment of molecules in order to direct them all towards a direction favorable for modeling. This method that totally optimizes some applications seems to be the most plausible alternative to bypass some conventional or classical methods, which up to now show some shortcomings when applied to the determination of biological activity. Generally, the interaction of a molecule (ligand) with the corresponding receptor is the main factor that regulates the biological activity of the molecule. The precise determination of this interaction and the nature of the relationship between it and the activity under investigation are the catalyst for the modeling of the biological activity. Compared to different approximations related to the development of QSAR, the CoMFA method appears to be the most appropriate for applications oriented to the modeling of protein-ligand interactions.

8

Introduction

1.5 Objective and Thesis Outline

As part of this project, we have developed a new method for predicting the solvation free energy of small organic compounds in different solvents, based on a combination of the self-consistent reaction field (SCRF) calculations25 and the local energy properties. Through the SCRF routine, we have extended the calculation of the solvent effect to d-orbitals by an implementation of the multipole approach. New logPow models have been developed, using our recent binned SIM, which is based on binned area descriptors, in contrast to our old polynomial SIM models. For these logPow models, several types of application can be envisaged. The thesis is structured as follows:

Chapter 1 is a general introduction about QSAR/QSPR modeling, including the basic concept and its application to biological activity.

The second chapter of this thesis presents the surface effect on the calculation of the “local surface tension” contribution to the solvation free energy at the molecular surface, the techniques used to determine the solvent effect, the solvation models obtained, the order of importance of each local property on these solvation models, and the problems related to their validations and applications.

Chapter 3 introduces the concept of binned SIM used for the development of our logPow models. We then show how the prediction of the logPow is related to the field normal to the surface (FN), the flexibility/rigidity of a molecule, the number of hydrogen bond donor/acceptor atoms, and the molecular surface.

Chapter 4 is devoted to applications of logPow models previously developed for the classification of phospholipidosis-inducing compounds. We present a prediction of the activity of some drugs that may induce the accumulation of phospholipids in the human body. We stress the ability of two ML algorithms (RF and NB) to predict induction of phospholipidosis. We evaluate the effect of the classification approach and the molecular surface used on the prediction quality.

9

Chapter 1

1.6 References

1. Linus, Pauling.; E, Bright Wilson. Introduction to Quantum Mechanics- with Applications to chemistry. McGraw-Hill Education, 1935.

2. Eyring, Henry.; Walter, John.; Kimball, Georges. Quantum chemistry. John Wiley And Sons Inc, 1944.

3. Walter, Heitler. Elementary Wave Mechanics- with Applications to Quantum Chemistry. Oxford: Clarendon Press, 1945.

4. Coulson, C. A. Textbook valence. Oxford: Clarendon Press, 1952.

5. Streitwieser, A.; Brauman J. I. and Coulson C. A. Supplementary Tables of Molecular Orbital Calculations. Oxford: Pergamon Press, 1965.

6. Pople, John A.; David L. Beveridge. Approximate Molecular Orbital Theory. New York: McGraw Hill, 1970.

7. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic and Steric Constants. American Chemical Society: Washington, DC, 1995.

8. Wiener, H. Structural Determination of Parafin Boiling Points. Journal of Chemical Information and Computer Sciences 1947, 69, 17-20.

9. Randic, M. On Characterization of Molecular Branching. Journal of the American Chemical Society 1975, 97, 6609-6614.

10. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research. New-York: Academic Press, 1976.

11. Balaban, A. T. Highly Discriminating Distance-Based Topological Index. Chemical Physics Letters 1982, 89, 399-404.

12. Crum-Brown, A.; Frazer, T. On the Connection between Chemical Constitution and Physiological Action. Transactions of the Royal Society of Edinburgh 1868-69, 25, 151-203.

13. Tong, W.; Hong, H.; Xie, Q.; Shi, L.; Fang, H.; Perkins, R. Assessing QSAR Limitations- A Regulatory Perspective. Current Computer-Aided Drug Design 2005, 2, 195-205.

14. Dearden, J. C. In Silico Prediction of Drug Toxicity. Journal of Computer-Aided Molecular Design 2003, 17, 2-4, 119-127.

15. Roy, K. On Some Aspects of Validation of Predictive Quantitative Structure-Activity Relationship Models. Expert Opin. Drug. Discov. 2007, 2 (12), 1567-1577.

16. Leonard, J. T.; Roy, K. On Selection of Training and Test Sets for the Development of Predictive QSAR Models. QSAR & Combinatorial Science 2006, 25 (3), 235-251.

10

Introduction

17. Roy, P. P.; Leonard, J. T.; Roy, K. Exploring the Impact of Size and Training Sets for the Development of Predictive QSAR Models. Chemometrics and Intelligent Laboratory Systems 2008, 90 (1), 31-42.

18. Roy, P. P.; Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR & Combinatorial Science 2008, 27 (3), 302-313.

19. Roy, P. P.; Paul, S.; Mitra, I.; Roy, K. On two Novel Parameters for Validation of Predictive QSAR Models. Molecules 2009, 14 (5), 1660-1701.

20. Rouvray, D. H.; Bonchev, Danail. Chemical graph theory: introduction and fundamentals. Tunbridge Wells, Kent, England: Abacus Press, 1991.

21. Lipinski, C. A.; Lombardio, F.; Doming, B. W.; Feeney, P. J. Experimental and Computational Approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Del. Rev. 2001, 46, 3-26.

22. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71 (6), 525-616.

23. Csizmadia, F.; Tsantili, A.; Panderi, I.; Darvas, F. Prediction of Distribution Coefficient from Structure. 1. Estimation Method. Journal of Pharmaceutical Sciences 1997, 86 (7), 865-871.

24. Cramer, R. D.; Patterson, D. E.; Bunce, J. D. Comparative Molecular Field Analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. Journal of the American Chemical Society 1988, 110 (18), p. 5959.

25. Tomasi, J.; Persico, M. Molecular Interactions in Solution: An Overview of Methods Based on Continuous Distribution of the Solvent. Chem. Rev. 1994, 94, 2027-2097.

11

12

Chapter 2

Predicting the Solvation Free Energy using a Combination of

Semi-empirical Self-consistent Reaction Field Calculations and

the Local Energy Properties

Chapter 2

2.1 Introduction

In recent years, the study of quantum chemistry has entered the mainstream for the investigation of organic molecules and the reaction mechanisms that take place in the gas phase. This method, which over time has become increasingly reliable, demonstrated its efficacy when it was used to predict the electronic properties of organic compounds1. Recently it has become apparent that in molecular modeling there is the increasing necessity of considering new paradigms for quantitative structure-activity (QSAR) and structure- property (QSPR) relationships, scoring functions and docking, and other applications in cheminformatics and modelling2. QSAR are related to some regression models that have as a culminating point the harmonious integration of experimental properties with continuous values, such as aqueous solubility, melting point, blood-brain barrier permeability, hydrophobicity, barrier penetration, lethal concentration, or inhibitor constants for enzymes. The major advantage of QSAR models is based on the fact that, relying on new classes of structural descriptors or more powerful statistical models, an extension of the Hansch model, which was already in evidence in molecular modeling, was made possible3,4. Nowadays, the QSAR models that are developed all tend to focus on the study of the intrinsic relationships that may exist between the structures of chemical compounds represented as chemical networks5-8. Through their early work on estimating the physical properties of molecular structures (QSPRs), Hansch and Leo9,10 indeed laid the foundations for computational chemistry. Today, molecular modeling is strongly influenced by generating models based on 11-13 14-16 the octanol-water partition coefficient (logPow ), standard enthalpies of formation , boiling points17, melting points18, and aqueous solubility19,20. These models are based on incremental approaches that involve splitting the molecule into atoms or groups in which each fragment thus formed is attributed to an additional contribution.

The aim of this work is to develop some new QSPR models for determining the free energy of solvation (ΔGsolv) of organic compounds in different solvents, based on the self- consistent reaction field (SCRF) calculation of the solvent effect and the local energy properties. To address this problem, the first stage of development was to implement the multipole model developed by Clark et al.21 into the SCRF routine22, so that the solvent effects for compounds containing either s,p-orbitals or s,p,d-orbitals could be easily calculated.

In each solution, aqueous or not, there is an interaction force between the solute and the solvent, which is usually parameterized by the geometry of the solute. Thus, to calculate ΔGsolv, one must first determine an average conformation for the solute. This problem can be solved by doing a single calculation using either the optimized gas-phase geometry or the optimized liquid-phase geometry23,24. In the SCRF model25-27, the reaction field and the Hamiltonian are closely related and therefore are virtually united in one set by a direct integration of the former into the latter. Thanks to this particularity, SCRF became the most appropriate model for both semi-empirical and ab initio molecular orbital treatments. The SCRF context can provide guidance on the difference between the models of the cavity and those not requiring the definition of an area delimiting the solute of the solvent. The first tests that have allowed a detailed description of the chemical environment were made by Klopman28, Germer29 and Miertus30. Based on Born’s31 formalisms, they were able to generate solvation models appropriate for this description. This approach highlights the effect of the solvent on the solute, which is summarized in a set consisting of the negative of the

14

Self-consistent Reaction Field Calculations

Mulliken’s net charge32 for the atomic center in connection with the solvation. Truhlar et al.33 proceeded by a dependent SCRF approach, which comprises forming a single unit resulting from the inclusion of the solvent effect in the Hamiltonian. Tomasi34 and Olivares del Valle35 by an approach based on the use of an arbitrarily shaped boundary outlined the concept of continuous models. Their main characteristics are their application to the ab initio formalisms and their ability to predict accurately the solvent effects. Solvent effects can also be calculated with several other techniques, such as the Generalized Born model (GB)36-41, the Poisson- Boltzmann (PB)42,43 model, and the Conductor-like model (COSMO)44-46.

In the SCRF approach22 used for this work, the algorithms of Pascal-Ahuir47 and Connolly48 were adapted to define an arbitrarily shaped cavity. In parallel, the approach of Marsili49 was modified using a marching-cube algorithm in order to make the obtaining of surface points accessible. In SCRF theory, the molecular electrostatic potential (MEP) allows qualitative definition of the nature of the reaction field as well as the electrostatic interaction free energy. The MEP has a major impact in some areas of computational chemistry, such as drug design50, the simulation of intermolecular interactions51, molecular similarity studies52, and continuum solvent models within molecular orbital (MO) theories51. Because of its current applications in evaluating continuous models, calculating the solvent effects through the SCRF approach requires the development of accurate methods that calculate within a short period of time the MEP at the van der Waals surface (or thereabouts).

Thus, the MEP53 seems to be the main catalyst in applications closely related to intermolecular interaction energies, such as QSAR54, QSPR55, prediction of toxicity56, docking57, (continuum) solvation models58-60 and many others. Using results derived from the quantum mechanical calculation, the MEP (V(r)) can be obtained by the formula

n Z ∞ ρ(r') V(r) = i - dr’ . (2.1) ¦ − ³−∞ − i=1 Ri r r' r

Here V(r) represents the electrostatic potential at any point r, n the number of atoms in the ρ , molecule, Z i is the nuclear charge of atom i located at R i and (r ) is the electronic density function of the molecule.

The point charges used for calculating the MEP are calculated using either the natural atomic orbital point charge (NAO-PC)61-64 for s,p-orbitals or the multipole model21 for s,p,d- orbitals. With the multipole approach, the MEP is obtained using the equation

§ 1 · ^ § 1 · 1 ^ § 1 · V(r) T¨ ¸  μα ∇α ¨ ¸  Θαβ ∇α ∇ β ¨ ¸ ಹ(2.2) © R ¹ © R ¹ 3 © R ¹

^ ^ In this case, q, μ α and Θαβ are the operators for monopole, dipole and quadrupole, respectively. R is the distance between the multipole center and the MEP point, and ∇ is the nabla operator.

15

Chapter 2

Up to now, different models have been proposed for calculating ΔGsolv. In 1975, Hine and Mookerje65 suggested an approach using the fragment additive's hypothesis. In 1986, Eisenberg and MacLachlan, focusing on the solvent accessible surface area, described another method66. In 1997, Hawkins, Cramer and Truhlar67, based on geometry-dependent atomic surface tensions associated with implicit electrostatics, made available to the scientific community, especially computational chemistry, a model for determining ΔGsolv exclusively in water. In 2005, Clark et al.68 developed an approach based entirely on the local properties of the molecule at the molecular surface for determining ΔGsolv. In the solvation model presented here, ΔGsolv is represented as the sum of the electrostatic energy (ΔGelec; evaluated with the SCRF technique) and the free energy at the molecular surface, namely the local surface tension contribution to ΔGsolv at the molecular surface (ΔGsurf; determined using the surface-integral model (SIM) technique).

2.1.1 Surface-Integral Models (SIMs)

In the SIM approach, a physical property is estimated by integrating one or more local properties over the molecular surface, which can be either an isodensity69 (iso) or a spherical- harmonic surface70 (sphh). Surface-integral models are expressed as

ntri = i i i α i η i ⋅ i P ¦ f (V , IE L , EAL , L , L ) A . (2.3) i=1

P is the target property and f a polynomial function of the five local properties (the electrostatic potential (V), the local ionization energy (IEL), the local electron affinity (EAL), the local polarizability (ĮL), and the local hardness (ȘL)) where the summation is performed by running over all ntri triangles that constitute the molecular surface. The superscript i refers to the value of the concerned local property evaluated at the center of the surface triangle i of area Ai . The polynomial function mentioned above is determined through a multiple linear regression using pre-calculated sums of the individual components of the functions listed in Table A1 of the Appendix.

2.1.1.1 Isodensity surface

An iso69 is any portion of a space constituted by a set of points having a common electron density ( ρ()r ) value. ρ()r is an estimate of the probability of the presence of an electron in a place particularly defined. In quantum chemistry, ρ()r is defined as a function of space coordinates r, so that ρ()r dr can be associated to the number of electrons present in a small volume element dr. In the specific case of closed-shell molecules, ρ()r can be literally expressed as a summation of the various products obtained from the basis functions,φ :

16

Self-consistent Reaction Field Calculations

ρ()= Ρ φ ()φ () r ¦¦ μν μ r ν r . (2.4) μν

Here P represents the corresponding density matrix.

2.1.1.2 Spherical-harmonic surface

Sphh70 derive from a fitting to spherical-harmonic expansions, which are based on the following spherical-harmonic function

()()2l +1 l − m ! φ Y m ()θ,φ = P m ()cosθ eim . (2.5) l 4π ()l + m ! l

In the above function, the two integers m and l are quantum numbers that give the number and the spatial arrangement of different nodes present in every function70. Their values vary from m θ 71,72 θ m = -1,-l+1,…,0,…,l. Pl cos are the associated Legendre functions . is the angle formed with the direction of the equatorial plane, and ij represents the angle obtained when the reference is taken from any direction chosen inside the plane.

2.2 Methods

NAO-PC61-64 models are limited to s and p orbitals. The development of the multipole model, in VAMP73 (the basis of ParaSurf74), for calculating the solvent effect in the SCRF was a main goal. An implementation of an atomic multipole model (up to quadrupole) for calculating the electrostatic properties of molecules, based on electron densities derived from MNDO-like NDDO-based semi-empirical molecular orbital (MO) calculations with minimal s,p,d valence basis sets, was carried out in the SCRF routine through VAMP 9.073. Structures obtained from the literature were converted from 2D structures to 3D MDL SD files using Molecular Networks’ CORINA.75,76 Geometries were optimized in the gas phase using the AM1, AM1*, MNDO/d, or PM3 Hamiltonians with VAMP 9.0.73 Solvent effects were calculated by default using the SCRF for the ground and excited states, and ΔGelec was determined by summing the energies of interaction between the solute and the solvent obtained from the SCRF calculations. The local surface properties were calculated using ParaSurf0974 for either an iso69 or a sphh70 through a marching-cube77 or a shrink-wrap algorithm78, respectively. The MEPs were calculated using either the NAO-PC61-64 technique for s,p-orbitals or the multipole model technique for s,p,d-orbitals. Through the leave-one-out cross-validation, multiple linear regression analyses were performed with Tsar 3.379. With 2 this approach, the predictive R cv values should be close to corresponding R values. The isodensity value of 0.008 e-Å-3, which is relatively equivalent to a van der Waals surface, was used simultaneously with the marching-cube algorithm to obtain the iso69 necessary to

17

Chapter 2 generate the surface-integral models. Spherical-harmonic expansions were fitted at an isodensity value of 0.0003 e-Å-3, which corresponds to the default van der Waals surface in ParaSurf74 for a spherical-harmonic fit, to obtain the sphh70 necessary to generate regression models derived from the shrink-wrap method. The statistical performances of the models are expressed by the regression coefficient R, the correlation coefficient R2, the leave-one-out 2 cross-validated correlation coefficient R cv, the mean signed error (MSE), the mean unsigned error (MUE), and the root-mean-square deviation between experiment and prediction (RMSD). They are presented below in the plots of experimental and predicted values of the physical properties and in the summary tables.

2.3 Results

2.3.1 Free Energy of Solvation

The data sets used for the free energies of hydration (385 species including 12 anions and 11 cations), the free energies of solvation in octanol (168 neutral compounds), and the free energies of solvation in chloroform (87 neutral compounds) were obtained from the University of Minnesota database80, and are presented in Tables A2, A4, and A5 of the Appendix, respectively. Δ The G solv is obtained by the equation

Δ = Δ + Δ Gsolv Gsurf Gelec. (2.6)

The electrostatic effect arises from creating an interaction force between the solvent and the solute. It is summarized as an electric polarization of the solvent by the polar or non-uniform charge distribution of the solute, and can also manifest itself by a distortion of the solute by the polarized solvent.

2.3.1.1 Local Properties

The systematic use of quantum mechanical properties allows calculating the descriptors, which in reality are statistical variables describing the distribution of V, IEL, EAL, ĮL, ȘL, and the local electronegativity (ȤL). The local properties calculated at the surface of doxycycline are shown in Figure 2.1.

18

Self-consistent Reaction Field Calculations

Molecular Electrostatic Potential Ionization Energy

Electron Affinity Molecular Polarizability

Hardness Electronegativity

Figure 2.1. Local property surfaces for doxycycline calculated with ParaSurf.

19

Chapter 2

2.3.1.2 Free Energy of Solvation in Water

A total data set80 of 385 compounds was used to produce a series of eight models for the solvation free energy in water (ΔGsolv(H2O)), using AM1, AM1*, MNDO/d, and PM3 69 70 Hamiltonians for calculating ΔGelec. The iso and the sphh are necessary for determining the local properties. The performances of the different models generated are listed in Table 2.1, providing the statistics obtained for ΔGsolv(H2O) for the entire data set.

Table 2.1. Statistics for the eight models generated with the entire data set for ΔGsolv(H2O) Model Training set 2 2 MUE RMSD R Rcv AM1 (iso) 0.86 1.16 0.99 0.81 AM1 (sphh) 1.04 1.31 0.99 0.82 AM1* (iso) 1.54 2.06 0.98 0.75 AM1* (sphh) 1.49 1.98 0.98 0.71 MNDO/d (iso) 1.62 2.16 0.99 0.68 MNDO/d (sphh) 1.71 2.25 0.98 0.67 PM3 (iso) 1.14 1.47 0.99 0.77 PM3 (sphh) 1.24 1.58 0.99 0.72

For AM1 and PM3, the use of the sphh70 reduces the predictive power, and there is an increase in MUE and RMSD of ≈ 0.20 ΔGsolv unit. For AM1* and MNDO/d, there is much less of a change (no significant change) in the predictive power of ΔGsolv(H2O) when using the iso69 or the sphh70. One of the most significant observations here is the higher values of -1 the RMSD, which range from 1.16 to 2.25 kcal mol (i.e., more than one ΔGsolv unit), with an average RMSD of 1.75 kcal mol-1. Paradoxical to the high RMSD values obtained there is a fairly strong correlation, with the values of the correlation coefficients ranging from 0.98 to 0.99. This implies that there is a real problem of fixing for these models, which could be caused either by the data quality, one of the various local properties used, or the molecular surface. Thus, to get a fixed idea about the role played by any of the parameters listed above, a histogram (Figure 2.2) of the obtained RMSD based on Hamiltonians and different surfaces has been constructed.

20

Self-consistent Reaction Field Calculations

Figure 2.2. Hamiltonians and surfaces versus RMSD.

The histogram shows that, in contrast to AM1 and PM3 Hamiltonians, there is a high percentage of the RMSD for AM1* and MNDO/d when the iso69 and the sphh70 are used 74 successively. This is probably due to the lack of ĮL that is not implemented in ParaSurf for these Hamiltonians. From the histogram, only the fundamental role played by ĮL on the models is revealed, but there is no accurate information about the data quality or the molecular surface. For this reason, the graph of the experimental versus the predicted values ‘ˆΔGsolv(H2O) presented in Figure 2.3 was made.

Figure 2.3. Experimental and calculated ΔGsolv(H2O) using the iso and the AM1 Hamiltonian for the entire training data set. N = 385, MSE = 0.00, MUE = 0.86, RMSD = 1.16, R2 = 0.99, 2 R cv = 0.81.

21

Chapter 2

The experimental and calculated values are listed in Table A2 of the Appendix.

Figure 2.3 shows clearly the role played by each type of molecule (neutral and ionic). The sizeable gap between the ΔGsolv(H2O) of ions and those of neutral molecules leads to the formation of two clusters (one consisting of neutral molecules with points sufficiently close to each other; the other, far enough away and formed by the ionic molecules). This can be due to the fact that the ΔGsolv(H2O) of ions are strongly influenced by large electrostatic contributions. Because the data structures shown in Figure 2.4 are poorly adapted for linear regression, all compounds without permanent charges were selected from the data set80 and used to generate other models for ΔGsolv(H2O). This consisted of 362 compounds all in their neutral forms.

The local contribution to ΔGsolv(H2O) was determined using equation 2.7, which was obtained by performing a multiple linear regression of the five local properties calculated with ParaSurf74 for the iso69.

1 Δ = − × −2 ⋅[]2 − f ( Gsurf (H 2O),neutral)(r) 1.5397 10 V (r)

− 3 − 3 × 8 ⋅[]− × 1 ⋅ [α ]2 + 6.1805 10 V (r) 5.8558 10 L (r) − 3 − 1 × 1 ⋅[]α + × 3 ⋅ []η 2 + 6.2882 10 L (r) 9.2847 10 L (r) 3 1 × −8 ⋅[]⋅ 2 + × −3 ⋅[]⋅ 2 − 3.5446 10 V (r) IEL (r) 1.2702 10 V (r) EAL (r) 5 × −5 ⋅ ⋅ − × −11 ⋅[]⋅ 2 − 4.5985 10 V (r) EAL (r) 7.2025 10 V (r) EAL (r)

− 5 − 5 × 6 ⋅[]⋅α 2 − × 15 ⋅[]⋅η 2 − 4.2172 10 V (r) L (r) 7.3925 10 IEL (r) L (r) (2.7) × −6 ⋅ ⋅ ⋅α + × −15 ⋅[]⋅ ⋅α 3 + 9.5554 10 V (r) IEL (r) L (r) 3.9030 10 V (r) IEL (r) L (r) 5 × −8 ⋅ ⋅ ⋅η − × −19 ⋅[]⋅ ⋅η 2 + 1.2124 10 V (r) IEL (r) L (r) 3.8282 10 V (r) IEL (r) L (r) 5 × −23 ⋅[]⋅ ⋅η 3 + × −10 ⋅[]⋅ ⋅α 2 − 5.2734 10 V (r) IEL (r) L (r) 2.7618 10 V (r) EAL (r) L (r) × −12 ⋅[][]⋅ ⋅α 3 + × −20 ⋅ ⋅ ⋅η 3 + 1.0478 10 V (r) EAL (r) L (r) 1.3501 10 V (r) EAL (r) L (r) × −24 ⋅[]⋅ ⋅η 3 + × −7 ⋅ ⋅ ⋅α ⋅η − 3.9589 10 IEL (r) EAL (r) L (r) 6.6954 10 V (r) EAL (r) L (r) L (r) 0.049879

The above equation obtained with the AM1 Hamiltonian is in the form of ax+by+c and contains 21 terms. V, ĮL and ȘL each play a direct role in the model. V appears in 16 of the 21 terms, confirming the significant role of V in the prediction of intermolecular interaction energies. EAL, ȘL, IEL and ĮL appear each in eight terms.

In order to know if there is a risk of overtraining the models, the free energy data for the neutral compounds was randomized, and 75% of the data were used to construct another model, using the same procedure as above. The resulting regression equation consists of 18 2 terms and gave R = 0.94 and R cv = 0.86. The resulting equation of the model obtained from 2 the 362 neutral compounds contains 21 terms, with R = 0.92 and R cv = 0.85. Another

22

Self-consistent Reaction Field Calculations remarkable fact is that only five real variables are used68 ultimately, thus over-fitting should not be a problem.

The plot of the predicted as a function of the experimental values of ΔGsolv(H2O) for the neutral compounds is shown in Figure 2.4.

Figure 2.4. Best model of ΔGsolv(H2O) for the neutral compounds obtained with the iso and 2 2 the AM1 Hamiltonian. N = 362, MSE = 0.00, MUE = 0.67, RMSD = 0.87, R = 0.92, R cv = 0.85.

There is no strong outlier, suggesting the robustness of the model. Moreover, in contrast to the model obtained with the total data set, this model seems to fit correctly with linear regression.

Table A3 of the Appendix contains the experimental and predicted values.

Table 2.2 contains the statistical values for ΔGsolv(H2O) for the neutral compounds.

23

Chapter 2

Table 2.2. Statistical significances of all models obtained with the neutral compounds for ΔGsolv(H2O) Model Training set 2 2 MUE RMSD R Rcv AM1 (iso) 0.67 0.87 0.92 0.85 AM1 (sphh) 0.72 0.97 0.90 0.78 AM1* (iso) 0.84 1.10 0.87 0.78 AM1* (sphh) 1.08 1.39 0.80 0.62 MNDO/d (iso) 0.85 1.10 0.87 0.81 MNDO/d (sphh) 1.12 1.44 0.79 0.67 PM3 (iso) 0.72 0.95 0.91 0.85 PM3 (sphh) 0.81 1.10 0.88 0.79

AM1* and MNDO/d, associated with the sphh70, reduce the predictive power, and in contrast with the iso69, there is an increase in RMSD of ≈ 0.30 kcal mol-1. For AM1 and 69 PM3, there is less of a change in RMSD, of about 0.10 ΔGsolv unit, when the iso is replaced by the sphh70. The average RMSD for all the models is 1.12 kcal mol-1 for the neutral compounds, in contrast to the ionic and neutral compounds, which yielded an average RMSD -1 of 1.75 kcal mol . There is a decrease of ≈ 0.60 ΔGsolv unit when the ionic compounds are removed (i.e., subtracting the ionic compounds from the entire data leads to a reduction of the RMSD).

The effect of the molecular surface on the models is portrayed in the histogram below (Figure 2.5) for the different models obtained with all molecules (ionic + neutral) and the neutral molecules selected from the total data set80.

Figure 2.5. Hamiltonians and surfaces versus RMSD for ionic + neutral compounds and the neutral compounds exclusively.

According to the histogram, the higher frequencies are obtained for models generated with the ionic and neutral compounds. It is clear that for the neutral compounds, the RMSD frequencies always increase with the sphh70. The use of the neutral compounds provides more 24

Self-consistent Reaction Field Calculations information about the dependence of the solvation models in water on the molecular surface. This singular information is that with the sphh70, there is always an increase in the RMSD, thus a decrease of the predictive power. Solvation models for water generated with the neutral compounds are totally surface-dependent, in contrast to models obtained with both ionic and neutral compounds.

In order to check the behavior of this new approach with solvents other than water, respective models of the free energy of solvation with octanol (ΔGsolv(octanol)) and chloroform (ΔGsolv(CHCl3)) were generated.

2.3.1.3 Free Energy of Solvation in Octanol

Selecting 168 compounds whose experimental values of ΔGsolv(octanol) are known 80 from the total data set , models for ΔGsolv(octanol) were developed.

The regression equation based on the raw data for the local contribution to 69 ΔGsolv(octanol) at the molecular surface using the iso is shown in equation 2.8 for the AM1 Hamiltonian.

Δ = × −3 ⋅ − f ( Gsurf (oc tan ol))(r) 4.9463 10 V (r)

− 3 − 3 × 1 ⋅[]α 2 + × 9 ⋅[]η + 2.7847 10 L (r) 1.5926 10 L (r) × −4 ⋅ ⋅ − × −2 ⋅ ⋅α − 1.0587 10 V (r) EAL (r) 1.0106 10 V (r) L (r) 5 × −8 ⋅[]⋅α 3 − × −12 ⋅[]⋅η 2 − 4.2449 10 V (r) L (r) 1.5672 10 V (r) L (r) (2.8) 5 × −17 ⋅[]⋅ ⋅ 2 + × −21 ⋅[]⋅ ⋅ 3 + 1.1331 10 V (r) IEL (r) EAL (r) 3.1301 10 V (r) IEL (r) EAL (r) 3 × −8 ⋅[]⋅ ⋅α 2 + × −23 ⋅[]⋅ ⋅η 3 + 4.6302 10 V (r) IEL (r) L (r) 3.8954 10 V (r) IEL (r) L (r) 1 × −3 ⋅[]⋅ ⋅α 2 − × −4 ⋅ ⋅ ⋅α − 3.5242 10 V (r) EAL (r) L (r) 2.0364 10 V (r) EAL (r) L (r) 5 × −6 ⋅ ⋅α ⋅η + × −16 ⋅[]⋅ ⋅α ⋅η 2 + 1.6487 10 IEL (r) L (r) L (r) 1.3417 10 V (r) EAL (r) L (r) L (r) 0.63929

This equation contains 15 terms, among these V appears in 12 terms and is the dominant term as for ΔGsolv(H2O). ĮL, EAL, IEL and ȘL appear in eight, six, five and five terms, respectively. As for the solvation model in water for the neutral compounds, V, ĮL and ȘL play direct roles in this model.

Figure 2.6 shows the performance of the model developed for ΔGsolv(octanol) using the AM1 Hamiltonian and the iso69.

25

Chapter 2

Figure 2.6. Schematic view of the model of ΔGsolv(octanol) performed with the iso and the 2 2 AM1 Hamiltonian. N = 168, MSE = 0.00, MUE = 0.57, RMSD = 0.73, R = 0.92, R cv = 0.84.

All the points are close to each other, and none of them moves away significantly from the straight line obtained from the equation y = ax (a = 1). This implies that there is a fairly good correlation between the calculated and the experimental values.

In Table A4 of the Appendix the experimental and predicted values are presented.

Table 2.3 gives the statistics for ΔGsolv(octanol) for the 168 compounds.

Table 2.3. Measures of performance of all models generated for ΔGsolv(octanol) Model Training set 2 2 MUE RMSD R Rcv AM1 (iso) 0.57 0.73 0.92 0.84 AM1 (sphh) 0.66 0.91 0.88 0.70 AM1* (iso) 0.89 1.16 0.80 0.61 AM1* (sphh) 0.85 1.17 0.79 0.58 MNDO/d (iso) 0.81 1.09 0.82 0.74 MNDO/d (sphh) 0.87 1.15 0.80 0.60 PM3 (iso) 0.56 0.74 0.92 0.85 PM3 (sphh) 0.64 0.88 0.88 0.79

All the values of the RMSD are below one ΔGsolv unit for AM1 and PM3 Hamiltonians, and moreover, the use of the sphh70 reduces the predictive power, leading to an increase in RMSD of ≈ 0.20 kcal mol-1 compared to the iso69. However, all the values of the RMSD are above one ΔGsolv unit for AM1* and MNDO/d, and no significant change is

26

Self-consistent Reaction Field Calculations

69 70 observed in the predictive power of ΔGsolv(octanol) when using the iso or the sphh . High RMSD values are obtained with AM1* and MNDO/d for the two surfaces.

2.3.1.4 Free Energy of Solvation in Chloroform

Surface-integral models for ΔGsolv(CHCl3) at the molecular surface were performed using 87 compounds that were preliminarily extracted from the total data set80.

The resulting equation obtained by performing a multiple linear regression through Tsar 3.379 on data containing pre-calculated local properties is given below (equation 2.9), for the AM1 Hamiltonian and the iso69.

Δ = × −3 ⋅ + f ( Gsurf (CHCl3 ))(r) 5.5132 10 V (r) 1.1206×10−4 ⋅ IE (r) + 6.4912×10−1 ⋅[]α (r) 3 − L L (2.9) 5 × −11 ⋅[]⋅ 2 + × −15 ⋅[]⋅η 3 − 1.8137 10 V (r) EAL (r) 6.1496 10 V (r) L (r) × −6 ⋅[]⋅α 2 − × −5 ⋅ ⋅ ⋅α + 5.0234 10 IEL (r) L (r) 3.3045 10 V (r) IEL (r) L (r) × −13 ⋅[]⋅ ⋅α ⋅η 2 + 2.4629 10 V (r) EAL (r) L (r) L (r) 0.47847

In the above equation there is a total of eight terms and five of these eight terms contain V. Then, ĮL, IEL, EAL, and ȘL are in four, three, two and two terms, respectively. In this case, V, ĮL, and IEL play a direct role in the model.

The predicted values of ΔGsolv(CHCl3) are given in Figure 2.7 as a function of the experimental values.

27

Chapter 2

Figure 2.7. Graphical representation of ΔGsolv(CHCl3) for AM1 Hamiltonian and the iso. N = 2 2 87, MSE = 0.00, MUE = 0.46, RMSD = 0.62, R = 0.91, R cv = 0.74.

In this case, it is clearly observed that a correlation between the experimental and the predicted values of ΔGsolv(CHCl3) is also manifested.

In Table A5 of the Appendix, the respective experimental and calculated values are presented.

Table 2.4 gives the performances of ΔGsolv(CHCl3) for the 87 compounds.

Table 2.4. Performances of all the solvation free energy models in chloroform Model Training set 2 2 MUE RMSD R Rcv AM1 (iso) 0.46 0.62 0.91 0.74 AM1 (sphh) 0.59 0.78 0.86 0.70 AM1* (iso) 0.64 0.84 0.84 0.63 AM1* (sphh) 0.70 0.92 0.81 0.60 MNDO/d (iso) 0.49 0.67 0.90 0.81 MNDO/d (sphh) 0.72 0.95 0.80 0.70 PM3 (iso) 0.58 0.83 0.85 0.71 PM3 (sphh) 0.54 0.80 0.86 0.67

In particular, except for AM1* and MNDO/d with the sphh70, where the values of the RMSD are close to one, all the RMSD values for the other models are less than one ΔGsolv unit. The use of the sphh70 reduces the predictive power for AM1 and MNDO/d, for which we observe an increase in RMSD of 0.20 ΔGsolv unit for AM1, and 0.30 ΔGsolv unit for MNDO/d,

28

Self-consistent Reaction Field Calculations compared to the iso69. For AM1* and PM3, which are weakly affected, there is no significant 69 70 change in the predictive power of ΔGsolv(CHCl3) when using the iso or the sphh .

The histogram below (Figure 2.8) gives the variable importance for ΔGsolv(H2O), 69 ΔGsolv(octanol), and ΔGsolv(CHCl3), obtained with the AM1 Hamiltonian and the iso .

Figure 2.8. Summary of the occurrence of each local property in the solvation models generated with AM1 and the iso.

It appears that, for the same Hamiltonian and surface, the models are strongly dominated by the contribution of V. For QSPR models, V seems to be the most important parameter (i.e., most useful information can be stored in the MEP coefficients), but the other local properties also are strongly required.

The validations of the solvation models previously developed were performed by calculating different partition coefficients.

2.3.2 The Partition Coefficient: logP

The partitioning of an organic solute between two phases, one aqueous and the other organic, is an important parameter on which many phenomena of biological and medicinal chemistry, more precisely drug delivery, binding, and clearance, are extremely dependent. A reliable interpretation of the solvation of organic solutes in organic media would significantly impact conformational analysis in the condensed phase and the ability to predict molecular aggregation. The models obtained from the ΔGsolv(H2O), ΔGsolv(octanol), and ΔGsolv(CHCl3) were validated by calculating the octanol-water and chloroform-water partition coefficient logP as

29

Chapter 2

ΔG (solvent) − ΔG (H O) log Psolvent / H 2O = − solv solv 2 , (2.10) 2.303RT

Δ − Δ where R is the gas constant, T equals 298 K, and Gsolv (solvent) Gsolv (H 2O ) is the transfer free energy of a given solute to the specified solvent from water.

2.3.2.1 The Octanol-Water Partition Coefficient: logPow

Lipophilicity is a molecular property that has an active function in the transport of bioactive molecules via their corresponding receptors12 and the environmental fate of organic molecules81. However, lipophilicity, which is usually approximated by the logarithm of the partition coefficient, logP, of a compound determined in the octanol/water system, is a determining factor for QSAR studies82.

2.3.2.1.1 LogPow for Small Molecules

83 The data set used for calculating the logPow was obtained from the literature . It consists of 157 small molecules, listed in Table A6 of the Appendix, which were not included in the data set used to fit the models.

69 The theoretical logPow obtained using the iso and the AM1 Hamiltonian is shown graphically in Figure 2.9.

30

Self-consistent Reaction Field Calculations

Figure 2.9. Relationship between the theoretical and the experimental logPow for the models generated with the AM1 Hamiltonian and the iso. N = 157, MSE = -0.09, MUE = 0.46, RMSD = 0.59, R2 = 0.92.

As shown in Figure 2.9, there is a good predictive power of the models for ΔGsolv(H2O), and ΔGsolv(octanol), when applied to a data set of small compounds. 

In Table A6 of the Appendix, the experimental and calculated values of logPow are summarized.

In Table 2.5 the statistics for logPow for the 157 compounds are summarized.

Table 2.5. Performances with a logPow validation Model Validation set MSE MUE RMSD R2 AM1 (iso) -0.09 0.46 0.59 0.92 AM1 (sphh) -0.04 0.50 0.65 0.90 AM1* (iso) -0.05 0.61 0.79 0.86 AM1* (sphh) 0.26 0.59 0.84 0.86 MNDO/d (iso) 0.03 0.56 0.75 0.90 MNDO/d (sphh) 0.04 0.41 0.63 0.92 PM3 (iso) 0.04 0.42 0.57 0.93 PM3 (sphh) -0.22 0.53 0.70 0.90

The RMSD values vary from 0.57 to 0.84 logPow unit. Statistically this means that the validation of the models by calculating the logPow for a data set of small compounds provides a useful and informative assessment of the likely reliability of the models developed. Very accurate logPow values are obtained with AM1, MNDO/d, and PM3. An analysis of the RMSD obtained with the iso69 and the sphh70 shows that for PM3, the use of the sphh70

31

Chapter 2 reduces the predictive power and there is an increase in the RMSD of ≈ 0.20 logPow unit. Opposite to this, with MNDO/d there is a decrease in the RMSD, compared to the iso69. For AM1 and AM1*, there is no significant change in the predictive power of the logPow for small molecules when using the iso69 or the sphh70.

2.3.2.1.2 LogPow for Large Molecules

The large molecule data set used was obtained from Exploring QSAR84. It consists of 1842 molecules without zwitterions. Proceeding as done previously with small molecules, the solvation models developed were used for calculating the logPow for large molecules. However, it became clear that the calculated values were very far from the experimental ones. 84 Thus, for the sake of identifying the true source of the problem, 80% of the logPow data of large molecules were randomly selected and used as a training set to build some SIMs, and 20%, listed in Table A7 of the Appendix, were used as a test set. It appears that the models obtained from multiple linear regression analyses predict much better than the solvation models.

Figure 2.10 shows the performance of the logPow model obtained from the SIM descriptors using the AM1 Hamiltonian and the iso69.

Figure 2.10. Good prediction of the logPow values for a test set by multiple linear regression 2 2 analysis. N = 368, MSE = -0.06, MUE = 0.48, RMSD = 0.62, R = 0.84, R cv = 0.77.

The primary information extracted from Figure 2.10 is that there is no point with a particular behavior, for example, a great remoteness from the line of equation y = ax (a = 1) (i.e., linear regression can be fairly accurate for every type of compound).

32

Self-consistent Reaction Field Calculations

The experimental and calculated values are listed in Table A7 of the Appendix.

The statistics for the logPow for the data set of large compounds are listed in Table 2.6 below.

Table 2.6. Statistical performances of the logPow models generated

Model Test set Training set 2 2 MUE RMSD R R cv AM1 (iso) 0.48 0.62 0.84 0.77 AM1 (sphh) 0.54 0.71 0.79 0.72 AM1* (iso) 0.65 0.81 0.73 0.12 AM1* (sphh) 0.75 0.98 0.59 0.46 MNDO/d (iso) 0.54 0.70 0.79 0.73 MNDO/d (sphh) 0.56 0.73 0.78 0.70 PM3 (iso) 0.51 0.65 0.82 0.75 PM3 (sphh) 0.54 0.69 0.80 0.73

From Table 2.6, it appears that the results obtained from this small investigation are more promising with a multiple linear regression approach. All the values of the RMSD are 70 less than one logPow unit, except for the AM1* with the sphh , where the RMSD is close to 70 one logPow unit. For AM1*, the use of the sphh reduces the predictive power, and there is an increase in MUE of ≈ 0.10 logPow unit and RMSD of ≈ 0.20 logPow unit, compared to the iso69. For AM1, MNDO/d and PM3, there is no significant change in the predictive power of 69 70 the logPow for large molecules when using the iso or the sphh . For AM1*, seven iodine compounds were also removed because of the poor reproducibility of iodine with AM1*.

2.3.2.2 The Chloroform-Water Partition Coefficient: logPcw

Chloroform, a substituted derivative of methane, is an organic solvent used singularly. Thanks to this particularity, the chloroform-water partition coefficient (logPcw) has a useful application85 in the prediction of ligand lipophilicity and biological activity of organic 85,86 ligands. The data set used for calculating the logPcw was obtained from the literature . It consists of 30 compounds from which 23 compounds included in the data used to build the models have been removed. These seven compounds, listed in Table A8 of the Appendix, were then used for the validation process.

69 In Figure 2.11 the theoretical logPcw obtained using the iso and the AM1 Hamiltonian is presented.

33

Chapter 2

Figure 2.11. Plot of theoretical logPcw values obtained with the AM1 Hamiltonian and the iso. N = 7, MSE = 0.3, MUE = 0.46, RMSD = 0.54, R2 = 0.95.

Here, there is an excellent agreement between the experimental and the calculated logPcw values.

The experimental and calculated values are listed in Table A8 of the Appendix.

Table 2.7 provides the statistical significance of logPcw for the seven compounds.

Table 2.7. Performances with a logPcw validation Model Validation set MSE MUE RMSD R2 AM1 (iso) 0.30 0.46 0.54 0.95 AM1 (sphh) 0.15 0.39 0.59 0.91 AM1* (iso) 0.29 0.66 0.81 0.82 AM1* (sphh) 0.93 0.93 1.02 0.94 MNDO/d (iso) 0.69 0.95 1.10 0.77 MNDO/d (sphh) 0.65 0.80 0.88 0.90 PM3 (iso) 0.12 0.73 0.90 0.84 PM3 (sphh) 0.45 0.49 0.58 0.98

The statistics listed in Table 2.7 allow us to say that the models obtained are reliable 70 enough for predicting ΔGsolv(CHCl3) accurately. For AM1*, the use of the sphh reduces the predictive power, and there is an increase in RMSD of ≈ 0.20 logPcw unit, compared to the iso69. For MNDO/d and PM3, the use of the sphh70 increases considerably the predictive power, but there is no substantial change for AM1. One particularity here is that the best predictions (R2 = 0.98 and R2 = 0.94) are obtained with the sphh70.

34

Self-consistent Reaction Field Calculations

2.4 Discussion

As mentioned in the introduction, the main objective of this research project was to develop some accurate models for predicting ΔGsolv of compounds in different organic solvents. During this investigation, the study of the solvent effects on calculating ΔGsolv of some organic compounds was carried out. It has been proven that ΔGsolv is made up of two parts: the electrostatic and the nonpolar contributions. The electrostatic contribution, which is charge-method dependent, was evaluated using the SCRF calculations. It was determined that solvent effects are more significant for ionic compounds; therefore, ΔGsolv of ions are much larger than those of neutral solutes and are dominated by large electrostatic contributions. The nonpolar part, which is surface-method dependent, could be evaluated using the multiple linear regression technique.

Calculating the theoretical logPow using the standard formula gives a very good correlation between the experimental and the calculated values for small molecules, as shown in Table 2.5. Here the mean unsigned errors and the RMSDs are uniformly small. The main exception being AM1*, which gave MUE and RMSD of 0.61, 0.79 and 0.59, 0.84, for the iso69 and the sphh70, respectively. However, this approach has some problems when applied to large-sized molecules as it was also mentioned in the work of P. Kollman et al.87 The reason can be due to the interior atoms where many of these atoms are completely buried, so the model cannot provide very accurate ΔGsolv. The solvation models presented here were also performed with small molecules. Maybe the use of large molecules (whose experimental values of ΔGsolv are not easy to obtain) for generating solvation models can improve the accuracy of the theoretical logP for large molecules.

The major advantage of these solvation models is that they can be used for calculating accurately the logPow or the logPcw for a data set of small molecules. Therefore, these models can be widely applicable for small- and medium-sized compounds where few atoms are totally buried. Their disadvantage lies in the fact that their scope is extremely limited. Hence, they cannot be applied to a data set of large molecules. What can be immediately retained from this is that, to use the solvation free energies obtained from these models as the parameters in QSAR studies, one must first localize properly the data quality that should be use.

To avoid the difficulties mentioned above, it would be better to calculate the logPow for large molecules by linear regression. This gives, in general, a good correlation between the experimental and the calculated values. Table 2.8 compares the theoretical and the simulated logPow obtained with AM1 and AM1* for large molecules.

35

Chapter 2

Table 2.8. Comparison of the performances of the logPow obtained for large molecules Model Compounds of the test set logPow(calc) logPow(sim)

MSE MUE RMSD R2 MSE MUE RMSD R2 AM1 -0.36 0.80 1.15 0.53 -0.06 0.48 0.62 0.84 (iso) AM1 -0.017 1.47 3.37 0.25 -0.06 0.54 0.71 0.79 (sphh) AM1* -0.47 1.72 2.77 0.08 -0.02 0.65 0.81 0.73 (iso) AM1* -0.14 1.82 2.66 0.11 -0.09 0.75 0.98 0.59 (sphh)

The experimental, the theoretical and the predicted values of logPow obtained using the AM1 Hamiltonian, the iso69 and the sphh70 for the large molecules sulfanilamide and 3,5- diiodosalicylic acid are listed in Table 2.9.

OH O

I OH

I

Figure 2.12. Sulfanilamide. Figure 2.13. 3,5-Diiodosalicylic acid.

Table 2.9. Comparison of logPow values obtained from different models for selected compounds of the test set Compound Experiment Model Water-octanol SIMlogPow models models Sulfanilamide -0.62 AM1 (iso) 3.77 0.003 AM1 (sphh) -5.09 -0.96 AlogPs -0.16 3,5- 4.56 AM1 (iso) 9.48 3.32 Diiodosalicylic AM1 (sphh) 43.45 3.33 acid AlogPs 3.13

For the sulfanilamide, the deviations between the experimental and the theoretical 69 70 logPow are 4.39 and 4.47 for the iso and the sphh , respectively, and these deviations are 0.62 and 0.34, respectively, for the predicted logPow using multiple linear regression.

36

Self-consistent Reaction Field Calculations

For the 3,5-diiodosalicylic acid, the deviations between the experimental and the 69 70 theoretical logPow are 4.92 and 38.9 for the iso and the sphh , respectively, and these deviations are 1.24 and 1.23, respectively, for the predicted logPow using multiple linear regression.

Compared to the logPow obtained from the SIMs, the logPow of these compounds calculated directly from the solvation models are significantly worse because of the reason mentioned previously. There is not a big difference between the logPow obtained from the SIMlogPow models and those obtained from a publicly available model (AlogPs). The surface plays an important role in calculating the nonpolar contribution of ΔGsolv and in predicting the logPow. The logPow obtained by linear regression is thus surface-method dependent as is the nonpolar part of ΔGsolv.

Three main parameters (the local property, the data quality, and the molecular surface) seem to govern the predictive ability of these solvation models. The absence of ĮL for AM1* and MNDO/d affects the calculations performed with these Hamiltonians, which generally give RMSD values higher than those obtained with AM1 and PM3. Due to the fact that these Hamiltonians are particularly suited for calculations regarding compounds containing up to d- orbitals, the atomic parameters used for these Hamiltonians can also be the source of the large errors obtained. AM1* is an extension of the AM1 semi-empirical molecular technique, and in its parameterization process, the addition of heats of reaction can cause a considerable impact on the heats of formation, which thus should be weighted with some caution, and therefore less heavily.

Although V dominates consistently as shown in Figure 2.8, all of the five local properties give sufficient information for a full description of a molecule and the intermolecular binding properties related to this compound. V and ĮL are the terms that always appear singularly in each of the solvation models for water, octanol and chloroform. In contrast, the other local properties (EAL, IEL and ȘL) appear either in a single form for a typical model or in a combination with other local properties. Within an SCRF approach, V and ĮL seem to be the main factors responsible for the interaction between the dissolved molecule and the solvent. One particularity is observed for water; V and ĮL appear two times in a single form (equation 2.7), contrary to the models for octanol (equation 2.8) and chloroform (equation 2.9) where they appear only one time in a single form. This can be due to the fact that for water there may be other specific kinds of interactions (Lewis-base, hydrogen-bond donor, hydrogen-bond acceptor, etc.) than for the other organic solvents.

The average of the correlation coefficients of the solvation models developed are 0.87, 0.85, and 0.85 for water (neutral molecules), octanol and chloroform, respectively. This shows that these models are sufficiently robust and normally they can be used for a further derivation of some properties. Another highlight is that their correlation coefficients, presented in Tables 2.2, 2.3, and 2.4, have reasonable magnitudes and similarities for the same Hamiltonian and surface (i.e., there is not a big difference between the different values). This is proof of reliability for these models because each Hamiltonian is characterized by its own atomic parameters, which in some exceptional cases are the same for different -1 Hamiltonians. The MUE values range from 0.67 to 1.12 kcal mol for ΔGsolv(H2O), 0.56 to -1 -1 0.89 kcal mol for ΔGsolv(octanol), and 0.46 to 0.72 kcal mol for ΔGsolv(CHCl3). Compared to some currently available models, such as the SMn solvation models of Truhlar et al.88,89 for which the MUE is 0.49 kcal mol-1 or Friesner et al.90 who with their GB/PSA models obtained

37

Chapter 2 a MUE of 0.60 kcal mol-1, these models are not as uniformly accurate as the above models. Since the models are sufficiently reliable for predicting solvation free energies in a specific theoretical framework, the fact that they also depend on the gas-phase electron density can allow for an indirect inclusion of solute polarization via Į68. These models still face another major problem, the lack of physical interpretation, which is and remains a unique feature of QSAR and QSPR models derived from surface integrals. Fundamentally, the issue of finding an adequate physical interpretation for activity or property models is currently a priority in the execution of desirable properties68.

2.5 Conclusions

Combining the pure Coulomb ΔGsolv from SCRF calculations with a local term calculated as the surface-integral of a function of local properties leads to a robust model for the heats of solvation that can be validated by checking its performance in predicting logPow or logPcw. The error estimates for some individual compounds from calculating their different logPow helped in identifying clearly the type of compounds for which the models should be less reliable. The results obtained vary for different Hamiltonians and the surfaces used, but are quite good according to the correlation coefficients. The quality of the surface used for calculating the molecular descriptors plays an important role in predicting the target property. 69 70 In general, the use of the iso gives better results than the sphh . V and ĮL play special roles in the solvation models, but also some significant roles in combination with one or more of the other local properties. From the results obtained, it is now possible to say that ΔGsolv can be widely considered as a local property for certain kinds of organic compounds, and the local ΔGsolv(H2O) seems to be the main parameter for the study of intermolecular interaction sites. Good agreement with experimental results has been obtained especially for the AM1 Hamiltonian, which gives more accurate results. More importantly, although not having the atomic parameters for compounds containing s, p, and d- orbitals, it was observed that, in general, the lowest RMSD and the highest R2 are obtained with AM191, which is the standard NDDO92-based semi-empirical MO theory, and the iso69. The question remains: Is AM1 the most appropriate Hamiltonian for such QSPR studies?

38

Self-consistent Reaction Field Calculations

2.6 References

1. Hehre, W.; Radom, L.; Schleyer, P. v. R.; Pople, J. A. Ab Initio Molecular Orbital Theory; Wiley: New York, 1986.

2. Clark, T. Modelling the chemistry: Time to Break the Mould? EuroQSAR 2002: Designing drugs and crop protectants; Ford, M., Dearden, J., Eds.; 2003; pp 111-121.

3. Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Nature. 1962, 194, 178.

4. Hansch, C.; Fujita, T.; J. Am. Chem. Soc. 1964, 86, 1616.

5. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Posetic quantitative superstructure/activity relationships (QSSARs) for chlorobenzenes. J. Chem. Inf. Model. 2005, 45, 870-879.

6. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Modeling the bioconcentration factors and bioaccumulation factors of polychlorinated biphenyls with posetic quantitative super- structure/activity relationships (QSSAR). Mol. Divers. 2006, 10, 133-145.

7. Ivanciuc, T.; Ivanciuc, O.; Klein, D. J. Prediction of environmental properties for chlorophenols with posetic quantitative super-structure/property relationships (QSSPR). Int. J. Mol. Sci. 2006, 7, 358-374.

8. Gonzalez-Diaz, H.; Gonzalez-Diaz, Y.; Santana, L.; Ubeira, F. M.; Uriarte, E. Proteomics, networks and connectivity indices. Proteomics. 2008, 8, 750-778.

9. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients And Their Uses. Chem. Rev. 1971, 71, 524-616.

10. Hansch, C.; Leo, A. Exploring QSAR: Fundamentals and Applications in Chemistry and Biology; American Chemical Society: Washington, DC, 1995.

11. Leo, A. J. ClogP; Daylight Chemical Information Systems: Irvine, CA, 1991.

12. Viswanadhan, V. N.; Reddy, M. R.; Bacquet, R. J.; Erion, D. M. Assessment of Methods Used for Predicting Lipophilicity: Application to Nucleosides and Nucleoside Bases. J. Comput. Chem. 1993, 9, 1019-1026.

13. Klopman, G.; Li, J.-Y.; Wang, S.; Dimayuga, M. Computer Automated logP Calculations Based on an Extended Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752- 781.

14. Benson, S. W. Thermochemical Kinetics, 2nd ed.; Wiley: New York, 1976.

15. Clark, T.; McKervey, M. A. Saturated Hydrocarbons. In Comprehensive Organic Chemistry; Barton, D. H. R.; Ollis, W. D., Eds. Pergamon Press: Oxford, 1979; Vol. 1, Chapter 2, pp 37-120.

16. Cohen, N.; Benson, S. W. In Chemistry of alkanes and cycloalkanes., Patai, S.; Rappoport, Z., Eds. Wiley: Chichester, 1992; Chapter 6, p 215.

17. Stein, S. E.; Brown, R. L. Estimation of normal boiling points from group contributions. J. Chem. Inf. Comput. Sci. 1994, 34, 581-587.

39

Chapter 2

18. Constantinou, L.; Gani, R. New group contribution method for estimating properties of pure compounds. AICHe J. 1994, 40, 237-244.

19. Klopman, G.; Wang, S.; Balthasar, D. M. Estimation of aqueous solubility of organic molecules by the group contribution approach. Application to the study of biodegradation. J. Chem. Inf. Comput. Sci. 1992, 32, 474-482.

20. Kuhne, R.; Ebert, R.-U.; Kleint, F.; Schmidt, G.; Schuurmann, G. Group contribution methods to estimate water solubility of organic chemicals. Chemosphere. 1995, 30, 2061-2077.

21. Horn, A. H. C.; Lin, Jr-H.; Clark, T. Theor. Chem. Acc. 2005, 114, 159-168.

22. Rauhut, G.; Clark, T.; Steinke, T. J. Am. Chem. Soc. 1993, 115, 9174-9181.

23. Tomasi, J.; Persico, M. Chem. Rev. 1994, 94, 2027.

24. Cramer, C. J.; Truhlar, D. G. Chem. Rev. 1999, 99, 2161.

25. (a) Huron, M.-J.; Claverie, P. J. Phys. Chem. 1972, 76, 2123. (b) Huron, M.-J.; Claverie, P. J. Phys. Chem. 1974, 78, 1853, 1862.

26. (a) Gomez-Jeria, J. S.; Conteras, R. R. Int. J. Quantum Chem. 1986, 15, 591. (b) Gomez-Jeria, J. S.; Morales-Lagos, D. J. Phys. Chem. 1990, 94, 3790. (c) Morales-Lagos, D.; Gomez-Jeria, J. S. J. Phys. Chem. 1991, 95, 5308.

27. (a) Fox, T.; Rösch, N.; Zauhar, R. J. J. Comput. Chem. 1993, 14, 253. (b) Fox, T.; Rösch, N. Chem. Phys. Lett. 1992, 191, 33. (c) Zauhar, R. J.; Morgan, R. S. J. Comput. Chem. 1988, 9, 171.

28. (a) Klopman, G. Chem. Phys. Lett. 1967, 1, 200. (b) Klopman, G.; Andreozzi, P. Theor. Chim Acta. 1980, 55, 77.

29. (a) Germer, H. A. Theor. Chim. Acta. 1974, 34, 145. (b) Germer, H. A. Theor. Chim. Acta. 1974, 35, 273.

30. (a) Miertus, S.; Kysel, O. Chem. Phys. 1977, 21, 27, 33, 47. (b) Duben, A. J.; Miertus, S. Theor. Chim. Acta. 1981, 60, 327.

31. Born, M. Z. Phys. 1920, 1, 45.

32. Mulliken, R. S. J. Chem. Phys. 1955, 23, 1833.

33. (a) Cramer, C. J.; Truhlar, D. G. J. Am. Chem. Soc. 1991, 113, 8305. (b) Cramer, C. J.; Truhlar, D. G. J. Am. Chem. Soc. 1991, 113, 8552. (c) Cramer, C. J.; Truhlar, D. G. Science. 1992, 256, 213. (d) Cramer, C. J.; Truhlar, D. G. J. Comput. Chem. 1992, 13, 1089.

34. (a) Miertus, S.; Scrocco, E.; Tomasi, J. Chem. Phys. 1981, 55, 117. (b) Miertus, S.; Tomasi, J. Chem. Phys. 1982, 65, 239. (c) Bonaccorsi, R.; Cimiraglia, R.; Tomasi, J. J. Comput. Chem. 1983, 4, 567. (d) Tomasi, J.; Alagona, G.; Bonaccorsi, R.; Ghio, G. In Modelling of Structure and Properties of Molecules., Maksic, Z. B., Ed. Wiley: New York, 1987; p 330. (e) Floris, F.; Tomasi, J. J. Comput. Chem. 1989, 10, 616. (f) Bonaccorsi, R.; Cammi, R.; Tomasi, J. J. Comput. Chem. 1991, 12, 301. (g) Tomasi, J.; Bonaccorsi, R.; Cammi, R.; Olivares del Valle, F. J. J. Mol. Struct. (THEOCHEM) 1991, 234, 401. (h) Tunon, I.; Silla, E.; Tomasi, J. J. Phys. Chem. 1992, 96, 9043.

40

Self-consistent Reaction Field Calculations

35. (a) Aguilar, M. A.; Olivares del Valle, F. J. Chem. Phys. 1989, 129, 439. (b) Aguilar, M. A.; Olivares del Valle, F. J. Chem. Phys. 1989, 138, 327. (c) Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1991, 150, 139. (d) Aguilar, M. A.; Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1991, 150, 151. (e) Tolosa, S.; Esperilla, J. J.; Olivares del Valle, F. J. J. Comput. Chem. 1990, 11, 576. (f) Cammi, R.; Olivares del Valle, F. J.; Tomasi, J. Chem. Phys. 1988, 122, 63. (g) Olivares del Valle, F. J.; Aguilar, M. A. J. Comput. Chem. 1992, 13, 115.

36. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. 1996, 100, 19 824-19 839.

37. Cramer, C. J.; Truhlar, D. G. J. Comput.-Aided Mol. Des. 1992, 6, 629-666.

38. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. Chem. Phys. Lett. 1995, 246, 122-129.

39. Jayaram, B.; Sprous, D.; Beveridge, D. L. J. Phys. Chem. B 1998, 102, 9571-9576.

40. Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990, 112, 6127-6129.

41. Beveridge, D. L.; Dicapua, F. M. Annu. Rev. Biophys. Chem. 1989, 18, 431-492.

42. Sitkoff, D.; Sharp, K. A.; Honig, B. J. Phys. Chem. 1994, 98, 1978-1988.

43. Luo, R.; Moult, J.; Gilson, K. J. Phys. Chem. B 1997, 101, 11 226-11 236.

44. Dolney, D. M.; Hawkins, G. D.; Winget, P.; Liotard, D.; Cramer, C. J.; Truhlar, D. G. J. Comput. Chem. 2000, 340-366.

45. Klamt, A.; Schüürmann, G. J. Chem. Soc., Perkin Trans. 1993, 2, 799-805.

46. Barone, V.; Cossi, M.; Tomasi, J. J. Comput. Chem. 1998, 19, 404-417.

47. (a) Pascual-Ahuir, J. L.; Silla, E.; Tomasi, J.; Bonaccorsi, R. J. Comput. Chem. 1987, 8, 778. (b) Pascual-Ahuir, J. L.; Silla, E. J. Comput. Chem. 1990, 11, 1047. (c) Silla, E.; Tunon, I.; Pascual- Ahuir, J. L. J. Comput. Chem. 1991, 12, 1077.

48. (a) Connolly, M. L. J. Appl. Crysrtallogr. 1983, 16, 548. (b) Connolly, M. L. Molecular Surface Program, QCPE No. 429.

49. Marsili, M. In Physical Property Prediction in Organic Chemisty., Jochum, C.; Hicks, M. G.; Sunkel, J., Eds.; Springer: Berlin, 1988; p 249.

50. (a) Petrongolo, C.; Tomasi, J. Int. J. Quantum Chem. Symp. 1975, 2, 181. (b) Loew, G. H.; Berkowitz, D. S. J. Med. Chem. 1975, 18, 656. (c) Mazurek, A. D.; Weinstein, H.; Osman, R.; Topiol, S.; Ebersole, B. J. Quantum Biol. Symp. 1984, 11, 183.

51. (a) Pullman, B. Intermolecular Interactions: From Diatomics to Biopolymers, Wiley, Chichester, UK, 1978. (b) Kaplan, I. P. Theory of Molecular Interactions, Elsevier, Amsterdam, 1968. (c) Besler, B. H.; Merz, K. M.; Kollnian, P. A. J. Comp. Chem. 1990, 11, 431.

52. (a) Carbo, R.; Leyda, L.; Arnau, M. Int. J. Quantum Chem. 1980, 17, 1185. (b) Burt, C.; Huxley, P.; Richards, W. G. J. Comp. Chem. 1990, 11, 117.

41

Chapter 2

53. Politzer, P.; Murray, J. S. Molecular electrostatic potentials and chemical reactivity. In Rev. Comput. Chem., Lipkowitz, K.; Boyd, R. B., Eds. VCH: NewYork, 1998; Vol.2, pp 273–312.

54. Murray, J. S.; Politzer, P. The use of the molecular electrostatic potential in medicin chemistry. In Quantum medicinal chemistry., Carloni, P.; Alber, F.; Mannhold, R.; Kubinyi, H.; Folkers, G., Eds. Wiley-VCH: NewYork, 2003; Vol. 17, pp 233–254.

55. Ehresmann, B.; de Groot, M. J.; Alex, A.; Clark, T. J. Chem. Inf. Comput. Sci. 2004, 43, 658–668.

56. Podlipnik, C.; Koller, J.; Croat. Chem. Acta. 1998, 71, 689–696.

57. Reynolds, C. A.; Richards, W. G.; Goodford, P. J. J Chem Soc, Perkin Trans II 1988, 551– 556.

58. Miertus, S.; Scrocco, E.; Tomasi, J. J Chem Soc, Perkin Trans 1981, 1439–1443.

59. Wong, M. W.; Frisch, M. J.; Wiberg, K. B. J. Am. Chem. Soc. 1991, 113, 4776–4782.

60. Zou, J.; Yu, Y.; Shang, Z. J Chem Soc, Perkin Trans 2001, 1439–1443.

61. Rauhut, G.; Clark, T. J. Comput. Chem. 1993, 14, 503–509.

62. Beck, B.; Rauhut, G.; Clark, T.; J. Comput. Chem.1994, 15, 1064–1073.

63. Göller, A. H.; Horn, A. H. C.; Clark, T. (unpublished).

64. Horn, A. H. C. Ph.D Thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg 1994.

65. Hine, J.; Mookerjee, P. K. J. Org. Chem. 1975, 40, 292-298.

66. Eisenberg, D.; McLachlan, A. D. Nature. 1986, 319, 199-203.

67. Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G, J. Phys. Chem. B 1997, 101, 7147-7157.

68. Bernd, E.; de Groot, M. J.; Clark, T. J. Chem. Inf. Model. 2005, 45, 1053-1060.

69. Jr-Hung, L.; Clark, T. An Analytical, Variable Resolution, Complete Description of Static Molecules and Their Intermolecular Binding Properties. J. Chem. Inf. 2005, 45, 1010-1016.

70. Cai, W.; Shao, X.; Maigret, B. Protein-ligand recognition using spherical harmonic molecular surfaces: towards a fast and efficient filter for large virtual throughput screening. Journal of Molecular Graphics and Modelling 2002, 20, 313-328.

71. Max, N. L.; Getzoff, E. D. Spherical Harmonic Molecular Surfaces. IEEE Comput. Graphics Appl. 1988, 8, 42-50.

72. Duncan, B. S.; Olson, A. J. Approximation and Characterization of Molecular Surfaces; Scripps Institute: San Diego, California, 1995.

42

Self-consistent Reaction Field Calculations

73. Clark, T.; Alex, A.; Beck, B.; Burhardt, F.; Chandrase, J.; Gedeck, P.; Horn, A. H. C.; Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP, 9.0; Accelrys Inc.; San Diego, 2003.

74. Clark, T.; Lin, J.-H.; Horn, A. H. C. ParaSurf 1.0, Computer-Chemie-Centrum, University of Erlangen, Erlangen, 2004.

75. CORINA 3D Structure Generator, Molecular Networks, GmbH: Erlangen, Germany, 2006.

76. Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional Model Builders Using 639 X-Ray Structures. Journal of Chemical Information and Computational Sciences 1994, 34, 1000-1008.

77. Heiden, W.; Goetze, T.; Brickmann, J. Fast generation of molecular surfaces from 3D data fields with an enhanced “marching cube” algorithm. J. Comput. Chem. 1993, 14, 246-250.

78. Cai, W.; Zhang, M.; Maigret, B. New approach for representation of molecular surface. J. Comput. Chem. 1998, 19, 1805-1815.

79. Tsar 3.3, Oxford Molecular Ltd.: Oxford, 2000.

80. Jiabo, Li.; Tianhai, Zhu.; Hawkins, G. D.; Winget, P.; Daniel, A. L.; Cramer, J. C.; Donald, G. T. Extension of the platform of applicability of the SM5.42R universal solvation model. Theor. Chem. Acc. 1999, 103, 9-63.

81. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods. American Chemical Society: Washington, DC, 1990.

82. Hansch, C.; Leo, A.; Mekapati, S. B.; kurup, A. QSAR and ADME. Biorg. Med. Chem. 2004, 12, 3391-3400.

83. Soskic, M. J. Chem. Inf. Model. 2005, 45, 930-938.

84. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic, and Steric Constants. The American Chemical Society: Washington, DC, 1995.

85. Reynolds, C. H. J. Chem. Inf. Comput. Sci. 1995, 35, 738.

86. Giesen, D. J.; Chambers, C. C.; Cramer, J. C.; Truhlar, D. G. J. Phys. Chem. B 1997, 101, 2061-2069.

87. Junmei, W.; Wie, W.; Shuanghong, H.; Matthew, L.; Kollman, P. A. J. Phys. Chem. B 2001, 105, 5055-5067.

88. Thompson, J. D.; Cramer, C. J.; Truhlar, D. G. New Universal Solvation Model and Comparison of the Accuracy of the SM5.42R, SM5.43R, C.-PCM, and IEF-PCM Continuum Solvation Models for Aqueous and Organic Solvation Free Energies and for Vapor Pressures. J. Phys. Chem. A 2004, 108, 6532-6542.

89. Giesen, D. J.; Chambers, C. C.; Cramer, C. J.; Truhlar, D. G.: Solvation Model for Chloroform Based on Class IV Atomic Charges. J. Phys. Chem. B 1997, 101, 2061-2069.

43

Chapter 2

90. Tannor, D. J.; Marten, B.; Murphy, R.; Friesner, R. A.; Sitkoff, D.; Nicholls, A.; Honig, B.; Ringnalda, M.; Goddard, W. A. Accurate First Principles Calculation of Molecular Charge Distributions and Solvation Energies from Ab Initio Quantum Mechanics and Continuum Dielectric Theory. J. Am. Chem. Soc. 1994, 116, 11875-82.

91. Dewar, M. J. S.; Zoebisch, E. G.; Healy, E. F.; Stewart J. J. P. Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 1985, 107, 3902-3909. Holder, A. J. AM1, Encyclopedia of Computational Chemistry; Schleyer, P. v. R., Allinger, N. L., Clark, T., Gasteiger, J., Kollman, P. A., Schaefer, H. F., III., Schreiner, P. R., Eds.; Wiley: Chichester, 1998; pp 8-11.

92. Pople, J. A.; Santry, D. P.; Segal, G. A. J. Chem. Phys. 1965, 43, S129-S135.

44

Chapter 3

Binned Surface-Integral Models for Predicting the Octanol-Water

Partition Coefficient

Chapter 3

3.1 Introduction

In science, more specifically in chemistry, the use of certain assumptions or concepts can help either to conduct a comparative study of results or to plan experiments that may be realized in the future1. Hydrophobicity, which can be defined as the ability of an organic molecule to dissolve in non-aqueous substances such as oils and fats, seems to adhere systematically to this statement. Hydrophobic molecules are generally nonpolar, and in an aqueous medium they cluster together, encouraging the creation of a nonpolar molecule- aqueous solvent interaction, which is described by both hydrophobic hydration and hydrophobic interaction2-5. Hydrophobic effects play a fundamental role in the development of many reaction mechanisms that take place in aqueous medium. For this purpose, thanks to some phenomena, such as hydrophobic interactions between the receptor and ligand during transport through a membrane, or other factors related to the pharmacokinetic properties of the molecule6,7, lipophilicity exerts considerable influence on the determination of the biological activity of ligands, the transport of bioactive molecules through biological membranes8, and the environmental fate of organic molecules9. Hydrophobicity is commonly used for describing the free energy changes occurring when a drug is moved from an aqueous phase to a lipid bilayer10. Hydrophobicity can also play a key role in the binding of the substrate to its corresponding macromolecular active site11. In molecular modeling, the determination of hydrophobicity is an important parameter for the design and development of QSAR models12. Quantitatively, lipophilicity is associated with the logarithm of the partition 11 coefficient, logPow, of a compound in an assembly formed by the n-octanol/water system . Because lipophilicity plays a key role in the transport of ligands, ligand binding, and the development of QSAR models necessary to predict biological activity13-21, research and the development of standard methods, effective enough for calculating the logPow, are currently far reaching and of financial interest. Thus, many methods based on different approaches have been developed in this regard, especially the use of either atomic or molecular fragment-based group additivity, and linear regression of a database of compounds with known partition constants, has resulted in the production of successful ClogP18 and AlogP14 models13-17. In the latter case, the specific values for structural equivalents are particularly determined.

The surface-integral method, commonly called surface-integral models22 (SIMs), is a technique for obtaining the target property by an integration of the local properties on the entire surface of the molecule. It belongs to the group of QSPR models and can provide a general idea about the exact characteristics that contribute to the modeling of the physical property. In the previous chapter, we used the SIMs to obtain the local surface tension contribution to the solvation free energy. Similarly, in some previous works, solvation energies have been explicitly determined through surface integrals relatively evaluated as equivalent to volume integrals23. Local properties24 that are usually defined using semi- empirical molecular orbital (MO) theory are generally used for generating solvation models in different solvents25. These models, far from being a simple integral function of local solvation, always contain a large constant, and hence the energies of solvation cannot in any way be assimilated to the local properties. This is clearly the case for the models derived from least squares fitting, where all training data is moreover sufficiently local so that the model can give rise only to relatively small deviations from the constant. To efficiently generate a SIM for logPow, one must first define the contours and the different aspects of the local hydrophobicity, whose integral is logPow. The main objective of the present research study is directed towards the establishment of a new approach for generating SIMs for logPow, relying heavily on the true local hydrophobicity. This means defining clearly a true local

46

Binned SIM logPow Models hydrophobicity, where the addition of a constant to the surface integral is not essential, and the local properties used are obtained from semi-empirical MO calculations.

The protonation state is a parameter affecting naturally the logPow, which is basically the difference between the free energies of solvation of a compound in water and in water- 26 saturated n-octanol . In most cases, it is the logPow of the neutral compound that is used. In other words, the logPow is measured when the compound is in a buffered solution, and therefore un-ionized or zwitterionic. Chloroform and n-hexane27 are other solvents, which can be used in defining the lipophilic environment of a molecule. N-hexane, which is a fully polar solvent, has a tendency to be more hydrophobic than n-octanol28, which contrarily is a hydrogen bond donor and acceptor. However, given the fact that more data are available for logPow than for the partition coefficients between water and other organic phases, n-octanol was unanimously adopted as the reference solvent for lipophilicity. The logPow is widely used in predicting transmembrane transport properties, protein binding, receptor affinity, and 20 pharmacological activity of molecules . The reliability of predicted logPow values can cause a considerable effect on the design process because there is a strong relationship between this process and the predicted structures20. Some research groups, apart from relying on the most common approach of using structural fragments for calculating logPow, have turned to the use 17,19,20,21 of molecular properties for generating logPow models . Following this line of research, Klopman established a method in which the partial charges calculated with MINDO/3 are highly considered29. Bodor and co-workers, using their original BlogP model, have 20 successfully completed an extension of the molecular properties for calculating the logPow . Generally, determining logPow values through an experimental way is not a complicated process. The logPow can effectively be measured in the laboratory using sufficiently adapted and developed techniques30, such as correlation with chromatography retention times31 and automated titration with potentiometric measurements32, which seem to be used more than the original shake-flask method. These techniques allow the derivation of fairly reliable values of logPow, which can be used with confidence to generate logPow models. Today, various research laboratories in terms of their work efforts have succeeded in making available to the 33 scientific community several databases for logPow, which among others are the PHYSPROP , Beilstein34, and LOGKOW35 databases. Significant success can be observed with these databases because they allow, along with techniques such as multiple linear regression36, support vector machines37, partial least-squares38, neural networks, and ensembles thereof39, the generation, in conjunction with a variety of different descriptors, of logPow models.

In this work, an extension of the new binned SIM approach developed by Clark et al.40 with the other Hamiltonians and the solvent-excluded surface (SES)41 added to the AM1 Hamiltonian and the isodensity surface (iso)42 used originally, and introducing another local property, electronegativity, is presented.

3.2 Methods of Calculating

35 The LOGKOW database provided 37,783 logPow values, including 23,479 compounds collected from the literature. For these compounds, the values suggested by LOGKOW were controlled and treated to have an average value for those with disadvantages. Comparing the structures of these compounds to those stored in SciFinder or PubChem, yields 11,500 compounds whose structures are virtually identical to those present in the different

47

Chapter 3 databases, therefore, obeying the standards. Among them, all duplicated entries, those with permanent charges or unpaired electrons, compounds containing atoms whose parameters are not available in VAMP43, and all zwitterionic compounds were systematically subtracted, so that at the end 10,813 compounds (9241 compounds with rotatable bonds and 1572 rigid compounds) were kept for the work. With CORINA44 the SMILES strings of each of these compounds were converted into 3D structures. The Molecular Operating Environment (MOE)45 helped in calculating the number of rotatable bonds and the number of hydrogen bond donor/acceptor atoms. A set of 368 compounds obtained randomly from 20% of the data46, used as a test set in chapter 2, was used as an external validation set.

Geometries of the different 3D molecular structures were optimized in the gas phase using the AM1, AM1*, MNDO, MNDO/d, PM3, and PM6 Hamiltonians, which are fully implemented in VAMP43. Starting with the previously optimized molecular geometries, the different molecular descriptors were calculated with ParaSurf1047, however, having first defined the nature and type of molecular surface, which can be the default iso42 or the SES41. The local properties obtained from the ParaSurf calculation are the molecular electrostatic potential (MEP), local ionization energy (IEL), local electron affinity (EAL), local hardness (HARD), local polarizability (POL), local electronegativity (ENEG) and the field normal to the surface (FN)48.

One of the commonly used modern techniques and whose major asset is to mitigate effects, such as noise and outliers in the training set, is bagging. By this method, it is possible during model generation to make a reasonable estimate of the test set, which can be obtained as large as the training set49. Thereby, the bagging version of stepwise multiple linear regression50 helped for generating the models. Ninety-five percent of the critical F-value51, first calculated and subsequently used as a stopping criterion with forward and backward stepping, avoids overtraining but at the same time gives certain assurances on the significant variables included in the models. Given the results obtained, this strategy generally leads to sufficiently robust models. Seventy-five percent of the compounds from the total training set were first selected by the random selection technique, and subsequently, 50 independent bagging samples were generated for each model, so that each compound was repeatedly and successively used in the training and test sets40. The statistical performances of test and training sets are thus expressed in terms of the square of the correlation coefficient (R2), mean unsigned error (MUE) and the root mean square error (RMSE). The final formula of the equation for each model was obtained by performing an arithmetic average of the various coefficients generated from the 50 unique formulas.

3.2.1 Solvent-Excluded Surface (SES)

The SES41 is a delimited portion of the molecule that is not accessible to a solvent probe sphere when the latter moves along the molecular surface41. The SES comprises two different parts that are the convex and the concave surfaces. The convex surface is the contact surface of the molecule, namely, the region of the van der Waals (vdW) surface that creates a direct contact with the solvent probe molecules. The concave surface is the reentrant part of the molecular surface. It is formed from the regions that face inward towards the solvent probe sphere, which is in contact with two or more atoms simultaneously. The SES41, as

48

Binned SIM logPow Models shown in Figure 3.141, depends entirely on the types of atoms contained in the protein molecules and the nature of the corresponding solvent probe molecules.

Figure 3.1. The blue curve is the portion of the molecular structure that represents the vdW surface, and the red circle depicts the solvent probe molecule. The curves in red and black represent the solvent-accessible surface (SAS) and the SES, respectively. Reprinted from 41.

3.3 Results

The SIM24 appraoch used in the previous chapter is based on an integration over the molecular surface of a polynomial expansion of the local properties obtained from ParaSurf (including cross terms) with the fitting of the integrals to the target. Although the use of this SIM approach provides models in which the constant is weak, very often this technique leads simultaneously to the obtaining of an equation with one important constant and a local function that is fully focused on the relatively small differences of the constant value. Thus, the transfer of this local function to a local hydrophobicity is extremely dependent on this constant, which in this case should not be significant. To achieve this, a new approach, fairly reliable and entirely different from the previous one, was used. It consists of performing the binning of the local properties and their cross-products, and then generating, for all training compounds contained in a specific set, maxima and minima whose median values are used as the outer binning thresholds. This results in an intermediate range that, divided into 10 bins of the same width, gives a set of 12 bins and 11 thresholds that along with the 21 cross-products and the seven local properties for AM1, MNDO, PM3, and PM6 or with the 15 cross products and the six local properties for AM1* and MNDO/d leads to the generation of 336 or 252 new surface-bin descriptors, respectively. Each set of descriptors obtained was utilised through a stepwise multiple linear regression to fit experimental logPow values.

The total of 10,813 compounds, with the AM1 Hamiltonian and the SES, yields a set of descriptors that, performed with the binned SIM approach, generates a stepwise multiple 2 linear regression model, with a performance of R (test) = 0.89, RMSE (test) = 0.58 logPow units and MUE (test) = 0.43 logPow units. Figure 3.2 is a graphical representation of predicted versus measured values for the out-of-bag test set predictions obtained.

49

Chapter 3

Figure 3.2. Graphical representation of the measured versus predicted logPow values for the test set obtained with the AM1 Hamiltonian and the SES.

For the entire data set, 40 of the 336 descriptors are used in about 25 of the 50 bagging equations. In this set of descriptors, MEP × EAL and EAL × FN bins appear in each of the 50 bagging equations. Based on the sum of the absolute values of the coefficients, the most important descriptors listed in decreasing order are EAL × ENEG, HARD × ENEG, EAL × FN, IEL × EAL, and IEL. The average value of the constant is 0.15.

According to Figure 3.2, there are some strong outliers in the model. The SMILES of these outliers were collected and submitted to the MOE45 for a conformational search, using the MMFF94x forcefield. Geometries obtained from MOE45 with the lowest energies were optimized and their heat of formations compared to the one obtained from the geometries used for generating the model. It appears that the protonation states used for the model building are more stable than those obtained from MOE45. The largest deviations observed for these compounds may be either related to problems with geometry optimization or particular problems with the AM1 Hamiltonian. The different protonation states of 1H-purine, 2,6,8-tris(methylsulfonyl)-, one of the outliers, are shown below in Figure 3.3.

50

Binned SIM logPow Models

CORINA’s protonation state MOE’s protonation state Heat of Formation = -89.16 kcal mol-1 Heat of Formation = -68.48 kcal mol-1

Figure 3.3. Protonation states of 1H-purine, 2,6,8-tris(methylsulfonyl)- generated with CORINA and MOE.

Because of their low values of heat of formation, all the CORINA single conformations were used for generating all of the models.

3.3.1 Conformational Dependence

In QSPR, the study of the relationship between the prediction and the different conformations of a molecule when it is represented in 3D-molecular structure40 is and remains a subject that is always aborted. It has been shown previously in the solvation free energy models that the models obtained can be used to reproduce the logPow values for small molecules with a very good accuracy. In this study on predicting the logPow, it emerged, as shown in Figure 3.4, that for compounds with rotatable bonds, the prediction error depends on the number of rotatable bonds of the molecule. The numbers of rotatable bonds for flexible compounds were calculated and sorted in increasing numbers with MOE45, except for macrocyclic compounds in which MOE45 did not designate the aliphatic carbons in the cycle as rotatable.

51

Chapter 3

Figure 3.4. Error of prediction obtained with AM1 and the SES versus the number of rotatable bonds.

For the smaller numbers of rotatable bonds, the error increases with increasing number of rotatable bonds. However, we see a larger variance for larger numbers of rotatable bonds; this is because we only have a few samples with a higher number of rotatable bonds, therefore, lowering the quality of sampling. This linear growth of the deviation is closely related to the increase in the uncertainty of the conformation. Some specific compounds in the logPow data set used here can help to avoid this kind of problem. These compounds without significant conformational flexibility (rigid compounds) allow models with a maximum performance to be generated. A plot of predicted versus measured values for the out-of-bag test set predictions obtained for the single-conformation with the AM1 Hamiltonian and the SES41 is shown in Figure 3.5.

52

Binned SIM logPow Models

Figure 3.5. Predicted values of logPow obtained with the AM1 Hamiltonian and the SES versus the measured ones for a test set of the rigid compounds.

Figure 3.5 above presents virtually no strong outliers, and there is a fairly good correlation between predicted and experimental logPow, unlike the model obtained with the entire data set. Subtracting the flexible compounds leads to an increase in the correlation coefficient of about 0.060, and consequently a decrease in the RMSE of ≈ 0.20 logPow units.

The statistical significance for all three models generated with the AM1 Hamiltonian and the SES for training and out-of-bag test sets are listed in Table 3.1.

Table 3.1. Statistical significance of all models generated with the AM1 Hamiltonian and the SES for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.42 0.56 0.89 0.43 0.57 0.88 0.30 0.40 0.96 set Test set 0.43 0.58 0.89 0.44 0.59 0.87 0.33 0.46 0.95

Using only the rigid compounds yields a RMSE of 0.46 and 0.40 for the test and training set, respectively. With the flexible compounds, the RMSE are 0.59 for the test set and 0.57 for the training set. There is a large increase in the performance power for the models when the flexible compounds are removed from the data set. Because of the uncertainty of the conformation, the flexible compounds are responsible for an increase in the RMSE of ≈ 0.20 logPow units for the model generated with AM1 and the SES.

Combining the full data set, the AM1* Hamiltonian and the SES give a set of descriptors that, performed simultaneously with binned SIM and stepwise multiple linear

53

Chapter 3 regression, yields a model where R2 (test) = 0.87, RMSE (test) = 0.61 and MUE (test) = 0.46. Figure 3.6 is a plot of predicted versus measured values for the out-of-bag test set predictions obtained.

Figure 3.6. Measured values of logPow compared to the predicted ones obtained with the AM1* Hamiltonian and the SES for the test set.

Here, 39 of the 252 descriptors are used in about 25 of the 50 bagging equations, in contrast to AM1 where 40 of the 336 descriptors are used. For this set of descriptors, MEP × FN bin appears in each of the 50 bagging equations. Using the sum of the absolute values of the coefficients leads to obtaining, in decreasing order, the descriptors MEP × FN, IEL × FN, FN, EAL × FN, and ENEG × FN as the most important ones. The average value of the constant is 0.21. For this model, three significant outliers were observed.

The performances for all models generated with the AM1* Hamiltonian for training and out-of-bag test sets are summarized in Table 3.2.

Table 3.2. Performances of all models generated with the AM1* Hamiltonian for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.45 0.60 0.88 0.50 0.61 0.86 0.32 0.42 0.96 SES set Test set 0.46 0.61 0.87 0.47 0.62 0.86 0.35 0.48 0.94 Training 0.48 0.63 0.86 0.49 0.64 0.85 0.38 0.49 0.94 Iso set Test set 0.49 0.65 0.85 0.50 0.68 0.83 0.41 0.55 0.92

54

Binned SIM logPow Models

The use of the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test set in the models generated with the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.59 with the iso. The average of the R2 is 0.90 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set, the average of the RMSE is 0.57 and 0.63 when the models are generated with the SES and the iso, respectively. The test set using the SES yields an average R2 of 0.89 and an average of 0.87 when performed with the iso.

The complete data set, the MNDO/d Hamiltonian and the SES provide a set of descriptors that used with the binned SIM allow the creation of a stepwise multiple linear regression model in which the statistical performances of R2 (test) = 0.86, RMSE (test) = 0.63 and MUE (test) = 0.47 are obtained. A graphical description of the correlation between the measured and the predicted values for the out-of-bag test set predictions obtained with MNDO/d and the SES is shown in Figure 3.7.

Figure 3.7. Correlation between the measured and the predicted values of logPow for the test set obtained with the MNDO/d Hamiltonian and the SES.

For this model, 55 of the 252 descriptors are used in about 25 of the 50 bagging equations. Compared to the AM1*, there is a large increase in the use of the set of binned descriptors for MNDO/d. Among this set of binned descriptors, MEP × FN, MEP × ENEG, FN, and HARD × ENEG bins are present in each of the 50 bagging equations. With regards to the sum of the absolute values of the coefficients, the most important descriptors enumerated, in decreasing order, are MEP × FN, FN, HARD × FN, IEL × FN, and ENEG × FN. The average value of the constant is 2.31×10−2 . This is a special situation where about five points are far away from the line of equation y = ax (a = 1).

55

Chapter 3

The statistical details about the performances of all models generated with the MNDO/d Hamiltonian for training and out-of-bag test sets are given in Table 3.3.

Table 3.3. Statistical details on the performances of all models generated with the MNDO/d Hamiltonian for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.46 0.61 0.87 0.48 0.63 0.85 0.33 0.44 0.95 SES set Test set 0.47 0.63 0.86 0.49 0.64 0.85 0.36 0.50 0.94 Training 0.49 0.65 0.85 0.50 0.66 0.84 0.36 0.49 0.94 Iso set Test set 0.50 0.67 0.85 0.51 0.68 0.83 0.40 0.54 0.93

As seen previously with AM1*, the use of the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated with the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.56 with the SES and 0.60 with the iso. The average of the R2 is 0.89 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.59 and 0.63 when the models are generated with the SES and the iso, respectively. For the test set, using the SES leads to an average R2 of 0.88 and an average of 0.87 when generating the model with the iso.

The set formed by the flexible and the rigid compounds, with the MNDO Hamiltonian and the SES, gives a set of descriptors that performed with the binned SIM approach allows the generation of a stepwise multiple linear regression model, characterized by a performance of R2 (test) = 0.88, RMSE (test) = 0.60 and MUE (test) = 0.46. The predicted values of the logPow obtained for the out-of-bag test set are plotted against the measured values in Figure 3.8.

56

Binned SIM logPow Models

Figure 3.8. Plot of the predicted against the measured values of logPow for the test set obtained with the MNDO Hamiltonian and the SES.

In this case, 51 of the 336 descriptors are used in about 25 of the 50 bagging equations. For this set of binned descriptors, MEP × FN, MEP × EAL, EAL×POL, and MEP × ENEG bins are expressed in each of the 50 bagging equations. Focusing on the sum of the absolute values of the coefficient, the most important descriptors enumerated, in decreasing order, are HARD × FN, MEP × EAL, MEP × FN, MEP × ENEG, and IEL × ENEG. The average value of the constant is 5.44×10−2 . About four points are outliers here.

Table 3.4 describes statistically all of the three models generated with the MNDO Hamiltonian for training and out-of-bag test sets.

Table 3.4. Statistical description of all models generated with the MNDO Hamiltonian for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.44 0.58 0.88 0.46 0.60 0.87 0.32 0.43 0.95 SES set Test set 0.46 0.60 0.88 0.47 0.62 0.86 0.35 0.48 0.94 Training 0.46 0.61 0.87 0.48 0.63 0.85 0.34 0.45 0.95 Iso set Test set 0.48 0.63 0.86 0.49 0.65 0.84 0.37 0.50 0.94

With the SES there is a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated with the full data set and the flexible compounds. There is no change in the R2 for the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.56 with the iso. The average of the R2 is 0.90

57

Chapter 3 and 0.89 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.57 and 0.59 when the models are generated with the SES and the iso, respectively. The test set performed with the SES gives an average R2 of 0.89 and an average of 0.88 when realized with the iso.

The entire data set with the PM3 Hamiltonian and the SES provides a set of descriptors that performed with the binned SIM approach lead to a stepwise multiple linear regression model, described by a performance of R2 (test) = 0.87, RMSE (test) = 0.60 and MUE (test) = 0.46. A graphical description of the correlation between the measured and the predicted values for the out-of-bag test set predictions obtained with PM3 and the SES is shown in Figure 3.9.

Figure 3.9. Comparison of measured and predicted values of logPow for the test set obtained with the PM3 Hamiltonian and the SES.

The use of the entire data set, the PM3 Hamiltonian and the SES yields a model where 45 of the 336 descriptors are used in about 25 of the 50 bagging equations. For this case, POL × FN, IEL × FN, and FN bins appear in each of the 50 bagging equations. Averaging the absolute values of the coefficient, the most important descriptors provided, in decreasing order, are IEL × FN, FN, HARD × FN, POL × FN, and EAL × ENEG. The average value of the constant is 5.49×10− 2 . One of these points has a particular behavior. It is a little far away from the other points, which are close together, and it is an outlier.

Table 3.5 lists all statistical values for three models generated with the PM3 Hamiltonian for training and out-of-bag test sets.

58

Binned SIM logPow Models

Table 3.5. Predictive powers of all models generated with the PM3 Hamiltonian for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.45 0.59 0.88 0.46 0.59 0.87 0.34 0.45 0.95 SES set Test set 0.46 0.60 0.87 0.47 0.61 0.86 0.37 0.51 0.94 Training 0.50 0.66 0.85 0.51 0.67 0.83 0.40 0.52 0.93 Iso set Test set 0.52 0.68 0.84 0.53 0.69 0.82 0.44 0.59 0.92

Generating the models with the SES leads to a decrease in the RMSE and an increase in the R2 for both the training and the test set for the full data set, the flexible compounds and the rigid compounds. For the training set, the average of the RMSE is 0.54 with the SES and 0.62 with the iso. The average of the R2 is 0.90 and 0.87 with the SES and the iso, respectively, for the same training set. For the test set, the average of the RMSE is 0.57 and 0.65 when the models are generated with the SES and the iso, respectively. Using the SES helps in obtaining an average R2 of 0.89 for the test set and an average of 0.86 when performed with the iso for the same test set.

The total data set with the PM6 Hamiltonian and the SES allows the obtainment of the set of descriptors that performed with the binned SIM approach produce a stepwise multiple linear regression model having a performance of R2 (test) = 0.86, RMSE (test) = 0.64 and MUE (test) = 0.47. A schematic representation of the regression of the measured logPow against the predicted values obtained for the out-of-bag test set is depicted in Figure 3.10.

Figure 3.10. Regression of measured logPow against predicted values obtained for the test set with the PM6 Hamiltonian and the SES.

59

Chapter 3

Using the set of 10,813 compounds permits a model in which 43 of the 336 descriptors are used in about 25 of the 50 bagging equations to be generated. For this set of binned descriptors, IEL × HARD and IEL × ENEG bins appear in each of the 50 bagging equations. Performing an arithmetic sum of the absolute values of the coefficient classifys the descriptors EAL × ENEG, IEL × ENEG, IEL × EAL, MEP × FN, and FN by importance in decreasing order. The average value of the constant used is -0.17. For this case, we have three points that are very far away from the others; they are significant outliers.

Table 3.6 is a summary of the performances of all models generated with the PM6 Hamiltonian for training and out-of-bag test sets.

Table 3.6. Summary of the performances of all models generated with the PM6 Hamiltonian for the single conformation Model Full set Flexible compounds Rigid compounds MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 Training 0.46 0.61 0.87 0.47 0.62 0.86 0.36 0.48 0.94 SES set Test set 0.47 0.64 0.86 0.48 0.64 0.85 0.38 0.52 0.93 Training 0.48 0.63 0.86 0.49 0.64 0.85 0.37 0.50 0.94 Iso set Test set 0.49 0.66 0.85 0.50 0.66 0.84 0.41 0.55 0.93

Model generation realized with the SES yields a decrease in the RMSE and an increase in the R2 for both the training and the test sets of the models generated either with the full data set or the flexible compounds. There is no change in the R2 for the rigid compounds. For the training set, the average of the RMSE is 0.57 with the SES and 0.59 with the iso. The average of the R2 is 0.89 and 0.88 with the SES and the iso, respectively, for the same training set. For the test set the average of the RMSE is 0.60 and 0.62 when the models are generated with the SES and the iso, respectively. With the test set, R2 averages of 0.88 and 0.87 are obtained for the SES and the iso, respectively.

In chemistry, as in physics, two atoms create between them an interaction that may be attractive or repulsive depending on the nature of these atoms. Thus, a hydrogen atom in the presence of an electronegative atom, such as oxygen, nitrogen, or fluorine, from another molecule or chemical group creates with the latter an attractive interaction, commonly called hydrogen bonding. An electronegative atom, whether bonded to a hydrogen atom or not, can always bond with another hydrogen, which means an eternal hydrogen bond acceptor, unlike the hydrogen atom, which, when attached to an electronegative atom, automatically binds a 52 hydrogen donor . Hydrogen bonding strongly affects the logPow prediction, as does the number of rotatable bonds.

In Figure 3.11, the error of prediction is related to the number of hydrogen bond donor/acceptors.

60

Binned SIM logPow Models

Figure 3.11. Correlation between the error of prediction obtained with AM1 and the SES and the number of hydrogen bond donor/acceptor atoms obtained with MOE.

For the complete data set, the standard deviation rises essentially linearly with increasing hydrogen bond donor/acceptor atoms. There is a correlation between the number of hydrogen bond donor/acceptors and the standard deviation.

The SMILES strings of the LOGKOW database35 were compared to those of the data46 used as a test set in the previous chapter, stored in PubChem or ChemSpider. Forty-eight compounds found in both data sets were removed from the external validation data set. A remaining set of 320 compounds was then used for the external validation process. The performances of all 11 models generated with the full data set on the external validation set are shown in Table 3.7.

Table 3.7. Performances of all models generated with the full set on the external validation set for the single conformation MODEL MUE RMSE R2 AM1 (SES) 0.35 0.49 0.89 AM1* (iso) 0.41 0.55 0.86 AM1* (SES) 0.40 0.55 0.86 MNDO/d (iso) 0.36 0.51 0.87 MNDO/d (SES) 0.36 0.50 0.88 MNDO (iso) 0.37 0.53 0.86 MNDO (SES) 0.36 0.51 0.88 PM3 (iso) 0.42 0.59 0.83 PM3 (SES) 0.36 0.51 0.88 PM6 (iso) 0.40 0.55 0.85 PM6 (SES) 0.41 0.57 0.85

The best model obtained with the binned SIM gives an R2 of 0.89 compared to the polynomial SIM where the best model gives an R2 of 0.84. The new binned SIM approach

61

Chapter 3 helps not only in describing a true local hydrophobicity or in obtaining a test set that can be as large as the training set but also in obtaining models whose statistical performances are better than those generated from the old polynomial SIM approach.

3.3.2 Comparison with Publicly Available logPow Models

The full, flexible and rigid data sets were performed with other publicly available 53 45 logPow prediction tools. These were AlogPS2.1 , SlogP and logPo/w available in MOE . The performances are summarized in Table 3.8.

Table 3.8. Performances of other publicly available logPow models on our full, flexible and rigid sets for the single conformation Model Full test (validation) set Test or validation set for Test or validation set for (10,813 compounds) flexible compounds rigid compounds (1572 (9241 compounds) compounds) MUE RMSE R2 MUE RMSE R2 MUE RMSE R2 AlogPs 0.30 0.43 0.94 0.30 0.44 0.93 0.27 0.40 0.96 SlogP 0.55 0.76 0.80 0.56 0.76 0.79 0.52 0.72 0.88 logP_o/w 0.56 0.82 0.79 0.58 0.84 0.77 0.45 0.66 0.88

There is a striking difference in the performances of the publicly available models for the AlogPS2.1, logP_o/w and SlogP from the MOE45. The logP_o/w and SlogP obtained from the MOE45 give similar R2 values, but the AlogPs give very good R2 values. One disadvantage of the AlogPS2.1 model is that for our data set of 10,813 compounds, 122 compounds were not calculated because their SMILES were not accepted.

3.3.3 Variable Importance

The variable importance gives an idea about the effect or particular impact of a descriptor in a specific model. It is obtained by the arithmetic addition of the absolute values of the coefficients of the descriptors ascertained from a given property or a cross-product of two properties.

This importance measured for a model obtained from the full data set, with the AM1 Hamiltonian and the SES, is shown in Figure 3.12.

62

Binned SIM logPow Models

Figure 3.12. Importance of the different basic descriptors as the arithmetic addition of the absolute values of the corresponding bin coefficients obtained with the AM1 Hamiltonian and the SES.

The EAL, ENEG, IEL, HARD, and FN as well as the cross-products between them are the most important descriptors. No single local property dominates, but the MEP does not play a significant role.

Figure 3.13 is a graphical representation of the variable importance obtained for a model generated with the full data set, the AM1* Hamiltonian and the SES.

Figure 3.13. Quantitative significance of the role played by the different basic descriptors obtained as the arithmetic addition of the absolute values of the corresponding bin coefficients for the AM1* Hamiltonian and the SES.

63

Chapter 3

The MEP, FN, IEL, EAL, and ENEG as well as the cross-products between them are the most important descriptors. No single local property dominates, but the local hardness does not play a significant role.

3.3.4 Variable Dependence

The dependence of the predicted AM1 and AM1* logPow values was investigated by systematically changing one of the input descriptors, the FN. Compounds selected for this purpose are listed in Tables 3.9 and 3.10. They are flexible compounds with a range of rotatable bonds varying from one to four. Changing the calculated value of the integrated absolute FN over the surface (|F(N)|), the integrated FN over the surface for all negative values (F(N – ve)), and the integrated FN over the surface for all positive values (F(N + ve)) has a dominant influence in predicting logPow. Table 3.9 shows the dependence of the 41 calculated logPow on the F(N – ve) obtained with the AM1 Hamiltonian and the SES .

Table 3.9. Dependence of logPow prediction on the F(N – ve) for AM1 and the SES

Compound Ex F(N–ve) 1 logP F(N – ve) 2 logP Number log (kcal. 1 (kcal.Angstro 2 of P Angstro m mol −1 ) rotatabl m mol −1 ) e bonds Benzonitrile, 2,6-dimethyl-, 2.74 -228.5 9.84 -1319 7.32 1 N-oxide Dibenzo[b,z][1,4,7,10,13,16,19 0.52 -731.2 5.15 -7172 3.95 , 22,25,28,31,34,37,40,43,46] Hexadecaoxacyclo- octatetracontin, 6,7,9,10,12,13,15,16,18,19,21, 22,24,25,32,33,35,36,38,39,41, 42,44,45,47,48,50,51- octacosahydro- Dibenzo[b,q][1,4,7,10,13,16, 3.32 -460.0 7.45 -4779 6.66 19,22,25,28]decaoxacyclo- triacontin, 2,20-bis(1,1-dimethylethyl)- 6,7,9,10,12,13,15,16,23,24, 26,27, 29,30,32,33-hexadecahydro- Aziridine, 1,1',1""- 0.53 -1432 -4.30 -3142 -2.37 3 phosphinothioylidynetris- 1H-purine, 2,6,8- 3.58 -561.3 -0.81 -5476 -0.60 3 tris(methylsulfonyl)- Sulfur, pentafluorophenyl- 3.36 -407.3 3.80 -1343 3.63 1 Acetic acid, 3- 2.78 -475.8 3.40 -2554 3.02 4 (pentafluorothio)phenoxy-

64

Binned SIM logPow Models

Table 3.10 gives the dependence of the predicted logPow on the F(N – ve) when the AM1* Hamiltonian and the SES41 are used.

Table 3.10. Dependence of logPow prediction on the F(N – ve) for AM1* and the SES

Compound Exlog F(N–ve) 1 logP F(N – ve) 2 logP Number P (kcal. 1 (kcal.Angstro 2 of Angstro m mol −1 ) rotatabl m mol −1 ) e bonds Benzonitrile, 2,6-dimethyl-, 2.74 -237.9 3.13 -1320 2.29 1 N-oxide Dibenzo[b,z][1,4,7,10,13,16,1 0.52 -739.9 5.88 -7282 4.41 9, 22,25,28,31,34,37,40,43,46] hexadecaoxacyclo- octatetracontin, 6,7,9,10,12,13,15,16,18,19,21, 22,24,25,32,33,35,36,38,39,41 , 42,44,45,47,48,50,51- octacosahydro- Dibenzo[b,q][1,4,7,10,13,16, 3.32 -472.3 7.93 -4788 7.01 19,22,25,28]decaoxacyclo- triacontin, 2,20-bis(1,1-dimethylethyl)- 6,7,9,10,12,13,15,16,23,24, 26,27, 29,30,32,33-hexadecahydro- Aziridine, 1,1',1""- 0.53 -1042 -1.40 -2315 -1.12 3 phosphinothioylidynetris- 1H-purine, 2,6,8- 3.58 -470.6 0.88 -4886 -0.31 3 tris(methylsulfonyl)- Sulfur, pentafluorophenyl- 3.36 -414.4 6.62 -1312 1.98 1 Acetic acid, 3- 2.78 -470 7.24 -2514 3.24 4 (pentafluorothio)phenoxy-

It has been clearly demonstrated that a direct link exists between the change of a parameter and the logPow prediction. Carbon substitution, in general, increases hydrophobicity, while heteroatoms decrease its value; decreasing the value of the F(N – ve) increases the logPow predictability, reducing the values of the errors.

In the previous tables, compounds without a number designation for rotatable bonds are macrocycles for which the MOE45 did not assign a number. Figures 3.14 and 3.15 are some of these macrocycles.

65

Chapter 3

(a) (b)

Figure 3.14. (a) Sulfur, pentafluorophenyl- (b) Aziridine, 1,1',1""- phosphinothioylidynetris-.

(a) (b)

Figure 3.15. (a) Acetic acid, 3-(pentafluorothio)phenoxy- (b) Benzonitrile, 2,6-dimethyl-, N-oxide.

Hydrogen bonding is a key parameter for the assessment of possible three-dimensional structures of proteins and nucleic bases. Very often in these macromolecules, there may be bonding between the different parts of the same molecule. This internal bond allows the molecule to adopt a specific conformation, favorable to the biochemical or physiological role of the molecule. Therefore, the FN, which appears as one of the most important descriptors, is closely related to hydrogen bonding and in consequence can specifically affect the logPow prediction. In Figure 3.16, the |F(N)| is related to the number of hydrogen bond donor/acceptors.

66

Binned SIM logPow Models

Figure 3.16. Average value of the |F(N)| obtained with AM1 and the SES versus the number of hydrogen bond donor/acceptor atoms.

For the full data set, a strong relationship is manifested between the value of the |F(N)| and the number of hydrogen bond donor/acceptor atoms. These two properties are extremely related, and increasing the number of hydrogen bond donor/acceptor atoms goes hand in hand with that of the |F(N)|.

3.4 Discussion

The main objective of this study was to generate logPow models based mainly on the true local hydrophobicity, which can be used to predict accurately the logPow values for a set composed of structurally diverse compounds. From our investigation, it was established that some physical properties, such as the number of rotatable bonds and the number of hydrogen bond donor/acceptor atoms (Figures 3.4 and 3.11), are parameters that strongly affect the logPow prediction. LogPow values for compounds with a large number of rotatable bonds or a large number of hydrogen bond donor/acceptor atoms are poorly estimated. For the special case of the AM1 Hamiltonian and the SES41 (solvex), the R2 of the model obtained from the flexible compounds is about eight times lower than that of the model generated with the rigid compounds, suggesting the possibility of a particular uncertainty about the different conformations of the flexible compounds. Generally, models generated from the rigid compounds, which have restricted conformations, present very good statistical performances, unlike those obtained with the flexible compounds. This hypothetical situation can be due to the fact that for the flexible compounds the real conformations are not known, and the single conformation used was derived from a 2D ĺ 3D conversion with CORINA44. The use of an average conformation for these flexible compounds, or other statistical approaches, may be a typical strategy that can help in improving the predictive power of the models generated with the flexible compounds. Another important aspect concerning the relationship between the number of hydrogen bond donor/acceptors of some particular compounds and the RMSE, both of which increase linearly relative to one another (Figure 3.11), can be the specificity of

67

Chapter 3 the octanol, which is already a hydrogen bond donor/acceptor compound28. This unfavorable similarity20 between octanol and compounds containing a large number of hydrogen bond donor/acceptors leads to the creation of a repulsive interaction force between these compounds and octanol. Therefore, compounds with a high number of hydrogen bond donor/acceptor atoms tend to be less hydrophobic.

As seen for the solvation models where the histogram of the graphical representation of the RMSD is dominated by the high percentages of AM1* and MNDO/d compared to AM1 and PM3, it does not seem to be the same situation for the logPow models. Figure 3.17 gives a graphical representation of the RMSE for the logPow models generated with the total data set for AM1, AM1*, MNDO/d, MNDO, PM3, and PM6.

Figure 3.17. Hamiltonian versus RMSE for the total data set.

According to the histogram above, the higher percentages of the RMSE, in decreasing order, are those for PM3(iso), MNDO/d(iso), PM6(iso), AM1*(iso), PM6(solvex), MNDO(iso), MNDO/d(solvex), AM1*(solvex), MNDO(solvex), PM3(solvex), and finally AM1(solvex). The use of the FN seems to compensate for the lack of the POL for AM1* and MNDO/d that was previously strongly manifested for the solvation models, thus playing a significant role in predictions with these Hamiltonians. As a reference to the comments made on Figures 3.6 and 3.7, MEP × FN bin appears in all of the 50 bagging equations and is the first on the list of the most important descriptors that are selected in regard to the sum of the absolute values of the coefficient for AM1* and MNDO/d. FN also appears in a single form and in all cross terms for these two Hamiltonians. The effect of the molecular surface on the logPow prediction is shown. All the models generated with the SES have RMSE frequencies lower than those obtained with the iso for the same Hamiltonian. The binned SIM approach is entirely a surface-dependent method; the use of the SES always increases the predictive power of the model generated, thus decreasing the value of the RMSE.

Among the Hamiltonians containing the POL, AM1 with 40 of the 336 descriptors used (11.90%) seems to be the one using a fewer number of descriptors in 25 of the 50

68

Binned SIM logPow Models bagging equations, contrary to MNDO, PM3, and PM6 with 51 (15.18), 45 (13.39%), and 43 (12.80%), respectively. For MNDO/d and AM1*, 55 (21.83%) and 39 (15.48%) binned descriptors are used, respectively, of the 252 descriptors when the model is generated with the full data set and the SES. The major difference observed in the use of the different set of binned descriptors can be due to the different parameterization approaches used for these Hamiltonians.

The constants for the models obtained with the SES and our different Hamiltonians are 0.15, 0.21, 0.023, 0.054, 0.055, and -0.17. All of these values are very close to zero, suggesting an equal probability of distribution in water and octanol, when the compound is presumed to have no surface. This supports the hyphothesis that the function that is closely related to a true local hydrophobicity can effectively be integrated into a SIM, leading to the obtainment of models representing a true local hydrophobicity.

Universal logP models were applied to the total data set of the LOGKOW database35, for comparison. It appears that, except for AlogPs that with an R2 of 0.94 perform better than 2 the binned SIM logPow developed, SlogP and logPo/w with R values of 0.80 and 0.79, respectively, are worse compared to the performances of the binned SIM logPow models ranging from 0.84 to 0.89. The great difference observed in the performance of AlogPs compared to our models may be due to the 122 molecules that have not been taken into account in the evaluation of the performance of this model. However, the largest absolute error is lower for our models, confirming the robustness of our binned SIMs. One common difficulty here is that it is not possible to know if some compounds present in the LOGKOW database35 are also present in the data sets used to generate the logP_o/w, SlogP, and AlogPs models. Therefore, the comparison made here cannot be considered a true comparison.

The compounds used for the model generation are completely different from those used in the validation set, so it is possible to say that the validation made for the models obtained with the binned SIM approach is a real validation for these models.

The FN as one of the most important descriptors was analyzed in order to detect a correlation with the conformation of the molecule. A close relationship between the parameters, FN and the number of hydrogen bond donor/acceptor atoms, was observed as shown in Figure 3.16, suggesting the hypothesis that one parameter is sufficient for influencing the logPow prediction. The FN within a molecule depends strongly on the molecule’s conformation.

For the models generated, it appears that some compounds have specific behaviors with respect to some Hamiltonians, including, among others, a marked remoteness from the other points of the curve. These behaviors may be due to certain incompatibilities between the molecule and some parameters related either to VAMP43 or to some characteristics or properties specific to the molecule.

1H-purine, 2,6,8-tris(methylsulfonyl)- (Figure 3.18) is an outlier for all the Hamiltonians. After the optimization of the geometry, the two CH3-S-O2 groups are no longer bonded to the central atom group as shown below. This compound has a problem with 43 geometry optimization in VAMP and therefore its logPow value is always poorly reproduced. This problem can be solved by doing a single point calculation instead of a full optimization.

69

Chapter 3

Before optimization After optimization

Figure 3.18. 1H-purine, 2,6,8-tris(methylsulfonyl)-.

1H-imidazole, 2,4,5-triiodo- (Figure 3.19a) is the strongest outlier for the AM1 Hamiltonian when the descriptors are calculated with the SES. This compound contains three iodine atoms. Because of its lack of polarity, iodine, which is non-polar, has a tendency to be less soluble in octanol and water, which are polar solvents. This can also be due to the possible interaction between the NH group and the iodine atom.

Phenol, 2,6-bis(1-methylpropyl)-4-nitro-, [S-(R*,R*)]- (Figure 3.19b) is the strongest outlier when the geometry is optimized with the PM6 Hamiltonian and the descriptors calculated with the SES. For this compound, the nitro group is attached to the benzene in the para position, and because of its strong attraction for electrons, it will delocalize ʌ-electrons of the ring to satisfy its charge deficiency. Moreover, this para position of the nitro group creates the presence of two adjacent carbons that would have positive charges, and this would lead to an undesirable situation. These unwanted effects seem to be strongly manifested with the PM6 Hamiltonian, and all of these factors destabilize the compound. Therefore, its logPow would be poorly reproduced.

70

Binned SIM logPow Models

(a) (b)

Figure 3.19. (a) 1H-imidazole, 2,4,5-triiodo- (b) Phenol, 2,6-bis(1-methylpropyl)-4-nitro- , [S-(R*,R*)]-.

Methane, tetrabromo- (Figure 3.20a) is an outlier for the PM6 Hamiltonian. This can be either due to the fact that this compound has an optimization problem with PM6 or because of the parameterization process of the atomic parameters of bromine for the PM6 Hamiltonian.

Dibenzo[b,z][1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46]hexadecaoxacyclooctatet racontin, 6,7,9,10,12,13,15,16,18,19,21,22,24,25,32,33,35,36,38,39,41,42,44,45,47,48,50,51- octacosahydro- (Figure 3.20b), which is a macrocycle, is an outlier for all the Hamiltonians because of its flexibility. Also, its logPow value is measured in some conditions that are not defined.

(a) (b)

Figure 3.20. (a) Methane, tetrabromo- (b) Dibenzo[b,z][1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46]hexadecaoxacyclooctatetracont in, 6,7,9,10,12,13,15,16,18,19,21,22,24,25,32,33,35,36,38,39,41,42,44,45,47,48,50,51- octacosahydro-.

The macrcocycle 1,4,7,10,13,16,19,22,25,28,31-benzundecaoxacyclotritriacontin, 2,3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29,30-eicosahydro- (Figure 3.21a) is the

71

Chapter 3 strongest outlier for MNDO and MNDO/d. This compound probably has an optimization problem with the MNDO and MNDO/d Hamiltonians.

Dibenzo[b,q][1,4,7,10,13,16,19,22,25,28]decaoxacyclotriacontin, 2,20-bis(1,1- dimethylethyl)-6,7,9,10,12,13,15,16,23,24,26,27,29,30,32,33-hexadecahydro- (Figure 3.21b) is also a macrocyle. It is the strongest outlier for all the Hamiltonians because of its flexibility and probably its experimental logPow value, which is obtained in some conditions that are not defined.

(a) (b)

Figure 3.21. (a) 1,4,7,10,13,16,19,22,25,28,31-Benzundecaoxacyclotritriacontin, 2,3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29,30-eicosahydro- (b) Dibenzo[b,q][1,4,7,10,13,16,19,22,25,28]decaoxacyclotriacontin, 2,20-bis(1,1- dimethylethyl)-6,7,9,10,12,13,15,16,23,24,26,27,29,30,32,33-hexadecahydro-.

For AM1* 1,3,5-triazine, 2,4,6-tris(trichloromethyl)- (Figure 3.22a) is an outlier probably because of the presence of three CCl3 groups, which possibly destabilized the compound during the optimization.

For PM6 1,3-benzenedicarboxamide, -5-(acetylmethylamino)-N,N'-bis(2,3- dihydroxypropyl)-2,4,6-triiodo- (Figure 3.22b) is an outlier. This compound contains three iodine atoms and probably has an optimization problem with the PM6 Hamiltonian. Iodine is basically the most electropositive halogen, and in the presence of a polar solvent, it tends to form with the latter a charge-transfer complex. The presence of the high electron density on the iodine atom can also be responsible for the poor reproducibility of the logPow of compounds containing iodine. Another important aspect here can be, as mentioned previously, the possible interaction between the iodine atom and a NH group.

72

Binned SIM logPow Models

(a) (b)

Figure 3.22. (a) 1,3,5-Triazine, 2,4,6-tris(trichloromethyl)- (b) 1,3- Benzenedicarboxamide, 5-(acetylmethylamino)-N,N'-bis(2,3-dihydroxypropyl)-2,4,6- triiodo-.

1H-imidazole, 2-nitro-1-(2,3,5-tri-O-benzoyl-á-D-ribofuranosyl)- (Figure 3.23) is an outlier for all the Hamiltonians, because of its flexibility and its experimental logPow value, which is obtained in some undefined conditions.

Figure 3.23. 1H-imidazole, 2-nitro-1-(2,3,5-tri-O-benzoyl-á-D-ribofuranosyl)-.

Hydrophobic surfaces obtained from the MOE45 for 1,3,5-triazine, 2,4,6- tris(trichloromethyl)- and 1H-purine, 2,6,8-tris(methylsulfonyl)- and optimized with AM1* and the descriptors calculated with the iso are shown below (Figure 3.24):

73

Chapter 3

1,3,5-Triazine, 2,4,6-tris(trichloromethyl)- 1H-purine, 2,6,8-tris(methylsulfonyl)- logP(exp): 3.75 logP(exp): 3.58 logP(calc): 8.66 logP(calc): 0.32

Figure 3.24. Schematic view of the contribution of local surface areas for 1,3,5-triazine, 2,4,6-tris(trichloromethyl)- and 1H-purine, 2,6,8-tris(methylsulfonyl)-. The color scheme is green for the hydrophobic surface, violet for the hydrophilic surface and white for neutral.

3.5 Conclusions

Here, some robust logPow models for predicting the octanol-water partition coefficient for a set of very large compounds were developed, based on the use of 336 or 252 new surface-bin descriptors generated completely from the binning of the local properties and their cross-products for AM1, PM3, MNDO, and PM6, or AM1* and MNDO/d Hamiltonians. The descriptors used to capture the chemical information were calculated either with the iso42 or the SES41. From the different analyses made, it appears that all the models developed are surface-dependent, with the use of the SES41 always predicting well compared to the iso42. This “surface” factor mostly governs the approachability of the solvent and the extent of interaction with the solvent. The models obtained from the new binned SIM approach and presented here predict well compared to the models obtained from the polynomial SIM approach. Therefore, these models seem to be most suitable for many QSAR/QSPR applications, such as the prediction of protein binding, receptor affinity, or pharmacological activity of compounds. The FN plays a key role in predicting with AM1* and MNDO/d Hamiltonians, and seems to be the most appropriate descriptor that can compensate for the lack of the POL, which is not implemented in ParaSurf for these Hamiltonians. Compared to the other Hamiltonians, AM1 calculated electrostatic potential-derived atomic charges agree better. This important aspect is probably the one responsible for AM1 predicting logPow better than the other Hamiltonians. The best performance was obtained with the AM1 Hamiltonian and the SES41, with an R2 of 0.89. Generally, the partition coefficient describes the distribution, which can be disproportionate for a molecule when the latter is immersed in a solvent composed of two phases, one aqueous and the other organic. If the organic phase is octanol, the partition coefficient reflects the hydrophobicity of the molecule. When the compound is in contact with the aqueous phase of the solvent, a repulsive force is created between the non-polar molecule and the aqueous solvent. This interaction describes the

74

Binned SIM logPow Models hydrophobic effect or hydrophobic hydration2-5. For application in medicinal chemistry, the study of this hydrophobic effect is of paramount importance for the developement of drugs, because it predicts the possible interactions that can occur between nonpolar regions of drugs and their receiving environments. Through this research project a new approach that can help in the relative estimation of lipophilicity, necessary to carry out QSAR studies, has been developed. This technique presented can be assimilated to a purely thermodynamic method with a sufficiently expanded scope, including, among others, all neutral molecules that have their atomic parameters implemented in VAMP43. Thus, in contrast to fragment and molecular property-based approaches, it is now possible to estimate logPow values for some compounds without resorting to their experimental values.

75

Chapter 3

3.6 References

1. Jäger, R.; Schmidt, F.; Schilling, B.; Brickmann, J. Localization and quantification of hydrophobicity: The molecular free energy density (MolFESD) concept and its application to sweetness recognition. Journal of Computer-Aided Molecular Design. 2000, 14, 631-646.

2. Glokzij, W.; Engberts, J. B. F. N. Hydrophobe Effeckte-Ansichten und Tatsachen. Angew. Chem. Int. Ed. Engl. 1993, 105, 1610-1648.

3. Tanford, C. The Hydrophobic Effect: Formation of Micelles and Biological Membranes; Wiley: New York, 1973.

4. Creighton, E. Protein-Structures and Molecular Properties; Freeman: New York, 1993.

5. Abraham, D. J.; Kellog, G. E. 3D QSAR in Drug Design: Theory, Methods and Applications; Kubinyi, H., Ed.; Escom: Leiden, the Netherlands, 1993; p 506.

6. Hansch, C.; Leo, A. J. Substituent Constants for Correlation Analysis in Chemistry and Biology; Wiley: New York, 1979.

7. Martin, Y. C.; Marcel Dekker. Quantitative Drug Design: A Critical Introduction. Inc., San Diego, 1978.

8. Hansch, C.; Leo; A. Exploring QSAR. Fundamentals and Application in Chemistry and Biology. American Chemical Society: Washington, DC, 1995.

9. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods. American Chemical Society: Washington, DC, 1990.

10. Hansch, C.; Dunn, W. J., III J. Pharm. Sci. 1972, 61, 1-19.

11. Essex, J. W.; Reynolds, C. A.; Graham, W. R. Theoretical Determination of Partition Coefficients. J. Am. Chem. Soc. 1992, 114, 3634-3639.

12. Hansch, C.; Leo, A.; Mekapati, S. B.; Kurup, A. QSAR and ADME. Biorg. Med. Chem. 2004, 12, 3391-3400.

13. Leo, A. Calculating Log Poct from Structure. Chem. Rev. 1993, 30, 1283-1306.

14. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional Structure-Directed Quantitative Structure-Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565-577.

15. Klopman, G.; Li, J.-Y.; Wang, S.; Dimayuga, M. Computer Automated log P Calculations Based on an Extended Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752-781 and references therein.

16. Rekker, R. F.; Mannhold, R. Calculation of Drug Lipophilicity, VCH: New York, 1992.

17. Klopman, G.; Iroff, L. Calculation of Partition Coefficients by the Charge Density Method. J. Comput. Chem. 1981, 2, 157-160.

18. Leo, A. J. ClogP; Daylight Chemical Information Systems: Irvine, CA, 1991.

76

Binned SIM logPow Models

19. Klopman, G.; Namboodiri, K.; Schochet, M. Simple Method of Computing the Partition Coefficient. J. Comput. Chem. 1985, 6, 28-38.

20. Bodor, N.; Gabanyi, Z.; Wong, C.-K. A New Method for the Estimation of Partition Coefficient. J. Am. Chem. Soc. 1989, 111, 3783-3786.

21. Waller, C. L. A Three-Dimensional Technique for the Calculation of Octanol-Water Partition Coefficient. Quant. Struct.-Act. Relat. 1994, 13, 172-176.

22. Pixner, P.; Heiden, W.; Merx, H.; Moeckel, G.; Moeller, A.; Brickmann, J. Empirical Method for the Quantification and Localization of Molecular Hydrophobicity. J. Chem. Inf. Comp. Sci. 1994, 34, 1309-1319.

23. Abraham, R. J.; Hudson, B. D.; Kermode, M. W.; Nines, J. R. A General Calculation of Molecular Solvation Energies. J. Chem. Soc. Faraday Trans. 1988, 84, 1911-1917.

24. Ehresmann, B.; de Groot, M. J.; Alex, A.; Clark, T. New Molecular Descriptors Based on Local Properties at the Molecular Surface and a Boiling-Point Model Derived from Them. J. Chem. Inf. Comp. Sci. 2004, 43, 658-668.

25. Ehresmann, B.; de Groot, M. J.; Clark, T. Surface-Integral QSPR Models: Local Energy Properties. J. Chem. Inf. Model. 2005, 45, 1053-1060.

26. Sangster, J. Octanol-Water Partition Coefficients: Fundamentals and Physical Chemistry, John Wiley & Sons Ltd: Chichester, 1997; Vol.2.

27. (a) Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71 (6), 525-616. (b) Leahy, D. E.; Taylor, P. J.; Wait, A. R.; Model Solvent Systems for QSAR Part, I. Propylene Glycol Dipelargonate (PGDP). A new Standard Solvent for use in Partition Coefficient Determination. Quant. Struct.-Act. Relat. 1989, 8 (1), 17-31.

28. Schulte, J.; Dürr, J.; Ritter, S.; Hauthal, W. H.; Quitzsch, K.; Maurer, G. Partition Coefficients for Environmentally Important, Multifunctional Organic Compounds in Hexane + Water. J. Chem. Eng. Data 1998, 43 (1), 69-73.

29. Bingham, R. C.; Dewar, M. J. S.; Lo, D. H. Ground States of Molecules. XXV. MINDO/3. An Improved Version of the MINDO Semiempirical SCF-MO Method. J. Am. Chem. Soc. 1975, 97, 1285-1293.

30. Dearden, J. C.; Bresnen, G. M. The Measurement of Partition Coefficients. QSAR Com. Sci. 2006, 7 (3), 133-144.

31. Valko, K. Application of high-performance liquid chromatography based measurements of lipophilicity to model biological distribution. J. Chromatogr. A 2004, 1037 (1-2), 299-310.

32. Takacs-Novak, K.; Avdeef, A. Interlaboratory study of log P determination by shake-flask and potentiometric methods. J. Pharm. Biomed. Anal. 1996, 14 (11), 1405-1413.

33. The physical properties database (PHYSPROP). Syracuse research corporation.

34. CrossFire Beilstein, Elsevier: Frankfurt, 2009; Vol. 7.1.

77

Chapter 3

35. Sangster, J. LOGKOW -A databank of evaluated octanol-water partition coefficients (Log P). In Sangster Research Laboratories: Montreal, Quebec, accessed 11/23/2008.

36. Nys, G. G.; Rekker, R. F. The concept of hydrophobic fragmental constants (f-values). II. Extension of its applicability to the calculation of lipophilicities of aromatic and hetero-aromatic structures. Chim. Ther. 1974, 9, 361-374.

37. Hughes, L. D.; Palmer, D. S.; Nigsch, F.; Mitchell, J. B. Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P. J. Chem. Inf. Model. 2008, 48 (1), 220-232.

38. Liu, R.; Zhou, D. Using Molecular Fingerprint as Descriptors in the QSPR study of Lipophilicity. J. Chem. Inf. Model. 2008, 48 (3), 542-549.

39. (a) Breindl, A.; Beck, B.; Clark, T.; Glen, R. C. Prediction of the n-octanol/Water Partition Coefficient, logP, Using a Combination of Semiempirical MO-Calculations and a Neural Network. J. Mol. Model. 1997, 3, 142-155. (b) Tetko, I. V.; Tanchuk, V. Y.; Villa, A. E. Prediction of n- octanol/water partition coefficients from PHYSPROP database using artificial neural networks and E-state indices. J. Chem. Inf. Comp. Sci. 2001, 41, 1407-1421.

40. Kramer, C.; Beck, B.; Clark, T. A Surface-Integral Model for Log Pow . J. Chem. Inf. Model. 2010, 50 (3), 429-436.

41. Pan, Q.; Tai, X. –C. Model the Solvent-Excluded Surface of 3D Protein Molecular Structures Using Geometric PDE-Based Level-Set Method. Commun. Comput. Phys. 2009, 6, 777-792.

42. Meyer, A. Y. The size of molecules. Chem. Soc. Rev. 1985, 15, 449-475.

43. Clark, T.; Alex, A.; Beck, A.; Burkhardt, F.; Chandrasekhar, J.; Gedeck, P.; Horn, A. H. C.; Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP 8.2; accelrys Inc.: Erlangen: San Diego, USA, 2002.

44. CORINA 3.4; Molecular Networks Inc: Erlangen, Germany, 2006.

45. Labute, P. Molecular Operating Environment, 2008. 10; Chemical Computing Group: Montreal, Quebec, Canada, 2008.

46. Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR: Hydrophobic, Electronic, and Steric Constants. The American Chemical Society: Washington, DC, 1995.

47. ParaSurf10, CEPOS InSilico Ltd.: Erlangen, Germany, 2010.

48. Clark, T.; Byler, K. G.; de Groot, M. J., Biological Communication via Molecular Surfaces. In Molecular Interactions Bringing Chemistry to life; Proceedings of the International Beilstein Workshop, Bozen, Italy, May 15-19, 2006 (Logos Verlag:), Berlin, 2008; pp 129-146.

49. Polikar, R. Ensemble based systems in decision making. IEEE Circ. Sys. Mag. 2006, 03/06, 21-45.

50. Efroymson, M. A. Multiple regression analysis. In Mathematical Methods for Digital Computers., Ralston, A.; Milf, H. A., Eds. Wiley: New York, 1960; Vol. 1, pp 191-203.

78

Binned SIM logPow Models

51. Kramer, C.; Tautermann, C. S.; Livingstone, D. J.; Salt, D. W.; Whitley, D. C.; Beck, B.; Clark, T. Sharpening the Toolbox of Computational Chemistry: A New Approximation of Critical F-Values for Multiple Linear Regression. J. Chem. Inf. Model. 2009, 49 (1), 28-34.

52. Campbell, N. A.; Brad, W.; Robin, J. H. Biology: Exploring Life. Boston, Massachusetts: Pearson Prentics Hall, 2006.

53. www.vcclab.org

79

80

Chapter 4

Comparative Study of two

Classification Algorithms for the Prediction of Drug-Induced

Phospholipidosis

Chapter 4

4.1 Introduction

In daily life, some people after taking a drug are victims of itching of the eyes or skin, which are, in most cases, the manifestation of an allergy to the drug consumed. Likewise, the presence of some cationic amphiphilic drugs in the human body can cause side effects commonly referred to as phospholipidosis1 (PPL). It manifests itself physically by a markedly extensive accumulation of phospholipids in lamellar and concentric forms within cells of the body. Typically, in the cellular environment, many processes occur that regulate the life cycle of the entire cell, and this metabolism can undergo significant changes under the influence of phospholipids or enzymes. Thus, any chemical, in order to acquire all properties enabling it to ensure proper biological activity, should undergo further important intermediate steps of evaluation, which include a considerable number of tests, such as pharmacodynamics, toxicity, , metabolism, excretion and mutagenicity. However, nowadays the design of sufficient and effective drugs, capable of producing good biological activity, remains a major problem and a challenge.

Our goal in this work is the generation of new models for predicting drugs that induce PPL, using two machine learning (ML) techniques.

PPL, as shown in Figure 4.1 below1, arises from the crowding of phospholipids inside living cells, followed by the generation of concentric lamellar bodies, also called inclusion bodies or lysosomal myeloid bodies2.

Figure 4.1. Lamelar inclusion bodies as depicted by electron microscopy of cells. Adapted from 1.

In most cases, the lungs, liver, eyes, kidneys, cornea, and nervous and lymphatic systems are the body's organs often affected by this pathological manifestation. The event or the onset of PPL is a phenomenon that is largely reversible. Related to the amount of drugs released in the body3, it occurs only if a sufficiently high dosage is administered and disappears soon after systematic metabolism.

The beginning of drug discovery was strongly influenced by the high-speed physico- chemical method used for determining PPL. This assay was essentially based on a quantitative analysis of the drug-phospholipid complex formed4. Based fundamentally on the principles of structure-activity relationship (SAR), the execution speed of the screening method for PPL may increase considerably if the process is realized with the use of filters for the calculation. Strengtened by the previous hypothesis, the use of SAR models facilitates systematic screening of virtual libraries of compounds whose existence is updated by the computer5,6. To ensure that the molecules have the desired or corresponding properties to achieve the goal, one must realize the virtual screening of these libraries by using a fairly

82

Classification of Phospholipidosis reliable and accurate method. This method can be developed or established by the use of quantitative structure-activity relationship (QSAR). The ability to develop effective rules that accurately predict the pharmacokinetic properties of drugs remains a major concern and an important asset for the design and development of drugs. The application of these rules not only provides free access to a rotational pathway to drug discovery7 but can also, to some extent, lead to a systematic decrease of project failures that are consistently associated with pharmacokinetic problems8,9. Having this aim in mind, the active search for comprehensive and reliable approaches to optimize the pharmacokinetic properties of drug compounds8 has grown considerably. For this purpose, many research studies have been done on PPL through different methods and techniques10,17. Here, we predict PPL induction as a function of the octanol/water partition coefficient (logPow), the molecular descriptors calculated with ParaSurf, the van der Waals’ surface, and two ML techniques, the Naive Bayes (NB) and the Random Forest (RF).

Drug-induced PPL is a phenomenon that is characterized by a sporadic occurrence of phospholipids in intracellular lysosomes. Experimentally, it has not yet been proven that a close relationship between in vitro drug-induced PPL and the drug’s side effects exists in humans. Therefore, this does not eliminate the hypothesis that drug-induced PPL creates a favorable environment for the emergence of toxicity. Thanks to scientific and technical progress achieved in recent decades, the detection of drug-induced PPL has been particularly effective and efficient when it is performed with electron microscopy and quantitative PCR. In parallel, the detection of drug-induced PPL in HepG2 cells18 is quite relevant when performed by an assay with high throughput LipidTox and a fluorescent lipophilic dye. The fluorescent probe technique is a rather practical test, which can be done within a short period of time on a relatively small sample of compounds. SAR is the approach most commonly used in computational chemistry for the determination of drug-induced PPL. Through it, the chemical compounds are classified according to their ability to induce, or not induce, effects via a biological receptor, active or inactive, respectively. The Hansch model19,20, while being the fundamental basis, can also be regarded as an exceptional reference for modern SAR and QSAR models. Its basic principle relies on the fact that it expresses qualitatively and quantitatively a physico-chemical property from a linear statistical correlation with steric, electronic and hydrophobic indices of chemical structures21. In order to extend the Hansch model, SAR and QSAR models have made use of new classes of structural descriptors or sufficiently powerful statistical models. From a qualitative point of view, a numerical descriptor can be likened to a digital representation of some essential molecular features, such as empirical indices (Hammett and Taft substituent constants), physical properties (logPow, dipole moment, or aqueous solubility), the number of substructures or substituents, graph descriptors22-24, topological indices25-27, connectivity indices28,29, electrotopological indices30,31, geometrical descriptors (molecular surface and volume), quantum indices (atomic charges, HOMO and LUMO energies)32,33, and molecular fields (steric, electrostatic, and hydrophobic)34.

4.2 Methods

We used a data set of 144 compounds listed in Table A9 of the Appendix, which when assayed were positive for PPL induction as determined by transmission electron microscopy35, provided by Anne Tilloy-Ellul (Pfizer Global R&D, Amboise Laboratories,

83

Chapter 4

France) and Marcel de Groot (Pfizer Global R&D, Sandwich Laboratories, UK). The conversion to 3D structures of the Pfizer set of 144 canonical SMILES36was made with CORINA37,38. Geometries were optimized in the gas phase using the AM1, AM1*, MNDO, MNDO/d, PM3, or the PM6 Hamiltonian in VAMP 1139. For AM1* and MNDO, diclofenac was not optimized because Na is not parameterized in VAMP for these Hamiltonians. The 124 molecular descriptors listed in Table A10 of the Appendix were calculated with ParaSurf10alpha40, using the default isodensity surface41 (iso) or the solvent-excluded 42 surface (SES). The logPow values of these compounds were calculated using the binned SIM models developed in the preceding chapter and added as additional descriptors to the ParaSurf´s40 standard descriptors, creating a set of 125 descriptors. Ceftazidime (which was duplicated), cephaloridine, and paraquat, which are compounds containing quaternary nitrogen, were removed, because they cannot be neutralized and they give large values of logPow compared to those stored in PubChem. In Figures 4.2-4.4 these compounds with their 42 predicted logPow values using the AM1 Hamiltonian and the SES are shown.

Figure 4.2. Ceftazidime predicted logPow: -12.72, XlogP3 (PubChem): 0.4.

Figure 4.3. Paraquat predicted logPow: -37.64, XlogP3-AA (PubChem): 1.7.

84

Classification of Phospholipidosis

Figure 4.4. Cephaloridine predicted logPow: -13.52, XlogP3 (PubChem): 1.9.

One of each of the duplicated carbon tetrachlorides and valproic acid were removed. Positive charges (+1) are assigned to active compounds (inducing PPL), and negative charges (-1) are assigned to inactive compounds (non-inducing PPL). The created set of 125 descriptors necessary to capture our chemical information was used to randomize our data set of 138 compounds. The randomized data was then divided into two sets, 50% for the training set and the remainder used to assess the generalization performance. Therefore, the training set of 69 compounds consists of 44 positives and 25 negatives and the test set of 69 compounds consists of 37 positives and 32 negatives. All the models obtained from the training set were evaluated on their respective test (or validation) set, using a 10-fold cross- validation.

4.2.1 Machine Learning Algorithms

Today, predicting the biological activity of some drugs has been facilitated by the further development of easily accessible and manipulated software. Thus, in molecular modeling, we can within a short period of time, and with an extraction of rules and functions from large data sets, develop models using some common methods and algorithms such as ML. ML is a fairly extensive area of artificial intelligence15, which includes, among others, decision trees, k-nearest neighbors, lazy learning, Bayesian methods, Gaussian processes, artificial neural networks, artificial immune systems, support vector machines and kernel algorithms. The specificity of the ML algorithms is based on the use of calculation methods and statistics for the prediction of new properties by a systematic extraction of information from experimental data. All SAR models based on ML algorithms are efficiently generated using the ML software weka (http:www.cs.waikato.ac.nz/ml/weka/)43,44.

4.2.1.1 Naive Bayes

The basic principle of the Bayes classifier is based on Bayes' theorem formulated by the famous British mathematician Thomas Bayes (1702-1761). Any classification made using the classifier NB thereby becomes a probabilistic classification with strong assumptions of independence (Naive), constructed from the conditional model45

85

Chapter 4

p(C | F1 ,..., Fn ) over a dependent class variable C with a relatively reduced number of outcomes or classes, conditional on several feature variables F1 through Fn . Using the Bayes’ theorem, the model can be assimilated into the mathematical relation

p(C) p(F1 ,...Fn | C) p(C | F1 ,..., Fn ) = . (4.1) p(F1 ,..., Fn )

Under certain independent assumptions, the conditional distribution over the class variable C becomes the equation

1 n p(C | F1 ,..., Fn ) = p(C)∏ p(Fi | C) . (4.2) Z i=1

Here, Z is a scaling factor dependent only on F1 ,…, Fn . For this model, the associated classifier is the classify function given below

n = = = Classify( f1 ,..., f n ) = argmax c P(C c)∏ p(Fi f i | C c) . (4.3) i=1

4.2.1.2 Random Forest

Suggested by Breiman46, RF is a type of classification method that uses a collection of unpruned trees to determine the output class of a given observation. It is a collection of tree predictors formulated as

Θ h(x; k ), k = 1,…,K where x is the observed input (covariate) vector of length p with associated random vector X Θ and k are independent identically distributed (iid) random vectors.

In order to optimize the performance of this learning technique, Breiman introduced { } the notion of the margin function for a set of classifiers h1 (x) , h2 (x) ,…,hk (x ) as follows:

86

Classification of Phospholipidosis

mg(X,Y) = ak I( hk (X) = Y) - max j≠Y ( ak I( hk (X=j)) (4.4)

in which ak is the average and I the indicator function.

However, for any classification carried out, the predictions are given in the form of the following conditional probability:

If mg(X,Y) > 0, then the set of classifiers votes for the correct classification. If mg(X,Y) < 0, then it votes for the incorrect classification.

The NB and the RF algorithms are fully implemented in weka43,44.

4.3 Results

PPL, as seen in Figure 4.547, is an anomaly caused by a lysosomal overload, characterized by the successive deposit of layers of phospholipids in tissues producing lamellar and concentric bodies.

Figure 4.5. (A) Lipid-filled laminated bodies. Transmission electron micrograph of lysomal lamellar bodies (LLB) of PPL in kidney tissue. (B) Crystallin. Transmission electron micrograph of LLB of PPL in lung tissue. (C) Zebra. Transmission electron micrograph of LLB of PPL in soft tissue. Reprinted from 47.

87

Chapter 4

4.3.1 Machine Learning Models

In order to optimize the biological activity, the selectivity, or the physico-chemical properties of molecular compounds, the use of SAR and QSAR models based on ML algorithms, over time, seems to be of paramount importance for the development of drugs.

The results obtained by applying the two ML classifiers to our database are given in the following sections. Among these results, those obtained from a set of descriptors calculated with the SES42 for the test set are listed in the form of confusion matrices, where the values (-1) and (+1) are assigned to non-induction and induction of PPL, respectively.

4.3.1.1 Naive Bayes Models

NB is a classifier that is distinguished particularly by its ability to produce anti-spam filters. Generally it is the most appropriate algorithm for the classification of objects into binary categories. It is a Bayesian classifier.

Implementing the NB classifier on the set of descriptors obtained from the different Hamiltonians and surfaces gives the results listed below, which are enumerated starting with the one obtained with AM1 and the SES:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 4 33 | b = 1

Figure 4.6. Confusion matrix for the test set obtained from the NB classification model using the descriptors calculated with AM1 and the SES.

Running the NB algorithm on the training set yields a model that, evaluated on the respective test set, gives a prediction accuracy of 66% for negatives and 89% for positives. The overall accuracy is 78%. Among the 32 negative compounds 21 are correctly classified, and for the 37 positive compounds 33 of them are correctly classified. Another important aspect is a difference of ≈ 23% between the accuracies of the positive and negative compounds. This difference is quite high and therefore there is no good similarity in the confusion matrix, although there is an overall accuracy higher than 75%.

The performances of all models generated by running a NB classifier on a set of descriptors calculated with AM1 are given in Table 4.1.

88

Classification of Phospholipidosis

Table 4.1. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with AM1 Model True True False False Accuracy positive negative positive negative (%) Training 35 14 11 9 71 Iso set Test set 30 20 12 7 72 Training 33 13 12 11 67 SES set Test set 33 21 11 4 78

When the models generated with the AM1 Hamiltonian are applied to the test sets, the overall accuracy obtained is 78% for the SES and 72% for the iso. There is an increase of the accuracy by ≈ 6% with the SES, compared to the iso. For the training set there is a decrease of the accuracy by ≈ 4% when replacing the iso with the SES.

The set of descriptors calculated with AM1* and the SES, classified with the NB algorithm, results in obtaining the following model:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 20 12 | a = -1 7 30 | b = 1

Figure 4.7. Schematic of the confusion matrix obtained by applying the model generated by training the NB on the descriptors of the training set generated with AM1* and the SES on the respective test set.

Performing the NB algorithm with the training set allows the attainment of a model that, when used to classify the respective test set, yields a prediction accuracy of 63% for negatives and 81% for positives. The overall accuracy is 72%, and for this case, 20 among the 32 negative compounds are correctly classified and 30 of the 37 positives are correctly classified. From the above statistics, a difference of ≈ 18% is obtained between the accuracies of the positive and negative compounds, which is a little high and thus a slight similarity is manifested in the confusion matrix.

Table 4.2 gives the statistical significance of the models obtained when using the set of descriptors attained from AM1* through a NB classifier.

89

Chapter 4

Table 4.2. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with AM1* Model True True False False Accuracy positive negative positive negative (%) Training 35 12 12 9 69 Iso set Test set 28 18 14 9 67 Training 34 15 9 10 72 SES set Test set 30 20 12 7 72

Modeling wth the AM1* Hamiltonian predicts for the test set with an overall accuracy of 72% with the SES and 67% with the iso. In comparison to the iso, the use of the SES results in increasing the overall accuracy by ≈ 5%. Here, we observe that for the SES the same overall accuracy value is obtained for both the training and test sets. The difference between their performances is just the predictivity of the negative and positive compounds. There is a difference of ≈ 3% between the predictive powers of the training set models generated with the iso and the SES, with the SES increasing the performance of the training set for AM1*. The main observation here is that all the models generated with the NB for the descriptors when they are calculated with AM1* are surface-dependent.

Chemical information attained through the descriptors obtained from MNDO and the SES was used with the NB classifier to generate the model in Figure 4.8.

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 4 33 | b = 1

Figure 4.8. Representation of the confusion matrix of the test set obtained by training the NB on the chemical information generated with MNDO and the SES.

A NB trained on the SES fitted properties yields a model where the prediction accuracy of the negative compounds is 66%, and that of the positive compounds is 89%, with an overall accuracy of 78%. Here, the correct classifications are 21 among 32 and 33 among 37 for the negative and the positive compounds, respectively. This leads to a high difference of ≈ 23% between the accuracies of the negative and positive compounds, which simply justifies the presence of a poor similarity in the above confusion matrix, although presenting a good general performance.

The predictive powers of all models obtained by performing a NB classification on a set containing chemical information obtained with MNDO are given in Table 4.3.

90

Classification of Phospholipidosis

Table 4.3. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with MNDO Model True True False False Accuracy positive negative positive negative (%) Training 35 14 10 9 72 Iso set Test set 32 19 13 5 74 Training 35 15 9 9 74 SES set Test set 33 21 11 4 78

Considering the test set models, it appears that the overall accuracy is 78% for the SES and 74% for the iso. There is an increase in the accuracy by ≈ 4% with the SES, compared to the iso. For the training set, there is also an increase in the overall accuracy by ≈ 2% when comparing the model generated with the SES to the one obtained with the iso. As with the AM1* Hamiltonian, the NB classifier provides some models that are surface-dependent for the MNDO Hamiltonian.

The descriptors of the training and test sets obtained from MNDO/d and the ParaSurf calculations, subjected to a NB classification, allow the creation of a model with the statistical performances below:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 23 9 | a = -1 10 27 | b = 1

Figure 4.9. Representation of the confusion matrix for the descriptors of the test set generated with MNDO/d and the SES subjected to the NB classification.

The NB using the descriptors generated with the MNDO/d Hamiltonian and the SES provides prediction accuracies of 72% and 73% for the negative and positive compounds, respectively, of the test set. Twenty-three of 32 negative compounds are correctly predicted, and 27 of 37 positive compounds, resulting in an overall accuracy of 72%. Although the overall accuracy is below 75%, the difference of ≈ 1% between the accuracies of the negative and positive compounds is relatively small, which justifies the presence of this good similarity in the confusion matrix.

The statistical details about the performances of all models generated by running a NB classification on a set of descriptors obtained with MNDO/d are given in Table 4.4.

91

Chapter 4

Table 4.4. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with MNDO/d Model True True False False Accuracy positive negative positive negative (%) Training 39 13 12 5 75 Iso set Test set 31 17 15 6 70 Training 32 13 12 12 65 SES set Test set 27 23 9 10 72

Performing the NB algorithm, the descriptors of the test set generated with the SES yield an overall accuracy of 72%, and 70% for those obtained with the iso. There is an increase in the accuracy by ≈ 2% when replacing the iso with the SES. In contrast, for the training set there is a decrease of ≈ 10% when the iso is replaced with the SES. For the MNDO/d Hamiltonian, the SES helps increase the accuracy of the test set, but decreases the predictive power of the training set, in comparison to the iso.

With the NB, the training set obtained with the PM3 Hamiltonian and the SES helps create a model that with a test set gives the statistical performances below:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 46 66.6667 % Incorrectly Classified Instances 23 33.3333 %

=== Confusion Matrix ===

a b ← classified as 16 16 | a = -1 7 30 | b = 1

Figure 4.10. Representation of the predictive powers obtained with the NB trained on descriptors calculated with PM3 and the SES for negative and positive compounds of the test set in the form of a confusion matrix.

The set of 125 descriptors obtained is classified by the NB to generate a model that predicts correctly 16 of the 32 negative compounds, giving an accuracy of 50%, and 30 of the 37 positive ones, with an accuracy of 81%. The overall accuracy is 67%, and the difference between the predictive performances of the negative and positive compounds is ≈ 31%, which is very high. Moreover, obtaining a poor overall performance confirms this low or no similarity in the confusion matrix.

Table 4.5 below describes statistically all the models generated by performing a NB classification on a set of descriptors obtained with PM3.

92

Classification of Phospholipidosis

Table 4.5. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with PM3 Model True True False False Accuracy positive negative positive negative (%) Training 41 10 15 3 74 Iso set Test set 35 7 25 2 61 Training 36 13 12 8 71 SES set Test set 30 16 16 7 67

Running the models obtained through a NB classifier on the test set yields overall accuracies of 67% for the SES and 61% for the iso. There is an increase in the accuracy by ≈ 6% with the SES, compared to the iso. In contrast, for the training set, the SES decreases the predictivity by ≈ 3% compared to the iso.

Performing the NB probabilistic classification on a set of descriptors created with PM6 and the SES leads to the model described below:

Scheme: weka.classifiers.bayes.NaiveBayes Correctly Classified Instances 50 72.4638 % Incorrectly Classified Instances 19 27.5362 %

=== Confusion Matrix ===

a b ← classified as 20 12 | a = -1 7 30 | b = 1

Figure 4.11. The confusion matrix obtained by performing the NB classification on the set of descriptors obtained with PM6 and the SES for the test set.

The NB algorithm applied to a set of descriptors of the training set yields a model that gives accuracies of 63% and 81% for the negative and positive compounds, respectively, when used to classify a set of compounds previously selected and considered as a test set. This model of which 20 of 32 and 30 of 37 compounds are correctly classified for the negative and positive compounds, respectively, gives an overall accuracy of 72%, which is below 75%. The difference between the accuracies of the negative and positive predictions is ≈ 18%, implying that there is an absence of a similarity in the confusion matrix.

Table 4.6 lists all statistical values for the models realized by applying a NB classifier on a set of descriptors obtained with PM6.

93

Chapter 4

Table 4.6. Performances of the training and test set models generated by the NB classifier on a set of descriptors obtained with PM6 Model True True False False Accuracy positive negative positive negative (%) Training 38 11 14 6 71 Iso set Test set 28 20 12 9 70 Training 38 13 12 6 74 SES set Test set 30 20 12 7 72

The models generated from a set of descriptors calculated with PM6 give overall accuracies of 72% and 70% for the SES and the iso, respectively, when evaluated on their respective test sets. This, therefore, leads to an increase in the accuracy by ≈ 2% when changing from an iso to a SES. A similar situation is obtained for the training set with an increase in the accuracy by ≈ 3%. All the models performed with the NB, when the descriptors are calculated with the PM6, are surface-dependent, and the use of the SES helps increase the predictive power of the models, compared to the iso.

In order to check the effect of another classifier on our data, the RF algorithm was performed on the different sets of descriptors previously used, and the results obtained are presented below.

4.3.1.2 Random Forest Models

RF is a classification technique that consists of a methodical combination of tree predictors obtained successively. Each tree is parameterized by the values of a random vector sampled independently and with the same distribution for all trees in the forest46. It is an ensemble method.

The models for the PPL classifications generated by a 10-fold cross-validation run through a RF classifier on the descriptors of the training and test sets calculated with the different Hamiltonians and surfaces are given below, starting with the one obtained with AM1 and the SES:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 52 75.3623 % Incorrectly Classified Instances 17 24.6377 %

=== Confusion Matrix ===

a b ← classified as 23 9 | a = -1 8 29 | b = 1

Figure 4.12. Confusion matrix for the test set obtained using RF, the AM1 Hamiltonian and the SES.

94

Classification of Phospholipidosis

RF trained on the set of descriptors gives statistical performances of 72% for the negative compounds and 78% for the positive ones. The difference between the two predictive powers is ≈ 6%, which is slightly low and implicates the existence of a good similarity in the confusion matrix. For this model where 23 of the 32 negative compounds and 29 of the 37 positive ones are correctly classified, an overall accuracy of 75% is obtained. Compared to NB, there is a significant decrease in the difference between the predictive powers for negative and positive compounds, from 23% to 6%, leading to a good improvement in the similarity of the confusion matrix.

The performances of the models generated when a RF classifier is run on a set of descriptors calculated with AM1 are given in Table 4.7.

Table 4.7. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with AM1 Model True True False False Accuracy positive negative positive negative (%) Training 32 14 11 12 67 Iso set Test set 32 18 14 5 72 Training 33 16 9 11 71 SES set Test set 29 23 9 8 75

With the AM1 Hamiltonian, performing the RF approach on a set of descriptors calculated with the SES leads to an increase in the accuracies of the training and test sets by ≈ 4%, and ≈ 3%, respectively, when a comparison is made with the calculations performed with the iso. The RF classifier provides some models that are entirely surface-dependent.

A RF approach trained on a set of descriptors calculated with AM1* and the SES generates the model below:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 54 78.2609 % Incorrectly Classified Instances 15 21.7391 %

=== Confusion Matrix ===

a b ← classified as 26 6 | a = -1 9 28 | b = 1

Figure 4.13. Representation of the confusion matrix obtained when the model generated by training a RF classifier on the descriptors calculated with AM1* and the SES is applied to the test set.

An overall accuracy of 78% is obtained for the test set when performing the RF algorithm on the selected training set. Twenty-six of the 32 negative compounds are correctly predicted, and 28 of the positive ones, leading to performances of 81% and 76% for the negative and positive compounds, respectively. Here, there is a gap of ≈ 5% between the

95

Chapter 4 predictivities of the negative and positive compounds and a very good similarity in the confusion matrix, which is probably related to this small difference between the predictive powers of the negative and positive compounds. For the same set of descriptors, RF helps improve the predictive power and correct problems with the similarity of the confusion matrix observed with NB. The difference between the predictivities of negative and positive compounds changes from 18% to 5%.

In Table 4.8, the statistical significance of the training and test set models realized by running a set of descriptors obtained from AM1* through a RF’s algorithm is presented.

Table 4.8. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with AM1* Model True True False False Accuracy positive negative positive negative (%) Training 31 12 12 13 63 Iso set Test set 29 17 15 8 67 Training 34 14 10 10 71 SES set Test set 28 26 6 9 78

The overall accuracy is ≈ 78% for the SES and ≈ 67% for the iso when using the training model obtained with the RF classifier on the descriptors of the test set for AM1*. This yields an increase in the accuracy by ≈ 11% with the SES, compared to the iso. The same observation is made for the training set models, where the use of the SES leads to an increase in the accuracy by ≈ 8%; therefore, all the models generated here are surface- dependent.

Mapping the data obtained from MNDO and the SES calculations through a RF classification results in creating a model whose characteristics are given below:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 52 75.3623 % Incorrectly Classified Instances 17 24.6377 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 10 27 | b = 1

Figure 4.14. Confusion matrix representing the statistical distribution of the predictivity of the test set obtained by training the RF classifier on descriptors calculated with MNDO and the SES.

This model obtained with the RF algorithm is characterized by statistical performances of 75% for positive and negative compounds, 78% for the negative compounds, and 73% for the positive ones. A small difference of ≈ 5% is obtained between the accuracies of positive and negative compounds, which fully justifies the existence of this good similarity in the

96

Classification of Phospholipidosis confusion matrix. Twenty-five of the 32 negative compounds are correctly classified and 27 of the 37 positive ones. The difference between the accuracies of the positive and negative compounds is reduced from 23% to 5% when the NB is replaced by the RF classifier.

The predictive powers of the training and test set models obtained by performing a RF classification on a set containing chemical information gained from MNDO calculations are listed in Table 4.9.

Table 4.9. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with MNDO Model True True False False Accuracy positive negative positive negative (%) Training 34 14 10 10 71 Iso set Test set 30 20 12 7 72 Training 36 16 8 8 76 SES set Test set 27 25 7 10 75

The overall accuracy for the test set is 75% when generating the descriptors with the SES and 72% for the iso. There is an increase in the accuracy by ≈ 3% with the SES, compared to the iso. Using the descriptors of the training set obtained with the SES for model generation helps in increasing the accuracy by ≈ 5%. The training and test set models are both surface-dependent.

Combining the different tree predictors derived from an application of the RF algorithm to a set of descriptors obtained with MNDO/d and the SES gives the model presented below:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 55 79.7101 % Incorrectly Classified Instances 14 20.2899 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 7 30 | b = 1

Figure 4.15. Statistical distribution of the predictivity of negative and positive compounds of the test set obtained by training the RF classifier on descriptors calculated with MNDO/d and the SES in a confusion matrix.

A set of descriptors obtained from the MNDO/d Hamiltonian and the SES, classified with a RF algorithm, gives a model with predictive powers of 80%, 78%, and 81% for positive and negative compounds, negative compounds, and positive compounds, respectively. A small gap of ≈ 3% exists between the predictive powers of the negative and positive compounds. Twenty-five of the 32 negative compounds are correctly predicted, and 30 of the 37 positive compounds. There is a very good similarity in the confusion matrix,

97

Chapter 4 which is justified by the highest value for the overall accuracy and the higher predictive powers obtained for negative and positive compounds. With NB there is a small difference of about 1% between the predictivities of positive and negative compounds, but with an overall accuracy of 72%, RF significantly improves the overall performance, while maintaining a low difference of about 3% between the predictivities of positive and negative compounds.

The statistical information on the performances of the models for the training and test sets generated by training a RF classifier on a set obtained with MNDO/d are presented in Table 4.10.

Table 4.10. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with MNDO/d Model True True False False Accuracy positive negative positive negative (%) Training 33 11 14 11 64 Iso set Test set 28 23 9 9 74 Training 29 13 12 15 61 SES set Test set 30 25 7 7 80

The overall accuracies are 80% for the SES and 74% for the iso when the models obtained with MNDO/d are evaluated by a RF classification of their respective test sets. There is an increase in the accuracy by ≈ 6% when the iso is replaced by the SES for the calculation of the descriptors for the test set. In contrast, there is a decrease of ≈ 3% for the training set when the model obtained with the SES is compared to the one generated with the iso.

With a RF classifier, parameterizing all trees by a random vector stemming from the chemical information attained with PM3 and the SES generates the following model:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 58 84.0580 % Incorrectly Classified Instances 11 15.9420 %

=== Confusion Matrix ===

a b ← classified as 25 7 | a = -1 4 33 | b = 1

Figure 4.16. Configuration of the confusion matrix for the test set obtained with RF for a set of descriptors generated with the PM3 Hamiltonian and the SES.

Mapping the local properties onto the SES for each compound yields a set of descriptors. This set of descriptors with the additional logPow, trained through a RF classification, gives a model where seven compounds are wrongly predicted as positive, and four wrongly predicted as negative. This provides the best model with an overall accuracy of 84% and accuracies of 78% and 89% for the negative and positive compounds, respectively. A very good similarity exists in the confusion matrix, where the difference between the

98

Classification of Phospholipidosis accuracies of negative and positive compounds is ≈ 11%. Compared to the model obtained with the NB classifier, RF significantly improves the overall performance and the similarity of the confusion matrix, with a large reduction in the difference between the predictivities of positive and negative compounds, down from 31% to 11%.

Table 4.11 describes statistically the training and test set models generated by performing a RF classification technique on a set obtained with PM3.

Table 4.11. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with PM3 Model True True False False Accuracy positive negative positive negative (%) Training 31 13 12 13 64 Iso set Test set 30 23 9 7 77 Training 35 15 10 9 72 SES set Test set 33 25 7 4 84

When testing the models generated, the overall accuracy is 84% for the SES and 77% for the iso. There is an increase in the accuracy of ≈ 7% with the SES, compared to the iso. Using the SES to train the models gives an increase of ≈ 8% more than the iso. Performing RF classification when the geometry’s optimization is realized with the PM3 Hamiltonian yields models that are surface-dependent.

Using the collection of unpruned trees obtained by running a RF approach on the information extracted from our data set with PM6 and the SES yields a model whose performances are described below:

Scheme: weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 Correctly Classified Instances 53 76.8116 % Incorrectly Classified Instances 16 23.1884 %

=== Confusion Matrix ===

a b ← classified as 21 11 | a = -1 5 32 | b = 1

Figure 4.17. RF predictions for the test set via a confusion matrix when the descriptors are calculated with PM6 and the SES.

With a 10-fold cross-validation scheme, run through RF, a model of classification for drugs inducing PPL is obtained, where 21 compounds are correctly classified as non-inducers of PPL and 32 correctly classified as inducers. For this model, the accuracies are 66% for the negative compounds, 86% for the positive ones, and 77% for the positive and negative compounds. There is a very large difference of ≈ 20% between the accuracies of the negative and positive compounds, leading to a weak similarity in the confusion matrix. PM6 seems to be exceptional in regard to other Hamiltonians. Compared to NB, RF helps in increasing the

99

Chapter 4 overall accuracy. However, unlike the previous observations, there is an increase in the difference between the predictivities of the positive and negative compounds, which rises from 18% to 20%.

Table 4.12 presents the statistical values for the training and test set models realized by applying a RF algorithm on a set of descriptors obtained with PM6.

Table 4.12. Performances of the training and test set models generated by the RF classifier on a set of descriptors obtained with PM6 Model True True False False Accuracy positive negative positive negative (%) Training 32 15 10 12 68 Iso set Test set 34 19 13 3 77 Training 37 12 13 7 71 SES set Test set 32 21 11 5 77

This is a particular situation where the overall accuracy is 77% for both the SES and iso when the models are evaluated on their respective test sets. For the training set, there is an increase of ≈ 3% with the SES, compared to the iso. Performing the RF classification on a set of descriptors generated with PM6 yields models in which the predictive powers are unchanged when applied to their respective test sets using either the SES or the iso. The major difference between the two models generated based on the descriptors calculated with the PM6 Hamiltonian, the SES and the iso is their predictive powers for negative and positive compounds.

Figure 4.18 below gives a general description of the predictive powers of the models obtained by training a NB or RF classifier on the descriptors of the training set calculated with the different Hamiltonians and surfaces when evaluated on their respective test sets.

100

Classification of Phospholipidosis

Figure 4.18. Distribution of the accuracies of the models trained with NB and RF and evaluated on the test sets with the different Hamiltonians and surfaces. (I) and (S) represent iso and the SES, respectively.

The histogram shows that with the NB algorithm there is a continuous increase in the accuracy when the models are generated with the descriptors calculated with the SES. The same situation is observed for the RF, except for the models generated with the PM6 Hamiltonian where the predictive power is the same when the descriptors used are calculated either with the SES or the iso. For the AM1* Hamiltonian, the use of descriptors calculated with the iso produces models that have the same predictive power when evaluated on their test sets by both the NB and the RF algorithms.

The Matthews Correlation Coefficient (MCC)48 is an important parameter necessary in evaluating the predictive power of a model. It is calculated using the standard formula

TP ×TN − FP × FN MCC = . (4.5) (TP + FP)(TP + FN)(TN + FP)(TN + FN)

Here, TP is the number of compounds correctly classified as inducers of PPL, and TN the number of compounds correctly classified as non-inducers of PPL. FP and FN are related to the number of compounds wrongly classified as inducers and non-inducers of PPL, respectively. In order to measure the predictivity of each classifier, the MCC value of the test set for each model generated with the NB and the RF classifiers was calculated and the values obtained are reported in Table 4.13.

101

Chapter 4

Table 4.13. Calculated values of the MCC for the test set of each model obtained with a 10- fold cross-validation

Hamiltonian NB RF Hamiltonian NB RF and surface and surface

AM1 (iso) 0.45 0.45 AM1 (SES) 0.57 0.50

AM1* (iso) 0.33 0.33 AM1* (SES) 0.45 0.57

MNDO (iso) 0.48 0.45 MNDO 0.57 0.51 (SES)

MNDO/d 0.39 0.48 MNDO/d 0.45 0.59 (iso) (SES)

PM3 (iso) 0.24 0.53 PM3 (SES) 0.33 0.68

PM6 (iso) 0.39 0.55 PM6 (SES) 0.45 0.54

Average 0.38 0.47 Average 0.47 0.57

As seen in Table 4.13, the MCC values for each model generated with the RF and the NB algorithms increase with the SES for AM1, AM1*, MNDO, MNDO/d, and PM3. The hypothetical situation is only for the models generated by performing the RF classification on sets of descriptors obtained with PM6, the iso and the SES. Although the two models have the same overall accuracy (77%) as mentioned above, there is a decrease in the MCC value of ≈ 0.01 with the SES. However, with the NB, there is an increase in the MCC value when performed on sets of descriptors calculated with PM6, the iso and the SES. The use of the SES improves the average of the MCC by ≈ 0.11 for the RF and ≈ 0.09 for the NB when compared to the iso. The average value of the MCC obtained with RF is greater than that calculated by performing the NB algorithm for both surfaces.

The predictions obtained from the best models are listed in Table A11 of the Appendix.

The effect of each of the classifiers (NB and RF) on the compounds of the test set was investigated by sorting in increasing order the number of correct classifications over 12 models. Here the number of times the majority prediction was made for each compound over all of the 12 runs is calculated and listed in Table 4.14 below. LogP (calc) are the values of the logP of the best model obtained with the PM3 Hamiltonian and the SES. LogP (XLogP3), H-bond donor, and H-bond acceptor values are obtained from the predicted properties stored in PubChem, ChemSpider (Predicted-ACD/Labs Properties) or SciFinder, denoted with the references a, b, and c, respectively.

102

Classification of Phospholipidosis

Table 4.14. Number of correct classifications of 12 models by the NB and the RF classifiers for all the Hamiltonians and surfaces Compound logP logP H- H-bond Inductance NB RF (calc) bond acceptor donor Not predictive Methadone 3.56 3.9a 0a 2a -1 0 0 Clociguanil 2.92 3.081b,c 4b,c 6b,c 10 Doxapram 3.13 3.3a 0a 3a -1 0 Procaine 2.27 2.256b,c 2b,c 4b,c -1 0 Bicalutamide 3.55 4.135b,c 2b,c 6b,c -1 0 Diflunisal 3.19 3.652b,c 2b,c 3b,c -1 0 Suramin 4.34 1.5a 12a 23a 1 0 Very slightly predictive Colchicine 2.08 1.066b,c 1b,c 7b,c -1 1 1 Sulindac 3.12 3.4a 1a 5a -1 1 Bilirubin 4.52 2.9a 6a 6a 11 Rolitetracycline 2.34 1.657c 6c 11c -1 1 Tacrine 2.62 2.7a 1a 2a -1 1 Gemfibrozil 3.24 3.8a 1a 3a -1 2 Etoposide 0.16 0.275b,c 3b,c 13b,c -1 2 Suramin 4.34 1.5a 12a 23a 12 Flutamide 2.99 3.3a 1a 6a -1 2 Clociguanil 2.92 3.081b,c 4b,c 6b,c 1 1 Doxapram 3.13 3.3a 0a 3a -1 1 Tunicamycin -0.35 -0.3a 11a 16a 1 1 Bilirubin 4.52 2.9a 6a 6a 1 2 Abacavir 1.29 1.158b,c 4b,c 7b,c -1 2 Slightly predictive Amiodarone 7.34 7.6a 0a 4a 14 Ketoconazole 3.74 4.043b,c 0b,c 8b,c 14 4 Amikacin -5.67 - 17b,c 18b,c 1 4 5.262b,c Tacrine 2.62 2.7a 1a 2a -1 4 Carbamazepine 2.56 2.5a 1a 1a -1 5 Rolitetracycline 2.34 1.657c 6c 11c -1 5 Predictive Amikacin -5.67 - 17b,c 18b,c 1 6 5.262b,c Tunicamycin -0.35 -0.3a 11a 16a 16 Chlorpromazine 5.26 5.2a 0a 3a 17 Carbamazepine 2.56 2.5a 1a 1a -1 7 Abacavir 1.29 1.158b,c 4b,c 7b,c -1 7 Valproic_acid 2.40 2.8a 1a 2a -1 7 AC-3579 3.37 2.154b,c 0b,c 5b,c 18 Tocainide 1.80 0.808b,c 3b,c 3b,c 1 7 Procaine 2.27 2.256b,c 2b,c 4b,c -1 7 Hydrazine -0.90 -1.5a 2a 2a -1 7 103

Chapter 4

Demeclocycline 1.10 0.942c 7c 10c -1 7 Chlortetracycline 2.18 1.323b,c 7b,c 10b,c -1 8 Gemfibrozil 3.24 3.8a 1a 3a -1 8 Tobramycin -4.54 - 15b,c 14b,c 1 8 4.224b,c WY-14643 3.73 3.958b,c 2b,c 5b,c -1 8 Doxycycline 1.88 1.777b,c 7b,c 10b,c -1 8 Very predictive Tobramycin -4.54 - 15b,c 14b,c 1 9 4.224b,c Tocainide 1.80 0.808b,c 3b,c 3b,c 19 Etoposide 0.16 0.275b,c 3b,c 13b,c -1 9 Amiodarone 7.34 7.6a 0a 4a 1 9 Sulindac 3.12 3.4a 1a 5a -1 9 Famotidine -0.65 -0.64b 8b 9b -1 9 Acetaminophen 0.60 0.5a 2a 2a -1 10 Carbon 2.53 2.8a 0a 0a -1 10 tetrachloride Fenfluramine 4.55 3.554b,c 1b,c 1b,c 1 10 Promethazine 3.72 4.8a 0a 3a 1 10 Quinacrine 6.12 6a 1a 4a 1 10 Doxycycline 1.88 1.777b,c 7b,c 10b,c -1 10 Chlortetracycline 2.18 1.323b,c 7b,c 10b,c -1 10 Caffeine 0.48 -0.1a 0a 3a -1 11 Dibucaine 4.60 4.759b,c 1b,c 5b,c 1 11 11 Galactosamine -2.57 -2.8a 5a 6a -1 11 11 Hypoglicin-A 0.60 -2.5a 2a 3a -1 11 11 Methyldopa 0.64 0.676b 4b 5b -1 11 11 Piroxicam 0.96 0.588c 2c 7c -1 11 11 Zileuton 1.62 1.6a 2a 3a -1 11 11 Acetaminophen 0.60 0.5a 2a 2a -1 11 Demeclocycline 1.10 0.942c 7c 10c -1 11 Hydrazine -0.90 -1.5a 2a 2a -1 11 Valproic_acid 2.40 2.8a 1a 2a -1 11 Diflunisal 3.19 3.652b,c 2b,c 3b,c -1 11 Bicalutamide 3.55 4.135b,c 2b,c 6b,c -1 11 AC-3579 3.37 2.154b,c 0b,c 5b,c 1 11 Tetracaine 2.81 3.7a 1a 4a 1 11 Methotrexate 0.37 - 7b,c 13b,c -1 11 0.446b,c Gentamicin -2.72 -1.887b 11b 12b 1 11 Highly predictive Chloroquine 4.79 4.6a 1a 3a 1 12 12 Cyclizine 4.33 3.6a 0a 2a 1 12 12 Desipramine 4.09 3.972b,c 1b,c 2b,c 1 12 12 Hydroxyzine 3.45 3.7a 1a 4a 1 12 12 Chlorcyclizine 5.11 4.5a 0a 2a 1 12 12

104

Classification of Phospholipidosis

Clomipramine 5.16 5.2a 0a 2a 1 12 12 Emetine 4.49 4.7a 1a 6a 1 12 12 Lysergide 3.13 3a 1a 2a 1 12 12 AY-9944 6.08 6.395b,c 2b,c 2b,c 1 12 12 Norchlorcyclizine 4.08 3.4a 1a 2a 1 12 12 Nortriptyline 4.51 4.5a 1a 1a 1 12 12 Pheniramine 3.54 2.8a 0a 2a 1 12 12 Phentermine 2.76 2.200b,c 2b,c 1b,c 1 12 12 Tamoxifen 6.34 7.1a 0a 2a 1 12 12 Temozolomide -0.14 -1.1a 1a 5a -1 12 12 Homochlorcyclizine 5.58 4.2a 0a 2a 1 12 12 Quinidine 2.40 2.823b,c 1b,c 4b,c 1 12 12 SDZ-200125 3.45 3.351b,c 0b,c 5b,c 1 12 12 Stavudine -0.19 - 2b,c 6b,c -1 12 12 0.647b,c Trifluperazine 5.16 5a 0a 7a 1 12 12 Triparanol 6.98 6.2a 1a 3a 1 12 12 Netilmicin -2.12 - 11b,c 12b,c 1 12 12 1.840b,c Methotrexate 0.37 - 7b,c 13b,c -1 12 0.446b,c Tetracaine 2.81 3.7a 1a 4a 1 12 WY-14643 3.73 3.958b,c 2b,c 5b,c -1 12 Carbon 2.53 2.8a 0a 0a -1 12 tetrachloride Famotidine -0.65 -0.64b 8b 9b -1 12 Fenfluramine 4.55 3.554b,c 1b,c 1b,c 1 12 Promethazine 3.72 4.8a 0a 3a 1 12 Quinacrine 6.12 6a 1a 4a 1 12 Gentamicin -2.72 -1.887b 11b 12b 1 12 Caffeine 0.48 -0.1a 0a 3a -1 12 Chlorpromazine 5.26 5.2a 0a 3a 1 12 Flutamide 2.99 3.3a 1a 6a -1 12

The number of correct predictions ranges between zero and 12, suggesting that over the 12 runs, both classes were predicted inequally.

Most of the compounds correctly predicted as actives by both NB and RF classifiers for the 12 models generated for each classifier have their physical properties parameterized as follows:

2 ” logP ” 5 0 ” H-Bond Donor ” 2 1 ” H-Bond Acceptor ” 7, thus respecting Lipinski’s condition49, which states that a drug able to produce effects in humans must have a logP value not greater than 5 (logP units), the number of hydrogen bond donors not greater than five, and the number of hydrogen bond acceptors not greater than10.

105

Chapter 4

Among these compounds correctly classified as inducers of PPL in the 12 runs performed with each classifier, triparanol, tamoxifen, and AY-9944 have computed logP values much higher than 5 (logP units), but the number of hydrogen bond donors and the number of hydrogen bond acceptors are not greater than five and 10, respectively, therefore, excluding only one of the three rules of Lipinski’s hypothesis investigated here.

A special observation was made for netilmicin, which was correctly predicted 12 times as an inducer PPL when performed with the NB and the RF algorithms, with a logP < 0, and the number of hydrogen bond donors and acceptors equal to 11 and 12, respectively.

Compounds correctly classified as inactive have physical properties that are characterized as follows:

LogP < 0 1 ” H-Bond Donor ” 2 5 ” H-Bond Acceptor ” 6.

The statistical description of the number of correct predictions realized by the NB and the RF classifiers for the 69 compounds of the randomly chosen test set is shown in Figure 4.19.

Figure 4.19. Distribution of the percentage of correct classifications by the NB and the RF for all the compounds of the test set.

It appears on the histogram that NB has a higher percentage of compounds correctly predicted over 12 runs, C12 (NB predicts more correctly consistently), but also a higher percentage of misclassified compounds (C0) and for C1, C2, and C6. For class C12, the two classifiers have a percentage above 35%. The two algorithms give the same percentage just over 5% for class seven, but RF performs better than NB for C4, C5, C8, C9, C10, and C11. Here, NB and RF provide 11 prediction classes of unequal amplitudes.

106

Classification of Phospholipidosis

4.4 Discussion

Through this investigation into predicting compounds that induce PPL, it appears that logP is one of the most important physical properties necessary for the prediction of the biological activity of compounds, as are H-bond donors and H-bond acceptors. The compounds triparanol, tamoxifen, and AY-9944 with computed logP values of 6.2a, 7.1a, and 6.395b,c, respectively, were highly predicted by both the NB and RF classifiers, but excluded one of Lipinski’s rule of five49 among the three parameters (computed logP, H-bond donor, H-bond acceptor) chosen for our investigation. As seen in the previous chapter, the logP of compounds with a large number of rotatable bonds are difficult to predict. This may apply to these three compounds because the numbers of their rotatable bonds are 10a, eighta, and eighta for triparanol, tamoxifen, and AY-9944, respectively,and therefore they are very flexible compounds. This seems to be the case also for netilmicin, which with a number of rotatable bonds equal to eighta, excludes all three of Lipinski’s hypotheses analysed.

The accuracies of the training set models are between 61% and 76%, confirming the robustness of the models.The data set used was composed of inactive and active compounds; therefore the classes are unbalanced. According to Lowe et al.1, accuracy is not a sufficient parameter that can be used to appreciate the prediction quality when the study is made with unbalanced classes. The determination of the MCC48 can be helpful in estimating the prediction quality. Theoretically the MCC48 value is between -1 and one, where -1 leads to a perfect anticorrelation, zero to the equivalent of random guessing and one to a perfect correlation. The calculated values of the MCC48 ranged from 0.24 to 0.68, and obtaining a value close to zero implies the presence of classes without prediction, which is related to an uninformative prediction. All the MCC48 values obtained here were above zero, implying that there is no anticorrelation, no uninformative prediction, but a fairly reliable correlation. Referring to the calculated values of the MCC48 listed in Table 4.13, the best average values of 0.57 and 0.47 were obtained for the models generated by applying RF and NB classifiers, respectively, to sets of descriptors calculated with the SES. There is indeed an improvement in the predictivity by ≈ 0.1 with the RF classification, compared to the NB. RF provides the best prediction quality when applied to descriptors generated with the SES, with an average MCC48 of 0.57, which is close to the values obtained by Lowe et al.1 (0.532 with their E- Dragon descriptors set and 0.539 for their combination of descriptors).

In Figure 4.19 the percentages higher than 35% obtained for the C12 class by the two classifiers suggests a certain reliability of NB and RF in predicting compounds that have an ability to induce PPL. There were 25 cases ( ≈ 36 %) for RF and 31 cases ( ≈ 45 %) for NB in which compounds were correctly predicted consistently among all the 12 runs and several cases where compounds were incorrectly predicted. The large differences in percentage observed between the different prediction classes leads to the hypothesis that certain molecules are particularly difficult to predict, and the compounds of classes C0, C1, C2, C4, and C5 are those that adhere to this hypothesis. The most consistently well-predicted compounds in our randomly chosen test set and the most misclassified compounds are listed in Table 4.14. Doxapram and methadone, for example, were always misclassified by both NB and RF algorithms.

Figures 4.20 and 4.21 give more details about the abilities of RF and NB in predicting active and inactive compounds, using the created set of 125 descriptors.

107

Chapter 4

 

Figure 4.20. Histogram of active compounds correctly predicted by NB and RF classifiers.

For both classifiers, RF predicted better than NB for C1, C10, and C11. In contrast, NB was better than RF at predicting for C6, C9, and C12, but both were equal for C0, C2, C4, C7, and C8, when investigating the predictivity of active compounds. There is a large gap between the percentages of C12 (more than 30%) and the other classes of prediction where the values range between 3% and 6%; this implies that the two classifiers predict active compounds more correctly and consistently. Here, there are ten classes of prediction for both RF and NB. 

 

Figure 4.21. Histogram of inactive compounds correctly predicted by NB and RF classifiers.

108

Classification of Phospholipidosis

RF predicted inactive compounds better than NB for C4, C5, C8, C9, and equally for C7, C10, and C11. In contrast, NB predicted better than RF for C0, C1, C2, and C12. Here, there is variability in the percentages of the different classes with dominance in the C11 class, followed by class C12. In contrast to Figure 4.20, there is not a very large gap between the percentages of the different classes. Another important point is that there are 11 classes of prediction for RF and only seven for NB.

At first glance, the major observation provided by Figures 4.19, 4.20, and 4.21 is the constant dominance of NB compared to RF for class C12. From this, it seems clear that NB predicts better than RF, which contrasts the information provided in Table 4.13. However, these Figures have different classes of prediction in which there are some variances regarding the percentages of compounds correctly predicted. Therefore, the observed differences between the prediction percentages of NB and RF for class C12 would be fully offset by the other classes, which is consistent with the previously computed values for MCC48, and would validly support the hypothesis that the prediction quality of RF is better than NB. For active compounds where the number of compounds (37) is higher than for inactive compounds (32), NB and RF both provide 10 prediction classes. In contrast, for inactive compounds, RF gives 11 classes and NB seven classes of prediction. This permits us to hypothesize that NB predicts in relation to the proportion of elements present in a class. The 69 compounds used in our test set are dissimilar. No particular correlation could be reported between the size of the data set (i.e, number of chemicals) and the percentage of correctly predicted compounds. Therefore, we can say that the percent of test sets predicted correctly can be directly connected to the diversity of chemicals tested. The use of dissimilar compounds produces a good test for the algorithm. In general, structures that have a positive assay result for PPL induction are over-represented in our test set; therefore, data sets trained on more active than inactive compounds will predict active better than inactive.

The best predictivity ( ≈ 84%) was obtained by performing a RF classification on a set of descriptors computed with the SES, and there is a very good similarity in the confusion matrix obtained: a b ← classified as 25 7 | a = -1 4 33 | b = 1, where only seven compounds are incorrectly predicted as actives and only four compounds wrongly predicted as inactives.

Except PM6, which with RF gives a constant value of 77% when performed either with the iso or the SES, all the other models generated are surface-dependent for both NB and RF algorithms. This specific problem could be due to the atomic parameters of PM6. However, when calculating the values of the MCC48, it appears that the MCC48 values obtained when using the RF classification on sets of descriptors generated either with the iso or the SES are 0.55 and 0.54, respectively. This difference of ≈ 0.01 is probably related to the difference between the two descriptor sets in predicting compounds as inducers or non- inducers of PPL, which leads to different predictivities for negative and positive compounds.

Models presenting good similarities in their confusion matrices are those in which there is a small difference between the predictive powers of compounds that can be classified as inducers or non-inducers of PPL. Most of these models are specifically generated by

109

Chapter 4 performing the RF classification on sets of descriptors calculated with the SES. It clearly appears that the use of ML algorithms with high quality data or descriptor sets can lead to improved predictions. According to Svetnik et al.50, compared to linear trees, RF performs significantly better, and among others, provides the most appropriate approach for similar classification problems (no need for descriptors or variable selections), therefore, providing the specificity needed to select appropriate descriptors. Many research studies have been conducted on the prediction of PPL induction with different data sets. Could the use of a large data set produce better predictivity as mentoned by Lowe ET al.1? In this work, we used a standard 10-fold cross-validation available in weka. Increasing the number of cross- validations may have a significant effect on predictivity.

4.5 Conclusions

We presented a comparison of two ML techniques for the prediction of drug-induced PPL, based on the calculated ParaSurf descriptors and the logPow. Focusing on the accuracies and the calculated values of the MCC48, the performance values for the descriptors in our data set clearly support the hypothesis that ParaSurf descriptors and the logPow contain sufficient information for ML algorithms to construct accurate and simple predictive models to determine if a given molecule is active or inactive. Although the RF algorithm does not require an extra step for the selection of appropriate descriptors, its application on descriptors calculated with the SES generally leads to obtaining robust models that evaluated on their respective test sets yield not only good predictive powers with overall accuracies generally higher or equal to 75% but also confusion matrices presenting good similarities. With RF, the use of the SES improves the predictive power of the models and corrects the similarity problem of the confusion matrix. RF produces the best models, with an average MCC48 of 0.57 in a 10-fold cross-validation. We obtained different accuracies for the different models, suggesting that some individual molecules might be hard to predict with some Hamiltonians. Furthermore, NB yields models that evaluated on their respective test sets are totally surface- dependent and also predict more correctly and consistently than RF. Strong in its ability to generate entirely surface-dependent classification models, the NB classifier is officially recognized as a van der Waals surface-dependent classifier. One of the singular points motivating the use of the Bayesian approach is that, due to the penalization of complex models51, overfitting and overtraining problems are considerably reduced. For a specific class (active or inactive), RF seems to be able to predict regardless of the distribution of elements within the class, unlike NB, which tends to predict relative to the proportion of items in the class. Based on the fact that the set of descriptors created here provides sufficient chemical information necessary to generate sufficiently robust models, we can say with some conviction that this work provides a new approach for predicting drugs that induce PPL. The scientific input of the results presented is their possible use as an indicator to study the performance of classification methods in real systems relevant to drug discovery. From our comparative study between NB and RF algorithms, it appears that RF is the more appropriate classifier for some QSAR studies.

110

Classification of Phospholipidosis

4.6 References

1. Lowe, R.; Glen, R. C.; Mitchell, J. B.O. Predicting Phospholipidosis Using Machine Learning. Mol Pharmaceutics. 2010, 7 (5), 1708-1714.

2. Anderson, N.; Borlak, J. Drug-induced phospholipidosis. FEBS Lett. 2006, 580, 5533-5540.

3. Abe, A.; Hiraoka, M.; Shayman, J. A. A role for lysosomal phospholipase A2 in drug induced phospholipidosis. Drug. Metab. Lett. 2007, 1, 49-53.

4. Vitovic, P.; Alakoskela, J. M.; Kinnunen, P. K. J. Assessment of drug-lipid complex formation by a high-throughput Langmuirbalance and correlation to phospholipidosis. J. Med. Chem. 2008, 51, 1842-1848.

5. Ohlstein, E. H.; Ruffolo, R. R.; Elliot, J. D. Drug discovery in the next millennium. Ann. Rev. Pharmacol. Toxicol. 2000, 40, 177-191.

6. Rodrigues, A. D. Preclinical in the age of high-throughput screening. Pharm. Res. 1997, 14, 1504-1510.

7. Navia, M. A.; Chaturvedi, P. R. Design principles for orally bioavailable drugs. Drug. Disc. Today. 1996, 1, 179-189.

8. Gaviraghi, G.; Barnaby, R. J.; Pellegatti, M. Pharmaco-kinetic challenges in lead optimization. In Pharmacokinetic Optimization in Drug Research., Testa, B.; van de Waterbeemd, H.; Folkers, G.; Guy, R., Eds. Wiley-VCH: Basel, Switzerland, 2001; pp. 1-14.

9. Kennedy, T. Managing the drug discovery/development interface. Drug. Disc. Today. 1997, 2, 436-444.

10. Pelletier, D. J.; Gehlhaar, D.; Tilloy-Ellul, A.; Johnson, T. O.; Greene, N. Evaluation of a published in silico model and construction of a novel Bayesian model for predicting phospholipidosis inducing potential. J. Chem. Inf. Model. 2007, 47, 1196-1205.

11. Ploemen, J. P.; Kelder, J.; Hafmans, T.; van de Sandt, H.; van Burgsteden, J. A.; Saleminki, P. J.; and van Esch, E. Use of physicochemical calculation of pKa and ClogP to predict phospholipidosis-inducing potential: A case study with structural related piperazines. Exp. Toxicol. Pathol. 2004, 55, 347-355.

12. Sawada, H.; Takami, K.; Asahi, S. A toxicogenomic approach to drug-induced phospholipidosis: Analysis of its induction mechanism and establishment of a novel in vitro screening system. Toxicol. Sci. 2005, 83, 282-292.

13. Atienzar, F.; Gerets, H.; Dufrane, S.; Tilmant, K.; Cornet, M.; Dhalluin, S.; Ruty, B.; Rose, G.; Canning, M. Determination of phospholipidosis potential based on gene expression analysis in HepG2 cells. Toxicol. Sci. 2007, 96, 101-114.

14. Kasahara, T.; Tomita, K.; Murano, H.; Harada, T.; Tsubakimoto, K.; Ogihara, T.; Ohnishi, S.; Kakinuma, C. Establishment of an in vitro high-throughput screening assay for detecting phospholipidosis-inducing potential. Toxicol. Sci. 2006, 90, 133-141.

15. Ivanciuc, O. Weka machine learning for predicting the phospholipidosis inducing potential. Curr. Top. Med. Chem. 2008, 8, 1691-1709.

111

Chapter 4

16. Tomizawa, K.; Sugano, K.; Yamada, H.; Horii, I. Physicochemical and cell-based approach for early screening of phospholipidosis-inducing potential. J. Toxicol. Sci. 2006, 31, 315-324.

17. Umesh, M. H.; Gottfried, W.; Regueiro-Ren, A.;Roumyana, Y.; John, P. C.; Stephen, P. A. Phospholipidosis as a function of Basicity, Lipophilicity, and Volume of Distribution of Compounds. Chem. Res. Toxicol. 2010, 23, 749-755.

18. Nioi, P.; Perry, B. K.; Wang, E. J.; Gu, Y. Z.; Snyder, R. D. In vitro detection of drug-induced phospholipidosis using gene expression and fluorescent phospholipid based methodologies. Toxicol. Sci. 2007, 99 (1), 162-173.

19. Hansch, C.; Fujita, T. ρ-σ-π analysis. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc. 1964, 86, 1616-1626.

20. Fujita,T.; Iwasa, J.; Hansch, C. A new substituent constant, π, derived from partition coefficients. J. Am. Chem. Soc. 1964, 86, 5175-5180.

21. Bonchev, D.; Rouvray, D. H. Chemical Graph Theory. Introduction and Fundamentals. Abacus Press/Gordon & Breach Science Publishers: New York, 1991.

22. Trinajstic, N. Chemical Graph Theory. CRC Press: Boca Raton, FL, 1992.

23. Ivanciuc, O. Graph Theory in chemistry. In Handbook of Chemoinformatics, Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 1, pp 103-138.

24. Bonchev, D. Information Theoretic Indices for Characterization of Chemical Structure. Research Studies Press: Chichester, UK, 1983.

25. Balaban, A. T.; Ivanciuc, O. Historical development of topological indices. In Topological Indices and Related Descriptors in QSAR and QSPR, Devillers, J.; Balaban, A. T., Eds. Gordon and Breach Science Publishers: Amsterdam, 1999; pp 21-57.

26. Ivanciuc, O. Topological indices. In Handbook of Chemoinformatics, Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 3, pp 981-1003.

27. Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research. Academic Press: New York, 1976.

28. Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis. Research Studies Press: Letchworth, 1986.

29. Kier, L. B.; Hall, L. H. Molecular Structure Description. The Electrotopological State. Academic Press: San Diego, 1999.

30. Ivanciuc, O. Electrotopological state indices. In Molecular Drug Properties. Measurement and Prediction, Mannhold, R., Ed. Wiley-VCH: Weinheim, 2008; pp 85-109.

31. Todeschini, R.; Consonni, V. Descriptors from molecular geometry. In Handbook of Chemoinformatics, Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 3, pp 1004-1033.

32. Jurs, P. Quantitative structure-property relationships. In Handbook of Chemoinformatics, Gasteiger, J., Ed. Wiley-VCH: Weinheim, 2003; Vol. 3, pp 1314-1335.

112

Classification of Phospholipidosis

33. Ivanciuc, O. 3D QSAR models. In QSPR/QSAR Studies by Molecular Descriptors, Diudea, M. V., Ed. Nova Science Publishers: Huntington, NY, 2001; pp 233-280.

34. Zhang, S.; Golbraikh, A.; Oloff, S.; Kohn, H.; Tropsha, A. A novel automated lazy learning QSAR (ALL-QSAR) approach: Method development, applications, and virtual screening of chemical databases using validated ALL-QSAR models. J. Chem. Inf. Model. 2006, 46, 1984- 1995.

35. Coulombe, P.A.; Kan, F. W.; Bendayon, M. Introduction of a high-resolution cytochemical method for studying the distribution of phospholipidosis in biological tissues. European Journal of Cell Biology 1988, 46 (3), 564-576.

36. Weininger, D. SMILES, a Chemical Language and Information System. Introduction to Methodology and Encoding Rules. Journal of Chemical Information and Computational Sciences 1988, 28, 31-36.

37. CORINA 3D Structure Generator; Molecular Networks, GmbH: Erlangen, Germany, 2006.

38. Sadowski, J.; Gasteiger, J.; Klebe, G. Comparison of Automatic Three-Dimensional Model Builders using 639 X-Ray structures. Journal of Chemical Information and Computational Sciences 1994, 34, 1000-1008.

39. Clark, T.; Alex, A.; Beck, A.; Burkhardt, F.; Chandrasekhar, J.; Gedeck, P.; Horn, A. H. C.; Hutter, M.; Martin, B.; Rauhut, G.; Sauer, W.; Schindler, T.; Steinke, T. VAMP 8.2; accelrys Inc.: Erlangen: San Diego, USA, 2002.

40. ParaSurf10, CEPOS InSilico Ltd.: Erlangen, Germany, 2010.

41. Meyer, A. Y. The size of molecules. Chem. Soc. Rev. 1985, 15, 449-475.

42. Pan, Q.; Tai, X. –C. Model the Solvent-Excluded Surface of 3D Protein Molecular Structures Using Geometric PDE-Based Level-Set Method. Commun. Comput. Phys. 2009, 6, 777-792.

43. Witten, I. H.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2 ed. Morgan Kaufmann: San Francisco, 2005; p 525.

44. Reasor, M. J.; Kacew, S. Drug-induced phospholipidosis: Are there functional consequences? Exp. Biol. Med. 2001, 226, 825-830.

45. Frank, E.; Trigg, L.; Holmes, G.; Witten, I. H. Naïve Bayes for regression. Machine Learning, 2000, 41 (1), pp 5-15.

46. L. Breiman, “Random Forests”, Machine Learning, 2001, 45, pp 5-32.

47. Linda.; Chatman, A.; Daniel, M.; Theodore, O.; Johnson.; Susan D, A. A Strategy for Risk Management of Drug-Induced Phospholipidosis. Toxicol Pathol. 2009, Vol. 37, No. 7.

48. Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta. 1975, 405, 442-451.

49. Leo, A.; Hansch, C.; Elkins, D. Partition Coefficients and their uses. Chem. Rev. 1971, 71 (6), 525-616.

113

Chapter 4

50. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.

51. Michael, J. S.; John, O. M.; Ross, A. Mc.; David, A. W.; Frank, R. B.; Paul, A. S. Comparison of Linear and Nonlinear Classification Algorithms for the Prediction of Drug and Chemical Metabolism by Human UDP-Glucuronosyltransferase Isoforms. J. Chem. Inf. Comput. Sci. 2003, 43, 2019-2024.

114

Appendix

Appendix

Table A1. Component terms used in the SIM Number Term 1 V (r) 2 V (r) 3 3 []V (r) 2 4 []V (r) 2 5 5 []V (r) 2 6 []V (r) 3 7 IEL (r)

8 IEL (r) 3 9 []2 IEL (r) 10 []2 IEL (r) 5 11 []2 IEL (r) 12 []3 IEL (r) 13 EAL (r)

14 EAL (r) 3 15 []2 EAL (r) 16 []2 EAL (r) 5 17 []2 EAL (r) 18 []3 EAL (r) 19 α L (r) α 20 L (r) 3 21 []α 2 L (r) 22 []α 2 L (r) 5 23 []α 2 L (r) 24 []α 3 L (r) 25 η L (r) η 26 L (r) 3 27 []η 2 L (r) 28 []η 2 L (r) 5 29 []η 2 L (r) 30 []η 3 L (r) 31 V (r).IEL (r)

116

Appendix

32 V (r).IEL (r) 3 33 []2 V (r).IEL (r) 34 []2 V (r).IEL (r) 5 35 []2 V (r).IEL (r) 36 []3 V (r).IEL (r) 37 V (r).EAL (r)

38 V (r).EAL (r) 3 39 []2 V (r).EAL (r) 40 []2 V (r).EAL (r) 5 41 []2 V (r).EAL (r) 42 []3 V (r).EAL (r) 43 α V (r). L (r) α 44 V (r). L (r) 3 45 []α 2 V (r). L (r) 46 []α 2 V (r). L (r) 5 47 []α 2 V (r). L (r) 48 []α 3 V (r). L (r) 49 η V (r). L (r) η 50 V (r). L (r) 3 51 []η 2 V (r). L (r) 52 []η 2 V (r). L (r) 5 53 []η 2 V (r). L (r) 54 []η 3 V (r). L (r) 55 IEL (r).EAL (r)

56 IEL (r).EAL (r) 3 57 []2 IEL (r).EAL (r) 58 []2 IE L (r).EAL (r) 5 59 []2 IEL (r).EAL (r) 60 []3 IE L (r).EAL (r) 61 α IEL (r). L (r) α 62 IEL (r). L (r)

117

Appendix

3 63 []α 2 IEL (r). L (r) 64 []α 2 IEL (r). L (r) 5 65 []α 2 IEL (r). L (r) 66 []α 3 IEL (r). L (r) 67 η IEL (r). L (r) η 68 IEL (r). L (r) 3 69 []η 2 IEL (r). L (r) 70 []η 2 IEL (r). L (r) 5 71 []η 2 IEL (r). L (r) 72 []η 3 IEL (r). L (r) 73 α EAL (r). L (r) α 74 EAL (r). L (r) 3 75 []α 2 EAL (r). L (r) 76 []α 2 EAL (r). L (r) 5 77 []α 2 EAL (r). L (r) 78 []α 3 EAL (r). L (r) 79 η EAL (r). L (r) η 80 EAL (r). L (r) 3 81 []η 2 EAL (r). L (r) 82 []η 2 EAL (r). L (r) 5 83 []η 2 EAL (r). L (r) 84 []η 3 EAL (r). L (r) 85 α η L (r). L (r) α η 86 L (r). L (r) 3 87 []α η 2 L (r). L (r) 88 []α η 2 L (r). L (r) 5 89 []α η 2 L (r). L (r) 90 []α η 3 L (r). L (r) 91 V (r).IEL (r).EAL (r)

92 V (r).IEL (r).EAL (r) 3 93 []2 V (r).IEL (r).EAL (r) 94 []2 V (r).IE L (r).EAL (r)

118

Appendix

5 95 []2 V (r).IEl (r).EAL (r) 96 []3 V (r).IE L (r).EAL (r) 97 α V (r).IEL (r). L (r) α 98 V (r).IEL (r). L (r) 3 99 []α 2 V (r).IEL (r). L (r) 100 []α 2 V (r).IE L (r). L (r) 5 101 []α 2 V (r).IEL (r). L (r) 102 []α 3 V (r).IEL (r). L (r) 103 η V (r).IEL (r). L (r) η 104 V (r).IEL (r). L (r) 3 105 []η 2 V (r).IEL (r). L (r) 106 []η 2 V (r).IEL (r). L (r) 5 107 []η 2 V (r).IEL (r). L (r) 108 []η 3 V (r).IEL (r). L (r) 109 α V (r).EAL (r). L (r) α 110 V (r).EAL (r). L (r) 3 111 []α 2 V (r).EAL (r). L (r) 112 []α 2 V (r).EAL (r). L (r) 5 113 []α 2 V (r).EAL (r). L (r) 114 []α 3 V (r).EAL (r). L (r) 115 η V (r).EAL (r). L (r) η 116 V (r).EAL (r). L (r) 3 117 []η 2 V (r).EAL (r). L (r) 118 []η 2 V (r).EAL (r). L (r) 5 119 []η 2 V (r).EAL (r). L (r) 120 []η 3 V (r).EAL (r). L (r) 121 α IEL (r).EAL (r). L (r) α 122 IEL (r).EAL (r). L (r) 3 123 []α 2 IEL (r).EAL (r). L (r) 124 []α 2 IE L (r).EAL (r). L (r) 5 125 []α 2 IEL (r).EAL (r). L (r)

119

Appendix

126 []α 3 IE L (r).EAL (r). L (r) 127 η IEL (r).EAL (r). L (r) η 128 IEL (r).EAL (r). L (r) 3 129 []η 2 IEL (r).EAL (r). L (r) 130 []η 2 IEL (r).EAL (r). L (r) 5 131 []η 2 IEL (r).EAL (r). L (r) 132 []η 3 IEL (r).EAL (r). L (r) 133 α η IEL (r). L (r). L (r) α η 134 IEL (r). L (r). L (r) 3 135 []α η 2 IEL (r). L (r). L (r) 136 []α η 2 IEL (r). L (r). L (r) 5 137 []α η 2 IEL (r). L (r). L (r) 138 []α η 3 IEL (r). L (r). L (r) 139 α η EAL (r). L (r). L (r) α η 140 EAL (r). L (r). L (r) 3 141 []α η 2 EAL (r). L (r). L (r) 142 []α η 2 EAL (r). L (r). L (r) 5 143 []α η 2 EAL (r). L (r). L (r) 144 []α η 3 EAL (r). L (r). L (r) 145 α η V (r).EAL (r). L (r). L (r) α η 146 V (r).EAL (r). L (r). L (r) 3 147 []α η 2 V (r).EAL (r). L (r). L (r) 148 []α η 2 V (r).EAL (r). L (r). L (r) 5 149 []α η 2 V (r).EAL (r). L (r). L (r) 150 []α η 3 V (r).EAL (r). L (r). L (r)

120

Appendix

Δ −1 Table A2. Gsolv (H 2O) (kcal mol ) for the entire data set calculated with AM1 and the iso No Compound Exp Calc 1 Methane 2 0.8524 2 Ethane 1.83 0.981 3 1.96 1.259 4 Cyclopropane 0.75 0.955 5 2-Methylpropane 2.32 1.645 6 2,2-Dimethylpropane 2.5 2.079 7 n-Butane 2.08 1.521 8 2,2-Dimethylbutane 2.59 2.252 9 Cyclopentane 1.2 0.801 10 n-Pentane 2.33 1.747 11 2-Methylpentane 2.52 2.098 12 3-Methylpentane 2.51 2.059 13 2,4-Dimethylpentane 2.88 2.482 14 2,2,4-Trimethylpentane 2.85 2.854 15 Methylcyclopentane 1.6 1.334 16 n-Hexane 2.49 1.997 17 Cyclohexane 1.23 1.678 18 Methylcyclohexane 1.71 2.04 cis-1,2- 19 Dimethylcyclohexane 1.58 2.305 20 n-Heptane 2.62 2.228 21 n-Octane 2.89 2.46 22 Ethylene 1.27 0.6656 23 Propylene 1.27 0.779 24 2-Methylpropene 1.16 0.823 25 1-Butene 1.38 0.986 26 2-Methyl-2-butene 1.31 0.966 27 3-Methyl-1-butene 1.83 1.416 28 1-Pentene 1.66 1.254 29 trans-2-Pentene 1.34 1.125 30 4-Methyl-1-pentene 1.91 1.732 31 Cyclopentene 0.56 0.1955 32 1-Hexene 1.68 1.506 33 Cyclohexene 0.37 0.648 34 trans-2-Heptene 1.66 1.62 35 1-Methylcyclohexene 0.67 0.756 36 1-Octene 2.17 1.971 37 1,3-Butadiene 0.61 0.252 38 2-Methyl-1,3-butadiene 0.68 0.402 2,3-Dimethyl-1,3- 39 butadiene 0.4 0.505 40 1,4-Pentadiene 0.94 1.031 41 1,5-Hexadiene 1.01 1.153 42 Acetylene -0.01 0.283 43 Propyne -0.31 0.016

121

Appendix

44 1-Butyne -0.16 0.298 45 1-Pentyne 0.01 0.567 46 1-Hexyne 0.29 0.765 47 1-Heptyne 0.6 0.977 48 1-Octyne 0.71 1.179 49 1-Nonyne 1.05 1.373 50 Butenyne 0.04 -0.296 51 Benzene -0.87 -1.27206 52 Toluene -0.89 -1.0833 53 1,2,4-Trimethylbenzene -0.86 -0.8755 54 Ethylbenzene -0.8 -0.5145 55 m-Xylene -0.84 -0.9824 56 o-Xylene -0.9 -1.0254 57 p-Xylene -0.8 -0.9412 58 Propylbenzene -0.53 -0.282 59 Butylbenzene -0.4 0.041 60 t-Butylbenzene -0.44 -0.043 61 t-Amylbenzene -0.18 0.152 62 Naphtalene -2.39 -2.8586 63 Anthracene -4.23 -4.313 64 Phenanthrene -3.95 -4.188 65 Acenaphtene -3.15 -3.1016 66 p-Chlorotoluene -1.92 -2.3758 67 Fluoromethane -0.22 -0.543 68 1,1-Difluoroethane -0.11 -0.232 69 Trifluoromethane -0.81 -0.076 70 Tetrafluoromethane 3.11 2.384 71 3.94 3.046 72 Octafluoropropane 4.28 5.498 73 Fluorobenzene -0.78 -1.9676 2-Chloro-1,1,1- 74 Trifluoroethane 0.05 -0.723 75 Chlorofluoromethane -0.77 -0.366 76 Chlorodifluoromethane -0.5 0.034 77 Chlorotrifluoromethane 2.52 0.866 78 Dichlorodifluoromethane 1.69 1.491 79 Fluorotrichloromethane 0.82 0.10197 1,1,2-Trichloro-1,2,2- 80 trifluoroethane 1.77 -0.2529 1,1,2,2- 81 Tetrachlorodifluoroethane 0.82 1.488 82 Chloropentafluoroethane 2.87 3.591 1,1- 83 Dichlorotetrafluoroethane 2.51 2.085 1,2- 84 Dichlorotetrafluoroethane 2.32 0.5602 85 1-Bromo-1-chloro-2,2,2- -0.13 -1.5307

122

Appendix

trifluoroethane 86 Bromotrifluoromethane 1.79 -2.442 1-Bromo-1,2,2,2- 87 tetrafluoroethane 0.52 -1.0819 88 Chloromethane -0.56 -0.8307 89 Dichloromethane -1.41 -0.5013 90 Trichloromethane -1.07 -0.0448 91 Tetrachloromethane 0.1 -1.031 92 Chloroethane -0.63 -0.7603 93 1,1-Dichloroethane -0.85 -0.4106 94 (E)-1,2-Dichloroethane -1.73 -0.9465 95 1,1,1-Trichloroethane -0.25 -0.0542 96 1,1,2-Trichloroethane -1.95 -0.3544 97 1,1,1,2-Tetrachloroethane -1.15 0.653 98 1,1,2,2-Tetrachloroethane -2.36 1.392 99 Pentachloroethane -1.36 0.2649 100 Hexachloroethane -1.41 -0.4145 101 1-Chloropropane -0.27 -0.8321 102 2-Chloropropane -0.25 0.048 103 1,2-Dichloropropane -1.25 -0.489 104 1,3-Dichloropropane -1.9 -1.4707 105 1-Chlorobutane -0.14 -0.7407 106 2-Chlorobutane 0.07 0.394 107 1,1-Dichlorobutane -0.7 -0.4029 108 1-Chloropentane -0.07 -0.6889 109 2-Chloropentane 0.07 0.58 110 3-Chloropentane 0.04 0.467 111 Chloroethylene -0.59 -1.0348 112 cis-1,2-Dichloroethylene -1.17 -1.2622 trans-1,2- 113 Dichloroethylene -0.76 -1.3745 114 Trichloroethylene -0.44 -1.3712 115 Tetrachloroethylene 0.05 -1.67 116 Chlorobenzene -1.12 -2.4015 117 o-Chlorotoluene -1.15 -1.47909 118 1,2-Dichlorobenzene -1.36 -3.163 119 1,3-Dichlorobenzene -0.98 -3.315 120 1,4-Dichlorobenzene -1.01 -3.829 121 2,2'-Dichlorobiphenyl -2.73 -1.8674 122 2,3-Dichlorobiphenyl -2.45 -3.2653 123 2,2,3'-Trichlorobiphenyl -1.99 -2.6484 124 Bromotrichloromethane -0.93 -0.5003 125 1-Chloro-2-bromoethane -1.95 -1.06448 126 Bromomethane -0.82 -1.4264 127 Dibromomethane -2.11 -3.307 128 Tribromomethane -2.13 -2.334 129 Bromoethane -0.7 -0.9601

123

Appendix

130 1,2-Dibromoethane -2.1 -2.638 131 1-Bromopropane -0.56 -1.0035 132 2-Bromopropane -0.48 -1.109 133 1,2-Dibromopropane -1.94 -0.06 134 1,3-Dibromopropane -1.96 -1.658 1-Bromo-2- 135 methylpropane -0.03 -0.3031 136 1-Bromobutane -0.41 -0.8639 137 1-Bromoisobutane -0.03 0.353 138 1-Bromo-3-methylbutane 0.21 -0.8331 139 1-Bromopentane -0.08 -0.77535 140 3-Bromopropene -0.86 -1.2995 141 Bromobenzene -1.46 -2.0111 142 1,4-Dibromobenzene -2.3 -1.3769 143 p-Bromotoluene -1.39 -1.9557 144 1-Bromo-2-ethylbenzene -1.19 -0.296 145 o-Bromocumene -0.85 -1.6731 146 Methanol -5.11 -4.736 147 -5.01 -4.786 148 Ethylene glycol -9.3 -6.5 149 1-Propanol -4.83 -4.448 150 2-Propanol -4.76 -4.731 1,1,1-Trifluoro-2- 151 propanol -4.16 -3.172 2,2,3,3- 152 Tetrafluoropropanol -4.88 -2.869 2,2,3,3,3- 153 Pentafluoropropanol -4.15 -4.318 154 Hexafluoro-2-propanol -3.76 -1.554 155 2-Methyl-1-propanol -4.52 -3.636 156 1-Butanol -4.72 -4.221 157 2-Butanol -4.58 -2.0371 158 t-Butyl alcohol -4.51 -4.69 159 2-Methyl-1-butanol -4.42 -1.9162 160 3-Methyl-1-butanol -4.42 -3.829 161 2-Methyl-2-butanol -4.43 -2.553 162 2,3-Dimethyl-1-butanol -3.91 -2.762 163 1-Pentanol -4.47 -4.036 164 2-Pentanol -4.39 -2.618 165 3-Pentanol -4.35 -3.478 166 2-Methyl-1-pentanol -3.93 -2.876 167 2-Methyl-2-pentanol -3.93 -2.1991 168 2-Methyl-3-pentanol -3.89 -2.548 169 4-Methyl-2-pentanol -3.74 -2.2558 170 Cyclopentanol -5.49 -4.408 171 1-Hexanol -4.36 -3.762 172 3-Hexanol -4.07 -3.066

124

Appendix

173 Cyclohexanol -5.48 -4.05 174 4-Heptanol -4.01 -2.889 175 Cycloheptanol -5.49 -3.682 176 1-Heptanol -4.24 -3.79 177 1-Octanol -4.09 -3.715 178 Allyl alcohol -5.03 -5.25 179 Phenol -6.62 -5.426 180 4-Bromophenol -7.13 -5.649 181 4-t-Butylphenol -5.92 -4.062 182 2-Cresol -5.87 -5.034 183 3-Cresol -5.49 -5.349 184 4-Cresol -6.14 -5.328 185 2,2,2-Trifluoroethanol -4.31 -5.638 186 p-Bromophenol -7.13 -5.649 187 2-Methoxyethanol -6.77 -5.193 188 Dimethoxymethane -2.93 -2.0455 189 Methyl propyl -1.66 -3.146 190 Methyl isopropyl ether -2.01 -2.1025 191 Methyl t-butyl ether -2.21 -1.0785 192 -1.76 -3.473 193 Ethyl propyl ether -1.81 -3.185 194 Dipropyl ether -1.16 -2.874 195 Diisopropyl ether -0.53 -1.1811 196 Di-n-butyl ether -0.83 -2.411 197 Tetrahydrofuran -3.47 -3.974 198 2-Methyltetrahydrofuran -3.3 -3.599 199 -2.45 -3.789 200 Ethyl phenyl ether -4.28 -3.857 201 1,1-Diethoxyethane -3.27 -3.583 202 1,2-Dimethoxyethane -4.83 -3.667 203 1,2-Diethoxyethane -3.52 -3.334 204 1,3-Dioxolane -4.1 -3.522 205 1,4-Dioxane -5.05 -5.707 2,2,2-Trifluoroethyl vinyl 206 ether -0.12 -0.26 1-Chloro-2,2,2- trifluoroethyl 207 difluoromethyl ether 0.11 -1.247 208 Acetaldehyde -3.5 -3.8434 209 Propanal -3.44 -3.4101 210 Butanal -3.18 -3.731 211 Pentanal -3.03 -3.2441 212 Hexanal -2.81 -3.5882 213 Heptanal -2.67 -2.9576 214 Octanal -2.29 -3.3372 215 Nonanal -2.08 -2.7208 216 trans-2-Butenal -4.23 -4.946

125

Appendix

217 trans-2-Hexenal -3.68 -4.271 218 trans-2-Octenal -3.44 -3.6682 trans,trans-2,4- 219 Hexadienal -4.63 -5.096 220 Benzaldehyde -4.02 -4.703 221 m-Hydroxybenzaldehyde -9.51 -8.772 222 p-Hydroxybenzaldehyde -10.48 -9.164 223 Acetone -3.85 -4.258 224 2-Butanone -3.64 -4.269 225 3-Methyl-2-butanone -3.24 -3.992 226 3,3-Dimethylbutanone -2.89 -3.681 227 2-Pentanone -3.53 -3.763 228 3-Pentanone -3.41 -2.9221 229 4-Methyl-2-pentanone -3.06 -3.23 2,4-Dimethyl-3- 230 pentanone -2.74 -2.22485 231 Cyclopentanone -4.68 -4.163 232 2-Hexanone -3.29 -3.6473 233 2-Heptanone -3.04 -3.2754 234 4-Heptanone -2.93 -2.7046 235 2-Octanone -2.88 -3.1327 236 2-Nonanone -2.49 -2.904 237 5-Nonanone -2.67 -4.155 238 2-Undecanone -2.16 -2.4185 239 Acetophenone -4.58 -5.172 240 Acetic acid -6.7 -6.535 241 -6.47 -6.747 242 Butyric acid -6.36 -6.021 243 Pentanoic acid -6.16 -6.02 244 Hexanoic acid -6.21 -5.607 4-Amino-3,5,6- trichloropyridine-2- 245 carboxylic acid -11.96 -10.202 246 -2.78 -3.62848 247 -2.65 -2.8942 248 Propyl formate -2.48 -2.6296 249 -3.31 -3.40406 250 Isopropyl formate -2.02 -3.34648 251 Isobutyl formate -2.22 -2.8746 252 Isoamyl formate -2.13 -2.5209 253 -3.1 -3.43298 254 -2.86 -4.0639 255 -2.65 -3.9699 256 -2.55 -3.7066 257 -2.36 -3.2861 258 -2.45 -3.5504 259 -2.21 -3.5564

126

Appendix

260 -2.26 -3.39161 261 -2.93 -3.9529 262 -2.8 -3.8199 263 Propyl propionate -2.54 -3.4933 264 Isopropyl propionate -2.22 -3.3816 265 Amyl propionate -1.99 -2.95397 266 -2.83 -3.5128 267 -2.5 -3.2304 268 Propyl butyrate -2.28 -3.0632 269 -2.57 -3.6008 270 -2.52 -3.3553 271 Methyl hexanoate -2.49 -3.06611 272 -2.3 -3.01596 273 Methyl octanoate -2.04 -2.5765 274 Methyl benzoate -4.28 -7.585 275 Methylamine -4.56 -4.88 276 Ethylamine -4.5 -3.977 277 Propylamine -4.39 -3.957 278 Butylamine -4.29 -3.712 279 Pentylamine -4.1 -3.445 280 Hexylamine -4.03 -3.274 281 Dimethylamine -4.29 -3.78 282 Diethylamine -4.07 -3.312 283 Dipropylamine -3.66 -2.728 284 Dibutylamine -3.31 -2.507 285 Trimethylamine -3.23 -3.457 286 Triethylamine -3.03 -2.717 287 Azetidine -5.56 -3.865 288 Piperazine -7.38 -9.97 289 N,N'-Dimethylpiperazine -7.58 -8.016 290 N-Methylpiperazine -7.77 -6.872 291 Aniline -5.49 -6.56 1,1-Dimethyl-3- 292 phenylurea -9.63 -10.28 293 N,N-Dimethylaniline -3.58 -6.278 294 Ethylenediamine -9.75 -9.839 295 Hydrazine -9.3 -8.355 296 2-Methoxy-1-ethanamine -6.55 -6.846 297 Morpholine -7.17 -6.925 298 N-Methylmorpholine -6.34 -5.822 299 N-Methylpyrrolidine -3.98 -3.829 300 N-Methylpiperidine -3.89 -3.572 301 Pyrrolidine -5.48 -3.111 302 Piperidine -5.11 -4.442 303 Pyridine -4.7 -4.119 304 2-Methylpyridine -4.63 -3.518 305 3-Methylpyridine -4.77 -3.562

127

Appendix

306 4-Methylpyridine -4.94 -4.049 307 2-Ethylpyridine -4.33 -3.0412 308 3-Ethylpyridine -4.6 -3.34 309 4-Ethylpyridine -4.74 -3.832 310 2,3-Dimethylpyridine -4.83 -3.4115 311 2,4-Dimethylpyridine -4.86 -3.5107 312 2,5-Dimethylpyridine -4.72 -3.4217 313 2,6-Dimethylpyridine -4.6 -3.3403 314 3,4-Dimethylpyridine -5.22 -3.643 315 3,5-Dimethylpyridine -4.84 -3.943 316 2-Methylpyrazine -5.52 -5.398 317 2-Ethylpyrazine -5.46 -5.142 318 2-Isobutylpyrazine -5.05 -4.393 2-Ethyl-3- 319 methoxypyrazine -4.4 -5.577 2-Isobutyl-3- 320 methoxypyrazine -3.68 -2.9336 321 9-Methyladenine -13.6 -13.59 322 1-Methylthymine -10.4 -10.303 323 Methylimidazole -10.25 -8.597 324 N-Propylguanidine -10.92 -10.564 325 Acetonitrile -3.89 -2.2333 326 Propionitrile -3.85 -2.23622 327 Butyronitrile -3.64 -2.1477 328 Benzonitrile -4.1 -4.308 329 2,6-Dichlorobenzonitrile -5.22 -5.581 3,5-Dibromo-4- 330 hydroxybenzonitrile -9 -4.716 331 N,N-Dimethylformamide -4.9 -5.3387 332 N-Methylformamide -10 -9.728 333 Acetamide -9.71 -10.934 334 (E)-N-Methylacetamide -10 -8.721 335 (Z)-N-Methylacetamide -10 -9.189 336 Propionamide -9.42 -10.428 337 Methanethiol -1.24 -1.889 338 Ethanethiol -1.3 -1.5487 339 1-Propanethiol -1.05 -1.4264 340 Thiophenol -2.55 -3.274 341 Thioanisole -2.73 -2.984 342 Dimethyl sulfide -1.54 -1.821 343 Diethyl sulfide -1.43 -1.143 344 Methyl ethyl sulfide -1.49 -1.4987 345 Dipropyl sulfide -1.27 -1.1159 2,2'-Dichlorodiethyl 346 sulfide -3.92 -2.6346 347 Dimethyl disulfide -1.83 -3.355 348 Diethyl disulfide -1.63 -2.575

128

Appendix

349 Trimethyl phosphate -8.7 -7.6882 350 Triethyl phosphate -7.8 -7.6335 351 Tripropyl phosphate -6.1 -5.865 2,2-Dichloroethenyl 352 dimethyl phosphate -6.61 -5.229 Dimethyl-5-(4-chloro)- bicyclo[3.2.0]-heptyl 353 phosphate -7.28 -5.993 o-Ethyl-o'-(4-bromo-2- chlorophenyl) S-propyl 354 phosphorothioate -4.09 -6.316 355 Hydrochinone -10.77 -9.466 356 1,2,3-Trimethoxybenzene -5.4 -3.8749 357 1,2-Benzenediole -7.62 -8.649 358 1,3-Benzenediole -9.67 -9.133 359 o-Phenylenediamine -7.19 -10.124 360 m-Phenylenediamine -10.26 -12.101 361 2-Methylaniline -5.56 -6.495 362 N-Methylaniline -4.68 -5.918 363 Acetylene anion -73 -73.669 364 Protonated methanol -87 -83.049 365 Protonated dimethyl ether -70 -69.636 366 Protonated 2-propanol -64 -62.664 367 Oxonium ion -105 -104.865 368 Methanolate ion -98 -97.655 369 Formylate ion -77 -79.832 370 Dimethyl ether carbanion -81 -75.585 371 Phenolate ion -75 -77.695 372 Toluene carbanion -59 -59.715 373 Hydrogen anion -101 -98.014 374 Methyl ammonium ion -73 -71.676 Protonated N- 375 Methylmethanamine -66 -64.57 Protonated N,N- 376 Dimethylmethanamine -59 -58.822 377 Pyridinium ion -58 -60.615 378 Ammonium ion -81 -85.215 379 Acetonitrile carbanion -75 -73.352 380 Amide ion -95 -97.509 381 Azide ion -74 -76.391 382 Methylsulfonium ion -74 -72.051 Protonated Dimethyl 383 sulfide -61 -64.492 384 1-Propanethiolate anion -76 -76.615 385 Thiophenolate ion -65 -64.2845

129

Appendix

Δ −1 Table A3. Gsolv (H 2O) (kcal mol ) for the neutral compounds calculated with AM1 and the iso No Compound Exp Calc 1 Methane 2 0.8071 2 Ethane 1.83 0.901 3 Propane 1.96 1.137 4 Cyclopropane 0.75 0.914 5 2-Methylpropane 2.32 1.475 6 2,2-Dimethylpropane 2.5 1.92 7 n-Butane 2.08 1.36 8 2,2-Dimethylbutane 2.59 2.067 9 Cyclopentane 1.2 0.6328 10 n-Pentane 2.33 1.571 11 2-Methylpentane 2.52 1.887 12 3-Methylpentane 2.51 1.843 13 2,4-Dimethylpentane 2.88 2.222 14 2,2,4-Trimethylpentane 2.85 2.608 15 Methylcyclopentane 1.6 1.02 16 n-Hexane 2.49 1.797 17 Cyclohexane 1.23 1.385 18 Methylcyclohexane 1.71 1.731 19 cis-1,2-Dimethylcyclohexane 1.58 1.975 20 n-Heptane 2.62 2.016 21 n-Octane 2.89 2.215 22 Ethylene 1.27 0.6094 23 Propylene 1.27 0.714 24 2-Methylpropene 1.16 0.822 25 1-Butene 1.38 0.867 26 2-Methyl-2-butene 1.31 0.958 27 3-Methyl-1-butene 1.83 1.21 28 1-Pentene 1.66 1.094 29 trans-2-Pentene 1.34 0.972 30 4-Methyl-1-pentene 1.91 1.516 31 Cyclopentene 0.56 0.1813 32 1-Hexene 1.68 1.329 33 Cyclohexene 0.37 0.565 34 trans-2-Heptene 1.66 1.405 35 1-Methylcyclohexene 0.67 0.709 36 1-Octene 2.17 1.749 37 1,3-Butadiene 0.61 0.2453 38 2-Methyl-1,3-butadiene 0.68 0.431 39 2,3-Dimethyl-1,3-butadiene 0.4 0.552 40 1,4-Pentadiene 0.94 0.793 41 1,5-Hexadiene 1.01 0.953 42 Acetylene -0.01 -0.1367 43 Propyne -0.31 -0.316 44 1-Butyne -0.16 -0.054

130

Appendix

45 1-Pentyne 0.01 0.219 46 1-Hexyne 0.29 0.405 47 1-Heptyne 0.6 0.634 48 1-Octyne 0.71 0.846 49 1-Nonyne 1.05 1.042 50 Butenyne 0.04 -0.6119 51 Benzene -0.87 -1.2157 52 Toluene -0.89 -1.0219 53 1,2,4-Trimethylbenzene -0.86 -0.703 54 Ethylbenzene -0.8 -0.6633 55 m-Xylene -0.84 -0.8851 56 o-Xylene -0.9 -0.8467 57 p-Xylene -0.8 -0.8711 58 Propylbenzene -0.53 -0.396 59 Butylbenzene -0.4 -0.162 60 t-Butylbenzene -0.44 0.008 61 t-Amylbenzene -0.18 0.18 62 Naphtalene -2.39 -2.8235 63 Anthracene -4.23 -4.352 64 Phenanthrene -3.95 -4.32 65 Acenaphtene -3.15 -3.2067 66 p-Chlorotoluene -1.92 -2.0604 67 Fluoromethane -0.22 -2.1649 68 1,1-Difluoroethane -0.11 -0.989 69 Trifluoromethane -0.81 -1.074 70 Tetrafluoromethane 3.11 1.669 71 Hexafluoroethane 3.94 2.921 72 Octafluoropropane 4.28 4.362 73 Fluorobenzene -0.78 -2.67 74 2-Chloro-1,1,1-trifluoroethane 0.05 -0.278 75 Chlorofluoromethane -0.77 -0.802 76 Chlorodifluoromethane -0.5 0.262 77 Chlorotrifluoromethane 2.52 2.622 78 Dichlorodifluoromethane 1.69 2.018 79 Fluorotrichloromethane 0.82 0.6426 80 1,1,2-Trichloro-1,2,2-trifluoroethane 1.77 3.272 81 1,1,2,2-Tetrachlorodifluoroethane 0.82 1.111 82 Chloropentafluoroethane 2.87 3.456 83 1,1-Dichlorotetrafluoroethane 2.51 2.211 84 1,2-Dichlorotetrafluoroethane 2.32 3.915 85 1-Bromo-1-chloro-2,2,2-trifluoroethane -0.13 -0.4821 86 Bromotrifluoromethane 1.79 0.815 87 1-Bromo-1,2,2,2-tetrafluoroethane 0.52 0.341 88 Chloromethane -0.56 -1.1878 89 Dichloromethane -1.41 -0.6411 90 Trichloromethane -1.07 -0.6326 91 Tetrachloromethane 0.1 -0.999

131

Appendix

92 Chloroethane -0.63 -1.1942 93 1,1-Dichloroethane -0.85 -0.9339 94 (E)-1,2-Dichloroethane -1.73 -1.5757 95 1,1,1-Trichloroethane -0.25 -1.2597 96 1,1,2-Trichloroethane -1.95 -1.1746 97 1,1,1,2-Tetrachloroethane -1.15 -1.0587 98 1,1,2,2-Tetrachloroethane -2.36 -0.257 99 Pentachloroethane -1.36 -1.0926 100 Hexachloroethane -1.41 -0.929 101 1-Chloropropane -0.27 -0.98153 102 2-Chloropropane -0.25 0.321 103 1,2-Dichloropropane -1.25 -0.7996 104 1,3-Dichloropropane -1.9 -1.60189 105 1-Chlorobutane -0.14 -0.84263 106 2-Chlorobutane 0.07 0.511 107 1,1-Dichlorobutane -0.7 -0.6147 108 1-Chloropentane -0.07 -0.6884 109 2-Chloropentane 0.07 0.708 110 3-Chloropentane 0.04 -0.4214 111 Chloroethylene -0.59 -0.594 112 cis-1,2-Dichloroethylene -1.17 -0.8462 113 trans-1,2-Dichloroethylene -0.76 -0.8634 114 Trichloroethylene -0.44 -1.1249 115 Tetrachloroethylene 0.05 -1.429 116 Chlorobenzene -1.12 -2.0034 117 o-Chlorotoluene -1.15 -1.3306 118 1,2-Dichlorobenzene -1.36 -2.2624 119 1,3-Dichlorobenzene -0.98 -2.2158 120 1,4-Dichlorobenzene -1.01 -2.548 121 2,2'-Dichlorobiphenyl -2.73 -2.51954 122 2,3-Dichlorobiphenyl -2.45 -3.2246 123 2,2,3'-Trichlorobiphenyl -1.99 -3.1211 124 Bromotrichloromethane -0.93 -1.661 125 1-Chloro-2-bromoethane -1.95 -2.1 126 Bromomethane -0.82 0.0297 127 Dibromomethane -2.11 -1.646 128 Tribromomethane -2.13 -2.796 129 Bromoethane -0.7 -0.80407 130 1,2-Dibromoethane -2.1 -1.998 131 1-Bromopropane -0.56 -0.57778 132 2-Bromopropane -0.48 1.318 133 1,2-Dibromopropane -1.94 -1.3061 134 1,3-Dibromopropane -1.96 -1.8931 135 1-Bromo-2-methylpropane -0.03 1.136 136 1-Bromobutane -0.41 -0.5448 137 1-Bromoisobutane -0.03 -0.1121 138 1-Bromo-3-methylbutane 0.21 -0.1958

132

Appendix

139 1-Bromopentane -0.08 -0.4 140 3-Bromopropene -0.86 -0.4982 141 Bromobenzene -1.46 -1.9521 142 1,4-Dibromobenzene -2.3 -1.3538 143 p-Bromotoluene -1.39 -2.1592 144 1-Bromo-2-ethylbenzene -1.19 0.492 145 o-Bromocumene -0.85 -0.6293 146 Methanol -5.11 -4.897 147 Ethanol -5.01 -5.078 148 Ethylene glycol -9.3 -6.527 149 1-Propanol -4.83 -4.73 150 2-Propanol -4.76 -4.893 151 1,1,1-Trifluoro-2-propanol -4.16 -3.223 152 2,2,3,3-Tetrafluoropropanol -4.88 -3.7019 153 2,2,3,3,3-Pentafluoropropanol -4.15 -5.353 154 Hexafluoro-2-propanol -3.76 -1.908 155 2-Methyl-1-propanol -4.52 -4.372 156 1-Butanol -4.72 -4.502 157 2-Butanol -4.58 -2.1465 158 t-Butyl alcohol -4.51 -4.798 159 2-Methyl-1-butanol -4.42 -2.2947 160 3-Methyl-1-butanol -4.42 -4.064 161 2-Methyl-2-butanol -4.43 -2.759 162 2,3-Dimethyl-1-butanol -3.91 -3.582 163 1-Pentanol -4.47 -4.336 164 2-Pentanol -4.39 -2.595 165 3-Pentanol -4.35 -3.785 166 2-Methyl-1-pentanol -3.93 -3.687 167 2-Methyl-2-pentanol -3.93 -2.391 168 2-Methyl-3-pentanol -3.89 -2.893 169 4-Methyl-2-pentanol -3.74 -2.281 170 Cyclopentanol -5.49 -4.634 171 1-Hexanol -4.36 -4.053 172 3-Hexanol -4.07 -3.367 173 Cyclohexanol -5.48 -4.373 174 4-Heptanol -4.01 -3.216 175 Cycloheptanol -5.49 -3.883 176 1-Heptanol -4.24 -4.176 177 1-Octanol -4.09 -4.098 178 Allyl alcohol -5.03 -5.842 179 Phenol -6.62 -5.857 180 4-Bromophenol -7.13 -6.401 181 4-t-Butylphenol -5.92 -4.125 182 2-Cresol -5.87 -4.862 183 3-Cresol -5.49 -5.797 184 4-Cresol -6.14 -5.809 185 2,2,2-Trifluoroethanol -4.31 -6.47

133

Appendix

186 p-Bromophenol -7.13 -6.401 187 2-Methoxyethanol -6.77 -4.719 188 Dimethoxymethane -2.93 -1.5904 189 Methyl propyl ether -1.66 -2.8 190 Methyl isopropyl ether -2.01 -1.644 191 Methyl t-butyl ether -2.21 -0.6949 192 Diethyl ether -1.76 -3.236 193 Ethyl propyl ether -1.81 -2.925 194 Dipropyl ether -1.16 -2.579 195 Diisopropyl ether -0.53 -1.0426 196 Di-n-butyl ether -0.83 -2.1359 197 Tetrahydrofuran -3.47 -3.734 198 2-Methyltetrahydrofuran -3.3 -3.432 199 Anisole -2.45 -3.1637 200 Ethyl phenyl ether -4.28 -3.1649 201 1,1-Diethoxyethane -3.27 -2.8988 202 1,2-Dimethoxyethane -4.83 -2.8075 203 1,2-Diethoxyethane -3.52 -2.6382 204 1,3-Dioxolane -4.1 -3.122 205 1,4-Dioxane -5.05 -4.876 206 2,2,2-Trifluoroethyl vinyl ether -0.12 -0.35 207 1-Chloro-2,2,2-trifluoroethyl difluoromethyl ether 0.11 -1.363 208 Acetaldehyde -3.5 -3.7743 209 Propanal -3.44 -2.949 210 Butanal -3.18 -3.2128 211 Pentanal -3.03 -2.57359 212 Hexanal -2.81 -2.9175 213 Heptanal -2.67 -2.2493 214 Octanal -2.29 -2.5851 215 Nonanal -2.08 -1.9666 216 trans-2-Butenal -4.23 -4.596 217 trans-2-Hexenal -3.68 -3.5553 218 trans-2-Octenal -3.44 -3.0859 219 trans,trans-2,4-Hexadienal -4.63 -4.2269 220 Benzaldehyde -4.02 -4.391 221 m-Hydroxybenzaldehyde -9.51 -9.066 222 p-Hydroxybenzaldehyde -10.48 -9.198 223 Acetone -3.85 -4.074 224 2-Butanone -3.64 -4.234 225 3-Methyl-2-butanone -3.24 -3.797 226 3,3-Dimethylbutanone -2.89 -3.3775 227 2-Pentanone -3.53 -3.69 228 3-Pentanone -3.41 -3.1188 229 4-Methyl-2-pentanone -3.06 -3.3107 230 2,4-Dimethyl-3-pentanone -2.74 -2.20966 231 Cyclopentanone -4.68 -3.966 232 2-Hexanone -3.29 -3.6411

134

Appendix

233 2-Heptanone -3.04 -3.2661 234 4-Heptanone -2.93 -2.7957 235 2-Octanone -2.88 -3.1127 236 2-Nonanone -2.49 -2.929 237 5-Nonanone -2.67 -4.143 238 2-Undecanone -2.16 -2.4784 239 Acetophenone -4.58 -4.974 240 Acetic acid -6.7 -6.446 241 Propionic acid -6.47 -6.731 242 Butyric acid -6.36 -5.924 243 Pentanoic acid -6.16 -5.945 244 Hexanoic acid -6.21 -5.513 245 4-Amino-3,5,6-trichloropyridine-2-carboxylic acid -11.96 -10.62 246 Methyl formate -2.78 -3.9017 247 Ethyl formate -2.65 -3.0931 248 Propyl formate -2.48 -2.7386 249 Methyl acetate -3.31 -3.39886 250 Isopropyl formate -2.02 -3.41405 251 Isobutyl formate -2.22 -2.8939 252 Isoamyl formate -2.13 -2.379 253 Ethyl acetate -3.1 -3.38374 254 Propyl acetate -2.86 -3.8644 255 Isopropyl acetate -2.65 -4.012 256 Butyl acetate -2.55 -3.5154 257 Isobutyl acetate -2.36 -3.741 258 Amyl acetate -2.45 -3.3177 259 Isoamyl acetate -2.21 -3.3588 260 Hexyl acetate -2.26 -3.1331 261 Methyl propionate -2.93 -3.9138 262 Ethyl propionate -2.8 -3.7573 263 Propyl propionate -2.54 -3.3838 264 Isopropyl propionate -2.22 -3.4825 265 Amyl propionate -1.99 -2.8459 266 Methyl butyrate -2.83 -3.3729 267 Ethyl butyrate -2.5 -3.0897 268 Propyl butyrate -2.28 -2.88715 269 Methyl pentanoate -2.57 -3.4129 270 Ethyl pentanoate -2.52 -3.1943 271 Methyl hexanoate -2.49 -2.9004 272 Ethyl heptanoate -2.3 -2.8331 273 Methyl octanoate -2.04 -2.403 274 Methyl benzoate -4.28 -6.3513 275 Methylamine -4.56 -4.84 276 Ethylamine -4.5 -4.099 277 Propylamine -4.39 -4.149 278 Butylamine -4.29 -3.904 279 Pentylamine -4.1 -3.673

135

Appendix

280 Hexylamine -4.03 -3.5 281 Dimethylamine -4.29 -3.548 282 Diethylamine -4.07 -3.237 283 Dipropylamine -3.66 -2.652 284 Dibutylamine -3.31 -2.439 285 Trimethylamine -3.23 -3.177 286 Triethylamine -3.03 -3.018 287 Azetidine -5.56 -4.231 288 Piperazine -7.38 -10.238 289 N,N'-Dimethylpiperazine -7.58 -7.921 290 N-Methylpiperazine -7.77 -6.94 291 Aniline -5.49 -6.306 292 1,1-Dimethyl-3-phenylurea -9.63 -10.59 293 N,N-Dimethylaniline -3.58 -5.564 294 Ethylenediamine -9.75 -10.117 295 Hydrazine -9.3 -9.103 296 2-Methoxy-1-ethanamine -6.55 -6.739 297 Morpholine -7.17 -6.857 298 N-Methylmorpholine -6.34 -5.702 299 N-Methylpyrrolidine -3.98 -3.865 300 N-Methylpiperidine -3.89 -3.585 301 Pyrrolidine -5.48 -3.34 302 Piperidine -5.11 -4.614 303 Pyridine -4.7 -4.052 304 2-Methylpyridine -4.63 -3.716 305 3-Methylpyridine -4.77 -3.713 306 4-Methylpyridine -4.94 -4.176 307 2-Ethylpyridine -4.33 -3.4 308 3-Ethylpyridine -4.6 -3.509 309 4-Ethylpyridine -4.74 -3.96 310 2,3-Dimethylpyridine -4.83 -3.678 311 2,4-Dimethylpyridine -4.86 -3.819 312 2,5-Dimethylpyridine -4.72 -3.76 313 2,6-Dimethylpyridine -4.6 -3.677 314 3,4-Dimethylpyridine -5.22 -3.809 315 3,5-Dimethylpyridine -4.84 -4.172 316 2-Methylpyrazine -5.52 -4.656 317 2-Ethylpyrazine -5.46 -4.449 318 2-Isobutylpyrazine -5.05 -3.628 319 2-Ethyl-3-methoxypyrazine -4.4 -5.173 320 2-Isobutyl-3-methoxypyrazine -3.68 -2.9403 321 9-Methyladenine -13.6 -14.007 322 1-Methylthymine -10.4 -10.295 323 Methylimidazole -10.25 -8.87 324 N-Propylguanidine -10.92 -10.18 325 Acetonitrile -3.89 -2.5948 326 Propionitrile -3.85 -2.4743

136

Appendix

327 Butyronitrile -3.64 -2.2009 328 Benzonitrile -4.1 -4.117 329 2,6-Dichlorobenzonitrile -5.22 -4.563 330 3,5-Dibromo-4-hydroxybenzonitrile -9 -8.614 331 N,N-Dimethylformamide -4.9 -5.3883 332 N-Methylformamide -10 -9.26 333 Acetamide -9.71 -11.07 334 (E)-N-Methylacetamide -10 -8.562 335 (Z)-N-Methylacetamide -10 -9.602 336 Propionamide -9.42 -10.807 337 Methanethiol -1.24 -1.787 338 Ethanethiol -1.3 -1.5067 339 1-Propanethiol -1.05 -1.3026 340 Thiophenol -2.55 -3.11 341 Thioanisole -2.73 -2.886 342 Dimethyl sulfide -1.54 -1.786 343 Diethyl sulfide -1.43 -1.3491 344 Methyl ethyl sulfide -1.49 -1.676 345 Dipropyl sulfide -1.27 -1.0457 346 2,2'-Dichlorodiethyl sulfide -3.92 -3.214 347 Dimethyl disulfide -1.83 -2.83 348 Diethyl disulfide -1.63 -1.8783 349 Trimethyl phosphate -8.7 -10.507 350 Triethyl phosphate -7.8 -9.063 351 Tripropyl phosphate -6.1 -4.806 352 2,2-Dichloroethenyl dimethyl phosphate -6.61 -4.648 353 Dimethyl-5-(4-chloro)-bicyclo[3.2.0]-heptyl phosphate -7.28 -6.7174 o-Ethyl-o'-(4-bromo-2-chlorophenyl) S-propyl 354 phosphorothioate -4.09 -5.69 355 Hydrochinone -10.77 -10.377 356 1,2,3-Trimethoxybenzene -5.4 -3.5128 357 1,2-Benzenediole -7.62 -9.02 358 1,3-Benzenediole -9.67 -9.45 359 o-Phenylenediamine -7.19 -9.892 360 m-Phenylenediamine -10.26 -11.444 361 2-Methylaniline -5.56 -6.083 362 N-Methylaniline -4.68 -5.339

137

Appendix

Δ −1 Table A4. Gsolv (oc tan ol) (kcal mol ) calculated with PM3 and the iso No Compound Exp Calc 1 Methane 0.51 0.00188 2 Ethane -0.64 -0.4052 3 Propane -1.26 -1.092 4 Cyclopropane -1.6 -0.5115 5 2-Methylpropane -1.45 -1.689 6 2,2-Dimethylpropane -1.74 -2.095 7 n-Butane -1.86 -1.736 8 Cyclopentane -2.65 -2.132 9 n-Pentane -2.45 -2.408 10 n-Hexane -3.01 -3.065 11 Cyclohexane -3.46 -2.782 12 Methylcyclohexane -3.21 -3.206 13 n-Heptane -3.74 -3.738 14 n-Octane -4.18 -4.38 15 Ethylene -0.27 -0.5058 16 Propylene -1.14 -1.1119 17 2-Methylpropene -2.03 -1.669 18 1-Butene -1.89 -1.93 19 1-Hexene -2.94 -3.3 20 1,3-Butadiene -2.1 -2.402 21 Acetylene -0.51 -1.5961 22 Propyne -1.59 -2.203 23 1-Pentyne -2.79 -3.357 24 1-Hexyne -3.43 -4.003 25 Benzene -3.72 -3.802 26 Toluene -4.55 -4.292 27 Ethylbenzene -5.08 -4.844 28 m-Xylene -5.25 -4.782 29 o-Xylene -5.07 -4.691 30 p-Xylene -5.19 -4.775 31 Naphtalene -6.97 -7.246 32 Anthracene -10.47 -10.881 33 1,1-Difluoroethane -1.13 -0.9462 34 Tetrafluoromethane 1.5 1.476 35 Fluorobenzene -3.87 -4.381 36 Chlorodifluoromethane -1.97 -2.56 37 Dichlorodifluoromethane -1.25 -0.6303 38 Fluorotrichloromethane -2.63 -3.281 1,1,2-Trichloro-1,2,2- 39 Trifluoroethane -2.54 -2.871 1-Bromo-1-chloro-2,2,2- 40 trifluoroethane -3.27 -4.07 41 Bromotrifluoromethane -0.75 -0.34358 42 Dichloromethane -3.07 -3.162 43 Trichloromethane -3.81 -4.555

138

Appendix

44 Chloroethane -2.58 -3.385 45 1,1,1-Trichloroethane -3.69 -3.98 46 1,1,2-Trichloroethane -4.53 -3.746 47 1-Chloropropane -3.06 -3.903 48 2-Chloropropane -2.84 -2.093 49 cis-1,2-Dichloroethylene -3.71 -3.092 trans-1,2- 50 Dichloroethylene -3.61 -2.543 51 Trichloroethylene -3.75 -3.116 52 Tetrachloroethylene -4.24 -3.844 53 Chlorobenzene -5 -5.555 54 1,2-Dichlorobenzene -6.01 -5.781 55 1,4-Dichlorobenzene -5.67 -6.16 56 2,2'-Dichlorobiphenyl -9.41 -8.99 57 2,3-Dichlorobiphenyl -9.23 -9.932 58 2,2,3'-Trichlorobiphenyl -9.12 -10.174 59 Bromomethane -2.43 -2.562 60 Dibromomethane -4.18 -3.92 61 Tribromomethane -5.62 -5.433 62 Bromoethane -2.9 -2.63 63 1-Bromopropane -3.42 -2.943 64 2-Bromopropane -3.4 -3.464 65 1-Bromobutane -4.16 -3.881 66 1-Bromopentane -4.68 -4.59 67 3-Bromopropene -3.3 -3.546 68 Bromobenzene -5.46 -6.563 69 1,4-Dibromobenzene -7.47 -7.553 70 p-Bromotoluene -6.36 -7.149 71 Methanol -3.87 -4.24 72 Ethanol -4.36 -4.729 73 Ethylene glycol -7.44 -7.7 74 1-Propanol -5.02 -5.282 75 2-Propanol -4.62 -4.892 1,1,1-Trifluoro-2- 76 propanol -5.12 -4.713 77 Hexafluoro-2-propanol -5.76 -4.801 78 2-Methyl-1-propanol -4.78 -5.609 79 1-Butanol -5.71 -5.85 80 1-Pentanol -6.4 -6.578 81 1-Hexanol -7.06 -7.092 82 1-Heptanol -7.75 -8.082 83 1-Octanol -8.13 -8.835 84 Allyl alcohol -5.27 -5.795 85 Phenol -8.69 -7.491 86 2-Cresol -8.49 -7.441 87 3-Cresol -8.2 -8.028 88 4-Cresol -8.84 -8.036

139

Appendix

89 2,2,2-Trifluoroethanol -4.81 -6.601 90 p-Bromophenol -10.59 -10.177 91 2-Methoxyethanol -5.83 -5.976 92 Methyl propyl ether -3.63 -3.898 93 Methyl isopropyl ether -4.64 -3.569 94 Methyl t-butyl ether -3.49 -3.683 95 Diethyl ether -2.89 -3.757 96 Tetrahydrofuran -3.93 -3.744 97 Anisole -5.47 -5.517 98 Ethyl phenyl ether -5.65 -6.012 99 1,2-Dimethoxyethane -4.55 -4.826 100 1,4-Dioxane -4.89 -5.028 101 Propanal -4.13 -4.096 102 Butanal -4.62 -4.894 103 Benzaldehyde -6.13 -7.323 104 m-Hydroxybenzaldehyde -11.39 -11.059 105 p-Hydroxybenzaldehyde -12.36 -10.662 106 Acetone -3.15 -3.682 107 2-Butanone -3.78 -4.401 108 3,3-Dimethylbutanone -4.53 -4.773 109 2-Pentanone -4.35 -4.689 110 3-Pentanone -4.36 -5.15 111 Cyclopentanone -5.01 -4.553 112 2-Hexanone -5.02 -5.157 113 2-Heptanone -5.65 -5.834 114 2-Octanone -6.38 -6.445 115 Acetophenone -6.74 -7.294 116 Acetic acid -6.35 -6.093 117 Propionic acid -6.86 -5.789 118 Butyric acid -7.58 -6.361 119 Pentanoic acid -8.22 -6.883 120 Hexanoic acid -8.82 -7.478 4-Amino-3,5,6- trichloropyridine-2- 121 carboxylic acid -12.37 -12.821 122 Methyl formate -2.82 -4.885 123 Methyl acetate -3.54 -4.449 124 Ethyl acetate -4.06 -4.732 125 Propyl acetate -4.55 -5.362 126 Butyl acetate -4.96 -5.922 127 Methyl propionate -4.06 -4.338 128 Methyl butyrate -4.59 -5.195 129 Methyl pentanoate -5.13 -5.512 130 Methyl benzoate -7.26 -9.43 131 Methylamine -3.78 -2.665 132 Ethylamine -4.09 -3.056 133 Propylamine -4.77 -3.817

140

Appendix

134 Butylamine -5.35 -4.436 135 Diethylamine -4.75 -4.726 136 Dipropylamine -6.02 -5.969 137 Trimethylamine -3.6 -3.792 138 Piperazine -5.8 -7.24 139 Aniline -6.71 -6.509 1,1-Dimethyl-3- 140 phenylurea -13.12 -10.481 141 Hydrazine -6.48 -4.205 142 Morpholine -5.99 -5.477 143 Piperidine -6.27 -5.154 144 Pyridine -5.34 -5.106 145 2-Methylpyridine -6.14 -5.572 146 3-Methylpyridine -6.4 -5.599 147 4-Methylpyridine -6.6 -5.783 148 2-Ethylpyridine -6.4 -6.122 149 2-Methylpyrazine -5.87 -5.344 150 2-Ethylpyrazine -6.4 -5.886 2-Ethyl-3- 151 methoxypyrazine -6.85 -6.644 152 9-Methyladenine -13.56 -12.888 153 Acetonitrile -3.15 -3.5473 154 Propionitrile -3.66 -3.851 155 Butyronitrile -4.25 -4.228 156 Benzonitrile -6.09 -7.448 157 2,6-Dichlorobenzonitrile -9.18 -7.326 158 1-Propanethiol -3.52 -2.524 159 Thiophenol -5.99 -5.537 160 Thioanisole -6.47 -7.907 161 Diethyl sulfide -4.09 -4.015 162 Dipropyl sulfide -3.89 -5.332 163 Dimethyl disulfide -4.24 -4.001 164 Trimethyl phosphate -7.81 -8.173 165 Triethyl phosphate -8.88 -7.976 166 Tripropyl phosphate -8.65 -8.904 2,2-Dichloroethenyl 167 dimethyl phosphate -8.59 -9.513 o-Ethyl-o'-(4-bromo-2- chlorophenyl) s-propyl 168 phosphorothioate -10.49 -9.792

141

Appendix

Δ −1 Table A5. Gsolv (CHCl3 ) (kcal mol ) calculated with AM1 and the iso No Compound Exp Calc 1 Cyclohexane -4.45 -3.419 2 n-Octane -5.25 -4.99 3 Benzene -4.64 -4.588 4 Toluene -5.48 -5.281 5 Ethylbenzene -5.84 -5.858 6 m-Xylene -5.86 -6.029 7 o-Xylene -6.23 -5.921 8 Naphtalene -7.89 -8.172 9 Phenanthrene -10.9 -11.582 10 Fluorobenzene -4.25 -4.698 11 Dichlorodifluoromethane -1.55 -0.8468 12 Fluorotrichloromethane -2.62 -2.71 13 Chlorobenzene -5.45 -5.754 14 1,4-Dichlorobenzene -6.32 -6.767 15 Bromobenzene -6.07 -6.298 16 Methanol -3.32 -3.283 17 Ethanol -3.94 -4.018 18 Ethylene glycol -5.98 -4.636 19 1-Propanol -4.41 -4.396 20 2-Propanol -4.28 -4.544 21 2-Methyl-1-propanol -4.48 -5.095 22 1-Butanol -5.28 -5.009 23 1-Bentanol -5.9 -5.589 24 1-Hexanol -6.67 -6.142 25 1-Heptanol -7.53 -6.824 26 Allyl alcohol -4.34 -4.913 27 Phenol -7.14 -6.365 28 2-Cresol -7.55 -6.444 29 3-Cresol -6.7 -7.073 30 4-Cresol -7.59 -7.119 31 2,2,2-Trifluoroethanol -3.03 -4.1799 32 p-Bromophenol -8.59 -7.94 33 Diethyl ether -4.32 -5 34 Diisopropyl ether -3.78 -5.129 35 Anisole -6.24 -6.267 36 Ethyl phenyl ether -7.16 -7.028 37 1,4-Dioxane -6.21 -5.452 38 Acetaldehyde -3.65 -3.7784 39 Benzaldehyde -7.09 -7.047 40 p-Hydroxybenzaldehyde -10.3 -9.028 41 Acetone -4.42 -4.666 42 2-Butanone -5.43 -5.599 43 Acetophenone -7.81 -8.122 44 Acetic acid -4.74 -5.133 45 Propionic acid -5.37 -5.899

142

Appendix

46 Butyric acid -5.99 -5.93 47 Pentanoic acid -6.61 -6.608 48 Hexanoic acid -7.51 -6.992 49 Methyl acetate -4.9 -4.767 50 Ethyl acetate -5.58 -5.449 51 Propyl acetate -6.35 -6.336 52 Butyl acetate -6.71 -6.744 53 Amyl acetate -7.36 -7.316 54 Methyl propionate -5.48 -5.484 55 Methyl pentanoate -6.68 -6.462 56 Methyl hexanoate -7.24 -6.911 57 Methyl benzoate -7.81 -9.103 58 Methylamine -3.17 -3.961 59 Ethylamine -4.02 -4.222 60 Propylamine -4.73 -4.96 61 Butylamine -5.35 -5.543 62 Dimethylamine -3.69 -4.317 63 Diethylamine -5.23 -5.7 64 Trimethylamine -3.9 -5.085 65 Aniline -7.34 -6.985 1,1-Dimethyl-3- 66 phenylurea -13.64 -12.187 67 Hydrazine -7.46 -5.937 68 Morpholine -6.72 -6.598 69 Piperidine -6.37 -6.554 70 Pyridine -6.45 -6.232 71 2-Methylpyridine -6.98 -6.913 72 3-Methylpyridine -7.35 -6.707 73 4-Methylpyridine -7.5 -7.047 74 2,6-Dimethylpyridine -7.74 -7.809 75 2-Methylpyrazine -6.99 -7.196 76 2-Ethylpyrazine -7.72 -7.79 77 9-Methyladenine -12.51 -13.384 78 1-Methylthymine -9.71 -10.476 79 Acetonitrile -4.44 -3.509 80 Benzonitrile -7.22 -7.184 81 Acetamide -7.05 -8.17 82 Thiophenol -7.61 -7.253 83 Thioanisole -5.98 -7.656 84 Diethyl sulfide -6.4 -5.462 85 Trimethyl phosphate -9.74 -9.232 86 Triethyl phosphate -10.9 -11.085 87 Tripropyl phosphate -11.11 -11.38

143

Appendix

Table A6. The theoretical logPow for small data set calculated with PM3 and the iso No Compound Exp Calc 1 n-Nonane 5.65 5.9077 2 n-Undecane 6.54 7.12721 3 n-Dodecane 6.8 7.74502 4 n-Tridecane 7.56 8.35184 5 n-Tetradecane 8 8.9506 6 2,3-Dimethylbutane 3.85 3.7274 7 3,3-Dimethylheptane 5.19 5.55666 8 Hept-1-ene 3.99 4.0997 9 Non-1-ene 5.15 5.31994 10 Isobutene 2.34 2.04912 11 But-2-yne 1.46 1.37414 12 Nonan-1-ol 3.67 3.90915 13 Undecan-1-ol 4.72 5.10374 14 Dodecan-1-ol 5.13 5.72595 15 Tridecan-1-ol 5.82 6.3313 16 Tetradecan-1-ol 6.36 6.93739 17 Hexan-2-ol 1.76 1.6966 18 Heptan-2-ol 2.31 2.3364 19 Heptan-3-ol 2.24 2.13486 20 Octan-2-ol 2.9 2.9271 21 Octan-4-ol 2.68 2.76001 22 2-Methylpropan-2-ol 0.35 0.11286 23 3-Methylbutan-2-ol 1.28 1.84904 2,3-Dimethylbutan-2- 24 ol 1.48 1.34263 25 Ethyl-n-butyl ether 2.03 2.02127 26 Decan-2-one 3.73 3.04143 27 5-Methylhexan-2-one 1.88 1.25248 28 5-Methyloctan-2-one 2.92 2.40016 29 Heptanoic acid 2.41 2.36425 30 Octanoic acid 3.05 2.91318 31 Decanoic acid 4.09 4.07771 32 Tetradecanoic acid 6.1 6.44255 33 Hexadecanoic acid 7.17 7.62204 34 Octadecanoic acid 8.23 8.80607 35 Eicosanoic acid 9.29 10.01429 36 3-Methylbutanoic acid 1.16 1.37854 37 2-Ethylbutanoic acid 1.68 1.92453 2-Methylpentanoic 38 acid 1.8 1.91867 39 2-Propylpentanoic acid 2.75 3.20339 40 2-Ethylhexanoic acid 2.64 3.22025 41 n- 2.67 2.66774 42 n-Butyl pentanoate 3.36 3.3828 43 Ethyl isobutanoate 1.55 1.32709

144

Appendix

44 1-Cyanopropane 0.53 0.87102 45 1-Cyanobutane 1.12 1.46055 46 1-Cyanopentane 1.66 2.01255 47 2-Cyanopropane 0.46 0.89997 48 2-Cyanobutane 1.1 1.3073 1-Cyano-2- 49 methylpropane 1.1 1.48722 2-Cyano-2- 50 methylpropane 1.08 1.43805 51 n-Heptylamine 2.57 3.01944 52 n-Octylamine 3.09 3.61674 53 n-Nonylamine 3.6 4.22576 54 Isopropylamine 0.26 0.97179 55 Isobutylamine 0.73 0.98132 56 sec-Butylamine 0.74 0.86406 57 tert-Butylamine 0.4 1.38293 58 Diisopropylamine 1.16 1.50972 59 Butanamide -0.21 0.08575 60 Ethylthiol 1.18 1.98829 61 n-Propylthiol 1.81 1.4209 62 n-Butylthiol 2.28 2.0519 63 1-Fluorobutane 2 1.82412 64 1-Fluoropentane 2.33 2.38038 65 1-Chlorohexane 3.66 3.7241 66 1-Chloroheptane 4.15 4.29098 67 1-Chlorooctane 4.73 4.82671 68 1-Bromohexane 3.8 3.60502 69 1-Bromoheptane 4.36 4.24012 70 1-Bromooctane 4.89 4.79637 71 1-Bromodecane 6 5.89041 72 Isopropylbenzene 3.66 3.73341 1,2,3- 73 Trimethylbenzene 3.66 3.30057 1,3,5- 74 Trimethylbenzene 3.59 3.58295 75 2-Ethyltoluene 3.53 3.44326 76 4-Ethyltoluene 3.63 3.74418 77 4-Isopropyltoluene 4.1 4.35056 1,2,3,4- 78 Tetramethylbenzene 3.98 3.75078 1,2,3,5- 79 Tetramethylbenzene 4.04 3.93979 1,2,4,5- 80 Tetramethylbenzene 4 3.96346 81 n-Pentylbenzene 4.9 5.03192 82 Pentamethylbenzene 4.56 4.25434 83 n-Hexylbenzene 5.52 5.6416

145

Appendix

84 n-Octylbenzene 6.34 6.86044 85 n-Decylbenzene 7.4 8.08068 86 2,4-Dimethylphenol 2.3 1.80214 87 2,5-Dimethylphenol 2.33 1.84465 88 2,6-Dimethylphenol 2.36 1.85564 89 3,4-Dimethylphenol 2.23 2.06304 90 3,5-Dimethylphenol 2.35 1.70467 91 2-Ethylphenol 2.47 1.95311 92 3-Ethylphenol 2.4 2.42215 93 4-Ethylphenol 2.58 1.96484 94 2,4,6-Trimethylphenol 2.97 2.46979 95 2,3,6-Trimethylphenol 2.67 2.25579 96 1,2,3-Trichlorobenzene 4.05 3.9296 97 1,2,4-Trichlorobenzene 4.02 3.40647 98 1,3,5-Trichlorobenzene 4.19 3.05873 1,2,3,4- 99 Tetrachlorobenzene 4.64 3.94045 1,2,3,5- 100 Tetrachlorobenzene 4.65 3.54198 1,2,4,5- 101 Tetrachlorobenzene 4.6 2.91003 102 Pentachlorobenzene 5.18 3.48343 103 Hexachlorobenzene 5.73 3.69628 104 1,2-Dibromobenzene 3.64 3.55591 105 1,3-Dibromobenzene 3.75 3.38148 106 Dimethyl ether 0.1 -0.59729 107 Methyl nonanoate 3.87 3.86972 108 Methyl decanoate 4.41 4.48803 109 Methylthiol 0.65 1.21071 110 Cyclooctane 4.45 4.34961 111 Cyclooctanol 2.39 2.48884 112 Cyclohexanone 0.81 0.15757 113 Cyclododecanone 4.1 3.07294 114 Formic acid -0.54 -0.13045 115 Formamide -1.51 -0.78198 N,N- 116 Dimethylacetamide -0.77 -0.2653 117 N,N-Diethylacetamide 0.34 1.61291 N,N- 118 Dimethylpropanamide -0.11 0.57384 119 Methyl acrylate 0.8 0.10993 120 Ethyl acrylate 1.32 0.74196 121 n-Butyl acrylate 2.36 2.1043 122 Isobutyl acrylate 2.22 2.18408 123 Methyl methacrylate 1.38 0.99818 124 Ethyl methacrylate 1.94 1.34087 125 n-Butyl methacrylate 2.88 2.64553

146

Appendix

126 Isobutyl methacrylate 2.66 2.73953 127 1,4-Dichlorobutane 2.24 1.92607 128 Diethoxymethane 0.84 1.93911 129 Cyclohexanediol 0.16 0.0601 130 1,2-Ethanediol -1.36 -0.19861 131 2,3-Butanediol -0.92 -0.52767 132 Ethoxyethanol -0.32 0.50715 133 Butoxyethanol 0.83 1.88129 134 1,4-Butanediol -0.83 -0.64346 135 1-Methylcyclohexanol 1.33 1.56176 136 Tetrahydropyran 0.95 1.05607 4- 137 Methylcyclohexanone 1.38 0.81276 138 5-Phenylpentan-2-one 2.42 1.57421 139 Glutaric acid -0.29 -0.58483 Cyclohexanecarboxylic 140 acid 1.96 1.78748 141 Tripropylamine 2.79 3.76331 142 Cyclohexylamine 1.49 1.96117 Methyl 4- 143 phenylbutanoate 2.77 2.38038 144 Dimethyl adipate 1.03 0.72481 145 4-Chlorobutyronitrile 0.56 0.71822 146 Octanenitrile 2.73 3.14946 147 4-Phenylbutyramid 1.41 2.57019 148 4-Chlorobutanol 0.85 0.6867 149 Cyclopropylamine 0.07 0.18542 4-tert- 150 Butylcyclohexanol 3.06 3.57936 151 Bromochloromethane 1.41 0.63313 152 Difluoromethane 0.2 0.13859 153 Thymol 3.3 3.02897 154 Bibenzyl 4.79 5.03558 155 Crotonic acid 0.72 0.25138 156 Acrylamide -0.67 -0.71089 157 1,2-Dibutoxyethane 2.48 4.78567

147

Appendix

Table A7. LogPow (SIM) for large data set calculated with AM1 and the iso No Compound Exp Calc 1 Ethane 1.81 1.454 2 Cyclopropane 1.72 1.213 3 Butane 2.89 2.374 4 1-Pentyne 1.98 2.112 5 1,5-Hexadiene 2.87 2.768 6 Allylbenzene 3.23 3.43 7 1,3,5-Trimethylbenzene 3.42 3.35 8 1-Methylnaphthalene 3.87 3.694 9 Acenaphthene 3.92 3.651 10 2,6-Dimethylnaphthalene 4.31 4.079 11 1,4-Dimethylnaphthalene 4.37 4.138 12 1,7-Dimethylnaphthalene 4.44 4.19 13 1,8-Dimethylnaphthalene 4.26 3.985 14 Hexamethylbenzene 4.61 4.053 15 4.18 3.869 16 1-Methyl-9H-fluorene 4.97 4.15 17 Bibenzyl 4.79 4.845 18 Pyrene 4.88 4.707 19 Fluoranthene 5.16 4.759 20 Benz(A)anthracene 5.79 5.388 21 Ethyleneoxide -0.3 0.4616 22 Ethanol -0.31 0.1637 23 Furan 1.34 1.122 24 Butanol 0.88 0.8264 25 3-Pentanol 1.21 1.234 26 Cyclohexanol 1.23 1.269 27 3,3-Dimethyl-2-butanol 1.48 1.65 28 m-Methylphenol 1.96 1.896 29 2,6-Dimethylphenol 2.36 1.945 30 2,6-Dimethylcyclohexanol 2.38 1.851 31 Octanol 3 2.739 32 3-Phenyl-propanol 1.88 2.358 33 TR-2-Phenylcyclopropylcarbinol 1.95 1.819 34 p-tert-Butylphenol 3.31 3.151 35 Adamantane 2.14 2.375 36 1-Dodecanol 5.13 4.493 37 Phenyl-p-tolylcarbinol 3.13 3.809 38 Acetone -0.24 0.2081 39 2-Butanone 0.29 0.7136 40 1-Hexen-5-one 1.02 1.238 41 4-Cyclopropyl-2-butanone 1.5 1.265 42 Phenylacetaldehyde 1.78 1.904 43 1-Phenylpropan-2-one 1.44 2.018 44 2-Nonanone 3.14 2.813 45 n-Propylformate 0.83 0.4846

148

Appendix

46 Ethylacetate 0.73 0.2221 47 Ethylpropionate 1.21 0.4737 48 Phenylacetate 1.49 1.604 49 2-Propylpentanoic acid 2.75 1.452 50 p-Tolylacetate 2.11 2.14 51 Ethylbenzoate 2.64 2.113 52 beta-Phenylpropionic acid 1.84 1.565 53 Methyl-3-phenylpropionate 2.32 2.519 54 Decanoic acid 4.09 2.574 55 (2-Propan-2-ylphenyl)acetate 2.78 2.858 56 1-Acetoxynaphthalene 2.78 2.684 57 Propionitrile 0.16 0.9268 58 Isopropylamine 0.26 -1.58E-02 59 Methylbutylamine 1.33 1.555 60 Amylamine 1.49 1.063 61 Triethylamine 1.45 1.965 62 Dipropylamine 1.67 2.095 63 Hexylamine 2.06 1.595 64 m-Toluidine 1.4 1.297 65 Propyl-isobutylamine 2.07 2.36 66 N,N-Dimethylaniline 2.31 2.51 67 N-Ethylaniline 2.16 2.257 68 N-Methyl-p-toluidine 2.15 1.876 69 N-Propylaniline 2.45 2.462 70 N,N,4-Trimethylaniline 2.81 2.592 71 N,N-Dimethyl-1-phenylmethanamine 1.98 2.642 72 Tripropylamine 2.79 3.582 73 1-Phenylbutan-2-amine 2.28 2.348 74 Diphenylmethylamine 3.9 4.029 75 N-Methylacetamide -1.05 -0.8345 76 Benzamide 0.64 0.8224 77 Acetanilide 1.16 1.428 78 N-Methylformanilide 1.09 1.583 79 Phenylacetamide 0.45 0.5849 80 Allyl-isopropyl-acetamide 1.14 1.032 81 2-Propan-2-ylpentanamide 1.48 1.421 82 Cinnamamide 1.43 1.548 83 4-propan-2-ylbenzamide 2.14 1.869 84 Isobutyranilide 1.95 1.812 85 4-Phenylbutanamide 1.41 1.733 86 [(E)-2-Nitrovinyl]benzene 2.11 1.972 87 Carbon tetrachloride2.83 3.581 88 Difluoromethane 0.2 -0.3114 89 Methylenechloride 1.25 1.579 90 Methylchloride 0.91 0.6398 91 Tetrachloroethylene 3.4 3.271 92 Trichloroethylene 2.61 2.348

149

Appendix

93 1,1-Difluoroethylene 1.24 0.7709 94 Ethylchloride 1.43 1.536 95 Ethylbromide 1.61 2.435 96 1-Chloropropane 2.04 2.368 97 2-Chloropropane 1.9 1.873 98 1-Bromopropane 2.1 3.096 99 1-Bromopentane 3.37 3.836 100 1,3-Dibromobenzene 3.75 2.707 101 Iodobenzene 3.25 2.606 102 alpha-Chlorotoluene 2.3 2.322 103 1-Methylpentachlorocychlohexane 4.04 4.254 104 1,4-Dimethyltetrachlorocyclohexane 4.4 4.794 105 3-Chlorobiphenyl 4.58 4.284 106 Adenosine -1.05 -1.053 107 2,6-Diaminopurine 2’-Deoxyriboside -0.52 -1.962 108 deoxyuridine -1.62 -0.7789 109 3'-Fluoro-2’,3’-dideoxyuridine -0.49 1.094 110 3'-Deoxy-3'-fluorothymidine -0.28 0.3543 111 2-amino-6-bromo-2’,3’-dideoxypurine 0.34 0.5567 112 8-Azaadenine -0.96 1.11 113 9-Butylpurin-6-amine 1.25 0.6148 114 9-Pentylpurin-6-amine 1.79 1.009 115 Cytosine -1.73 -0.1575 116 Hypoxanthine -1.11 -0.1446 117 4-Nitropyrazole -0.59 6.35E-02 118 5-Chlorouracil -0.35 0.1955 119 1-Methyluracil -1.2 -0.3744 120 Tribromoethene 3.2 3.2 121 2,2,2-Trichloroethanol 1.42 2.502 122 Bromoacetic acid 0.41 0.5277 123 Fluoroacetamide -1.05 -1.239 124 Hydroxyacetic acid -1.11 -1.206 125 2-Fluoroethanol -0.76 -0.2858 126 2-Chloroethanol -0.06 0.7047 127 2-Bromoethanol 0.18 1.337 128 4-Nitropyrazole 0.59 1.102 129 Pyrazole 0.26 5.11E-02 130 Imidazoline-2-thiol -0.66 8.41E-02 131 alpha-Hydroxypropionic acid -0.72 -0.8975 132 Glycerol -1.76 -1.664 133 Barbituric acid -1.47 -0.7828 134 2-Methyl-4,5-dihydro-1H-imidazole 0.52 -0.6484 135 N-Nitrosothiomorpholine 0.4 0.8287 136 N-Nitrosomorpholine -0.44 0.2063 137 3,5,6-Trichloro-2-pyridinol 3.21 2.737 138 2,3-Dichloropyridine 2.11 2.3 139 3,5-Dichloropyridine 2.56 2.404

150

Appendix

140 Uric acid -2.17 -0.5138 141 Pyridine 0.65 1.046 142 Acetylacetone 0.4 -0.1067 143 D-Valerolactone -0.35 0.1463 144 1,3-Diacetyl-urea -0.68 -2.305 145 N-Methylmorpholine -0.33 3.34E-02 146 2,4,6-Tribromophenol 4.13 3.898 147 3-Cyanopyridine 0.23 1.401 148 4-Cyanopyridine 0.46 1.519 149 m-Iodonitrobenzene 2.94 1.967 150 o-Dinitrobenzene 1.69 2.017 151 p-Dinitrobenzene 1.46 1.902 152 o-Fluorophenol 1.71 1.19 153 p-Fluorophenol 1.77 1.123 154 3,4-Dichlorobenzenesulfonamide 1.44 1.722 155 p-Fluoroaniline 1.15 0.576 156 o-Chloroaniline 1.9 1.733 157 o-Nitroaniline 1.85 1.038 158 p-Nitroaniline 1.39 1.017 159 2-Picoline 1.11 1.436 160 Phenylhydroxylamine 0.79 0.8007 161 p-Aminophenol 0.04 0.2716 162 Piconol 0.06 0.4649 163 Phenylsulfamide 0.4 0.736 164 Sulfanilamide -0.62 2.60E-02 165 3-Methyl-N-nitrosopiperidine 0.99 1.345 166 Diethylacetal 0.84 1.125 167 3,5-Diiodosalicylic acid 4.56 3.321 168 Phenylisothiocyanate 3.28 2.262 169 p-Bromobenzoic acid 2.86 2.044 170 m-Iodobenzoic acid 3.13 1.569 171 7-Azaindole 1.82 1.097 172 o-Phenyleneurea 1.12 0.9036 173 2-Chloro-4-aminobenzoic acid 1.33 0.8486 174 m-Nitrobenzamide 0.77 0.378 175 p-Nitrobenzamide 0.82 0.5428 176 Benzaldoxime 1.75 1.649 177 o-Hydroxybenzamide 1.28 3.74E-02 178 p-Hydroxybenzamide 0.33 -3.49E-03 179 p-Aminobenzoic acid 0.83 0.3551 180 3-Hydroxy-4-aminobenzoic acid 0.5 -0.1222 181 p-Nitrobenzyl alcohol 1.26 1.146 182 Benzoylhydrazine 0.19 -9.23E-02 183 4-Pyridineacetamide -0.65 -9.63E-02 184 Phenylurea 0.83 0.3346 185 Methylphenylsulfoxide 0.55 1.093 186 Salicyl alcohol 0.73 0.9729

151

Appendix

187 2,4-Diaminobenzoic acid -0.31 -0.7212 188 Thiophene-2-carboxylic acid ethylester 2.33 2.032 189 3-Methylsulfanylaniline 1.45 0.5749 190 m-Methylbenzenesulfonamide 0.85 0.8004 191 4-Pyridineethaneamine -0.01 0.5096 192 1-H-2-Methoxytetrachlorocyclohexane 2.99 3.521 193 Sulfaguanidine -1.22 -1.08 194 1,3-Diallylurea 0.64 0.3586 195 1-Nitrosotriethylurea 1.54 1.037 4,5,7-Trichloro-2- 196 (trifluoromethyl)benzimidazole 3.78 4.189 1,2,4-Benzothiadiazine-1,1-dioxide-3- 197 trifluoromethyl-6-chloro 1.65 2.38 6-Nitro-2-(trifluoromethyl)-1H- 198 benzimidazole 2.68 3.092 199 2-(Trifluoromethyl)-1H-benzimidazole 2.67 2.666 200 m-Cyanobenzoic acid 1.48 1.205 201 p-Cyanoformanilide 1.08 1.249 202 4-Chloro beta-nitrostyrene 2.44 2.362 6-Chloro-2-methylsulfanyl-1H- 203 benzimidazole 3.22 2.59 204 2-Aminoquinazoline-4-one 0.6 0.7154 205 2-(Fluorophenyl)acetate 1.76 1.407 206 3-Fluorophenylacetate 1.74 1.346 207 3-Chlorophenylacetate 2.32 2.284 208 2-Bromophenylacetate 2.2 2.419 209 m-Iodophenylacetic acid 2.62 1.64 210 2-(2,4,6-Tribromophenoxy)ethanol 3.42 3.925 211 m-Nitroacetophenone 1.42 1.553 212 2-(2-Fluorophenoxy)acetic acid 1.39 1.009 213 2-(3-Chlorophenoxy)acetic acid 2.03 1.857 214 2-(2-Nitrophenoxy)acetic acid 0.97 0.8384 1-(3,3,3- 215 Trifluoroethoxy)pentachlorocychlohexane 4.06 4.468 216 S-Phenylethanethioate 2.23 2.104 217 p-Fluoroacetanilide 1.47 0.4283 218 2-Hydroxyacetophenone 1.92 1.404 219 4-Methoxybenzoic acid 2.74 0.8878 220 Iso-phthalamide -0.21 -0.7384 221 N-Methyl-2-fluorophenylcarbamate 1.25 0.9564 222 N-Methyl-3-fluorophenylcarbamate 1.48 1.3 223 N-Methyl-4-fluorophenylcarbamate 1.28 1.201 224 2-Chlorophenylcarbamate,o-methyl 2.13 1.894 225 N-Methyl-3-bromophenylcarbamate 2.25 1.861 226 N-Methyl-2-iodophenylcarbamate 1.94 1.946 1,2,4-Benzothiadiazine-1,1-dioxide-3- 227 methyl-6-amino-7-chloro 0.63 1.198

152

Appendix

228 Methyl 3-hydroxybenzoate 1.89 1.216 229 m-Hydroxyphenylacetic acid 0.85 0.6542 230 o-Hydroxyphenylacetic acid 0.85 0.5124 231 (2-Hydroxyphenoxy)acetic acid 0.85 0.6071 232 p-Methylsulfonylbenzoic acid 0.67 0.6285 233 p-Methoxyformanilide 1.03 1.195 234 o-Methoxybenzamide 0.84 1.015 235 p-Methoxybenzamide 0.86 0.7343 236 2-Pyridine propanamide -0.27 0.2265 237 2- 1.16 2.016 238 2-Pyridinepropanol 0.58 1.108 239 4-Pyridinepropanol 0.58 1.318 240 p-Ethylbenzenesulfonamide 1.31 1.145 241 2-Chloroquinoline 2.71 2.781 242 8-Chloroquinoline 2.44 2.612 243 5-Nitroquinoline 1.86 2.029 244 7-Nitroquinoline 1.82 1.876 245 Phenoxyacetic acid,3-cyano-4-chloro 1.56 2.391 246 Quinoline 2.03 2.173 247 4-Hydroxyquinoline 0.75 1.729 248 8-Quinolinol 2.02 1.462 249 p-Trifluoromethylphenylacetic acid 2.45 1.251 250 (4-Cyanophenoxy)-acetic acid 0.93 1.23 251 4-Aminoquinoline 1.63 1.465 252 3-Aminoquinoline 1.63 1.324 253 Acetylsalicylic acid 1.19 0.6613 254 p-Acetylformanilide 0.94 0.838 255 3-Acetamidobenzoic acid 1.32 0.3003 256 5,6-Dimethyl-1H-benzimidazole 2.35 1.603 257 p-N-Acetylaminobenzamide 0.01 3.88E-02 258 p-Methoxyphenylacetic acid 1.42 0.8824 259 o-Methylphenoxyacetic acid 1.98 1.432 260 p-Methylphenoxyacetic acid 1.86 1.539 261 2-(3-Methoxyphenoxy)acetic acid 1.38 1.646 262 Phenoxyacetic acid,m-methylsulfonyl 0.01 0.7186 263 3-(4-Chlorophenyl)-1,1-dimethylurea 1.94 1.805 264 Ethyl 2-aminobenzoate 2.57 1.284 265 N-methyl-o-tolycarbamate 1.46 1.608 266 N-phenyl ethylcarbamate 2.3 1.816 267 1,1-Dimethyl-3-p-nitrophenylurea 1.51 1.677 268 1-Phenyl-3-ethylthiourea 1.42 3.381 269 1,3-Dimethyl-1-phenylurea 1.02 0.9155 270 2-Pyridinebutanol 0.86 1.59 271 4-Pyridinebutanamine 0.86 1.343 272 5-Ethyl-5-isopropylbarbituric acid 0.97 0.4037 273 N-Methyl-5-butylbarbuturic acid 1.1 1.055 274 3-Isopropylthio-4-amino-6-isopropyl- 2.06 1.747

153

Appendix

1,2,4tTriazine-5-one 275 1,3-Dibutylurea 1.4 -0.798 (3,4,5-Trichlorophenyl)hydrazono]cyano- 276 acetic methyl 5.22 4.301 Butanenitrile,2[(3,4- 277 dichlorophenyl)hydrazono]-3-oxo 4.56 3.482 Methyl(2Z)-2-[(3- chlorophenyl)hydrazinylidene]-2-cyano 278 acetate 3.56 3.03 279 5-Methyl-8-quinolinol 2.37 1.708 280 4-Methyl-5,8-dihydroxyquinoline 1.59 1.215 2-N,N-Dime-6(5-NO2-2-furyl)-1,3- 281 thiazine-4-one 0.55 2.568 3-Methio-4-amine-6-phenyl-1,2,4- 282 triazine-5-one 1.66 1.259 283 Benzoylacetone 2.52 1.905 1,2,4-Benzothiadiazine-1,1-dioxide-3- 284 methyl-6-acetylamino-7-chloro 0.53 0.8317 285 Phenoxyacetic acid,3-acetamido-4-chloro 0.75 0.8014 286 o-ethyl phenoxyacetic acid 2.53 1.966 287 5,5-Diallylbarbituric acid 1.05 1.012 288 p-Aminohippuric acid,methyl ester -0.23 0.376 289 N-(4-Ethoxyphenyl)acetamide 1.58 1.708 290 Fuscaric acid 0.68 1.797 291 3,5-Dimethyl-4-hydroxyacetanilide 1.11 1.167 292 3-Ethyl-4-hydroxyacetanilide 1.31 1.495 293 3,5-Dimethylphenyl methylcarbamate 2.23 1.998 294 N-Methyl-3-ethoxyphenylcarbamate 1.75 1.726 295 N-Methyl-4-ethoxyphenylcarbamate 1.63 1.695 296 4-Sulfamylbenzoic acid,propyl ester 1.75 1.035 297 2-(3’-Pyridyl) piperidine 0.97 1.481 298 Nikethamide 0.33 1.414 299 4-Pyridinepentanol 1.39 2.116 5-tert-Butyl-5-ethyl-1,3-diazinane-2,4,6- 300 trione 1.73 0.874 301 1-Isothiocyanonaphthalene 4.34 3.466 (2-Chloro-4- trifluoromethylphenyl)hydrazono]cyano- 302 acetic acid methyl 4.66 3.677 (2,4,5-Trichlorophenyl)hydrazono]-cyano- 303 acetic acid ethyl ester 5.21 4.49 (3-Chloro-phenyl)hydrazono] cyano- 304 acetic acid ethyl ester 3.94 3.414 (2-Chlorophenyl)hydrazono] cyano-acetic 305 acid ethyl ester 3.38 2.813 306 8-Dimethylaminoquinoline 2.73 2.178 307 3-Acetyl phenyl dimethylcarbamate 1.18 1.774 308 Sulfisoxazole 1.01 1.26

154

Appendix

309 2,3,5-Trimethyl-4-hydroxyacetanilide 0.82 0.5573 (2-Propan-2-yloxyphenyl) N- 310 methylcarbamate 1.52 1.831 311 3-Isopropoxyphenyl N-methylcarbamate 1.96 2.109 312 1,7-Phenanthroline 2.51 2.426 313 Phenazine 2.84 3.035 314 1-Naphthyl N-methylcarbamate 2.36 2.621 315 o-Phenoxyaniline 2.46 3.014 316 4,4'-Diamino diphenyl ether 1.36 1.499 5,6-Dihydro-2-methyl-1,4-oxathiin-3- 317 carboxamide 2.14 2.314 318 2-Iodophenyl-beta-D-glucopyranoside 0.27 1.079 319 (o-sec-Butylphenoxy)acetic acid 3.32 2.752 320 2-sec-Butylphenyl N-methylcarbamate 2.78 2.477 N-Methyl-3-methyl-6-isopropyl 321 phenylcarbamate 2.84 2.515 322 N-Methyl-4-sec-butylphenylcarbamate 3.2 2.866 323 N-Methyl-4-tert-butylphenylcarbamate 3.06 2.799 324 N-Methyl-4-butoxyphenylcarbamate 2.86 2.454 325 2-I-Pentoxy-4-aminobenzoic acid 2.3 1.852 326 2-Aminophenyl beta-d-glucopyranoside -1.23 -1.047 327 Thioxanthone 3.99 2.616 328 (4-Isothiocyanatophenyl)-phenyldiazene 5.55 4.135 329 4-Isothiocyanaodiphenylether 4.75 4.142 330 4-Isothiocyanodiphenylsulfoxide 4.4 2.757 331 1-Aminoacridine 2.47 2.696 332 p-Hydroxybenzophenone 3.07 2.975 333 m-Phenoxybenzoic acid 3.91 2.96 334 2-(3-Methylphenyl) acetic acid 1.95 1.362 335 3-Chlorophenyl 4-aminosalicylate 3.9 2.888 336 4-Bromophenyl 4-aminosalicylate 3.46 3.103 4,4-Dimethyl-3-oxo-2-[(2,4,5- 337 trichlorophenyl) hydrazono]pentanenitrile 5.86 5.105 338 Ethyl 3-oxo-5-phenylpentanoate 2.52 2.269 339 Aminopyrine 1 1.971 340 3,4-Diethoxyphenylcarbamate,o-ethyl 2.5 2.483 341 Probenecid 3.21 2.097 342 4,4'-Diisothiocyanatebiphenyl 5.5 4.352 343 4-Isothiocyanobenzophenone 4.88 3.949 (4-Methylphenyl) 4-amino-2- 344 hydroxybenzoate 3.38 2.533 345 5-Benzyl-3-furanmethanol acetate 3.24 3.222 346 3,5-Di-propyl-4-hydroxyacetanilide 2.67 2.765 347 Diethofencarb 2.82 2.529 348 2-Isothiocyanoanthracene 5.7 4.66 349 Apigenin 1.74 2.559 350 4-Aminosalicylic acid,2,6-dimethylphenyl 3.88 2.86

155

Appendix

ester 351 Cycloheximide 0.55 6.70E-02 352 Metoprolol 1.88 2.566 353 Medazepam 4.41 4.641 354 Tetrazepam 2.76 3.578 355 Propanolol 2.98 3.657 356 2,6-Diphenylpyridine 4.82 4.843 357 Chlorpromazine 5.19 5.059 358 Diphenhydramine 3.27 4.806 359 Mepyramine 3.27 3.284 360 Estrone 3.13 3.325 361 Acebutolol 1.71 1.376 362 Isothazine 4.77 5.972 363 Benzo(A)pyrene 5.97 5.712 364 o,o-diethyl-o-phenylphosphate 1.64 1.907 365 [Methoxy(methyl)phosphoryl]oxymethane -0.66 -0.1839 366 o,S-Dime-N-BU-phosphoramidothioate 0.94 0.2227 367 Dicapthon 3.58 2.841 368 2-Phenoxy-1,3,2-dioxaphospholane 1.42 1.202

Table A8. The theoretical logPcw calculated with AM1 and the iso No Compound Exp Calc 1 p-Toluic acid 1.4 1.48 2 1-Naphtol 1.2 1.96 3 iso-Butyl alcohol 0.3 1.03 4 m-Chlorophenol 1.0 1.76 5 Propionamide -1.4 -1.55 6 p-Hexylpyridine 5.0 5.34 7 p-Chlorophenol 1.0 0.57

156

Appendix

Table A9. The total data set of 144 PPL compounds used (training and test set) No Compound Inductance 1 17 alpha-Ethinylestradiol Negative 2 1-Amino-4-octylpiperazine Positive 1-Chloro-10,11- 3 dehydroamitriptyline Positive 4 1-Chloroamitriptyline Positive 5 6-Hydroxy-dopamine Positive 6 Abacavir Negative 7 Amineptine Negative 8 Amodiaquine Positive 9 Azaserine Negative 10 Azithromycin Positive 11 Bilirubin Positive 12 Brompheniramine Positive 13 Caffeine Negative 14 Carbamazepine Negative 15 Carbon TetrachlorideNegative 16 Ceftazidime Negative 17 Chloroquine Positive 18 Chlorpromazine Positive 19 Chlortetracycline Negative 20 Ciprofibrate Negative 21 Clociguanil Positive 22 Clofibrate Negative 23 Colchicine Negative 24 Cyclizine Positive 25 Cyproterone Acetate Negative 26 Desipramine Positive (d-)H-4,4-bis- Diethylaminoethoxy- 27 diethylphenylethane Positive 28 Dibucaine Positive 29 Erythromycin Positive 30 Etoposide Negative 31 Felbamate Negative 32 Fenofibrate Negative 33 Fluoxetine Positive 34 Galactosamine Negative 35 Gemfibrozil Negative 36 Gentamicin Positive 37 Hydroxyzine Positive 38 Hypoglycine A Negative 39 Iprindole Positive 40 Ketoconazole Positive 41 Lysergide Positive 42 Methotrexate Negative

157

Appendix

43 Methyldopa Negative 44 Norchlorcyclizine Positive 45 Nortriptyline Positive 46 Phenacetin Positive 47 Pheniramine Positive 48 Phenobarbital Negative 49 Phentermine Positive 50 Physostigmine Negative 51 Piroxicam Negative 52 Quinine Positive 53 R-800 Positive 54 RMI10393 Positive 55 SC-45864 Positive 56 Stilbamidine Positive 57 Sulindac Negative 58 Suramin Positive 59 Tamoxifen Positive 60 Temozolomide Negative 61 Tetracaine Positive 62 Thioacetamide Negative 63 Tilorone Positive 64 Tobramycin Positive 65 Tocainide Positive 66 Trimeprazine Positive Trimethoprim 67 sulfamethoxazole Positive 68 Trospectomycin sulfate Positive 69 Valproic Acid Negative 70 WY-14643 Negative 71 Zidovudine Negative 72 Zileuton Negative 73 3-Methylcholanthrene Negative 74 AC 3579Positive 75 Acetaminophen Negative 76 Amikacin Positive 77 Amiodarone Positive 78 Amitriptyline Positive 79 Anticoman Negative 80 Aricept Negative 81 AY-25329 Negative 82 AY-9944 Positive 83 Bicalutamide Negative 84 Boxidine Positive 85 Bupropion Negative 86 Cephaloridine Positive 87 Chlorcyclizine Positive 88 Chloroform Negative

158

Appendix

89 Chlorphentermine Positive 90 Citalopram Positive 91 Clindamycin Positive 92 Clomipramine Positive 93 Clozapine Positive 94 Cocaine Positive 95 Coralgil Positive 96 Dantrolene Negative 97 Demeclocycline Negative 98 Desferal Negative 99 Dibekacin Positive 100 Diclofenac Negative 101 Diflunisal Negative 102 di-Isobutamide Positive 103 Doxapram Negative 104 Doxycycline Negative 105 Emetine Positive 106 Ethyl fluclozepate Positive 107 Famotidine Negative 108 Fenfluramine Positive 109 Flutamide Negative 110 Homochlorocyclizine Positive 111 Hydrazine Negative 112 Hydroxyurea Negative 113 IA3 Positive 114 Imipramine Positive 115 Indoramin Positive 116 L-ethionine Negative 117 Maprotiline Positive 118 Meclizine Positive 119 Mesoridazine Positive 120 Metformin Negative 120 Methadone Negative 122 Methapyrilene Negative 123 Mianserin Positive 124 Netilmicin Positive 125 Noxiptiline Positive 126 Paraquat Positive 127 Perhexiline Positive 128 Procaine Negative 129 Promethazine Positive 130 Propranolol Positive 131 Quinacrine Positive 132 Quinidine Positive 133 Rolitetracycline Negative 134 SDZ_200-125 Positive 135 SKF-14336-D Positive

159

Appendix

136 Stavudine Negative 137 Tacrine Negative 138 Trifluperazine Positive 139 Triparanol Positive 140 Tunicamycin Positive 141 Zimelidine Positive 142 Ceftazidime Negative 143 Carbon tetrachlorideNegative 144 Valproic acid Negative

160

Appendix

Table A10. The 124 Descriptors calculated with ParaSurf10alpha Molecular Electrostatic Potential Descriptors Descriptor Symbol Description

Vmax MEPmax Maximum (most positive) MEP value

Vmin MEPmin Minimum (most negative) MEP value V + meanMEP+ Mean of the positive MEP values V − meanMEP- Mean of the negative MEP values V meanMEP Mean of all MEP values Δ V MEP-range MEP-range 2 σ + MEPvar+ Total variance in the positive MEP values 2 σ − MEPvar- Total variance in the negative MEP values σ 2 MEPvartot Total variance in the MEP tot ν MEPbalance MEP balance parameter σ 2 ν Var*balance Product of the total variance tot in the MEP and the MEP balance parameter γ v MEPskew Skewness of the MEP- 1 distribution γ v MEPkurt Kurtosis of the MEP- 2 distribution MEPint Integrated MEP over the V ³ surface Local Ionization Energy Descriptors Descriptor Symbol Description max IELmax Maximum value of the local IEL ionization energy min IELmin Minimum value of the local IEL ionization energy IELbar Mean value of the local IEL ionization energy Δ IEL IEL-range Range of the local ionization energy σ 2 IELvar Variance of the local IE ionization energy γ IEL IELskew Skewness of the local 1 ionization energy distribution γ IEL IELkurt Kurtosis of the local 2 ionization energy distribution IELint Integrated local ionization IE ³ L energy over the surface Local Electron Affinity Descriptors

161

Appendix

Descriptor Symbol Description max EALmax Maximum of the local EAL electron affinity min EALmin Minimum of the local EAL electron affinity EALbar+ Mean of the positive values EAL+ of the local electron affinity EALbar- Mean of the negative values EAL− of the local electron affinity EALbar Mean value of the local EAL electron affinity Δ EAL EAL-range Range of the local electron affinity σ 2 EALvar+ Variance in the local electron EA+ affinity for all positive values σ 2 EALvar- Variance in the local electron EA− affinity for all negative values σ 2 EALvartot Sum of the positive and EAtot negative variances in the local electron affinity ν EA EALbalance Local electron affinity balance parameter δ + EALfraction+ Fraction of the surface area AEA with positive local electron affinity + EALarea+ Surface area with positive AEA local electron affinity γ EAL EALskew Skewness of the local 1 electron affinity distribution γ EAL EALkurt Kurtosis of the local electron 2 affinity distribution EALint Integrated local electron EA ³ L affinity over the surface Local Electronegativity Descriptor Symbol Description χ max ENEGmax Maximum value of the local L electronegativity χ min ENEGmin Minimum value of the local L electronegativity χ ENEGbar Mean value of the local L electronegativity Δχ L ENEGrange Range of the local electron electronegativity 2 σ χ ENEGvar Variance in the local electronegativity χ γ L ENEGskew Skewness of the local 1 electronegativity distribution

162

Appendix

χ γ L ENEGkurt Kurtosis of the local 2 electronegativity distribution

χ ENEGint Integrated local ³ L electronegativity over the surface Local Hardness Descriptor Symbol Description η max HARDmax Maximum value of the local L hardness η min HARDmin Minimum value of the local L hardness η HARDbar Mean value of the local L hardness Δη L HARDrange Range of the local electron hardness 2 σ η HARDvar Variance in the local hardness η γ L HARDskew Skewness of the local 1 hardness distribution η γ L HARDkurt Kurtosis of the local hardness 2 distribution

η HARDint Integrated local hardness ³ L over the surface Local Polarizability Descriptors Descriptor Symbol Description α max POLmax Maximum value of the local L polarizability α min POLmin Minimum value of the local L polarizability α L POLbar Mean value of the local polarizability Δα L POL-range Range of the local polarizability 2 σ α POLvar Variance in the local polarizability α γ L POLskew Skewness of the local 1 polarizability distribution α γ L POLkurt Kurtosis of the local 2 polarizability distribution

α POLint Integrated local polarizability ³ L over the surface Additional Descriptors Descriptor Symbol Description μ dipole Dipole moment μ D dipden Dipolar density α polarizability Molecular electronic polarizability MW MWt Molecular weight

163

Appendix

G globularity Globularity A totalarea Molecular surface area VOL volume Molecular Volume Qsum Sum of the VESPA electronic potential on (N, O, P, S, hal, F, Cl, Br, I, H) atoms Estate Analogous to the Kier&Hall Estate using the bond order between atom i and j instead of the distance for (N, O, P, S, F, Cl, Br, I, hal) atoms Estate2 Analogous to the Kier&Hall Estate using rij to describe the distance between atom i and j for (N, O, P, S, F, Cl, Br, I, hal) atoms LocPol Local polarity: All absolute deviations from the mean ESP for each surface point summed up and divided by the number of surface points CovHBac Covalent hydrogen bond acidity: EHOMO(molecule)- ELUMO(water) CovHBbas Covalent hydrogen bond basicity: ELUMO(molecule)- EHOMO(water) EsHBac Electrostatic hydrogen bond acidity: Most negative formal charge (molecule)-Most positive formal charge (water) EsHBbas Electrostatic hydrogen bond basicity: Most positive formal charge on hydrogen(molecule)-most negative formal charge (water) CohIndex Cohesive index: (Nacc x Ndon0.5)/total surface HoLu LewDon Lewis Donor LewAcc Lewis Acceptor Nacc Number of H-bond acceptors Ndon Number of H-bond donors Naryl Number of aromatic rings Npos Number of positive Nneg Number of negative

164

Appendix

max FNmax Maximum value of the FN electrostatic field normal to the surface min FN FNmin Minimum value of the electrostatic field normal to the surface Δ FN FNrange Range of the field normal to the surface FNmean Mean value of the field FN normal to the surface σ 2 FNvartot Variance in the field normal F to the surface σ 2 FNvar+ Variance in the field normal F + to the surface for all positive values σ 2 FNvar- Variance in the field normal F − to the surface for all negative values ν F FNbal Normal field balance parameter γ FN FNskew Skewness of the field normal 1 to the surface γ FN FNkurt Kurtosis of the field normal 2 to the surface FNint Integrated field normal to the F ³ N surface over the surface

+ FN+ Integrated field normal to the F ³ N surface over the surface for all positive values − FN- Integrated field normal to the F ³ N surface over the surface for all negative values FNabs Integrated absolute field F ³ N normal to the surface over the surface

165

Appendix

Table A11. Results of the best predictions obtained Compound Inductance AM1 MNDO AM1* MNDO/d PM3 PM6 (solvex) (solvex) (solvex) (solvex) (solvex) (iso) (NB) (NB) (RF) (RF) (RF) (RF) Abacavir -1 -1 -1 1 1 1 1 Bilirubin 1 -1 -1 -1 -1 -1 1 Caffeine -1 -1 -1 -1 -1 -1 -1 Carbamazepine -1 -1 -1 1 -1 -1 1 Chloroquine 1 1 1 1 1 1 1 Chlorpromazine 1 1 1 1 1 1 1 Chlortetracycline -1 -1 -1 -1 -1 -1 1 Clociguanil 1 -1 -1 -1 -1 1 -1 Colchicine -1 1 1 1 1 1 1 Cyclizine 1 1 1 1 1 1 1 Desipramine 1 1 1 1 1 1 1 Dibucaine 1 1 1 1 1 1 1 Etoposide -1 1 1 -1 -1 -1 1 Galactosamine -1 -1 -1 -1 -1 -1 -1 Gemfibrozil -1 1 -1 -1 -1 -1 1 Gentamicin 1 1 1 1 1 1 1 Hydroxyzine 1 1 1 1 1 1 1 Hypoglicin-A -1 -1 -1 -1 -1 -1 1 Ketoconazole 1 1 -1 -1 1 -1 1 Lysergide 1 1 1 1 1 1 1 Methotrexate -1 -1 -1 -1 -1 -1 -1 Methyldopa -1 -1 -1 -1 -1 -1 -1 Norchlorcyclizine 1 1 1 1 1 1 1 Nortriptyline 1 1 1 1 1 1 1 Pheniramine 1 1 1 1 1 1 1 Phentermine 1 1 1 1 1 1 1 Piroxicam -1 -1 -1 -1 -1 -1 -1 Sulindac -1 1 1 -1 1 -1 -1 Suramin 1 -1 1 -1 -1 -1 -1 Tamoxifen 1 1 1 1 1 1 1 Temozolomide -1 -1 -1 -1 -1 -1 -1 Tetracaine 1 1 1 1 1 1 1 Tobramycin 1 1 1 1 1 1 1 Tocainide 1 1 -1 1 -1 1 -1 WY-14643 -1 -1 -1 -1 -1 -1 -1 Zileuton -1 -1 -1 -1 -1 -1 -1 AC-3579 1 1 1 -1 1 1 1 Acetaminophen -1 -1 -1 1 -1 -1 -1 Amikacin 1 1 1 -1 -1 1 1 Amiodarone 1 -1 1 -1 1 1 1 AY-9944 1 1 1 1 1 1 1

166

Appendix

Bicalutamide -1 1 1 -1 -1 -1 -1 Carbon -1 tetrachloride -1 -1 -1 -1 -1 1 Chlorcyclizine 1 1 1 1 1 1 1 Clomipramine 1 1 1 1 1 1 1 Demeclocycline -1 -1 -1 -1 -1 -1 1 Diflunisal -1 1 1 -1 -1 -1 1 Doxapram -1 1 1 -1 1 1 1 Doxycycline -1 -1 -1 -1 -1 1 -1 Emetine 1 1 1 1 1 1 1 Famotidine -1 -1 -1 -1 -1 -1 -1 Fenfluramine 1 1 1 1 1 1 1 Flutamide -1 -1 1 -1 -1 -1 -1 Homochlorcyclizine 1 1 1 1 1 1 1 Hydrazine -1 -1 -1 -1 1 -1 -1 Methadone -1 1 1 1 1 1 1 Netilmicin 1 1 1 1 1 1 1 Procaine -1 1 1 -1 -1 1 -1 Promethazine 1 1 1 1 1 1 1 Quinacrine 1 1 1 -1 -1 1 1 Quinidine 1 1 1 1 1 1 1 Rolitetracycline -1 1 1 -1 -1 -1 1 SDZ-200-125 1 1 1 1 1 1 1 Stavudine -1 -1 -1 -1 -1 -1 -1 Tacrine -1 1 1 1 1 1 -1 Trifluperazine 1 1 1 1 1 1 1 Triparanol 1 1 1 1 1 1 1 Tunicamycin 1 1 1 -1 -1 -1 1 Valproic_acid -1 -1 -1 -1 -1 -1 -1

167